Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 59]
cs.CV [Total: 291]
cs.AI [Total: 63]
cs.SD [Total: 6]
cs.LG [Total: 188]
cs.MA [Total: 8]
cs.MM [Total: 3]
eess.AS [Total: 1]
eess.IV [Total: 16]

cs.CL

[1] Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

Huu Tuong Tu, Ha Viet Khanh, Tran Tien Dat, Vu Huan, Thien Van Luong, Nguyen Tien Cuong, Nguyen Thi Thu Trang

Main category: cs.CL

TL;DR: Training-free MDD framework using retrieval with pretrained ASR, achieving 69.60% F1 score on L2-ARCTIC without model training.

Details

Motivation: To avoid the complexity of scoring models or phoneme-level training in MDD, while maintaining accurate error detection and diagnosis.

Method: Leverages retrieval techniques with a pretrained Automatic Speech Recognition model, eliminating need for phoneme-specific modeling or additional training.

Result: Achieves superior F1 score of 69.60% on L2-ARCTIC dataset.

Conclusion: Proposed training-free framework effectively performs MDD while avoiding model training complexity.

Abstract: Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.

[2] Efficient Multi-Hop Question Answering over Knowledge Graphs via LLM Planning and Embedding-Guided Search

Manil Shrestha, Edward Kim

Main category: cs.CL

TL;DR: Hybrid algorithms for efficient and verifiable multi-hop question answering over knowledge graphs, combining LLM-guided planning with embedding-based neural search to eliminate expensive LLM inference while maintaining accuracy.

Details

Motivation: Address computational challenges in multi-hop QA over knowledge graphs due to combinatorial path explosion and expensive LLM inference, while ensuring verifiable grounding in structured knowledge.

Method: Two complementary approaches: (1) LLM-Guided Planning using single LLM call for relation sequence prediction executed via BFS, (2) Embedding-Guided Neural Search fusing text and graph embeddings with lightweight edge scorer. Knowledge distillation used to compress planning capability.

Result: Achieved near-perfect accuracy (micro-F1 > 0.90) with grounded answers, 100x speedup with competitive accuracy, and compressed 4B-parameter model matching large-model performance at zero API cost.

Conclusion: Verifiable multi-hop reasoning doesn’t require massive models but the right architectural inductive biases combining symbolic structure with learned representations, with structured planning proving more transferable than direct answer generation.

Abstract: Multi-hop question answering over knowledge graphs remains computationally challenging due to the combinatorial explosion of possible reasoning paths. Recent approaches rely on expensive Large Language Model (LLM) inference for both entity linking and path ranking, limiting their practical deployment. Additionally, LLM-generated answers often lack verifiable grounding in structured knowledge. We present two complementary hybrid algorithms that address both efficiency and verifiability: (1) LLM-Guided Planning that uses a single LLM call to predict relation sequences executed via breadth-first search, achieving near-perfect accuracy (micro-F1 > 0.90) while ensuring all answers are grounded in the knowledge graph, and (2) Embedding-Guided Neural Search that eliminates LLM calls entirely by fusing text and graph embeddings through a lightweight 6.7M-parameter edge scorer, achieving over 100 times speedup with competitive accuracy. Through knowledge distillation, we compress planning capability into a 4B-parameter model that matches large-model performance at zero API cost. Evaluation on MetaQA demonstrates that grounded reasoning consistently outperforms ungrounded generation, with structured planning proving more transferable than direct answer generation. Our results show that verifiable multi-hop reasoning does not require massive models at inference time, but rather the right architectural inductive biases combining symbolic structure with learned representations.

[3] Can LLMs Faithfully Explain Themselves in Low-Resource Languages? A Case Study on Emotion Detection in Persian

Mobina Mehrazar, Mohammad Amin Yousefi, Parisa Abolfath Beygi, Behnam Bahrak

Main category: cs.CL

TL;DR: LLM-generated explanations in Persian emotion classification show poor faithfulness, with models agreeing more with each other than human judgments, despite good classification performance.

Details

Motivation: Evaluate faithfulness of LLM-generated explanations in low-resource languages like Persian, as current methods may produce unfaithful reasoning despite good predictions.

Method: Compare LLM-identified influential words with human annotations using confidence scores from token-level log-probabilities. Test two prompting strategies: Predict-then-Explain and Explain-then-Predict.

Result: LLMs achieve strong classification performance but generated explanations diverge from faithful reasoning, showing greater inter-model agreement than agreement with human judgments.

Conclusion: Current explanation methods and metrics have limitations, highlighting need for more robust approaches to ensure LLM reliability in multilingual and low-resource contexts.

Abstract: Large language models (LLMs) are increasingly used to generate self-explanations alongside their predictions, a practice that raises concerns about the faithfulness of these explanations, especially in low-resource languages. This study evaluates the faithfulness of LLM-generated explanations in the context of emotion classification in Persian, a low-resource language, by comparing the influential words identified by the model against those identified by human annotators. We assess faithfulness using confidence scores derived from token-level log-probabilities. Two prompting strategies, differing in the order of explanation and prediction (Predict-then-Explain and Explain-then-Predict), are tested for their impact on explanation faithfulness. Our results reveal that while LLMs achieve strong classification performance, their generated explanations often diverge from faithful reasoning, showing greater agreement with each other than with human judgments. These results highlight the limitations of current explanation methods and metrics, emphasizing the need for more robust approaches to ensure LLM reliability in multilingual and low-resource contexts.

[4] Comparative Analysis of LoRA-Adapted Embedding Models for Clinical Cardiology Text Representation

Richard J. Young, Alice M. Matthews

Main category: cs.CL

TL;DR: Encoder-only models like BioLinkBERT outperform larger decoder models for cardiology text embeddings when fine-tuned with LoRA, challenging the assumption that bigger models are always better for domain-specific tasks.

Details

Motivation: There's a lack of systematic comparisons of transformer architectures for clinical NLP, and domain-specific embeddings are crucial for medical applications but current approaches don't adequately evaluate different model types.

Method: Fine-tuned ten transformer models using Low-Rank Adaptation (LoRA) on 106,535 cardiology text pairs from medical textbooks, comparing encoder-only vs decoder-based architectures.

Result: Encoder-only architectures, particularly BioLinkBERT, achieved superior domain-specific performance (separation score: 0.510) while requiring significantly fewer computational resources than larger decoder models.

Conclusion: Larger language models don’t necessarily produce better domain-specific embeddings; encoder-only models are more efficient and effective for clinical NLP, providing practical guidance for medical informatics development.

Abstract: Domain-specific text embeddings are critical for clinical natural language processing, yet systematic comparisons across model architectures remain limited. This study evaluates ten transformer-based embedding models adapted for cardiology through Low-Rank Adaptation (LoRA) fine-tuning on 106,535 cardiology text pairs derived from authoritative medical textbooks. Results demonstrate that encoder-only architectures, particularly BioLinkBERT, achieve superior domain-specific performance (separation score: 0.510) compared to larger decoder-based models, while requiring significantly fewer computational resources. The findings challenge the assumption that larger language models necessarily produce better domain-specific embeddings and provide practical guidance for clinical NLP system development. All models, training code, and evaluation datasets are publicly available to support reproducible research in medical informatics.

[5] What does it mean to understand language?

Colton Casto, Anna Ivanova, Evelina Fedorenko, Nancy Kanwisher

Main category: cs.CL

TL;DR: Language understanding requires exporting information from the brain’s language system to other regions for rich mental model construction, due to processing limitations in the core language system.

Details

Motivation: To explain how deep language understanding occurs beyond surface-level meaning extraction, addressing the fundamental limitations of the brain's core language processing system.

Method: Review existing evidence and propose using recent cognitive neuroscience advances to directly test the hypothesis through conceptual foundation and new methods.

Result: The paper presents a hypothesis that deep language understanding requires information export from language regions to other brain areas for perceptual, motor, mental model, and memory computations.

Conclusion: Recent progress in cognitive neuroscience enables direct testing of this hypothesis, opening a new strategy to reveal the cognitive and neural basis of what it means to understand language.

Abstract: Language understanding entails not just extracting the surface-level meaning of the linguistic input, but constructing rich mental models of the situation it describes. Here we propose that because processing within the brain’s core language system is fundamentally limited, deeply understanding language requires exporting information from the language system to other brain regions that compute perceptual and motor representations, construct mental models, and store our world knowledge and autobiographical memories. We review the existing evidence for this hypothesis, and argue that recent progress in cognitive neuroscience provides both the conceptual foundation and the methods to directly test it, thus opening up a new strategy to reveal what it means, cognitively and neurally, to understand language.

[6] Gender Bias in Emotion Recognition by Large Language Models

Maureen Herbert, Katie Sun, Angelica Lim, Yasaman Etesam

Main category: cs.CL

TL;DR: LLMs exhibit gender biases in emotional theory of mind tasks, and training-based interventions are needed for effective debiasing rather than just prompt engineering.

Details

Motivation: As LLMs become more integrated into daily life, it's crucial to evaluate and ensure their fairness, particularly in understanding human emotions where gender biases could have significant real-world impacts.

Method: Examined LLM fairness in emotional theory of mind by presenting descriptions of people and environments, then asking “How does this person feel?” Proposed and evaluated multiple debiasing strategies including training-based interventions and prompt engineering approaches.

Result: Found that LLMs do exhibit gender biases when assessing emotions. Training-based interventions were significantly more effective at reducing bias compared to inference-time prompt-based approaches like prompt engineering.

Conclusion: Achieving meaningful reductions in gender bias in LLMs’ emotional theory of mind requires training-based interventions rather than relying solely on prompt engineering at inference time.

Abstract: The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, “How does this person feel?”. Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering.

[7] Breaking Bad: Norms for Valence, Arousal, and Dominance for over 10k English Multiword Expressions

Saif M. Mohammad

Main category: cs.CL

TL;DR: The paper presents NRC VAD Lexicon v2, which expands the original lexicon with 10k multiword expressions and 25k additional words, providing reliable valence, arousal, and dominance ratings for NLP and social science research.

Details

Motivation: To complement existing VAD lexicons by adding emotional ratings for multiword expressions and expanding coverage of contemporary words, enabling broader research applications.

Method: Created human-rated valence, arousal, and dominance associations for 10k English multiword expressions and their constituent words, plus additional unigrams, building on the original NRC VAD Lexicon.

Result: The new lexicon contains entries for 10k MWEs and 25k words in addition to v1 entries, with highly reliable associations. Analysis shows emotional characteristics of MWEs including emotionality strength and compositionality.

Conclusion: The expanded NRC VAD Lexicon v2 enables diverse research across NLP, psychology, public health, digital humanities, and social sciences, and is freely available for use.

Abstract: Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D). Existing lexicons such as the NRC VAD Lexicon, published in 2018, include VAD association ratings for words. Here, we present a complement to it, which has human ratings of valence, arousal, and dominance for 10k English Multiword Expressions (MWEs) and their constituent words. We also increase the coverage of unigrams, especially words that have become more common since 2018. In all, the new NRC VAD Lexicon v2 now has entries for 10k MWEs and 25k words, in addition to the entries in v1. We show that the associations are highly reliable. We use the lexicon to examine emotional characteristics of MWEs, including: 1. The degree to which MWEs (idioms, noun compounds, and verb particle constructions) exhibit strong emotionality; 2. The degree of emotional compositionality in MWEs. The lexicon enables a wide variety of research in NLP, Psychology, Public Health, Digital Humanities, and Social Sciences. The NRC VAD Lexicon v2 is freely available through the project webpage: http://saifmohammad.com/WebPages/nrc-vad.html

[8] Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English, Sepedi and Setswana

Koena Ronny Mabokela, Tim Schlippe, Mpho Raborife, Turgay Celik

Main category: cs.CL

TL;DR: Automatic sentiment labeling method using emojis and sentiment words for low-resource African languages, achieving 66% accuracy for English, 69% for Sepedi, and 63% for Setswana.

Details

Motivation: Many African languages lack labeled sentiment data due to expensive manual labeling, requiring automatic methods for sentiment analysis in low-resource settings.

Method: Language-independent sentiment labeling leveraging sentiment-bearing emojis and words from tweets in English, Sepedi, and Setswana using the SAfriSenti corpus.

Result: Achieved 66% accuracy for English, 69% for Sepedi, and 63% for Setswana, reducing manual correction needed to only 34% of labels on average.

Conclusion: The proposed automatic labeling method effectively reduces manual effort for sentiment analysis in low-resource African languages.

Abstract: Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.

[9] Profile-LLM: Dynamic Profile Optimization for Realistic Personality Expression in LLMs

Shi-Wei Dai, Yan-Wei Shie, Tsung-Huan Yang, Lun-Wei Ku, Yung-Hui Li

Main category: cs.CL

TL;DR: PersonaPulse is a framework that optimizes prompts to maximize personality expression in LLMs through iterative enhancement and situational response benchmarking, outperforming previous methods based on psychological descriptions.

Details

Motivation: Previous studies used prompts to elicit personality traits in LLMs but didn't optimize them to maximize personality expression, creating a limitation in creating realistic AI interactions.

Method: Leverages LLMs’ inherent knowledge of personality traits to iteratively enhance role-play prompts while using a situational response benchmark as a scoring tool for realistic, contextually grounded evaluation.

Result: Quantitative evaluations show PersonaPulse-generated prompts outperform prior work. Experiments reveal relationship between model size and personality modeling, and that personality evocation can be partially controlled by pausing optimization for certain traits.

Conclusion: Prompt optimization is crucial for shaping personality expression in LLMs, offering valuable insights for future research on adaptive AI interactions.

Abstract: Personalized Large Language Models (LLMs) have been shown to be an effective way to create more engaging and enjoyable user-AI interactions. While previous studies have explored using prompts to elicit specific personality traits in LLMs, they have not optimized these prompts to maximize personality expression. To address this limitation, we propose PersonaPulse: Dynamic Profile Optimization for Realistic Personality Expression in LLMs, a framework that leverages LLMs’ inherent knowledge of personality traits to iteratively enhance role-play prompts while integrating a situational response benchmark as a scoring tool, ensuring a more realistic and contextually grounded evaluation to guide the optimization process. Quantitative evaluations demonstrate that the prompts generated by PersonaPulse outperform those of prior work, which were designed based on personality descriptions from psychological studies. Additionally, we explore the relationship between model size and personality modeling through extensive experiments. Finally, we find that, for certain personality traits, the extent of personality evocation can be partially controlled by pausing the optimization process. These findings underscore the importance of prompt optimization in shaping personality expression within LLMs, offering valuable insights for future research on adaptive AI interactions.

[10] A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

Farzad Ahmed, Joniel Augustine Jerome, Meliha Yetisgen, Özlem Uzuner

Main category: cs.CL

TL;DR: RDP outperforms zero-shot and static prompting for medical error detection and correction, reducing false positives by 15% and improving recall by 5-10%.

Details

Motivation: Clinical documentation errors compromise patient safety, and LLMs may help detect/correct them, but optimal prompting strategies are unclear.

Method: Evaluated 9 LLMs on MEDEC dataset using zero-shot, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for error detection and correction tasks.

Result: RDP reduced FPR by ~15%, improved recall by 5-10% in error sentence detection, and generated more contextually accurate corrections compared to other methods.

Conclusion: Retrieval-augmented dynamic prompting improves detection accuracy, reduces false positives, and enhances reliability of medical error correction across diverse LLMs.

Abstract: Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.

[11] AppSelectBench: Application-Level Tool Selection Benchmark

Tianyi Chen, Michael Solodko, Sen Wang, Jongwoo Ko, Junheng Hao, Colby Banbury, Sara Abdali, Saeed Amizadeh, Qing Xiao, Yinheng Li, Tianyu Ding, Kamran Ghasedi Dizaji, Suzhen Zheng, Hao Fan, Justin Wagle, Pashmina Cameron, Kazuhito Koishida

Main category: cs.CL

TL;DR: AppSelectBench is a new benchmark for evaluating application selection in Computer Using Agents (CUAs), addressing the gap in assessing models’ ability to reason across and choose between different applications rather than just fine-grained API selection.

Details

Motivation: Current benchmarks primarily assess fine-grained API selection but don't evaluate whether models can reason across and choose between different applications, which is fundamental for CUAs to operate effectively by initializing correct environments and avoiding orchestration confusion.

Method: Introduces AppSelectBench with a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, covering 100 widely used desktop applications and over 100,000 user tasks. Includes unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented settings.

Result: Extensive experiments across closed-source and open-source LLMs reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices.

Conclusion: AppSelectBench establishes a foundation for studying and advancing application-level reasoning, an essential yet underexplored capability of intelligent CUAs, with the source code available publicly.

Abstract: Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://github.com/microsoft/appselectbench.

[12] $\text{R}^2\text{R}$: A Route-to-Rerank Post-Training Framework for Multi-Domain Decoder-Only Rerankers

Xinyu Wang, Hanwei Wu, Qingchen Hu, Zhenghan Tai, Jingrui Tian, Lei Ding, Jijun Chi, Hailin He, Tung Sum Thomas Kwok, Yufei Cui, Sicheng Lyu, Muzhi Li, Mingze Li, Xinyue Yu, Ling Zhou, Peng Lu

Main category: cs.CL

TL;DR: R2R is a domain-aware reranking framework that uses dynamic expert routing and Entity Abstraction for Generalization (EAG) training to prevent surface-form overfitting and improve cross-domain robustness in high-stakes fields.

Details

Motivation: Generalist reranking models lack domain-specific knowledge in critical fields like finance and law, while naive fine-tuning causes overfitting to surface forms and catastrophic forgetting of general knowledge.

Method: Combines dynamic expert routing with two-stage EAG training that masks predictive surface cues, forcing learning of domain-invariant relevance patterns. Uses lightweight Latent Semantic Router to select optimal LoRA experts per query.

Result: R2R consistently outperforms generalist and single-domain fine-tuned baselines across legal, medical, and financial domains with different reranker backbones.

Conclusion: R2R provides a model-agnostic, modular approach to domain specialization with strong cross-domain robustness, effectively addressing domain adaptation challenges in reranking.

Abstract: Decoder-only rerankers are central to Retrieval-Augmented Generation (RAG). However, generalist models miss domain-specific nuances in high-stakes fields like finance and law, and naive fine-tuning causes surface-form overfitting and catastrophic forgetting. To address this challenge, we introduce R2R, a domain-aware framework that combines dynamic expert routing with a two-stage training strategy, Entity Abstraction for Generalization (EAG). EAG introduces a counter-shortcut mechanism by masking the most predictive surface cues, forcing the reranker to learn domain-invariant relevance patterns rather than memorizing dataset-specific entities. To efficiently activate domain experts, R2R employs a lightweight Latent Semantic Router that probes internal representations from the frozen backbone decoder to select the optimal LoRA expert per query. Extensive experiments across different reranker backbones and diverse domains (legal, medical, and financial) demonstrate that R2R consistently surpasses generalist and single-domain fine-tuned baselines. Our results confirm that R2R is a model-agnostic and modular approach to domain specialization with strong cross-domain robustness.

[13] Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test

Mihir Sahasrabudhe

Main category: cs.CL

TL;DR: Transformers exhibit a directional optimization gap where learning inverse mappings is fundamentally harder than forward mappings, even in synthetic data without linguistic biases.

Details

Motivation: To resolve whether directional failures in Transformers stem from linguistic statistics or the architecture itself, by testing with synthetic data that removes real-world biases.

Method: Created synthetic benchmark with random string mappings using tunable branching factor K, comparing forward tasks (zero conditional entropy) and inverse tasks (analytically determined entropy floors). Tested scratch-trained GPT-2, MLP baselines, pre-trained models, and LoRA fine-tuning.

Result: Found strong directional optimization gap (e.g., 1.16 nats at K=5) in Transformers, much larger than MLP baselines. Pre-training shifts but doesn’t eliminate gap, and LoRA hits capacity wall on high-entropy inverse mappings.

Conclusion: Transformers have intrinsic directional friction in causal training that persists even without linguistic priors, token frequencies, or temporal asymmetries, making inversion fundamentally harder.

Abstract: Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a “reversal curse,” and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.

Edward Ajayi, Martha Kachweka, Mawuli Deku, Emily Aiken

Main category: cs.CL

TL;DR: A unified multiclass classification framework for detecting 10 mental health and cyberbullying categories from social media, with MentalBERT achieving best performance (0.92 accuracy, 0.76 Macro F1) and a human-in-the-loop dashboard prototype.

Details

Motivation: Address the growing prevalence of mental health challenges and cyberbullying in digital spaces by developing scalable, interpretable detection systems.

Method: Curated datasets from Twitter/Reddit, implemented ‘split-then-balance’ pipeline, compared lexical models, hybrid approaches, and fine-tuned transformers including domain-adapted MentalBERT.

Result: End-to-end fine-tuning proved critical; MentalBERT outperformed generic models and zero-shot LLM baseline with 0.92 accuracy and 0.76 Macro F1 score.

Conclusion: Presents a robust baseline for human-in-the-loop screening, with future needs for multi-label clinically-validated datasets at the intersection of online safety and computational mental health.

Abstract: Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous “split-then-balance” pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard (“Social Media Screener”) designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.

[15] Online-PVLM: Advancing Personalized VLMs with Online Concept Learning

Huiyu Bai, Runze Wang, Zhuoyun Du, Yiyang Zhao, Fengji Zhang, Haoyu Chen, Xiaoyong Zhu, Bo Zheng, Xuejiao Zhao

Main category: cs.CL

TL;DR: Online-PVLM enables real-time concept learning for personalized VLMs using hyperbolic representations without training, addressing scalability issues in large-scale scenarios.

Details

Motivation: Existing methods require separate embeddings for each concept and fail to support real-time adaptation during testing, making them inefficient for large-scale applications.

Method: Proposed Online-PVLM framework using hyperbolic representations for train-free concept embedding generation at test time, plus OP-Eval benchmark with 1,292 concepts and 30K+ instances.

Result: Extensive experiments demonstrate state-of-the-art performance in online concept learning for personalized VLMs.

Conclusion: Online-PVLM provides a scalable and efficient solution for real-time concept learning in personalized VLMs, with comprehensive evaluation through the OP-Eval benchmark.

Abstract: Personalized Visual Language Models (VLMs) are gaining increasing attention for their formidable ability in user-specific concepts aligned interactions (e.g., identifying a user’s bike). Existing methods typically require the learning of separate embeddings for each new concept, which fails to support real-time adaptation during testing. This limitation becomes particularly pronounced in large-scale scenarios, where efficient retrieval of concept embeddings is not achievable. To alleviate this gap, we propose Online-PVLM, a framework for online concept learning by leveraging hyperbolic representations. Our approach makes a train-free paradigm for concept embeddings generation at test time, making the use of personalized VLMs both scalable and efficient. In addition, we develop OP-Eval, a comprehensive and large-scale benchmark comprising 1,292 concepts and over 30K high-quality instances with diverse question types, designed to rigorously assess online concept learning in realistic scenarios. Extensive experiments demonstrate the state-of-the-art performance of our proposed framework. Our source code and dataset will be made available.

[16] MTA: A Merge-then-Adapt Framework for Personalized Large Language Model

Xiaopeng Li, Yuanjin Zheng, Wanyu Wang, wenlin zhang, Pengyue Jia, Yiqi Wang, Maolin Wang, Xuetao Wei, Xiangyu Zhao

Main category: cs.CL

TL;DR: MTA is a Merge-then-Adapt framework for Personalized LLMs that addresses scalability and sparse data issues through dynamic meta-LoRA fusion and few-shot personalization.

Details

Motivation: Current PLLM approaches face scalability issues (linear storage costs with users) and poor performance for users with sparse data, requiring a more efficient and flexible solution.

Method: Three-stage framework: 1) Construct shared Meta-LoRA Bank with anchor users, 2) Adaptive LoRA Fusion to dynamically merge relevant meta-LoRAs, 3) LoRA Stacking with ultra-low-rank module for few-shot personalization.

Result: Outperforms existing SOTA methods on LaMP benchmark across multiple tasks, demonstrating superior performance and scalability.

Conclusion: MTA provides an effective solution for scalable PLLMs that handles sparse data through dynamic fusion and few-shot adaptation, eliminating user-specific storage requirements.

Abstract: Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.

[17] More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering

Duc Anh Vu, Thong Nguyen, Cong-Duy Nguyen, Viet Anh Nguyen, Anh Tuan Luu

Main category: cs.CL

TL;DR: BiasPrompting is a novel inference framework that improves LLM performance on multiple-choice questions by generating and evaluating reasoning for all answer options before final prediction.

Details

Motivation: Existing approaches present answer choices to LLMs without contextual grounding, leading to incomplete exploration of options and degraded reasoning capabilities.

Method: Two-stage framework: 1) reasoning generation stage where LLMs produce supportive reasoning for each answer option, 2) reasoning-guided agreement stage where generated reasonings are synthesized to select the most plausible answer.

Result: Significant improvements in five widely used multiple-choice question answering benchmarks, enhancing LLM reasoning capabilities especially in challenging settings.

Conclusion: BiasPrompting provides a strong foundation for tackling complex multiple-choice questions and addresses limitations of existing methods that underperform in difficult scenarios.

Abstract: With the advancement of large language models (LLMs), their performance on multiple-choice question (MCQ) tasks has improved significantly. However, existing approaches face key limitations: answer choices are typically presented to LLMs without contextual grounding or explanation. This absence of context can lead to incomplete exploration of all possible answers, ultimately degrading the models’ reasoning capabilities. To address these challenges, we introduce BiasPrompting, a novel inference framework that guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction. It consists of two components: first, a reasoning generation stage, where the model is prompted to produce supportive reasonings for each answer option, and then, a reasoning-guided agreement stage, where the generated reasonings are synthesized to select the most plausible answer. Through comprehensive evaluations, BiasPrompting demonstrates significant improvements in five widely used multiple-choice question answering benchmarks. Our experiments showcase that BiasPrompting enhances the reasoning capabilities of LLMs and provides a strong foundation for tackling complex and challenging questions, particularly in settings where existing methods underperform.

[18] SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun

Main category: cs.CL

TL;DR: SSA is a training framework that addresses gradient deficiency in sparse attention by enforcing bidirectional alignment between sparse and full attention, achieving SOTA performance and enabling flexible compute-performance trade-offs.

Details

Motivation: Existing sparse attention methods suffer from gradient update deficiency where excluded key-value pairs don't learn proper suppression, and paradoxically produce lower sparsity than full attention despite aiming to approximate it.

Method: SSA unifies sparse and full attention training with bidirectional alignment at every layer, preserving gradient flow to all tokens while explicitly aligning sparse-attention outputs with full-attention counterparts.

Result: SSA achieves state-of-the-art performance on commonsense benchmarks under both sparse and full attention inference, enables smooth adaptation to varying sparsity budgets, and shows improved long-context extrapolation by mitigating attention over-allocation.

Conclusion: SSA successfully overcomes gradient deficiency in sparse attention training, achieving superior performance and flexibility while demonstrating that native sparse-attention training can improve long-context extrapolation capabilities.

Abstract: The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.

[19] EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning

Xingfeng Li, Xiaohan Shi, Junjie Li, Yongwei Li, Masashi Unoki, Tomoki Toda, Masato Akagi

Main category: cs.CL

TL;DR: EM2LDL is a multilingual speech corpus for mixed emotion recognition using label distribution learning, covering English, Mandarin, and Cantonese with code-switching data from Hong Kong and Macao.

Details

Motivation: To address limitations of monolingual and single-label emotion corpora that lack linguistic diversity, cannot model mixed emotions, and have poor ecological validity.

Method: Created a corpus with expressive utterances in three languages, capturing intra-utterance code-switching, annotated with fine-grained emotion distributions across 32 categories. Used self-supervised learning models (HuBERT-large-EN) for baseline evaluations.

Result: Robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results.

Conclusion: EM2LDL enables exploration of complex emotional dynamics in multilingual settings and provides a versatile testbed for developing adaptive, empathetic systems in affective computing applications.

Abstract: This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolor{black}{that restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity}, EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao. The corpus integrates spontaneous emotional expressions from online platforms, annotated with fine-grained emotion distributions across 32 categories. Experimental baselines using self-supervised learning models demonstrate robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results. By incorporating linguistic diversity and ecological validity, EM2LDL enables the exploration of complex emotional dynamics in multilingual settings. This work provides a versatile testbed for developing adaptive, empathetic systems for applications in affective computing, including mental health monitoring and cross-cultural communication. The dataset, annotations, and baseline codes are publicly available at https://github.com/xingfengli/EM2LDL.

[20] LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: LightMem is an efficient memory system for LLMs that organizes memory into three stages inspired by human memory models, achieving significant performance improvements while drastically reducing computational overhead.

Details

Motivation: LLMs struggle to effectively use historical interaction information in dynamic environments, and existing memory systems introduce substantial time and computational overhead.

Method: Inspired by the Atkinson-Shiffrin human memory model, LightMem organizes memory into three stages: sensory memory for rapid filtering and topic grouping, short-term memory for topic-based consolidation, and long-term memory with offline sleep-time updates.

Result: On LongMemEval and LoCoMo benchmarks, LightMem improved QA accuracy by up to 7.7%/29.3%, reduced total token usage by up to 38x/20.9x and API calls by up to 30x/55.5x, with even better online test-time efficiency.

Conclusion: LightMem successfully balances performance and efficiency in memory systems for LLMs, demonstrating substantial improvements in accuracy while dramatically reducing computational costs.

Abstract: Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. On LongMemEval and LoCoMo, using GPT and Qwen backbones, LightMem consistently surpasses strong baselines, improving QA accuracy by up to 7.7% / 29.3%, reducing total token usage by up to 38x / 20.9x and API calls by up to 30x / 55.5x, while purely online test-time costs are even lower, achieving up to 106x / 117x token reduction and 159x / 310x fewer API calls. The code is available at https://github.com/zjunlp/LightMem.

[21] “When Data is Scarce, Prompt Smarter”… Approaches to Grammatical Error Correction in Low-Resource Settings

Somsubhra De, Harsh Kumar, Arun Prakash A

Main category: cs.CL

TL;DR: LLMs with prompting strategies outperform fine-tuned Indic-language models for grammatical error correction in low-resource settings, achieving top results across multiple Indic languages.

Details

Motivation: Grammatical error correction for Indic languages remains challenging due to limited resources, linguistic diversity, and complex morphology, while existing approaches mainly focus on high-resource languages like English.

Method: Used prompting-based approaches with state-of-the-art LLMs (GPT-4.1, Gemini-2.5, LLaMA-4) combined with few-shot strategy, including zero-shot and few-shot approaches with carefully designed prompts.

Result: Achieved leading results: 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97), substantially outperforming fine-tuned Indic-language models.

Conclusion: Prompt-driven NLP techniques and large-scale LLMs effectively bridge resource gaps in multilingual GEC, demonstrating exceptional multilingual generalization capabilities for low-resource languages.

Abstract: Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in transformer-based models and large annotated datasets have greatly improved GEC performance for high-resource languages such as English, the progress has not extended equally. For most Indic languages, GEC remains a challenging task due to limited resources, linguistic diversity and complex morphology. In this work, we explore prompting-based approaches using state-of-the-art large language models (LLMs), such as GPT-4.1, Gemini-2.5 and LLaMA-4, combined with few-shot strategy to adapt them to low-resource settings. We observe that even basic prompting strategies, such as zero-shot and few-shot approaches, enable these LLMs to substantially outperform fine-tuned Indic-language models like Sarvam-22B, thereby illustrating the exceptional multilingual generalization capabilities of contemporary LLMs for GEC. Our experiments show that carefully designed prompts and lightweight adaptation significantly enhance correction quality across multiple Indic languages. We achieved leading results in the shared task–ranking 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97). These findings highlight the effectiveness of prompt-driven NLP techniques and underscore the potential of large-scale LLMs to bridge resource gaps in multilingual GEC.

[22] Computational Turing Test Reveals Systematic Differences Between Human and AI Language

Nicolò Pagan, Petter Törnberg, Christopher A. Bail, Anikó Hannák, Christopher Barrie

Main category: cs.CL

TL;DR: This paper introduces a computational Turing test framework to validate LLMs’ ability to simulate human behavior, finding that even calibrated LLMs remain distinguishable from human text, with trade-offs between human-likeness and semantic fidelity.

Details

Motivation: The motivation is to address the untested assumption that LLMs can generate realistic human-like text for social science simulations, and the lack of robust validation tools beyond unreliable human judgments.

Method: The method involves a computational Turing test framework combining aggregate metrics (BERT-based detectability, semantic similarity) with interpretable linguistic features (stylistic markers, topical patterns), and systematically comparing 9 open-weight LLMs across 5 calibration strategies on X, Bluesky, and Reddit data.

Result: Results show LLM outputs remain clearly distinguishable from human text even after calibration, particularly in affective tone and emotional expression. Instruction-tuned models underperform base models, scaling model size doesn’t enhance human-likeness, and there’s a trade-off between human-likeness and semantic fidelity.

Conclusion: The study provides a scalable validation framework for LLM simulations but cautions about current limitations in capturing human communication, highlighting the need for better calibration methods that balance human-likeness with semantic accuracy.

Abstract: Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations – testing whether humans can distinguish AI from human output – despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies – including fine-tuning, stylistic prompting, and context retrieval – benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations – and offer a cautionary note about their current limitations in capturing human communication.

[23] SEDA: A Self-Adapted Entity-Centric Data Augmentation for Boosting Gird-based Discontinuous NER Models

Wen-Fang Su, Hsiao-Wei Chou, Wen-Yang Lin

Main category: cs.CL

TL;DR: Grid-tagging models enhanced with image data augmentation techniques improve discontinuous entity recognition by addressing segmentation issues, achieving 1-2.5% overall F1 gains and 3.7-8.4% gains for discontinuous entities.

Details

Motivation: Traditional NER methods struggle with discontinuous entities due to text segmentation problems, particularly missing cross-sentence entities which reduces recognition accuracy.

Method: Integrate image data augmentation techniques (cropping, scaling, padding) into grid-based tagging models to enhance discontinuous entity recognition and handle segmentation challenges.

Result: Traditional segmentation methods fail to capture cross-sentence discontinuous entities, while augmented grid models achieve F1 score improvements of 1-2.5% overall and 3.7-8.4% for discontinuous entities on CADEC, ShARe13, and ShARe14 datasets.

Conclusion: Image data augmentation combined with grid-tagging models effectively addresses discontinuous entity recognition challenges and improves performance on segmentation-sensitive NER tasks.

Abstract: Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.

[24] KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP

Adilet Metinov, Gulida M. Kudakeeva, Gulnara D. Kabaeva

Main category: cs.CL

TL;DR: KyrgyzBERT is the first monolingual BERT model for Kyrgyz, achieving competitive sentiment analysis performance with a smaller model size compared to multilingual BERT.

Details

Motivation: Kyrgyz is a low-resource language lacking foundational NLP tools, creating a need for dedicated language models to support Kyrgyz NLP research.

Method: Developed KyrgyzBERT with 35.9M parameters using a custom tokenizer for Kyrgyz morphology, and created kyrgyz-sst2 benchmark by translating Stanford Sentiment Treebank with manual test set annotation.

Result: KyrgyzBERT fine-tuned on kyrgyz-sst2 achieved F1-score of 0.8280, competitive with a fine-tuned mBERT model that is five times larger.

Conclusion: The release of KyrgyzBERT models, data, and code provides foundational tools to advance Kyrgyz NLP research and demonstrates effective monolingual modeling for low-resource languages.

Abstract: Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language’s morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.

[25] REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance

Chuyi Kong, Gao Wei, Jing Ma, Hongzhan Lin, Zhiyuan Fan

Main category: cs.CL

TL;DR: REFLEX is a plug-and-play fact-checking paradigm that uses internal model knowledge instead of external sources to improve accuracy and explanation quality through role-play dialogue and contrastive activation steering.

Details

Motivation: Existing LLM-based fact-checking systems rely heavily on external knowledge sources, causing latency, hallucinations, and reduced reliability for real-time use, while lacking interpretable explanations.

Method: Reformulates fact-checking as role-play dialogue, jointly trains verdict prediction and explanation generation, extracts contrastive activation pairs between backbone model and fine-tuned variant to create steering vectors that disentangle truth into style and substance.

Result: Outperforms previous methods that steer toward single truth direction, achieves state-of-the-art performance with only 465 self-refined training samples, and enables models without explanatory objectives to improve by up to 7.57%.

Conclusion: Internal explanation signals play dual role in interpreting and enhancing factual reasoning, demonstrating that REFLEX effectively leverages internal model knowledge for more faithful and efficient fact-checking without external dependencies.

Abstract: The prevalence of misinformation on social media threatens public trust, demanding automated fact-checking systems that provide accurate verdicts with interpretable explanations. However, existing large language model-based (LLM-based) approaches often rely heavily on external knowledge sources, introducing substantial latency and even hallucinations that undermine reliability, interpretability, and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations REFLEX paradigm, a plug-and-play, self-refining paradigm that leverages the internal knowledge in backbone model to improve both verdict accuracy and explanation quality. REFLEX reformulates fact-checking as a role-play dialogue and jointly trains verdict prediction and explanation generation. It adaptively extracts contrastive activation pairs between the backbone model and its fine-tuned variant to construct steering vectors that disentangle truth into style and substance naturally. These activation-level signals guide inference and suppress noisy explanations, enabling more faithful and efficient reasoning. Experiments on real-world datasets show that REFLEX outperforms previous methods that steer toward a single truth direction and underscores the challenge traditional approaches face when handling the subtle, human-unknown truth in fact-checking tasks. Remarkably, with only 465 self-refined training samples, RELFEX achieves state-of-the-art performance. Furthermore, models trained with explanatory objectives can effectively guide those without them, yielding up to a 7.57% improvement, highlighting that internal explanation signals play a dual role in both interpreting and enhancing factual reasoning.

[26] Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

Luohe Shi, Zuchao Li, Lefei Zhang, Baoyuan Qi, Guoming Liu, Hai Zhao

Main category: cs.CL

TL;DR: SpecFormer is a novel architecture that combines unidirectional and bidirectional attention to enable efficient speculative decoding for LLM inference, eliminating the need for large draft trees and achieving consistent acceleration even in large-batch scenarios.

Details

Motivation: Current speculative decoding methods assume abundant computing power and use complex draft trees, but batching has become a superior alternative that compresses idle computing power. There's a need for speculative decoding with low verification resources and scheduling costs.

Method: Proposed SpecFormer architecture that integrates unidirectional and bidirectional attention mechanisms, combining autoregressive model’s ability to extract information from entire input with parallel generation benefits of non-autoregressive models.

Result: SpecFormer eliminates reliance on large prefix trees and achieves consistent acceleration in large-batch scenarios. Experiments show it sets new standards for scaling LLM inference with lower training demands and reduced computational costs.

Conclusion: SpecFormer provides an effective solution for efficient speculative decoding, enabling scaling of LLM inference with reduced computational requirements and training demands.

Abstract: Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative decoding, as they compress the available idle computing power. Therefore, performing speculative decoding with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate sequences of limited length, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model’s ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent acceleration, even in large-batch scenarios. Through lossless speculative decoding experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling LLM inference with lower training demands and reduced computational costs.

[27] The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models

Taewhoo Lee, Minju Song, Chanwoong Yoon, Jungwoo Park, Jaewoo Kang

Main category: cs.CL

TL;DR: LLMs show emerging but limited analogical reasoning capabilities - they can encode relational concepts but struggle to apply them to novel situations, with failures linked to missing relational information and poor structural alignment.

Details

Motivation: To understand whether LLMs can encode high-level relational concepts and apply them through analogical reasoning like humans do, using proportional and story analogies.

Method: Analyzed LLM performance on proportional and story analogies, examining how attributive and relational information propagates through model layers, and tested strategic patching of hidden representations.

Result: LLMs effectively encode relationships between analogous entities in mid-upper layers, but struggle to apply relational concepts to new entities. Strategic patching can partially facilitate information transfer. Success correlates with strong structural alignment between situations.

Conclusion: LLMs exhibit emerging analogical reasoning capabilities but with significant limitations compared to human cognition, particularly in applying relational concepts to novel situations.

Abstract: Analogical reasoning is at the core of human cognition, serving as an important foundation for a variety of intellectual activities. While prior work has shown that LLMs can represent task patterns and surface-level concepts, it remains unclear whether these models can encode high-level relational concepts and apply them to novel situations through structured comparisons. In this work, we explore this fundamental aspect using proportional and story analogies, and identify three key findings. First, LLMs effectively encode the underlying relationships between analogous entities; both attributive and relational information propagate through mid-upper layers in correct cases, whereas reasoning failures reflect missing relational information within these layers. Second, unlike humans, LLMs often struggle not only when relational information is missing, but also when attempting to apply it to new entities. In such cases, strategically patching hidden representations at critical token positions can facilitate information transfer to a certain extent. Lastly, successful analogical reasoning in LLMs is marked by strong structural alignment between analogous situations, whereas failures often reflect degraded or misplaced alignment. Overall, our findings reveal that LLMs exhibit emerging but limited capabilities in encoding and applying high-level relational concepts, highlighting both parallels and gaps with human cognition.

[28] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali

Abdullah Al Sefat

Main category: cs.CL

TL;DR: BengaliFig is a Bengali challenge set with 435 culturally rich riddles to evaluate LLMs’ figurative and cultural reasoning in low-resource contexts, revealing weaknesses in metaphorical and culturally specific reasoning.

Details

Motivation: To address the gap in evaluating large language models on figurative and culturally grounded reasoning, especially in low-resource languages like Bengali where current benchmarks are insufficient.

Method: Created BengaliFig dataset with 435 unique Bengali riddles from oral/literary traditions, annotated across 5 dimensions, and converted to multiple-choice format using AI-assisted pipeline. Evaluated 8 frontier LLMs with zero-shot and few-shot chain-of-thought prompting.

Result: LLMs showed consistent weaknesses in metaphorical and culturally specific reasoning, highlighting limitations in handling low-resource cultural contexts despite broad multilingual capabilities.

Conclusion: BengaliFig provides a diagnostic tool for evaluating LLM robustness in cultural contexts and advances inclusive, heritage-aware NLP evaluation for low-resource languages.

Abstract: Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.

[29] A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines

Md Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal

Main category: cs.CL

TL;DR: Proposes a comprehensive evaluation framework for stemming methods using SES, MPD, and ANLD metrics to assess utility, downstream impact, and semantic preservation.

Details

Motivation: Current stemming evaluation approaches are limited and don't capture potential harm from excessive stemming, necessitating new evaluation methods.

Method: Task-oriented evaluation framework with three metrics: Stemming Effectiveness Score (SES) for utility, Model Performance Delta (MPD) for downstream impact, and Average Normalized Levenshtein Distance (ANLD) for semantic similarity.

Result: Bangla stemmer showed high SES (1.67) but poor ANLD (0.26) indicating harmful over-stemming, while English stemmer achieved moderate SES (1.31) with safe ANLD (0.14) and better downstream performance.

Conclusion: The framework effectively distinguishes between efficiency gains (high SES) and meaning preservation (low ANLD), showing that SES alone is insufficient and ANLD is crucial for identifying harmful over-stemming.

Abstract: Text normalization is an essential preprocessing step in many natural language processing (NLP) tasks, and stemming is one such normalization technique that reduces words to their base or root form. However, evaluating stemming methods is challenging because current evaluation approaches are limited and do not capture the potential harm caused by excessive stemming; therefore, it is essential to develop new approaches to evaluate stemming methods. To address this issue, this study propose a novel, task-oriented approach to evaluate stemming methods, which considers three aspects: (1) the utility of stemming using Stemming Effectiveness Score (SES), (2) the impact of stemming on downstream tasks using Model Performance Delta (MPD), and (3) the semantic similarity between stemmed and original words using Average Normalized Levenshtein Distance (ANLD), thus providing a comprehensive evaluation framework. We apply our evaluation framework to compare two stemmers for Bangla (BNLTK) and English (Snowball), and our results reveal a significant issue, prompting us to analyze their performance in detail. While the Bangla stemmer achieves the highest SES (1.67) due to effective word reduction (CR = 1.90), SES alone is insufficient because our proposed safety measure, ANLD, reveals that this high SES is due to harmful over-stemming (ANLD = 0.26), which correlates with the observed decrease in downstream performance.In contrast, the English stemmer achieves a moderate SES (1.31) with a safe meaning distance (ANLD = 0.14), allowing its word reduction to contribute positively to downstream performance; therefore, it is a more reliable stemmer. Our study provides a valuable tool for distinguishing between potential efficiency gains (high SES) and meaning preservation (low ANLD).

[30] Generation, Evaluation, and Explanation of Novelists’ Styles with Single-Token Prompts

Mosab Rezaei, Mina Rajaei Moghadam, Abdul Rahman Shaikh, Hamed Alhoori, Reva Freedman

Main category: cs.CL

TL;DR: A framework for generating and evaluating 19th-century novelist style text using fine-tuned LLMs and transformer-based detectors with explainable AI methods.

Details

Motivation: Address challenges in stylometry: training generative models without paired data and evaluating stylistic text beyond human judgment alone.

Method: Fine-tune large language models with single-token prompts for author imitation; use transformer-based detector trained on authentic sentences as classifier and explanation tool; apply syntactic comparisons and explainable AI methods (attention-based and gradient-based analyses).

Result: Generated text reflects authors’ distinctive patterns; AI-based evaluation provides reliable alternative to human assessment.

Conclusion: The framework successfully generates stylistically accurate text and offers robust evaluation methods for stylometry research.

Abstract: Recent advances in large language models have created new opportunities for stylometry, the study of writing styles and authorship. Two challenges, however, remain central: training generative models when no paired data exist, and evaluating stylistic text without relying only on human judgment. In this work, we present a framework for both generating and evaluating sentences in the style of 19th-century novelists. Large language models are fine-tuned with minimal, single-token prompts to produce text in the voices of authors such as Dickens, Austen, Twain, Alcott, and Melville. To assess these generative models, we employ a transformer-based detector trained on authentic sentences, using it both as a classifier and as a tool for stylistic explanation. We complement this with syntactic comparisons and explainable AI methods, including attention-based and gradient-based analyses, to identify the linguistic cues that drive stylistic imitation. Our findings show that the generated text reflects the authors’ distinctive patterns and that AI-based evaluation offers a reliable alternative to human assessment. All artifacts of this work are published online.

[31] Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Jakub Hoscilowicz, Artur Janicki

Main category: cs.CL

TL;DR: The paper introduces Adversarial Confusion Attack, a new threat against MLLMs that systematically disrupts model outputs to generate incoherent or confidently incorrect responses, using adversarial images to prevent reliable operation of MLLM-powered agents.

Details

Motivation: To address the vulnerability of multimodal large language models to systematic disruption attacks, particularly for applications where adversarial images could prevent reliable operation of MLLM-powered agents on websites.

Method: Uses an ensemble of open-source MLLMs to maximize next-token entropy through adversarial perturbations generated via basic PGD technique, tested in both full-image and adversarial CAPTCHA settings.

Result: A single adversarial image can disrupt all models in the ensemble, with perturbations transferring effectively to both unseen open-source (Qwen3-VL) and proprietary (GPT-5.1) models.

Conclusion: The proposed Adversarial Confusion Attack demonstrates significant transferability and effectiveness in systematically disrupting MLLMs, highlighting serious security vulnerabilities in current multimodal AI systems.

Abstract: We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.

[32] The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models

Nathan Roll, Jill Kries, Flora Jin, Catherine Wang, Ann Marie Finley, Meghan Sumner, Cory Shain, Laura Gwilliams

Main category: cs.CL

TL;DR: The Text Aphasia Battery (TAB) is a text-only benchmark adapted from clinical assessments to evaluate aphasic-like language deficits in large language models, with automated evaluation validated using Gemini 2.5 Flash.

Details

Motivation: Traditional clinical assessments for aphasia are unsuitable for LLMs because they assume human-like cognitive processes and pragmatic pressures that artificial systems lack. There's a need for clinically-grounded tools to study linguistic disorders in LLMs.

Method: Developed TAB by adapting the Quick Aphasia Battery (QAB) into four text-only subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. Validated automated evaluation using Gemini 2.5 Flash and compared with human expert ratings.

Result: The automated evaluation achieved reliability comparable to human raters (Cohen’s kappa = 0.255 for model-consensus vs 0.286 for human-human agreement), making TAB scalable for large-scale analysis of language deficits in artificial systems.

Conclusion: TAB provides a clinically-grounded, scalable framework for systematically analyzing language deficits in LLMs, bridging clinical linguistics and artificial intelligence research.

Abstract: Large language models (LLMs) have emerged as a candidate “model organism” for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB’s design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen’s kappa = 0.255 for model–consensus agreement vs. 0.286 for human–human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.

[33] Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition

Wesley Bian, Xiaofeng Lin, Guang Cheng

Main category: cs.CL

TL;DR: Novel data augmentation technique improves ASR performance for low-resource languages, outperforming existing methods.

Details

Motivation: Address performance gap in audio ML models between well-resourced languages (like English) and low-resource languages due to data scarcity.

Method: Introduces a novel data augmentation technique specifically designed for speech corpora to enhance training data for low-resource languages.

Result: Significantly improves automatic speech recognition system performance on low-resource languages and outperforms existing augmentation strategies.

Conclusion: Provides a practical solution for enhancing speech technology in underrepresented linguistic communities through effective data augmentation.

Abstract: Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.

[34] From Words to Wisdom: Discourse Annotation and Baseline Models for Student Dialogue Understanding

Farjana Sultana Mim, Shuchin Aeron, Eric Miller, Kristen Wendell

Main category: cs.CL

TL;DR: This paper introduces an annotated educational dialogue dataset for knowledge construction vs task production discourse and establishes baseline models using GPT-3.5 and Llama-3.1, finding suboptimal performance that indicates need for future research.

Details

Motivation: Manual analysis of student conversations to identify discourse features is time-consuming and limits study scale. NLP can provide scalable analysis, but existing work rarely addresses educational dialogue data.

Method: Created annotated educational dialogue dataset of student conversations with knowledge construction and task production discourse labels. Used pre-trained LLMs (GPT-3.5 and Llama-3.1) to establish baseline models for predicting discourse properties per conversation turn.

Result: State-of-the-art models performed suboptimally on the discourse prediction task, suggesting the problem is challenging and requires further research.

Conclusion: The work addresses the gap in NLP for educational dialogue analysis and provides a foundation for future research, as current models show limited effectiveness in automatically detecting educational discourse features.

Abstract: Identifying discourse features in student conversations is quite important for educational researchers to recognize the curricular and pedagogical variables that cause students to engage in constructing knowledge rather than merely completing tasks. The manual analysis of student conversations to identify these discourse features is time-consuming and labor-intensive, which limits the scale and scope of studies. Leveraging natural language processing (NLP) techniques can facilitate the automatic detection of these discourse features, offering educational researchers scalable and data-driven insights. However, existing studies in NLP that focus on discourse in dialogue rarely address educational data. In this work, we address this gap by introducing an annotated educational dialogue dataset of student conversations featuring knowledge construction and task production discourse. We also establish baseline models for automatically predicting these discourse properties for each turn of talk within conversations, using pre-trained large language models GPT-3.5 and Llama-3.1. Experimental results indicate that these state-of-the-art models perform suboptimally on this task, indicating the potential for future research.

[35] On Evaluating LLM Alignment by Evaluating LLMs as Judges

Yixin Liu, Pengfei Liu, Arman Cohan

Main category: cs.CL

TL;DR: The paper examines the relationship between LLMs’ generation and evaluation capabilities for human preference alignment, revealing strong correlation and proposing AlignEval - a benchmark that assesses alignment by evaluating LLMs as judges rather than directly testing their generation outputs.

Details

Motivation: Current methods for evaluating LLM alignment with human preferences require direct assessment of generated outputs using human annotators or strong LLM judges, which can be resource-intensive. The authors aim to explore whether evaluation capability can serve as a proxy for generation quality.

Method: First conducted comprehensive analysis of generation-evaluation consistency across various LLMs using a strong LLM preference oracle. Then proposed AlignEval benchmark that measures LLM alignment by assessing their performance as evaluators rather than directly evaluating their generated outputs.

Result: Found strong correlation between LLMs’ generation and evaluation capabilities. AlignEval benchmark matches or surpasses widely used automatic evaluation benchmarks (AlpacaEval and Arena-Hard) in capturing human preferences when ranking LLMs.

Conclusion: There is a strong connection between LLMs’ generation and evaluation capabilities for human preference alignment. Evaluating LLMs as judges can effectively assess their alignment without directly testing generation outputs, providing a more efficient evaluation paradigm.

Abstract: Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models’ (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs’ generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs’ generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.

[36] Latent Collaboration in Multi-Agent Systems

Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang

Main category: cs.CL

TL;DR: LatentMAS enables LLM agents to collaborate directly in latent space rather than text, achieving higher accuracy with 70-83% token reduction and 4x faster inference.

Details

Motivation: Existing LLM agents rely on text-based mediation, which is inefficient. The goal is to enable direct collaboration in continuous latent space for better efficiency and reasoning quality.

Method: End-to-end training-free framework with auto-regressive latent thoughts generation using last-layer embeddings and shared latent working memory for lossless information exchange.

Result: Outperforms text-based MAS by up to 14.6% accuracy, reduces tokens by 70.8-83.7%, and provides 4-4.3x faster inference across 9 benchmarks in math, science, commonsense, and code.

Conclusion: Latent collaboration framework enhances reasoning quality with substantial efficiency gains without additional training, demonstrating superior expressiveness and lower complexity than text-based approaches.

Abstract: Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent’s internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

[37] Gram2Vec: An Interpretable Document Vectorizer

Peter Zeng, Hannah Stortz, Eric Sclafani, Alina Shabaeva, Maria Elizabeth Garza, Daniel Greeson Owen Rambow

Main category: cs.CL

TL;DR: Gram2Vec is an interpretable grammatical style embedding system that uses normalized relative frequencies of grammatical features for document analysis, applied to authorship verification and AI detection.

Details

Motivation: To create an interpretable alternative to neural approaches for document analysis that provides transparent feature-based explanations for style-based tasks.

Method: Extracts normalized relative frequencies of grammatical features from text to create document embeddings, then uses these features for authorship verification explanations and AI detection classification.

Result: Outperforms machine learning models trained on comparable Biber features for AI detection, and provides interpretable explanations for authorship verification decisions.

Conclusion: Gram2Vec offers an effective and interpretable approach for grammatical style analysis that can be successfully applied to authorship verification and AI detection tasks.

Abstract: We present Gram2Vec, a grammatical style embedding system that embeds documents into a higher dimensional space by extracting the normalized relative frequencies of grammatical features present in the text. Compared to neural approaches, Gram2Vec offers inherent interpretability based on how the feature vectors are generated. In this paper, we use authorship verification and AI detection as two applications to show how Gram2Vec can be used. For authorship verification, we use the features from Gram2Vec to explain why a pair of documents is by the same or by different authors. We also demonstrate how Gram2Vec features can be used to train a classifier for AI detection, outperforming machine learning models trained on a comparable set of Biber features.

[38] BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models

Isack Lee, Haebin Seong

Main category: cs.CL

TL;DR: BiasJailbreak exploits ethical biases in LLMs to bypass safety alignments, achieving up to 20% higher jailbreak success rates for certain demographic groups, and proposes BiasDefense as an efficient defense method.

Details

Motivation: LLMs present safety risks through jailbreaks that bypass safety alignments, and ethical biases in these models can be exploited to generate harmful content.

Method: BiasJailbreak automatically generates biased keywords by querying the target LLM itself, then uses these keywords to generate harmful output. BiasDefense prevents jailbreaks by injecting defense prompts before generation.

Result: Jailbreaking success rates differ by 20% between non-binary and cisgender keywords and 16% between white and black keywords in GPT-4o models, even with identical prompts.

Conclusion: Ethical biases in LLMs can lead to unsafe output generation, and BiasDefense provides an efficient alternative to guard models for making LLMs more secure and unbiased.

Abstract: Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks’, where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cisgender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.

[39] MedS$^3$: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: MedS3 is a self-evolving framework that enhances clinical reasoning in small medical language models using curriculum sampling, Monte Carlo Tree Search for reasoning trajectories, and a soft dual process reward model with value dynamics.

Details

Motivation: Current medical language models lack task coverage, fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, making them unsuitable for real-world clinical reasoning applications.

Method: Uses curriculum sampling across medical domains, Monte Carlo Tree Search to construct reasoning trajectories, reinforcement fine-tuning with preference learning, and a soft dual process reward model that penalizes value-degrading steps.

Result: Outperforms previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points on eleven benchmarks, achieving robust and faithful reasoning behavior.

Conclusion: MedS3 successfully creates versatile, credible and efficient language models for clinical reasoning by imparting robust reasoning capabilities to small, deployable models through self-evolving framework.

Abstract: Medical language models face critical barriers to real-world clinical reasoning applications. However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage. To this end, we propose MedS3, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. Moreover, we introduce a soft dual process reward model that incorporates value dynamics: steps that degrade node value are penalized, enabling fine-grained identification of reasoning errors even when the final answer is correct. Experiments on eleven benchmarks show that MedS3 outperforms the previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points. Additional empirical analysis further demonstrates that MedS3 achieves robust and faithful reasoning behavior.

[40] A Comprehensive Survey on Long Context Language Modeling

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang

Main category: cs.CL

TL;DR: A comprehensive survey on recent advances in long-context language models (LCLMs), covering acquisition methods, training/infrastructure, evaluation techniques, applications, and future directions.

Details

Motivation: The growing volume of long documents, dialogues, and textual data necessitates the development of efficient long-context language models that can process extensive inputs effectively.

Method: Structured survey approach examining three key aspects: data strategies, architectural designs, and workflow approaches for obtaining LCLMs; infrastructure for training and deployment; and evaluation paradigms for comprehension, generation, and behavioral analysis.

Result: Provides an up-to-date review of literature on long-context LLMs, serving as a valuable resource for researchers and engineers, with an associated GitHub repository collecting latest papers and repositories.

Conclusion: The survey thoroughly explores application scenarios of existing LCLMs and outlines promising future development directions in long-context language modeling.

Abstract: Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \href{https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling}{\color[RGB]{175,36,67}{LCLM-Horizon}}.

[41] Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

Jing Bi, Susan Liang, Xiaofei Zhou, Pinxin Liu, Junjia Guo, Yunlong Tang, Luchuan Song, Chao Huang, Ali Vosoughi, Guangyu Sun, Jinxi He, Jiarui Wu, Shu Yang, Daoan Zhang, Chen Chen, Lianggong Bruce Wen, Zhang Liu, Jiebo Luo, Chenliang Xu

Main category: cs.CL

TL;DR: This paper provides a comprehensive overview of reasoning techniques in textual and multimodal LLMs, highlighting challenges in multimodal reasoning and offering practical optimization methods.

Details

Motivation: While LLMs have improved reasoning in textual domains, extending these capabilities to multimodal contexts (integrating visual and textual inputs) remains challenging due to complexities like handling conflicting information across modalities.

Method: The paper conducts a thorough comparison of reasoning techniques, formulates core challenges, and highlights practical methods for post-training optimization and test-time inference.

Result: The work provides valuable insights bridging theoretical frameworks and practical implementations, clearly identifying reasoning challenges and opportunities in multimodal contexts.

Conclusion: The paper sets clear directions for future research by offering guidance on advancing reasoning capabilities in multimodal LLMs through sophisticated algorithms and robust evaluation methodologies.

Abstract: Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a significant challenge. Multimodal reasoning introduces complexities, such as handling conflicting information across modalities, which require models to adopt advanced interpretative strategies. Addressing these challenges involves not only sophisticated algorithms but also robust methodologies for evaluating reasoning accuracy and coherence. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs. Through a thorough and up-to-date comparison, we clearly formulate core reasoning challenges and opportunities, highlighting practical methods for post-training optimization and test-time inference. Our work provides valuable insights and guidance, bridging theoretical frameworks and practical implementations, and sets clear directions for future research.

[42] From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Liting Huang, Imran Razzak, Preslav Nakov, Usman Naseem

Main category: cs.CL

TL;DR: MM Health is a large-scale multimodal misinformation dataset for health domain with 34,746 news articles containing both text and images, including both human-generated and AI-generated content to help combat health misinformation spread.

Details

Motivation: Health misinformation has significant negative impacts and generative AI has accelerated its spread. Existing datasets have limitations in topical coverage, AI-generated content inclusion, and accessibility.

Method: Created MM Health dataset with 34,746 multimodal news articles (5,776 human-generated + 28,880 AI-generated from SOTA models) and benchmarked against reliability checks, originality checks, and fine-grained AI detection tasks.

Result: Existing SOTA models struggle to accurately distinguish reliability and origin of information, demonstrating the need for improved detection methods.

Conclusion: MM Health supports development of misinformation detection across health scenarios and facilitates detection of human vs machine-generated content at multimodal levels.

Abstract: Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.

[43] Counterfactual Simulatability of LLM Explanations for Generation Tasks

Marvin Limpijankit, Yanda Chen, Melanie Subbiah, Nicholas Deas, Kathleen McKeown

Main category: cs.CL

TL;DR: This paper extends counterfactual simulatability evaluation from yes/no tasks to generation tasks like news summarization and medical suggestions, finding LLM explanations help predict outputs but need improvement, especially for knowledge-based tasks.

Details

Motivation: LLMs are unpredictable and their explanations are critical in high-stakes settings. Counterfactual simulatability evaluates how well explanations help users predict model behavior on related inputs.

Method: Extends counterfactual simulatability framework to generation tasks using news summarization and medical suggestion as case studies.

Result: LLM explanations help users better predict outputs in summarization but show significant room for improvement in medical suggestion. Evaluation works better for skill-based vs knowledge-based tasks.

Conclusion: Counterfactual simulatability evaluation is more appropriate for skill-based tasks than knowledge-based tasks, and LLM explanations need substantial improvement for reliable use in high-stakes domains.

Abstract: LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model’s output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.

[44] TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Yiran Zhang, Mo Wang, Xiaoyang Li, Kaixuan Ren, Chencheng Zhu, Usman Naseem

Main category: cs.CL

TL;DR: TurnBench is a new benchmark that evaluates multi-turn, multi-step reasoning in LLMs through interactive code-breaking tasks, revealing significant performance gaps between models and humans.

Details

Motivation: Existing benchmarks focus on single-turn/single-step tasks and fail to capture iterative reasoning required in real-world settings, creating a need for evaluation frameworks that test multi-turn reasoning capabilities.

Method: Uses an interactive code-breaking task inspired by the Turing Machine Board Game, where models make sequential guesses and receive structured feedback to uncover hidden logical/arithmetic rules across multiple rounds. Includes Classic and Nightmare modes with varying complexity.

Result: State-of-the-art LLMs achieve 84% accuracy in Classic mode but drop to 18% in Nightmare mode, while human participants achieve 100% in both modes, highlighting significant performance gaps.

Conclusion: TurnBench provides a rigorous testbed for diagnosing multi-step reasoning capabilities, reduces contamination risks by hiding task rules, and reveals substantial room for improvement in LLMs’ iterative reasoning abilities.

Abstract: Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by the “Turing Machine Board Game.” In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps-capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine-grained analysis, we provide ground-truth annotations for intermediate reasoning steps. Our evaluation of state-of-the-art LLMs reveals significant gaps: the best model achieves 84% accuracy in Classic mode, but performance drops to 18% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi-step, multi-turn reasoning in LLMs.

[45] Improved LLM Agents for Financial Document Question Answering

Nelvin Tan, Zian Seng, Liang Zhang, Yu-Ching Shih, Dong Yang, Amol Salunkhe

Main category: cs.CL

TL;DR: The paper presents an improved critic agent and calculator agent for numerical question answering on financial documents, addressing limitations of traditional critic agents when oracle labels are unavailable.

Details

Motivation: LLMs struggle with numerical QA for financial documents containing tabular and textual data, and traditional critic agents perform poorly when oracle labels are not available.

Method: Proposed an improved critic agent and calculator agent that work together, with investigation of their interaction effects on performance.

Result: The proposed approach outperforms previous state-of-the-art (program-of-thought) methods and is safer.

Conclusion: The improved critic agent with calculator agent provides better performance for numerical QA in financial documents without requiring oracle labels.

Abstract: Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent’s performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.

[46] Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection

Devesh Pant, Rishi Raj Grandhe, Vipin Samaria, Mukul Paul, Sudhir Kumar, Saransh Khanna, Jatin Agrawal, Jushaan Singh Kalra, Akhil VSSG, Satish V Khalikar, Vipin Garg, Himanshu Chauhan, Pranay Verma, Neha Khandelwal, Soma S Dhavala, Minesh Mathew

Main category: cs.CL

TL;DR: Health Sentinel is an automated multi-stage information extraction pipeline that uses ML and non-ML methods to detect disease outbreaks from online news articles, processing over 300 million articles and identifying 95,000+ health events with 3,500+ flagged as potential outbreaks.

Details

Motivation: Early detection of disease outbreaks is crucial for timely intervention, but manual screening of online media is impractical due to the volume of articles published daily.

Method: Multi-stage information extraction pipeline combining ML and non-ML methods to extract structured event information about disease outbreaks from online articles.

Result: Processed over 300 million news articles, identified over 95,000 unique health events across India, with over 3,500 events shortlisted by NCDC experts as potential outbreaks.

Conclusion: Health Sentinel successfully automates disease outbreak detection from online media, providing valuable structured information to health authorities for timely intervention and analysis.

Abstract: Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.

[47] SAS: Simulated Attention Score

Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Peihao Wang, Jing Xiong, Liliang Ren, Hao Cheng, Janardhan Kulkarni, Yelong Shen, Atlas Wang, Mac Schwager, Anderson Schneider, Xiaodong Liu, Jianfeng Gao

Main category: cs.CL

TL;DR: SAS method simulates more attention heads and larger hidden dimensions without increasing parameters, using projection and parameter-efficient aggregation to boost performance.

Details

Motivation: Performance improves with more attention heads and larger hidden sizes per head, but parameter overhead is a constraint. SAS aims to achieve these benefits at low cost.

Method: Project low-dimensional head representations to higher-dimensional space to simulate more heads and larger hidden dimensions. Use Parameter-Efficient Attention Aggregation (PEAA) to control parameter cost.

Result: Significant improvements over different attention variants across various datasets and tasks.

Conclusion: SAS effectively enhances attention capacity and model performance while maintaining compact model size.

Abstract: The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.

[48] LaajMeter: A Framework for LaaJ Evaluation

Samuel Ackerman, Gal Amram, Ora Nova Fandina, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Avi Ziv

Main category: cs.CL

TL;DR: LaaJMeter is a simulation-based framework for meta-evaluation of LLM-as-a-Judge systems in domain-specific contexts where annotated data is scarce.

Details

Motivation: Meta-evaluation of LLM-as-a-Judge systems is challenging in domain-specific contexts due to scarce annotated data and costly expert evaluation, leading to unvalidated metrics and unclear quality thresholds.

Method: LaaJMeter generates synthetic data representing virtual models and judges, enabling systematic analysis of evaluation metrics under realistic conditions to validate LLM-as-a-Judge systems for specific tasks.

Result: In a code translation task involving a legacy programming language, LaaJMeter demonstrated how different metrics vary in sensitivity to evaluator quality, highlighting limitations of common metrics.

Conclusion: LaaJMeter provides a scalable and extensible solution for assessing LLM-as-a-Judge systems in low-resource settings, contributing to trustworthy and reproducible evaluation in NLP.

Abstract: Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks, a paradigm known as LLM-as-a-Judge (LaaJ). The analysis of a LaaJ software, commonly refereed to as meta-evaluation, pose significant challenges in domain-specific contexts. In such domains, in contrast to general domains, annotated data is scarce and expert evaluation is costly. As a result, meta-evaluation is often performed using metrics that have not been validated for the specific domain in which they are applied. Therefore, it becomes difficult to determine which metrics effectively identify LaaJ quality, and further, what threshold indicates sufficient evaluator performance. In this work, we introduce LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs. LaaJMeter enables engineers to generate synthetic data representing virtual models and judges, allowing systematic analysis of evaluation metrics under realistic conditions. This helps practitioners validate LaaJs for specific tasks: they can test whether their metrics correctly distinguish between high and low quality (virtual) LaaJs, and estimate appropriate thresholds for evaluator adequacy. We demonstrate the utility of LaaJMeter in a code translation task involving a legacy programming language, showing how different metrics vary in sensitivity to evaluator quality. Our results highlight the limitations of common metrics and the importance of principled metric selection. LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP.

[49] ConfTuner: Training Large Language Models to Express Their Confidence Verbally

Yibo Li, Miao Xiong, Jiaying Wu, Bryan Hooi

Main category: cs.CL

TL;DR: ConfTuner is a fine-tuning method that improves LLM calibration using a tokenized Brier score loss, addressing overconfidence without needing ground-truth confidence scores.

Details

Motivation: LLMs deployed in high-stakes domains need accurate uncertainty expression, but current models often show overconfidence. Existing calibration methods have limited effectiveness and generalizability.

Method: ConfTuner uses a new loss function called tokenized Brier score, which is theoretically proven to be a proper scoring rule that incentivizes models to report true probabilities of being correct.

Result: ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models like GPT-4o. Better calibration enables gains in self-correction and model cascade.

Conclusion: ConfTuner advances trustworthy LLM systems by providing an efficient fine-tuning approach for confidence calibration that requires minimal overhead and no ground-truth confidence data.

Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as “overconfidence”. Recent efforts have focused on calibrating LLMs’ verbalized confidence: i.e., their expressions of confidence in text form, such as “I am 80% confident that…”. Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it “correctly incentivizes the model to report its true probability of being correct”. ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at https://github.com/liushiliushi/ConfTuner.

[50] FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang

Main category: cs.CL

TL;DR: A contamination-free evaluation of large reasoning models with the release of ROME benchmark for vision language models.

Details

Motivation: To conduct a moderate-scale evaluation of current large reasoning models (LRMs) without contamination and test their reasoning capabilities from visual clues.

Method: Developed and released ROME benchmark specifically designed for vision language models to evaluate reasoning from visual information.

Result: Preliminary findings from the evaluation of current LRMs, with benchmark and evaluation data made publicly available.

Conclusion: Successful creation and release of ROME benchmark for vision language model evaluation, providing a contamination-free assessment framework for reasoning capabilities.

Abstract: We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

[51] Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

Pengfei Wang, Baolin Sun, Xuemei Dong, Yaxun Dai, Hongwei Yuan, Mengdie Chu, Yingqi Gao, Xiang Qi, Peng Zhang, Ying Yan

Main category: cs.CL

TL;DR: Agentar-Scale-SQL is a novel framework that achieves state-of-the-art performance on the BIRD benchmark through orchestrated test-time scaling combining internal, sequential, and parallel scaling strategies.

Details

Motivation: Current Text-to-SQL methods lag behind human experts on challenging benchmarks like BIRD, and existing test-time scaling approaches lack orchestration and neglect the model's internal reasoning process.

Method: Orchestrated Test-Time Scaling strategy combining: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection.

Result: Achieves 81.67% execution accuracy on BIRD benchmark test set, ranking first on the official leaderboard.

Conclusion: Agentar-Scale-SQL provides an effective path toward human-level performance in Text-to-SQL tasks and is a general-purpose framework adaptable to new databases and more powerful language models.

Abstract: State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model’s internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.

[52] OceanGym: A Benchmark Environment for Underwater Embodied Agents

Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen

Main category: cs.CL

TL;DR: OceanGym is the first comprehensive benchmark for ocean underwater embodied agents, featuring 8 realistic tasks and a unified MLLM-driven framework to address extreme underwater challenges like low visibility and dynamic currents.

Details

Motivation: To advance AI in demanding underwater environments that present extreme perceptual and decision-making challenges not found in terrestrial or aerial domains.

Method: Uses a unified agent framework driven by Multi-modal Large Language Models (MLLMs) that integrates perception, memory, and sequential decision-making, requiring agents to comprehend optical/sonar data and autonomously explore complex environments.

Result: Extensive experiments show substantial performance gaps between state-of-the-art MLLM-driven agents and human experts, highlighting persistent difficulties in perception, planning, and adaptability in ocean environments.

Conclusion: OceanGym provides a high-fidelity platform for developing robust embodied AI and transferring capabilities to real-world autonomous ocean vehicles, representing a significant step toward intelligent agents for underwater exploration.

Abstract: We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth’s last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.

[53] LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation

Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Amin Tabatabaei, Maarten de Rijke, Xinyi Chen

Main category: cs.CL

TL;DR: LiRA is a multi-agent workflow that automates literature review writing using specialized agents for outlining, writing, editing, and reviewing, outperforming existing baselines in quality while maintaining similarity to human reviews.

Details

Motivation: The rapid growth of scientific publications makes comprehensive literature reviews difficult, with automation focusing mainly on retrieval/screening while leaving writing phase under-explored, especially regarding readability and factual accuracy.

Method: Multi-agent collaborative workflow emulating human review process, with specialized agents for content outlining, subsection writing, editing, and reviewing to produce cohesive review articles.

Result: Outperforms baselines like AutoSurvey and MASS-Survey in writing and citation quality on SciReviewGen and ScienceDirect datasets, maintains competitive similarity to human-written reviews, and shows robustness to reviewer model variation.

Conclusion: Agentic LLM workflows, even without domain-specific tuning, have potential to improve reliability and usability of automated scientific writing.

Abstract: The rapid growth of scientific publications has made it increasingly difficult to keep literature reviews comprehensive and up-to-date. Though prior work has focused on automating retrieval and screening, the writing phase of systematic reviews remains largely under-explored, especially with regard to readability and factual accuracy. To address this, we present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process. LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles. Evaluated on SciReviewGen and a proprietary ScienceDirect dataset, LiRA outperforms current baselines such as AutoSurvey and MASS-Survey in writing and citation quality, while maintaining competitive similarity to human-written reviews. We further evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation. Our findings highlight the potential of agentic LLM workflows, even without domain-specific tuning, to improve the reliability and usability of automated scientific writing.

[54] Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training

Mehrdad Ghassabi, Sadra Hakim, Hamidreza Baradaran Kashani, Pedram Rostami

Main category: cs.CL

TL;DR: Using RLAIF and DPO with Chain-of-Thought reasoning to enhance Persian medical QA capabilities in small language models, achieving superior performance with much less training data.

Details

Motivation: To improve reasoning capabilities in small language models for specialized applications like medical question answering in underrepresented languages like Persian, addressing data scarcity issues.

Method: Translated medical QA dataset to Persian, used RLAIF to generate rejected-preferred answer pairs with CoT reasoning, applied DPO training on a 4.5M token dataset containing correct and incorrect reasoning trajectories.

Result: The trained model significantly outperformed its predecessor (gaokerena-V trained on 57M tokens) despite using a much smaller dataset, demonstrating enhanced medical reasoning capabilities in Persian.

Conclusion: Reasoning-focused training approaches like RLAIF and DPO are highly efficient and effective for developing domain-specific language models with limited data availability.

Abstract: Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.

[55] EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, Jinjie Gu, Ya Zhang, Yanfeng Wang, Yu Wang, Weidi Xie

Main category: cs.CL

TL;DR: EHR-Ins is a large-scale EHR reasoning dataset with 300k reasoning cases and 4M non-reasoning cases across 42 tasks, used to train EHR-R1 LLMs that outperform state-of-the-art models like GPT-4o by over 30 points on clinical benchmarks.

Details

Motivation: Current LLMs have limited ability to analyze EHRs due to narrow task coverage and lack of EHR-oriented reasoning capabilities, creating a gap in automated clinical decision-making.

Method: Developed a thinking-graph-driven framework to generate EHR-Ins dataset, then trained EHR-R1 models (up to 72B parameters) using multi-stage training including domain adaptation, reasoning enhancement, and reinforcement learning.

Result: EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs, surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving 10% higher zero-shot AUROC on EHRSHOT.

Conclusion: The EHR-Ins dataset, EHR-R1 models, and EHR-Bench benchmark collectively advance the development of more reliable and clinically relevant EHR analysis systems.

Abstract: Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.

[56] AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs

Mo El-Haj, Paul Rayson

Main category: cs.CL

TL;DR: AraFinNews is the largest Arabic financial news dataset with 212,500 article-headline pairs from 2015-2025, used to evaluate domain-specific Arabic financial summarization with LLMs.

Details

Motivation: To address the lack of large-scale Arabic financial datasets and provide a realistic benchmark for evaluating domain-specific language understanding and generation in Arabic financial contexts.

Method: Used transformer-based models (mT5, AraT5, and domain-adapted FinAraT5) to investigate the impact of domain specificity on abstractive summarization of Arabic financial texts.

Result: Domain-adapted models generated more coherent summaries with better handling of quantitative and entity-centric information, showing improved accuracy, numerical reliability, and stylistic alignment.

Conclusion: Domain-specific adaptation is crucial for improving narrative fluency in Arabic financial summarization, and the AraFinNews dataset provides a valuable resource for future research.

Abstract: We introduce AraFinNews, the largest publicly available Arabic financial news dataset to date, comprising 212,500 article-headline pairs spanning a decade of reporting from 2015 to 2025. Designed as an Arabic counterpart to major English summarisation corpora such as CNN/DailyMail, AraFinNews provides a realistic benchmark for evaluating domain-specific language understanding and generation in financial contexts. Using this resource, we investigate the impact of domain specificity on abstractive summarisation of Arabic financial texts with large language models (LLMs). In particular, we evaluate transformer-based models: mT5, AraT5, and the domain-adapted FinAraT5 to examine how financial-domain pretraining influences accuracy, numerical reliability, and stylistic alignment with professional reporting. Experimental results show that domain-adapted models generate more coherent summaries, especially in their handling of quantitative and entity-centric information. These findings highlight the importance of domain-specific adaptation for improving narrative fluency in Arabic financial summarisation. The dataset is freely available for non-commercial research at https://github.com/ArabicNLP-uk/AraFinNews.

[57] Toward Honest Language Models for Deductive Reasoning

Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan

Main category: cs.CL

TL;DR: The paper introduces honest deductive reasoning as a model’s ability to respond only when conclusions are logically entailed by premises, and proposes ACNCHOR, a reinforcement learning method that prevents early training collapse by injecting ground truth trajectories.

Details

Motivation: Current language models often fail to reason honestly, producing unwarranted answers when input is insufficient, highlighting the need for models that can abstain when logical entailment is not present.

Method: Created datasets from graph structures for linear algebra and logical inference, introduced unanswerable cases via edge perturbation, and developed ACNCHOR - a reinforcement learning method that injects ground truth trajectories into rollouts to prevent early training collapse.

Result: ACNCHOR stabilizes learning and significantly improves overall reasoning performance compared to prompting and existing training methods like GRPO, which struggle with these tasks and are vulnerable to collapse when negative rewards dominate early training.

Conclusion: Training dynamics are crucial for enabling honest deductive reasoning in language models, and the proposed ACNCHOR method effectively addresses the challenge of early training collapse in reinforcement learning approaches.

Abstract: Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty in this setting as a model’s ability to respond only when the conclusion is logically entailed by the premises, and to abstain otherwise. However, current language models often fail to reason honestly, producing unwarranted answers when the input is insufficient. To study this challenge, we formulate honest deductive reasoning as multi-step tasks where models must either derive the correct conclusion or abstain. We curate two datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle on these tasks. In particular, GRPO optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. To address this, we propose ACNCHOR, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling honest deductive reasoning in language models.

[58] MindEval: Benchmarking Language Models on Multi-turn Mental Health Support

José Pombal, Maya D’Eon, Nuno M. Guerreiro, Pedro Henrique Martins, António Farinhas, Ricardo Rei

Main category: cs.CL

TL;DR: MindEval is a framework for automatically evaluating language models in realistic multi-turn mental health therapy conversations, developed with clinical psychologists. It shows current LLMs struggle significantly in therapeutic contexts.

Details

Motivation: Current AI mental health chatbots have limitations like sycophancy and reinforcement of maladaptive beliefs, and existing benchmarks don't capture the complexity of real therapeutic interactions.

Method: Developed MindEval framework with clinical psychologists using patient simulation and automatic LLM evaluation, validated against human-generated text and expert judgments.

Result: 12 state-of-the-art LLMs scored below 4/6 on average, with weaknesses in problematic AI communication patterns. Performance doesn’t improve with reasoning capabilities or model scale, and deteriorates with longer interactions or severe patient symptoms.

Conclusion: Current LLMs are inadequate for mental health therapy applications, and specialized evaluation frameworks like MindEval are needed to assess therapeutic capabilities properly.

Abstract: Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.

[59] HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations

Linxiao Cao, Ruitao Wang, Jindong Li, Zhipeng Zhou, Menglin Yang

Main category: cs.CL

TL;DR: HyperbolicRAG integrates hyperbolic geometry into graph-based RAG to better capture hierarchical relationships in knowledge graphs, outperforming existing methods.

Details

Motivation: Standard graph-based RAG methods use Euclidean embeddings that capture semantic similarity but lack geometric representation of hierarchical depth, limiting their ability to handle complex knowledge graphs with abstraction relationships.

Method: Proposes HyperbolicRAG with three key components: depth-aware representation learning in Poincare manifold, unsupervised contrastive regularization for geometric consistency, and mutual-ranking fusion combining Euclidean and hyperbolic retrieval signals.

Result: Extensive experiments on multiple QA benchmarks show HyperbolicRAG outperforms competitive baselines including standard RAG and graph-augmented methods.

Conclusion: Hyperbolic geometry effectively enhances graph-based RAG by capturing both semantic similarity and hierarchical relationships, leading to improved performance in knowledge-intensive tasks.

Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to access external knowledge, helping mitigate hallucinations and enhance domain-specific expertise. Graph-based RAG enhances structural reasoning by introducing explicit relational organization that enables information propagation across semantically connected text units. However, these methods typically rely on Euclidean embeddings that capture semantic similarity but lack a geometric notion of hierarchical depth, limiting their ability to represent abstraction relationships inherent in complex knowledge graphs. To capture both fine-grained semantics and global hierarchy, we propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG. HyperbolicRAG introduces three key designs: (1) a depth-aware representation learner that embeds nodes within a shared Poincare manifold to align semantic similarity with hierarchical containment, (2) an unsupervised contrastive regularization that enforces geometric consistency across abstraction levels, and (3) a mutual-ranking fusion mechanism that jointly exploits retrieval signals from Euclidean and hyperbolic spaces, emphasizing cross-space agreement during inference. Extensive experiments across multiple QA benchmarks demonstrate that HyperbolicRAG outperforms competitive baselines, including both standard RAG and graph-augmented baselines.

cs.CV

[60] PuzzlePoles: Cylindrical Fiducial Markers Based on the PuzzleBoard Pattern

Juri Zach, Peer Stelldinger

Main category: cs.CV

TL;DR: PuzzlePole is a new cylindrical fiducial marker based on PuzzleBoard pattern that enables 360° recognition and robust pose estimation for autonomous systems.

Details

Motivation: Need for reliable environment perception in autonomous systems, where current visual markers lack robustness to occlusions and limited viewing angles.

Method: Developed cylindrical PuzzlePole markers derived from PuzzleBoard calibration pattern, leveraging its unique combinatorial structure for 360° viewing capability.

Result: PuzzlePoles provide high accuracy in localization and orientation estimation while being robust to occlusions, enabling reliable recognition from any viewing direction.

Conclusion: PuzzlePole offers flexible deployment for diverse autonomous systems including robot navigation, SLAM, and tangible interfaces with improved robustness and 360° recognition capability.

Abstract: Reliable perception of the environment is a key enabler for autonomous systems, where calibration and localization tasks often rely on robust visual markers. We introduce the PuzzlePole, a new type of fiducial markers derived from the recently proposed PuzzleBoard calibration pattern. The PuzzlePole is a cylindrical marker, enabling reliable recognition and pose estimation from 360° viewing direction. By leveraging the unique combinatorial structure of the PuzzleBoard pattern, PuzzlePoles provide a high accuracy in localization and orientation while being robust to occlusions. The design offers flexibility for deployment in diverse autonomous systems scenarios, ranging from robot navigation and SLAM to tangible interfaces.

[61] Personalized Reward Modeling for Text-to-Image Generation

Jeongeun Lee, Ryang Heo, Dongha Lee

Main category: cs.CV

TL;DR: PIGReward is a personalized reward model for text-to-image generation that uses CoT reasoning to evaluate images based on individual user preferences, enabling personalized evaluation and prompt optimization without user-specific training.

Details

Motivation: Current text-to-image evaluation methods fail to capture personal visual preferences, as they rely on general reward functions or similarity metrics that don't account for individual user tastes.

Method: PIGReward uses a self-bootstrapping strategy with Chain-of-Thought reasoning to generate user-conditioned evaluation dimensions from limited reference data, creating rich user contexts for personalization without requiring user-specific training.

Result: Extensive experiments show PIGReward outperforms existing methods in accuracy and interpretability, and it enables personalized prompt optimization that improves alignment between generated images and individual user intent.

Conclusion: PIGReward establishes a scalable, reasoning-based foundation for personalized text-to-image evaluation and optimization, representing a significant step toward individually aligned image generation.

Abstract: Recent text-to-image (T2I) models generate semantically coherent images from textual prompts, yet evaluating how well they align with individual user preferences remains an open challenge. Conventional evaluation methods, general reward functions or similarity-based metrics, fail to capture the diversity and complexity of personal visual tastes. In this work, we present PIGReward, a personalized reward model that dynamically generates user-conditioned evaluation dimensions and assesses images through CoT reasoning. To address the scarcity of user data, PIGReward adopt a self-bootstrapping strategy that reasons over limited reference data to construct rich user contexts, enabling personalization without user-specific training. Beyond evaluation, PIGReward provides personalized feedback that drives user-specific prompt optimization, improving alignment between generated images and individual intent. We further introduce PIGBench, a per-user preference benchmark capturing diverse visual interpretations of shared prompts. Extensive experiments demonstrate that PIGReward surpasses existing methods in both accuracy and interpretability, establishing a scalable and reasoning-based foundation for personalized T2I evaluation and optimization. Taken together, our findings highlight PIGReward as a robust steptoward individually aligned T2I generation.

[62] Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Jie Li, Hongyi Cai, Mingkang Dong, Muxin Pu, Shan You, Fei Wang, Tao Huang

Main category: cs.CV

TL;DR: Pistachio is a new Video Anomaly Detection/Understanding benchmark created using a generation-based pipeline that provides controlled scenes, anomaly types, and temporal narratives to overcome limitations of existing datasets.

Details

Motivation: Existing VAD benchmarks lack scene diversity, balanced anomaly coverage, and temporal complexity, while VAU requires deeper semantic reasoning but is difficult to benchmark due to heavy manual annotation requirements.

Method: Uses a controlled generation-based pipeline leveraging video generation models with scene-conditioned anomaly assignment, multi-step storyline generation, and temporally consistent long-form synthesis to create 41-second videos with minimal human intervention.

Result: Pistachio demonstrates scale, diversity, and complexity that reveal new challenges for existing methods and motivate future research on dynamic and multi-event anomaly understanding.

Conclusion: The generation-based approach effectively eliminates biases of Internet-collected datasets and provides a robust benchmark for advancing Video Anomaly Detection and Understanding research.

Abstract: Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

[63] SG-OIF: A Stability-Guided Online Influence Framework for Reliable Vision Data

Penghao Rao, Runmin Jiang, Min Xu

Main category: cs.CV

TL;DR: SG-OIF is a stability-guided online influence framework that addresses challenges in estimating training-point influence on test predictions for deep-learning vision models, achieving state-of-the-art performance on noise-label and out-of-distribution detection tasks.

Details

Motivation: Existing influence function implementations face challenges in deep-learning vision models due to expensive inverse-curvature computations, training non-stationarity, and fragile rankings from missing confidence calibration, making it difficult to identify critical training examples.

Method: SG-OIF treats algorithmic stability as a real-time controller, maintaining lightweight anchor IHVPs via stochastic Richardson and preconditioned Neumann methods, and uses modular curvature backends with stability-guided residual thresholds, anomaly gating, and confidence calibration.

Result: SG-OIF achieves SOTA performance across multiple datasets: 91.1% accuracy in top 1% prediction samples on CIFAR-10 (20% asymmetric noise) and 99.8% AUPR score on MNIST, demonstrating effective online influence estimation.

Conclusion: The framework provides a practical controller for online influence estimation, effectively addressing computational challenges and improving robustness in identifying critical training examples for deep-learning vision models.

Abstract: Approximating training-point influence on test predictions is critical for deploying deep-learning vision models, essential for locating noisy data. Though the influence function was proposed for attributing how infinitesimal up-weighting or removal of individual training examples affects model outputs, its implementation is still challenging in deep-learning vision models: inverse-curvature computations are expensive, and training non-stationarity invalidates static approximations. Prior works use iterative solvers and low-rank surrogates to reduce cost, but offline computation lags behind training dynamics, and missing confidence calibration yields fragile rankings that misidentify critical examples. To address these challenges, we introduce a Stability-Guided Online Influence Framework (SG-OIF), the first framework that treats algorithmic stability as a real-time controller, which (i) maintains lightweight anchor IHVPs via stochastic Richardson and preconditioned Neumann; (ii) proposes modular curvature backends to modulate per-example influence scores using stability-guided residual thresholds, anomaly gating, and confidence. Experimental results show that SG-OIF achieves SOTA (State-Of-The-Art) on noise-label and out-of-distribution detection tasks across multiple datasets with various corruption. Notably, our approach achieves 91.1% accuracy in the top 1% prediction samples on the CIFAR-10 (20% asym), and gets 99.8% AUPR score on MNIST, effectively demonstrating that this framework is a practical controller for online influence estimation.

[64] Tracking and Segmenting Anything in Any Modality

Tianlu Zhang, Qiang Zhang, Guiguang Ding, Jungong Han

Main category: cs.CV

TL;DR: SATA is a universal tracking and segmentation framework that unifies multiple subtasks across different modalities using a Decoupled Mixture-of-Expert mechanism and Task-aware Multi-object Tracking pipeline.

Details

Motivation: Existing approaches use specialized architectures for different tracking and segmentation tasks, limiting generalization and scalability due to distributional gaps across modalities and feature representation gaps across tasks.

Method: Proposes Decoupled Mixture-of-Expert (DeMoE) to separate cross-modal shared knowledge from specific information, and Task-aware Multi-object Tracking (TaMOT) pipeline to unify task outputs as instances with calibrated ID information.

Result: Demonstrates superior performance on 18 challenging tracking and segmentation benchmarks.

Conclusion: SATA provides a novel perspective for more generalizable video understanding by effectively addressing cross-task and cross-modal knowledge sharing challenges.

Abstract: Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.

[65] The Determinant Ratio Matrix Approach to Solving 3D Matching and 2D Orthographic Projection Alignment Tasks

Andrew J. Hanson, Sonya M. Hanson

Main category: cs.CV

TL;DR: This paper introduces the determinant ratio matrix (DRaM) approach for solving 3D-3D (EnP) and 3D-2D orthographic (OnP) pose estimation problems, providing closed-form solutions and placing existing methods in a unified framework.

Details

Motivation: To develop a unified mathematical framework for solving pose estimation problems, particularly for the OnP (orthographic projection) case which lacks known closed-form solutions comparable to SVD methods for EnP problems.

Method: Uses determinant ratio matrix (DRaM) approach to solve least squares systems for error-free EnP and OnP problems, with a rotation correction scheme for noisy data. Places QR decomposition and Moore-Penrose pseudoinverse methods within the DRaM family framework.

Result: Presents new closed-form solutions for 3D and 2D orthographic pose estimation, showing that DRaM methods generalize to all N-dimensional Euclidean pose problems and could have been discovered in Gauss’s time.

Conclusion: The DRaM approach provides both practical solutions and valuable theoretical insights into pose estimation problems, unifying existing methods and revealing their mathematical foundations.

Abstract: Pose estimation is a general problem in computer vision with wide applications. The relative orientation of a 3D reference object can be determined from a 3D rotated version of that object, or from a projection of the rotated object to a 2D planar image. This projection can be a perspective projection (the PnP problem) or an orthographic projection (the OnP problem). We restrict our attention here to the OnP problem and the full 3D pose estimation task (the EnP problem). Here we solve the least squares systems for both the error-free EnP and OnP problems in terms of the determinant ratio matrix (DRaM) approach. The noisy-data case can be addressed with a straightforward rotation correction scheme. While the SVD and optimal quaternion eigensystem methods solve the noisy EnP 3D-3D alignment exactly, the noisy 3D-2D orthographic (OnP) task has no known comparable closed form, and can be solved by DRaM-class methods. We note that while previous similar work has been presented in the literature exploiting both the QR decomposition and the Moore-Penrose pseudoinverse transformations, here we place these methods in a larger context that has not previously been fully recognized in the absence of the corresponding DRaM solution. We term this class of solutions as the DRaM family, and conduct comparisons of the behavior of the families of solutions for the EnP and OnP rotation estimation problems. Overall, this work presents both a new solution to the 3D and 2D orthographic pose estimation problems and provides valuable insight into these classes of problems. With hindsight, we are able to show that our DRaM solutions to the exact EnP and OnP problems possess derivations that could have been discovered in the time of Gauss, and in fact generalize to all analogous N-dimensional Euclidean pose estimation problems.

[66] Single Image to High-Quality 3D Object via Latent Features

Huanning Dong, Yinuo Huang, Fan Li, Ping Kuang

Main category: cs.CV

TL;DR: LatentDreamer is a fast 3D generation framework that uses a pre-trained VAE to map 3D geometries to latent features, enabling high-fidelity object generation from single images in about 70 seconds.

Details

Motivation: Existing automatic 3D generation methods struggle to achieve fast, detailed, and high-fidelity generation simultaneously, creating a need for more efficient approaches.

Method: Uses a pre-trained variational autoencoder to map 3D geometries to latent features, then sequentially generates coarse geometries, refined geometries, and realistic textures through a pipeline.

Result: Generates 3D objects with high fidelity to input images within approximately 70 seconds, demonstrating competitive performance with minimal training.

Conclusion: LatentDreamer provides an efficient solution for fast, high-quality 3D object generation from single images using latent feature representation.

Abstract: 3D assets are essential in the digital age. While automatic 3D generation, such as image-to-3d, has made significant strides in recent years, it often struggles to achieve fast, detailed, and high-fidelity generation simultaneously. In this work, we introduce LatentDreamer, a novel framework for generating 3D objects from single images. The key to our approach is a pre-trained variational autoencoder that maps 3D geometries to latent features, which greatly reducing the difficulty of 3D generation. Starting from latent features, the pipeline of LatentDreamer generates coarse geometries, refined geometries, and realistic textures sequentially. The 3D objects generated by LatentDreamer exhibit high fidelity to the input images, and the entire generation process can be completed within a short time (typically in 70 seconds). Extensive experiments show that with only a small amount of training, LatentDreamer demonstrates competitive performance compared to contemporary approachs.

[67] Fewer Tokens, Greater Scaling: Self-Adaptive Visual Bases for Efficient and Expansive Representation Learning

Shawn Young, Xingyu Zeng, Lijian Xu

Main category: cs.CV

TL;DR: This paper explores how model size affects the minimum number of visual tokens needed to preserve image semantics, revealing that larger models require fewer tokens.

Details

Motivation: To understand the fundamental relationship between model capacity and the minimal visual tokens required to preserve image semantics, inspired by the Minimum Description Length principle.

Method: Proposes Orthogonal Filtering, a lightweight module that adaptively clusters redundant tokens into a compact set of orthogonal bases, and reinterprets image tokens as vectors in a visual semantic space.

Result: Extensive experiments reveal a consistent token-model scaling law: larger models require significantly fewer tokens to span visual semantic space. The paper also contributes a visual long-context dataset.

Conclusion: The study demonstrates a clear scaling relationship between model size and token efficiency, with larger models being more efficient at representing visual semantics with fewer tokens.

Abstract: This paper investigates the fundamental relationship between model capacity and the minimal number of visual tokens required to preserve image semantics. Inspired by the Minimum Description Length principle, we reinterpret image tokens as vectors in a visual semantic space and define the intrinsic semantic complexity of an image as the smallest set of basis vectors needed to span this space. Building on this perspective, we propose Orthogonal Filtering, a lightweight module that adaptively clusters redundant tokens into a compact set of orthogonal bases. Through extensive experiments across a range of ViT models, we reveal a consistent token, model scaling law: larger models require significantly fewer tokens to span visual semantic space. Besides, we also contribute a visual long-context dataset.

[68] ExDDV: A New Dataset for Explainable Deepfake Detection in Video

Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu

Main category: cs.CV

TL;DR: ExDDV is the first dataset and benchmark for explainable deepfake detection in video, containing 5.4K real and fake videos with manual text and click annotations to explain artifacts.

Details

Motivation: As deepfake videos become more realistic, humans increasingly rely on automatic detectors, but these are error-prone and lack explainability, leaving people vulnerable to fraud and misinformation.

Method: Created ExDDV dataset with 5.4K videos annotated with text descriptions and clicks to explain artifacts. Evaluated vision-language models using fine-tuning and in-context learning strategies.

Result: Both text and click supervision are required for robust explainable models that can localize and describe deepfake artifacts effectively.

Conclusion: The ExDDV dataset enables development of explainable deepfake detectors that provide both detection and artifact explanations, addressing the explainability gap in current systems.

Abstract: The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.

[69] VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu, Yi Huang, Zijun Liu, Yafei Wen, Xiaoxin Chen, Yang Liu, Peng Li, Yali Wang

Main category: cs.CV

TL;DR: VideoChat-M1 is a multi-agent system for video understanding that uses collaborative policy planning with dynamic tool invocation and reinforcement learning, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: Current multi-agent frameworks for video understanding use static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues needed for robust perception and reasoning in complex videos.

Method: Uses Collaborative Policy Planning (CPP) with multiple policy agents: (1) Policy Generation - each agent creates unique tool invocation policies, (2) Policy Execution - agents sequentially invoke tools to explore video content, (3) Policy Communication - agents interact and update policies during execution. Enhanced with Multi-Agent Reinforcement Learning for joint optimization.

Result: Achieves SOTA performance across eight benchmarks spanning four tasks. Notably outperforms Gemini 2.5 pro by 3.6% and GPT-4o by 15.6% on LongVideoBench.

Conclusion: The collaborative multi-agent framework with dynamic policy planning and reinforcement learning significantly improves video understanding capabilities, demonstrating superior performance over existing state-of-the-art models.

Abstract: By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user’s query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user’s query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1’s performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.

[70] Blinking Beyond EAR: A Stable Eyelid Angle Metric for Driver Drowsiness Detection and Data Augmentation

Mathis Wolter, Julie Stephany Berrio Perez, Mao Shan

Main category: cs.CV

TL;DR: Introduces Eyelid Angle (ELA), a novel 3D-based metric for measuring eye openness that’s robust to camera angle changes, enabling reliable blink detection and synthetic data generation for driver drowsiness monitoring.

Details

Motivation: Need for reliable driver drowsiness detection to enhance road safety and support ADAS, addressing limitations of existing methods like EAR that are sensitive to viewpoint changes and lack of natural drowsiness datasets.

Method: Developed ELA metric from 3D facial landmarks to measure eye openness geometrically, designed blink detection framework extracting temporal characteristics, and created synthetic datasets using Blender 3D to animate avatars with controllable parameters.

Result: ELA shows lower variance under viewpoint changes compared to EAR, achieves accurate blink detection in public datasets, and synthetic augmentation successfully expands training data diversity for drowsiness recognition.

Conclusion: ELA serves as both a reliable biometric measure and powerful tool for generating scalable synthetic datasets, advancing driver state monitoring capabilities.

Abstract: Detecting driver drowsiness reliably is crucial for enhancing road safety and supporting advanced driver assistance systems (ADAS). We introduce the Eyelid Angle (ELA), a novel, reproducible metric of eye openness derived from 3D facial landmarks. Unlike conventional binary eye state estimators or 2D measures, such as the Eye Aspect Ratio (EAR), the ELA provides a stable geometric description of eyelid motion that is robust to variations in camera angle. Using the ELA, we design a blink detection framework that extracts temporal characteristics, including the closing, closed, and reopening durations, which are shown to correlate with drowsiness levels. To address the scarcity and risk of collecting natural drowsiness data, we further leverage ELA signals to animate rigged avatars in Blender 3D, enabling the creation of realistic synthetic datasets with controllable noise, camera viewpoints, and blink dynamics. Experimental results in public driver monitoring datasets demonstrate that the ELA offers lower variance under viewpoint changes compared to EAR and achieves accurate blink detection. At the same time, synthetic augmentation expands the diversity of training data for drowsiness recognition. Our findings highlight the ELA as both a reliable biometric measure and a powerful tool for generating scalable datasets in driver state monitoring.

[71] Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

Liqin Luo, Guangyao Chen, Xiawu Zheng, Yongxing Dai, Yixiong Zou, Yonghong Tian

Main category: cs.CV

TL;DR: GroundingAgent is a zero-shot visual grounding framework that uses iterative reasoning with pretrained models to link text queries to image regions without task-specific fine-tuning, achieving 65.1% accuracy on benchmarks.

Details

Motivation: Existing visual grounding methods require extensive task-specific annotations and fine-tuning, limiting their generalization to novel scenarios. The paper aims to develop a framework that can perform visual grounding without fine-tuning.

Method: Uses a structured iterative reasoning mechanism combining pretrained open-vocabulary object detectors, multimodal LLMs, and LLMs to progressively refine candidate regions through joint semantic and spatial analyses.

Result: Achieves 65.1% zero-shot grounding accuracy on RefCOCO benchmarks without fine-tuning. With MLLM-generated captions replaced by original queries, selection accuracy reaches ~90%, matching supervised performance.

Conclusion: GroundingAgent demonstrates strong zero-shot visual grounding capabilities and interpretability, showing that LLM reasoning plays a critical role in achieving performance close to supervised methods.

Abstract: Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

[72] Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

Muhao Guo, Yang Weng

Main category: cs.CV

TL;DR: A multimodal LLM approach for global photovoltaic assessment that outperforms traditional CV models in cross-domain generalization, achieving minimal performance degradation across unseen regions.

Details

Motivation: The rapid expansion of distributed PV systems creates grid management challenges due to undocumented installations, and traditional CV models require extensive labeled data and fail to generalize across regions.

Method: Leverages structured prompts and fine-tuning of a multimodal large language model to integrate detection, localization, and quantification within a unified schema for PV assessment.

Result: Cross-regional evaluation using ΔF1 metric shows the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines.

Conclusion: Multimodal LLMs demonstrate robustness under domain shift and have potential for scalable, transferable, and interpretable global PV mapping.

Abstract: The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the $Δ$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.

[73] Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

Zhaoqi Xu, Yingying Zhang, Jian Li, Jianwei Guo, Qiannan Zhu, Hua Huang

Main category: cs.CV

TL;DR: InfoPrune is an information-theoretic framework for compressing vision-language models using Information Bottleneck principle, achieving 3.2x FLOP reduction with minimal performance loss.

Details

Motivation: Current VLMs are too large for deployment, and existing compression methods lack theoretical guarantees about information preservation.

Method: Proposes entropy-based effective rank (eRank) and KS distance to measure attention head importance, plus training-based head pruning and training-free FFN compression via low-rank approximation.

Result: Achieves up to 3.2x FLOP reduction and 1.8x acceleration on VQAv2, TextVQA, and GQA with negligible performance degradation.

Conclusion: InfoPrune provides a theoretically grounded and effective approach for efficient multimodal large models.

Abstract: Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov–Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.

[74] Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models

Jonathan Lee, Xingrui Wang, Jiawei Peng, Luoxin Ye, Zehan Zheng, Tiezheng Zhang, Tao Wang, Wufei Ma, Siyi Chen, Yu-Cheng Chou, Prakhar Kaushik, Alan Yuille

Main category: cs.CV

TL;DR: Perceptual Taxonomy is a benchmark for physically grounded visual reasoning that evaluates object recognition, spatial configurations, and task-relevant properties like material, affordance, and function.

Details

Motivation: Current vision-language benchmarks focus on surface-level recognition and lack comprehensive evaluation of structured scene understanding that supports goal-directed reasoning, which is fundamental to human cognition.

Method: Created a benchmark with 3173 objects annotated with 84 fine-grained attributes across four property families. Constructed 28033 template-based questions across four types (object description, spatial reasoning, property matching, taxonomy reasoning) using 5802 images from synthetic and real domains, plus 50 expert-crafted questions.

Result: Leading vision-language models perform well on recognition tasks but degrade by 10-20% on property-driven questions, especially those requiring multi-step reasoning. Providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions.

Conclusion: There is a persistent gap in structured visual understanding, with current models relying heavily on pattern matching rather than true reasoning. Perceptual-taxonomy-guided prompting shows promise for improving reasoning capabilities.

Abstract: We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability and instead focus on surface-level recognition or image-text alignment. To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine-grained attributes. Using these annotations, we construct a multiple-choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template-based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert-crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning. Experimental results show that leading vision-language models perform well on recognition tasks but degrade by 10 to 20 percent on property-driven questions, especially those requiring multi-step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions, demonstrating the effectiveness of perceptual-taxonomy-guided prompting.

[75] MapRF: Weakly Supervised Online HD Map Construction via NeRF-Guided Self-Training

Hongyu Lyu, Thomas Monninger, Julie Stephany Berrio Perez, Mao Shan, Zhenxing Ming, Stewart Worrall

Main category: cs.CV

TL;DR: MapRF is a weakly supervised framework that constructs 3D HD maps using only 2D image labels, eliminating the need for costly 3D annotations through a self-training approach with Neural Radiance Fields and Map-to-Ray Matching.

Details

Motivation: Existing HD map construction methods require expensive 3D map annotations for training, which limits their generalization and scalability across diverse driving environments. There is a need for more cost-effective approaches that can scale better.

Method: Proposes MapRF framework with: 1) Neural Radiance Fields module conditioned on map predictions to generate view-consistent 3D geometry and semantics pseudo labels, 2) Self-training iterative refinement using pseudo labels, 3) Map-to-Ray Matching strategy to align map predictions with camera rays from 2D labels and mitigate error accumulation.

Result: Extensive experiments on Argoverse 2 and nuScenes datasets show MapRF achieves performance comparable to fully supervised methods, attaining around 75% of baseline performance while surpassing other approaches using only 2D labels.

Conclusion: MapRF demonstrates the potential to enable scalable and cost-effective online HD map construction for autonomous driving by reducing dependency on expensive 3D annotations while maintaining competitive performance.

Abstract: Autonomous driving systems benefit from high-definition (HD) maps that provide critical information about road infrastructure. The online construction of HD maps offers a scalable approach to generate local maps from on-board sensors. However, existing methods typically rely on costly 3D map annotations for training, which limits their generalization and scalability across diverse driving environments. In this work, we propose MapRF, a weakly supervised framework that learns to construct 3D maps using only 2D image labels. To generate high-quality pseudo labels, we introduce a novel Neural Radiance Fields (NeRF) module conditioned on map predictions, which reconstructs view-consistent 3D geometry and semantics. These pseudo labels are then iteratively used to refine the map network in a self-training manner, enabling progressive improvement without additional supervision. Furthermore, to mitigate error accumulation during self-training, we propose a Map-to-Ray Matching strategy that aligns map predictions with camera rays derived from 2D labels. Extensive experiments on the Argoverse 2 and nuScenes datasets demonstrate that MapRF achieves performance comparable to fully supervised methods, attaining around 75% of the baseline while surpassing several approaches using only 2D labels. This highlights the potential of MapRF to enable scalable and cost-effective online HD map construction for autonomous driving.

[76] Vidi2: Large Multimodal Models for Video Understanding and Creation

Vidi Team, Celong Liu, Chia-Wen Kuo, Chuang Huang, Dawei Du, Fan Chen, Guang Chen, Haoji Zhang, Haojun Zhao, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qihang Fan, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Weiyan Tao, Wen Zhong, Xiaohui Shen, Xin Gu, Zhenfang Chen, Zuhua Lin

Main category: cs.CV

TL;DR: Vidi2 is a next-generation video understanding model that advances beyond temporal retrieval to include fine-grained spatio-temporal grounding and video question answering, enabling comprehensive multimodal reasoning and practical applications in video editing.

Details

Motivation: Video has become the primary medium for internet communication and creativity, creating strong demand for scalable, high-quality video production tools that can understand and reason about video content comprehensively.

Method: Vidi2 extends video understanding capabilities with end-to-end spatio-temporal grounding that can identify both timestamps and bounding boxes of target objects given text queries. The model is evaluated using new benchmarks VUE-STG and VUE-TR-V2 that improve upon existing datasets with longer video durations, better query formats, higher annotation quality, and refined evaluation metrics.

Result: Vidi2 substantially outperforms leading proprietary systems like Gemini 3 Pro and GPT-5 on both VUE-TR-V2 and VUE-STG benchmarks, while achieving competitive results with popular open-source models of similar scale on video QA benchmarks.

Conclusion: Vidi2 represents a significant advancement in video understanding, enabling practical applications in complex editing scenarios through its comprehensive spatio-temporal grounding capabilities and superior performance compared to state-of-the-art proprietary systems.

Abstract: Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.

[77] Studying Maps at Scale: A Digital Investigation of Cartography and the Evolution of Figuration

Remi Petitpierre

Main category: cs.CV

TL;DR: This research analyzes cartographic heritage at scale using 771,561 map records and 99,715 digitized images from 38 catalogs, spanning 1492-1948. It develops automated techniques to study maps as cultural objects, revealing geographic structures, political influences, and semiotic patterns in cartographic history.

Details

Motivation: Current automated map analysis methods have largely ignored the historical and cultural dimensions of cartography. This work aims to bridge the gap between computational techniques and the understanding of maps as semantic-symbolic systems reflecting political and epistemic contexts.

Method: Leveraged a large corpus of 771,561 map records and 99,715 digitized images from 38 digital catalogs. Applied semantic segmentation and object detection models trained on annotated data and synthetic images for land class recognition and cartographic sign detection. Used normalization techniques to handle 236,925 contributors across six centuries.

Result: Revealed geographic structures and global chronology of map publication, showing links between Atlantic maritime charting, triangular trade, and colonial expansion. Documented progression of national focus and military conflict impacts. Analyzed 63M signs and 25M fragments showing figurative shifts (e.g., hachures to contours) and local consistency in sign systems. Identified role of legitimacy, major actors, and cities in spreading figurative norms.

Conclusion: Maps are designed images with specific framing and composition that emphasize features through centering and semantic symmetries. The research demonstrates how large-scale computational analysis can reveal cultural patterns in cartographic heritage, showing maps as complex semantic-symbolic systems shaped by political and epistemic contexts over time.

Abstract: This thesis presents methods and datasets to investigate cartographic heritage on a large scale and from a cultural perspective. Heritage institutions worldwide have digitized more than one million maps, and automated techniques now enable large-scale recognition and extraction of map content. Yet these methods have engaged little with the history of cartography, or the view that maps are semantic-symbolic systems, and cultural objects reflecting political and epistemic expectations. This work leverages a diverse corpus of 771,561 map records and 99,715 digitized images aggregated from 38 digital catalogs. After normalization, the dataset includes 236,925 contributors and spans six centuries, from 1492 to 1948. These data make it possible to chart geographic structures and the global chronology of map publication. The spatial focus of cartography is analyzed in relation to political dynamics, evidencing links between Atlantic maritime charting, the triangular trade, and colonial expansion. Further results document the progression of national, domestic focus and the impact of military conflicts on publication volumes. The research introduces semantic segmentation techniques and object detection models for the generic recognition of land classes and cartographic signs, trained on annotated data and synthetic images. The analysis of land classes shows that maps are designed images whose framing and composition emphasize features through centering and semantic symmetries. The study of cartographic figuration encodes 63 M signs and 25 M fragments into a latent visual space, revealing figurative shifts such as the replacement of relief hachures by terrain contours and showing that signs tend to form locally consistent systems. Analyses of collaboration and diffusion highlight the role of legitimacy, larger actors, and major cities in the spread of figurative norms and semiotic cultures.

[78] Proxy-Free Gaussian Splats Deformation with Splat-Based Surface Estimation

Jaeyeong Kim, Seungwoo Yoo, Minhyuk Sung

Main category: cs.CV

TL;DR: SpLap is a proxy-free deformation method for Gaussian splats that uses a novel surface-aware splat graph and Laplacian operator to enable plausible deformations without requiring deformation proxies like cages or meshes.

Details

Motivation: Existing GS deformation methods rely on deformation proxies which suffer from dependency on proxy quality and computational overhead, while direct Laplacian-based approaches fail to capture surface information properly due to lack of explicit structure.

Method: Constructs a surface-aware splat graph by defining neighboring splats based on their intersections rather than just distance between centers, and uses a Laplacian operator derived from this graph along with Gaussian kernel adaptation to preserve surface structure during deformation.

Result: Superior performance compared to both proxy-based and proxy-free baselines, demonstrated on 50 challenging objects from ShapeNet, Objaverse, Sketchfab, and NeRF-Synthetic datasets.

Conclusion: The proposed method enables more plausible deformations that preserve details and topology without requiring deformation proxies, improving rendering quality after deformation.

Abstract: We introduce SpLap, a proxy-free deformation method for Gaussian splats (GS) based on a Laplacian operator computed from our novel surface-aware splat graph. Existing approaches to GS deformation typically rely on deformation proxies such as cages or meshes, but they suffer from dependency on proxy quality and additional computational overhead. An alternative is to directly apply Laplacian-based deformation techniques by treating splats as point clouds. However, this often fail to properly capture surface information due to lack of explicit structure. To address this, we propose a novel method that constructs a surface-aware splat graph, enabling the Laplacian operator derived from it to support more plausible deformations that preserve details and topology. Our key idea is to leverage the spatial arrangement encoded in splats, defining neighboring splats not merely by the distance between their centers, but by their intersections. Furthermore, we introduce a Gaussian kernel adaptation technique that preserves surface structure under deformation, thereby improving rendering quality after deformation. In our experiments, we demonstrate the superior performance of our method compared to both proxy-based and proxy-free baselines, evaluated on 50 challenging objects from the ShapeNet, Objaverse, and Sketchfab datasets, as well as the NeRF-Synthetic dataset. Code is available at https://github.com/kjae0/SpLap.

[79] Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment

Ehsan Karimi, Nhut Le, Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: ThiFAN-VQA is a two-stage reasoning-based VQA framework for post-disaster damage assessment that uses chain-of-thought prompting and answer selection to improve accuracy and interpretability in UAV imagery analysis.

Details

Motivation: Existing AI approaches for disaster damage assessment are limited by costly data annotation, fixed answer spaces that restrict flexibility, and generative models that often produce hallucinated or generic responses lacking domain relevance.

Method: Two-stage framework: 1) Generate structured reasoning traces using chain-of-thought prompting and in-context learning, 2) Answer selection module evaluates responses and assigns most coherent/accurate answer. Integrates custom information retrieval, domain-specific prompting, and reasoning-guided selection.

Result: Experiments on FloodNet and RescueNet-VQA datasets show ThiFAN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment compared to existing methods.

Conclusion: ThiFAN-VQA bridges the gap between zero-shot and supervised methods, combining flexibility with consistency for effective UAV-based disaster damage assessment without requiring extensive retraining or data collection.

Abstract: Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze large volumes of aerial imagery collected by Unmanned Aerial Vehicles, providing actionable insights rapidly. However, creating and annotating data for training these models is costly and time-consuming, resulting in datasets that are limited in size and diversity. Furthermore, most existing approaches rely on traditional classification-based frameworks with fixed answer spaces, restricting their ability to provide new information without additional data collection or model retraining. Using pre-trained generative models built on in-context learning (ICL) allows for flexible and open-ended answer spaces. However, these models often generate hallucinated outputs or produce generic responses that lack domain-specific relevance. To address these limitations, we propose ThiFAN-VQA, a two-stage reasoning-based framework for visual question answering (VQA) in disaster scenarios. ThiFAN-VQA first generates structured reasoning traces using chain-of-thought (CoT) prompting and ICL to enable interpretable reasoning under limited supervision. A subsequent answer selection module evaluates the generated responses and assigns the most coherent and contextually accurate answer, effectively improve the model performance. By integrating a custom information retrieval system, domain-specific prompting, and reasoning-guided answer selection, ThiFAN-VQA bridges the gap between zero-shot and supervised methods, combining flexibility with consistency. Experiments on FloodNet and RescueNet-VQA, UAV-based datasets from flood- and hurricane-affected regions, demonstrate that ThiFAN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment tasks.

[80] HunyuanOCR Technical Report

Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Houwen Peng, Hongming Yang, Senhao Xie, Binghong Wu, Mana Yang, Sergey Wang, Raccoon Liu, Dick Zhu, Jie Jiang, Linus, Han Hu, Chengquan Zhang

Main category: cs.CV

TL;DR: HunyuanOCR is a 1B-parameter open-source VLM for OCR that outperforms commercial APIs and larger models, achieving SOTA results in perception and semantic tasks while unifying versatility and efficiency in a streamlined end-to-end architecture.

Details

Motivation: To address limitations of narrow OCR expert models and inefficient general VLMs by creating a lightweight model that unifies core OCR capabilities while eliminating error propagation from traditional pipeline dependencies.

Method: Uses a Native Vision Transformer and lightweight LLM connected via MLP adapter, with pure end-to-end paradigm and data-driven approach including Reinforcement Learning strategies for performance gains.

Result: Outperforms commercial APIs, traditional pipelines, and larger models like Qwen3-VL-4B; achieves 1st place in ICDAR 2025 DIMT Challenge (Small Model Track) and SOTA on OCRBench among VLMs with <3B parameters.

Conclusion: HunyuanOCR successfully unifies versatility and efficiency in OCR, provides production-ready deployment, and advances frontier research while offering a solid foundation for industrial applications.

Abstract: This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow “OCR expert models” and inefficient “General VLMs”. 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

[81] Leveraging Unlabeled Scans for NCCT Image Segmentation in Early Stroke Diagnosis: A Semi-Supervised GAN Approach

Maria Thoma, Michalis A. Savelonas, Dimitris K. Iakovidis

Main category: cs.CV

TL;DR: Semi-supervised GAN method for early ischemic stroke segmentation in NCCT scans, using limited labeled data and leveraging unlabeled scans to improve detection of subtle ischemic changes.

Details

Motivation: Early ischemic stroke detection in NCCT is challenging due to subtle changes, leading to delayed interventions and poor patient outcomes. Current methods struggle with limited annotated data.

Method: Semi-supervised segmentation using GANs with adversarial framework, combining Dice loss, cross-entropy loss, feature matching loss, and self-training loss to learn from limited labeled and abundant unlabeled NCCT scans.

Result: Experiments on AISD dataset show improved delineation of early infarcts, even when faint or small, enhancing diagnostic capabilities.

Conclusion: The method reduces manual annotation burden and supports more efficient clinical decision-making in stroke care by improving early detection of ischemic regions.

Abstract: Ischemic stroke is a time-critical medical emergency where rapid diagnosis is essential for improving patient outcomes. Non-contrast computed tomography (NCCT) serves as the frontline imaging tool, yet it often fails to reveal the subtle ischemic changes present in the early, hyperacute phase. This limitation can delay crucial interventions. To address this diagnostic challenge, we introduce a semi-supervised segmentation method using generative adversarial networks (GANs) to accurately delineate early ischemic stroke regions. The proposed method employs an adversarial framework to effectively learn from a limited number of annotated NCCT scans, while simultaneously leveraging a larger pool of unlabeled scans. By employing Dice loss, cross-entropy loss, a feature matching loss and a self-training loss, the model learns to identify and delineate early infarcts, even when they are faint or their size is small. Experiments on the publicly available Acute Ischemic Stroke Dataset (AISD) demonstrate the potential of the proposed method to enhance diagnostic capabilities, reduce the burden of manual annotation, and support more efficient clinical decision-making in stroke care.

[82] Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis

Dimitrios E. Diamantis, Dimitris K. Iakovidis

Main category: cs.CV

TL;DR: Proposes a novel VAE-based method (MSVQ-VAE) for generating synthetic WCE images with abnormalities to address data scarcity in medical imaging, enabling conditional generation of various GI abnormalities.

Details

Motivation: Address data scarcity in WCE imaging due to privacy constraints and annotation costs, which hinders development of Clinical Decision Support systems for gastrointestinal screening.

Method: Introduces Multiscale Vector Quantized Variational Autoencoder (MSVQ-VAE) that enables conditional generation of synthetic WCE images with various abnormalities including polyps, vascular and inflammatory conditions.

Result: Training a CDS classifier using synthetic abnormal images generated by MSVQ-VAE yields comparable performance to training with only real data.

Conclusion: The proposed methodology effectively addresses medical data scarcity and shows promise for broader applicability in medical multimedia domains.

Abstract: Gastrointestinal (GI) imaging via Wireless Capsule Endoscopy (WCE) generates a large number of images requiring manual screening. Deep learning-based Clinical Decision Support (CDS) systems can assist screening, yet their performance relies on the existence of large, diverse, training medical datasets. However, the scarcity of such data, due to privacy constraints and annotation costs, hinders CDS development. Generative machine learning offers a viable solution to combat this limitation. While current Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks and Variational Autoencoders have been explored, they often face challenges with training stability and capturing sufficient visual diversity, especially when synthesizing abnormal findings. This work introduces a novel VAE-based methodology for medical image synthesis and presents its application for the generation of WCE images. The novel contributions of this work include a) multiscale extension of the Vector Quantized VAE model, named as Multiscale Vector Quantized Variational Autoencoder (MSVQ-VAE); b) unlike other VAE-based SDG models for WCE image generation, MSVQ-VAE is used to seamlessly introduce abnormalities into normal WCE images; c) it enables conditional generation of synthetic images, enabling the introduction of different types of abnormalities into the normal WCE images; d) it performs experiments with a variety of abnormality types, including polyps, vascular and inflammatory conditions. The utility of the generated images for CDS is assessed via image classification. Comparative experiments demonstrate that training a CDS classifier using the abnormal images generated by the proposed methodology yield comparable results with a classifier trained with only real data. The generality of the proposed methodology promises its applicability to various domains related to medical multimedia.

[83] SkillSight: Efficient First-Person Skill Assessment with Gaze

Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman

Main category: cs.CV

TL;DR: SkillSight enables power-efficient skill assessment using gaze data instead of continuous video processing, achieving high accuracy with 73x less power consumption.

Details

Motivation: Automatic skill assessment from egocentric perception is challenging, and current methods require continuous video processing which is power-intensive for smart glasses.

Method: Two-stage framework: first learns joint modeling of gaze and video for skill prediction, then distills a gaze-only student model that eliminates video processing at inference.

Result: Achieves state-of-the-art performance on three datasets (cooking, music, sports) and gaze-only model maintains high accuracy with 73x less power than competing methods.

Conclusion: Gaze data is valuable for skill understanding across diverse real-world settings, enabling practical AI-supported skill learning applications on smart glasses.

Abstract: Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.

[84] Learning Hierarchical Sparse Transform Coding of 3DGS

Hao Xu, Xiaolin Wu, Xi Zhang

Main category: cs.CV

TL;DR: SHTC introduces a sparsity-guided hierarchical transform coding method for 3D Gaussian Splatting compression, achieving better rate-distortion performance and faster decoding than state-of-the-art methods by incorporating neural transform coding.

Details

Motivation: Current 3DGS compression methods lack analysis-synthesis transform, leaving redundancy removal to entropy coding which increases latency and computational burden.

Method: Uses KLT to decorrelate intra-anchor attributes, followed by quantization and entropy coding, then compresses KLT residuals with a low-complexity scene-adaptive neural transform using sparsity prior and deep unfolding.

Result: Achieves improved rate-distortion performance and higher decoding speed over state-of-the-art methods while reducing memory usage.

Conclusion: SHTC provides an effective solution for 3DGS compression with parameter-efficient design that could inspire future low-complexity neural codecs.

Abstract: 3D Gaussian Splatting (3DGS) supports fast, high quality, novel view synthesis but has a heavy memory footprint, making the compression of its model crucial. Current state-of-the-art (SOTA) 3DGS compression methods adopt an anchor-based architecture that pairs the Scaffold-GS representation with conditional entropy coding. However, these methods forego the analysis-synthesis transform, a vital mechanism in visual data compression. As a result, redundancy remains intact in the signal and its removal is left to the entropy coder, which computationally overburdens the entropy coding module, increasing coding latency. Even with added complexity thorough redundancy removal is a task unsuited to an entropy coder. To fix this critical omission, we introduce a Sparsity-guided Hierarchical Transform Coding (SHTC) method, the first study on the end-to-end learned neural transform coding of 3DGS. SHTC applies KLT to decorrelate intra-anchor attributes, followed by quantization and entropy coding, and then compresses KLT residuals with a low-complexity, scene-adaptive neural transform. Aided by the sparsity prior and deep unfolding technique, the learned transform uses only a few trainable parameters, reducing the memory usage. Overall, SHTC achieves an appreciably improved R-D performance and at the same time higher decoding speed over SOTA. Its prior-guided, parameter-efficient design may also inspire low-complexity neural image and video codecs. Our code will be released at https://github.com/hxu160/SHTC_for_3DGS_compression.

[85] On the Utility of Foundation Models for Fast MRI: Vision-Language-Guided Image Reconstruction

Ruimin Feng, Xingxin He, Ronald Mercer, Zachary Stewart, Fang Liu

Main category: cs.CV

TL;DR: Vision-language foundation models can enhance undersampled MRI reconstruction by providing high-level semantic priors through contrastive learning, improving perceptual quality and anatomical structure preservation.

Details

Motivation: To investigate whether vision-language foundation models can provide high-level contextual information beyond conventional priors to improve undersampled MRI reconstruction quality.

Method: Proposed a semantic distribution-guided reconstruction framework using pre-trained vision-language models to encode images and auxiliary information into semantic features, with a contrastive objective aligning reconstructed representations with target semantic distributions.

Result: Experiments on knee and brain datasets showed semantic priors preserve fine anatomical structures, achieve superior perceptual quality (lower LPIPS, higher Tenengrad scores), and enable high-level control over reconstruction attributes while maintaining data fidelity.

Conclusion: Vision-language foundation models can effectively improve undersampled MRI reconstruction through semantic-space optimization, demonstrating the value of high-level semantic priors in medical imaging reconstruction.

Abstract: Purpose: To investigate whether a vision-language foundation model can enhance undersampled MRI reconstruction by providing high-level contextual information beyond conventional priors. Methods: We proposed a semantic distribution-guided reconstruction framework that uses a pre-trained vision-language foundation model to encode both the reconstructed image and auxiliary information into high-level semantic features. A contrastive objective aligns the reconstructed representation with the target semantic distribution, ensuring consistency with high-level perceptual cues. The proposed objective works with various deep learning-based reconstruction methods and can flexibly incorporate semantic priors from multimodal sources. To test the effectiveness of these semantic priors, we evaluated reconstruction results guided by priors derived from either image-only or image-language auxiliary information. Results: Experiments on knee and brain datasets demonstrate that semantic priors from images preserve fine anatomical structures and achieve superior perceptual quality, as reflected in lower LPIPS values, higher Tenengrad scores, and improved scores in the reader study, compared with conventional regularization. The image-language information further expands the semantic distribution and enables high-level control over reconstruction attributes. Across all evaluations, the contrastive objective consistently guided the reconstructed features toward the desired semantic distributions while maintaining data fidelity, demonstrating the effectiveness of the proposed optimization framework. Conclusion: The study highlights that vision-language foundation models can improve undersampled MRI reconstruction through semantic-space optimization.

[86] Navigating Gigapixel Pathology Images with Large Multimodal Models

Thomas A. Buckley, Kian R. Weihrauch, Katherine Latham, Andrew Z. Zhou, Padmini A. Manrai, Arjun K. Manrai

Main category: cs.CV

TL;DR: GIANT framework enables LMMs to iteratively navigate whole-slide images like pathologists, achieving superior performance on medical image interpretation tasks compared to conventional methods.

Details

Motivation: General-purpose large multimodal models have shown poor performance in medical image interpretation, particularly with gigapixel pathology images, due to previous approaches using low-resolution thumbnails or random patches.

Method: Introduced GIANT framework that allows LMMs to iteratively navigate whole-slide images like pathologists, and created MultiPathQA benchmark with 934 WSI-level questions across five clinically-relevant tasks.

Result: GIANT substantially outperforms conventional patch- and thumbnail-based baselines, achieving 62.5% accuracy on pathologist-authored questions, surpassing specialist pathology models like TITAN (43.8%) and SlideChat (37.5%).

Conclusion: The study reveals strengths and limitations of current foundation models and provides groundwork for future development of LMMs for expert reasoning in pathology.

Abstract: Despite being widely used to support clinical care, general-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation, particularly in pathology, where gigapixel images are used. However, prior studies have used either low-resolution thumbnails or random patches, which likely underestimated model performance. Here, we ask whether LMMs can be adapted to reason coherently and accurately in the evaluation of such images. In this study, we introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images (WSIs) like a pathologist. Accompanying GIANT, we release MultiPathQA, a new benchmark, which comprises 934 WSI-level questions, encompassing five clinically-relevant tasks ranging from cancer diagnosis to open-ended reasoning. MultiPathQA also includes 128 questions, authored by two professional pathologists, requiring direct slide interpretation. Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines, approaching or surpassing the performance of specialized models trained on millions of images. For example, on pathologist-authored questions, GPT-5 with GIANT achieves 62.5% accuracy, outperforming specialist pathology models such as TITAN (43.8%) and SlideChat (37.5%). Our findings reveal the strengths and limitations of current foundation models and ground future development of LMMs for expert reasoning in pathology.

[87] CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

Xinhai Hou, Shaoyuan Xu, Manan Biyani, Mayan Li, Jia Liu, Todd C. Hollon, Bryan Wang

Main category: cs.CV

TL;DR: CodeV is a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO) to address unfaithful visual reasoning in vision-language models, achieving both high accuracy and improved faithful tool-use rates.

Details

Motivation: Current vision-language models often achieve high final-answer accuracy but exhibit unfaithful visual reasoning - using tools on irrelevant regions or ignoring tool outputs while still guessing correctly. This reveals a trustworthiness gap in agentic visual reasoning systems.

Method: Proposes CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO) - a process-level RL framework that uses dense rewards defined directly on visual tool inputs and outputs rather than chain-of-thought tokens. CodeV represents visual tools as executable Python code.

Result: CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on visual search benchmarks. It also attains strong performance on multimodal reasoning and math benchmarks.

Conclusion: Explicitly supervising intermediate tool behavior through TAPO is crucial for building trustworthy, agentic visual reasoning systems that demonstrate both high accuracy and faithful tool use.

Abstract: Agentic vision-language models are increasingly trained to “think with images” by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.

[88] OncoVision: Integrating Mammography and Clinical Data through Attention-Driven Multimodal AI for Enhanced Breast Cancer Diagnosis

Istiak Ahmed, Galib Ahmed, K. Shahriar Sanjid, Md. Tanzim Hossain, Md. Nishan Khan, Md. Misbah Khan, Md. Arifur Rahman, Sheikh Anisul Haque, Sharmin Akhtar Rupa, Mohammed Mejbahuddin Mia, Mahmud Hasan Mostofa Kamal, Md. Mostafa Kamal Sarker, M. Monir Uddin

Main category: cs.CV

TL;DR: OncoVision is a multimodal AI pipeline that combines mammography images and clinical data for breast cancer diagnosis, featuring ROI segmentation, clinical feature prediction, and late-fusion strategies to improve diagnostic accuracy and reduce variability.

Details

Motivation: To improve breast cancer diagnosis by leveraging complementary multimodal data (mammography images and clinical data) to enhance diagnostic precision, reduce inter-observer variability, and provide scalable screening solutions for underprivileged areas.

Method: Uses an attention-based encoder-decoder backbone to segment four ROIs (masses, calcifications, axillary findings, breast tissues) and predict ten clinical features. Implements two late-fusion strategies to combine imaging and clinical insights. Deployed as a secure web application with structured reports, dual-confidence scoring, and attention-weighted visualizations.

Result: Achieves state-of-the-art accuracy in ROI segmentation and robust prediction of clinical features. Late fusion strategies improve diagnostic precision and reduce inter-observer variability. The system produces structured reports with confidence scoring for real-time diagnostic support.

Conclusion: OncoVision sets a new standard for AI-based mammography by combining accurate segmentation with clinical intuition, offering a scalable and equitable solution for early breast cancer detection that can be deployed globally, particularly in underserved regions like rural South Asia.

Abstract: OncoVision is a multimodal AI pipeline that combines mammography images and clinical data for better breast cancer diagnosis. Employing an attention-based encoder-decoder backbone, it jointly segments four ROIs - masses, calcifications, axillary findings, and breast tissues - with state-of-the-art accuracy and robustly predicts ten structured clinical features: mass morphology, calcification type, ACR breast density, and BI-RADS categories. To fuse imaging and clinical insights, we developed two late-fusion strategies. By utilizing complementary multimodal data, late fusion strategies improve diagnostic precision and reduce inter-observer variability. Operationalized as a secure, user-friendly web application, OncoVision produces structured reports with dual-confidence scoring and attention-weighted visualizations for real-time diagnostic support to improve clinician trust and facilitate medical teaching. It can be easily incorporated into the clinic, making screening available in underprivileged areas around the world, such as rural South Asia. Combining accurate segmentation with clinical intuition, OncoVision raises the bar for AI-based mammography, offering a scalable and equitable solution to detect breast cancer at an earlier stage and enhancing treatment through timely interventions.

[89] INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models

Parsa Madinei, Ryan Solgi, Ziqi Wen, Jonathan Skaza, Miguel Eckstein, Ramtin Pedarsani

Main category: cs.CV

TL;DR: INTERLACE is a novel framework that prunes redundant layers in Vision-Language Models (VLMs) while maintaining performance through sample-efficient finetuning.

Details

Motivation: Existing layer pruning methods cause significant performance drops when applied to VLMs, highlighting the need for a more effective approach to reduce model redundancy without sacrificing performance.

Method: Analyzes triplets of consecutive layers to identify local redundancy, removes the most redundant of the first two layers, finetunes the remaining layer to compensate, and freezes the third layer as a stable anchor during finetuning.

Result: Achieves 88.9% average performance retention after dropping 25% of the network by finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, achieving state-of-the-art performance.

Conclusion: The interleaved finetune-freeze design enables rapid convergence with minimal data after pruning, providing an effective solution for VLM layer pruning.

Abstract: We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning. Existing layer pruning methods lead to significant performance drop when applied to VLMs. Instead, we analyze triplets of consecutive layers to identify local redundancy, removing the most redundant of the first two layers, finetune the remaining layer to compensate for the lost capacity, and freeze the third layer to serve as a stable anchor during finetuning. We found that this interleaved finetune-freeze design enables rapid convergence with minimal data after pruning. By finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, Interlace achieves 88.9% average performance retention after dropping 25% of the network, achieving SOTA performance. Our code is available at: https://github.com/pmadinei/Interlace.git

[90] IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants

Vivek Chavan, Yasmina Imgrund, Tung Dao, Sanwantri Bai, Bosong Wang, Ze Lu, Oliver Heimann, Jörg Krüger

Main category: cs.CV

TL;DR: IndEgo is a multimodal dataset for industrial tasks with 3,460 egocentric and 1,092 exocentric recordings, featuring collaborative work and rich annotations for task understanding, mistake detection, and reasoning.

Details

Motivation: To address the lack of comprehensive multimodal datasets for industrial tasks, particularly collaborative work scenarios that involve both cognitive and physical aspects.

Method: Collected 197 hours of egocentric and 97 hours of exocentric recordings across various industrial tasks, enriched with multimodal data including eye gaze, narration, sound, motion, and detailed annotations.

Result: Created a challenging dataset with baseline evaluations showing state-of-the-art multimodal models struggle with mistake detection, question answering, and collaborative task understanding tasks.

Conclusion: IndEgo provides a valuable resource for advancing multimodal AI in industrial settings, particularly for understanding complex collaborative tasks and detecting errors in procedural workflows.

Abstract: We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

[91] CountXplain: Interpretable Cell Counting with Prototype-Based Density Map Estimation

Abdurahman Ali Mohammed, Wallapak Tavanapong, Catherine Fonder, Donald S. Sakaguchi

Main category: cs.CV

TL;DR: A prototype-based method for interpretable cell counting via density map estimation that learns representative visual patterns for cells and background artifacts, validated by biologists.

Details

Motivation: Deep learning models for cell counting lack interpretability, which is crucial for clinical applications where trust and transparency are essential.

Method: Integrates a prototype layer into density estimation networks to learn visual patterns, with interpretations highlighting image regions similar to each prototype.

Result: Achieves interpretability without compromising counting effectiveness on two public datasets, with biologists confirming the relevance of learned visual patterns.

Conclusion: Provides a transparent and reliable tool for cell counting that can increase trust and accelerate adoption of deep learning in biomedical applications.

Abstract: Cell counting in biomedical imaging is pivotal for various clinical applications, yet the interpretability of deep learning models in this domain remains a significant challenge. We propose a novel prototype-based method for interpretable cell counting via density map estimation. Our approach integrates a prototype layer into the density estimation network, enabling the model to learn representative visual patterns for both cells and background artifacts. The learned prototypes were evaluated through a survey of biologists, who confirmed the relevance of the visual patterns identified, further validating the interpretability of the model. By generating interpretations that highlight regions in the input image most similar to each prototype, our method offers a clear understanding of how the model identifies and counts cells. Extensive experiments on two public datasets demonstrate that our method achieves interpretability without compromising counting effectiveness. This work provides researchers and clinicians with a transparent and reliable tool for cell counting, potentially increasing trust and accelerating the adoption of deep learning in critical biomedical applications. Code is available at https://github.com/NRT-D4/CountXplain.

[92] RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

Omar Alama, Darshil Jariwala, Avigyan Bhattacharya, Seungchan Kim, Wenshan Wang, Sebastian Scherer

Main category: cs.CV

TL;DR: RADSeg uses RADIO foundation model for zero-shot open-vocabulary semantic segmentation, achieving state-of-the-art mIoU with 3.95x speedup and 2.5x fewer parameters compared to previous methods.

Details

Motivation: Existing OVSS approaches either have limited generalization due to scarce training data or use multiple models that are computationally expensive. RADIO foundation model offers an overlooked opportunity for efficient zero-shot segmentation.

Method: Enhanced RADIO with self-correlating recursive attention, self-correlating global aggregation, and efficient mask refinement to improve zero-shot OVSS performance.

Result: Achieves 6-30% mIoU improvement in base ViT class, 3.95x faster inference, and uses 2.5x fewer parameters. RADSeg-base (105M) outperforms previous huge models (850-1350M) in mIoU.

Conclusion: RADSeg achieves state-of-the-art accuracy for zero-shot open-vocabulary semantic segmentation with substantially lower computational and memory costs than previous approaches.

Abstract: Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (105M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.

[93] Rethinking Vision Transformer Depth via Structural Reparameterization

Chengwei Zhou, Vipin Chaudhary, Gourav Datta

Main category: cs.CV

TL;DR: Proposes branch-based structural reparameterization to reduce Vision Transformer layers while maintaining accuracy, achieving up to 37% inference speedup by consolidating parallel branches during training.

Details

Motivation: Address the computational overhead of deep Vision Transformers by exploring layer reduction while maintaining representational capacity, moving beyond existing algorithmic optimizations like token pruning.

Method: Branch-based structural reparameterization technique that uses parallel branches within transformer blocks during training, then consolidates them into single-path models for inference by merging branches at nonlinear component entry points.

Result: Successfully reduced ViT-Tiny from 12 layers to 6, 4, or 3 layers while maintaining ImageNet-1K classification accuracy, achieving up to 37% inference speedup on mobile CPUs.

Conclusion: Conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, opening new opportunities for constructing efficient vision transformers through structural reparameterization.

Abstract: The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-level optimizations such as token pruning and attention speedup. This leaves an underexplored research question: can we reduce the number of stacked transformer layers while maintaining comparable representational capacity? To answer this, we propose a branch-based structural reparameterization technique that operates during the training phase. Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models suitable for inference deployment. The consolidation mechanism works by gradually merging branches at the entry points of nonlinear components, enabling both feed-forward networks (FFN) and multi-head self-attention (MHSA) modules to undergo exact mathematical reparameterization without inducing approximation errors at test time. When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K. The resulting compressed models achieve inference speedups of up to 37% on mobile CPU platforms. Our findings suggest that the conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, and point toward new opportunities for constructing efficient vision transformers.

[94] Maritime Small Object Detection from UAVs using Deep Learning with Altitude-Aware Dynamic Tiling

Sakib Ahmed, Oscar Pizarro

Main category: cs.CV

TL;DR: Altitude-aware dynamic tiling method improves small object detection in UAV-based search and rescue missions by scaling and adaptively subdividing images based on altitude, achieving 38% mAP improvement and more than double inference speed.

Details

Motivation: Small objects are difficult to detect from high UAV altitudes due to low object-to-background pixel ratios in search and rescue missions.

Method: Altitude-aware dynamic tiling that scales and adaptively subdivides images into tiles using altitude-dependent scaling with adaptive tiling factor, integrated with YOLOv5 and SAHI framework.

Result: 38% improvement in Mean Average Precision for small objects compared to baseline, and more than double inference speed compared to static tiling.

Conclusion: The approach enables more efficient and accurate UAV-based search and rescue operations under diverse conditions.

Abstract: Unmanned Aerial Vehicles (UAVs) are crucial in Search and Rescue (SAR) missions due to their ability to monitor vast maritime areas. However, small objects often remain difficult to detect from high altitudes due to low object-to-background pixel ratios. We propose an altitude-aware dynamic tiling method that scales and adaptively subdivides the image into tiles for enhanced small object detection. By integrating altitude-dependent scaling with an adaptive tiling factor, we reduce unnecessary computation while maintaining detection performance. Tested on the SeaDronesSee dataset [1] with YOLOv5 [2] and Slicing Aided Hyper Inference (SAHI) framework [3], our approach improves Mean Average Precision (mAP) for small objects by 38% compared to a baseline and achieves more than double the inference speed compared to static tiling. This approach enables more efficient and accurate UAV-based SAR operations under diverse conditions.

[95] Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

Xinran Liu, Elaheh Akbari, Rocio Diaz Martin, Navid NaderiAlizadeh, Soheil Kolouri

Main category: cs.CV

TL;DR: The paper studies the transferability of optimized slicers in min-Sliced Transport Plan (min-STP) framework, showing they can be effectively transferred to new distribution pairs under slight perturbations, with theoretical guarantees and empirical validation.

Details

Motivation: To address the computational cost limitations of Optimal Transport (OT) and investigate whether learned optimal slicers can transfer to new distribution pairs under distributional shift, which is crucial for evolving data and repeated OT computations.

Method: Introduces min-Sliced Transport Plan (min-STP) framework with theoretical analysis of slicer transferability, develops a minibatch formulation for improved scalability, and provides statistical guarantees on accuracy.

Result: Theoretically shows optimized slicers remain close under slight data perturbations, enabling efficient transfer. Empirically demonstrates strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

Conclusion: Transferable min-STP provides an efficient and scalable approach for OT applications, with theoretical guarantees on slicer transferability and practical effectiveness in real-world tasks.

Abstract: Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

[96] Leveraging Foundation Models for Histological Grading in Cutaneous Squamous Cell Carcinoma using PathFMTools

Abdul Rahman Diab, Emily E. Karn, Renchin Wu, Emily S. Ruiz, William Lotter

Main category: cs.CV

TL;DR: PathFMTools is a lightweight Python package that enables efficient execution, analysis, and visualization of pathology foundation models, tested on CONCH and MUSK models for cutaneous squamous cell carcinoma grading.

Details

Motivation: Adapting computational pathology foundation models to specific clinical tasks is challenging due to WSI processing complexity, feature opacity, and diverse adaptation strategies.

Method: Developed PathFMTools package to interface with CONCH and MUSK foundation models, benchmarked multiple adaptation strategies on 440 cSCC H&E WSIs for histological grading.

Result: Demonstrated trade-offs across prediction approaches and validated that foundation model embeddings can effectively train small specialist models.

Conclusion: Pathology foundation models show promise for real-world clinical applications, with PathFMTools enabling efficient analysis and validation.

Abstract: Despite the promise of computational pathology foundation models, adapting them to specific clinical tasks remains challenging due to the complexity of whole-slide image (WSI) processing, the opacity of learned features, and the wide range of potential adaptation strategies. To address these challenges, we introduce PathFMTools, a lightweight, extensible Python package that enables efficient execution, analysis, and visualization of pathology foundation models. We use this tool to interface with and evaluate two state-of-the-art vision-language foundation models, CONCH and MUSK, on the task of histological grading in cutaneous squamous cell carcinoma (cSCC), a critical criterion that informs cSCC staging and patient management. Using a cohort of 440 cSCC H&E WSIs, we benchmark multiple adaptation strategies, demonstrating trade-offs across prediction approaches and validating the potential of using foundation model embeddings to train small specialist models. These findings underscore the promise of pathology foundation models for real-world clinical applications, with PathFMTools enabling efficient analysis and validation.

[97] What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities

Muchang Bahng, Charlie Berens, Jon Donnelly, Eric Chen, Chaofan Chen, Cynthia Rudin

Main category: cs.CV

TL;DR: Extends prototype networks to multimodal species detection, enabling interpretable predictions while reducing need for expensive genetic data through intelligent modality allocation.

Details

Motivation: Address two key problems in multimodal species detection: lack of interpretability in black-box neural networks, and high cost/invasiveness of genetic data collection.

Method: Extends prototype networks (ProtoPNets) to multimodal setting, ensembles prototypes from each modality with weighted contributions, and introduces methods to identify when genetic data isn’t needed for confident predictions.

Result: Approach intelligently allocates expensive genetic data for fine-grained distinctions while using abundant image data for clearer visual classifications, achieving comparable accuracy to models using both modalities consistently.

Conclusion: The proposed multimodal prototype network framework provides interpretable species detection while reducing reliance on costly genetic data through smart modality allocation.

Abstract: Species detection is important for monitoring the health of ecosystems and identifying invasive species, serving a crucial role in guiding conservation efforts. Multimodal neural networks have seen increasing use for identifying species to help automate this task, but they have two major drawbacks. First, their black-box nature prevents the interpretability of their decision making process. Second, collecting genetic data is often expensive and requires invasive procedures, often necessitating researchers to capture or kill the target specimen. We address both of these problems by extending prototype networks (ProtoPNets), which are a popular and interpretable alternative to traditional neural networks, to the multimodal, cost-aware setting. We ensemble prototypes from each modality, using an associated weight to determine how much a given prediction relies on each modality. We further introduce methods to identify cases for which we do not need the expensive genetic information to make confident predictions. We demonstrate that our approach can intelligently allocate expensive genetic data for fine-grained distinctions while using abundant image data for clearer visual classifications and achieving comparable accuracy to models that consistently use both modalities.

[98] Vision–Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

Jiaqi Guo, Mingzhen Li, Hanyu Su, Santiago López, Lexiaozi Fan, Daniel Kim, Aggelos Katsaggelos

Main category: cs.CV

TL;DR: VESSA integrates vision-language models into semi-supervised medical image segmentation, using a two-stage approach with visual feature matching and dynamic student interaction to improve accuracy with limited annotations.

Details

Motivation: To reduce reliance on extensive expert annotations in medical image segmentation by leveraging vision-language models' generalization capabilities within semi-supervised learning frameworks.

Method: Two-stage approach: Stage 1 trains VESSA as reference-guided segmentation assistant using template bank; Stage 2 integrates VESSA into SSL framework for dynamic interaction with student model, using student predictions as prompts to generate higher-quality pseudo-labels.

Result: Significantly enhances segmentation accuracy across multiple datasets and domains, outperforming state-of-the-art baselines under extremely limited annotation conditions.

Conclusion: VESSA-augmented SSL effectively incorporates foundation-level visual-semantic understanding to improve medical image segmentation with minimal labeled data.

Abstract: Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.

[99] A Storage-Efficient Feature for 3D Concrete Defect Segmentation to Replace Normal Vector

Linxin Hua, Jianghua Deng, Ye Lu

Main category: cs.CV

TL;DR: Proposes a new ‘relative angle’ feature for point cloud damage reconstruction that provides equivalent directionality information to normal vectors but with significant data compression benefits.

Details

Motivation: Point cloud reconstruction is effective for damage assessment but constrained by high 3D data volume. Need to reduce computational and storage requirements while maintaining performance.

Method: Introduces relative angle feature - angle between point’s normal vector and average normal vector of parent point cloud. Uses entropy-based feature evaluation and trains PointNet++ models with this feature.

Result: Models using relative angles achieved similar performance to normal vector models while providing 27.6% storage reduction and 83% input channel compression.

Conclusion: Relative angle feature enables larger-batch execution on resource-constrained hardware without requiring architectural modifications to models, making point cloud damage reconstruction more practical.

Abstract: Point cloud reconstruction of damage offers an effective solution to image-based methods vulnerable to background noise, yet its application is constrained by the high volume of 3D data. This study proposes a new feature, relative angle, computed as the angle between the normal vector of a point and the average normal vector of its parent point cloud. This single-dimensional feature provides directionality information equivalent to normal vectors for concrete surface defect characteristics. Through entropy-based feature evaluation, this study demonstrates the ability of relative angle to filter out redundant information in undamaged sections while retaining effective information in damaged sections. By training and testing with PointNet++, models based on the relative angles achieved similar performance to that of models based on normal vectors while delivering 27.6% storage reduction and 83% input channel compression. This novel feature has the potential to enable larger-batch execution on resource-constrained hardware without the necessity of architectural modifications to models.

[100] Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation

Ali Torabi, Sanjog Gaihre, Yaqoob Majeed

Main category: cs.CV

TL;DR: CrispFormer improves weakly supervised semantic segmentation by enhancing the SegFormer decoder with three key changes: boundary supervision, uncertainty-guided refinement, and dynamic multi-scale fusion, resulting in better boundaries and noise resistance.

Details

Motivation: Weakly supervised semantic segmentation (WSSS) struggles with learning dense masks from noisy, under-specified cues like image-level labels, leading to poor boundary quality and sensitivity to label noise.

Method: Three decoder modifications: (1) boundary branch with edge head and boundary-aware loss, (2) uncertainty-guided refiner using aleatoric uncertainty to weight losses and correct segmentation logits, (3) dynamic multi-scale fusion with spatial softmax gating over multi-resolution features.

Result: Consistent improvements in boundary F-score, small-object recall, and mIoU over SegFormer baselines while adding minimal computational overhead. Better preservation of crisp boundaries and resistance to label noise.

Conclusion: CrispFormer provides a simple, decoder-centric approach that is broadly compatible with existing SegFormer variants and offers reproducible improvement in mask fidelity from image-level supervision.

Abstract: Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective-without altering the MiT backbone or relying on heavy post-processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary-aware loss; (2) an uncertainty-guided refiner that predicts per-pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi-scale fusion layer that replaces static concatenation with spatial softmax gating over multi-resolution features, optionally modulated by uncertainty. The result is a single-pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F-score, small-object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder-centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher-fidelity masks from image-level supervision.

[101] Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization

Debin Meng, Chen Jin, Zheng Gao, Yanran Li, Ioannis Patras, Georgios Tzimiropoulos

Main category: cs.CV

TL;DR: TPSO is a training-free module that enhances image diversity in text-to-image diffusion models by optimizing token-prompt embedding space to explore underrepresented regions while maintaining image quality.

Details

Motivation: Text-to-image diffusion models often generate repetitive outputs due to collapsing toward strong modes in the learned distribution, hindering creative exploration and downstream applications. Existing diversity methods either still collapse to dominant modes or introduce quality distortions.

Method: Token-Prompt embedding Space Optimization (TPSO) introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing generation from strong modes while using prompt-level space as global semantic constraint to prevent quality degradation.

Result: Extensive experiments on MS-COCO and three diffusion backbones show TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points without sacrificing image quality.

Conclusion: TPSO effectively addresses the diversity challenge in text-to-image generation by optimizing token-prompt embedding space, providing a training-free and model-agnostic solution that maintains high fidelity while significantly improving diversity.

Abstract: Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.

[102] Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

Noah Frahm, Prakrut Patel, Yue Zhang, Shoubin Yu, Mohit Bansal, Roni Sengupta

Main category: cs.CV

TL;DR: Prune-Then-Plan framework stabilizes VLM-based embodied question answering by pruning implausible frontier choices and using coverage-based planning, reducing frontier oscillations and improving navigation efficiency.

Details

Motivation: Large VLMs used for embodied question answering exhibit frontier oscillations (unstable back-and-forth movements) due to overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality.

Method: Proposes Prune-Then-Plan framework: 1) Prunes implausible frontier choices using Holm-Bonferroni inspired procedure, 2) Delegates final decisions to coverage-based planner, converting overconfident predictions into conservative actions.

Result: Achieves relative improvements of up to 49% in visually grounded SPL and 33% in LLM-Match metrics over baselines, with better scene coverage under equal exploration budgets on OpenEQA and EXPRESS-Bench datasets.

Conclusion: The method effectively stabilizes VLM exploration through step-level calibration, separating pruning from planning to achieve more efficient and reliable embodied question answering.

Abstract: Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.

[103] CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Miguel Carvalho, Helder Dias, Bruno Martins

Main category: cs.CV

TL;DR: CropVLM is an external method that enhances VLMs’ fine-grained image understanding by enabling dynamic zooming on relevant regions, trained via reinforcement learning without human supervision.

Details

Motivation: VLMs struggle with fine-grained image tasks like scene-text recognition due to perception limitations and visual fragmentation.

Method: Train CropVLM using reinforcement learning without human-labeled bounding boxes or synthetic evaluations, enabling dynamic region zooming.

Result: Significant improvements on high-resolution image understanding tasks, especially for out-of-domain benchmarks, without VLM modification.

Conclusion: CropVLM effectively boosts VLM performance on fine-grained tasks while avoiding catastrophic forgetting through external implementation.

Abstract: Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ‘‘zoom in’’ on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.

[104] One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

Haoyu Wu, Jingyi Xu, Qiaomu Miao, Dimitris Samaras, Hieu Le

Main category: cs.CV

TL;DR: The paper identifies a structural failure in using linear interpolation on RoPE for mixed-resolution denoising with Diffusion Transformers, where attention collapses due to phase aliasing from incompatible sampling rates.

Details

Motivation: To address the core failure mode where linear interpolation on rotary positional embeddings causes attention mechanism collapse when tokens from different spatial grids are mixed in mixed-resolution denoising.

Method: Proposes Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that modifies RoPE index maps so all Q/K positions use the query’s stride, ensuring equal physical distances produce identical phase increments.

Result: CRPA eliminates the failure at its source, stabilizes all heads and layers uniformly, enables high-fidelity mixed-resolution generation, and outperforms previous state-of-the-art methods on image and video generation.

Conclusion: CRPA provides an effective solution to the structural phase aliasing problem in mixed-resolution denoising with Diffusion Transformers, enabling stable and high-quality generation without requiring retraining.

Abstract: We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When tokens from different spatial grids are mixed, the attention mechanism collapses. The issue is structural. Linear coordinate remapping forces a single attention head to compare RoPE phases sampled at incompatible rates, creating phase aliasing that destabilizes the score landscape. Pretrained DiTs are especially brittle-many heads exhibit extremely sharp, periodic phase selectivity-so even tiny cross-rate inconsistencies reliably cause blur, artifacts, or full collapse. To this end, our main contribution is Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that eliminates this failure at its source. CRPA modifies only the RoPE index map for each attention call: all Q/K positions are expressed on the query’s stride so that equal physical distances always induce identical phase increments. This restores the precise phase patterns that DiTs rely on. CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly. We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation.

[105] MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

Chengyue Huang, Mellon M. Zhang, Robert Azarcon, Glen Chou, Zsolt Kira

Main category: cs.CV

TL;DR: MAPS is a fine-tuning framework for Vision-Language-Action models that uses module-wise proximity scheduling to preserve pretrained VLM priors while enabling action layer adaptation, improving generalization without extra parameters.

Details

Motivation: Naive fine-tuning of VLA models disrupts pretrained VLM representations and harms generalization, while existing approaches either overconstrain adaptation or ignore component-specific roles.

Method: MAPS linearly schedules proximity constraint relaxation based on empirical analysis, keeping visual encoders close to pretrained priors while allowing action-oriented language layers to adapt more freely.

Result: MAPS consistently improves both in-distribution and out-of-distribution performance (up to +30%) across multiple VLA models and benchmarks including SimplerEnv, CALVIN, LIBERO, and real-world Franka Emika Panda evaluations.

Conclusion: Empirically guided proximity to pretrained VLMs is a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer, with MAPS providing a seamless integration framework.

Abstract: Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes – freezing modules or applying uniform regularization – either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.

[106] Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes

Jihan Yao, Achin Kulshrestha, Nathalie Rauschmayr, Reed Roberts, Banghua Zhu, Yulia Tsvetkov, Federico Tombari

Main category: cs.CV

TL;DR: LRP trains lightweight probes on VLMs’ internal representations to detect uncertainty for abstention in STVQA tasks, improving abstention accuracy by 7.6% over baselines.

Details

Motivation: VLMs need reliable abstention capabilities for safety-critical applications, especially when OCR errors could cause severe consequences like traffic accidents. Existing methods fail due to miscalibrated probabilities or semantic agreement requirements.

Method: Latent Representation Probing (LRP) trains lightweight probes on hidden states or attention patterns using three designs: concatenating representations across layers, aggregating attention over visual tokens, and ensembling single layer probes by majority vote.

Result: LRP improves abstention accuracy by 7.6% over best baselines across four benchmarks in image and video modalities. Probes generalize across uncertainty sources and datasets, with optimal signals from intermediate layers.

Conclusion: LRP establishes a principled framework for deployment-ready AI systems by detecting confidence signals from internal states rather than unreliable outputs, enabling better uncertainty detection in VLMs.

Abstract: As VLMs are deployed in safety-critical applications, their ability to abstain from answering when uncertain becomes crucial for reliability, especially in Scene Text Visual Question Answering (STVQA) tasks. For example, OCR errors like misreading “50 mph” as “60 mph” could cause severe traffic accidents. This leads us to ask: Can VLMs know when they can’t see? Existing abstention methods suggest pessimistic answers: they either rely on miscalibrated output probabilities or require semantic agreement unsuitable for OCR tasks. However, this failure may indicate we are looking in the wrong place: uncertainty signals could be hidden in VLMs’ internal representations. Building on this insight, we propose Latent Representation Probing (LRP): training lightweight probes on hidden states or attention patterns. We explore three probe designs: concatenating representations across all layers, aggregating attention over visual tokens, and ensembling single layer probes by majority vote. Experiments on four benchmarks across image and video modalities show LRP improves abstention accuracy by 7.6% over best baselines. Our analysis reveals: probes generalize across various uncertainty sources and datasets, and optimal signals emerge from intermediate rather than final layers. This establishes a principled framework for building deployment-ready AI systems by detecting confidence signals from internal states rather than unreliable outputs.

[107] CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

Yuefei Chen, Jiang Liu, Xiaodong Lin, Ruixiang Tang

Main category: cs.CV

TL;DR: The paper introduces CounterVQA, a video benchmark for evaluating counterfactual reasoning in VLMs, reveals performance gaps in complex reasoning, and proposes CFGPT to enhance visual counterfactual reasoning.

Details

Motivation: Current VLMs show advancements in video understanding but lack exploration in counterfactual reasoning, which is essential for robust understanding by inferring alternative outcomes under hypothetical conditions.

Method: Developed CounterVQA benchmark with three difficulty levels, evaluated state-of-the-art models, and created CFGPT post-training method to distill counterfactual reasoning from language to visual modality.

Result: Found substantial performance gap: models perform reasonably on simple counterfactual questions but degrade significantly on complex multi-hop causal chains. CFGPT showed consistent improvements across all difficulty levels.

Conclusion: Counterfactual reasoning remains challenging for current VLMs, but the proposed CFGPT method effectively enhances this capability by transferring reasoning skills from language to visual domains.

Abstract: Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model’s visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

[108] ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, Jong Chul Ye

Main category: cs.CV

TL;DR: ReDirector is a camera-controlled video retake generation method that corrects RoPE misuse and introduces Rotary Camera Encoding (RoCE) to improve camera controllability and video quality for variable-length videos.

Details

Motivation: To address the common misuse of RoPE in previous video generation works and improve camera controllability for dynamically captured variable-length videos.

Method: Introduces Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures multi-view relationships within and across input and target videos, integrating camera conditions into RoPE.

Result: Significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths, with better dynamic object localization and static background preservation.

Conclusion: ReDirector effectively generalizes to out-of-distribution camera trajectories and video lengths, demonstrating superior performance in camera-controlled video retake generation.

Abstract: We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.

[109] Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation

Haoqing Li, Jun Shi, Xianmeng Chen, Qiwei Jia, Rui Wang, Wei Wei, Hong An, Xiaowen Hu

Main category: cs.CV

TL;DR: BHD-RAG is a multimodal retrieval-augmented generation framework that integrates domain-specific knowledge with MLLMs to improve Birt-Hogg-Dube syndrome diagnosis from CT images, addressing challenges of limited clinical samples and low inter-class differentiation in Diffuse Cystic Lung Diseases.

Details

Motivation: Deep learning faces challenges in BHD diagnosis due to limited clinical samples and low inter-class differentiation among DCLDs. MLLMs show potential but suffer from domain knowledge gaps and hallucination risks without proper radiological references.

Method: Three-component framework: (1) specialized agent generates imaging manifestation descriptions to build multimodal DCLD corpus, (2) cosine similarity-based retriever finds relevant image-description pairs for query images, (3) MLLM synthesizes retrieved evidence with imaging data for diagnosis.

Result: Validated on dataset with four DCLD types, achieving superior accuracy and generating evidence-based descriptions closely aligned with expert insights.

Conclusion: BHD-RAG effectively addresses diagnostic challenges in rare diseases by integrating domain-specific expertise and clinical precedents with multimodal AI, reducing hallucination risks and improving diagnostic accuracy.

Abstract: Deep learning methods face dual challenges of limited clinical samples and low inter-class differentiation among Diffuse Cystic Lung Diseases (DCLDs) in advancing Birt-Hogg-Dube syndrome (BHD) diagnosis via Computed Tomography (CT) imaging. While Multimodal Large Language Models (MLLMs) demonstrate diagnostic potential fo such rare diseases, the absence of domain-specific knowledge and referable radiological features intensify hallucination risks. To address this problem, we propose BHD-RAG, a multimodal retrieval-augmented generation framework that integrates DCLD-specific expertise and clinical precedents with MLLMs to improve BHD diagnostic accuracy. BHDRAG employs: (1) a specialized agent generating imaging manifestation descriptions of CT images to construct a multimodal corpus of DCLDs cases. (2) a cosine similarity-based retriever pinpointing relevant imagedescription pairs for query images, and (3) an MLLM synthesizing retrieved evidence with imaging data for diagnosis. BHD-RAG is validated on the dataset involving four types of DCLDs, achieving superior accuracy and generating evidence-based descriptions closely aligned with expert insights.

[110] Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, Qingyi Gu

Main category: cs.CV

TL;DR: Rectified SpaAttn addresses attention sparsity biases in diffusion transformers by rectifying attention allocation with implicit full attention reference, achieving significant speedups while maintaining generation quality.

Details

Motivation: Existing attention sparsity methods in diffusion transformers suffer from severe performance degradation due to systematic biases: excessive focus on critical tokens and complete neglect of non-critical tokens.

Method: Proposes Rectified SpaAttn with two key components: Isolated-Pooling Attention Reallocation for critical tokens and Gain-Aware Pooling Rectification for non-critical tokens, integrated using Triton kernel optimization.

Result: Achieves up to 3.33x speedup on HunyuanVideo and 2.08x on Wan 2.1 while maintaining high generation quality.

Conclusion: Rectified SpaAttn effectively addresses attention sparsity biases and enables efficient video generation with significant speed improvements without quality degradation.

Abstract: Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to the sparse attention weights, with the ratio governed by the amplified weights. Accordingly, we propose Isolated-Pooling Attention Reallocation, which calculates accurate rectification factors by reallocating multimodal pooled weights. (2) for non-critical tokens, recovering attention weights from the pooled query-key yields attention gains but also introduces pooling errors. Therefore, we propose Gain-Aware Pooling Rectification, which ensures that the rectified gain consistently surpasses the induced error. Moreover, we customize and integrate the Rectified SpaAttn kernel using Triton, achieving up to 3.33 and 2.08 times speedups on HunyuanVideo and Wan 2.1, respectively, while maintaining high generation quality. We release Rectified SpaAttn as open-source at https://github.com/BienLuky/Rectified-SpaAttn .

[111] 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zihao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, Zhibo Chen

Main category: cs.CV

TL;DR: 4DWorldBench is a comprehensive benchmark for evaluating World Generation Models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency, using adaptive multimodal conditioning and unified evaluation methods.

Details

Motivation: Existing benchmarks lack unified assessment of world-realism capability in World Models, which need to construct realistic, dynamic, and physically consistent 3D/4D worlds from various inputs.

Method: Introduces adaptive conditioning across multiple modalities, maps all modality conditions into unified textual space, and integrates LLM-as-judge, MLLM-as-judge, and traditional network-based evaluation methods.

Result: The benchmark covers tasks like Image-to-3D/4D, Video-to-4D, Text-to-3D/4D, and preliminary human studies show adaptive tool selection achieves closer agreement with subjective human judgments.

Conclusion: 4DWorldBench provides a foundation for objective comparisons and improvements, accelerating the transition from visual generation to world generation.

Abstract: World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world-realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross-modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from “visual generation” to “world generation.” Our project can be found at https://yeppp27.github.io/4DWorldBench.github.io/.

[112] Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

Thomas M Metz, Matthew Q Hill, Alice J O’Toole

Main category: cs.CV

TL;DR: The paper introduces IMIC, a gradient-coupled training method that enables foundation models to perform multiple vision tasks (object recognition, face recognition from high/low-quality images, person recognition) simultaneously in a single embedding space without catastrophic forgetting.

Details

Motivation: Vision foundation models suffer from catastrophic forgetting when fine-tuned for specific tasks like face/person recognition, losing their generalized object classification capabilities. The goal is to create models that can handle multiple vision tasks concurrently while maintaining foundation model properties.

Method: Proposed Interleaved Multi-Domain Identity Curriculum (IMIC) - a gradient-coupled, interleaving training schedule that fine-tunes foundation backbones (DINOv3, CLIP, EVA-02) simultaneously on all four tasks to prevent catastrophic forgetting.

Result: EVA-02 and CLIP with IMIC performed comparably with domain experts on all four tasks and were more accurate than humans at multi-tasking. The approach maintained out-of-distribution generalization. Analysis showed linearly separable representations with substantial feature sharing across tasks.

Conclusion: IMIC enables foundation models to handle multiple vision tasks simultaneously without catastrophic forgetting, maintaining foundation model properties while achieving expert-level performance across domains.

Abstract: Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned. However, fine-tuned models suffer from catastrophic forgetting. We create models that perform four tasks (object recognition, face recognition from high- and low-quality images, and person recognition from whole-body images) in a single embedding space – without incurring substantial catastrophic forgetting. To accomplish this, we introduce two variants of the Interleaved Multi-Domain Identity Curriculum (IMIC): a gradient-coupled, interleaving training schedule that fine-tunes a foundation backbone simultaneously on all four tasks. The IMIC method proved effective with three foundation model bases: DINOv3, CLIP, and EVA-02. Two of these (EVA-02 and CLIP) performed comparably with domain experts on all four tasks concurrently and were more accurate than humans at multi-tasking across face, body, and object datasets. Further, we demonstrate that our approach does not substantially harm out-of-distribution generalization, thus maintaining a key property of foundation models. Analysis of the most accurate model variants (EVA-02 + IMIC A and B) showed linearly separable representations of the four tasks in the unified embedding space, but with substantial sharing of features across tasks. Fewer than 100 PCs calculated from any one task could perform all other tasks with nearly zero performance degradation.

[113] DOGE: Differentiable Bezier Graph Optimization for Road Network Extraction

Jiahui Sun, Junran Lu, Jinhui Yin, Yishuo Xu, Yuanqi Li, Yanwen Guo

Main category: cs.CV

TL;DR: DOGE introduces a Bézier Graph representation for road extraction that learns directly from segmentation masks without needing vector ground-truth, achieving state-of-the-art results through differentiable geometry optimization and topological refinement.

Details

Motivation: Current road extraction methods use polylines that poorly model curvilinear geometry, while road geometry is inherently curve-based. The difficulty of obtaining vector ground-truth data is a major bottleneck.

Method: DOGE reframes road extraction as global optimization over Bézier Graphs, alternating between DiffAlign (differentiable geometry optimization) and TopoAdapt (discrete topological refinement) modules, learning directly from segmentation masks.

Result: Sets new state-of-the-art on SpaceNet and CityScale benchmarks, generating high-fidelity vector maps of road networks without requiring curve ground-truth.

Conclusion: Presents a new paradigm for road network extraction using parametric curve-based representation that eliminates the need for difficult-to-obtain vector ground-truth data.

Abstract: Automatic extraction of road networks from aerial imagery is a fundamental task, yet prevailing methods rely on polylines that struggle to model curvilinear geometry. We maintain that road geometry is inherently curve-based and introduce the Bézier Graph, a differentiable parametric curve-based representation. The primary obstacle to this representation is to obtain the difficult-to-construct vector ground-truth (GT). We sidestep this bottleneck by reframing the task as a global optimization problem over the Bézier Graph. Our framework, DOGE, operationalizes this paradigm by learning a parametric Bézier Graph directly from segmentation masks, eliminating the need for curve GT. DOGE holistically optimizes the graph by alternating between two complementary modules: DiffAlign continuously optimizes geometry via differentiable rendering, while TopoAdapt uses discrete operators to refine its topology. Our method sets a new state-of-the-art on the large-scale SpaceNet and CityScale benchmarks, presenting a new paradigm for generating high-fidelity vector maps of road networks. We will release our code and related data.

[114] STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction

Jiankuo Zhao, Xiangyu Zhu, Zidu Wang, Zhen Lei

Main category: cs.CV

TL;DR: STAvatar is a novel method for reconstructing high-fidelity 3D head avatars from monocular videos, addressing limitations of existing Gaussian Splatting approaches through UV-Adaptive Soft Binding and Temporal Adaptive Density Control.

Details

Motivation: Existing 3D Gaussian Splatting methods for head avatar reconstruction suffer from rigid motion due to Linear Blend Skinning, limited expressiveness, and inability to handle frequently occluded regions like mouth interiors and eyelids.

Method: Proposes two key components: (1) UV-Adaptive Soft Binding framework that uses image-based and geometric priors to learn per-Gaussian feature offsets in UV space, supporting dynamic resampling and compatibility with Adaptive Density Control; (2) Temporal ADC strategy that clusters structurally similar frames and uses fused perceptual error as clone criterion to capture both geometric and textural discrepancies.

Result: Extensive experiments on four benchmark datasets show STAvatar achieves state-of-the-art reconstruction performance, particularly in capturing fine-grained details and reconstructing frequently occluded regions.

Conclusion: STAvatar effectively addresses the limitations of existing methods by introducing adaptive binding and temporal density control strategies, resulting in superior 3D head avatar reconstruction quality and detail preservation.

Abstract: Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they lack specialized strategies to handle frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image-based and geometric priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to shape and textural variations. (2) a Temporal ADC strategy, which first clusters structurally similar frames to facilitate more targeted computation of the densification criterion. It further introduces a novel fused perceptual error as clone criterion to jointly capture geometric and textural discrepancies, encouraging densification in regions requiring finer details. Extensive experiments on four benchmark datasets demonstrate that STAvatar achieves state-of-the-art reconstruction performance, especially in capturing fine-grained details and reconstructing frequently occluded regions. The code will be publicly available.

[115] DesignPref: Capturing Personal Preferences in Visual Design Generation

Yi-Hao Peng, Jeffrey P. Bigham, Jason Wu

Main category: cs.CV

TL;DR: DesignPref dataset with 12k UI design comparisons shows high disagreement among professional designers, challenging traditional majority-voting approaches and demonstrating personalized models outperform aggregated ones using far fewer examples.

Details

Motivation: Visual design preferences are highly subjective and personalized, but current generative model finetuning relies on aggregated human-annotated preferences that don't capture individual variation.

Method: Created DesignPref dataset with 12k pairwise UI design comparisons annotated by 20 professional designers, then investigated personalization strategies including fine-tuning and RAG pipelines with designer-specific annotations.

Result: Substantial disagreement among designers (Krippendorff’s alpha = 0.25), with personalized models consistently outperforming aggregated baseline models using 20 times fewer examples.

Conclusion: Personalized models better capture individual design preferences than aggregated approaches, providing foundation for modeling individual design taste in future research.

Abstract: Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff’s alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers’ preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.

[116] Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

Xiangkai Ma, Han Zhang, Wenzhong Li, Sanglu Lu

Main category: cs.CV

TL;DR: TimeArtist is a temporal-visual conversion framework that achieves semantic-level alignment between time series and visual concepts, enabling high-quality image generation directly from time series data.

Details

Motivation: Existing methods fail to establish semantic-level alignment between time series and images, and the potential of using non-visual sequential data for high-fidelity image generation remains unexplored.

Method: Uses a ‘warmup-align’ paradigm with dual-autoencoder and shared quantizer trained self-supervised on large datasets, then freezes encoders and introduces projection for temporal-visual alignment at representation level.

Result: Achieves satisfactory performance in image generation metrics and superior results in zero-shot temporal tasks, enabling high-quality diverse image generation from time series while capturing temporal patterns.

Conclusion: Establishes a new paradigm for cross-modal generation that bridges the gap between temporal dynamics and visual semantics.

Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into “pseudo-images” for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a “warmup-align” paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the representation level. TimeArtist establishes a versatile cross-modal framework, enabling high-quality, diverse image generation directly from time series, while capturing temporal fluctuation patterns to render images as styles transfer. Extensive experiments show that TimeArtist achieves satisfactory performance in image generation metrics, while also attaining superior results in zero-shot temporal tasks. Our work establishes a new paradigm for cross-modal generation, bridging the gap between temporal dynamics and visual semantics.

[117] Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan

Main category: cs.CV

TL;DR: UniSandbox framework reveals significant understanding-generation gap in multimodal models, showing Chain-of-Thought reasoning bridges this gap for reasoning tasks and knowledge transfer.

Details

Motivation: To investigate whether understanding truly informs generation in Unified Multimodal Models, addressing the fundamental gap between these capabilities.

Method: Introduces UniSandbox, a decoupled evaluation framework with controlled synthetic datasets to avoid data leakage and enable detailed analysis of understanding-generation relationships.

Result: Found significant understanding-generation gap in reasoning generation and knowledge transfer. Chain-of-Thought reasoning effectively bridges the gap, and self-training can internalize this ability. Query-based architectures show latent CoT-like properties.

Conclusion: UniSandbox provides insights for designing future unified architectures and training strategies that truly bridge understanding and generation gaps.

Abstract: Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox

[118] GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, Zheng Zhu

Main category: cs.CV

TL;DR: GigaWorld-0 is a unified world model framework that generates synthetic embodied interaction data for training Vision-Language-Action models, enabling strong real-world robot performance without real-world training.

Details

Motivation: World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI, addressing the need for diverse, high-quality training data for Vision-Language-Action learning.

Method: Integrates GigaWorld-0-Video (large-scale video generation) and GigaWorld-0-3D (3D generative modeling with Gaussian Splatting, system identification, and motion planning) with joint optimization, plus efficient GigaTrain framework using FP8-precision and sparse attention.

Result: Generates high-quality, diverse, and controllable embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. VLA models trained on this data achieve strong real-world performance on physical robots.

Conclusion: GigaWorld-0 enables scalable synthesis of embodied interaction data that significantly improves generalization and task success on physical robots without requiring real-world interaction during training.

Abstract: World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

[119] ChessMamba: Structure-Aware Interleaving of State Spaces for Change Detection in Remote Sensing Images

Lei Ding, Tong Liu, Xuanguang Liu, Xiangyun Liu, Haitao Guo, Jun Lu

Main category: cs.CV

TL;DR: ChessMamba is a structure-aware framework for change detection in multitemporal remote sensing imagery that addresses heterogeneity and spatiotemporal misalignment issues through interleaved state-space modeling with chessboard serialization and structure-aware fusion.

Details

Motivation: Existing methods based on vision transformers or state-space models disrupt local structural consistency during temporal serialization, obscuring discriminative cues under misalignment and hindering reliable change localization in multitemporal remote sensing imagery.

Method: Integrates SpatialMamba encoder with lightweight cross-source interaction module using: (i) Chessboard interleaving with snake scanning order to serialize multi-temporal features into unified sequence in single forward pass, and (ii) Structure-aware fusion via multi-dilated convolutions to capture center-and-corner neighborhood contexts.

Result: Comprehensive evaluations on three CD tasks (binary CD, semantic CD, multimodal building damage assessment) demonstrate effective fusion of heterogeneous features and substantial accuracy improvements over state-of-the-art methods.

Conclusion: ChessMamba provides a robust framework for change detection that effectively addresses structural consistency and misalignment challenges in multitemporal remote sensing imagery through innovative serialization and fusion techniques.

Abstract: Change detection (CD) in multitemporal remote sensing imagery presents significant challenges for fine-grained recognition, owing to heterogeneity and spatiotemporal misalignment. However, existing methodologies based on vision transformers or state-space models typically disrupt local structural consistency during temporal serialization, obscuring discriminative cues under misalignment and hindering reliable change localization. To address this, we introduce ChessMamba, a structure-aware framework leveraging interleaved state-space modeling for robust CD with multi-temporal inputs. ChessMamba integrates a SpatialMamba encoder with a lightweight cross-source interaction module, featuring two key innovations: (i) Chessboard interleaving with snake scanning order, which serializes multi-temporal features into a unified sequence within a single forward pass, thereby shortening interaction paths and enabling direct comparison for accurate change localization; and (ii) Structure-aware fusion via multi-dilated convolutions, selectively capturing center-and-corner neighborhood contexts within each mono-temporal. Comprehensive evaluations on three CD tasks, including binary CD, semantic CD and multimodal building damage assessment, demonstrate that ChessMamba effectively fuses heterogeneous features and achieves substantial accuracy improvements over state-of-the-art methods.The relevant code will be available at: github.com/DingLei14/ChessMamba.

Junhong Liu, Yuan Zhang, Tao Huang, Wenchao Xu, Renyu Yang

Main category: cs.CV

TL;DR: Frequency-decoupled cross-modal knowledge distillation method that separates low/high-frequency features for better knowledge transfer between vision and language modalities.

Details

Motivation: Traditional KD struggles in cross-modal scenarios due to representation inconsistencies between modalities like vision and language.

Method: Decouples frequency-domain features, applies strong alignment for low-frequency features (high cross-modal consistency) and relaxed alignment for high-frequency features (low similarity), plus scale consistency loss and shared classifier.

Result: Substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches across multiple benchmark datasets.

Conclusion: Frequency decoupling effectively addresses cross-modal knowledge transfer challenges by leveraging the different consistency properties of low and high-frequency features across modalities.

Abstract: Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Code is available at https://github.com/Johumliu/FD-CMKD.

[121] LiMT: A Multi-task Liver Image Benchmark Dataset

Zhe Liu, Kai Han, Siqi Ma, Yan Zhu, Jun Chen, Chongwen Lyu, Xinyi Qiu, Chengxuan Qian, Yuqing Song, Yi Liu, Liyuan Tian, Yang Ji, Yuefeng Li

Main category: cs.CV

TL;DR: Created LiMT - a multi-task liver dataset for segmentation, classification, and detection using arterial phase CT scans from 150 cases with various liver diseases.

Details

Motivation: Existing CAD datasets are limited to single tasks, constraining development. Need a multi-task dataset to explore task correlations without dataset heterogeneity issues.

Method: Constructed LiMT dataset with 150 CT volumes from different cases including four liver diseases and normal cases. All volumes annotated by experienced clinicians for segmentation, multi-label classification, and detection tasks.

Result: Provides a public multi-task liver dataset with baseline experimental results and review of existing liver-related datasets and methods. Dataset available for research community.

Conclusion: LiMT dataset offers valuable resource for medical imaging research, enabling exploration of task correlations and advancing CAD technology for liver lesion evaluation.

Abstract: Computer-aided diagnosis (CAD) technology can assist clinicians in evaluating liver lesions and intervening with treatment in time. Although CAD technology has advanced in recent years, the application scope of existing datasets remains relatively limited, typically supporting only single tasks, which has somewhat constrained the development of CAD technology. To address the above limitation, in this paper, we construct a multi-task liver dataset (LiMT) used for liver and tumor segmentation, multi-label lesion classification, and lesion detection based on arterial phase-enhanced computed tomography (CT), potentially providing an exploratory solution that is able to explore the correlation between tasks and does not need to worry about the heterogeneity between task-specific datasets during training. The dataset includes CT volumes from 150 different cases, comprising four types of liver diseases as well as normal cases. Each volume has been carefully annotated and calibrated by experienced clinicians. This public multi-task dataset may become a valuable resource for the medical imaging research community in the future. In addition, this paper not only provides relevant baseline experimental results but also reviews existing datasets and methods related to liver-related tasks. Our dataset is available at https://drive.google.com/drive/folders/1l9HRK13uaOQTNShf5pwgSz3OTanWjkag?usp=sharing.

[122] VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering

Yuyi Li, Daoyuan Chen, Zhen Wang, Yutong Lu, Yaliang Li

Main category: cs.CV

TL;DR: The paper proposes a Generate-then-Verify framework to create VeriSciQA, a high-quality SVQA dataset that addresses systematic errors in existing synthetic datasets through cross-modal consistency checks.

Details

Motivation: Open-source LVLMs struggle with Scientific Visual Question Answering due to lack of large-scale, high-quality datasets, and existing synthetic datasets contain systematic errors from LVLM limitations and information asymmetry.

Method: A verification-centric framework that generates QA pairs with figure-associated textual context, then applies cross-modal consistency checks against figures and auxiliary filters to eliminate erroneous pairs.

Result: Created VeriSciQA with 20,351 QA pairs spanning 20 scientific domains and 12 figure types. Open-source models achieve 64% accuracy vs 82% for proprietary models, and models fine-tuned on VeriSciQA show consistent improvements on SVQA benchmarks.

Conclusion: The scalable framework enables continued data expansion to advance SVQA capability in the open-source community, with human evaluation validating VeriSciQA’s superior correctness.

Abstract: Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck lies in the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs’ inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context, then applies cross-modal consistency checks against figures along with auxiliary filters to eliminate erroneous pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types. VeriSciQA poses a challenging benchmark for open-source models, with a substantial accuracy gap between the leading open-source models (64%) and a proprietary model (82%). Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size and surpass models trained on existing datasets. Human evaluation further validates the superior correctness of VeriSciQA. Together, these evidences demonstrate that continued data expansion by our scalable framework can further advance SVQA capability in the open-source community.

[123] Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao

Main category: cs.CV

TL;DR: Agent0-VL is a self-evolving vision-language agent that uses tool-integrated reasoning for self-evaluation and self-repair, achieving 12.5% improvement over base models without external rewards.

Details

Motivation: Overcome limitations of human-annotated supervision in vision-language agents by enabling self-rewarding and addressing text-based self-evaluation struggles with visual reasoning verification.

Method: Unifies Solver and Verifier roles within a single LVLM, using tool-integrated reasoning for multi-step problem solving and tool-grounded critique for structured feedback and self-rewards through a Self-Evolving Reasoning Cycle.

Result: Achieves 12.5% improvement over base model on geometric problem solving and visual scientific analysis tasks through zero-external-reward evolution.

Conclusion: Agent0-VL enables continual self-improvement by aligning reasoning and verification behaviors without human annotation or external reward models, demonstrating effective tool-integrated self-evolution.

Abstract: Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at \href{https://github.com/aiming-lab/Agent0/Agent0-VL}{this https URL}.

[124] MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition

Mingyu Zhao, Zhanfu Yang, Yang Zhou, Zhaoyang Xia, Can Jin, Xiaoxiao He, Carol Neidle, Dimitris N. Metaxas

Main category: cs.CV

TL;DR: A multimodal approach for continuous ASL sign recognition that combines 3D skeletal features and handshape classification to detect sign boundaries, then recognizes segmented signs using a model trained on both isolated and continuous signing data.

Details

Motivation: To improve continuous sign language recognition by addressing the challenge of detecting sign boundaries in continuous signing, where signs often differ from their citation forms and boundaries are not clearly defined.

Method: Uses 3D skeletal features to capture sign dynamics at boundaries, pretrains a handshape classifier on 87 canonical categories, fuses multimodal features for boundary detection, and trains recognition on both isolated and continuous signing data.

Result: Significant improvements over previous work when evaluated on the ASLLRP corpus.

Conclusion: The multimodal approach combining boundary detection with handshape information and training on diverse data sources effectively improves continuous sign recognition performance.

Abstract: This paper presents a multimodal approach for continuous sign recognition that first uses machine learning to detect the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then recognizes the segmented signs. For improved robustness, we use 3D skeletal features extracted from sign language videos to capture the convergence of sign properties and their dynamics, which tend to cluster at sign boundaries. Another focus of this work is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and the handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing, as such signs often differ in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.

[125] Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance

Haoxuan Wang, Jiachen Tao, Junyi Wu, Gaowen Liu, Ramana Rao Kompella, Yan Yan

Main category: cs.CV

TL;DR: Motion Marionette is a zero-shot framework for rigid motion transfer from source videos to target images using internal spatial-temporal priors instead of external constraints.

Details

Motivation: Previous methods use external priors (geometric, generative, or simulation) that create trade-offs between generalizability and temporal consistency. The goal is to overcome these limitations by using internal priors that capture spatial-temporal transformations shared between source and target.

Method: Lift source video and target image into unified 3D representation space, extract motion trajectories to create spatial-temporal (SpaT) prior independent of object geometry/semantics, integrate with target object to synthesize controllable velocity field, refine using Position-Based Dynamics to reduce artifacts.

Result: The framework generalizes across diverse objects, produces temporally consistent videos aligned with source motion, and supports controllable video generation.

Conclusion: Motion Marionette successfully addresses limitations of external priors by using internal spatial-temporal transformations, enabling efficient and consistent motion transfer without compromising generalizability.

Abstract: We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.

[126] Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, Tat-Seng Chua

Main category: cs.CV

TL;DR: Reasoning-VLA is a fast action-generation VLA framework for autonomous driving that uses learnable action queries and reasoning-enhanced features to generate continuous trajectories in parallel, achieving state-of-the-art performance with excellent inference speed.

Details

Motivation: Existing Vision-Language-Action models struggle with efficient inference and generalization to novel vehicle configurations and driving scenarios.

Method: Uses learnable action queries initialized via Gaussian sampling from ground-truth trajectories, interacts with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel, and consolidates eight datasets into standardized Chain-of-Thought reasoning format with supervised learning and reinforcement learning fine-tuning.

Result: Achieves state-of-the-art performance, superior generalization capability, and the best inference speed reported to date across multiple benchmarks.

Conclusion: Reasoning-VLA provides a general and fast action-generation framework that effectively addresses efficiency and generalization challenges in autonomous driving VLA models.

Abstract: Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.

[127] Coupled Physics-Gated Adaptation: Spatially Decoding Volumetric Photochemical Conversion in Complex 3D-Printed Objects

Maryam Eftekharifar, Churun Zhang, Jialiang Wei, Xudong Cao, Hossein Heidari

Main category: cs.CV

TL;DR: Presents a framework for predicting photochemical conversion in 3D printed objects using computer vision, introducing a new task of predicting non-visual volumetric properties from 3D visual data.

Details

Motivation: To enable virtual chemical characterization of 3D printed objects without traditional post-print measurements, addressing the challenge of predicting coupled physical interactions that conventional vision models cannot handle.

Method: Proposes Coupled Physics-Gated Adaptation (C-PGA), a multimodal fusion architecture that uses sparse geometrical and process parameters as queries to dynamically gate and adapt dense visual features via feature-wise linear modulation (FiLM), processing dual 3D visual streams from raw projection stacks and their diffusion-diffraction corrected counterparts.

Result: The approach offers breakthrough capability in virtual chemical characterization, enabling precise control over chemical conversion state in 3D printed objects.

Conclusion: The C-PGA framework successfully addresses the challenging task of predicting dense non-visual volumetric physical properties from 3D visual data by explicitly modeling physical coupling, eliminating the need for traditional post-print measurements.

Abstract: We present a framework that pioneers the prediction of photochemical conversion in complex three-dimensionally printed objects, introducing a challenging new computer vision task: predicting dense, non-visual volumetric physical properties from 3D visual data. This approach leverages the largest-ever optically printed 3D specimen dataset, comprising a large family of parametrically designed complex minimal surface structures that have undergone terminal chemical characterisation. Conventional vision models are ill-equipped for this task, as they lack an inductive bias for the coupled, non-linear interactions of optical physics (diffraction, absorption) and material physics (diffusion, convection) that govern the final chemical state. To address this, we propose Coupled Physics-Gated Adaptation (C-PGA), a novel multimodal fusion architecture. Unlike standard concatenation, C-PGA explicitly models physical coupling by using sparse geometrical and process parameters (e.g., surface transport, print layer height) as a Query to dynamically gate and adapt the dense visual features via feature-wise linear modulation (FiLM). This mechanism spatially modulates dual 3D visual streams-extracted by parallel 3D-CNNs processing raw projection stacks and their diffusion-diffraction corrected counterparts allowing the model to recalibrate its visual perception based on the physical context. This approach offers a breakthrough in virtual chemical characterisation, eliminating the need for traditional post-print measurements and enabling precise control over the chemical conversion state.

[128] Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models

Qin Ren, Yufei Wang, Lanqing Guo, Wen Zhang, Zhiwen Fan, Chenyu You

Main category: cs.CV

TL;DR: LoTTS introduces localized test-time scaling for diffusion models, adaptively resampling only defective image regions instead of the entire image, reducing computation by 2-4x while improving quality.

Details

Motivation: Existing test-time scaling methods operate at full-image level, wasting computation on satisfactory regions and providing insufficient correction for localized defects, leading to inefficient resource usage.

Method: LoTTS uses cross- and self-attention contrast under quality-aware prompts to identify defective regions, refines them into coherent masks, then perturbs and denoises only these regions locally while preserving high-quality areas.

Result: LoTTS achieves state-of-the-art performance on SD2.1, SDXL, and FLUX, improving both local quality and global fidelity while reducing GPU cost by 2-4x compared to Best-of-N sampling.

Conclusion: Localized test-time scaling is a promising new direction for efficient scaling of diffusion models at inference time, enabling targeted quality improvements with significantly reduced computational cost.

Abstract: Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However, existing TTS methods operate at the full-image level, overlooking the fact that image quality is often spatially heterogeneous. This leads to unnecessary computation on already satisfactory regions and insufficient correction of localized defects. In this paper, we explore a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality regions, thereby substantially reducing the search space. This paradigm poses two central challenges: accurately localizing defects and maintaining global consistency. We propose LoTTS, the first fully training-free framework for localized TTS. For defect localization, LoTTS contrasts cross- and self-attention signals under quality-aware prompts (e.g., high-quality vs. low-quality) to identify defective regions, and then refines them into coherent masks. For consistency, LoTTS perturbs only defective regions and denoises them locally, ensuring that corrections remain confined while the rest of the image remains undisturbed. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance: it consistently improves both local quality and global fidelity, while reducing GPU cost by 2-4x compared to Best-of-N sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.

[129] HybriDLA: Hybrid Generation for Document Layout Analysis

Yufan Chen, Omar Moured, Ruiping Liu, Junwei Zheng, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: HybriDLA is a novel generative framework that combines diffusion and autoregressive decoding for document layout analysis, achieving state-of-the-art 83.5% mAP on modern documents with complex layouts.

Details

Motivation: Traditional document layout analysis methods struggle with contemporary documents that have diverse element counts and complex layouts, as they rely on fixed queries or empirical priors.

Method: Unifies diffusion and autoregressive decoding in a single layer: diffusion refines bounding-box hypotheses iteratively, while autoregressive component adds semantic and contextual awareness. Uses multi-scale feature-fusion encoder to capture both fine-grained and high-level visual cues.

Result: Achieves 83.5% mean Average Precision (mAP), setting state-of-the-art performance on DocLayNet and M$^6$Doc benchmarks, outperforming previous approaches.

Conclusion: HybriDLA effectively addresses challenges of modern document layout analysis through its hybrid generative framework and demonstrates superior performance on complex document layouts.

Abstract: Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at https://yufanchen96.github.io/projects/HybriDLA.

[130] Intelligent Image Search Algorithms Fusing Visual Large Models

Kehan Wang, Tingqiong Cui, Yang Zhang, Yu Chen, Shifeng Wu, Zhenzhang Li

Main category: cs.CV

TL;DR: DetVLM is a novel image retrieval framework that combines object detection with Visual Large Models (VLMs) to enable state-specific and zero-shot search capabilities for fine-grained image analysis.

Details

Motivation: Conventional methods for fine-grained image retrieval have limitations: manual features lack robustness, deep learning detectors can't perform state-specific retrieval or zero-shot search, and VLMs have poor spatial grounding and high computational costs.

Method: A two-stage pipeline: YOLO detector first performs efficient component-level screening, then VLM acts as recall-enhancement unit for secondary verification. This enables state search (verifying component existence and state judgments) and zero-shot search (recognizing unseen components).

Result: Achieves 94.82% overall retrieval accuracy on vehicle component dataset, significantly outperforming detection-only baselines. Also attains 94.95% accuracy in zero-shot search for driver mask-wearing and over 90% average accuracy in state search tasks.

Conclusion: DetVLM successfully bridges gaps in fine-grained image retrieval by synergistically fusing object detection with VLMs, enabling advanced capabilities like state-specific and zero-shot search with high accuracy and efficiency.

Abstract: Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient, high-recall component-level screening to determine component presence; then, a VLM acts as a recall-enhancement unit, performing secondary verification for components missed by the detector. This architecture directly enables two advanced capabilities: 1) State Search: Guided by task-specific prompts, the VLM refines results by verifying component existence and executing sophisticated state judgments (e.g., “sun visor lowered”), allowing retrieval based on component state. 2) Zero-shot Search: The framework leverages the VLM’s inherent zero-shot capability to recognize and retrieve images containing unseen components or attributes (e.g., “driver wearing a mask”) without any task-specific training. Experiments on a vehicle component dataset show DetVLM achieves a state-of-the-art overall retrieval accuracy of 94.82%, significantly outperforming detection-only baselines. It also attains 94.95% accuracy in zero-shot search for driver mask-wearing and over 90% average accuracy in state search tasks.

[131] Context-Aware Token Pruning and Discriminative Selective Attention for Transformer Tracking

Janani Kugarajeevan, Thanikasalam Kokul, Amirthalingam Ramanan, Subha Fernando

Main category: cs.CV

TL;DR: CPDATrack is a Transformer-based tracker that suppresses background and distractor interference through token pruning and selective attention mechanisms, achieving state-of-the-art performance.

Details

Motivation: Existing one-stream Transformer trackers suffer from excessive background search tokens attending to target templates, weakening discriminative capability, while current token pruning methods often remove essential contextual information near targets.

Method: Integrates a learnable module to estimate target association probabilities, prunes less-informative background tokens while preserving context, employs discriminative selective attention to block search-to-template attention in early layers, and selectively extracts high-probability target tokens from localized regions in later layers.

Result: Achieves state-of-the-art performance across multiple benchmarks, with 75.1% average overlap on GOT-10k.

Conclusion: CPDATrack effectively suppresses background and distractor interference while maintaining computational efficiency, demonstrating superior tracking performance through targeted token pruning and selective attention mechanisms.

Abstract: One-stream Transformer-based trackers have demonstrated remarkable performance by concatenating template and search region tokens, thereby enabling joint attention across all tokens. However, enabling an excessive proportion of background search tokens to attend to the target template tokens weakens the tracker’s discriminative capability. Several token pruning methods have been proposed to mitigate background interference; however, they often remove tokens near the target, leading to the loss of essential contextual information and degraded tracking performance. Moreover, the presence of distractors within the search tokens further reduces the tracker’s ability to accurately identify the target. To address these limitations, we propose CPDATrack, a novel tracking framework designed to suppress interference from background and distractor tokens while enhancing computational efficiency. First, a learnable module is integrated between two designated encoder layers to estimate the probability of each search token being associated with the target. Based on these estimates, less-informative background tokens are pruned from the search region while preserving the contextual cues surrounding the target. To further suppress background interference, a discriminative selective attention mechanism is employed that fully blocks search-to-template attention in the early layers. In the subsequent encoder layers, high-probability target tokens are selectively extracted from a localized region to attend to the template tokens, thereby reducing the influence of background and distractor tokens. The proposed CPDATrack achieves state-of-the-art performance across multiple benchmarks, particularly on GOT-10k, where it attains an average overlap of 75.1 percent.

[132] Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Youngseo Kim, Dohyun Kim, Geonhee Han, Paul Hongsuck Seo

Main category: cs.CV

TL;DR: DRIFT is a zero-shot object tracking framework that repurposes diffusion model self-attention maps as semantic propagation kernels for video object segmentation, achieving state-of-the-art performance without training.

Details

Motivation: Image diffusion models capture rich semantic structures that can be leveraged for recognition and localization tasks beyond image generation, particularly for object tracking in videos.

Method: Reinterpret diffusion model self-attention maps as semantic label propagation kernels, extend to temporal propagation for videos, use test-time optimization (DDIM inversion, textual inversion, adaptive head weighting), and refine masks with SAM.

Result: Achieves state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

Conclusion: Pretrained image diffusion models can be effectively adapted for robust object tracking through semantic propagation mechanisms and mask refinement, demonstrating their versatility beyond generation tasks.

Abstract: Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

[133] Low-Resolution Editing is All You Need for High-Resolution Editing

Junsung Lee, Hyunsoo Lee, Yong Jae Lee, Bohyung Han

Main category: cs.CV

TL;DR: Proposes a test-time optimization framework for high-resolution image editing that performs patch-wise optimization with detail transfer and synchronization to maintain consistency.

Details

Motivation: Existing image editing approaches are limited to low-resolution settings (typically up to 1K), while high-resolution content creation is emerging as a central challenge requiring effective, controllable manipulation mechanisms.

Method: Patch-wise optimization on high-resolution source images, followed by fine-grained detail transfer module and novel synchronization strategy to maintain consistency across patches.

Result: Extensive experiments show the method produces high-quality edits, facilitating high-resolution content creation.

Conclusion: The proposed framework addresses the challenge of high-resolution image editing and enables effective content generation that aligns with user intent.

Abstract: High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. While images serve as the most fundamental modality for visual expression, content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating the way toward high-resolution content creation.

[134] Supervise Less, See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting

Wen Zhang, Qin Ren, Wenjing Liu, Haibin Ling, Chenyu You

Main category: cs.CV

TL;DR: SPROUT is a training-free framework for nuclear instance segmentation that uses histology-informed prototypes and partial optimal transport to generate prompts for SAM, achieving competitive performance without supervision.

Details

Motivation: Existing nuclear segmentation methods rely on dense supervision and expensive fine-tuning, creating a need for training-free approaches that can scale effectively in computational pathology.

Method: Leverages histology-informed priors to create slide-specific reference prototypes, uses partial optimal transport for feature alignment, and transforms features into positive/negative point prompts for SAM without parameter updates.

Result: Extensive experiments across multiple histopathology benchmarks show SPROUT achieves competitive performance without supervision or retraining.

Conclusion: SPROUT establishes a novel paradigm for scalable, training-free nuclear instance segmentation in pathology, demonstrating the viability of prompt-based approaches without parameter updates.

Abstract: Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting data-driven clinical insights and facilitating downstream translational applications. While large vision foundation models have shown promise for zero-shot biomedical segmentation, most existing approaches still depend on dense supervision and computationally expensive fine-tuning. Consequently, training-free methods present a compelling research direction, yet remain largely unexplored. In this work, we introduce SPROUT, a fully training- and annotation-free prompting framework for nuclear instance segmentation. SPROUT leverages histology-informed priors to construct slide-specific reference prototypes that mitigate domain gaps. These prototypes progressively guide feature alignment through a partial optimal transport scheme. The resulting foreground and background features are transformed into positive and negative point prompts, enabling the Segment Anything Model (SAM) to produce precise nuclear delineations without any parameter updates. Extensive experiments across multiple histopathology benchmarks demonstrate that SPROUT achieves competitive performance without supervision or retraining, establishing a novel paradigm for scalable, training-free nuclear instance segmentation in pathology.

[135] GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion

Hichem Felouat, Hanrui Wang, Isao Echizen

Main category: cs.CV

TL;DR: GFT-GCN is a privacy-preserving 3D face recognition framework that combines spectral graph learning with diffusion-based template protection to achieve secure authentication while maintaining high accuracy.

Details

Motivation: 3D face recognition provides robust biometric authentication but requires protection of stored templates against security threats. The need for privacy-preserving solutions that maintain recognition performance drives this research.

Method: Integrates Graph Fourier Transform (GFT) and Graph Convolutional Networks (GCN) to extract spectral features from 3D face meshes, then applies spectral diffusion for template protection. Uses client-server architecture to keep raw biometric data on client devices.

Result: Achieves high recognition accuracy on BU-3DFE and FaceScape datasets with strong resistance to reconstruction attacks. Effectively balances privacy protection and recognition performance.

Conclusion: GFT-GCN provides a practical solution for secure 3D face authentication by combining spectral feature extraction with diffusion-based template protection, offering both privacy and performance.

Abstract: 3D face recognition offers a robust biometric solution by capturing facial geometry, providing resilience to variations in illumination, pose changes, and presentation attacks. Its strong spoof resistance makes it suitable for high-security applications, but protecting stored biometric templates remains critical. We present GFT-GCN, a privacy-preserving 3D face recognition framework that combines spectral graph learning with diffusion-based template protection. Our approach integrates the Graph Fourier Transform (GFT) and Graph Convolutional Networks (GCN) to extract compact, discriminative spectral features from 3D face meshes. To secure these features, we introduce a spectral diffusion mechanism that produces irreversible, renewable, and unlinkable templates. A lightweight client-server architecture ensures that raw biometric data never leaves the client device. Experiments on the BU-3DFE and FaceScape datasets demonstrate high recognition accuracy and strong resistance to reconstruction attacks. Results show that GFT-GCN effectively balances privacy and performance, offering a practical solution for secure 3D face authentication.

[136] MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing

Changho Choi, Minho Kim, Jinkyu Kim

Main category: cs.CV

TL;DR: MambaEye is a causal sequential visual encoder using Mamba2 backbone with relative move embedding and diffusion-inspired loss, enabling input-size agnostic processing with linear complexity.

Details

Motivation: To create a truly input-size agnostic visual encoder that mimics human vision's ability to process images of arbitrary sizes, overcoming limitations of previous bidirectional Mamba-based approaches.

Method: Uses strictly unidirectional Mamba2 backbone with relative move embedding for spatial shift encoding, and diffusion-inspired loss for step-wise supervision during sequential processing.

Result: Achieves robust performance across wide range of image resolutions, especially high resolutions like 1536² on ImageNet-1K, while maintaining linear time and memory complexity.

Conclusion: MambaEye successfully demonstrates input-size agnostic visual encoding with causal processing, providing a foundation for more human-like vision systems.

Abstract: Despite decades of progress, a truly input-size agnostic visual encoder-a fundamental characteristic of human vision-has remained elusive. We address this limitation by proposing \textbf{MambaEye}, a novel, causal sequential encoder that leverages the low complexity and causal-process based pure Mamba2 backbone. Unlike previous Mamba-based vision encoders that often employ bidirectional processing, our strictly unidirectional approach preserves the inherent causality of State Space Models, enabling the model to generate a prediction at any point in its input sequence. A core innovation is our use of relative move embedding, which encodes the spatial shift between consecutive patches, providing a strong inductive bias for translation invariance and making the model inherently adaptable to arbitrary image resolutions and scanning patterns. To achieve this, we introduce a novel diffusion-inspired loss function that provides dense, step-wise supervision, training the model to build confidence as it gathers more visual evidence. We demonstrate that MambaEye exhibits robust performance across a wide range of image resolutions, especially at higher resolutions such as $1536^2$ on the ImageNet-1K classification task. This feat is achieved while maintaining linear time and memory complexity relative to the number of patches.

[137] HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen

Main category: cs.CV

TL;DR: HiCoGen is a hierarchical compositional framework that decomposes complex prompts into semantic units and synthesizes them iteratively using a Chain of Synthesis paradigm, enhanced by reinforcement learning with a novel decaying stochasticity schedule.

Details

Motivation: Existing diffusion models struggle with complex prompts involving multiple objects and hierarchical structures, leading to concept omission, confusion, and poor compositionality.

Method: Uses LLM to decompose prompts into minimal semantic units, iteratively synthesizes them via Chain of Synthesis, and employs RL with decaying stochasticity schedule and hierarchical rewards (global, subject, relationship levels).

Result: Significantly outperforms existing methods in both concept coverage and compositional accuracy on the new HiCoPrompt benchmark.

Conclusion: The hierarchical compositional approach with enhanced exploration through decaying stochasticity enables more faithful generation of complex scenes from text prompts.

Abstract: Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

[138] VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction

Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, Hao Wang

Main category: cs.CV

TL;DR: VGGT4D extends the 3D foundation model VGGT for training-free 4D scene reconstruction by mining dynamic cues from VGGT’s attention layers and using them to separate static and dynamic elements.

Details

Motivation: Reconstructing dynamic 4D scenes is challenging due to difficulty in disentangling dynamic objects from static background. Existing methods rely on external priors, heavy optimization, or 4D dataset fine-tuning.

Method: Mine dynamic cues from VGGT’s global attention layers via gram similarity, aggregate across temporal window, refine masks with projection gradient, and integrate masks into VGGT’s early inference to mitigate motion interference.

Result: Achieves superior performance across six datasets in dynamic object segmentation, camera pose estimation, and dense reconstruction. Supports single-pass inference on sequences longer than 500 frames.

Conclusion: VGGT4D provides a training-free framework for robust 4D scene reconstruction by leveraging VGGT’s implicit dynamic cues, effectively handling dynamic scenes without requiring external priors or fine-tuning.

Abstract: Reconstructing dynamic 4D scenes is challenging, as it requires robust disentanglement of dynamic objects from the static background. While 3D foundation models like VGGT provide accurate 3D geometry, their performance drops markedly when moving objects dominate. Existing 4D approaches often rely on external priors, heavy post-optimization, or require fine-tuning on 4D datasets. In this paper, we propose VGGT4D, a training-free framework that extends the 3D foundation model VGGT for robust 4D scene reconstruction. Our approach is motivated by the key finding that VGGT’s global attention layers already implicitly encode rich, layer-wise dynamic cues. To obtain masks that decouple static and dynamic elements, we mine and amplify global dynamic cues via gram similarity and aggregate them across a temporal window. To further sharpen mask boundaries, we introduce a refinement strategy driven by projection gradient. We then integrate these precise masks into VGGT’s early-stage inference, effectively mitigating motion interference in both pose estimation and geometric reconstruction. Across six datasets, our method achieves superior performance in dynamic object segmentation, camera pose estimation, and dense reconstruction. It also supports single-pass inference on sequences longer than 500 frames.

[139] Boosting Reasoning in Large Multimodal Models via Activation Replay

Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan, Shijian Lu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: RLVR post-training shifts low-entropy activations in LMMs, which is linked to reasoning. Activation Replay boosts multimodal reasoning by replaying low-entropy activations from base models to RLVR models without retraining.

Details

Motivation: To understand the mechanisms behind RLVR's effectiveness in improving reasoning in Large Multimodal Models and leverage these insights to develop training-free methods for enhancing multimodal reasoning.

Method: Proposed Activation Replay - a training-free approach that manipulates visual tokens at test time by replaying low-entropy activations from base LMMs to regulate RLVR counterparts, without requiring expensive policy optimization.

Result: Activation Replay improves reasoning across mathematics, visual agents, and video reasoning scenarios. It boosts Pass@K metrics and mitigates narrower reasoning coverage of RLVR. Outperforms alternative approaches like replaying high-entropy activations or direct cross-model intervention.

Conclusion: Modulating low-entropy activations plays a beneficial role in LMM reasoning. Activation Replay provides an effective training-free method to enhance multimodal reasoning in post-trained LMMs by leveraging insights about how RLVR affects input activations.

Abstract: Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.

[140] EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

Jingyang Jia, Kai Shu, Gang Yang, Long Xing, Xun Chen, Aiping Liu

Main category: cs.CV

TL;DR: Proposes EmoFeedback2, a reinforcement paradigm for continuous emotional image generation that uses LVLM feedback to improve emotional continuity and fidelity through reward-based reinforcement and adaptive prompt refinement.

Details

Motivation: Existing methods lack emotional feedback from generated images and fail to adaptively adjust emotional prompts based on image content, limiting emotional continuity and fidelity.

Method: Uses a generation-understanding-feedback reinforcement paradigm with LVLM for emotional evaluation and feedback. Includes emotion-aware reward feedback for reinforcement fine-tuning and self-promotion textual feedback for adaptive prompt refinement.

Result: Outperforms state-of-the-art methods in generating high-quality images with desired emotions on custom dataset.

Conclusion: The proposed EmoFeedback2 effectively generates emotionally aligned images through LVLM-based feedback mechanisms, addressing limitations of existing approaches.

Abstract: Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback2) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.

[141] SONIC: Spectral Optimization of Noise for Inpainting with Consistency

Seungyeon Baek, Erqun Dong, Shadan Namazifard, Mark J. Matthews, Kwang Moo Yi

Main category: cs.CV

TL;DR: A training-free inpainting method that optimizes initial seed noise in text-to-image models to match unmasked image regions, using linear approximation and spectral domain optimization for efficiency.

Details

Motivation: Existing guidance-based methods for inpainting with generic text-to-image models are limited in effectiveness, requiring specialized inpainting models. The paper identifies that optimizing the initial seed noise is the missing component for effective training-free inpainting.

Method: Proposes optimizing initial seed noise to match unmasked image parts with few optimization steps. Uses linear approximation to avoid costly unrolling and spectral domain optimization for stability. Applies conventional training-free inpainting methods on the optimized noise.

Result: The method demonstrates effectiveness across various inpainting tasks and outperforms state-of-the-art approaches.

Conclusion: Optimizing initial seed noise with linear approximation and spectral domain optimization enables effective training-free inpainting with off-the-shelf text-to-image models, eliminating the need for specialized inpainting models.

Abstract: We propose a novel training-free method for inpainting with off-the-shelf text-to-image models. While guidance-based methods in theory allow generic models to be used for inverse problems such as inpainting, in practice, their effectiveness is limited, leading to the necessity of specialized inpainting-specific models. In this work, we argue that the missing ingredient for training-free inpainting is the optimization (guidance) of the initial seed noise. We propose to optimize the initial seed noise to approximately match the unmasked parts of the data - with as few as a few tens of optimization steps. We then apply conventional training-free inpainting methods on top of our optimized initial seed noise. Critically, we propose two core ideas to effectively implement this idea: (i) to avoid the costly unrolling required to relate the initial noise and the generated outcome, we perform linear approximation; and (ii) to stabilize the optimization, we optimize the initial seed noise in the spectral domain. We demonstrate the effectiveness of our method on various inpainting tasks, outperforming the state of the art. Project page: https://ubc-vision.github.io/sonic/

[142] GazeProphetV2: Head-Movement-Based Gaze Prediction Enabling Efficient Foveated Rendering on Mobile VR

Farhaan Ebadulla, Chiraag Mudlpaur, Shreya Chaurasia, Gaurav BV

Main category: cs.CV

TL;DR: A multimodal VR gaze prediction method combining gaze history, head movement, and visual scene data using gated fusion with cross-modal attention, achieving 93.1% accuracy across 22 VR scenes.

Details

Motivation: To address the challenge of predicting gaze behavior in VR for rendering optimization and interface design, enabling efficient VR systems without expensive eye tracking hardware.

Method: Multimodal approach using gated fusion mechanism with cross-modal attention to adaptively weight gaze history, head movement, and scene content based on contextual relevance.

Result: Improved predictive accuracy when combining modalities vs individual streams, with 93.1% validation accuracy and temporal consistency across 1-3 future frames in 22 VR scenes with 5.3M gaze samples.

Conclusion: Integrating past gaze trajectories with head orientation and scene content enhances gaze prediction, contributing to attention understanding in VR and enabling applications in rendering optimization, interaction design, and user experience evaluation.

Abstract: Predicting gaze behavior in virtual reality environments remains a significant challenge with implications for rendering optimization and interface design. This paper introduces a multimodal approach to VR gaze prediction that combines temporal gaze patterns, head movement data, and visual scene information. By leveraging a gated fusion mechanism with cross-modal attention, the approach learns to adaptively weight gaze history, head movement, and scene content based on contextual relevance. Evaluations using a dataset spanning 22 VR scenes with 5.3M gaze samples demonstrate improvements in predictive accuracy when combining modalities compared to using individual data streams alone. The results indicate that integrating past gaze trajectories with head orientation and scene content enhances prediction accuracy across 1-3 future frames. Cross-scene generalization testing shows consistent performance with 93.1% validation accuracy and temporal consistency in predicted gaze trajectories. These findings contribute to understanding attention mechanisms in virtual environments while suggesting potential applications in rendering optimization, interaction design, and user experience evaluation. The approach represents a step toward more efficient virtual reality systems that can anticipate user attention patterns without requiring expensive eye tracking hardware.

[143] OmniRefiner: Reinforcement-Guided Local Diffusion Refinement

Yaoli Liu, Ziheng Ouyang, Shengtao Lou, Yiren Song

Main category: cs.CV

TL;DR: A detail-aware refinement framework that uses two-stage reference-driven correction to enhance pixel-level consistency in image generation, addressing limitations of current diffusion models in preserving fine-grained visual details.

Details

Motivation: Current diffusion models struggle to preserve fine-grained visual details when refining images using references due to VAE-based latent compression discarding subtle texture information, and existing post-editing approaches often produce inconsistent results in lighting, texture, or shape.

Method: Two consecutive stages: 1) Adapt a single-image diffusion editor fine-tuned to jointly process draft and reference images for globally coherent refinement while maintaining structural fidelity; 2) Apply reinforcement learning to strengthen localized editing capability by optimizing for detail accuracy and semantic consistency.

Result: Significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.

Conclusion: The proposed framework effectively addresses the limitations of current diffusion models in preserving fine-grained details during reference-guided image generation, achieving superior performance in detail preservation and visual coherence.

Abstract: Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.

[144] On the Feasibility of Hijacking MLLMs’ Decision Chain via One Perturbation

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

Main category: cs.CV

TL;DR: SAUPs are semantic-aware universal perturbations that can hijack entire decision chains in multimodal models, causing cascading errors with a single adversarial frame.

Details

Motivation: Real-world AI systems make sequential decisions where isolated errors are manageable but cascading failures pose severe risks. Current attacks focus on single decisions, missing the threat of chain manipulation.

Method: Developed Semantic-Aware Universal Perturbations (SAUPs) using normalized space optimization with semantic separation strategy, and created RIST dataset for evaluation.

Result: Achieved 70% attack success rate controlling five distinct targets simultaneously on three multimodal large language models using just one adversarial frame.

Conclusion: SAUPs expose critical vulnerability in sequential decision systems, demonstrating that single perturbations can systematically manipulate multiple outputs, posing severe real-world risks.

Abstract: Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks. This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model’s outputs toward multiple, predefined outcomes, such as simultaneously misclassifying “non-motorized lane” signs as “motorized lane” and “pedestrian” as “plastic bag”. To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame.

[145] CREward: A Type-Specific Creativity Reward Model

Jiyeon Han, Ali Mahdavi-Amiri, Hao Zhang, Haedong Jeong

Main category: cs.CV

TL;DR: The paper introduces CREward, the first type-specific creativity reward model that assesses creativity across geometry, material, and texture dimensions, trained using LVLM-generated labels aligned with human perception.

Details

Motivation: Current approaches treat creativity as a single undifferentiated quantity, which is naive and underwhelming for representing and assessing the complex phenomenon of creativity.

Method: Conducted human benchmark evaluation to capture human perception of creativity across three types, analyzed LVLM-human alignment, collected LVLM-generated labels to train CREward model, and explored three applications.

Result: LVLMs exhibit strong alignment with human perception of creativity, enabling successful training of CREward model for type-specific creativity assessment and generation.

Conclusion: CREward enables type-specific creativity evaluation and generation across geometry, material, and texture dimensions, with applications in assessment, explainable creativity, and creative sample acquisition.

Abstract: Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this work, we learn the \emph{first type-specific creativity reward model}, coined CREward, which spans three creativity ``axes," geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.

[146] Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network

Yuanzhe Li, Steffen Müller

Main category: cs.CV

TL;DR: Proposes a multimodal fusion network for pedestrian crossing intention prediction using seven modality features from visual and motion branches, achieving superior performance on JAAD dataset.

Details

Motivation: Pedestrian crossing intention prediction is essential for autonomous vehicles to reduce collision risks, but challenging due to diverse pedestrian behavior and contextual dependencies.

Method: Uses multimodal fusion with Transformer-based feature extraction, depth-guided attention for spatial feature interactions, and modality/temporal attention mechanisms to emphasize important features.

Result: Extensive experiments on JAAD dataset show the proposed network achieves superior performance compared to baseline methods.

Conclusion: The multimodal fusion approach effectively extracts and integrates complementary cues across different modalities for improved pedestrian crossing intention prediction.

Abstract: Pedestrian crossing intention prediction is essential for the deployment of autonomous vehicles (AVs) in urban environments. Ideal prediction provides AVs with critical environmental cues, thereby reducing the risk of pedestrian-related collisions. However, the prediction task is challenging due to the diverse nature of pedestrian behavior and its dependence on multiple contextual factors. This paper proposes a multimodal fusion network that leverages seven modality features from both visual and motion branches, aiming to effectively extract and integrate complementary cues across different modalities. Specifically, motion and visual features are extracted from the raw inputs using multiple Transformer-based extraction modules. Depth-guided attention module leverages depth information to guide attention towards salient regions in another modality through comprehensive spatial feature interactions. To account for the varying importance of different modalities and frames, modality attention and temporal attention are designed to selectively emphasize informative modalities and effectively capture temporal dependencies. Extensive experiments on the JAAD dataset validate the effectiveness of the proposed network, achieving superior performance compared to the baseline methods.

[147] Multi-Context Fusion Transformer for Pedestrian Crossing Intention Prediction in Urban Environments

Yuanzhe Li, Hang Zhong, Steffen Müller

Main category: cs.CV

TL;DR: Proposes MFT, a multi-context fusion Transformer that integrates pedestrian behavior, environmental, localization, and vehicle motion contexts using progressive fusion strategies for accurate pedestrian crossing intention prediction.

Details

Motivation: Accurate pedestrian intention prediction is crucial for autonomous vehicle safety but remains challenging due to multiple factors affecting pedestrian behavior in urban environments.

Method: Uses multi-context fusion Transformer with progressive fusion: mutual intra-context attention for feature fusion, mutual cross-context attention for context integration, and guided attention mechanisms for refinement.

Result: Achieves state-of-the-art performance with 73% accuracy on JAADbeh, 93% on JAADall, and 90% on PIE datasets, outperforming existing methods.

Conclusion: MFT effectively integrates diverse contextual information and demonstrates superior performance for pedestrian crossing intention prediction through comprehensive multi-context fusion.

Abstract: Pedestrian crossing intention prediction is essential for autonomous vehicles to improve pedestrian safety and reduce traffic accidents. However, accurate pedestrian intention prediction in urban environments remains challenging due to the multitude of factors affecting pedestrian behavior. In this paper, we propose a multi-context fusion Transformer (MFT) that leverages diverse numerical contextual attributes across four key dimensions, encompassing pedestrian behavior context, environmental context, pedestrian localization context and vehicle motion context, to enable accurate pedestrian intention prediction. MFT employs a progressive fusion strategy, where mutual intra-context attention enables reciprocal interactions within each context, thereby facilitating feature sequence fusion and yielding a context token as a context-specific representation. This is followed by mutual cross-context attention, which integrates features across contexts with a global CLS token serving as a compact multi-context representation. Finally, guided intra-context attention refines context tokens within each context through directed interactions, while guided cross-context attention strengthens the global CLS token to promote multi-context fusion via guided information propagation, yielding deeper and more efficient integration. Experimental results validate the superiority of MFT over state-of-the-art methods, achieving accuracy rates of 73%, 93%, and 90% on the JAADbeh, JAADall, and PIE datasets, respectively. Extensive ablation studies are further conducted to investigate the effectiveness of the network architecture and contribution of different input context. Our code is open-source: https://github.com/ZhongHang0307/Multi-Context-Fusion-Transformer.

Yuanzhe Li, Steffen Müller

Main category: cs.CV

TL;DR: Proposes ACIT, an attention-guided cross-modal interaction Transformer for pedestrian crossing intention prediction using six visual/motion modalities grouped into three interaction pairs with dual-path attention mechanisms.

Details

Motivation: Effectively extracting and integrating complementary cues from different types of data remains a major challenge in pedestrian crossing intention prediction for autonomous vehicles.

Method: Uses six modalities grouped into three pairs: global semantic map+optical flow, local RGB+optical flow, ego-vehicle speed+pedestrian bounding box. Employs dual-path attention, optical flow-guided attention, cross-modal attention, and Transformer-based temporal aggregation.

Result: Outperforms state-of-the-art methods with 70% accuracy on JAADbeh and 89% on JAADall datasets.

Conclusion: ACIT effectively integrates multi-modal cues through attention mechanisms and achieves superior performance in pedestrian crossing intention prediction.

Abstract: Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from different types of data remains one of the major challenges. This paper proposes an attention-guided cross-modal interaction Transformer (ACIT) for pedestrian crossing intention prediction. ACIT leverages six visual and motion modalities, which are grouped into three interaction pairs: (1) Global semantic map and global optical flow, (2) Local RGB image and local optical flow, and (3) Ego-vehicle speed and pedestrian’s bounding box. Within each visual interaction pair, a dual-path attention mechanism enhances salient regions within the primary modality through intra-modal self-attention and facilitates deep interactions with the auxiliary modality (i.e., optical flow) via optical flow-guided attention. Within the motion interaction pair, cross-modal attention is employed to model the cross-modal dynamics, enabling the effective extraction of complementary motion features. Beyond pairwise interactions, a multi-modal feature fusion module further facilitates cross-modal interactions at each time step. Furthermore, a Transformer-based temporal feature aggregation module is introduced to capture sequential dependencies. Experimental results demonstrate that ACIT outperforms state-of-the-art methods, achieving accuracy rates of 70% and 89% on the JAADbeh and JAADall datasets, respectively. Extensive ablation studies are further conducted to investigate the contribution of different modules of ACIT.

[149] WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving

Seungjun Yu, Seonho Lee, Namho Kim, Jaeyo Shin, Junsung Park, Wonjeong Ryu, Raehyuk Jung, Hyunjung Shim

Main category: cs.CV

TL;DR: The paper introduces Safety-Critical Reasoning as a new task for autonomous driving that uses multi-view inputs to handle scenarios where avoiding one risk creates another, and presents WaymoQA dataset to improve MLLMs’ performance in these scenarios.

Details

Motivation: Current MLLMs struggle with high-level reasoning in safety-critical driving scenarios where avoiding one traffic risk can create another, requiring comprehensive environmental understanding beyond single front views.

Method: Defined Safety-Critical Reasoning as a two-stage process: resolve immediate risk first, then mitigate decision-induced downstream risks. Created WaymoQA dataset with 35,000 human-annotated QA pairs covering complex driving scenarios in multiple formats.

Result: Existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability in these complex situations.

Conclusion: The WaymoQA dataset effectively enhances MLLMs’ safety-critical reasoning capabilities, contributing to the development of safer and more reasoning-capable autonomous driving agents.

Abstract: Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.

[150] SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM

Lin Chen, Yingjian Zhu, Qi Yang, Xin Niu, Kun Ding, Shiming Xiang

Main category: cs.CV

TL;DR: SAM-MI is a mask-injected framework that integrates SAM with OVSS models to address over-segmentation and hard mask-label combination issues, achieving significant performance improvements and speedup.

Details

Motivation: To overcome challenges in leveraging SAM for OVSS: (1) SAM's tendency to over-segment objects and (2) hard combinations between fixed masks and labels.

Method: Uses Text-guided Sparse Point Prompter for faster mask generation, Shallow Mask Aggregation to merge partial masks, and Decoupled Mask Injection to incorporate SAM masks at different frequency levels separately.

Result: Achieves 16.7% relative improvement in mIoU over Grounded-SAM on MESS benchmark with 1.6× speedup.

Conclusion: SAM-MI serves as an effective alternative methodology to equip OVSS models with SAM capabilities.

Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment and recognize objects universally. Trained on extensive high-quality segmentation data, the segment anything model (SAM) has demonstrated remarkable universal segmentation capabilities, offering valuable support for OVSS. Although previous methods have made progress in leveraging SAM for OVSS, there are still some challenges: (1) SAM’s tendency to over-segment and (2) hard combinations between fixed masks and labels. This paper introduces a novel mask-injected framework, SAM-MI, which effectively integrates SAM with OVSS models to address these challenges. Initially, SAM-MI employs a Text-guided Sparse Point Prompter to sample sparse prompts for SAM instead of previous dense grid-like prompts, thus significantly accelerating the mask generation process. The framework then introduces Shallow Mask Aggregation (SMAgg) to merge partial masks to mitigate the SAM’s over-segmentation issue. Finally, Decoupled Mask Injection (DMI) incorporates SAM-generated masks for guidance at low-frequency and high-frequency separately, rather than directly combining them with labels. Extensive experiments on multiple benchmarks validate the superiority of SAM-MI. Notably, the proposed method achieves a 16.7% relative improvement in mIoU over Grounded-SAM on the MESS benchmark, along with a 1.6$\times$ speedup. We hope SAM-MI can serve as an alternative methodology to effectively equip the OVSS model with SAM.

[151] MFM-point: Multi-scale Flow Matching for Point Cloud Generation

Petr Molodyk, Jaemoo Choi, David W. Romero, Ming-Yu Liu, Yongxin Chen

Main category: cs.CV

TL;DR: MFM-Point is a multi-scale Flow Matching framework for point cloud generation that improves scalability and performance of point-based methods while maintaining their simplicity and efficiency.

Details

Motivation: Point-based methods for point cloud generation offer low training cost and simplicity but underperform compared to representation-based approaches. The authors aim to bridge this performance gap while preserving the advantages of point-based methods.

Method: Multi-scale Flow Matching framework with coarse-to-fine generation paradigm using structured downsampling and upsampling to preserve geometric structure and ensure smooth distributional transitions across resolutions.

Result: MFM-Point achieves best-in-class performance among point-based methods and challenges the best representation-based methods, with strong results in multi-category and high-resolution generation tasks.

Conclusion: The proposed multi-scale framework substantially improves point cloud generation quality and scalability without additional training or inference overhead, making point-based methods competitive with representation-based approaches.

Abstract: In recent years, point cloud generation has gained significant attention in 3D generative modeling. Among existing approaches, point-based methods directly generate point clouds without relying on other representations such as latent features, meshes, or voxels. These methods offer low training cost and algorithmic simplicity, but often underperform compared to representation-based approaches. In this paper, we propose MFM-Point, a multi-scale Flow Matching framework for point cloud generation that substantially improves the scalability and performance of point-based methods while preserving their simplicity and efficiency. Our multi-scale generation algorithm adopts a coarse-to-fine generation paradigm, enhancing generation quality and scalability without incurring additional training or inference overhead. A key challenge in developing such a multi-scale framework lies in preserving the geometric structure of unordered point clouds while ensuring smooth and consistent distributional transitions across resolutions. To address this, we introduce a structured downsampling and upsampling strategy that preserves geometry and maintains alignment between coarse and fine resolutions. Our experimental results demonstrate that MFM-Point achieves best-in-class performance among point-based methods and challenges the best representation-based methods. In particular, MFM-point demonstrates strong results in multi-category and high-resolution generation tasks.

[152] Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention

Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Zhixing Tan

Main category: cs.CV

TL;DR: Vision-Guided Attention (VGA) is a training-free method that uses visual token semantics to guide MLLMs’ attention, reducing hallucinations by 4.36% latency overhead while maintaining compatibility with efficient attention implementations.

Details

Motivation: MLLMs have limited localization capability in visual attention, leading to hallucinations despite accurately extracting visual semantics from tokens.

Method: VGA constructs precise visual grounding using visual token semantics and guides model focus toward relevant regions. For image captioning, it dynamically suppresses already-described regions during generation.

Result: VGA achieves state-of-the-art dehallucination performance across diverse MLLMs and benchmarks with only 4.36% latency overhead.

Conclusion: Explicit visual guidance is crucial for enhancing MLLMs’ visual understanding capabilities, and VGA effectively reduces hallucinations without requiring training.

Abstract: Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model’s focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36%. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.

[153] Explainable Visual Anomaly Detection via Concept Bottleneck Models

Arianna Stropeni, Valentina Zaccaria, Francesco Borsatti, Davide Dalle Pezze, Manuel Barusco, Gian Antonio Susto

Main category: cs.CV

TL;DR: Extends Concept Bottleneck Models to Visual Anomaly Detection to provide human-interpretable concept-based explanations alongside visual anomaly localization.

Details

Motivation: Current VAD methods provide visual explanations but lack semantically meaningful interpretations for users, limiting their practical utility and trustworthiness.

Method: Developed Concept Dataset, improved CBM architecture for dual explanations, and introduced pipeline for synthesizing artificial anomalies while maintaining VAD’s minimal dependence on anomalous samples.

Result: CONVAD achieves performance comparable to classic VAD methods while providing richer, concept-driven explanations that enhance interpretability.

Conclusion: The approach successfully bridges semantic and localization interpretability in VAD systems, offering more insightful anomaly explanations that build user trust.

Abstract: In recent years, Visual Anomaly Detection (VAD) has gained significant attention due to its ability to identify anomalous images using only normal images during training. Many VAD models work without supervision but are still able to provide visual explanations by highlighting the anomalous regions within an image. However, although these visual explanations can be helpful, they lack a direct and semantically meaningful interpretation for users. To address this limitation, we propose extending Concept Bottleneck Models (CBMs) to the VAD setting. By learning meaningful concepts, the network can provide human-interpretable descriptions of anomalies, offering a novel and more insightful way to explain them. Our contributions are threefold: (i) we develop a Concept Dataset to support research on CBMs for VAD; (ii) we improve the CBM architecture to generate both concept-based and visual explanations, bridging semantic and localization interpretability; and (iii) we introduce a pipeline for synthesizing artificial anomalies, preserving the VAD paradigm of minimizing dependence on rare anomalous samples. Our approach, Concept-Aware Visual Anomaly Detection (CONVAD), achieves performance comparable to classic VAD methods while providing richer, concept-driven explanations that enhance interpretability and trust in VAD systems.

[154] From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction

Zhida Zhao, Talas Fu, Yifan Wang, Lijun Wang, Huchuan Lu

Main category: cs.CV

TL;DR: Policy World Model (PWM) integrates world modeling and trajectory planning in autonomous driving, using action-free future state forecasting and collaborative state-action prediction for human-like anticipatory perception.

Details

Motivation: Current driving world models are mostly decoupled from trajectory planning, and the synergistic relationship between world modeling and planning needs further exploration to improve autonomous driving systems.

Method: PWM uses action-free future state forecasting, collaborative state-action prediction, dynamically enhanced parallel token generation with context-guided tokenizer and adaptive dynamic focal loss for efficient video forecasting.

Result: The method matches or exceeds state-of-the-art approaches using only front camera input, outperforming methods that rely on multi-view and multi-modal inputs.

Conclusion: PWM demonstrates that unified world modeling and planning with anticipatory perception mechanisms can achieve superior autonomous driving performance with simpler sensor inputs.

Abstract: Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.

[155] Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization

Xingyue Lin, Shuai Peng, Xiangyu Xie, Jianhua Zhu, Yuxuan Zhou, Liangcai Gao

Main category: cs.CV

TL;DR: COVec is an illumination-aware vectorization framework that uses intrinsic image decomposition to separate images into albedo, shade, and light layers in vector format, achieving superior visual fidelity and editability.

Details

Motivation: Existing vectorization methods struggle with complex real-world images, often producing fragmented shapes while sacrificing semantic conciseness.

Method: Introduces intrinsic image decomposition in vector domain, separating images into albedo, shade, and light layers with semantic-guided initialization and two-stage optimization using differentiable rendering.

Result: Experiments on various datasets show COVec achieves higher visual fidelity and significantly improved editability compared to existing methods.

Conclusion: COVec successfully addresses limitations of traditional vectorization by incorporating illumination awareness through the Clair-Obscur principle, enabling better representation of complex real-world images.

Abstract: Image vectorization aims to convert raster images into editable, scalable vector representations while preserving visual fidelity. Existing vectorization methods struggle to represent complex real-world images, often producing fragmented shapes at the cost of semantic conciseness. In this paper, we propose COVec, an illumination-aware vectorization framework inspired by the Clair-Obscur principle of light-shade contrast. COVec is the first to introduce intrinsic image decomposition in the vector domain, separating an image into albedo, shade, and light layers in a unified vector representation. A semantic-guided initialization and two-stage optimization refine these layers with differentiable rendering. Experiments on various datasets demonstrate that COVec achieves higher visual fidelity and significantly improved editability compared to existing methods.

[156] History-Augmented Contrastive Meta-Learning for Unsupervised Blind Super-Resolution of Planetary Remote Sensing Images

Huijia Zhao, Jie Lu, Yunqing Jiang, Xiao-Ping Lu, Kaichang Di

Main category: cs.CV

TL;DR: HACBSR is an unsupervised blind super-resolution framework for planetary images that uses contrastive kernel sampling and history-augmented contrastive learning without ground-truth data.

Details

Motivation: Planetary remote sensing images suffer from unknown degradations due to imaging environments and hardware constraints, limiting image quality and making supervised blind super-resolution challenging due to lack of ground-truth images.

Method: Two main components: (1) contrastive kernel sampling mechanism with kernel similarity control to reduce Gaussian sampling bias, and (2) history-augmented contrastive learning using historical models to generate negative samples for less greedy optimization.

Result: HACBSR achieves competitive performance compared to state-of-the-art unsupervised methods across multiple upscaling factors. Introduces Ceres-50 dataset for planetary applications evaluation.

Conclusion: The proposed unsupervised framework effectively addresses blind super-resolution for planetary images without requiring ground-truth data or external kernel priors.

Abstract: Planetary remote sensing images are affected by diverse and unknown degradations caused by imaging environments and hardware constraints. These factors limit image quality and hinder supervised blind super-resolution due to the lack of ground-truth images. This work presents History-Augmented Contrastive Blind Super-Resolution (HACBSR), an unsupervised framework for blind super-resolution that operates without ground-truth images and external kernel priors. HACBSR comprises two components: (1) a contrastive kernel sampling mechanism with kernel similarity control to mitigate distribution bias from Gaussian sampling, and (2) a history-augmented contrastive learning that uses historical models to generate negative samples to enable less greedy optimization and to induce strong convexity without ground-truth. A convergence analysis of the history-augmented contrastive learning is given in the Appendix. To support evaluation in planetary applications, we introduce Ceres-50, a dataset with diverse geological features simulated degradation patterns. Experiments show that HACBSR achieves competitive performance compared with state-of-the-art unsupervised methods across multiple upscaling factors. The code is available at https://github.com/2333repeat/HACBSR, and the dataset is available at https://github.com/2333repeat/Ceres-50.

[157] DeLightMono: Enhancing Self-Supervised Monocular Depth Estimation in Endoscopy by Decoupling Uneven Illumination

Mingyang Ou, Haojin Li, Yifeng Zhang, Ke Niu, Zhongxi Qiu, Heng Li, Jiang Liu

Main category: cs.CV

TL;DR: DeLight-Mono is a self-supervised monocular depth estimation framework that addresses illumination challenges in endoscopic images through illumination decoupling and joint optimization.

Details

Motivation: Existing depth estimation methods suffer from performance degradation due to uneven illumination in endoscopic images, particularly in low-intensity regions, and current enhancement techniques fail to effectively guide depth networks.

Method: Proposes an illumination-reflectance-depth model to decompose endoscopic images using auxiliary networks, with a self-supervised joint-optimizing framework and novel losses leveraging decoupled components.

Result: The framework effectively mitigates the effects of uneven illumination on depth estimation, as verified through extensive comparisons and ablation studies on two public datasets.

Conclusion: DeLight-Mono successfully addresses illumination challenges in endoscopic depth estimation through illumination decoupling and joint optimization, demonstrating improved performance over existing methods.

Abstract: Self-supervised monocular depth estimation serves as a key task in the development of endoscopic navigation systems. However, performance degradation persists due to uneven illumination inherent in endoscopic images, particularly in low-intensity regions. Existing low-light enhancement techniques fail to effectively guide the depth network. Furthermore, solutions from other fields, like autonomous driving, require well-lit images, making them unsuitable and increasing data collection burdens. To this end, we present DeLight-Mono - a novel self-supervised monocular depth estimation framework with illumination decoupling. Specifically, endoscopic images are represented by a designed illumination-reflectance-depth model, and are decomposed with auxiliary networks. Moreover, a self-supervised joint-optimizing framework with novel losses leveraging the decoupled components is proposed to mitigate the effects of uneven illumination on depth estimation. The effectiveness of the proposed methods was rigorously verified through extensive comparisons and an ablation study performed on two public datasets.

[158] LungEvaty: A Scalable, Open-Source Transformer-based Deep Learning Model for Lung Cancer Risk Prediction in LDCT Screening

Johannes Brandt, Maulik Chevli, Rickmer Braren, Georgios Kaissis, Philip Müller, Daniel Rueckert

Main category: cs.CV

TL;DR: LungEvaty is a transformer-based framework for predicting 1-6 year lung cancer risk from single LDCT scans, operating on whole-lung inputs without pixel-level annotations and matching state-of-the-art performance.

Details

Motivation: As lung cancer screening programs expand with LDCT imaging, scalable methods are needed to efficiently process large screening datasets without relying on labor-intensive pixel-level annotations or fragmenting lung analysis.

Method: A fully transformer-based framework that processes whole-lung CT scans, learns directly from large-scale screening data, and optionally uses Anatomically Informed Attention Guidance (AIAG) loss to encourage anatomically focused attention.

Result: Trained on over 90,000 CT scans (28,000+ for fine-tuning, 6,000 for evaluation), LungEvaty matches state-of-the-art performance using only imaging data without region supervision.

Conclusion: LungEvaty provides a simple, data-efficient, open-source solution that serves as an extensible foundation for future research in longitudinal and multimodal lung cancer risk prediction.

Abstract: Lung cancer risk estimation is gaining increasing importance as more countries introduce population-wide screening programs using low-dose CT (LDCT). As imaging volumes grow, scalable methods that can process entire lung volumes efficiently are essential to tap into the full potential of these large screening datasets. Existing approaches either over-rely on pixel-level annotations, limiting scalability, or analyze the lung in fragments, weakening performance. We present LungEvaty, a fully transformer-based framework for predicting 1-6 year lung cancer risk from a single LDCT scan. The model operates on whole-lung inputs, learning directly from large-scale screening data to capture comprehensive anatomical and pathological cues relevant for malignancy risk. Using only imaging data and no region supervision, LungEvaty matches state-of-the-art performance, refinable by an optional Anatomically Informed Attention Guidance (AIAG) loss that encourages anatomically focused attention. In total, LungEvaty was trained on more than 90,000 CT scans, including over 28,000 for fine-tuning and 6,000 for evaluation. The framework offers a simple, data-efficient, and fully open-source solution that provides an extensible foundation for future research in longitudinal and multimodal lung cancer risk prediction.

[159] When to Think and When to Look: Uncertainty-Guided Lookback

Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yolo Y. Tang, Luchuan Song, Susan Liang, Zhongfei Mark Zhang, Jason J. Corso, Chenliang Xu

Main category: cs.CV

TL;DR: Test-time thinking (generating reasoning chains) doesn’t always improve visual reasoning in LVLMs; short lookback phrases that reference images work better than long chains, leading to a new uncertainty-guided lookback decoding method.

Details

Motivation: Despite promising results from test-time thinking in LVLMs, there's no systematic analysis of how thinking actually affects visual reasoning performance.

Method: Large-scale controlled comparison of 10 LVLM variants on MMMU-val, analyzing thinking patterns, then proposing uncertainty-guided lookback decoding that combines uncertainty signals with adaptive lookback prompts and breadth search.

Result: More thinking isn’t always better - long chains often ignore images and underperform standard instruct mode. Short lookback phrases correlate with better visual grounding. The proposed method improves MMMU performance, especially in weak categories, and outperforms baselines.

Conclusion: Uncertainty-guided lookback decoding sets new SOTA under fixed model families and token budgets, and generalizes well across multiple benchmarks including multimodal suites and math-focused visual reasoning datasets.

Abstract: Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.

[160] FLaTEC: Frequency-Disentangled Latent Triplanes for Efficient Compression of LiDAR Point Clouds

Xiaoge Zhang, Zijie Wu, Mingtao Feng, Zichen Geng, Mehwish Nasim, Saeed Anwar, Ajmal Mian

Main category: cs.CV

TL;DR: FLaTEC is a frequency-aware point cloud compression model that decouples low-frequency structures and high-frequency textures using latent triplanes, achieving state-of-the-art compression performance.

Details

Motivation: Balancing compression ratio and reconstruction quality is challenging because low-frequency and high-frequency components contribute differently at the same resolution in point cloud compression.

Method: Proposes frequency-aware mechanism using latent triplanes to decouple low/high-frequency components, with frequency-disentangling technique and frequency-based attention to preserve 3D correlation.

Result: Achieves 78% and 94% BD-rate improvement over standard codecs on SemanticKITTI and Ford datasets respectively, with state-of-the-art rate-distortion performance.

Conclusion: FLaTEC enables efficient full-scan point cloud compression with high compression ratios by frequency-aware decomposition and triplane representation.

Abstract: Point cloud compression methods jointly optimize bitrates and reconstruction distortion. However, balancing compression ratio and reconstruction quality is difficult because low-frequency and high-frequency components contribute differently at the same resolution. To address this, we propose FLaTEC, a frequency-aware compression model that enables the compression of a full scan with high compression ratios. Our approach introduces a frequency-aware mechanism that decouples low-frequency structures and high-frequency textures, while hybridizing latent triplanes as a compact proxy for point cloud. Specifically, we convert voxelized embeddings into triplane representations to reduce sparsity, computational cost, and storage requirements. We then devise a frequency-disentangling technique that extracts compact low-frequency content while collecting high-frequency details across scales. The decoupled low-frequency and high-frequency components are stored in binary format. During decoding, full-spectrum signals are progressively recovered via a modulation block. Additionally, to compensate for the loss of 3D correlation, we introduce an efficient frequency-based attention mechanism that fosters local connectivity and outputs arbitrary resolution points. Our method achieves state-of-the-art rate-distortion performance and outperforms the standard codecs by 78% and 94% in BD-rate on both SemanticKITTI and Ford datasets.

[161] While recognizing actions, LMMs struggle to detect core interaction events

Daniel Harari, Michael Sidorov, Liel David, Chen Shterental, Abrham Kahsay Gebreselasie, Muhammad Haris Khan

Main category: cs.CV

TL;DR: LMMs fail to accurately identify when and where physical interactions begin/end in videos, despite being able to describe objects and actions, revealing a lack of perceptual grounding.

Details

Motivation: To explore whether large multi-modal models truly ground their semantic understanding in visual input by testing their ability to locate core interaction events in videos.

Method: Created a large-scale dataset with 20K+ annotated interactions from Something-Something-V2, labeled by 250 human annotators for contact/release events. Tested two LMMs (Qwen-2.5VL and GPT-4o) on locating these events in short videos.

Result: Models could reliably name objects, identify actions, and provide coherent reasoning, but consistently failed to identify the exact frame of interaction events and could not localize events within scenes.

Conclusion: LMMs lack the perceptual grounding required for deeper understanding of dynamic scenes, struggling to pinpoint the moment and location of physical contact that defines interactions.

Abstract: Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (‘contact’) or detached (‘release’). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.

[162] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images

Simon Damm, Jonas Ricker, Henning Petzka, Asja Fischer

Main category: cs.CV

TL;DR: PRADA is a simple, interpretable method that detects autoregressive-generated images and attributes them to their source models using probability ratio analysis.

Details

Motivation: There's a lack of work specifically targeting detection of images generated by autoregressive image generators, which are becoming increasingly realistic and powerful.

Method: Analyzes the ratio of a model’s conditional and unconditional probability for the autoregressive token sequence representing an image, exploiting unique characteristics in probability ratios for detection and attribution.

Result: Highly effective against eight class-to-image and four text-to-image models, demonstrating reliable detection and attribution capabilities.

Conclusion: PRADA provides a simple yet powerful approach for detecting and attributing autoregressive-generated images through probability ratio analysis.

Abstract: Autoregressive (AR) image generation has recently emerged as a powerful paradigm for image synthesis. Leveraging the generation principle of large language models, they allow for efficiently generating deceptively real-looking images, further increasing the need for reliable detection methods. However, to date there is a lack of work specifically targeting the detection of images generated by AR image generators. In this work, we present PRADA (Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images), a simple and interpretable approach that can reliably detect AR-generated images and attribute them to their respective source model. The key idea is to inspect the ratio of a model’s conditional and unconditional probability for the autoregressive token sequence representing a given image. Whenever an image is generated by a particular model, its probability ratio shows unique characteristics which are not present for images generated by other models or real images. We exploit these characteristics for threshold-based attribution and detection by calibrating a simple, model-specific score function. Our experimental evaluation shows that PRADA is highly effective against eight class-to-image and four text-to-image models.

[163] Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding

Jinghan Zhao, Yifei Huang, Feng Lu

Main category: cs.CV

TL;DR: The paper introduces a Task-Step-State (TSS) framework that uses ‘states’ as visually-grounded semantic anchors to improve procedural-aware video representations, outperforming baselines on multiple downstream tasks.

Details

Motivation: Existing methods align visual content with abstract 'task' and 'step' descriptions, but these fail to form robust alignment with concrete visual details due to their high level of abstraction.

Method: Proposes a Task-Step-State framework with progressive pre-training that unfolds the TSS hierarchy, forcing models to ground representations in observable states while associating them with steps and tasks.

Result: Outperforms baseline models on COIN and CrossTask datasets for task recognition, step recognition, and next step prediction. State supervision is identified as key performance driver.

Conclusion: The TSS framework with progressive pretraining effectively grounds abstract procedures in observable visual details, proving more effective than standard joint training for enforcing hierarchical structure.

Abstract: Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, ’task’ and ‘step’ descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce ‘states’, i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive pre-training strategy that unfolds the TSS hierarchy, forcing the model to ground representations in states while associating them with steps and high-level tasks. Extensive experiments on the COIN and CrossTask datasets show that our method outperforms baseline models on multiple downstream tasks, including task recognition, step recognition, and next step prediction. Ablation studies show that introducing state supervision is a key driver of performance gains across all tasks. Additionally, our progressive pretraining strategy proves more effective than standard joint training, as it better enforces the intended hierarchical structure.

Chu Chen, Aitor Artola, Yang Liu, Se Weon Park, Raymond H. Chan, Jean-Michel Morel, Kannie W. Y. Chan

Main category: cs.CV

TL;DR: Proposes BALD, a blind adaptive local denoising method for CEST MRI that addresses heteroscedastic noise without prior noise knowledge, improving quantitative contrast mapping accuracy.

Details

Motivation: Clinical translation of CEST MRI is hindered by spatially varying noise from hardware limitations and complex protocols, which traditional denoising methods cannot handle without altering critical biomedical information.

Method: BALD uses self-similarity in CEST data to create adaptive variance-stabilizing transform, performs two-stage denoising on linear-transformed data, and employs local SVD decomposition to prevent spatial/spectral artifacts.

Result: Extensive validation on phantoms and in vivo scans showed BALD consistently outperformed state-of-the-art CEST denoisers in denoising metrics and downstream tasks like molecular concentration mapping and cancer detection.

Conclusion: BALD effectively addresses heteroscedastic noise in CEST MRI without prior noise knowledge, enabling more accurate quantitative contrast mapping and improving clinical utility.

Abstract: Chemical Exchange Saturation Transfer (CEST) MRI enables molecular-level visualization of low-concentration metabolites by leveraging proton exchange dynamics. However, its clinical translation is hindered by inherent challenges: spatially varying noise arising from hardware limitations, and complex imaging protocols introduce heteroscedasticity in CEST data, perturbing the accuracy of quantitative contrast mapping such as amide proton transfer (APT) imaging. Traditional denoising methods are not designed for this complex noise and often alter the underlying information that is critical for biomedical analysis. To overcome these limitations, we propose a new Blind Adaptive Local Denoising (BALD) method. BALD exploits the self-similar nature of CEST data to derive an adaptive variance-stabilizing transform that equalizes the noise distributions across CEST pixels without prior knowledge of noise characteristics. Then, BALD performs two-stage denoising on a linear transformation of data to disentangle molecular signals from noise. A local SVD decomposition is used as a linear transform to prevent spatial and spectral denoising artifacts. We conducted extensive validation experiments on multiple phantoms and \textit{in vivo} CEST scans. In these experiments, BALD consistently outperformed state-of-the-art CEST denoisers in both denoising metrics and downstream tasks such as molecular concentration maps estimation and cancer detection.

[165] OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

Hao Yu, Jiabo Zhan, Zile Wang, Jinglin Wang, Huaisong Zhang, Hongyu Li, Xinrui Chen, Yongxian Wei, Chun Yuan

Main category: cs.CV

TL;DR: OmniAlpha is the first unified multi-task framework for RGBA image generation and editing, outperforming specialized models across 21 diverse tasks with novel architecture and dataset.

Details

Motivation: Address the fragmentation between specialized RGBA models (lacking versatility) and unified RGB frameworks (lacking RGBA support) by creating a unified multi-task approach for RGBA manipulation.

Method: Proposed OmniAlpha with MSRoPE-BiL, a novel RoPE method for DiT backbone enabling concurrent processing of multiple RGBA layers, trained on AlphaLayers dataset of 1,000 high-quality multi-layer triplets.

Result: Achieved 84.8% relative reduction in SAD for mask-free matting on AIM-500, won over 90% human preferences in layer-conditioned completion, and consistently outperformed specialized baselines across 21 tasks.

Conclusion: A unified multi-task model can learn superior shared representation for RGBA, paving the way for more powerful layer-aware generative systems.

Abstract: Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.

[166] WPT: World-to-Policy Transfer via Online World Model Distillation

Guangfeng Jiang, Yueru Luo, Jun Liu, Yi Huang, Yiyao Zhu, Zhan Qu, Dave Zhenyu Chen, Bingbing Liu, Xu Yan

Main category: cs.CV

TL;DR: WPT introduces a World-to-Policy Transfer training paradigm that uses world model guidance to distill knowledge into a lightweight student policy, achieving state-of-the-art performance with faster inference.

Details

Motivation: To overcome limitations of existing world model approaches that suffer from tight runtime coupling or dependency on offline reward signals, which cause substantial inference overhead and hinder end-to-end optimization.

Method: Developed a trainable reward model that infuses world knowledge into a teacher policy by aligning trajectories with future dynamics predicted by the world model, then used policy distillation and world reward distillation to transfer reasoning ability to a lightweight student policy.

Result: Achieved 0.11 collision rate (open-loop) and 79.23 driving score (closed-loop), surpassing both world-model-based and imitation-learning methods in accuracy and safety. Student policy sustained up to 4.9x faster inference while retaining most performance gains.

Conclusion: WPT enables efficient knowledge transfer from world models to lightweight policies, achieving superior planning performance while maintaining real-time deployability through faster inference speeds.

Abstract: Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio-temporal correlations between an agent’s actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher’s reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that our WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop) surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9x faster inference, while retaining most of the gains.

[167] Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation

Daniel Kienzle, Katja Ludwig, Julian Lorenz, Shin’ichi Satoh, Rainer Lienhart

Main category: cs.CV

TL;DR: A two-stage pipeline for 3D table tennis ball motion estimation from monocular videos, using 2D supervision from real data and 3D uplifting from synthetic data.

Details

Motivation: Existing methods trained on synthetic data fail to generalize to real-world videos due to lack of 3D ground truth and spin annotations for real data.

Method: Two-stage pipeline: front-end perception with 2D supervision from TTHQ dataset, and back-end 2D-to-3D uplifting trained on synthetic data with robustness to real-world artifacts.

Result: Transforms proof-of-concept uplifting into practical, robust end-to-end application for 3D trajectory and spin analysis.

Conclusion: The proposed approach effectively bridges the gap between synthetic training and real-world application for table tennis motion analysis.

Abstract: Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.

[168] Exploring State-of-the-art models for Early Detection of Forest Fires

Sharjeel Ahmed, Daim Armaghan, Fatima Naweed, Umair Yousaf, Ahmad Zubair, Murtaza Taj

Main category: cs.CV

TL;DR: This paper proposes a new dataset for early forest fire detection using synthetic data from game simulators combined with existing images, and compares YOLOv7 and detection transformer models for fire and smoke detection.

Details

Motivation: Existing fire detection methods suffer from missed detection due to lack of sizeable datasets and models specifically tuned for early warning of forest fires, particularly for detecting initial smoke plumes and fire initiation rather than widespread fires.

Method: Created a synthetic dataset using game simulators (Red Dead Redemption 2) to generate images of smoke plumes and initial fire stages, combined with existing published images. Compared image classification and localization methods including YOLOv7 and various detection transformer models.

Result: The paper presents a comprehensive dataset for early forest fire detection that includes multiple instances of smoke plumes and initial fire indicators, addressing the gap in existing datasets that mainly contain images of widespread fires.

Conclusion: The proposed synthetic dataset combined with existing images provides a valuable resource for developing early warning systems for forest fires, and the comparison of YOLOv7 and detection transformers shows promise for improving detection accuracy of initial fire stages.

Abstract: There have been many recent developments in the use of Deep Learning Neural Networks for fire detection. In this paper, we explore an early warning system for detection of forest fires. Due to the lack of sizeable datasets and models tuned for this task, existing methods suffer from missed detection. In this work, we first propose a dataset for early identification of forest fires through visual analysis. Unlike existing image corpuses that contain images of wide-spread fire, our dataset consists of multiple instances of smoke plumes and fire that indicates the initiation of fire. We obtained this dataset synthetically by utilising game simulators such as Red Dead Redemption 2. We also combined our dataset with already published images to obtain a more comprehensive set. Finally, we compared image classification and localisation methods on the proposed dataset. More specifically we used YOLOv7 (You Only Look Once) and different models of detection transformer.

[169] XiCAD: Camera Activation Detection in the Da Vinci Xi User Interface

Alexander C. Jenke, Gregor Just, Claas de Boer, Martin Wagner, Sebastian Bodenstedt, Stefanie Speidel

Main category: cs.CV

TL;DR: A lightweight ResNet18-based pipeline detects camera activation state and position in DaVinci Xi surgical videos with near-perfect accuracy, enabling automated metadata extraction for surgical data science applications.

Details

Motivation: Robot-assisted surgery relies on endoscopic video as the only visual feedback. Detecting camera activation provides valuable metadata for downstream tasks like tool tracking, skill assessment, and camera control automation.

Method: Developed a lightweight pipeline using a ResNet18 CNN to identify camera tile position and activation state, fine-tuned on manually annotated data from SurgToolLoc dataset and evaluated on three public datasets with over 70,000 frames.

Result: Achieved F1-scores between 0.993 and 1.000 for binary detection of active cameras, correctly localized camera tile in all cases without false multiple-camera detections.

Conclusion: The pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, models, and annotations are publicly available.

Abstract: Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that indicates the state of each robotic arm, including the activation of the endoscope arm. Detecting this activation provides valuable metadata such as camera movement information, which can support downstream surgical data science tasks including tool tracking, skill assessment, or camera control automation. Methods: We developed a lightweight pipeline based on a ResNet18 convolutional neural network to automatically identify the position of the camera tile and its activation state within the DaVinci Xi UI. The model was fine-tuned on manually annotated data from the SurgToolLoc dataset and evaluated across three public datasets comprising over 70,000 frames. Results: The model achieved F1-scores between 0.993 and 1.000 for the binary detection of active cameras and correctly localized the camera tile in all cases without false multiple-camera detections. Conclusion: The proposed pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, trained models, and annotations are publicly available.

[170] Multi Head Attention Enhanced Inception v3 for Cardiomegaly Detection

Abishek Karthik, Pandiyaraju V

Main category: cs.CV

TL;DR: Deep learning with CNN and attention mechanisms for automatic cardiomegaly detection from X-ray images achieves high accuracy (95.6%) and clinical relevance.

Details

Motivation: Revolutionize healthcare through automated detection of cardiomegaly using advanced imaging technologies and deep learning to improve cardiovascular disease diagnosis.

Method: Integrated approach using CNN with Inception V3 model and multi-head attention mechanism for selective feature learning from X-ray images, with preprocessing for image quality enhancement.

Result: Model achieved 95.6% accuracy, 95.2% precision, 96.2% recall, 95.7% sensitivity, 96.1% specificity, and 96.0 AUC, demonstrating strong performance in cardiomegaly detection.

Conclusion: The proposed deep learning system with attention mechanisms successfully enables accurate and clinically significant automatic detection of cardiomegaly from X-ray images.

Abstract: The healthcare industry has been revolutionized significantly by novel imaging technologies, not just in the diagnosis of cardiovascular diseases but also by the visualization of structural abnormalities like cardiomegaly. This article explains an integrated approach to the use of deep learning tools and attention mechanisms for automatic detection of cardiomegaly using X-ray images. The initiation of the project is grounded on a strong Data Collection phase and gathering the data of annotated X-ray images of various types. Then, while the Preprocessing module fine-tunes image quality, it is feasible to utilize the best out of the data quality in the proposed system. In our proposed system, the process is a CNN configuration leveraging the inception V3 model as one of the key blocks. Besides, we also employ a multilayer attention mechanism to enhance the strength. The most important feature of the method is the multi-head attention mechanism that can learn features automatically. By exact selective focusing on only some regions of input, the model can thus identify cardiomegaly in a sensitive manner. Attention rating is calculated, duplicated, and applied to enhance representation of main data, and therefore there is a successful diagnosis. The Evaluation stage will be extremely strict and it will thoroughly evaluate the model based on such measures as accuracy and precision. This will validate that the model can identify cardiomegaly and will also show the clinical significance of this method. The model has accuracy of 95.6, precision of 95.2, recall of 96.2, sensitivity of 95.7, specificity of 96.1 and an Area Under Curve(AUC) of 96.0 and their respective graphs are plotted for visualisation.

[171] UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu

Main category: cs.CV

TL;DR: UltraViCo addresses video length extrapolation in diffusion transformers by suppressing attention dispersion beyond training windows, achieving 4x extrapolation with significant quality improvements.

Details

Motivation: Video diffusion transformers struggle to generalize beyond training length due to attention dispersion causing periodic content repetition and quality degradation.

Method: Proposes UltraViCo, a training-free plug-and-play method that suppresses attention for tokens beyond training window using constant decay factor.

Result: Outperforms baselines across models and extrapolation ratios, pushing limit from 2x to 4x with 233% and 40.5% improvements in Dynamic Degree and Imaging Quality.

Conclusion: Method effectively addresses both failure modes of video length extrapolation and generalizes to downstream tasks like controllable video synthesis and editing.

Abstract: Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.

[172] Vision-Language Models for Automated 3D PET/CT Report Generation

Wenpei Jiao, Kun Shang, Hui Li, Ke Yan, Jiajin Zhang, Guangjie Yang, Lijuan Guo, Yan Wan, Xing Yang, Dakai Jin, Zhaoheng Xie

Main category: cs.CV

TL;DR: PETRG-3D is a 3D dual-branch framework for automated PET/CT report generation that addresses challenges in functional imaging through volumetric modeling and style-adaptive prompts to handle inter-hospital variability.

Details

Motivation: The rapid expansion of PET/CT scanners has outpaced specialist availability, creating clinical workload pressures. Functional PET imaging poses unique challenges compared to structural imaging due to metabolic pattern variations and the need for whole-body 3D contextual analysis.

Method: Proposed PETRG-3D framework with separate PET and CT volume encoding, style-adaptive prompts to mitigate hospital variability, and constructed PETRG-Lym dataset (824 reports) with AutoPET-RG-Lym benchmark (135 cases). Introduced PETRG-Score for lymphoma-specific clinical evaluation.

Result: PETRG-3D substantially outperforms existing methods: +31.49% ROUGE-L in natural language metrics and +8.18% PET-All in clinical efficacy metrics, demonstrating benefits of volumetric dual-modality modeling and style-aware prompting.

Conclusion: This work establishes foundation for PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation, with codes, models, and AutoPET-RG-Lym benchmark to be released publicly.

Abstract: Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49% ROUGE-L) and clinical efficacy metrics (e.g., +8.18% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.

[173] Prompting Lipschitz-constrained network for multiple-in-one sparse-view CT reconstruction

Baoshun Shi, Ke Jiang, Qiusheng Lian, Xinran Yu, Huazhu Fu

Main category: cs.CV

TL;DR: PromptCT is a storage-efficient deep unfolding framework for multiple-in-one sparse-view CT reconstruction that uses a Lipschitz-constrained network (LipNet) and explicit prompts to handle different sampling settings in a single model.

Details

Motivation: Address limitations of current deep learning SVCT methods: inability to explicitly prove Lipschitz constraints for prior networks, and high storage costs from training separate models for each view setting.

Method: Develop LipNet - an explicitly provable Lipschitz-constrained network, integrate explicit prompt module for discriminative knowledge of different sparse sampling settings, and create PromptCT framework embedding LipNet as prior network.

Result: Outperforms benchmark algorithms in simulated and real data experiments, achieving higher-quality reconstructions with lower storage costs for multiple-in-one SVCT reconstruction.

Conclusion: The proposed PromptCT framework effectively addresses storage and convergence issues in SVCT reconstruction, with theoretical guarantees of Lipschitz continuity and algorithm convergence.

Abstract: Despite significant advancements in deep learning-based sparse-view computed tomography (SVCT) reconstruction algorithms, these methods still encounter two primary limitations: (i) It is challenging to explicitly prove that the prior networks of deep unfolding algorithms satisfy Lipschitz constraints due to their empirically designed nature. (ii) The substantial storage costs of training a separate model for each setting in the case of multiple views hinder practical clinical applications. To address these issues, we elaborate an explicitly provable Lipschitz-constrained network, dubbed LipNet, and integrate an explicit prompt module to provide discriminative knowledge of different sparse sampling settings, enabling the treatment of multiple sparse view configurations within a single model. Furthermore, we develop a storage-saving deep unfolding framework for multiple-in-one SVCT reconstruction, termed PromptCT, which embeds LipNet as its prior network to ensure the convergence of its corresponding iterative algorithm. In simulated and real data experiments, PromptCT outperforms benchmark reconstruction algorithms in multiple-in-one SVCT reconstruction, achieving higher-quality reconstructions with lower storage costs. On the theoretical side, we explicitly demonstrate that LipNet satisfies boundary property, further proving its Lipschitz continuity and subsequently analyzing the convergence of the proposed iterative algorithms. The data and code are publicly available at https://github.com/shibaoshun/PromptCT.

[174] Hybrid Convolution and Frequency State Space Network for Image Compression

Haodong Pan, Hao Wei, Yusong Wang, Nanning Zheng, Caigui Jiang

Main category: cs.CV

TL;DR: HCFSSNet is a hybrid image compression model combining CNNs for local details with frequency state space blocks for long-range modeling, achieving competitive performance with fewer parameters.

Details

Motivation: Existing methods like Transformers and SSMs provide good long-range modeling but may lose structural information or ignore frequency characteristics important for compression. CNNs capture local details well but lack long-range capabilities.

Method: Proposes HCFSSNet with Vision Frequency State Space blocks that combine Omni-directional Neighborhood State Space (scans features horizontally, vertically, diagonally) with Adaptive Frequency Modulation Module (content-adaptive DCT frequency weighting). Also uses Frequency Swin Transformer Attention Module for frequency-aware side information modeling.

Result: Achieves competitive rate-distortion performance with significantly fewer parameters than recent SSM-based codecs. Reduces BD rate over VTM anchor by 18.06% (Kodak), 24.56% (Tecnick), and 22.44% (CLIC).

Conclusion: HCFSSNet provides an efficient and interpretable hybrid architecture that effectively combines local detail capture with long-range frequency modeling for learned image compression.

Abstract: Learned image compression (LIC) has recently benefited from Transformer based and state space model (SSM) based architectures. Convolutional neural networks (CNNs) effectively capture local high frequency details, whereas Transformers and SSMs provide strong long range modeling capabilities but may cause structural information loss or ignore frequency characteristics that are crucial for compression. In this work we propose HCFSSNet, a Hybrid Convolution and Frequency State Space Network for LIC. HCFSSNet uses CNNs to extract local high frequency structures and introduces a Vision Frequency State Space (VFSS) block that models long range low frequency information. The VFSS block combines an Omni directional Neighborhood State Space (VONSS) module, which scans features horizontally, vertically and diagonally, with an Adaptive Frequency Modulation Module (AFMM) that applies content adaptive weighting of discrete cosine transform frequency components for more efficient bit allocation. To further reduce redundancy in the entropy model, we integrate AFMM with a Swin Transformer to form a Frequency Swin Transformer Attention Module (FSTAM) for frequency aware side information modeling. Experiments on the Kodak, Tecnick and CLIC Professional Validation datasets show that HCFSSNet achieves competitive rate distortion performance compared with recent SSM based codecs such as MambaIC, while using significantly fewer parameters. On Kodak, Tecnick and CLIC, HCFSSNet reduces BD rate over the VTM anchor by 18.06, 24.56 and 22.44 percent, respectively, providing an efficient and interpretable hybrid architecture for future learned image compression systems.

[175] Restora-Flow: Mask-Guided Image Restoration with Flow Matching

Arnela Hadzic, Franz Thaler, Lea Bogensperger, Simon Johannes Joham, Martin Urschler

Main category: cs.CV

TL;DR: Restora-Flow is a training-free flow matching method for image restoration that uses degradation masks and trajectory correction to achieve fast, high-quality results in inpainting, super-resolution, and denoising.

Details

Motivation: Flow matching offers faster sampling than diffusion models but current methods still have long processing times or produce over-smoothed results in image restoration tasks.

Method: Uses degradation mask guidance and trajectory correction mechanism during flow matching sampling to enforce consistency with degraded inputs, without requiring additional training.

Result: Achieves superior perceptual quality and faster processing time compared to diffusion and flow matching-based methods on natural and medical datasets.

Conclusion: Restora-Flow provides an effective training-free solution for image restoration that combines the speed advantages of flow matching with improved quality through mask guidance and trajectory correction.

Abstract: Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.

[176] 3D Motion Perception of Binocular Vision Target with PID-CNN

Shi Jiazhao, Pan Pan, Shi Haotian

Main category: cs.CV

TL;DR: A small PID convolutional neural network with 17 layers and 413K parameters was developed for real-time 3D motion perception of binocular vision targets, achieving near-optimal prediction accuracy on simulated moving ball datasets.

Details

Motivation: To develop a neural network capable of perceiving 3D motion information (coordinates, velocity, acceleration) from binocular vision in real-time, with basic spatiotemporal perception capabilities.

Method: Designed a PID convolutional neural network with 17 layers and 413K parameters, using feature reuse through concatenation and pooling. Analyzed neural networks from a PID perspective, considering single-layer networks as second-order difference equations with nonlinearities.

Result: The network achieved prediction accuracy close to the upper limit of input image resolution on simulated randomly moving ball datasets.

Conclusion: High-dimensional convolution improves computational efficiency and feature space utilization, and PID information has potential advantages for implementing memory and attention mechanisms.

Abstract: This article trained a network for perceiving three-dimensional motion information of binocular vision target, which can provide real-time three-dimensional coordinate, velocity, and acceleration, and has a basic spatiotemporal perception capability. Understood the ability of neural networks to fit nonlinear problems from the perspective of PID. Considered a single-layer neural network as using a second-order difference equation and a nonlinearity to describe a local problem. Multilayer networks gradually transform the raw representation to the desired representation through multiple such combinations. Analysed some reference principles for designing neural networks. Designed a relatively small PID convolutional neural network, with a total of 17 layers and 413 thousand parameters. Implemented a simple but practical feature reuse method by concatenation and pooling. The network was trained and tested using the simulated randomly moving ball datasets, and the experimental results showed that the prediction accuracy was close to the upper limit that the input image resolution can represent. Analysed the experimental results and errors, as well as the existing shortcomings and possible directions for improvement. Finally, discussed the advantages of high-dimensional convolution in improving computational efficiency and feature space utilization. As well as the potential advantages of using PID information to implement memory and attention mechanisms.

[177] Alzheimers Disease Progression Prediction Based on Manifold Mapping of Irregularly Sampled Longitudinal Data

Xin Hong, Ying Shi, Yinhao Li, Yen-Wei Chen

Main category: cs.CV

TL;DR: Proposes R-TNAG framework for Alzheimer’s disease progression modeling from irregularly sampled longitudinal MRI data using Riemannian manifolds, neural ODEs, and attention-based recurrent units.

Details

Motivation: Clinical examinations have irregular observation intervals in longitudinal imaging data, and Euclidean space models fail to capture the intrinsic continuity and nonlinear geometric structure of disease progression.

Method: Projects sMRI features into Riemannian manifold space, uses Time-aware Neural ODE for continuous evolution modeling, and Attention-based Riemannian Gated Recurrent Unit for handling irregular intervals.

Result: Outperforms state-of-the-art models in disease status prediction and cognitive score regression, shows stable performance across varying sequence lengths and missing data rates.

Conclusion: The framework effectively handles irregular sampling in longitudinal data and demonstrates strong temporal generalizability and cross-dataset robustness for AD progression modeling.

Abstract: The uncertainty of clinical examinations frequently leads to irregular observation intervals in longitudinal imaging data, posing challenges for modeling disease progression.Most existing imaging-based disease prediction models operate in Euclidean space, which assumes a flat representation of data and fails to fully capture the intrinsic continuity and nonlinear geometric structure of irregularly sampled longitudinal images. To address the challenge of modeling Alzheimers disease (AD) progression from irregularly sampled longitudinal structural Magnetic Resonance Imaging (sMRI) data, we propose a Riemannian manifold mapping, a Time-aware manifold Neural ordinary differential equation, and an Attention-based riemannian Gated recurrent unit (R-TNAG) framework. Our approach first projects features extracted from high-dimensional sMRI into a manifold space to preserve the intrinsic geometry of disease progression. On this representation, a time-aware Neural Ordinary Differential Equation (TNODE) models the continuous evolution of latent states between observations, while an Attention-based Riemannian Gated Recurrent Unit (ARGRU) adaptively integrates historical and current information to handle irregular intervals. This joint design improves temporal consistency and yields robust AD trajectory prediction under irregular sampling.Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art models in both disease status prediction and cognitive score regression. Ablation studies verify the contributions of each module, highlighting their complementary roles in enhancing predictive accuracy. Moreover, the model exhibits stable performance across varying sequence lengths and missing data rates, indicating strong temporal generalizability. Cross-dataset validation further confirms its robustness and applicability in diverse clinical settings.

[178] From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

Zhiqing Guo, Dongdong Xi, Songlin Li, Gaobo Yang

Main category: cs.CV

TL;DR: BoxPromptIML is a weakly-supervised image manipulation localization framework that balances annotation cost and localization accuracy using coarse region annotations and knowledge distillation from SAM.

Details

Motivation: Address the trade-off between annotation cost and localization accuracy in image manipulation localization, where fully-supervised methods require expensive pixel-level masks and weakly-supervised methods lack precise spatial localization.

Method: Uses coarse region annotations for lower cost, employs knowledge distillation from SAM to a lightweight student model, and implements a dual-guidance feature fusion module inspired by human memory mechanisms for dynamic knowledge recollection.

Result: Outperforms or rivals fully-supervised models across in-distribution and out-of-distribution datasets while maintaining strong generalization, low annotation cost, and efficient deployment.

Conclusion: BoxPromptIML effectively balances annotation cost and localization performance, demonstrating superior generalization capabilities and deployment efficiency compared to existing approaches.

Abstract: Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world deployment.In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.

[179] Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving

Bin Hu, Zijian Lu, Haicheng Liao, Chengran Yuan, Bin Rao, Yongkang Li, Guofa Li, Zhiyong Cui, Cheng-zhong Xu, Zhenning Li

Main category: cs.CV

TL;DR: MAP-World is a prior-free multi-modal planning framework that uses masked action planning and path-weighted world modeling to handle multiple plausible futures without handcrafted anchors or reinforcement learning.

Details

Motivation: Existing motion planning systems discard information about alternative futures and complicate optimization by selecting single best modes, losing valuable information about plausible alternatives.

Method: Uses Masked Action Planning (MAP) module that treats future ego motion as masked sequence completion, expands compact latent states into diverse trajectory queries, and employs lightweight world model with semantic losses computed as expectation over modes.

Result: Achieves state-of-the-art performance among world-model-based methods on NAVSIM, matching anchor-based approaches while maintaining real-time inference latency.

Conclusion: MAP-World provides an effective framework for multi-modal planning that learns from full distribution of plausible futures without requiring reinforcement learning or handcrafted anchors.

Abstract: Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.

[180] SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

Da Li, Ji-Ping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun, Kai Chen, Rui Fan, Jiangang Kong, Shen Xi

Main category: cs.CV

TL;DR: SKEL-CF is a coarse-to-fine framework that improves SKEL parameter estimation using transformer-based encoder-decoder architecture, addressing challenges in anatomical accuracy and perspective ambiguities.

Details

Motivation: Parametric 3D human models like SMPL have simplified kinematics that limit biomechanical realism, and directly estimating SKEL parameters is challenging due to limited training data, perspective ambiguities, and complex human articulation.

Method: Transformer-based encoder-decoder architecture where encoder predicts coarse camera and SKEL parameters, decoder progressively refines them; converts SMPL-based 4DHuman dataset to SKEL-aligned 4DHuman-SKEL; explicitly incorporates camera modeling to mitigate depth/scale ambiguities.

Result: On MOYO dataset, achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming previous SKEL-based state-of-the-art HSMR (104.5 / 79.6).

Conclusion: SKEL-CF establishes a scalable and anatomically faithful framework for human motion analysis, bridging computer vision and biomechanics.

Abstract: Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.

[181] StableTrack: Stabilizing Multi-Object Tracking on Low-Frequency Detections

Matvei Shelukhan, Timur Mamedov, Karina Kvanchiani

Main category: cs.CV

TL;DR: StableTrack improves multi-object tracking for low-frequency detections using a two-stage matching strategy with Bbox-Based Distance and visual tracking integration, achieving 11.6% HOTA improvement at 1Hz.

Details

Motivation: Current MOT approaches struggle with limited computing resources as they require tracking objects in every frame. The goal is to enable efficient tracking under low-frequency detection conditions.

Method: Two-stage matching strategy with novel Bbox-Based Distance instead of Mahalanobis distance, integration of visual tracking into Kalman Filter and overall pipeline.

Result: Outperforms state-of-the-art trackers in low-frequency detections (11.6% HOTA improvement at 1Hz on MOT17-val), while maintaining performance on standard benchmarks with full-frequency detections.

Conclusion: StableTrack effectively addresses the challenge of limited computing resources in MOT by stabilizing tracking quality on low-frequency detections through innovative matching and distance metrics.

Abstract: Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving $\textit{11.6%}$ HOTA improvement at $\textit{1}$ Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.

[182] Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

Ziqi Wang, Chang Che, Qi Wang, Hui Ma, Zenglin Shi, Cees G. M. Snoek, Meng Wang

Main category: cs.CV

TL;DR: HPA is a post-training framework that balances safety and task performance in continual visual instruction tuning for safety-aligned MLLMs by partitioning parameters, balanced selection, and orthogonal adjustments.

Details

Motivation: Existing continual visual instruction tuning studies focus on models without safety alignment, ignoring that real-world MLLMs require safety mechanisms. Safety-aligned MLLMs suffer from both task forgetting and safety degradation during continual adaptation.

Method: Proposed Harmonious Parameter Adaptation (HPA) with three components: focusing-based parameter partition (separating safety-focused and task-focused parameters), harmoniously balanced parameter selection, and orthogonal parameter adjustment to alleviate catastrophic forgetting.

Result: Extensive experiments on CVIT benchmark and safety evaluation datasets show HPA better maintains high safety and mitigates forgetting compared to existing baselines.

Conclusion: HPA effectively addresses the challenge of balancing safety and task performance in continual adaptation of safety-aligned MLLMs through parameter partitioning and orthogonal constraints.

Abstract: While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.

[183] Block Cascading: Training Free Acceleration of Block-Causal Video Models

Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song, Varun Jampani

Main category: cs.CV

TL;DR: Block Cascading enables training-free parallelization for block-causal video generation, achieving ~2x speedup across model scales (1.3B: 16→30 FPS, 14B: 4.5→12.5 FPS) without quality loss by allowing future blocks to start generation with partially denoised context.

Details

Motivation: Block-causal video generation faces a severe speed-quality trade-off where small models are fast but low-quality, while large models are high-quality but too slow for practical use, forcing users to choose between responsiveness and quality.

Method: Block Cascading transforms sequential pipelines into parallel cascades by allowing future video blocks to begin generation using partially denoised context from predecessors, enabling multiple blocks to denoise simultaneously across multiple GPUs.

Result: With 5 GPUs exploiting temporal parallelism, achieved ~2x acceleration: 1.3B models from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Eliminated ~200ms KV-recaching overhead during context switches for interactive generation. No significant quality loss compared to block-causal pipelines.

Conclusion: Block Cascading significantly mitigates the speed-quality trade-off in block-causal video generation through training-free parallelization, enabling faster inference speeds without compromising generation quality, making large models more practical for real-time applications.

Abstract: Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/

[184] ADNet: A Large-Scale and Extensible Multi-Domain Benchmark for Anomaly Detection Across 380 Real-World Categories

Hai Ling, Jia Guo, Zhulin Tao, Yunkang Cao, Donglin Di, Hongyan Xu, Xiu Su, Yang Song, Lei Fan

Main category: cs.CV

TL;DR: ADNet is a large-scale multi-domain anomaly detection benchmark with 380 categories from 49 datasets, revealing scalability challenges in existing methods and proposing Dinomaly-m as a solution.

Details

Motivation: Existing anomaly detection benchmarks like MVTec-AD are too narrow (only 15 categories), limiting evaluation of cross-context generalization and scalability across diverse domains.

Method: Created ADNet benchmark with 196,294 RGB images (116,192 normal training, 80,102 test with 60,311 anomalies) across Electronics, Industry, Agrifood, Infrastructure, and Medical domains, with standardized MVTec-style annotations and text descriptions. Proposed Dinomaly-m, a context-guided Mixture-of-Experts extension of Dinomaly.

Result: Existing SOTA methods drop from 90.6% I-AUROC in one-for-one settings to 78.5% when scaling to all 380 categories. Dinomaly-m achieves 83.2% I-AUROC and 93.1% P-AUROC, outperforming existing approaches without increasing inference cost.

Conclusion: ADNet addresses the scalability gap in anomaly detection, provides a standardized extensible benchmark for diverse domains, and supports future anomaly detection foundation models. Dinomaly-m effectively handles multi-category scaling challenges.

Abstract: Anomaly detection (AD) aims to identify defects using normal-only training data. Existing anomaly detection benchmarks (e.g., MVTec-AD with 15 categories) cover only a narrow range of categories, limiting the evaluation of cross-context generalization and scalability. We introduce ADNet, a large-scale, multi-domain benchmark comprising 380 categories aggregated from 49 publicly available datasets across Electronics, Industry, Agrifood, Infrastructure, and Medical domains. The benchmark includes a total of 196,294 RGB images, consisting of 116,192 normal samples for training and 80,102 test images, of which 60,311 are anomalous. All images are standardized with MVTec-style pixel-level annotations and structured text descriptions spanning both spatial and visual attributes, enabling multimodal anomaly detection tasks. Extensive experiments reveal a clear scalability challenge: existing state-of-the-art methods achieve 90.6% I-AUROC in one-for-one settings but drop to 78.5% when scaling to all 380 categories in a multi-class setting. To address this, we propose Dinomaly-m, a context-guided Mixture-of-Experts extension of Dinomaly that expands decoder capacity without increasing inference cost. It achieves 83.2% I-AUROC and 93.1% P-AUROC, demonstrating superior performance over existing approaches. ADNet is designed as a standardized and extensible benchmark, supporting the community in expanding anomaly detection datasets across diverse domains and providing a scalable foundation for future anomaly detection foundation models. Dataset: https://grainnet.github.io/ADNet

[185] Object-Centric Vision Token Pruning for Vision Language Models

Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen

Main category: cs.CV

TL;DR: OC-VTP is a direct vision token pruning method for VLMs that selects the most representative tokens using object-centric reconstruction, enabling high efficiency while preserving accuracy without fine-tuning.

Details

Motivation: Vision tokens in VLMs are quantity-heavy but information-dispersed compared to language tokens, consuming unnecessary computation. Existing pruning methods use indirect approaches without guarantees.

Method: Light-weight pre-training of a small object-centric vision token pruner that reconstructs original tokens from selected ones, ensuring representative tokens are kept. Can be inserted into existing VLMs without fine-tuning.

Result: Consistently preserves the highest inference accuracy across all pruning ratios for mainstream VLMs while improving efficiency. Also demonstrates interesting interpretability.

Conclusion: OC-VTP provides a direct and guaranteed approach for vision token pruning that maintains accuracy while significantly improving VLM inference efficiency without requiring model fine-tuning.

Abstract: In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.

[186] Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware

Federico Paredes-Valles, Yoshitaka Miyatani, Kirk Y. W. Scheper

Main category: cs.CV

TL;DR: First battery-powered wearable eye tracking system using event-based sensors and neuromorphic processing, achieving 100Hz pupil tracking with <5mW power per eye.

Details

Motivation: Achieve robust, high-frequency eye tracking with ultra-low power consumption for wearable platforms, overcoming limitations of traditional vision sensors and processing solutions.

Method: Combines event-based sensing and neuromorphic processing on Speck2f SoC with lightweight coordinate decoding on microcontroller, using uncertainty-quantifying spiking neural network with gated temporal decoding.

Result: Demonstrated wearable prototype with dual neuromorphic devices achieving 100Hz binocular pupil tracking with average power consumption below 5mW per eye.

Conclusion: End-to-end neuromorphic computing enables practical, always-on eye tracking for next-generation energy-efficient wearable systems.

Abstract: Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution and sparse data streams, they have lacked fully integrated, low-power processing solutions capable of real-time inference. In this work, we present the first battery-powered, wearable pupil-center-tracking system with complete on-device integration, combining event-based sensing and neuromorphic processing on the commercially available Speck2f system-on-chip with lightweight coordinate decoding on a low-power microcontroller. Our solution features a novel uncertainty-quantifying spiking neural network with gated temporal decoding, optimized for strict memory and bandwidth constraints, complemented by systematic deployment mechanisms that bridge the reality gap. We validate our system on a new multi-user dataset and demonstrate a wearable prototype with dual neuromorphic devices achieving robust binocular pupil tracking at 100 Hz with an average power consumption below 5 mW per eye. Our work demonstrates that end-to-end neuromorphic computing enables practical, always-on eye tracking for next-generation energy-efficient wearable systems.

[187] Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, Luc Van Gool

Main category: cs.CV

TL;DR: Exo2EgoSyn adapts WAN 2.2 for cross-view video synthesis from exocentric (third-person) to egocentric (first-person) views using three key modules: view alignment, multi-view conditioning, and pose injection.

Details

Motivation: Foundation video generation models like WAN 2.2 are limited to same-view generation and cannot handle cross-view synthesis between different perspectives (exocentric to egocentric).

Method: Three modules: EgoExo-Align for latent-space view alignment, MultiExoCon for multi-view video conditioning, and PoseInj for camera pose injection to guide geometry-aware synthesis.

Result: Exo2EgoSyn enables high-fidelity ego view video generation from third-person observations without full retraining, validated on ExoEgo4D dataset with significant improvements in Ego2Exo synthesis.

Conclusion: The framework paves the way for scalable cross-view video generation with foundation models, demonstrating effective adaptation of existing models for novel cross-view synthesis tasks.

Abstract: Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.

[188] SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA

Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang, Jing Zhang

Main category: cs.CV

TL;DR: SFA is a training-free Video-LLM framework for Video TextVQA that mimics human question-answering by adaptively scanning frames, focusing on key regions, and amplifying them to guide attention to essential text cues, achieving state-of-the-art results.

Details

Motivation: Video TextVQA requires accurate perception of varying scene text across frames while integrating temporal context and filtering irrelevant information, which existing methods struggle with.

Method: Proposed SFA framework that adaptively scans video frames, selectively focuses on key regions, and directly amplifies them to guide Video-LLM attention toward essential textual cues without requiring training.

Result: Achieves new state-of-the-art results across multiple public Video TextVQA datasets, surpassing previous methods by substantial margins.

Conclusion: SFA effectively addresses Video TextVQA challenges through human-inspired attention guidance, demonstrating strong effectiveness and generalizability as the first Video-LLM-based method for this task.

Abstract: Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM’s attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.

[189] Automated Monitoring of Cultural Heritage Artifacts Using Semantic Segmentation

Andrea Ranieri, Giorgio Palmieri, Silvia Biasotti

Main category: cs.CV

TL;DR: Comparative study of U-Net architectures with different CNN encoders for automated crack detection in cultural heritage preservation using semantic segmentation.

Details

Motivation: Address the critical need for automated crack detection in preserving cultural heritage through semantic segmentation.

Method: Comparative study of U-Net architectures with various CNN encoders for pixel-level crack identification on statues and monuments, evaluated on OmniCrack30k dataset using mIoU, Dice coefficient, and Jaccard index metrics, plus out-of-distribution qualitative evaluation on real-world cracked statues.

Result: Models exhibit promising generalization capabilities to unseen cultural heritage contexts despite not being explicitly trained on statue/monument images.

Conclusion: Provides valuable insights into different CNN-based encoders’ capabilities for fine-grained crack segmentation in cultural heritage preservation.

Abstract: This paper addresses the critical need for automated crack detection in the preservation of cultural heritage through semantic segmentation. We present a comparative study of U-Net architectures, using various convolutional neural network (CNN) encoders, for pixel-level crack identification on statues and monuments. A comparative quantitative evaluation is performed on the test set of the OmniCrack30k dataset [1] using popular segmentation metrics including Mean Intersection over Union (mIoU), Dice coefficient, and Jaccard index. This is complemented by an out-of-distribution qualitative evaluation on an unlabeled test set of real-world cracked statues and monuments. Our findings provide valuable insights into the capabilities of different CNN- based encoders for fine-grained crack segmentation. We show that the models exhibit promising generalization capabilities to unseen cultural heritage contexts, despite never having been explicitly trained on images of statues or monuments.

[190] GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

Dionysia Danai Brilli, Dimitrios Mallis, Vassilis Pitsikalis, Petros Maragos

Main category: cs.CV

TL;DR: GHR-VQA is a human-centric Video QA framework that uses scene graphs to model human-object interactions across video frames, processed by GNNs and integrated with question features in a hierarchical network for improved reasoning.

Details

Motivation: Traditional pixel-based methods lack explicit modeling of human-object interactions and cross-frame reasoning centered on human actors, limiting interpretability and understanding of spatiotemporal dynamics in videos.

Method: Represent each frame as a scene graph, link human nodes across frames to a global root to form video-level graphs, process with GNNs for context-aware embeddings, and integrate with question features in a hierarchical network across abstraction levels.

Result: Achieved significant performance improvements on AGQA dataset, including 7.3% improvement in object-relation reasoning over state-of-the-art methods.

Conclusion: The explicit human-rooted graph structure enhances interpretability by decomposing actions into human-object interactions and enables better understanding of spatiotemporal dynamics in video question answering.

Abstract: We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.

[191] New York Smells: A Large Multimodal Dataset for Olfaction

Ege Ozguroglu, Junbang Liang, Ruoshi Liu, Mia Chiquier, Michael DeTienne, Wesley Wei Qian, Alexandra Horowitz, Andrew Owens, Carl Vondrick

Main category: cs.CV

TL;DR: New York Smells dataset with 7,000 smell-image pairs from 3,500 objects enables cross-modal olfactory representation learning and outperforms hand-crafted features.

Details

Motivation: Olfaction remains largely inaccessible to machines due to lack of diverse multimodal olfactory training data collected in natural settings.

Method: Created large dataset of paired image and olfactory signals captured in the wild, containing 7,000 smell-image pairs from 3,500 distinct objects across indoor/outdoor environments.

Result: Visual data enables cross-modal olfactory representation learning; learned olfactory representations outperform widely-used hand-crafted features.

Conclusion: The dataset enables three benchmark tasks and demonstrates that multimodal olfactory learning is feasible and effective.

Abstract: While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.’’ Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70$\times$ more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.

[192] Robust 3D Brain MRI Inpainting with Random Masking Augmentation

Juexin Zhang, Ying Weng, Ke Chen

Main category: cs.CV

TL;DR: A U-Net based deep learning framework won first place in the ASNR-MICCAI BraTS-Inpainting 2025 Challenge by effectively synthesizing healthy brain tissue in 3D MRI scans, outperforming previous winning solutions.

Details

Motivation: To mitigate dataset biases that limit deep learning models in quantitative brain tumor MRI analysis by developing a method for synthesizing healthy tissue in 3D scans.

Method: U-Net architecture trained to inpaint synthetically corrupted regions, enhanced with random masking augmentation strategy to improve generalization.

Result: Achieved SSIM of 0.919±0.088, PSNR of 26.932±5.057, and RMSE of 0.052±0.026 on final test set, securing first place in the 2025 challenge and surpassing 2023-2024 winning solutions.

Conclusion: The proposed framework successfully addresses dataset bias limitations in brain tumor MRI analysis and demonstrates state-of-the-art performance in brain tissue inpainting.

Abstract: The ASNR-MICCAI BraTS-Inpainting Challenge was established to mitigate dataset biases that limit deep learning models in the quantitative analysis of brain tumor MRI. This paper details our submission to the 2025 challenge, a novel deep learning framework for synthesizing healthy tissue in 3D scans. The core of our method is a U-Net architecture trained to inpaint synthetically corrupted regions, enhanced with a random masking augmentation strategy to improve generalization. Quantitative evaluation confirmed the efficacy of our approach, yielding an SSIM of 0.873$\pm$0.004, a PSNR of 24.996$\pm$4.694, and an MSE of 0.005$\pm$0.087 on the validation set. On the final online test set, our method achieved an SSIM of 0.919$\pm$0.088, a PSNR of 26.932$\pm$5.057, and an RMSE of 0.052$\pm$0.026. This performance secured first place in the BraTS-Inpainting 2025 challenge and surpassed the winning solutions from the 2023 and 2024 competitions on the official leaderboard.

[193] Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, Yifu Sun

Main category: cs.CV

TL;DR: Flash-DMD is a framework that accelerates diffusion models through efficient timestep distillation and joint RL-based refinement, achieving state-of-the-art generation quality with significantly reduced training costs.

Details

Motivation: Diffusion models are computationally expensive due to iterative sampling, and existing timestep distillation methods require extensive training while suffering from image quality degradation. RL-based fine-tuning for specific objectives is unstable and prone to reward hacking.

Method: Proposes an efficient timestep-aware distillation strategy that reduces training cost by 98% compared to DMD2, and introduces joint training where RL fine-tuning occurs simultaneously with ongoing distillation, using distillation loss as a regularizer to stabilize RL training.

Result: Achieves state-of-the-art generation quality in few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics with significantly faster convergence.

Conclusion: Flash-DMD presents an effective paradigm for training efficient, high-fidelity, and stable generative models by combining efficient distillation with joint RL refinement.

Abstract: Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.

[194] Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

Yuhang Qian, Haiyan Chen, Wentong Li, Ningzhong Liu, Jie Qin

Main category: cs.CV

TL;DR: CT-CIG is a controllable text-guided method for generating realistic camouflage images that logically blend objects with their surroundings using VLM-annotated prompts and frequency-aware refinement.

Details

Motivation: Existing camouflage image generation methods fail to produce natural results because they overlook the logical relationships between camouflaged objects and background environments.

Method: Uses Large Visual Language Models to annotate camouflage datasets with high-quality text prompts, fine-tunes Stable Diffusion with a lightweight controller for object location/shape guidance, and incorporates a Frequency Interaction Refinement Module for texture learning.

Result: Extensive experiments show semantic alignment of generated text prompts and CT-CIG’s ability to produce photorealistic camouflage images, validated through CLIPScore evaluation and camouflage effectiveness assessment.

Conclusion: CT-CIG successfully addresses the logical relationship issue in camouflage image generation and produces realistic, logically plausible camouflage images through text guidance and frequency-aware refinement.

Abstract: Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG’s ability to produce photorealistic camouflage images.

[195] Patch-Level Glioblastoma Subregion Classification with a Contrastive Learning-Based Encoder

Juexin Zhang, Qifeng Zhong, Ying Weng, Ke Chen

Main category: cs.CV

TL;DR: Fine-tuned Vision Transformer for glioblastoma histopathological analysis achieved 2nd place in BraTS-Path 2025 Challenge with MCC 0.6509 and F1-score 0.5330 on test set.

Details

Motivation: Address glioblastoma heterogeneity and provide objective automated analysis of whole slide images beyond traditional histopathology.

Method: Fine-tuned pre-trained Vision Transformer encoder with dedicated classification head on official training dataset.

Result: Achieved MCC 0.7064 and F1-score 0.7676 on validation set; MCC 0.6509 and F1-score 0.5330 on test set, securing 2nd place in challenge.

Conclusion: Established solid baseline for ViT-based histopathological analysis; future work needed to bridge performance gap on unseen data.

Abstract: The significant molecular and pathological heterogeneity of glioblastoma, an aggressive brain tumor, complicates diagnosis and patient stratification. While traditional histopathological assessment remains the standard, deep learning offers a promising path toward objective and automated analysis of whole slide images. For the BraTS-Path 2025 Challenge, we developed a method that fine-tunes a pre-trained Vision Transformer (ViT) encoder with a dedicated classification head on the official training dataset. Our model’s performance on the online validation set, evaluated via the Synapse platform, yielded a Matthews Correlation Coefficient (MCC) of 0.7064 and an F1-score of 0.7676. On the final test set, the model achieved an MCC of 0.6509 and an F1-score of 0.5330, which secured our team second place in the BraTS-Pathology 2025 Challenge. Our results establish a solid baseline for ViT-based histopathological analysis, and future efforts will focus on bridging the performance gap observed on the unseen validation data.

[196] V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

Sen Nie, Jie Zhang, Jianxin Yan, Shiguang Shan, Xilin Chen

Main category: cs.CV

TL;DR: V-Attack is a novel adversarial attack method that targets value features in transformer attention blocks to achieve precise local semantic manipulation in Large Vision-Language Models, outperforming existing methods by 36% in attack success rate.

Details

Motivation: Existing adversarial attack methods struggle with controllability and fail to precisely manipulate specific concept semantics in images due to semantic entanglement in patch-token representations, where global context dominates individual patch features.

Method: V-Attack targets value features (V) in transformer attention blocks, which suppress global-context channels and retain disentangled local semantic information. It includes Self-Value Enhancement and Text-Guided Value Manipulation modules to refine semantic richness and optimize source concepts toward target concepts using text prompts.

Result: Extensive experiments across LVLMs (LLaVA, InternVL, DeepseekVL, GPT-4o) show V-Attack improves attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in visual-language understanding.

Conclusion: Value features serve as precise handles for semantic manipulation, and V-Attack effectively bypasses entangled patch features to achieve highly controllable local semantic attacks, revealing significant security risks in modern vision-language models.

Abstract: Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics on Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and fail to precisely manipulate the semantics of specific concepts in the image. We attribute this limitation to semantic entanglement in the patch-token representations on which adversarial attacks typically operate: global context aggregated by self-attention in the vision encoder dominates individual patch features, making them unreliable handles for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features (V) computed within the transformer attention block serve as much more precise handles for manipulation. We show that V suppresses global-context channels, allowing it to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose V-Attack, a novel method designed for precise local semantic attacks. V-Attack targets the value features and introduces two core components: (1) a Self-Value Enhancement module to refine V’s intrinsic semantic richness, and (2) a Text-Guided Value Manipulation module that leverages text prompts to locate source concept and optimize it toward a target concept. By bypassing the entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepseekVL and GPT-4o, show that V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in modern visual-language understanding. Our code and data are available https://github.com/Summu77/V-Attack.

[197] HistoSpeckle-Net: Mutual Information-Guided Deep Learning for high-fidelity reconstruction of complex OrganAMNIST images via perturbed Multimode Fibers

Jawaria Maqbool, M. Imran Cheema

Main category: cs.CV

TL;DR: HistoSpeckle-Net is a deep learning architecture for reconstructing medical images from multimode fiber speckles using distribution-aware learning and multiscale feature refinement, achieving superior performance with limited data and under fiber bending conditions.

Details

Motivation: Existing deep learning methods in MMF imaging focus on simpler datasets and are data-intensive, limiting their applicability to complex real-world medical imaging tasks.

Method: Proposed HistoSpeckle-Net with histogram-based mutual information loss for distribution-aware learning and Three-Scale Feature Refinement Module for multiscale SSIM loss computation, using an optical setup with SLM and MMF to capture speckle patterns from OrganAMNIST images.

Result: HistoSpeckle-Net achieves higher fidelity than baseline models (U-Net, Pix2Pix), performs well with limited training samples, and maintains performance across varying fiber bending conditions.

Conclusion: HistoSpeckle-Net brings MMF imaging closer to practical deployment in real-world clinical environments by effectively reconstructing complex anatomical features with reduced data and under fiber perturbations.

Abstract: Existing deep learning methods in multimode fiber (MMF) imaging often focus on simpler datasets, limiting their applicability to complex, real-world imaging tasks. These models are typically data-intensive, a challenge that becomes more pronounced when dealing with diverse and complex images. In this work, we propose HistoSpeckle-Net, a deep learning architecture designed to reconstruct structurally rich medical images from MMF speckles. To build a clinically relevant dataset, we develop an optical setup that couples laser light through a spatial light modulator (SLM) into an MMF, capturing output speckle patterns corresponding to input OrganAMNIST images. Unlike previous MMF imaging approaches, which have not considered the underlying statistics of speckles and reconstructed images, we introduce a distribution-aware learning strategy. We employ a histogram-based mutual information loss to enhance model robustness and reduce reliance on large datasets. Our model includes a histogram computation unit that estimates smooth marginal and joint histograms for calculating mutual information loss. It also incorporates a unique Three-Scale Feature Refinement Module, which leads to multiscale Structural Similarity Index Measure (SSIM) loss computation. Together, these two loss functions enhance both the structural fidelity and statistical alignment of the reconstructed images. Our experiments on the complex OrganAMNIST dataset demonstrate that HistoSpeckle-Net achieves higher fidelity than baseline models such as U-Net and Pix2Pix. It gives superior performance even with limited training samples and across varying fiber bending conditions. By effectively reconstructing complex anatomical features with reduced data and under fiber perturbations, HistoSpeckle-Net brings MMF imaging closer to practical deployment in real-world clinical environments.

[198] Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities

Seyede Niloofar Hosseini, Ali Mojibi, Mahdi Mohseni, Navid Arjmand, Alireza Taheri

Main category: cs.CV

TL;DR: Deep neural networks (BLSTM and transformer) predict whole-body human posture during load-reaching tasks using 3D motion data, with a novel cost function enforcing constant body segment lengths to improve accuracy.

Details

Motivation: To develop accurate methods for predicting human posture dynamics during manual material handling activities, which is important for understanding biomechanics and preventing injuries.

Method: Used bidirectional LSTM and transformer architectures trained on 3D full-body motion data from 20 individuals performing load-reaching tasks. Inputs included hand-load position, lifting/handling techniques, body metrics, and first 25% of task posture data to predict remaining 75%. Introduced novel cost function to enforce constant body segment lengths.

Result: Transformer model achieved 47.0 mm RMSE, ~58% more accurate than BLSTM. New cost function reduced prediction error by ~8% for arms and ~21% for legs.

Conclusion: Transformer-based neural networks with time series dependencies effectively capture 3D motion dynamics for posture prediction during manual handling tasks, providing a valuable approach for biomechanical analysis.

Abstract: This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 47.0 mm, exhibited ~58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.

[199] PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Yi-Lun Wu, Hong-Han Shuai

Main category: cs.CV

TL;DR: The paper addresses the fidelity-diversity dilemma in long-prompt text-to-image generation, where longer prompts improve fidelity but reduce diversity. It introduces LPD-Bench for evaluation and proposes PromptMoG, a training-free method that samples embeddings from a Mixture-of-Gaussians to enhance diversity while preserving semantics.

Details

Motivation: Long prompts in text-to-image generation encode rich information but suppress diversity, leading to repetitive outputs. Current models show a clear drop in diversity as prompt length increases, creating a fidelity-diversity trade-off that needs addressing.

Method: The authors introduce LPD-Bench for systematic evaluation and propose PromptMoG, a training-free method that samples prompt embeddings from a Mixture-of-Gaussians distribution in the embedding space to increase sampling entropy and enhance diversity.

Result: Extensive experiments on four state-of-the-art models (SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image) show that PromptMoG consistently improves diversity in long-prompt generation without causing semantic drifting.

Conclusion: PromptMoG effectively resolves the fidelity-diversity dilemma in long-prompt text-to-image generation by enhancing diversity through prompt embedding sampling while maintaining semantic fidelity, demonstrating consistent improvements across multiple advanced models.

Abstract: Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.

[200] Zoo3D: Zero-Shot 3D Object Detection at Scene Level

Andrey Lemeshko, Bulat Gabdullin, Nikita Drozdov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi

Main category: cs.CV

TL;DR: Zoo3D is the first training-free 3D object detection framework that constructs 3D bounding boxes from 2D instance masks and assigns semantic labels using open-vocabulary methods, achieving state-of-the-art results without training.

Details

Motivation: Existing open-vocabulary 3D detectors still depend on training scenes, limiting their ability to recognize diverse, previously unseen objects in real-world environments.

Method: Uses graph clustering of 2D instance masks to construct 3D bounding boxes, then assigns semantic labels via open-vocabulary module with best-view selection and view-consensus mask generation. Operates in zero-shot (no training) and self-supervised (refining with pseudo labels) modes.

Result: Achieves state-of-the-art results on ScanNet200 and ARKitScenes benchmarks. Zero-shot Zoo3D_0 outperforms all existing self-supervised methods.

Conclusion: Demonstrates the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding, with zero-shot performance surpassing supervised methods.

Abstract: 3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D$_0$, which requires no training at all, and the self-supervised Zoo3D$_1$, which refines 3D box prediction by training a class-agnostic detector on Zoo3D$_0$-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D$_0$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at https://github.com/col14m/zoo3d .

[201] The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou

Main category: cs.CV

TL;DR: Adv-GRPO is an RL framework for image generation that uses adversarial rewards with reference images and vision foundation models to prevent reward hacking and improve image quality, outperforming existing methods in human evaluations.

Details

Motivation: Current RL approaches for image generation rely on scalar rewards from pre-trained preference models that often fail to capture human perception and are vulnerable to reward hacking, where higher scores don't correspond to better images.

Method: Introduces Adv-GRPO with adversarial reward that iteratively updates both reward model and generator. Uses reference images as positive samples and vision foundation models (e.g., DINO) to provide dense visual rewards instead of single scalar rewards.

Result: Outperforms Flow-GRPO and SD3 with 70.0% win rate in image quality and 72.4% in aesthetics in human evaluation. Enables distribution transfer and flexible style customization.

Conclusion: Using reference images with foundation-model rewards provides rich visual signals that prevent reward hacking and lead to consistent gains across image quality, aesthetics, and task-specific metrics.

Abstract: A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.

[202] MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, Humphrey Shi

Main category: cs.CV

TL;DR: The paper introduces MapReduce LoRA and RaTE methods to address alignment tax in multi-reward optimization, achieving significant improvements across text-to-image, text-to-video, and language tasks.

Details

Motivation: To overcome the alignment tax problem where optimizing multiple rewards jointly degrades some dimensions while improving others in RLHF.

Method: MapReduce LoRA trains preference-specific LoRA experts in parallel and merges them iteratively; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control.

Result: Achieved improvements of 36.1%-67.1% on text-to-image tasks, 48.1%-90.0% on text-to-video tasks, and 43.4%-136.7% on language tasks across various evaluation metrics.

Conclusion: The framework sets a new state-of-the-art multi-preference alignment recipe that works effectively across different modalities.

Abstract: Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.

Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, Fan Zhou

Main category: cs.CV

TL;DR: MBCD is a collaborative distillation framework for multi-modal domain generalization that overcomes WA’s bias toward faster-converging modalities through adaptive dropout, gradient consistency, and cross-modal distillation.

Details

Motivation: Weight Averaging (WA) struggles in multi-modal settings because different modalities converge at different speeds, causing WA to overfit to faster modalities and suppress slower complementary ones, leading to poor modality fusion and sharp loss minima.

Method: MBCD uses adaptive modality dropout to prevent early bias, gradient consistency to align learning signals between uni-modal and fused representations, and WA-based teacher for cross-modal distillation to strengthen interactions and guide toward flatter solutions.

Result: Extensive experiments on MMDG benchmarks show MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

Conclusion: MBCD successfully retains WA’s flatness benefits while overcoming its multi-modal limitations, enabling effective modality fusion and improved generalization performance.

Abstract: Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA’s flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

[204] MotionV2V: Editing Motion in a Video

Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa, Andrey Voynov, Nataniel Ruiz

Main category: cs.CV

TL;DR: A method for precise video motion editing by directly manipulating sparse trajectories from input videos, enabling natural motion propagation from any timestamp with strong user preference over prior work.

Details

Motivation: While generative video models have achieved high fidelity, precise motion control remains an under-explored paradigm for editing existing videos, offering a promising approach beyond text-to-video generation and image animation.

Method: Proposes modifying video motion by editing sparse trajectories extracted from input videos, using a pipeline to generate “motion counterfactuals” (video pairs with identical content but different motion), and fine-tuning a motion-conditioned video diffusion architecture on this dataset.

Result: The approach enables edits starting at any timestamp with natural propagation, achieving over 65% preference in a four-way head-to-head user study against prior work.

Conclusion: Direct trajectory editing coupled with generative backbones provides powerful video editing capabilities, demonstrating motion control as an effective paradigm for video manipulation.

Abstract: While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a “motion edit” and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating “motion counterfactuals”, video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V

[205] Advancing Image Classification with Discrete Diffusion Classification Modeling

Omer Belhasin, Shelly Golan, Ran El-Yaniv, Michael Elad

Main category: cs.CV

TL;DR: DiDiCM is a diffusion-based classification framework that models posterior class distributions, achieving superior accuracy on challenging ImageNet tasks compared to standard classifiers.

Details

Motivation: Standard classification approaches perform suboptimally under high-uncertainty conditions like corrupted inputs or limited training data, as they directly predict labels without modeling uncertainty.

Method: Proposes Discrete Diffusion Classification Modeling (DiDiCM) that uses diffusion processes to model posterior distributions of class labels conditioned on input images, supporting diffusion on either class probabilities or discrete labels.

Result: DiDiCM achieves higher classification accuracy on ImageNet with just a few diffusion iterations, with accuracy gains increasing as tasks become more challenging compared to baseline classifiers.

Conclusion: Diffusion-based classification modeling provides an effective alternative to conventional approaches, especially in high-uncertainty scenarios, offering flexibility in computation-memory trade-offs.

Abstract: Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at https://github.com/omerb01/didicm .

[206] MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities

Tooba Tehreem Sheikh, Jean Lahoud, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal

Main category: cs.CV

TL;DR: MedROV is the first real-time open-vocabulary object detection model for medical imaging that can detect both known and novel structures, outperforming previous methods by 40 mAP50 while running at 70 FPS.

Details

Motivation: Traditional medical object detection models are limited to closed-set paradigms and cannot detect novel labels. Open-vocabulary detection remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment.

Method: Curated Omnis dataset with 600K detection samples across 9 modalities, used pseudo-labeling for missing annotations, incorporated knowledge from large pre-trained foundation models, and leveraged contrastive learning with cross-modal representations.

Result: MedROV outperforms previous state-of-the-art foundation model by 40 mAP50 average absolute improvement, surpasses closed-set detectors by more than 3 mAP50, and runs at 70 FPS.

Conclusion: MedROV sets a new benchmark in medical detection by enabling real-time open-vocabulary object detection with superior performance compared to existing methods.

Abstract: Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.

[207] DRL-Guided Neural Batch Sampling for Semi-Supervised Pixel-Level Anomaly Detection

Amirhossein Khadivi Noghredeh, Abdollah Safari, Fatemeh Ziaeetabar, Firoozeh Haghighi

Main category: cs.CV

TL;DR: Semi-supervised deep reinforcement learning framework for industrial anomaly detection that uses RL-based patch sampling, autoencoder loss profiles, and segmentation to detect subtle defects with limited labeled data.

Details

Motivation: Address the challenge of anomaly detection in industrial inspection where defective samples are scarce, and overcome limitations of unsupervised reconstruction methods that often overfit and fail to detect subtle defects.

Method: Proposes a framework with three components: RL-based neural batch sampler for adaptive patch selection, autoencoder for generating loss profiles highlighting abnormal regions, and predictor for segmentation in loss-profile space.

Result: Achieves higher accuracy and better localization of subtle anomalies on MVTec AD dataset, with average improvement of 0.15 in F1_max and 0.06 in AUC, and maximum gain of 0.37 in F1_max.

Conclusion: The semi-supervised deep RL framework effectively learns both normal and defective patterns with limited labeled data, outperforming state-of-the-art methods while maintaining low complexity.

Abstract: Anomaly detection in industrial visual inspection is challenging due to the scarcity of defective samples. Most existing methods rely on unsupervised reconstruction using only normal data, often resulting in overfitting and poor detection of subtle defects. We propose a semi-supervised deep reinforcement learning framework that integrates a neural batch sampler, an autoencoder, and a predictor. The RL-based sampler adaptively selects informative patches by balancing exploration and exploitation through a composite reward. The autoencoder generates loss profiles highlighting abnormal regions, while the predictor performs segmentation in the loss-profile space. This interaction enables the system to effectively learn both normal and defective patterns with limited labeled data. Experiments on the MVTec AD dataset demonstrate that our method achieves higher accuracy and better localization of subtle anomalies than recent state-of-the-art approaches while maintaining low complexity, yielding an average improvement of 0.15 in F1_max and 0.06 in AUC, with a maximum gain of 0.37 in F1_max in the best case.

[208] VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng, Limin Wang, Yu Qiao, Yi Wang

Main category: cs.CV

TL;DR: VKnowU benchmark reveals MLLMs lack human-like visual knowledge understanding. VideoKnow+ model with See-Think-Answer paradigm and RL improves performance by 3.7%.

Details

Motivation: MLLMs excel at object recognition but lack intuitive understanding of physical and social principles that bridge perception and reasoning.

Method: Created VKnowU benchmark with 1,680 questions in 1,249 videos covering 8 visual knowledge types. Developed VideoKnow+ model using See-Think-Answer paradigm with reinforcement learning and visual knowledge reward.

Result: 23 SOTA MLLMs fall short of human performance, especially in world-centric knowledge. VideoKnow+ achieves +3.7% improvement on VKnowU and consistent gains on other benchmarks.

Conclusion: Visual knowledge is a missing cornerstone for developing more generalizable MLLMs that can truly understand physical and social worlds beyond just seeing.

Abstract: While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world’s underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

[209] ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

Advik Sinha, Saurabh Atreya, Aashutosh A, Sk Aziz Ali, Abhijit Das

Main category: cs.CV

TL;DR: ScenarioCLIP addresses the limitation of current CLIP models in handling compositional scenes by explicitly modeling relational structures between objects, achieving improved performance on scenario-based tasks.

Details

Motivation: Current CLIP models focus on single-object classification and lack explicit modeling of relational and compositional structures in real-world scenes with multiple objects and actions.

Method: Proposed ScenarioCLIP model accepts input texts, grounded relations, and images with focused regions highlighting relations. It’s pretrained on curated scenario data and finetuned for downstream tasks using a novel dataset generated from existing datasets with language models for grounding actions, objects, and relations.

Result: ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks, outperforming baseline methods on comprehensive scenario-based benchmarks.

Conclusion: Explicit modeling of relational structures in scene analysis significantly improves CLIP’s performance on compositional tasks, with ScenarioCLIP providing an effective framework for scenario understanding.

Abstract: Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP

[210] DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion

Yinghui Li, Qianyu Zhou, Di Shao, Hao Yang, Ye Zhu, Richard Dazeley, Xuequan Lu

Main category: cs.CV

TL;DR: DAPointMamba is a novel framework for domain adaptive point cloud completion that uses State Space Models to address geometric and semantic discrepancies between source and target domains with linear complexity and global receptive fields.

Details

Motivation: Existing methods for domain adaptive point cloud completion suffer from limited receptive fields or quadratic complexity when using CNNs or vision Transformers, and directly applying SSMs disrupts spatial topology and local geometric features.

Method: Proposes DAPointMamba with three modules: Cross-Domain Patch-Level Scanning for local alignment, Cross-Domain Spatial SSM Alignment for spatial consistency, and Cross-Domain Channel SSM Alignment for global semantic gaps.

Result: Extensive experiments on synthetic and real-world benchmarks show DAPointMamba outperforms state-of-the-art methods with less computational complexity and inference latency.

Conclusion: DAPointMamba demonstrates strong adaptability across domains with the advantages of global receptive fields and efficient linear complexity, making it superior to existing approaches.

Abstract: Domain adaptive point cloud completion (DA PCC) aims to narrow the geometric and semantic discrepancies between the labeled source and unlabeled target domains. Existing methods either suffer from limited receptive fields or quadratic complexity due to using CNNs or vision Transformers. In this paper, we present the first work that studies the adaptability of State Space Models (SSMs) in DA PCC and find that directly applying SSMs to DA PCC will encounter several challenges: directly serializing 3D point clouds into 1D sequences often disrupts the spatial topology and local geometric features of the target domain. Besides, the overlook of designs in the learning domain-agnostic representations hinders the adaptation performance. To address these issues, we propose a novel framework, DAPointMamba for DA PCC, that exhibits strong adaptability across domains and has the advantages of global receptive fields and efficient linear complexity. It has three novel modules. In particular, Cross-Domain Patch-Level Scanning introduces patch-level geometric correspondences, enabling effective local alignment. Cross-Domain Spatial SSM Alignment further strengthens spatial consistency by modulating patch features based on cross-domain similarity, effectively mitigating fine-grained structural discrepancies. Cross-Domain Channel SSM Alignment actively addresses global semantic gaps by interleaving and aligning feature channels. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our DAPointMamba outperforms state-of-the-art methods with less computational complexity and inference latency.

[211] SelfMOTR: Revisiting MOTR with Self-Generating Detection Priors

Fabian Gülhan, Emil Mededovic, Yuli Wu, Johannes Stegmaier

Main category: cs.CV

TL;DR: SelfMOTR is a novel tracking transformer that leverages self-generated detection priors from MOTR-like models to address detection-association conflicts in end-to-end tracking.

Details

Motivation: To overcome poor detection performance and the conflict between detection and association in joint transformer architectures for multi-object tracking, without relying on external object detectors.

Method: Uncovers hidden detection capabilities of MOTR-like models and introduces self-generated detection priors through analysis and practical tools for leveraging these capabilities effectively.

Result: Achieves strong performance on DanceTrack benchmark, competing with recent state-of-the-art end-to-end tracking methods.

Conclusion: SelfMOTR demonstrates that MOTR-like models inherently possess strong detection capabilities that can be effectively leveraged through self-generated priors for improved end-to-end tracking performance.

Abstract: Despite progress toward end-to-end tracking with transformer architectures, poor detection performance and the conflict between detection and association in a joint architecture remain critical concerns. Recent approaches aim to mitigate these issues by (i) employing advanced denoising or label assignment strategies, or (ii) incorporating detection priors from external object detectors via distillation or anchor proposal techniques. Inspired by the success of integrating detection priors and by the key insight that MOTR-like models are secretly strong detection models, we introduce SelfMOTR, a novel tracking transformer that relies on self-generated detection priors. Through extensive analysis and ablation studies, we uncover and demonstrate the hidden detection capabilities of MOTR-like models, and present a practical set of tools for leveraging them effectively. On DanceTrack, SelfMOTR achieves strong performance, competing with recent state-of-the-art end-to-end tracking methods.

[212] Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang

Main category: cs.CV

TL;DR: Training-free iterative self-refinement framework using LLMs and VLMs to improve physics consistency in video generation through multimodal chain-of-thought feedback.

Details

Motivation: Current video generation models produce visually impressive results but often violate real-world physical principles, limiting their practical applicability.

Method: Proposes an iterative self-refinement framework with multimodal chain-of-thought process that refines prompts based on physics inconsistency feedback from LLMs and VLMs.

Result: Improves Physics-IQ score from 56.31 to 62.38 on the PhyIQ benchmark, demonstrating enhanced physics consistency in generated videos.

Conclusion: The framework serves as preliminary exploration of physics-consistent video generation and offers insights for future research in this direction.

Abstract: Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.

[213] Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

Chao Wang, Chengan Che, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

Main category: cs.CV

TL;DR: BTTF is a novel optimization framework that generates video counterfactual explanations (CFEs) by introducing conditioned latent noise initialization and two-stage optimization to create physically plausible, temporally coherent counterfactual videos.

Details

Motivation: Existing CFE methods are designed for image classifiers and cannot generate temporally coherent, smooth, and physically plausible video CFEs, leaving video classifier explanations largely unexplored.

Method: Proposes Back To The Feature (BTTF) with two key features: 1) optimization to retrieve initial latent noise conditioned by the first video frame, 2) two-stage optimization to search for counterfactual videos near the input video, guided by progressive optimization for faster convergence.

Result: Extensive experiments on Shape-Moving, MEAD, and NTU RGB+D datasets show BTTF effectively generates valid, visually similar, and realistic counterfactual videos that provide insights into classifier decision-making.

Conclusion: BTTF successfully addresses the gap in video CFE generation by producing physically plausible and temporally coherent counterfactual videos that faithfully explain video classifier decisions.

Abstract: Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier’s decision-making mechanism.

[214] CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu

Main category: cs.CV

TL;DR: CrossEarth-Gate is a parameter-efficient fine-tuning method for remote sensing that uses a Fisher-guided adaptive selection mechanism to dynamically activate spatial, semantic, and frequency modules to handle multifaceted domain gaps in large-scale Earth observation tasks.

Details

Motivation: Existing PEFT methods fail on large-scale RS tasks because they cannot fully handle the multifaceted and unpredictable domain gaps (spatial, semantic, and frequency shifts) inherent in remote sensing data.

Method: Establishes a comprehensive RS module toolbox with spatial, semantic, and frequency modules, and uses a Fisher-guided adaptive selection mechanism that quantifies module importance through Fisher Information to dynamically activate only critical modules at appropriate layers.

Result: Achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation, demonstrating efficacy and generalizability.

Conclusion: CrossEarth-Gate effectively addresses multifaceted domain gaps in remote sensing through adaptive module selection, providing an efficient and effective PEFT solution for large-scale Earth observation tasks.

Abstract: In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module’s importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work will be released.

[215] TaCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection

Han Guo, Chenyang Liu, Haotian Zhang, Bowen Chen, Zhengxia Zou, Zhenwei Shi

Main category: cs.CV

TL;DR: TaCo is a spatio-temporal semantic consistent network for remote sensing change detection that enriches mask supervision with spatio-temporal semantic joint constraints, conceptualizing change as semantic transitions between bi-temporal states.

Details

Motivation: Previous methods rely solely on mask supervision which provides limited constraints on temporal semantic transitions, leading to spatially coherent predictions but unresolved semantic inconsistencies.

Method: Proposes TaCo with Text-guided Transition Generator that integrates textual semantics with bi-temporal visual features to construct cross-temporal transition features, and spatio-temporal semantic joint constraints including bi-temporal reconstruct constraints and transition constraint.

Result: Extensive experiments on six public datasets for both binary and semantic change detection tasks demonstrate that TaCo consistently achieves state-of-the-art performance without additional computational overhead during inference.

Conclusion: TaCo effectively addresses semantic inconsistency issues in change detection by introducing spatio-temporal semantic joint constraints, achieving superior performance across multiple datasets and tasks.

Abstract: Remote sensing change detection (RSCD) aims to identify surface changes across bi-temporal satellite images. Most previous methods rely solely on mask supervision, which effectively guides spatial localization but provides limited constraints on the temporal semantic transitions. Consequently, they often produce spatially coherent predictions while still suffering from unresolved semantic inconsistencies. To address this limitation, we propose TaCo, a spatio-temporal semantic consistent network, which enriches the existing mask-supervised framework with a spatio-temporal semantic joint constraint. TaCo conceptualizes change as a semantic transition between bi-temporal states, in which one temporal feature representation can be derived from the other via dedicated transition features. To realize this, we introduce a Text-guided Transition Generator that integrates textual semantics with bi-temporal visual features to construct the cross-temporal transition features. In addition, we propose a spatio-temporal semantic joint constraint consisting of bi-temporal reconstruct constraints and a transition constraint: the former enforces alignment between reconstructed and original features, while the latter enhances discrimination for changes. This design can yield substantial performance gains without introducing any additional computational overhead during inference. Extensive experiments on six public datasets, spanning both binary and semantic change detection tasks, demonstrate that TaCo consistently achieves SOTA performance.

[216] TReFT: Taming Rectified Flow Models For One-Step Image Translation

Shengqian Li, Ming Gao, Yi Liu, Zuzeng Lin, Feng Wang, Feng Dai

Main category: cs.CV

TL;DR: TReFT enables one-step image translation using pretrained Rectified Flow models by directly using the velocity prediction as output, achieving real-time performance comparable to state-of-the-art methods.

Details

Motivation: Rectified Flow models require costly multi-step denoising for image translation, hindering real-time applications. Direct adversarial training on RF models causes convergence issues.

Method: TReFT uses the velocity predicted by pretrained DiT or UNet as output, with memory-efficient latent cycle-consistency and identity losses, plus lightweight architectural simplifications.

Result: Pretrained RF models finetuned with TReFT achieve performance comparable to state-of-the-art methods across multiple image translation datasets while enabling real-time inference.

Conclusion: TReFT successfully enables one-step image translation with RF models by leveraging the velocity prediction property, overcoming convergence issues and achieving real-time performance.

Abstract: Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.

[217] IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection

Xuelin Qian, Jiaming Lu, Zixuan Wang, Wenxuan Wang, Zhongling Huang, Dingwen Zhang, Junwei Han

Main category: cs.CV

TL;DR: IrisNet is a meta-learned framework for infrared small target detection that dynamically adapts decoder parameters to input image characteristics using a transformer-based image-to-decoder mapping, achieving state-of-the-art performance.

Details

Motivation: Address pattern drift in static deep learning models across diverse infrared scenarios (day/night, sky/maritime/ground domains) due to low signal-to-noise ratios and complex backgrounds.

Method: Uses an image-to-decoder transformer to dynamically map infrared image features to entire decoder parameters, representing decoder as structured 2D tensor with self-attention for inter-layer dependencies and cross-attention for adaptive patterns. Integrates high-frequency components for enhanced target and edge perception.

Result: Achieves state-of-the-art performance on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets, demonstrating superior robustness across diverse infrared scenarios.

Conclusion: IrisNet effectively addresses pattern drift in infrared small target detection through dynamic parameter adaptation, providing a more robust solution than static deep learning approaches.

Abstract: Infrared Small Target Detection (IRSTD) faces significant challenges due to low signal-to-noise ratios, complex backgrounds, and the absence of discernible target features. While deep learning-based encoder-decoder frameworks have advanced the field, their static pattern learning suffers from pattern drift across diverse scenarios (\emph{e.g.}, day/night variations, sky/maritime/ground domains), limiting robustness. To address this, we propose IrisNet, a novel meta-learned framework that dynamically adapts detection strategies to the input infrared image status. Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters via an image-to-decoder transformer. More concretely, we represent the parameterized decoder as a structured 2D tensor preserving hierarchical layer correlations and enable the transformer to model inter-layer dependencies through self-attention while generating adaptive decoding patterns via cross-attention. To further enhance the perception ability of infrared images, we integrate high-frequency components to supplement target-position and scene-edge information. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet, achieving state-of-the-art performance.

[218] AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

Tianyi Yan, Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng, Weiyao Huang, Lingdong Kong, Wencheng Han, Xia Zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Cheng-zhong Xu, Jianbing Shen

Main category: cs.CV

TL;DR: The paper introduces an Impartial World Model framework that teaches autonomous driving systems to accurately predict dangerous scenarios through counterfactual data synthesis, enabling safer policy refinement via reinforcement learning.

Details

Motivation: End-to-end autonomous driving models struggle with safety and long-tail events. RL could help but is hindered by optimistic bias in world models that fail to properly account for danger.

Method: Developed an Impartial World Model using Counterfactual Synthesis to generate plausible collision and off-road events. Integrated this model as an internal critic in a closed-loop RL framework where agents query it to simulate outcomes of candidate actions.

Result: Significantly outperforms baselines in predicting failures on the Risk Foreseeing Benchmark. Enables substantial reduction in safety violations in challenging simulations.

Conclusion: Teaching models to accurately dream of danger is a critical step toward building truly safe and intelligent autonomous agents, addressing the fundamental optimistic bias problem in autonomous driving RL.

Abstract: End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.

[219] ShelfRectNet: Single View Shelf Image Rectification with Homography Estimation

Onur Berk Tore, Ibrahim Samil Yalciner, Server Calap

Main category: cs.CV

TL;DR: Deep learning framework for single-image homography estimation using ConvNeXt backbone and normalized coordinate regression, achieving competitive accuracy and speed for shelf image rectification.

Details

Motivation: Single-view homography estimation is challenging but valuable for retail applications like shelf monitoring where only one viewpoint is available for product alignment.

Method: Uses ConvNeXt-based backbone for feature representation with normalized coordinate regression, plus novel augmentation strategy by modeling and sampling synthetic homographies to address data scarcity.

Result: Achieves mean corner error of 1.298 pixels on test set, demonstrating competitive performance in both accuracy and inference speed compared to classical and deep learning approaches.

Conclusion: Establishes a robust and efficient solution for real-world single-view rectification, with dataset and code to be made publicly available to encourage further research.

Abstract: Estimating homography from a single image remains a challenging yet practically valuable task, particularly in domains like retail, where only one viewpoint is typically available for shelf monitoring and product alignment. In this paper, we present a deep learning framework that predicts a 4-point parameterized homography matrix to rectify shelf images captured from arbitrary angles. Our model leverages a ConvNeXt-based backbone for enhanced feature representation and adopts normalized coordinate regression for improved stability. To address data scarcity and promote generalization, we introduce a novel augmentation strategy by modeling and sampling synthetic homographies. Our method achieves a mean corner error of 1.298 pixels on the test set. When compared with both classical computer vision and deep learning-based approaches, our method demonstrates competitive performance in both accuracy and inference speed. Together, these results establish our approach as a robust and efficient solution for realworld single-view rectification. To encourage further research in this domain, we will make our dataset, ShelfRectSet, and code publicly available

[220] AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend

Hengyi Wang, Lourdes Agapito

Main category: cs.CV

TL;DR: AMB3R is a multi-view feed-forward model for dense 3D reconstruction that uses a sparse volumetric scene representation and achieves state-of-the-art performance across multiple 3D vision tasks without task-specific fine-tuning.

Details

Motivation: To create a unified model that can handle diverse 3D vision tasks (multi-view reconstruction, visual odometry, structure from motion) using a single architecture without requiring task-specific modifications or optimization.

Method: Leverages a sparse yet compact volumetric scene representation as backend for geometric reasoning, trained solely for multi-view reconstruction but can be extended to other tasks through feed-forward inference.

Result: Achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.

Conclusion: AMB3R demonstrates that a single feed-forward model trained for multi-view reconstruction can effectively generalize to multiple 3D vision tasks without task-specific fine-tuning, outperforming specialized optimization-based methods.

Abstract: We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.

[221] Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin

João Malheiro Silva, Andy Huynh, Tong Duy Son, Holger Caesar

Main category: cs.CV

TL;DR: Camera-only pipeline for 3D reconstruction using 3D Gaussian Splatting with semantic material extraction and physics-based material assignment for digital twins.

Details

Motivation: LiDAR-based methods lack semantics and textures from cameras, while traditional fusion approaches require complex calibration and struggle with materials like glass.

Method: Uses 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts to mesh surfaces with material labels, and assigns physics-based material properties.

Result: Provides sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements.

Conclusion: Camera-only approach successfully combines photorealistic reconstruction with physics-based material assignment, validated using LiDAR as ground truth.

Abstract: 3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.

[222] Thinking in 360°: Humanoid Visual Search in the Wild

Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, Yiming Li

Main category: cs.CV

TL;DR: Proposes humanoid visual search using embodied agents that rotate heads in 360° panoramic environments, introduces H* Bench benchmark with challenging real-world scenes, and shows significant performance improvements through post-training while highlighting remaining challenges.

Details

Motivation: To develop embodied visual search agents as efficient as humans while bypassing real-world hardware constraints, moving beyond static image approaches to consider physical embodiment and 3D world interaction.

Method: Humanoid visual search where agents actively rotate heads in 360° panoramic images, using post-training techniques to enhance open-source models like Qwen2.5-VL on the new H* Bench benchmark with challenging real-world scenes.

Result: Proprietary models achieved only ~30% success rate. Post-training improved Qwen2.5-VL’s success rate from 14.83% to 47.38% for object search and from 6.44% to 24.94% for path search, with path search being inherently more difficult.

Conclusion: Shows promising progress but quantifies immense remaining challenges in building MLLM agents for seamless integration into everyday human life, particularly due to demands for sophisticated spatial commonsense.

Abstract: Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.

[223] GS-Checker: Tampering Localization for 3D Gaussian Splatting

Haoliang Han, Ziyuan Luo, Jun Qi, Anderson Rocha, Renjie Wan

Main category: cs.CV

TL;DR: GS-Checker is a method for locating tampered regions in 3D Gaussian Splatting models without requiring expensive 3D supervision labels.

Details

Motivation: Recent advances in 3DGS editing technologies raise concerns about malicious manipulation of 3D content, making tampering localization crucial for security.

Method: Integrates 3D tampering attributes into Gaussian parameters, uses 3D contrastive mechanism to compare attribute similarities, and employs cyclic optimization to refine tampering detection.

Result: Extensive experiments demonstrate effective localization of tampered areas in 3DGS models.

Conclusion: The proposed GS-Checker method successfully addresses the need for tampering detection in 3DGS content without requiring costly 3D labels.

Abstract: Recent advances in editing technologies for 3D Gaussian Splatting (3DGS) have made it simple to manipulate 3D scenes. However, these technologies raise concerns about potential malicious manipulation of 3D content. To avoid such malicious applications, localizing tampered regions becomes crucial. In this paper, we propose GS-Checker, a novel method for locating tampered areas in 3DGS models. Our approach integrates a 3D tampering attribute into the 3D Gaussian parameters to indicate whether the Gaussian has been tampered. Additionally, we design a 3D contrastive mechanism by comparing the similarity of key attributes between 3D Gaussians to seek tampering cues at 3D level. Furthermore, we introduce a cyclic optimization strategy to refine the 3D tampering attribute, enabling more accurate tampering localization. Notably, our approach does not require expensive 3D labels for supervision. Extensive experimental results demonstrate the effectiveness of our proposed method to locate the tampered 3DGS area.

[224] VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild

Xin Ming, Yuxuan Han, Tianyu Huang, Feng Xu

Main category: cs.CV

TL;DR: VGGTFace is an automatic method that uses the 3D foundation model VGGT for topologically consistent facial geometry reconstruction from multi-view images, achieving high-quality results in 10 seconds.

Details

Motivation: Existing methods for facial geometry reconstruction require manual effort, lack generalization to in-the-wild data, or are limited by 3D Morphable Models' expressiveness.

Method: Augments VGGT with Pixel3DMM to inject topology information via pixel-aligned UV values, then uses Topology-Aware Bundle Adjustment with Laplacian energy to fuse point clouds with known topology.

Result: Achieves state-of-the-art results on benchmarks with impressive generalization to in-the-wild data, reconstructing in 10 seconds for 16 views on a single RTX 4090.

Conclusion: VGGTFace successfully addresses limitations of existing methods by leveraging VGGT’s generalization and expressive power while adding topology information for consistent facial reconstruction.

Abstract: Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, \emph{i.e.} VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.

[225] FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers

Xinwan Wen, Bowen Li, Jiajun Luo, Ye Li, Zhi Wang

Main category: cs.CV

TL;DR: FREE is a novel framework that accelerates Diffusion Transformers (DiTs) through feature-level autoregression with parallel verification, achieving up to 2.25× speedup while maintaining generation quality.

Details

Motivation: DiTs achieve state-of-the-art generation quality but suffer from high inference latency due to long sequential denoising trajectories. Existing speculative inference methods have limited acceleration on DiTs due to insufficient draft accuracy.

Method: Proposes FREE framework with lightweight drafter for feature-level autoregression using DiTs’ top-block features, plus uncertainty-guided relaxation strategy (FREE-relax) that dynamically adjusts acceptance probability based on prediction variance.

Result: Experiments on ImageNet-512² show FREE achieves 1.86× acceleration, and FREE (relax) reaches 2.25× speedup while maintaining high perceptual and quantitative fidelity in generation quality.

Conclusion: FREE effectively accelerates DiTs through feature-level speculative inference with uncertainty-aware relaxation, providing significant speedup without compromising generation quality.

Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference methods enable lossless parallel sampling in U-Net-based diffusion models via a drafter-verifier scheme, but their acceleration is limited on DiTs due to insufficient draft accuracy during verification. To address this limitation, we analyze the DiTs’ feature dynamics and find the features of the final transformer layer (top-block) exhibit strong temporal consistency and rich semantic abstraction. Based on this insight, we propose FREE, a novel framework that employs a lightweight drafter to perform feature-level autoregression with parallel verification, guaranteeing lossless acceleration with theoretical and empirical support. Meanwhile, prediction variance (uncertainty) of DiTs naturally increases in later denoising steps, reducing acceptance rates under speculative sampling. To mitigate this effect, we further introduce an uncertainty-guided relaxation strategy, forming FREE (relax), which dynamically adjusts the acceptance probability in response to uncertainty levels. Experiments on ImageNet-$512^2$ show that FREE achieves up to $1.86 \times$ acceleration, and FREE (relax) further reaches $2.25 \times$ speedup while maintaining high perceptual and quantitative fidelity in generation quality.

[226] A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control

Jiawei Lin, Guanlong Jiao, Jianjin Xu

Main category: cs.CV

TL;DR: MultiID is a training-free method for multi-ID customization that adapts single-ID models to generate images combining multiple individuals while preserving their identities and improving text controllability.

Details

Motivation: Multi-ID customization faces two major challenges: copy-paste issues from training on cropped person regions, and inferior text controllability where generated images don't align well with input text prompts.

Method: Proposes ID-decoupled cross-attention to inject distinct ID embeddings into corresponding regions, plus three strategies: local prompts, depth-guided spatial control, and extended self-attention to enhance controllability.

Result: Extensive experiments show MultiID effectively solves both challenges, with performance comparable or better than training-based multi-ID customization methods.

Conclusion: MultiID provides an effective training-free solution for multi-ID customization that addresses key challenges while maintaining high quality and text alignment.

Abstract: Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.

[227] Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

Bao Tang, Shuai Zhang, Yueting Zhu, Jijun Xiang, Xin Yang, Li Yu, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: TBCM is a data-free consistency distillation method that extracts latent representations directly from teacher model trajectories, eliminating external data dependency while improving efficiency and performance.

Details

Motivation: Current continuous-time consistency distillation methods rely heavily on training data and computational resources, limiting deployment in resource-constrained scenarios and scalability across domains.

Method: Proposes Trajectory-Backward Consistency Model (TBCM) that extracts latent representations directly from teacher model’s generation trajectory, eliminating VAE encoding and external dataset requirements.

Result: Achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, reduces training time by ~40% compared to Sana-Sprint, and saves substantial GPU memory while maintaining quality.

Conclusion: TBCM demonstrates superior efficiency without quality sacrifice, reveals diffusion-generation space discrepancy in consistency distillation, and provides insights for future distillation research.

Abstract: Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model’s generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: https://github.com/hustvl/TBCM.

[228] MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

Zilong Huang, Jun He, Xiaobin Huang, Ziyi Xiong, Yang Luo, Junyan Ye, Weijia Li, Yiping Chen, Ting Han

Main category: cs.CV

TL;DR: MajutsuCity is a natural language-driven framework for generating 3D urban scenes with structural consistency and stylistic diversity, featuring interactive editing capabilities and outperforming existing methods.

Details

Motivation: Existing methods struggle to balance text-based creative flexibility with object-level editability from structural representations in 3D city generation.

Method: A four-stage pipeline representing cities as controllable layouts, assets, and materials, plus MajutsuAgent for interactive language-grounded editing with five object-level operations.

Result: Reduces layout FID by 83.7% vs CityDreamer and 20.1% vs CityCraft; ranks first across all AQS and RDR scores, achieving state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability.

Conclusion: MajutsuCity represents a new state-of-the-art for 3D city generation and is expected to inspire new research avenues in the field.

Abstract: Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be released at https://github.com/LongHZ140516/MajutsuCity.

[229] STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai

Main category: cs.CV

TL;DR: STARFlow-V is a normalizing flow-based video generator that achieves high-quality autoregressive video generation with end-to-end learning, robust causal prediction, and native likelihood estimation, challenging the dominance of diffusion models in video generation.

Details

Motivation: Normalizing flows have shown progress in image generation but are rarely used in video generation due to spatiotemporal complexity and computational costs. Current state-of-the-art video generation systems almost exclusively rely on diffusion-based models, creating an opportunity to revisit normalizing flows for video generation.

Method: STARFlow-V operates in spatiotemporal latent space with a global-local architecture that restricts causal dependencies to global latent space while preserving local within-frame interactions. It uses flow-score matching with a light-weight causal denoiser for improved consistency, and employs a video-aware Jacobi iteration scheme for efficient sampling.

Result: Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. It can natively support text-to-video, image-to-video, and video-to-video generation tasks thanks to its invertible structure.

Conclusion: This work presents the first evidence that normalizing flows are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models and challenging the current dominance of diffusion models in video generation.

Abstract: Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

[230] BRIC: Bridging Kinematic Plans and Physical Control at Test Time

Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim

Main category: cs.CV

TL;DR: BRIC is a test-time adaptation framework that combines diffusion-based motion planning with RL-based physics controllers to enable long-term, physically plausible human motion generation by resolving execution drift and preserving pre-trained skills.

Details

Motivation: Diffusion models generate diverse motions but often produce physically implausible outputs, causing execution drift when simulated with physics controllers, which limits long-term motion generation capabilities.

Method: BRIC uses dynamic test-time adaptation of physics controllers to noisy motion plans while preventing catastrophic forgetting, plus lightweight test-time guidance in signal space for diffusion models without parameter updates.

Result: Achieves state-of-the-art performance on long-term tasks including motion composition, obstacle avoidance, and human-scene interaction across diverse environments.

Conclusion: BRIC effectively enables consistent and physically plausible long-term human motion execution by combining controller adaptation and diffusion guidance strategies in an efficient manner.

Abstract: We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.

[231] Dance Style Classification using Laban-Inspired and Frequency-Domain Motion Features

Ben Hamscher, Arnold Brosch, Nicolas Binninger, Maksymilian Jan Dejna, Kira Maag

Main category: cs.CV

TL;DR: A lightweight framework for dance style classification using pose estimates from videos, combining temporal-spatial descriptors inspired by Laban Movement Analysis with Fast Fourier Transform features for robust classification with low computational effort.

Details

Motivation: Dance is culturally important for conveying emotions and stories, but classifying dance genres from motion data is challenging due to similar poses and motion patterns across different styles.

Method: Extracts pose estimates from videos and uses temporal-spatial descriptors (velocity, acceleration, angular movement of upper body) inspired by Laban Movement Analysis, combined with Fast Fourier Transform features to capture rhythmic patterns in frequency domain.

Result: Achieves robust classification of different dance styles with low computational effort, demonstrating that interpretable motion representations can effectively capture stylistic nuances without complex model architectures.

Conclusion: The proposed lightweight framework successfully classifies dance styles using interpretable motion features, showing that structured spatial coordination and frequency domain analysis can effectively distinguish dance genres with minimal computational requirements.

Abstract: Dance is an essential component of human culture and serves as a tool for conveying emotions and telling stories. Identifying and distinguishing dance genres based on motion data is a complex problem in human activity recognition, as many styles share similar poses, gestures, and temporal motion patterns. This work presents a lightweight framework for classifying dance styles that determines motion characteristics based on pose estimates extracted from videos. We propose temporal-spatial descriptors inspired by Laban Movement Analysis. These features capture local joint dynamics such as velocity, acceleration, and angular movement of the upper body, enabling a structured representation of spatial coordination. To further encode rhythmic and periodic aspects of movement, we integrate Fast Fourier Transform features that characterize movement patterns in the frequency domain. The proposed approach achieves robust classification of different dance styles with low computational effort, as complex model architectures are not required, and shows that interpretable motion representations can effectively capture stylistic nuances.

[232] Learning to Generate Human-Human-Object Interactions from Textual Descriptions

Jeonghyeon Na, Sangwon Baik, Inhee Lee, Junyoung Lee, Hanbyul Joo

Main category: cs.CV

TL;DR: This paper introduces Human-Human-Object Interactions (HHOIs) to model complex multi-person interactions with objects, presents a new dataset and synthesis method using generative models, and extends to multi-human settings.

Details

Motivation: To enable machines to understand complex, context-dependent human behaviors involving multiple people and objects, which current single-human interaction models cannot capture.

Method: Uses a unified generative framework with text-to-HOI and text-to-HHI models trained via score-based diffusion, synthesizing HHOIs from individual human-object and human-human interactions.

Result: The method generates realistic HHOIs conditioned on textual descriptions, outperforming previous single-human HOI approaches and enabling multi-human motion generation with objects.

Conclusion: The proposed framework successfully models complex multi-person interactions with objects, extending beyond single-human scenarios and demonstrating practical applications in multi-human motion generation.

Abstract: The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.

[233] Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification

Akshit Pramod Anchan, Jewelith Thomas, Sritama Roy

Main category: cs.CV

TL;DR: This paper proposes a modular architecture with three specialized models for eye state detection, facial expression recognition, and speaker identification, achieving high accuracies (93.0%, 97.8%, 96.89%) and demonstrating feasibility for lightweight assistive technologies.

Details

Motivation: To develop comprehensive assistive technologies by integrating visual and auditory perception, inspired by perceptive systems like 'Smart Eye', for resource-constrained devices.

Method: Proposed three independent sensing modules: CNN for eye state detection (drowsiness/attention), deep CNN for facial expression recognition, and LSTM for voice-based speaker identification, using Eyes Image, FER2013, and customized audio datasets.

Result: Models achieved high accuracies: 93.0% for eye state detection, 97.8% for facial expression recognition, and 96.89% for speaker identification.

Conclusion: Lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.

Abstract: Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like ‘Smart Eye.’ We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.

[234] Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search

Yunqi Zhou, Chengjie Jiang, Chun Yuan, Jing Li

Main category: cs.CV

TL;DR: ZoomSearch is a training-free pipeline for Ultra-HR Remote Sensing VQA that decouples ‘where to look’ from ‘how to answer’ using hierarchical search and layout-aware patch reassembly.

Details

Motivation: Current remote sensing foundation models struggle with Ultra-HR images due to token/memory budget exhaustion from full-image encoding and loss of fine-grained details from resize-based preprocessing.

Method: Combines Adaptive Multi-Branch Zoom Search for hierarchical query-relevant region localization with Layout-Aware Patch Reassembly to organize selected patches into a compact, layout-faithful canvas.

Result: Achieves state-of-the-art accuracy with LLaVA-ov, improving baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS, while being 20%-44% faster than prior search-based methods.

Conclusion: ZoomSearch effectively addresses Ultra-HR RS-VQA challenges by focusing models on relevant regions, achieving superior performance and efficiency without requiring training.

Abstract: With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples ‘where to look’ from ‘how to answer’ for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.

[235] A Physics-Informed Loss Function for Boundary-Consistent and Robust Artery Segmentation in DSA Sequences

Muhammad Irfan, Nasir Rahim, Khalid Mahmood Malik

Main category: cs.CV

TL;DR: Proposes a Physics-Informed Loss (PIL) for cerebral artery segmentation from DSA sequences that models boundary interactions using dislocation theory, improving geometric consistency over conventional pixel-wise losses.

Details

Motivation: Conventional segmentation losses focus on pixel-wise overlap but ignore geometric and physical consistency of vascular boundaries, leading to fragmented or unstable vessel predictions in cerebrovascular disease management.

Method: Developed a physics-based loss function inspired by dislocation theory that treats predicted-ground-truth boundary interaction as an elastic process, enforcing smooth contour evolution and structural consistency. Integrated PIL into U-Net, U-Net++, SegFormer, and MedFormer architectures.

Result: PIL consistently outperformed Cross-Entropy, Dice, Active Contour, and Surface losses on DIAS and DSCA benchmarks, achieving superior sensitivity, F1 score, and boundary coherence in vascular segmentation.

Conclusion: Incorporating physics-based boundary interactions into deep neural networks improves both precision and robustness of vascular segmentation in dynamic angiographic imaging, addressing limitations of conventional pixel-wise loss functions.

Abstract: Accurate extraction and segmentation of the cerebral arteries from digital subtraction angiography (DSA) sequences is essential for developing reliable clinical management models of complex cerebrovascular diseases. Conventional loss functions often rely solely on pixel-wise overlap, overlooking the geometric and physical consistency of vascular boundaries, which can lead to fragmented or unstable vessel predictions. To overcome this limitation, we propose a novel \textit{Physics-Informed Loss} (PIL) that models the interaction between the predicted and ground-truth boundaries as an elastic process inspired by dislocation theory in materials physics. This formulation introduces a physics-based regularization term that enforces smooth contour evolution and structural consistency, allowing the network to better capture fine vascular geometry. The proposed loss is integrated into several segmentation architectures, including U-Net, U-Net++, SegFormer, and MedFormer, and evaluated on two public benchmarks: DIAS and DSCA. Experimental results demonstrate that PIL consistently outperforms conventional loss functions such as Cross-Entropy, Dice, Active Contour, and Surface losses, achieving superior sensitivity, F1 score, and boundary coherence. These findings confirm that the incorporation of physics-based boundary interactions into deep neural networks improves both the precision and robustness of vascular segmentation in dynamic angiographic imaging. The implementation of the proposed method is publicly available at https://github.com/irfantahir301/Physicsis_loss.

[236] VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Yolo Y. Tang, Junjia Guo, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, Pooyan Fazli, Chenliang Xu

Main category: cs.CV

TL;DR: VidComposition is a new benchmark for evaluating multimodal LLMs’ ability to understand video compositions, revealing significant performance gaps between humans and current models.

Details

Motivation: Existing MLLM benchmarks focus on abstract video comprehension but lack detailed assessment of video composition understanding - how visual elements combine and interact in compiled videos.

Method: Created VidComposition benchmark with 982 curated compiled videos and 1706 multiple-choice questions covering camera movement, angle, shot size, narrative structure, character actions, emotions, and other compositional aspects.

Result: Evaluation of 33 open-source and proprietary MLLMs shows significant performance gap between human and model capabilities in understanding complex compiled video compositions.

Conclusion: Current MLLMs have limitations in understanding complex video compositions, highlighting areas for improvement in multimodal video understanding capabilities.

Abstract: The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. The leaderboard and evaluation code are available at https://yunlong10.github.io/VidComposition/

[237] AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku

Main category: cs.CV

TL;DR: AlignBench is a new benchmark for evaluating image-text alignment models that uses detailed image-caption pairs from diverse generative models, with sentence-level annotations to directly assess VLMs as alignment evaluators.

Details

Motivation: Existing benchmarks for image-text alignment rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment between visual and linguistic representations.

Method: Introduced AlignBench benchmark using detailed image-caption pairs generated by diverse image-to-text and text-to-image models, with each sentence annotated for correctness to enable direct assessment of VLMs as alignment evaluators.

Result: Benchmarking revealed three key findings: CLIP-based models remain nearly blind even for compositional reasoning; detectors systematically over-score early sentences; and they show strong self-preference, favoring their own outputs and harming detection performance.

Conclusion: AlignBench provides a more comprehensive evaluation framework for image-text alignment models, revealing limitations in current approaches and highlighting the need for improved fine-grained alignment assessment.

Abstract: Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.

[238] HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou, Qing Liu, Shiwei Zhang, Yijun Li, Shaoteng Liu, Haitian Zheng, Jason Kuen, Yuehuan Wang, Changxin Gao, Nong Sang

Main category: cs.CV

TL;DR: HBridge proposes an asymmetric H-shaped architecture for unified multimodal models that selectively bridges intermediate layers between understanding and generative experts, reducing attention sharing by over 40% while improving efficiency and generation quality.

Details

Motivation: Current unified models use symmetric Mixture-of-Transformers designs that mirror experts across modalities, which is suboptimal due to inherent modality discrepancies and fails to optimally leverage pretrained priors from respective modality domains.

Method: Asymmetric H-shaped architecture that selectively bridges intermediate layers between experts, decoupling shallow/deep layers for modality-specific representations while bridging mid-layers for semantic alignment. Introduces semantic reconstruction tokens to guide generative expert in reconstructing visual semantic tokens.

Result: Extensive experiments across multiple benchmarks demonstrate effectiveness and superior performance, establishing a new paradigm for unified multimodal generation with improved efficiency and generation quality.

Conclusion: HBridge provides an effective asymmetric architecture that enables heterogeneous experts to optimally leverage their pretrained priors, achieving better multimodal generation through selective bridging and semantic reconstruction guidance.

Abstract: Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.

[239] Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition

Wei Tang, Zuo-Zheng Wang, Kun Zhang, Tong Wei, Min-Ling Zhang

Main category: cs.CV

TL;DR: CAPNET is an end-to-end framework for long-tailed multi-label visual recognition that models label correlations from CLIP’s textual encoder using graph networks and learnable prompts, achieving state-of-the-art performance on imbalanced datasets.

Details

Motivation: Existing methods for long-tailed multi-label recognition suffer from unreliable label correlations for tail classes due to data scarcity, and CLIP's single-label optimization is suboptimal for multi-label tasks.

Method: Proposes CAPNET with graph convolutional network for label-aware propagation, learnable soft prompts, distribution-balanced Focal loss with class-aware re-weighting, test-time ensembling, and parameter-efficient fine-tuning.

Result: Extensive experiments on VOC-LT, COCO-LT, and NUS-WIDE benchmarks show substantial improvements over state-of-the-art methods.

Conclusion: CAPNET effectively addresses long-tailed multi-label visual recognition challenges and demonstrates strong performance for real-world applications.

Abstract: Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP’s zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP’s textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.

[240] Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos

Yayuan Li, Aadit Jain, Filippos Bellos, Jason J. Corso

Main category: cs.CV

TL;DR: MATT is a fine-grained mistake attribution task that identifies what part of instructions are violated, when mistakes become irreversible, and where they occur in egocentric videos.

Details

Motivation: Prior mistake understanding work lacks fine-grained output, so MATT provides concrete attribution of mistakes to either instruction text or attempt video with detailed semantic, temporal, and spatial analysis.

Method: Developed MisEngine data engine to automatically construct attribution-rich mistake samples from existing datasets, creating EPIC-KITCHENS-M and Ego4D-M datasets. Then presented MisFormer, a unified attention-based model for mistake attribution across semantic, temporal, and spatial dimensions.

Result: MisEngine yields datasets up to two orders of magnitude larger than prior mistake datasets. MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines on new datasets and prior benchmarks.

Conclusion: MATT enables fine-grained mistake attribution in egocentric video, with MisEngine providing scalable data generation and MisFormer demonstrating superior performance across multiple dimensions of mistake analysis.

Abstract: We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.

[241] Concept-Aware Batch Sampling Improves Language-Image Pretraining

Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge

Main category: cs.CV

TL;DR: DataConcept provides 128M image-text pairs with fine-grained concept annotations, enabling flexible online concept-based data curation through Concept-Aware Batch Sampling (CABS) for improved vision-language model training.

Details

Motivation: Existing data curation methods are offline and concept-agnostic, using model-based filters that introduce data biases. This work advocates for more flexible, task-adaptive online concept-based curation to overcome these limitations.

Method: Developed DataConcept dataset with concept annotations, and introduced Concept-Aware Batch Sampling (CABS) framework with two variants: CABS-DM for diversity maximization and CABS-FM for frequency maximization, which construct batches on-the-fly based on target distributions.

Result: Extensive evaluations across 28 benchmarks show CABS significantly benefits CLIP/SigLIP model classes and yields highly performant models.

Conclusion: CABS provides a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions optimized for specific downstream tasks.

Abstract: What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.

[242] PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, Wangmeng Zuo

Main category: cs.CV

TL;DR: PhysChoreo is a two-stage framework that generates physically realistic and controllable videos from a single image by estimating physical properties and simulating dynamic behaviors.

Details

Motivation: Existing video generation models lack explicit physical controllability and plausibility, while physics-based rendering methods struggle with accurate modeling of complex physical properties and long-term temporal control.

Method: Two-stage approach: 1) Part-aware physical property reconstruction to estimate static initial physical properties from a single image, 2) Temporally instructed and physically editable simulation for video synthesis with dynamic behaviors.

Result: PhysChoreo generates videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.

Conclusion: The proposed framework successfully addresses the limitations of existing methods by providing diverse controllability and enhanced physical realism in video generation from single images.

Abstract: While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.

[243] Generative AI for Cel-Animation: A Survey

Yolo Y. Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong, Yunzhong Xiao, Chao Huang, Luchuan Song, Susan Liang, Yizhi Song, Liu He, Jing Bi, Mingqian Feng, Xinyang Li, Zeliang Zhang, Chenliang Xu

Main category: cs.CV

TL;DR: This survey explores how generative AI is revolutionizing traditional celluloid animation workflows by automating key production steps like inbetweening and colorization, making animation more accessible while addressing challenges like visual consistency.

Details

Motivation: Traditional Cel animation production requires substantial manual effort, technical expertise, and time investment, which has historically limited efficiency and scalability. The rise of generative AI offers innovative solutions to automate labor-intensive tasks.

Method: The paper surveys the integration of generative AI technologies (large language models, multimodal models, diffusion models) into animation workflows, examining tools like AniDoc, ToonCrafter, and AniSora that automate tasks such as inbetween frame generation, colorization, and storyboard creation.

Result: GenAI integration lowers technical barriers, broadens accessibility for creators, and enables artists to focus more on creative expression and artistic innovation rather than manual technical work.

Conclusion: While generative AI shows great potential for revolutionizing animation production, challenges remain in visual consistency, stylistic coherence, and ethical considerations. The paper also explores future directions for AI-assisted animation advancement.

Abstract: Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation

[244] A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

Shengqiong Wu, Weicai Ye, Yuanxing Zhang, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Kun Gai, Hao Fei, Tat-Seng Chua

Main category: cs.CV

TL;DR: ReaDe is a universal interpreter that converts ambiguous user instructions into precise specifications for controllable video generation, improving instruction fidelity and video quality.

Details

Motivation: Current diffusion transformers have limited practical controllability due to mismatch between concise user inputs and detailed training prompts.

Method: Two-stage optimization: reasoning-augmented supervision with stepwise traces and dense captions, plus multi-dimensional reward assigner for feedback-driven refinement.

Result: Consistent gains in instruction fidelity, caption accuracy, and downstream video quality across single- and multi-condition scenarios.

Conclusion: ReaDe provides a practical solution for aligning controllable video generation with accurately interpreted user intent.

Abstract: Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: https://sqwu.top/ReaDe/.

[245] DINO-Tok: Adapting DINO for Visual Tokenizers

Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo, Zeming Li, Xiaoyang Guo, Xiao-Xiao Long, Qian Zhang, Ping Tan, Wei Yin

Main category: cs.CV

TL;DR: DINO-Tok is a visual tokenizer that integrates DINO’s hierarchical features into a unified latent space, achieving state-of-the-art reconstruction performance on ImageNet 256×256.

Details

Motivation: Existing visual tokenizers struggle to balance semantic representation and reconstruction fidelity, especially in high-dimensional latent spaces, limiting their effectiveness in latent generative models.

Method: Integrates shallow and deep DINO features into an information-complete latent space, and proposes global PCA reweighting to stabilize vector quantization and prevent codebook collapse.

Result: Achieves 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling on ImageNet 256×256, outperforming prior tokenizers and comparable to billion-level data trained models.

Conclusion: Adapting pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, advancing visual generative models.

Abstract: Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$\times$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.

[246] Segmentation-Aware Generative Reinforcement Network (GRN) for Tissue Layer Segmentation in 3-D Ultrasound Images for Chronic Low-back Pain (cLBP) Assessment

Zixue Zeng, Xiaoyan Zhao, Matthew Cartier, Tong Yu, Jing Wang, Xin Meng, Zhiyu Sheng, Maryam Satarpour, John M Cormack, Allison Bean, Ryan Nussbaum, Maya Maurer, Emily Landis-Walkenhorst, Dinesh Kumbhare, Kang Kim, Ajay Wasan, Jiantao Pu

Main category: cs.CV

TL;DR: GRN is a segmentation-aware joint training framework that integrates segmentation loss feedback to optimize both image generation and segmentation. It reduces labeling efforts by 60-70% while maintaining or improving segmentation performance.

Details

Motivation: To address the high labeling burden in medical image segmentation, particularly for ultrasound images, by developing a framework that can achieve good segmentation performance with significantly less labeled data.

Method: Developed GRN framework with segmentation-guided enhancement (SGE) where generator produces images tailored for segmentation. Created two variants: GRN-SEL for sample-efficient learning and GRN-SSL for semi-supervised learning.

Result: GRN-SEL with SGE reduces labeling efforts by 70% while improving DSC by 1.98% compared to fully supervised models. All variants reduce labeling requirements by 60-70% while maintaining comparable performance to fully supervised approaches.

Conclusion: GRN framework effectively optimizes segmentation performance with significantly less labeled data, offering a scalable and efficient solution for ultrasound image analysis that reduces annotation burdens.

Abstract: We introduce a novel segmentation-aware joint training framework called generative reinforcement network (GRN) that integrates segmentation loss feedback to optimize both image generation and segmentation performance in a single stage. An image enhancement technique called segmentation-guided enhancement (SGE) is also developed, where the generator produces images tailored specifically for the segmentation model. Two variants of GRN were also developed, including GRN for sample-efficient learning (GRN-SEL) and GRN for semi-supervised learning (GRN-SSL). GRN’s performance was evaluated using a dataset of 69 fully annotated 3D ultrasound scans from 29 subjects. The annotations included six anatomical structures: dermis, superficial fat, superficial fascial membrane (SFM), deep fat, deep fascial membrane (DFM), and muscle. Our results show that GRN-SEL with SGE reduces labeling efforts by up to 70% while achieving a 1.98% improvement in the Dice Similarity Coefficient (DSC) compared to models trained on fully labeled datasets. GRN-SEL alone reduces labeling efforts by 60%, GRN-SSL with SGE decreases labeling requirements by 70%, and GRN-SSL alone by 60%, all while maintaining performance comparable to fully supervised models. These findings suggest the effectiveness of the GRN framework in optimizing segmentation performance with significantly less labeled data, offering a scalable and efficient solution for ultrasound image analysis and reducing the burdens associated with data annotation.

[247] VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu, Zicheng Duan, Kunchang Li, Chaorui Deng, Hongyi Yuan, Haoqi Fan, Cihang Xie, Jianfei Cai, Hamid Rezatofighi

Main category: cs.CV

TL;DR: This paper introduces VQ-VA World, a framework for Visual Question-Visual Answering that generates images in response to visual questions, achieving strong performance on the IntelligentBench benchmark.

Details

Motivation: To bring Visual Question-Visual Answering (VQ-VA) capability to open-source models, as this ability has recently emerged in proprietary systems like NanoBanana and GPT-Image.

Method: A data-centric framework with agentic pipeline for large-scale data construction, crawling ~1.8M high-quality interleaved image-text samples for training, and releasing IntelligentBench for systematic evaluation.

Result: Training with VQ-VA World data helps LightFusion achieve 53.06 on IntelligentBench, substantially surpassing prior open-source baselines (7.78 from vanilla LightFusion; 1.94 from UniWorld-V1) and narrowing gap to proprietary systems (81.67 from NanoBanana; 82.64 from GPT-Image).

Conclusion: By releasing model weights, datasets, and pipelines, the authors hope to stimulate future research on VQ-VA.

Abstract: This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question – an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.

[248] The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou, Ming-Ming Cheng, Mike Zheng Shou

Main category: cs.CV

TL;DR: ImageCritic uses reference-guided post-editing with attention alignment and detail encoding to fix inconsistency issues in image generation, outperforming existing methods.

Details

Motivation: To solve the inconsistency problem of generated images where fine-grained details don't match the reference image, addressing limitations in current customized generation tasks.

Method: Constructs reference-degraded-target triplets dataset using VLM-based selection and explicit degradation, then uses attention alignment loss and detail encoder for precise inconsistency rectification. Can be integrated into agent framework for automatic detection and multi-round local editing.

Result: Extensive experiments show ImageCritic effectively resolves detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.

Conclusion: ImageCritic successfully addresses inconsistency problems in image generation through reference-guided post-editing with attention mechanisms and detail encoding.

Abstract: Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model’s attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.

[249] Panoramic Distortion-Aware Tokenization for Person Detection and Localization in Overhead Fisheye Images

Nobuhiko Wakai, Satoshi Sato, Yasunori Ishii, Takayoshi Yamashita

Main category: cs.CV

TL;DR: Proposes a transformer-based method for detecting small persons in fisheye images by remapping to equirectangular panoramas and using panoramic distortion-aware tokenization to handle the small-person problem.

Details

Motivation: Person detection in overhead fisheye images faces challenges with person rotation and small persons. Prior work mainly addressed rotation, leaving the small-person problem underexplored, as conventional methods tend to favor larger persons who dominate attention maps.

Method: Remaps fisheye images to equirectangular panoramas to handle rotation, exploits panoramic geometry to address small persons, and introduces panoramic distortion-aware tokenization that divides features using self-similar figures to preserve significance areas of smaller persons.

Result: Extensive experiments demonstrated that the proposed method outperforms conventional methods on large-scale datasets, effectively enhancing detection of small persons in overhead fisheye imagery.

Conclusion: The combination of panoramic-image remapping and distortion-aware tokenization provides an effective solution for detecting small persons in fisheye images, addressing a previously underexplored challenge in person detection.

Abstract: Person detection in overhead fisheye images is challenging due to person rotation and small persons. Prior work has mainly addressed person rotation, leaving the small-person problem underexplored. We remap fisheye images to equirectangular panoramas to handle rotation and exploit panoramic geometry to handle small persons more effectively. Conventional detection methods tend to favor larger persons because they dominate the attention maps, causing smaller persons to be missed. In hemispherical equirectangular panoramas, we find that apparent person height decreases approximately linearly with the vertical angle near the top of the image. Using this finding, we introduce panoramic distortion-aware tokenization to enhance the detection of small persons. This tokenization procedure divides panoramic features using self-similar figures that enable the determination of optimal divisions without gaps, and we leverage the maximum significance values in each tile of the token groups to preserve the significance areas of smaller persons. We propose a transformer-based person detection and localization method that combines panoramic-image remapping and the tokenization procedure. Extensive experiments demonstrated that our method outperforms conventional methods on large-scale datasets.

[250] Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

Xinhao Liu, Jiaqi Li, Youming Deng, Ruxin Chen, Yingjia Zhang, Yifei Ma, Li Guo, Yiming Li, Jing Zhang, Chen Feng

Main category: cs.CV

TL;DR: Wanderland is a real-to-sim framework that addresses the sim-to-real gap in Embodied AI by providing high-fidelity simulation with photorealistic rendering and accurate geometry for reproducible closed-loop evaluation in urban environments.

Details

Motivation: Reproducible closed-loop evaluation is a major bottleneck in Embodied AI like visual navigation, and existing video-3DGS methods have large visual and geometric sim-to-real gaps that make them unsuitable for benchmarking.

Method: Developed Wanderland framework featuring multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Curated a diverse dataset of indoor-outdoor urban scenes and systematically analyzed image-only pipelines, geometry quality impact, and navigation policy learning.

Result: Demonstrated that image-only pipelines scale poorly, geometry quality significantly impacts novel view synthesis, and these factors adversely affect navigation policy learning and evaluation reliability. Provides a trusted testbed for embodied navigation and benchmarks for 3D reconstruction and novel view synthesis.

Conclusion: Wanderland establishes a new foundation for reproducible research in open-world embodied AI by addressing the sim-to-real gap through high-fidelity simulation with accurate geometry and photorealistic rendering.

Abstract: Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland’s rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at https://ai4ce.github.io/wanderland/.

[251] ShapeGen: Towards High-Quality 3D Shape Synthesis

Yangguang Li, Xianglong He, Zi-Xin Zou, Zexiang Liu, Wanli Ouyang, Ding Liang, Yan-Pei Cao

Main category: cs.CV

TL;DR: ShapeGen achieves state-of-the-art image-to-3D generation through improved 3D representation, supervision, resolution scaling, and linear transformers, producing high-quality assets suitable for 3D pipelines.

Details

Motivation: Current 3D shape generation methods lack intricate details, have overly smoothed surfaces, and fragmented thin-shell structures, falling short of artist standards.

Method: ShapeGen improves 3D representation and supervision, scales up resolution, and leverages linear transformers for enhanced generation quality.

Result: ShapeGen achieves significant performance leap in image-to-3D generation, establishing new state-of-the-art with high-fidelity 3D assets.

Conclusion: The synergistic improvements in ShapeGen enable high-quality image-to-3D generation that seamlessly integrates into 3D pipelines for widespread adoption.

Abstract: Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, current methods still face challenges, including the lack of intricate details, overly smoothed surfaces, and fragmented thin-shell structures. These limitations leave the generated 3D assets still one step short of meeting the standards favored by artists. In this paper, we present ShapeGen, which achieves high-quality image-to-3D shape generation through 3D representation and supervision improvements, resolution scaling up, and the advantages of linear transformers. These advancements allow the generated assets to be seamlessly integrated into 3D pipelines, facilitating their widespread adoption across various applications. Through extensive experiments, we validate the impact of these improvements on overall performance. Ultimately, thanks to the synergistic effects of these enhancements, ShapeGen achieves a significant leap in image-to-3D generation, establishing a new state-of-the-art performance.

[252] iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Zhoujie Fu, Xianfang Zeng, Jinghong Lan, Xinyao Liao, Cheng Chen, Junyi Chen, Jiacheng Wei, Wei Cheng, Shiyu Liu, Yunuo Chen, Gang Yu, Guosheng Lin

Main category: cs.CV

TL;DR: iMontage repurposes pre-trained video models into versatile image generators that produce variable-length image sets with natural transitions and expansive dynamic range by injecting image data diversity into coherent temporal frameworks.

Details

Motivation: To combine the temporal coherence of video models with the rich content diversity of image data, enabling generation of image sets with both natural transitions and expansive dynamic range beyond conventional scopes.

Method: Proposes an elegant, minimally invasive adaptation strategy with tailored data curation and training paradigm that allows video models to acquire image manipulation capabilities without corrupting original motion priors.

Result: iMontage excels across mainstream many-in-many-out tasks, maintaining strong cross-image contextual consistency while generating scenes with extraordinary dynamics that surpass conventional scopes.

Conclusion: The framework successfully unifies various image generation and editing tasks by repurposing video models to consume and produce variable-length image sets with both coherence and diversity.

Abstract: Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.

[253] Vision-Language Memory for Spatial Reasoning

Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, Chen Wang

Main category: cs.CV

TL;DR: VLM² is a vision-language model with persistent memory that addresses spatial reasoning challenges in videos through dual-memory architecture and view-consistent 3D representation.

Details

Motivation: Current VLMs underperform in video-based spatial reasoning due to semantic-geometric misalignment and lack of persistent memory for 3D understanding over time.

Method: Proposes VLM² with dual-memory module: working memory for immediate context and episodic memory for long-term information, enabling view-consistent 3D-aware representation from 2D video.

Result: Achieves state-of-the-art performance on multiple benchmarks among video-only models with fixed computational cost.

Conclusion: VLM² significantly advances visual-spatial intelligence by enabling efficient long-horizon spatial reasoning through persistent memory architecture.

Abstract: Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.

[254] RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers

Ahmet Berke Gokmen, Yigit Ekin, Bahri Batuhan Bilecen, Aysegul Dundar

Main category: cs.CV

TL;DR: RoPECraft is a training-free video motion transfer method that modifies rotary positional embeddings in diffusion transformers using optical flow to encode motion, with optimization during denoising and phase-based regularization.

Details

Motivation: To develop an efficient video motion transfer method that doesn't require training and can effectively transfer motion from reference videos to generated content while maintaining text prompt faithfulness.

Method: Extracts dense optical flow from reference video, warps RoPE complex-exponential tensors with motion offsets, optimizes embeddings during denoising via flow-matching trajectory alignment, and uses Fourier phase regularization to prevent artifacts.

Result: Outperforms all recently published methods both qualitatively and quantitatively on benchmarks.

Conclusion: RoPECraft provides an effective training-free approach for video motion transfer by strategically modifying positional embeddings and incorporating motion-aware optimization and regularization.

Abstract: We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video’s Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.

[255] PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo

Main category: cs.CV

TL;DR: PixelDiT is a single-stage diffusion transformer that operates directly in pixel space, eliminating the need for autoencoders and achieving state-of-the-art results on image generation tasks.

Details

Motivation: To address the limitations of latent-space diffusion models that rely on two-stage pipelines with pretrained autoencoders, which introduce lossy reconstruction and error accumulation while preventing joint optimization.

Method: Uses a fully transformer-based architecture with dual-level design: patch-level DiT for global semantics and pixel-level DiT for texture details, enabling efficient pixel-space diffusion training while preserving fine details.

Result: Achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models. When extended to text-to-image generation at 1024x1024 resolution, achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

Conclusion: PixelDiT demonstrates that effective pixel-level token modeling enables successful pixel-space diffusion, providing a single-stage alternative to latent-space models with competitive performance.

Abstract: Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

[256] 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding

Xiaoye Wang, Chen Tang, Xiangyu Yue, Wei-Hong Li

Main category: cs.CV

TL;DR: Proposes a lightweight Cross-view Module (CvM) to integrate geometric consistency into multi-task learning networks for dense prediction tasks like segmentation and depth estimation.

Details

Motivation: Current MTL approaches lack 3D-awareness and only capture cross-task relations in 2D image space, leading to unstructured features that limit comprehensive scene understanding.

Method: Introduces a shared Cross-view Module that exchanges information across views using cost volume as geometric consistency, integrated with MTL encoder features for multi-task predictions. The module is architecture-agnostic and works with both single and multi-view data.

Result: Extensive experiments on NYUv2 and PASCAL-Context datasets show improved performance by effectively injecting geometric consistency into existing MTL methods.

Conclusion: Integrating 3D geometric consistency through cross-view correlations significantly enhances multi-task learning performance for dense prediction tasks.

Abstract: This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.

[257] Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization

Tahira Kazimi, Connor Dunlop, Pinar Yanardag

Main category: cs.CV

TL;DR: DPP-GRPO is a plug-and-play framework that improves diversity in text-to-video generation by combining Determinantal Point Processes and Group Relative Policy Optimization to explicitly reward diverse outputs while maintaining prompt fidelity.

Details

Motivation: Current text-to-video diffusion models often produce low-diversity outputs when sampling multiple videos from the same prompt, limiting their ability to cover the full range of plausible outcomes.

Method: Formulates diverse video generation as a set-level policy optimization problem, using DPPs to impose diminishing returns on redundant samples and GRPO to provide groupwise feedback over candidate sets.

Result: Consistently improves video diversity on benchmarks (VBench, VideoScore, human preference studies) without sacrificing prompt fidelity or perceptual quality, demonstrated on WAN and CogVideoX models.

Conclusion: The framework successfully turns diversity into an explicit signal and provides a model-agnostic solution for diverse video generation, with released code and a new 30,000-prompt benchmark dataset.

Abstract: While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.

[258] Advancing Limited-Angle CT Reconstruction Through Diffusion-Based Sinogram Completion

Jiaqi Guo, Santiago Lopez-Tapia, Aggelos K. Katsaggelos

Main category: cs.CV

TL;DR: A novel sinogram inpainting method using MR-SDEs diffusion models for Limited Angle CT reconstruction, achieving state-of-the-art performance with accelerated processing.

Details

Motivation: Limited Angle CT faces challenges due to missing angular information, and previous image-domain methods are insufficient. Sinogram inpainting at projection level can better address this problem.

Method: Use MR-SDEs diffusion models for sinogram inpainting, combine distillation with pseudo-inverse constraints to accelerate diffusion process, and apply post-processing with back-projection and refinement.

Result: Achieves state-of-the-art performance in both perceptual and fidelity quality, effectively suppressing artifacts while preserving structural details.

Conclusion: Provides an efficient and accurate solution for LACT reconstruction with promising applications in scientific and clinical settings.

Abstract: Limited Angle Computed Tomography (LACT) often faces significant challenges due to missing angular information. Unlike previous methods that operate in the image domain, we propose a new method that focuses on sinogram inpainting. We leverage MR-SDEs, a variant of diffusion models that characterize the diffusion process with mean-reverting stochastic differential equations, to fill in missing angular data at the projection level. Furthermore, by combining distillation with constraining the output of the model using the pseudo-inverse of the inpainting matrix, the diffusion process is accelerated and done in a step, enabling efficient and accurate sinogram completion. A subsequent post-processing module back-projects the inpainted sinogram into the image domain and further refines the reconstruction, effectively suppressing artifacts while preserving critical structural details. Quantitative experimental results demonstrate that the proposed method achieves state-of-the-art performance in both perceptual and fidelity quality, offering a promising solution for LACT reconstruction in scientific and clinical applications.

[259] LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu

Main category: cs.CV

TL;DR: LocateAnything3D is a VLM-native approach that treats 3D object detection as a next-token prediction problem using a Chain-of-Sight sequence, achieving state-of-the-art results on Omni3D benchmark.

Details

Motivation: Current vision-language models excel at 2D description and grounding but lack multi-object 3D detection capabilities, which is essential for models to act effectively in the world by understanding 3D spatial relationships.

Method: Uses a Chain-of-Sight sequence that mimics human reasoning: first detect objects in 2D, then predict distance, size, and pose. Employs an easy-to-hard curriculum with near-to-far ordering across objects and center-from-camera, dimensions, rotation factorization within objects.

Result: Achieves 49.89 AP_3D on Omni3D benchmark, surpassing previous best by +15.51 absolute improvement even when baseline uses ground-truth 2D boxes. Shows strong zero-shot generalization to held-out categories.

Conclusion: By framing 3D detection as a disciplined next-token problem, LocateAnything3D provides a practical foundation for models to perceive in 3D while preserving open-vocabulary and visual-prompting capabilities.

Abstract: To act in the world, a model must name what it sees and know where it is in 3D. Today’s vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.

[260] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Changjiang Jiang, Wenhui Dong, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng, Xinbin Yuan, Yifei Bi, Ming Zhao, Zian Zhou, Caifeng Shan

Main category: cs.CV

TL;DR: Ivy-Fake is a large-scale multimodal benchmark for explainable AIGC detection with over 106K training samples and 5K evaluation examples, addressing limitations in current detection methods through rich annotations and fine-grained interpretability.

Details

Motivation: Address two major limitations in current AIGC detection: (1) lack of multidimensional explainable datasets with oversimplified binary annotations, and (2) insufficient fine-grained interpretability in MLLM-based forgery detectors that hinders reliable localization and explanation.

Method: Introduce Ivy-Fake benchmark with carefully designed pipeline for diverse, high-quality data collection from multiple generative models and real-world datasets. Propose Ivy-xDetector using reinforcement learning based on Group Relative Policy Optimization (GRPO) to produce explainable reasoning chains.

Result: Extensive experiments show superiority of the dataset and effectiveness of the approach. Method improves performance on GenImage from 86.88% to 96.32%, surpassing prior state-of-the-art methods by a clear margin.

Conclusion: Ivy-Fake provides the first large-scale multimodal benchmark for explainable AIGC detection, and Ivy-xDetector demonstrates robust performance across multiple synthetic content detection benchmarks with improved explainability.

Abstract: The rapid development of Artificial Intelligence Generated Content (AIGC) techniques has enabled the creation of high-quality synthetic content, but it also raises significant security concerns. Current detection methods face two major limitations: (1) the lack of multidimensional explainable datasets for generated images and videos. Existing open-source datasets (e.g., WildFake, GenVideo) rely on oversimplified binary annotations, which restrict the explainability and trustworthiness of trained detectors. (2) Prior MLLM-based forgery detectors (e.g., FakeVLM) exhibit insufficiently fine-grained interpretability in their step-by-step reasoning, which hinders reliable localization and explanation. To address these challenges, we introduce Ivy-Fake, the first large-scale multimodal benchmark for explainable AIGC detection. It consists of over 106K richly annotated training samples (images and videos) and 5,000 manually verified evaluation examples, sourced from multiple generative models and real world datasets through a carefully designed pipeline to ensure both diversity and quality. Furthermore, we propose Ivy-xDetector, a reinforcement learning model based on Group Relative Policy Optimization (GRPO), capable of producing explainable reasoning chains and achieving robust performance across multiple synthetic content detection benchmarks. Extensive experiments demonstrate the superiority of our dataset and confirm the effectiveness of our approach. Notably, our method improves performance on GenImage from 86.88% to 96.32%, surpassing prior state-of-the-art methods by a clear margin.

[261] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag

Main category: cs.CV

TL;DR: ∞-RoPE is a training-free framework that enables infinite-horizon video generation with fine-grained control and cinematic transitions by addressing three core limitations of autoregressive video diffusion models.

Details

Motivation: Current autoregressive video diffusion models are constrained by finite temporal horizons, slow prompt responsiveness, and inability to create cinematic transitions within single generations.

Method: Three interconnected components: Block-Relativistic RoPE (reformulates temporal encoding as moving local reference frame), KV Flush (renews KV cache for immediate prompt responsiveness), and RoPE Cut (introduces controlled temporal discontinuities for scene transitions).

Result: Enables continuous video generation beyond positional limits with immediate prompt control and multi-cut scene transitions within single rollouts, consistently surpassing previous models in VBench scores.

Conclusion: ∞-RoPE establishes a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion, addressing all three core bottlenecks of current autoregressive video models.

Abstract: Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model’s 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model’s maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.

[262] HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model

Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang

Main category: cs.CV

TL;DR: HoliSafe introduces a comprehensive safety dataset and benchmark for VLMs, plus a visual guard module (VGM) that enhances safety through modular architecture and interpretable harm classification.

Details

Motivation: Current VLM safety approaches have limited coverage of image-text interaction risks and rely mainly on data tuning without architectural innovations, leaving models vulnerable to jailbreak attacks.

Method: Proposes HoliSafe dataset covering all five safe/unsafe image-text combinations, and a visual guard module (VGM) that assesses image harmfulness as a plug-in component for VLMs.

Result: Safe-VLM with VGM achieves state-of-the-art safety performance across multiple benchmarks, and HoliSafe-Bench reveals critical vulnerabilities in existing VLMs.

Conclusion: HoliSafe and VGM provide a foundation for robust and interpretable VLM safety, enabling modular integration and expanding multimodal alignment research.

Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

[263] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation

Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, Chunming Qiao

Main category: cs.CV

TL;DR: RubricRL is a framework for text-to-image model alignment that uses structured rubrics instead of black-box rewards, providing interpretable, composable criteria for RL training.

Details

Motivation: Existing RL alignment methods use either composite metrics with fixed weights or single scalar rewards, which lack interpretability and user control over specific visual criteria.

Method: Dynamically constructs structured rubrics per prompt with fine-grained visual criteria (object correctness, attributes, OCR, realism), independently evaluated by multimodal judges, with prompt-adaptive weighting.

Result: Improves prompt faithfulness, visual detail, and generalizability in autoregressive text-to-image models compared to existing methods.

Conclusion: RubricRL offers a flexible, extensible foundation for interpretable RL alignment that enables direct user control over rewarded aspects across text-to-image architectures.

Abstract: Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt–a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism–tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.

[264] Harnessing Vision-Language Models for Time Series Anomaly Detection

Zelin He, Sarah Alnegheimish, Matthew Reimherr

Main category: cs.CV

TL;DR: The paper proposes a two-stage vision-language model approach for time-series anomaly detection, achieving superior performance without time-series training.

Details

Motivation: Existing time-series anomaly detection methods lack visual-temporal understanding capacity that human experts possess, and vision language models have shown promise but underperform when directly applied to time series.

Method: Two-stage solution: (1) ViT4TS - vision-screening stage using lightweight pre-trained vision encoder with 2D time series representations to localize candidate anomalies; (2) VLM4TS - VLM-based stage that integrates global temporal context and visual understanding to refine detection on candidates.

Result: VLM4TS outperforms time-series pre-trained and from-scratch baselines in most cases with 24.6% improvement in F1-max score, consistently beats existing language model-based methods, and is 36x more efficient in token usage.

Conclusion: Vision language models can be effectively leveraged for time-series anomaly detection through a two-stage approach that combines visual screening with contextual refinement, achieving state-of-the-art performance without domain-specific training.

Abstract: Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and sensor-based condition monitoring. Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal understanding capacity that human experts have to identify contextual anomalies. To fill this gap, we explore a solution based on vision language models (VLMs). Recent studies have shown the ability of VLMs for visual understanding tasks, yet their direct application to time series has fallen short on both accuracy and efficiency. To harness the power of VLMs for TSAD, we propose a two-stage solution, with (1) ViT4TS, a vision-screening stage built on a relatively lightweight pre-trained vision encoder, which leverages 2D time series representations to accurately localize candidate anomalies; (2) VLM4TS, a VLM-based stage that integrates global temporal context and VLM’s visual understanding capacity to refine the detection upon the candidates provided by ViT4TS. We show that without any time-series training, VLM4TS outperforms time-series pre-trained and from-scratch baselines in most cases, yielding a 24.6% improvement in F1-max score over the best baseline. Moreover, VLM4TS also consistently outperforms existing language model-based TSAD methods and is on average 36x more efficient in token usage.

[265] WeatherDiffusion: Controllable Weather Editing in Intrinsic Space

Yixin Zhu, Zuoliang Zhu, Jian Yang, Miloš Hašan, Jin Xie, Beibei Wang

Main category: cs.CV

TL;DR: WeatherDiffusion is a diffusion-based framework for controllable weather editing using intrinsic maps (material, geometry, lighting) with an inverse renderer to extract these maps and a forward renderer to generate weather-modified images.

Details

Motivation: Traditional pixel-space weather editing lacks controllability and spatial correspondence in large outdoor scenes, limiting applications like autonomous driving that require robust weather adaptation.

Method: Uses diffusion priors with two components: inverse renderer to estimate intrinsic maps from input images, and forward renderer that uses these maps with text prompts for weather control. Introduces intrinsic map-aware attention and CLIP-space interpolation for fine-grained weather editing.

Result: Outperforms state-of-the-art pixel-space editing, weather restoration, and rendering-based methods. Created datasets with 38k synthetic and 18k real-world images with intrinsic map annotations.

Conclusion: WeatherDiffusion shows promise for downstream tasks like autonomous driving by enhancing detection and segmentation robustness in challenging weather scenarios through controllable weather editing.

Abstract: We present WeatherDiffusion, a diffusion-based framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches.We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. WeatherDiffusion outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.

[266] SafeFix: Targeted Model Repair via Controlled Image Generation

Ouyang Xu, Baoming Zhang, Ruiyu Mao, Yunhui Guo

Main category: cs.CV

TL;DR: A model repair module that uses conditional text-to-image generation and LVLM filtering to create targeted synthetic images for rare failure cases, improving model robustness without introducing new bugs.

Details

Motivation: Deep learning models often fail on underrepresented semantic subpopulations, and existing repair methods using manual prompts suffer from distribution shift and semantic errors.

Method: Uses interpretable failure attribution, conditional text-to-image generation for targeted images, and LVLM filtering to ensure semantic consistency and data distribution alignment.

Result: Significantly reduces errors associated with rare cases and improves model robustness without introducing new bugs.

Conclusion: Targeted repair strategy using synthetic data generation and filtering effectively addresses model failures on rare semantic subpopulations.

Abstract: Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. Although existing debugging frameworks can pinpoint these failures by identifying key failure attributes, repairing the model effectively remains difficult. Current solutions often rely on manually designed prompts to generate synthetic training images – an approach prone to distribution shift and semantic errors. To overcome these challenges, we introduce a model repair module that builds on an interpretable failure attribution pipeline. Our approach uses a conditional text-to-image model to generate semantically faithful and targeted images for failure cases. To preserve the quality and relevance of the generated samples, we further employ a large vision-language model (LVLM) to filter the outputs, enforcing alignment with the original data distribution and maintaining semantic consistency. By retraining vision models with this rare-case-augmented synthetic dataset, we significantly reduce errors associated with rare cases. Our experiments demonstrate that this targeted repair strategy improves model robustness without introducing new bugs. Code is available at https://github.com/oxu2/SafeFix

[267] Natural Image Stitching Using Depth Maps

Tianli Liao, Nan Li

Main category: cs.CV

TL;DR: A novel image stitching method using depth maps to handle parallax in non-planar scenes captured by handheld cameras, achieving accurate alignment and view-consistent results.

Details

Motivation: Natural image stitching faces challenges with parallax when scenes are non-planar and captured from different viewing positions, requiring methods that can handle these geometric distortions.

Method: Uses depth maps to estimate epipolar geometry, filters outliers in feature matches, establishes pixel-to-pixel correspondences, and applies optimal warping with artifact-solving modules for final mosaic generation.

Result: Experimental results on three challenging datasets show significantly more accurate alignment in overlapping regions and view-consistent results in non-overlapping regions compared to existing methods.

Conclusion: The depth-based approach effectively handles parallax in image stitching and is expected to benefit from ongoing improvements in monocular depth estimation techniques.

Abstract: Natural image stitching aims to create a single, natural-looking mosaic from overlapped images that capture the same 3D scene from different viewing positions. Challenges inevitably arise when the scene is non-planar and captured by handheld cameras since parallax is non-negligible in such cases. In this paper, we propose a novel image stitching method using depth maps, which generates accurate alignment mosaics against parallax. Firstly, we construct a robust fitting method to filter out the outliers in feature matches and estimate the epipolar geometry between input images. Then, we utilize epipolar geometry to establish pixel-to-pixel correspondences between the input images and render the warped images using the proposed optimal warping. In the rendering stage, we introduce several modules to solve the mapping artifacts in the warping results and generate the final mosaic. Experimental results on three challenging datasets demonstrate that the depth maps of input images enable our method to provide much more accurate alignment in the overlapping region and view-consistent results in the non-overlapping region. We believe our method will continue to work under the rapid progress of monocular depth estimation. The source code is available at https://github.com/tlliao/NIS_depths.

[268] Leveraging Unlabeled Data from Unknown Sources via Dual-Path Guidance for Deepfake Face Detection

Zhiqiang Yang, Renshuai Tao, Chunjie Zhang, guodong yang, Xiaolong Zheng, Yao Zhao

Main category: cs.CV

TL;DR: DPGNet addresses deepfake detection challenges with unlabeled data using text-guided cross-domain alignment and curriculum-driven pseudo-label generation to handle domain differences and utilize unlabeled samples effectively.

Details

Motivation: Existing deepfake detectors fail to generalize to real-world scenarios with massive unlabeled fake data from unknown sources, and manual labeling is infeasible due to high realism. Traditional unsupervised methods struggle because real and fake faces share the same semantics.

Method: Dual-path guided network (DPGNet) with two core modules: text-guided cross-domain alignment using learnable cues to unify visual and textual embeddings, and curriculum-driven pseudo-label generation that dynamically utilizes unlabeled samples.

Result: Extensive experiments on multiple mainstream datasets show DPGNet significantly outperforms existing techniques in deepfake detection using unlabeled data.

Conclusion: DPGNet provides an effective new paradigm for deepfake detection that successfully addresses the challenges of domain differences and utilization of unlabeled data in real-world scenarios.

Abstract: Existing deepfake detection methods heavily rely on static labeled datasets. However, with the proliferation of generative models, real-world scenarios are flooded with massive amounts of unlabeled fake face data from unknown sources. This presents a critical dilemma: detectors relying solely on existing data face generalization failure, while manual labeling for this new stream is infeasible due to the high realism of fakes. A more fundamental challenge is that, unlike typical unsupervised learning tasks where categories are clearly defined, real and fake faces share the same semantics, which leads to a decline in the performance of traditional unsupervised strategies. Therefore, there is an urgent need for a new paradigm designed specifically for this scenario to effectively utilize these unlabeled data. Accordingly, this paper proposes a dual-path guided network (DPGNet) to address two key challenges: (1) bridging the domain differences between faces generated by different generative models; and (2) utilizing unlabeled image samples. The method comprises two core modules: text-guided cross-domain alignment, which uses learnable cues to unify visual and textual embeddings into a domain-invariant feature space; and curriculum-driven pseudo-label generation, which dynamically utilizes unlabeled samples. Extensive experiments on multiple mainstream datasets show that DPGNet significantly outperforms existing techniques,, highlighting its effectiveness in addressing the challenges posed by the deepfakes using unlabeled data.

[269] Rethinking the Learning Paradigm for Facial Expression Recognition

Weijie Wang, Nicu Sebe, Bruno Lepri

Main category: cs.CV

TL;DR: Proposes using weakly supervised strategies instead of converting ambiguous annotations to one-hot labels for Facial Expression Recognition training.

Details

Motivation: Real-world FER datasets have ambiguous annotations due to subjective crowdsourcing and inter-class similarity, but current methods convert them to precise one-hot labels which may not be optimal.

Method: Rethinks existing training paradigm and proposes using weakly supervised strategies to train FER models with original ambiguous annotations rather than converting them.

Result: Suggests that using original ambiguous annotations with weakly supervised learning is better than the conventional approach of converting to one-hot labels.

Conclusion: Weakly supervised strategies are more suitable for training FER models with inherently ambiguous facial expression annotations.

Abstract: Due to the subjective crowdsourcing annotations and the inherent inter-class similarity of facial expressions, the real-world Facial Expression Recognition (FER) datasets usually exhibit ambiguous annotation. To simplify the learning paradigm, most previous methods convert ambiguous annotation results into precise one-hot annotations and train FER models in an end-to-end supervised manner. In this paper, we rethink the existing training paradigm and propose that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation.

Zhenlong Yuan, Jiakai Cao, Zhaoxin Li, Hao Jiang, Zhaoqi Wang

Main category: cs.CV

TL;DR: SD-MVS uses SAM for semantic instance segmentation to improve 3D reconstruction in textureless areas, with patch deformation, spherical coordinate refinement, and EM optimization for state-of-the-art results.

Details

Motivation: To address challenges in 3D reconstruction of textureless areas where traditional methods struggle due to lack of visual features.

Method: Uses SAM for semantic instance segmentation, applies pixelwise patch deformation on matching cost and propagation, combines spherical coordinates with gradient descent for refinement, and employs EM algorithm for parameter optimization.

Result: Achieves state-of-the-art performance on ETH3D and Tanks and Temples benchmarks with reduced time consumption compared to existing methods.

Conclusion: SD-MVS effectively tackles textureless area reconstruction through semantic segmentation guidance and optimized deformation strategies, demonstrating superior performance and efficiency.

Abstract: In this paper, we introduce Segmentation-Driven Deformation Multi-View Stereo (SD-MVS), a method that can effectively tackle challenges in 3D reconstruction of textureless areas. We are the first to adopt the Segment Anything Model (SAM) to distinguish semantic instances in scenes and further leverage these constraints for pixelwise patch deformation on both matching cost and propagation. Concurrently, we propose a unique refinement strategy that combines spherical coordinates and gradient descent on normals and pixelwise search interval on depths, significantly improving the completeness of reconstructed 3D model. Furthermore, we adopt the Expectation-Maximization (EM) algorithm to alternately optimize the aggregate matching cost and hyperparameters, effectively mitigating the problem of parameters being excessively dependent on empirical tuning. Evaluations on the ETH3D high-resolution multi-view stereo benchmark and the Tanks and Temples dataset demonstrate that our method can achieve state-of-the-art results with less time consumption.

[271] Zero-Shot Video Translation via Token Warping

Haiming Zhu, Yangyang Xu, Jun Yu, Shengfeng He

Main category: cs.CV

TL;DR: TokenWarping is a training-free framework that uses optical flow to warp query, key, and value patches in diffusion models for temporally coherent video translation, outperforming existing methods.

Details

Motivation: Current video models lag behind image models in visual quality and user control, and existing video editing methods sacrifice local/structural preservation while overlooking the importance of query patches for temporal coherence.

Method: Extract optical flows from source videos and use them to warp previous frame’s query, key, and value patches to align with current frame during diffusion denoising, enhancing feature aggregation and temporal consistency.

Result: Extensive experiments show TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively on various video translation tasks.

Conclusion: TokenWarping effectively ensures temporally coherent video translation without additional training and can be integrated with existing text-to-image editing methods.

Abstract: With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame’s query, key, and value patches, aligning them with the current frame’s patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively. Video demonstrations can be found on our project webpage: https://alex-zhu1.github.io/TokenWarping/. Code is available at: https://github.com/Alex-Zhu1/TokenWarping.

[272] GMT: Effective Global Framework for Multi-Camera Multi-Target Tracking

Yihao Zhen, Mingyue Xu, Qiang Wang, Baojie Fan, Jiahua Dong, Tinghui Zhao, Huijie Fan

Main category: cs.CV

TL;DR: GMT is a global MCMT tracking framework that unifies intra-view and inter-view tracking into a single global trajectory association process, achieving significant performance improvements over traditional two-stage methods.

Details

Motivation: Existing MCMT tracking methods use a two-stage framework where multi-view information is only used to recover missed matches, providing limited contribution to overall tracking performance.

Method: Proposes GMT framework with Cross-View Feature Consistency Enhancement (CFCE) module to align features across views, and Global Trajectory Association (GTA) module to associate detections with global trajectories directly using multi-view information.

Result: Achieves significant improvements on existing datasets: up to 21.3% gain in CVMA and 17.2% gain in CVIDF1 metrics compared to two-stage framework.

Conclusion: GMT effectively addresses limitations of traditional two-stage MCMT tracking by unifying intra-view and inter-view tracking, and introduces VisionTrack dataset for future research.

Abstract: Multi-Camera Multi-Target (MCMT) tracking aims to locate and associate the same targets across multiple camera views. Existing methods typically adopt a two-stage framework, involving single-camera tracking followed by inter-camera tracking. However, in this paradigm, multi-view information is used only to recover missed matches in the first stage, providing a limited contribution to overall tracking. To address this issue, we propose GMT, a global MCMT tracking framework that jointly exploits intra-view and inter-view cues for tracking. Specifically, instead of assigning trajectories independently for each view, we integrate the same historical targets across different views as global trajectories, thereby reformulating the two-stage tracking as a unified global-level trajectory-target association process. We introduce a Cross-View Feature Consistency Enhancement (CFCE) module to align visual and spatial features across views, providing a consistent feature space for global trajectory modeling. With these aligned features, the Global Trajectory Association (GTA) module associates new detections with existing global trajectories, enabling direct use of multi-view information. Compared to the two-stage framework, GMT achieves significant improvements on existing datasets, with gains of up to 21.3 percent in CVMA and 17.2 percent in CVIDF1. Furthermore, we introduce VisionTrack, a high-quality, large-scale MCMT dataset providing significantly greater diversity than existing datasets. Our code and dataset will be released.

[273] KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models

Yujin Wang, Tianyi Wang, Quanfeng Liu, Wenxian Fan, Junfeng Jiao, Christian Claudel, Yunbing Yan, Bingzhao Gao, Jianqiang Wang, Hong Chen

Main category: cs.CV

TL;DR: KEPT is a knowledge-enhanced vision-language model framework that predicts ego trajectories from driving frames using temporal-spatial fusion, retrieval-augmented generation, and explicit planning constraints, achieving state-of-the-art performance on nuScenes dataset.

Details

Motivation: Existing vision-language models often fail to accurately understand driving scenes and generate trustworthy trajectories for autonomous driving, creating safety and reliability concerns.

Method: Integrates temporal frequency-spatial fusion video encoder with self-supervised learning, k-means & HNSW retrieval-augmented generation pipeline, chain-of-thought prompts with explicit planning constraints, and triple-stage fine-tuning paradigm.

Result: Achieves best open-loop performance on nuScenes dataset compared with baseline methods, with comprehensive ablation studies validating the effectiveness of various components.

Conclusion: KEPT offers a promising, data-efficient way toward trustworthy trajectory prediction in autonomous driving through knowledge enhancement and explicit planning constraints.

Abstract: Accurate short-horizon trajectory prediction is crucial for safe and reliable autonomous driving. However, existing vision-language models (VLMs) often fail to accurately understand driving scenes and generate trustworthy trajectories. To address this challenge, this paper introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT integrates a temporal frequency-spatial fusion (TFSF) video encoder, which is trained via self-supervised learning with hard-negative mining, with a k-means & HNSW retrieval-augmented generation (RAG) pipeline. Retrieved prior knowledge is added into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning paradigm aligns the VLM backbone to enhance spatial perception and trajectory prediction capabilities. Evaluated on nuScenes dataset, KEPT achieves the best open-loop performance compared with baseline methods. Ablation studies on fine-tuning stages, Top-K value of RAG, different retrieval strategies, vision encoders, and VLM backbones are conducted to demonstrate the effectiveness of KEPT. These results indicate that KEPT offers a promising, data-efficient way toward trustworthy trajectory prediction in autonomous driving.

[274] MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, Chunhua Shen

Main category: cs.CV

TL;DR: MovieDreamer is a hierarchical framework combining autoregressive models for narrative coherence and diffusion models for high-quality video rendering, enabling long-duration video generation with complex plots and character consistency.

Details

Motivation: Current video generation methods using diffusion models struggle with modeling complex narratives and maintaining character consistency over extended periods, which is crucial for long-form content like movies.

Method: A hierarchical framework that uses autoregressive models to predict sequences of visual tokens for global narrative coherence, followed by diffusion-based rendering to transform these tokens into high-quality video frames. Also employs a multimodal script with detailed character information and visual style for enhanced continuity.

Result: Extensive experiments across various movie genres show superior visual and narrative quality, with significantly extended generation duration beyond current capabilities.

Conclusion: MovieDreamer successfully pioneers long-duration video generation with intricate plot progressions and high visual fidelity by integrating autoregressive modeling with diffusion rendering, effectively addressing limitations of existing approaches.

Abstract: Recent advancements in video generation have primarily leveraged diffusion models for short-duration content. However, these approaches often fall short in modeling complex narratives and maintaining character consistency over extended periods, which is essential for long-form video production like movies. We propose MovieDreamer, a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering to pioneer long-duration video generation with intricate plot progressions and high visual fidelity. Our approach utilizes autoregressive models for global narrative coherence, predicting sequences of visual tokens that are subsequently transformed into high-quality video frames through diffusion rendering. This method is akin to traditional movie production processes, where complex stories are factorized down into manageable scene capturing. Further, we employ a multimodal script that enriches scene descriptions with detailed character information and visual style, enhancing continuity and character identity across scenes. We present extensive experiments across various movie genres, demonstrating that our approach not only achieves superior visual and narrative quality but also effectively extends the duration of generated content significantly beyond current capabilities. Homepage: https://aim-uofa.github.io/MovieDreamer/.

[275] E$^{3}$NeRF: Efficient Event-Enhanced Neural Radiance Fields from Blurry Images

Yunshan Qi, Jia Li, Yifan Zhao, Yu Zhang, Lin Zhu

Main category: cs.CV

TL;DR: E³NeRF reconstructs sharp NeRF from blurry images using event streams, modeling blur and event generation processes with specialized losses, and improving efficiency through event-guided temporal distribution.

Details

Motivation: NeRF struggles with blurry input images common in real-world scenarios, especially for high-speed motion and low-light conditions where traditional image-based approaches fail to reconstruct sharp 3D representations.

Method: Proposes Efficient Event-Enhanced NeRF (E³NeRF) that combines blurry images with event streams, introduces blur rendering loss and event rendering loss to model physical blur and event generation, leverages event data for efficient temporal training distribution, and includes camera pose estimation for real-world applications.

Result: E³NeRF effectively learns sharp NeRF representations from blurry inputs, demonstrating superior performance for high-speed non-uniform motion and low-light scenes compared to previous image-based and event-based NeRF methods.

Conclusion: The framework successfully integrates event streams with blurry images to reconstruct sharp 3D representations, making significant improvements in handling challenging real-world scenarios through better utilization of the relationship between events and images.

Abstract: Neural Radiance Fields (NeRF) achieves impressive novel view rendering performance by learning implicit 3D representation from sparse view images. However, it is difficult to reconstruct a sharp NeRF from blurry input that often occurs in the wild. To solve this problem, we propose a novel Efficient Event-Enhanced NeRF (E$^{3}$NeRF), reconstructing sharp NeRF by utilizing both blurry images and corresponding event streams. A blur rendering loss and an event rendering loss are introduced, which guide the NeRF training via modeling the physical image motion blur process and event generation process, respectively. To improve the efficiency of the framework, we further leverage the latent spatial-temporal blur information in the event stream to evenly distribute training over temporal blur and focus training on spatial blur. Moreover, a camera pose estimation framework for real-world data is built with the guidance of the events, generalizing the method to more practical applications. Compared to previous image-based and event-based NeRF works, our framework makes more profound use of the internal relationship between events and images. Extensive experiments on both synthetic data and real-world data demonstrate that E\textsuperscript{3}NeRF can effectively learn a sharp NeRF from blurry images, especially for high-speed non-uniform motion and low-light scenes.

[276] Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models

Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Zhixing Tan

Main category: cs.CV

TL;DR: CLVS improves LVLMs’ visual capabilities by maintaining sustained focus on key objects through cross-layer vision smoothing with a vision memory mechanism.

Details

Motivation: LVLMs have brief attention to key objects, and sustained focus could enhance their visual understanding capabilities.

Method: Cross-Layer Vision Smoothing (CLVS) with vision memory that smooths attention distribution across layers, initialized with position-unbiased attention and updated iteratively, terminated based on uncertainty indicators.

Result: Achieves state-of-the-art overall performance across various visual understanding tasks and comparable results to leading approaches on image captioning benchmarks across four benchmarks and three LVLMs.

Conclusion: CLVS effectively improves LVLMs’ visual capabilities through sustained attention on key objects, demonstrating effectiveness and generalizability.

Abstract: Large Vision-Language Models (LVLMs) can accurately locate key objects in images, yet their attention to these objects tends to be very brief. Motivated by the hypothesis that sustained focus on key objects can improve LVLMs’ visual capabilities, we propose Cross-Layer Vision Smoothing (CLVS). The core idea of CLVS is to incorporate a vision memory that smooths the attention distribution across layers. Specifically, we initialize this vision memory with position-unbiased visual attention in the first layer. In subsequent layers, the model’s visual attention jointly considers the vision memory from previous layers, while the memory is updated iteratively, thereby maintaining smooth attention on key objects. Given that visual understanding primarily occurs in the early and middle layers of the model, we use uncertainty as an indicator of completed visual understanding and terminate the smoothing process accordingly. Experiments on four benchmarks across three LVLMs confirm the effectiveness and generalizability of our method. CLVS achieves state-of-the-art overall performance across a variety of visual understanding tasks and attains comparable results to the leading approaches on image captioning benchmarks.

[277] OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Youjun Zhao, Jiaying Lin, Shuquan Ye, Qianshi Pang, Rynson W. H. Lau

Main category: cs.CV

TL;DR: Introduces Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) and OpenScan benchmark to evaluate 3D scene understanding beyond object classes, focusing on fine-grained attributes like affordance and material.

Details

Motivation: Existing open-vocabulary 3D methods focus only on object classes, which is insufficient for holistic 3D scene understanding. Need to explore open vocabulary problem beyond object classes to generalized knowledge.

Method: Created OpenScan benchmark with 3D object attributes across eight linguistic aspects including affordance, property, and material. Evaluated state-of-the-art OV-3D methods on this benchmark.

Result: Existing OV-3D methods struggle with abstract vocabularies in GOV-3D task. Scaling up object classes during training doesn’t solve this challenge.

Conclusion: Highlights limitations of current methods and explores promising directions to address shortcomings in generalized open-vocabulary 3D scene understanding.

Abstract: Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed set of object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient in providing a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes. It encompasses an open and diverse set of generalized knowledge, expressed as linguistic queries of fine-grained and object-specific attributes. To this end, we contribute a new benchmark named \textit{OpenScan}, which consists of 3D object attributes across eight representative linguistic aspects, including affordance, property, and material. We further evaluate state-of-the-art OV-3D methods on our OpenScan benchmark and discover that these methods struggle to comprehend the abstract vocabularies of the GOV-3D task, a challenge that cannot be addressed simply by scaling up object classes during training. We highlight the limitations of existing methodologies and explore promising directions to overcome the identified shortcomings.

[278] Temporally Compressed 3D Gaussian Splatting for Dynamic Scenes

Saqib Javed, Ahmad Jarrar Khan, Corentin Dumery, Chen Zhao, Mathieu Salzmann

Main category: cs.CV

TL;DR: TC3DGS is a compression technique for dynamic 3D Gaussian Splatting that achieves up to 67x compression with minimal quality loss through temporal pruning, mixed-precision quantization, and trajectory interpolation.

Details

Motivation: Current dynamic 3D Gaussian methods require substantial memory reduction and rendering efficiency improvements to be viable for real-time applications like AR/VR and gaming, especially for scenes with complex motions or long sequences.

Method: TC3DGS uses selective temporal pruning of Gaussians, gradient-aware mixed-precision quantization, and an adapted Ramer-Douglas-Peucker algorithm for Gaussian trajectory interpolation to compress dynamic 3D Gaussian representations.

Result: The method achieves up to 67x compression on multiple datasets with minimal or no degradation in visual quality.

Conclusion: TC3DGS effectively compresses dynamic 3D Gaussian representations, making them more suitable for real-time applications on resource-constrained devices.

Abstract: Recent advancements in high-fidelity dynamic scene reconstruction have leveraged dynamic 3D Gaussians and 4D Gaussian Splatting for realistic scene representation. However, to make these methods viable for real-time applications such as AR/VR, gaming, and rendering on low-power devices, substantial reductions in memory usage and improvements in rendering efficiency are required. While many state-of-the-art methods prioritize lightweight implementations, they struggle in handling {scenes with complex motions or long sequences}. In this work, we introduce Temporally Compressed 3D Gaussian Splatting (TC3DGS), a novel technique designed specifically to effectively compress dynamic 3D Gaussian representations. TC3DGS selectively prunes Gaussians based on their temporal relevance and employs gradient-aware mixed-precision quantization to dynamically compress Gaussian parameters. In addition, TC3DGS exploits an adapted version of the Ramer-Douglas-Peucker algorithm to further reduce storage by interpolating Gaussian trajectories across frames. Our experiments on multiple datasets demonstrate that TC3DGS achieves up to 67$\times$ compression with minimal or no degradation in visual quality. More results and videos are provided in the supplementary. Project Page: https://ahmad-jarrar.github.io/tc-3dgs/

[279] DVP-MVS: Synergize Depth-Edge and Visibility Prior for Multi-View Stereo

Zhenlong Yuan, Jinguo Luo, Fei Shen, Zhaoxin Li, Cong Liu, Tianlu Mao, Zhaoqi Wang

Main category: cs.CV

TL;DR: DVP-MVS improves multi-view stereo by addressing patch deformation instability through depth-edge alignment and cross-view visibility priors, achieving state-of-the-art performance on benchmarks.

Details

Motivation: Existing patch deformation methods suffer from deformation instability caused by mistaken edge-skipping and visibility occlusion, leading to potential estimation deviation in multi-view stereo reconstruction.

Method: Proposes depth-edge aligned guidance using Depth Anything V2 and Roberts operator with erosion-dilation strategy, cross-view visibility priors through depth reprojection, and improved propagation/refinement with multi-view geometry consistency using visible hemispherical normals and epipolar depth differences.

Result: Extensive evaluations on ETH3D and Tanks & Temples benchmarks demonstrate state-of-the-art performance with excellent robustness and generalization.

Conclusion: The proposed DVP-MVS method effectively addresses patch deformation instability through synergistic depth-edge alignment and cross-view visibility priors, achieving superior multi-view stereo reconstruction performance.

Abstract: Patch deformation-based methods have recently exhibited substantial effectiveness in multi-view stereo, due to the incorporation of deformable and expandable perception to reconstruct textureless areas. However, such approaches typically focus on exploring correlative reliable pixels to alleviate match ambiguity during patch deformation, but ignore the deformation instability caused by mistaken edge-skipping and visibility occlusion, leading to potential estimation deviation. To remedy the above issues, we propose DVP-MVS, which innovatively synergizes depth-edge aligned and cross-view prior for robust and visibility-aware patch deformation. Specifically, to avoid unexpected edge-skipping, we first utilize Depth Anything V2 followed by the Roberts operator to initialize coarse depth and edge maps respectively, both of which are further aligned through an erosion-dilation strategy to generate fine-grained homogeneous boundaries for guiding patch deformation. In addition, we reform view selection weights as visibility maps and restore visible areas by cross-view depth reprojection, then regard them as cross-view prior to facilitate visibility-aware patch deformation. Finally, we improve propagation and refinement with multi-view geometry consistency by introducing aggregated visible hemispherical normals based on view selection and local projection depth differences based on epipolar lines, respectively. Extensive evaluations on ETH3D and Tanks & Temples benchmarks demonstrate that our method can achieve state-of-the-art performance with excellent robustness and generalization.

[280] LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference

Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, Daniele De Martini

Main category: cs.CV

TL;DR: LikePhys is a training-free method that evaluates intuitive physics understanding in video diffusion models by distinguishing physically valid vs. impossible videos using denoising objectives as likelihood surrogates.

Details

Motivation: Accurately evaluating physics understanding in video diffusion models is challenging due to difficulty disentangling physics correctness from visual appearance in generated videos.

Method: Uses denoising objective as ELBO-based likelihood surrogate on curated valid-invalid video pairs across 12 scenarios spanning 4 physics domains. Introduces Plausibility Preference Error (PPE) metric.

Result: PPE demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. Models show clear improvement in physics understanding as capacity and inference settings scale, though struggle with complex chaotic dynamics.

Conclusion: LikePhys provides effective evaluation of intuitive physics in video diffusion models, revealing domain-specific capacity variations and scaling trends in physics understanding.

Abstract: Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.

[281] Personalized Generative Low-light Image Denoising and Enhancement

Xijun Wang, Prateek Chennuri, Dilshan Godaliyadda, Yu Yuan, Bole Ma, Xingguang Zhang, Hamid R. Sheikh, Stanley Chan

Main category: cs.CV

TL;DR: DiffPGD is a personalized generative denoising method that uses a diffusion model customized for individual users by leveraging their photo galleries to create identity-consistent physical buffers, enabling superior low-light image restoration without hallucination.

Details

Motivation: Current cameras perform poorly in low-light conditions due to photon shot noise and sensor read noise. Existing generative methods create hallucinatory content at low SNR, so there's a need for personalized approaches that leverage user photo galleries for better restoration.

Method: Developed Diffusion-based Personalized Generative Denoising (DiffPGD) that builds customized diffusion models for individual users. Key innovation is an identity-consistent physical buffer that extracts physical attributes from user photo galleries, serving as a robust prior integrated into diffusion models without fine-tuning.

Result: DiffPGD achieves superior image denoising and enhancement performance across a wide range of low-light testing scenarios compared to existing diffusion-based denoising approaches, effectively restoring degraded images without hallucinatory content.

Conclusion: The personalized approach using identity-consistent physical buffers from user photo galleries enables effective low-light image restoration with diffusion models, overcoming limitations of traditional and existing generative methods while maintaining identity consistency.

Abstract: Modern cameras’ performance in low-light conditions remains suboptimal due to fundamental limitations in photon shot noise and sensor read noise. Generative image restoration methods have shown promising results compared to traditional approaches, but they suffer from hallucinatory content generation when the signal-to-noise ratio (SNR) is low. Leveraging the availability of personalized photo galleries of the users, we introduce Diffusion-based Personalized Generative Denoising (DiffPGD), a new approach that builds a customized diffusion model for individual users. Our key innovation lies in the development of an identity-consistent physical buffer that extracts the physical attributes of the person from the gallery. This ID-consistent physical buffer serves as a robust prior that can be seamlessly integrated into the diffusion model to restore degraded images without the need for fine-tuning. Over a wide range of low-light testing scenarios, we show that DiffPGD achieves superior image denoising and enhancement performance compared to existing diffusion-based denoising approaches. Our project page can be found at \href{https://genai-restore.github.io/DiffPGD/}{\textcolor{purple}{\textbf{https://genai-restore.github.io/DiffPGD/}}}.

[282] LiHi-GS: LiDAR-Supervised Gaussian Splatting for Highway Driving Scene Reconstruction

Pou-Chun Kung, Xianling Zhang, Katherine A. Skinner, Nikita Jaipuria

Main category: cs.CV

TL;DR: Proposes a novel Gaussian Splatting method for dynamic scene synthesis and editing with LiDAR supervision, focusing on challenging highway scenes for autonomous driving.

Details

Motivation: Existing methods overlook highway scenarios and underutilize LiDAR data, limiting their applicability to real autonomous driving needs.

Method: Uses Gaussian Splatting with LiDAR supervision for improved scene reconstruction and supports LiDAR rendering, focusing on highway scenes with sparse views.

Result: First method to focus on highway scenes for autonomous driving, enabling better scene reconstruction and LiDAR data synthesis.

Conclusion: The approach addresses critical gaps in autonomous driving scene reconstruction by leveraging LiDAR supervision and targeting challenging highway scenarios.

Abstract: Photorealistic 3D scene reconstruction plays an important role in autonomous driving, enabling the generation of novel data from existing datasets to simulate safety-critical scenarios and expand training data without additional acquisition costs. Gaussian Splatting (GS) facilitates real-time, photorealistic rendering with an explicit 3D Gaussian representation of the scene, providing faster processing and more intuitive scene editing than the implicit Neural Radiance Fields (NeRFs). While extensive GS research has yielded promising advancements in autonomous driving applications, they overlook two critical aspects: First, existing methods mainly focus on low-speed and feature-rich urban scenes and ignore the fact that highway scenarios play a significant role in autonomous driving. Second, while LiDARs are commonplace in autonomous driving platforms, existing methods learn primarily from images and use LiDAR only for initial estimates or without precise sensor modeling, thus missing out on leveraging the rich depth information LiDAR offers and limiting the ability to synthesize LiDAR data. In this paper, we propose a novel GS method for dynamic scene synthesis and editing with improved scene reconstruction through LiDAR supervision and support for LiDAR rendering. Unlike prior works that are tested mostly on urban datasets, to the best of our knowledge, we are the first to focus on the more challenging and highly relevant highway scenes for autonomous driving, with sparse sensor views and monotone backgrounds. Visit our project page at: https://umautobots.github.io/lihi_gs

[283] VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Dan Zhang, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong

Main category: cs.CV

TL;DR: VisionReward is a framework for learning human visual preferences in image and video generation using hierarchical assessment and interpretable linear weighting, outperforming existing reward models.

Details

Motivation: Existing reward models for visual generation have limitations including lack of interpretability and potential biases, making human preference alignment challenging.

Method: Uses hierarchical visual assessment framework to capture fine-grained preferences, linear weighting for interpretability, and multi-dimensional consistent strategy during preference optimization.

Result: Significantly outperforms existing image/video reward models, surpassing VideoScore by 17.2% in preference prediction accuracy and achieving 31.6% higher pairwise win rate in text-to-video models.

Conclusion: VisionReward provides an effective and interpretable framework for aligning visual generative models with human preferences, with demonstrated superior performance over existing methods.

Abstract: Visual generative models have achieved remarkable progress in synthesizing photorealistic images and videos, yet aligning their outputs with human preferences across critical dimensions remains a persistent challenge. Though reinforcement learning from human feedback offers promise for preference alignment, existing reward models for visual generation face limitations, including black-box scoring without interpretability and potentially resultant unexpected biases. We present VisionReward, a general framework for learning human visual preferences in both image and video generation. Specifically, we employ a hierarchical visual assessment framework to capture fine-grained human preferences, and leverages linear weighting to enable interpretable preference learning. Furthermore, we propose a multi-dimensional consistent strategy when using VisionReward as a reward model during preference optimization for visual generation. Experiments show that VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation. Notably, VisionReward surpasses VideoScore by 17.2% in preference prediction accuracy, and text-to-video models with VisionReward achieve a 31.6% higher pairwise win rate compared to the same models using VideoScore. All code and datasets are provided at https://github.com/THUDM/VisionReward.

[284] Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning

Xiaohan Lan, Fanfan Liu, Haibo Qiu, Siqi Yang, Delian Ruan, Peng Shi, Lin Ma

Main category: cs.CV

TL;DR: Metis-HOME is a Hybrid Optimized Mixture-of-Experts framework that enables “Hybrid Thinking” by splitting a multimodal model into thinking and non-thinking expert branches with a dynamic router, improving both complex reasoning and general capabilities while resolving the efficiency trade-off.

Details

Motivation: Current multimodal reasoning models are inefficient for simple queries due to expensive reasoning processes and suffer from impaired general understanding capabilities when specialized for reasoning tasks.

Method: Adapt Qwen2.5-VL-7B into a Mixture-of-Experts architecture with two expert branches: thinking branch for complex multi-step reasoning and non-thinking branch for rapid direct inference, plus a lightweight trainable router for dynamic query allocation.

Result: Substantially enhances complex reasoning abilities while improving general capabilities, reversing the degradation trend seen in other reasoning-specialized models.

Conclusion: Establishes a new paradigm for building powerful and versatile MLLMs that effectively resolves the reasoning-vs-generalization dilemma.

Abstract: Inspired by recent advancements in LLM reasoning, the field of multimodal reasoning has seen remarkable progress, achieving significant performance gains on intricate tasks such as mathematical problem-solving. Despite this progress, current multimodal large reasoning models exhibit two key limitations. They tend to employ computationally expensive reasoning even for simple queries, leading to inefficiency. Furthermore, this focus on specialized reasoning often impairs their broader, more general understanding capabilities. In this paper, we propose Metis-HOME: a Hybrid Optimized Mixture-of-Experts framework designed to address this trade-off. Metis-HOME enables a ‘‘Hybrid Thinking’’ paradigm by structuring the original dense model into two distinct expert branches: a thinking branch tailored for complex, multi-step reasoning, and a non-thinking branch optimized for rapid, direct inference on tasks like general VQA and OCR. A lightweight, trainable router dynamically allocates queries to the most suitable expert. We instantiate Metis-HOME by adapting the Qwen2.5-VL-7B into an MoE architecture. Comprehensive evaluations reveal that our approach not only substantially enhances complex reasoning abilities but also improves the model’s general capabilities, reversing the degradation trend observed in other reasoning-specialized models. Our work establishes a new paradigm for building powerful and versatile MLLMs, effectively resolving the prevalent reasoning-vs-generalization dilemma. Code and weights are available at https://github.com/MM-Thinking/Metis-HOME.

[285] RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, Hao Tang

Main category: cs.CV

TL;DR: RobustMerge is a training-free parameter-efficient merging method that maintains direction robustness for merging expert models fine-tuned with parameter-efficient methods, addressing the failure of existing merging methods in efficient tuning scenarios.

Details

Motivation: As parameter-efficient tuning becomes common practice for obtaining task-specific models, existing merging methods designed for full fine-tuning fail when applied to efficiently tuned models. There's a need for dedicated efficient merging methods.

Method: The method analyzes low-rank decomposition and identifies direction robustness as crucial. It prunes parameters and scales coefficients from inter-parameter relations to maintain direction stability, and performs cross-task normalization to enhance unseen task generalization.

Result: Experiments on a diverse multimodal task benchmark show outstanding performance and generalizability. The method effectively maintains direction robustness and handles task interference.

Conclusion: RobustMerge successfully addresses the challenge of merging parameter-efficient tuned models by maintaining direction robustness through complementary parameter adaptation, providing an effective solution for creating universal multi-task models without data leakage.

Abstract: Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.

[286] Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Inclusion AI, :, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianing Li, Jianxin Sun, Jiajia Liu, Jian Sha, Jianjiang Zhu, Jianping Jiang, Jun Peng, Kaixiang Ji, Kaimeng Ren, Libin Wang, Lixiang Ru, Longhua Tan, Lu Ma, Lan Wang, Mochen Bai, Ning Gao, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Ruobing Zheng, Sirui Gao, Tao Zhang, Tianqi Li, Tinghao Liu, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaolong Wang, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yuting Xiao, Yunxiao Sun, Yipeng Chen, Yifan Mao, Yifei Wu, Yongjie Lyu, Ziping Ma, Zhiqiang Fang, Zhihao Qiu, Ziyuan Huang, Zizheng Yang, Zhengyu He

Main category: cs.CV

TL;DR: Ming-Flash-Omni is an upgraded multimodal AI model with 100B total parameters (6.1B active per token) using sparse MoE architecture, achieving state-of-the-art performance in text-to-image generation, generative segmentation, and contextual ASR.

Details

Motivation: To create a more efficient and capable unified multimodal model that advances toward AGI by improving computational efficiency while expanding model capacity across vision, speech, and language modalities.

Method: Built on sparser Mixture-of-Experts variant of Ling-Flash-2.0 with 100B total parameters (only 6.1B active per token), enabling efficient scaling and unified multimodal intelligence.

Result: Achieved SOTA in text-to-image generation and generative segmentation; set new records on all 12 contextual ASR benchmarks; substantial improvements in multimodal understanding/generation, speech recognition, image editing consistency, and identity preservation.

Conclusion: Ming-Flash-Omni represents a significant advancement in unified multimodal AI, demonstrating that sparse MoE architectures can achieve state-of-the-art performance across multiple modalities while maintaining computational efficiency.

Abstract: We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.

[287] From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images

Ruikun Zhang, Yan Yang, Liyuan Pan

Main category: cs.CV

TL;DR: PixNet is a dense prediction network that generates continuous gene expression maps from histopathology images, enabling prediction at multiple spatial scales and outperforming previous methods.

Details

Motivation: Previous spatial transcriptomics methods lose spatial resolution by mapping individual spots to gene expression, ignoring that spots contain multiple cells and are limited to fixed resolutions.

Method: Generate spatially dense continuous gene expression maps from histopathology slide images, then aggregate values within spots of interest rather than mapping individual spots directly.

Result: PixNet outperforms state-of-the-art methods on four common ST datasets across multiple spatial scales.

Conclusion: The dense prediction approach enables spatially resolved gene expression prediction at varying scales, addressing limitations of previous spot-based methods.

Abstract: Spatial transcriptomics (ST) measures gene expression at fine-grained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction typically crop spots of interest from histopathology slide images, and train models to map each spot to a corresponding gene expression profile. However, these methods inherently lose the spatial resolution in gene expression: 1) each spot often contains multiple cells with distinct gene expression profiles; 2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. Different from previous methods that map individual spots to gene expression values, we generate a spatially dense continuous gene expression map from the histopathology slide image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on four common ST datasets in multiple spatial scales. The source code will be publicly available.

[288] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang

Main category: cs.CV

TL;DR: Spatial-SSRL introduces self-supervised RL with five spatial pretext tasks using ordinary images, achieving significant improvements in spatial reasoning for LVLMs without costly supervision.

Details

Motivation: Address the weakness of spatial understanding in Large Vision-Language Models (LVLMs) by overcoming limitations of existing supervised fine-tuning and RLVR pipelines that depend on costly supervision, specialized tools, or constrained environments.

Method: Self-supervised RL paradigm that automatically formulates five pretext tasks from ordinary RGB or RGB-D images: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction.

Result: Achieves average accuracy gains of 4.63% (3B) and 3.89% (7B) over Qwen2.5-VL baselines on seven spatial understanding benchmarks in both image and video settings, while preserving general visual capabilities.

Conclusion: Simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs without human or LVLM annotation.

Abstract: Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.

[289] Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again

Weize Li, Yunhao Du, Qixiang Yin, Zhicheng Zhao, Fei Su

Main category: cs.CV

TL;DR: FlexHook is a novel two-stage Referring-by-Tracking framework that overcomes limitations of existing approaches through Conditioning Hook for better feature construction and Pairwise Correspondence Decoder for robust correspondence modeling, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: While one-stage methods have gained popularity, two-stage Referring-by-Tracking approaches still offer advantages in training cost and flexible deployment. However, existing two-stage frameworks suffer from heuristic feature construction and fragile correspondence modeling that limit their performance.

Method: Proposes FlexHook with two key components: 1) Conditioning Hook (C-Hook) that uses sampling-based strategy and language-conditioned cue injection for feature construction, and 2) Pairwise Correspondence Decoder (PCD) that replaces CLIP-based similarity matching with active correspondence modeling.

Result: Extensive experiments on Refer-KITTI/v2, Refer-Dance, and LaMOT benchmarks show that FlexHook becomes the first two-stage RBT approach to comprehensively outperform current state-of-the-art methods.

Conclusion: FlexHook successfully addresses fundamental limitations of existing two-stage RBT frameworks and demonstrates superior performance, making two-stage approaches competitive again in Referring Multi-Object Tracking.

Abstract: Referring Multi-Object Tracking (RMOT) aims to track multiple objects specified by natural language expressions in videos. With the recent significant progress of one-stage methods, the two-stage Referring-by-Tracking (RBT) paradigm has gradually lost its popularity. However, its lower training cost and flexible incremental deployment remain irreplaceable. Rethinking existing two-stage RBT frameworks, we identify two fundamental limitations: the overly heuristic feature construction and fragile correspondence modeling. To address these issues, we propose FlexHook, a novel two-stage RBT framework. In FlexHook, the proposed Conditioning Hook (C-Hook) redefines the feature construction by a sampling-based strategy and language-conditioned cue injection. Then, we introduce a Pairwise Correspondence Decoder (PCD) that replaces CLIP-based similarity matching with active correspondence modeling, yielding a more flexible and robust strategy. Extensive experiments on multiple benchmarks (Refer-KITTI/v2, Refer-Dance, and LaMOT) demonstrate that FlexHook becomes the first two-stage RBT approach to comprehensively outperform current state-of-the-art methods. Code can be found in the Supplementary Materials.

[290] Embodied Crowd Counting

Runling Long, Yunlong Wang, Jia Wan, Xiang Deng, Xinting Zhu, Weili Guan, Antoni B. Chan, Liqiang Nie

Main category: cs.CV

TL;DR: Proposes Embodied Crowd Counting (ECC) using active camera navigation to address occlusion challenges in crowd counting, with a new simulator ECCD and zero-shot navigation method ZECC.

Details

Motivation: Address occlusion limitations in traditional crowd counting by leveraging embodied navigation methods with active cameras, overcoming the constraints of passive camera datasets.

Method: Builds ECCD simulator for large-scale crowd scenes, introduces ZECC method with MLLM-driven coarse-to-fine navigation and active Z-axis exploration, plus normal-line-based crowd distribution analysis.

Result: Achieves best trade-off between counting accuracy and navigation cost compared to baselines.

Conclusion: Embodied crowd counting with active navigation effectively addresses occlusion challenges in crowd analysis.

Abstract: Occlusion is one of the fundamental challenges in crowd counting. In the community, various data-driven approaches have been developed to address this issue, yet their effectiveness is limited. This is mainly because most existing crowd counting datasets on which the methods are trained are based on passive cameras, restricting their ability to fully sense the environment. Recently, embodied navigation methods have shown significant potential in precise object detection in interactive scenes. These methods incorporate active camera settings, holding promise in addressing the fundamental issues in crowd counting. However, most existing methods are designed for indoor navigation, showing unknown performance in analyzing complex object distribution in large scale scenes, such as crowds. Besides, most existing embodied navigation datasets are indoor scenes with limited scale and object quantity, preventing them from being introduced into dense crowd analysis. Based on this, a novel task, Embodied Crowd Counting (ECC), is proposed. We first build up an interactive simulator, Embodied Crowd Counting Dataset (ECCD), which enables large scale scenes and large object quantity. A prior probability distribution that approximates realistic crowd distribution is introduced to generate crowds. Then, a zero-shot navigation method (ZECC) is proposed. This method contains a MLLM driven coarse-to-fine navigation mechanism, enabling active Z-axis exploration, and a normal-line-based crowd distribution analysis method for fine counting. Experimental results against baselines show that the proposed method achieves the best trade-off between counting accuracy and navigation cost.

[291] CGCE: Classifier-Guided Concept Erasure in Generative Models

Viet Nguyen, Vishal M. Patel

Main category: cs.CV

TL;DR: CGCE is a plug-and-play framework that uses lightweight classifiers to detect and refine unsafe text embeddings at inference time, achieving robust concept erasure without modifying original model weights.

Details

Motivation: Existing concept erasure methods are vulnerable to adversarial attacks and often degrade generative quality for safe concepts, creating a trade-off between safety and performance.

Method: Classifier-guided approach using lightweight classifiers on text embeddings to detect and refine prompts containing undesired concepts, enabling multi-concept erasure through classifier aggregation.

Result: CGCE achieves state-of-the-art robustness against red-teaming attacks while maintaining high generative utility on benign prompts across various T2I and T2V models.

Conclusion: CGCE provides a practical and effective solution for safe generative AI with superior balance between safety and performance, applicable to diverse generative models.

Abstract: Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model’s generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model’s original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.

[292] FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

Fufangchen Zhao, Xuerui Qiu, Linrui Xu, Ming Li, Wenhao Jiang, Jinkai Zheng, Hehe Fan, Jian Gao, Danfeng Yan

Main category: cs.CV

TL;DR: FaVChat is a Video-MLLM that enhances fine-grained facial understanding through multi-level feature extraction and data-efficient reinforcement learning, achieving superior performance with minimal training data.

Details

Motivation: Current multi-modal large language models struggle with fine-grained visual comprehension as visual encoders lose subtle cues needed for precise reasoning in facial understanding tasks.

Method: Proposes FaVChat with multi-level prompt-guided feature extraction (low-level for textures/motion, medium-level for discriminative regions, high-level for semantic alignment) and Date-Efficient GRPO RL algorithm for maximizing data utility through per-instance utility estimation.

Result: Extensive zero-shot evaluations show FaVCach achieves finer-grained understanding, stronger accuracy, and better generalization than existing Video-MLLMs across emotion recognition, explainable reasoning, and textual expression analysis, even with only 10K RL samples.

Conclusion: FaVChat successfully addresses fine-grained facial understanding limitations in Video-MLLMs through hierarchical feature extraction and data-efficient RL, demonstrating superior performance with minimal training data.

Abstract: Multi-modal large language models (MLLMs) have shown strong capability in video understanding but still struggle with fine-grained visual comprehension, as pure visual encoders often lose subtle cues essential for precise reasoning. To address this limitation, we propose FaVChat, a Video-MLLM specifically designed for fine-grained facial understanding. FaVChat introduces a multi-level prompt-guided feature extraction mechanism that progressively captures task-relevant information from three complementary stages: low-level transformer layers for textures and motion, medium-level learnable queries for discriminative regions, and high-level adaptive feature weighting for semantic alignment. These enriched features are dynamically fused and fed into the LLM to enable more accurate fine-grained reasoning. To further enhance the model’s ability to capture fine-grained facial attributes and maximize the utility of limited data, we propose Date-Efficient GRPO, a novel data-efficient reinforcement learning (RL) algorithm that maximizes the utility of each training sample through per-instance utility estimation and dynamic lifecycle scheduling. Extensive zero-shot evaluations across emotion recognition, explainable reasoning, and textual expression analysis demonstrate that FaVChat achieves finer-grained understanding, stronger accuracy, and better generalization than existing Video-MLLMs, even when trained with only 10K RL samples.

[293] FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Jingren Liu, Shuning Xu, Qirui Yang, Yun Wang, Xiangyu Chen, Zhong Ji

Main category: cs.CV

TL;DR: FAPE-IR is a unified image restoration framework that uses a frozen MLLM as planner to generate frequency-aware restoration plans, which guide a LoRA-MoE diffusion executor for adaptive restoration across multiple degradation types.

Details

Motivation: Existing All-in-One Image Restoration methods rely on task-specific designs or latent routing strategies, making them hard to adapt to real-world scenarios with various degradations.

Method: Uses frozen MLLM as planner to analyze degraded images and generate frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts module within diffusion-based executor that dynamically selects high/low-frequency experts using frequency features of input image. Includes adversarial training and frequency regularization loss.

Result: Achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.

Conclusion: FAPE-IR offers a unified and interpretable solution for all-in-one image restoration by coupling semantic planning with frequency-based restoration.

Abstract: All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.

[294] Dream-IF: Dynamic Relative EnhAnceMent for Image Fusion

Xingxin Xu, Bing Cao, Dongdong Li, Qinghua Hu, Pengfei Zhu

Main category: cs.CV

TL;DR: Dream-IF is a dynamic relative enhancement framework for image fusion that integrates enhancement and fusion processes by quantifying relative dominance between modalities and using prompt-based encoding for degradation-specific restoration.

Details

Motivation: Traditional fusion methods treat image enhancement and fusion as separate processes, overlooking the inherent correlation between them. The observation that dominant regions in one modality indicate areas where the other modality might benefit from enhancement inspired this integrated approach.

Method: The framework quantifies relative dominance of each modality across different layers and uses this for reciprocal cross-modal enhancement. It employs prompt-based encoding to capture degradation-specific details that dynamically steer the restoration process.

Result: Extensive experimental results demonstrate that Dream-IF consistently outperforms its counterparts in both multi-modal image fusion and image enhancement scenarios.

Conclusion: The proposed framework successfully integrates relative dominance from image fusion to support image restoration and broader enhancement applications, achieving superior performance compared to existing methods.

Abstract: Image fusion aims to integrate comprehensive information from images acquired through multiple sources. However, images captured by diverse sensors often encounter various degradations that can negatively affect fusion quality. Traditional fusion methods generally treat image enhancement and fusion as separate processes, overlooking the inherent correlation between them; notably, the dominant regions in one modality of a fused image often indicate areas where the other modality might benefit from enhancement. Inspired by this observation, we introduce the concept of dominant regions for image enhancement and present a Dynamic Relative EnhAnceMent framework for Image Fusion (Dream-IF). This framework quantifies the relative dominance of each modality across different layers and leverages this information to facilitate reciprocal cross-modal enhancement. By integrating the relative dominance derived from image fusion, our approach supports not only image restoration but also a broader range of image enhancement applications. Furthermore, we employ prompt-based encoding to capture degradation-specific details, which dynamically steer the restoration process and promote coordinated enhancement in both multi-modal image fusion and image enhancement scenarios. Extensive experimental results demonstrate that Dream-IF consistently outperforms its counterparts. The code is publicly available.\footnote{ https://github.com/jehovahxu/Dream-IF

[295] Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman

Main category: cs.CV

TL;DR: Stitch-a-Demo is a retrieval-based method that assembles coherent video demonstrations from multistep descriptions (like recipes or manuals) by stitching together clips from different sources while maintaining visual consistency.

Details

Motivation: Current methods only handle single-step text descriptions, but cannot coherently illustrate multistep procedures like cooking recipes or instruction manuals, which require maintaining visual consistency across multiple steps.

Method: Proposes a retrieval-based approach that stitches video clips from different sources to create coherent demonstrations. Uses a training pipeline with large-scale weakly supervised data containing diverse procedures and injects hard negatives to promote both correctness and coherence.

Result: Achieves state-of-the-art performance on in-the-wild instructional videos with gains up to 29%, and shows dramatic wins in human preference studies.

Conclusion: Stitch-a-Demo successfully addresses the challenge of creating visually coherent video demonstrations from multistep descriptions, outperforming existing methods significantly.

Abstract: When obtaining visual illustrations from text descriptions, today’s methods take a description with a single text context - a caption, or an action description - and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.

[296] Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting

Yiming Wang, Lucy Chai, Xuan Luo, Michael Niemeyer, Manuel Lagunas, Stephen Lombardi, Siyu Tang, Tiancheng Sun

Main category: cs.CV

TL;DR: A novel Fuse-and-Refine module that enhances 3D Gaussian Splatting by merging pixel-aligned primitives in canonical 3D space using hybrid Splat-Voxel representation, reducing redundancy and enabling dynamic scene reconstruction.

Details

Motivation: Existing pixel-aligned 3D Gaussian Splatting approaches create redundant representations from overlapping views and lack flexibility for dynamic scenes, constraining primitives to input rays without full 3D spatial freedom.

Method: Uses hybrid Splat-Voxel representation: aggregates pixel-aligned Gaussian primitives into coarse-to-fine voxel hierarchy, processes with sparse voxel transformer to generate refined primitives in canonical 3D space.

Result: Achieves state-of-the-art performance in static and streaming scene reconstruction, running at interactive rates (15 fps with 350ms delay) on single H100 GPU.

Conclusion: The Fuse-and-Refine approach effectively reduces redundancy, adapts to temporal frames, and enables history-aware online reconstruction of dynamic scenes while maintaining high performance.

Abstract: Recent advances in feed-forward 3D Gaussian Splatting have led to rapid improvements in efficient scene reconstruction from sparse views. However, most existing approaches construct Gaussian primitives directly aligned with the pixels in one or more of the input images. This leads to redundancies in the representation when input views overlap and constrains the position of the primitives to lie along the input rays without full flexibility in 3D space. Moreover, these pixel-aligned approaches do not naturally generalize to dynamic scenes, where effectively leveraging temporal information requires resolving both redundant and newly appearing content across frames. To address these limitations, we introduce a novel Fuse-and-Refine module that enhances existing feed-forward models by merging and refining the primitives in a canonical 3D space. At the core of our method is an efficient hybrid Splat-Voxel representation: from an initial set of pixel-aligned Gaussian primitives, we aggregate local features into a coarse-to-fine voxel hierarchy, and then use a sparse voxel transformer to process these voxel features and generate refined Gaussian primitives. By fusing and refining an arbitrary number of inputs into a consistent set of primitives, our representation effectively reduces redundancy and naturally adapts to temporal frames, enabling history-aware online reconstruction of dynamic scenes. Our approach achieves state-of-the-art performance in both static and streaming scene reconstructions while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU.

[297] CLIP-IT: CLIP-based Pairing for Histology Images Classification

Banafsheh Karimian, Giulia Avanzato, Soufian Belharbi, Alexis Guichemerre, Luke McCaffrey, Mohammadhadi Shateri, Eric Granger

Main category: cs.CV

TL;DR: CLIP-IT is a framework that uses unpaired text reports to enhance medical image classification by retrieving relevant text using CLIP pre-trained on separate data, distilling knowledge into vision models, and enabling unimodal inference.

Details

Motivation: To overcome limitations of traditional multimodal learning that requires large paired datasets and complex inference, leveraging abundant unpaired medical text reports to provide complementary diagnostic information.

Method: Uses CLIP pre-trained on separate histology image-text data to retrieve relevant unpaired text reports, creates pseudo-pairs, distills knowledge into vision models via LoRA adaptation, and uses only vision model at inference.

Result: Consistently improves classification accuracy over unimodal and CLIP-based multimodal baselines on histology datasets without requiring paired data or complex inference.

Conclusion: CLIP-IT enables effective multimodal learning using unpaired text, reduces annotation burden and computational overhead, and maintains practical deployment advantages while improving performance.

Abstract: Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based inference, limiting their practicality due to annotation cost, privacy, and compute demands. Crucially, available free unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image-text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the downstream unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference, only the vision model is used, keeping overhead low while still benefiting from multimodal training without requiring paired data in the downstream dataset. Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without the burden of per-dataset paired annotation or inference-time complexity.

[298] Prompt Guiding Multi-Scale Adaptive Sparse Representation-driven Network for Low-Dose CT MAR

Baoshun Shi, Bing Chen, Shaolei Zhang, Huazhu Fu, Zhanli Hu

Main category: cs.CV

TL;DR: Proposes PMSRNet and PDuMSRNet for simultaneous low-dose CT reconstruction and metal artifact reduction using multi-scale adaptive sparse representation with prompt guiding strategy for multiple dose levels.

Details

Motivation: Existing deep learning methods for LDCT reconstruction and metal artifact reduction neglect multi-scale information and require separate models for different dose levels, leading to storage inefficiency.

Method: Developed PMSRNet with prompt guiding scale-adaptive threshold generator (PSATG) and multi-scale coefficient fusion module (MSFuM) to capture within-scale and cross-scale information. Extended to PDuMSRNet with dual domain framework and prompt guiding module that uses dose level, metal mask, and input instance to handle multiple dose levels with single model.

Result: Extensive experiments show the proposed methods outperform state-of-the-art LDMAR methods across various dose levels.

Conclusion: The proposed prompt guiding multi-scale adaptive sparse representation networks effectively address LDCT reconstruction and metal artifact reduction while accommodating multiple dose levels with a single model.

Abstract: Low-dose CT (LDCT) is capable of reducing X-ray radiation exposure, but it will potentially degrade image quality, even yields metal artifacts at the case of metallic implants. For simultaneous LDCT reconstruction and metal artifact reduction (LDMAR), existing deep learning-based efforts face two main limitations: i) the network design neglects multi-scale and within-scale information; ii) training a distinct model for each dose necessitates significant storage space for multiple doses. To fill these gaps, we propose a prompt guiding multi-scale adaptive sparse representation-driven network, abbreviated as PMSRNet, for LDMAR task. Specifically, we construct PMSRNet inspired from multi-scale sparsifying frames, and it can simultaneously employ within-scale characteristics and cross-scale complementarity owing to an elaborated prompt guiding scale-adaptive threshold generator (PSATG) and a built multi-scale coefficient fusion module (MSFuM). The PSATG can adaptively capture multiple contextual information to generate more faithful thresholds, achieved by fusing features from local, regional, and global levels. Furthermore, we elaborate a model interpretable dual domain LDMAR framework called PDuMSRNet, and train single model with a prompt guiding strategy for multiple dose levels. We build a prompt guiding module, whose input contains dose level, metal mask and input instance, to provide various guiding information, allowing a single model to accommodate various CT dose settings. Extensive experiments at various dose levels demonstrate that the proposed methods outperform the state-of-the-art LDMAR methods.

[299] Exploring Convolutional Neural Networks for Rice Grain Classification: An Explainable AI Approach

Muhammad Junaid Asif, Hamza Khan, Rabia Tehseen, Rana Fayyaz Ahmad, Mujtaba Asad, Syed Tahir Hussain Rizvi, Shazia Saqib

Main category: cs.CV

TL;DR: This paper proposes an automatic CNN-based framework for classifying different rice grain varieties, achieving high accuracy and using explainability techniques like LIME and SHAP to understand model decisions.

Details

Motivation: Manual quality inspection of rice grains is laborious, time-consuming, and error-prone, necessitating an automated solution for efficient classification of different rice varieties to maintain quality standards in international trade.

Method: The research uses a convolutional neural network (CNN) model for rice grain classification, evaluated using performance metrics including accuracy, recall, precision, F1-Score, and ROC curves, with explainability techniques LIME and SHAP integrated.

Result: The CNN model achieved remarkable accuracy with perfect area under ROC curves for each class, and confusion matrix analysis showed minimal misclassifications, effectively distinguishing between different rice varieties.

Conclusion: The proposed CNN-based framework provides an effective automated solution for rice grain classification, with explainability techniques offering valuable insights into the model’s decision-making process based on specific rice grain features.

Abstract: Rice is an essential staple food worldwide that is important in promoting international trade, economic growth, and nutrition. Asian countries such as China, India, Pakistan, Thailand, Vietnam, and Indonesia are notable for their significant contribution to the cultivation and utilization of rice. These nations are also known for cultivating different rice grains, including short and long grains. These sizes are further classified as basmati, jasmine, kainat saila, ipsala, arborio, etc., catering to diverse culinary preferences and cultural traditions. For both local and international trade, inspecting and maintaining the quality of rice grains to satisfy customers and preserve a country’s reputation is necessary. Manual quality check and classification is quite a laborious and time-consuming process. It is also highly prone to mistakes. Therefore, an automatic solution must be proposed for the effective and efficient classification of different varieties of rice grains. This research paper presents an automatic framework based on a convolutional neural network (CNN) for classifying different varieties of rice grains. We evaluated the proposed model based on performance metrics such as accuracy, recall, precision, and F1-Score. The CNN model underwent rigorous training and validation, achieving a remarkable accuracy rate and a perfect area under each class’s Receiver Operating Characteristic (ROC) curve. The confusion matrix analysis confirmed the model’s effectiveness in distinguishing between the different rice varieties, indicating minimal misclassifications. Additionally, the integration of explainability techniques such as LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) provided valuable insights into the model’s decision-making process, revealing how specific features of the rice grains influenced classification outcomes.

Chih-Ting Liao, Zhangquan Chen, Chunlei Meng, Tzu-Yu Huang, Xin Cao, Xu Zheng

Main category: cs.CV

TL;DR: First comprehensive study of adversarial vulnerability in unified multi-modal encoders, showing significant performance drops across all modalities from mild perturbations, with non-visual inputs being especially fragile.

Details

Motivation: Despite impressive capabilities of unified multi-modal encoders, their robustness under adversarial perturbations remains underexplored, which is critical for safety-sensitive applications.

Method: Proposed efficient adversarial calibration framework with modality-specific projection heads trained on adversarial examples, using three training objectives and regularization for modality-consistent alignment under attack.

Result: Method improves adversarial robustness by up to 47.3% at epsilon=4/255 while preserving or improving clean zero-shot and retrieval performance with less than 1% trainable parameters.

Conclusion: The proposed adversarial calibration framework effectively enhances robustness across modalities without modifying pretrained encoders, making it compatible with existing foundation models.

Abstract: Recent unified multi-modal encoders align a wide range of modalities into a shared representation space, enabling diverse cross-modal tasks. Despite their impressive capabilities, the robustness of these models under adversarial perturbations remains underexplored, which is a critical concern for safety-sensitive applications. In this work, we present the first comprehensive study of adversarial vulnerability in unified multi-modal encoders. We find that even mild adversarial perturbations lead to substantial performance drops across all modalities. Non-visual inputs, such as audio and point clouds, are especially fragile, while visual inputs like images and videos also degrade significantly. To address this, we propose an efficient adversarial calibration framework that improves robustness across modalities without modifying pretrained encoders or semantic centers, ensuring compatibility with existing foundation models. Our method introduces modality-specific projection heads trained solely on adversarial examples, while keeping the backbone and embeddings frozen. We explore three training objectives: fixed-center cross-entropy, clean-to-adversarial L2 alignment, and clean-adversarial InfoNCE, and we introduce a regularization strategy to ensure modality-consistent alignment under attack. Experiments on six modalities and three Bind-style models show that our method improves adversarial robustness by up to 47.3 percent at epsilon = 4/255, while preserving or even improving clean zero-shot and retrieval performance with less than 1 percent trainable parameters.

[301] Panoptic Captioning: An Equivalence Bridge for Image and Text

Kun-Yu Lin, Hongjun Wang, Weining Ren, Kai Han

Main category: cs.CV

TL;DR: Introduces panoptic captioning as a task to generate comprehensive image descriptions covering all entities, locations, attributes, relationships, and global state. Proposes PancapEngine for data generation and PancapChain method that outperforms state-of-the-art MLLMs.

Details

Motivation: To create the minimum text equivalent of images that comprehensively describes all visual elements, addressing the limitations of current MLLMs in generating detailed, structured image captions.

Method: Uses PancapEngine with detection suite to identify entities and generate panoptic captions via entity-aware prompts. PancapChain decouples the task into multiple stages for step-by-step caption generation.

Result: PancapChain-13B model beats state-of-the-art open-source MLLMs like InternVL-2.5-78B and surpasses proprietary models like GPT-4o and Gemini-2.0-Pro.

Conclusion: The proposed panoptic captioning framework with PancapEngine and PancapChain effectively addresses the limitations of current MLLMs and demonstrates superior performance in generating comprehensive image descriptions.

Abstract: This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalent of images, which has broad potential applications. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state. Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation. Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https://visual-ai.github.io/pancap/

[302] ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Ruize Ma, Minghong Cai, Yilei Jiang, Jiaming Han, Yi Feng, Yingshui Tan, Xiaoyong Zhu, Bo Zhang, Bo Zheng, Xiangyu Yue

Main category: cs.CV

TL;DR: ConceptGuard is a unified safeguard framework that proactively detects and mitigates unsafe semantics in multimodal video generation by identifying latent safety risks in fused image-text inputs and steering the generative process away from unsafe concepts.

Details

Motivation: Existing safety methods for video generation are often text-only, require prior knowledge of risk categories, or operate as post-generation auditors, struggling to proactively mitigate compositional, multimodal risks that can emerge from individual modalities or their interaction.

Method: ConceptGuard operates in two stages: (1) a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space, and (2) a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt’s multimodal conditioning.

Result: Comprehensive experiments on ConceptRisk and T2VSafetyBench-TI2V benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.

Conclusion: ConceptGuard provides an effective framework for proactively addressing safety risks in multimodal video generation, demonstrating superior performance over existing methods through its unified approach to detection and mitigation.

Abstract: Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt’s multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.Our code is available at https://github.com/Ruize-Ma/ConceptGuard.

Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari

Main category: cs.CV

TL;DR: Novel architecture for monocular-to-stereo video conversion using extended Stable Video Diffusion model to inpaint and refine warped right views from left video input.

Details

Motivation: To solve the problem of converting monocular videos to stereo format by addressing disocclusion issues in depth-based reprojection methods.

Method: Extends Stable Video Diffusion model with modified attention layers for full attention on disoccluded pixels, using left video, warped right video, and disocclusion masks as conditioning input for end-to-end training.

Result: Outperforms previous state-of-the-art methods, ranked best 2.6x more often than second-place method in user study, while being 6x faster.

Conclusion: Proposed method effectively generates high-quality stereo video from monocular input with superior performance and efficiency compared to existing approaches.

Abstract: We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner without iterative diffusion steps by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, being ranked best 2.6x more often than the second-place method in a user study, while being 6x faster.

[304] Deep Hybrid Model for Region of Interest Detection in Omnidirectional Videos

Sana Alamgeer, Mylene Farias, Marcelo Carvalho

Main category: cs.CV

TL;DR: A hybrid saliency model is designed to predict regions of interest in 360° videos for improved streaming efficiency and viewing experience.

Details

Motivation: ROI prediction helps reduce bandwidth usage in 360° video streaming by enabling view-port prediction and intelligent video cutting, which reduces head movement in head-mounted devices and enhances streaming efficiency.

Method: The approach involves preprocessing videos into frames, developing a hybrid saliency model for ROI prediction, and post-processing the predictions to obtain ROIs for each frame.

Result: The proposed method’s performance is compared with subjective annotations from the 360RAT dataset.

Conclusion: The hybrid saliency model effectively identifies regions of interest in 360° videos, contributing to more efficient streaming and better user experience.

Abstract: The main goal of the project is to design a new model that predicts regions of interest in 360$^{\circ}$ videos. The region of interest (ROI) plays an important role in 360$^{\circ}$ video streaming. For example, ROIs are used to predict view-ports, intelligently cut the videos for live streaming, etc so that less bandwidth is used. Detecting view-ports in advance helps reduce the movement of the head while streaming and watching a video via the head-mounted device. Whereas, intelligent cuts of the videos help improve the efficiency of streaming the video to users and enhance the quality of their viewing experience. This report illustrates the secondary task to identify ROIs, in which, we design, train, and test a hybrid saliency model. In this work, we refer to saliency regions to represent the regions of interest. The method includes the processes as follows: preprocessing the video to obtain frames, developing a hybrid saliency model for predicting the region of interest, and finally post-processing the output predictions of the hybrid saliency model to obtain the output region of interest for each frame. Then, we compare the performance of the proposed method with the subjective annotations of the 360RAT dataset.

[305] SplatCo: Structure-View Collaborative Gaussian Splatting for Detail-Preserving Rendering of Large-Scale Unbounded Scenes

Haihong Xiao, Jianan Zou, Yuxin Zhou, Ying He, Wenxiong Kang

Main category: cs.CV

TL;DR: SplatCo is a collaborative Gaussian splatting framework that combines global tri-plane representations with local context grid features for high-fidelity rendering of complex outdoor environments, achieving superior reconstruction quality over state-of-the-art methods.

Details

Motivation: To address the challenge of high-fidelity rendering in complex outdoor environments by ensuring both global consistency and local detail preservation in large-scale unbounded scenes.

Method: Uses cross-structure collaboration module combining global tri-plane representations with local context grid features through hierarchical compensation strategy, and cross-view assisted training with synchronized gradient updates, visibility-aware densification, and structural consistency-based pruning.

Result: Achieves PSNR improvements of 1-2 dB and SSIM gains of 0.1 to 0.2 over state-of-the-art methods on 13 diverse large-scale scenes including Mill19, MatrixCity, Tanks & Temples, and WHU datasets.

Conclusion: SplatCo establishes a new benchmark for high-fidelity rendering of large-scale unbounded scenes through effective joint optimization of structural representation and multi-view coherence.

Abstract: We present SplatCo, a structure-view collaborative Gaussian splatting framework for high-fidelity rendering of complex outdoor environments. SplatCo builds upon two novel components: (1) a cross-structure collaboration module that combines global tri-plane representations, which capture coarse scene layouts, with local context grid features that represent fine surface details. This fusion is achieved through a novel hierarchical compensation strategy, ensuring both global consistency and local detail preservation; and (2) a cross-view assisted training strategy that enhances multi-view consistency by synchronizing gradient updates across viewpoints, applying visibility-aware densification, and pruning overfitted or inaccurate Gaussians based on structural consistency. Through joint optimization of structural representation and multi-view coherence, SplatCo effectively reconstructs fine-grained geometric structures and complex textures in large-scale scenes. Comprehensive evaluations on 13 diverse large-scale scenes, including Mill19, MatrixCity, Tanks & Temples, WHU, and custom aerial captures, demonstrate that SplatCo consistently achieves higher reconstruction quality than state-of-the-art methods, with PSNR improvements of 1-2 dB and SSIM gains of 0.1 to 0.2. These results establish a new benchmark for high-fidelity rendering of large-scale unbounded scenes. Code and additional information are available at https://github.com/SCUT-BIP-Lab/SplatCo.

[306] TK-Mamba: Marrying KAN With Mamba for Text-Driven 3D Medical Image Segmentation

Haoyu Yang, Yutong Guan, Meixing Shi, Yuxiang Cai, Jintao Chen, Sun Bing, Wenhui Lei, Mianxin Liu, Xiaoming Shi, Yankai Jiang, Jianwei Yin

Main category: cs.CV

TL;DR: TK-Mamba is a multimodal framework that combines Mamba with Kolmogorov-Arnold Networks (KAN) for efficient 3D medical image segmentation, achieving state-of-the-art performance in both organ and tumor segmentation.

Details

Motivation: 3D medical image segmentation faces challenges from high-dimensional data and complex spatial dependencies. Traditional single-modality networks like CNNs and Transformers are limited by computational inefficiency and constrained contextual modeling in 3D settings.

Method: Proposed TK-Mamba framework with two key contributions: 1) 3D-Group-Rational KAN (first KAN application in 3D medical imaging) for efficient nonlinear feature transformation, and 2) dual-branch text-driven strategy using Pubmedclip embeddings to capture inter-organ semantic relationships and align image features with anatomical texts.

Result: Experiments on multiple datasets demonstrate state-of-the-art performance in both organ and tumor segmentation tasks, surpassing existing methods in both accuracy and efficiency.

Conclusion: TK-Mamba offers a unified and scalable solution for 3D medical image segmentation by combining advanced backbone architecture with vision-language knowledge, providing superior performance for complex volumetric structures.

Abstract: 3D medical image segmentation is important for clinical diagnosis and treatment but faces challenges from high-dimensional data and complex spatial dependencies. Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling in 3D settings. To alleviate these limitations, we propose TK-Mamba, a multimodal framework that fuses the linear-time Mamba with Kolmogorov-Arnold Networks (KAN) to form an efficient hybrid backbone. Our approach is characterized by two primary technical contributions. Firstly, we introduce the novel 3D-Group-Rational KAN (3D-GR-KAN), which marks the first application of KAN in 3D medical imaging, providing a superior and computationally efficient nonlinear feature transformation crucial for complex volumetric structures. Secondly, we devise a dual-branch text-driven strategy using Pubmedclip’s embeddings. This strategy significantly enhances segmentation robustness and accuracy by simultaneously capturing inter-organ semantic relationships to mitigate label inconsistencies and aligning image features with anatomical texts. By combining this advanced backbone and vision-language knowledge, TK-Mamba offers a unified and scalable solution for both multi-organ and tumor segmentation. Experiments on multiple datasets demonstrate that our framework achieves state-of-the-art performance in both organ and tumor segmentation tasks, surpassing existing methods in both accuracy and efficiency. Our code is publicly available at https://github.com/yhy-whu/TK-Mamba

[307] Localizing Knowledge in Diffusion Transformers

Arman Zarei, Samyadeep Basu, Keivan Rezaei, Zihao Lin, Sayan Nag, Soheil Feizi

Main category: cs.CV

TL;DR: The paper proposes a method to localize knowledge within Diffusion Transformer (DiT) blocks, showing that identified blocks are interpretable and causally linked to knowledge expression. This enables efficient model personalization and knowledge unlearning.

Details

Motivation: To understand how knowledge is distributed across DiT layers for improved interpretability, controllability, and adaptation, as DiT-based models remain underexplored compared to UNet-based architectures.

Method: A model- and knowledge-agnostic method to localize where specific types of knowledge are encoded within DiT blocks, evaluated on state-of-the-art DiT models across six diverse knowledge categories.

Result: The identified blocks are interpretable and causally linked to knowledge expression in generated outputs. Localized fine-tuning enables efficient model personalization and knowledge unlearning with reduced computational cost and improved task-specific performance.

Conclusion: The findings offer new insights into DiT internal structure and provide a practical pathway for more interpretable, efficient, and controllable model editing through targeted updates.

Abstract: Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize where specific types of knowledge are encoded within the DiT blocks. We evaluate our method on state-of-the-art DiT-based models, including PixArt-alpha, FLUX, and SANA, across six diverse knowledge categories. We show that the identified blocks are both interpretable and causally linked to the expression of knowledge in generated outputs. Building on these insights, we apply our localization framework to two key applications: model personalization and knowledge unlearning. In both settings, our localized fine-tuning approach enables efficient and targeted updates, reducing computational cost, improving task-specific performance, and better preserving general model behavior with minimal interference to unrelated or surrounding content. Overall, our findings offer new insights into the internal structure of DiTs and introduce a practical pathway for more interpretable, efficient, and controllable model editing.

[308] MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yolo Y. Tang, Pinxin Liu, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu

Main category: cs.CV

TL;DR: MMPerspective is the first benchmark to systematically evaluate multimodal large language models’ understanding of perspective geometry through 10 tasks across perception, reasoning, and robustness dimensions.

Details

Motivation: To understand the extent to which multimodal large language models internalize perspective geometry, which is fundamental to human visual perception but remains unclear in MLLMs.

Method: Created MMPerspective benchmark with 2,711 real-world and synthetic image instances and 5,083 question-answer pairs, evaluating 43 state-of-the-art MLLMs across 10 tasks in three dimensions: Perspective Perception, Reasoning, and Robustness.

Result: Models demonstrate competence on surface-level perceptual tasks but struggle with compositional reasoning and maintaining spatial consistency under perturbations. Analysis reveals patterns between model architecture, scale, and perspective capabilities, with robustness bottlenecks identified.

Conclusion: MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems, highlighting both current limitations and the benefits of chain-of-thought prompting.

Abstract: Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs’ understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

[309] Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning

Bolin Lai, Sangmin Lee, Xu Cao, Xiang Li, James M. Rehg

Main category: cs.CV

TL;DR: FlexTI2V is a training-free method for text-image-to-video generation that enables flexible visual conditioning on arbitrary images at any positions using latent inversion and patch swapping.

Details

Motivation: Existing TI2V methods require costly finetuning of foundation models and are limited to pre-defined conditioning settings, making them resource-intensive and inflexible.

Method: Inverts condition images to noisy latent representations, then uses random patch swapping during denoising to incorporate visual features into video frames with dynamic control for balancing creativity and fidelity.

Result: Extensive experiments show FlexTI2V significantly outperforms previous training-free image conditioning methods and generalizes to both UNet-based and transformer-based architectures.

Conclusion: The proposed training-free approach enables flexible visual conditioning in TI2V generation without requiring model finetuning, achieving superior performance while maintaining computational efficiency.

Abstract: Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. Our method can also generalize to both UNet-based and transformer-based architectures.

[310] A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding

Mengjingcheng Mo, Xinyang Tong, Mingpi Tan, Jiaxu Leng, Jiankang Zheng, Yiran Liu, Haosheng Chen, Ji Gan, Weisheng Li, Xinbo Gao

Main category: cs.CV

TL;DR: A2Seek is a large-scale aerial anomaly detection dataset with detailed annotations, and A2Seek-R1 is a reasoning framework that improves anomaly prediction accuracy by 22.04% AP and localization by 13.9% mIoU through graph-of-thought fine-tuning and aerial-specific optimization.

Details

Motivation: UAV-based anomaly detection faces challenges from dynamic viewpoints, scale variations, and complex scenes, while existing ground-level methods perform poorly in aerial scenarios. There is a need for specialized datasets and methods for aerial anomaly understanding.

Method: A2Seek-R1 framework uses: 1) Graph-of-thought guided supervised fine-tuning to activate reasoning capabilities, 2) Aerial Group Relative Policy Optimization (A-GRPO) for aerial-specific reward functions, 3) A novel ‘seeking’ mechanism that simulates UAV flight behavior by directing attention to informative regions.

Result: Achieves 22.04% improvement in AP for prediction accuracy and 13.9% gain in mIoU for anomaly localization. Shows strong generalization across complex environments and out-of-distribution scenarios.

Conclusion: The proposed A2Seek dataset and A2Seek-R1 framework effectively address aerial anomaly detection challenges, enabling better understanding of where and why anomalies occur in aerial frames with significant performance improvements.

Abstract: While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introduce A2Seek (Aerial Anomaly Seek), a large-scale, reasoning-centric benchmark dataset for aerial anomaly understanding. This dataset covers various scenarios and environmental conditions, providing high-resolution real-world aerial videos with detailed annotations, including anomaly categories, frame-level timestamps, region-level bounding boxes, and natural language explanations for causal reasoning. Building on this dataset, we propose A2Seek-R1, a novel reasoning framework that generalizes R1-style strategies to aerial anomaly understanding, enabling a deeper understanding of “Where” anomalies occur and “Why” they happen in aerial frames. To this end, A2Seek-R1 first employs a graph-of-thought (GoT)-guided supervised fine-tuning approach to activate the model’s latent reasoning capabilities on A2Seek. Then, we introduce Aerial Group Relative Policy Optimization (A-GRPO) to design rule-based reward functions tailored to aerial scenarios. Furthermore, we propose a novel “seeking” mechanism that simulates UAV flight behavior by directing the model’s attention to informative regions. Extensive experiments demonstrate that A2Seek-R1 achieves up to a 22.04% improvement in AP for prediction accuracy and a 13.9% gain in mIoU for anomaly localization, exhibiting strong generalization across complex environments and out-of-distribution scenarios. Our dataset and code are released at https://2-mo.github.io/A2Seek/.

[311] How Animals Dance (When You’re Not Looking)

Xiaojuan Wang, Aleksander Holynski, Brian Curless, Ira Kemelmacher, Steve Seitz

Main category: cs.CV

TL;DR: A framework for generating music-synchronized animal dance videos using choreography patterns as high-level control, with keyframe optimization and video diffusion for synthesis.

Details

Motivation: To create realistic animal dance videos synchronized with music by introducing structured choreography patterns as control signals, overcoming limitations in existing dance generation methods.

Method: Uses choreography patterns (motion beat sequences) from human dance videos, generates keyframes via text-to-image/GPT-4o, formulates dance synthesis as graph optimization for optimal keyframe structure, and synthesizes in-between frames with video diffusion model.

Result: Can produce up to 30-second dance videos across various animals and music tracks using only 6 input keyframes, with effective music synchronization and choreography awareness.

Conclusion: The framework successfully generates high-quality animal dance videos by combining choreography patterns, keyframe optimization, and video diffusion, enabling long-range structured dance synthesis with minimal input.

Abstract: We present a framework for generating music-synchronized, choreography aware animal dance videos. Our framework introduces choreography patterns – structured sequences of motion beats that define the long-range structure of a dance – as a novel high-level control signal for dance video generation. These patterns can be automatically estimated from human dance videos. Starting from a few keyframes representing distinct animal poses, generated via text-to-image prompting or GPT-4o, we formulate dance synthesis as a graph optimization problem that seeks the optimal keyframe structure to satisfy a specified choreography pattern of beats. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 seconds dance videos across a wide range of animals and music tracks.

[312] Multi-view Surface Reconstruction Using Normal and Reflectance Cues

Robin Bruneau, Baptiste Brument, Yvain Quéau, Jean Mélou, François Bernard Lauze, Jean-Denis Durou, Lilian Calvet

Main category: cs.CV

TL;DR: A framework that integrates multi-view normal and reflectance maps into radiance-based surface reconstruction to achieve high-fidelity 3D reconstruction with fine details, especially for materials with complex reflectance properties.

Details

Motivation: Achieving high-fidelity 3D surface reconstruction while preserving fine details is challenging for materials with complex reflectance properties and without dense-view setups.

Method: Uses pixel-wise joint re-parametrization of reflectance and surface normals as vectors of radiances under varying illumination, enabling integration into both traditional MVS and modern neural volume rendering pipelines.

Result: Achieves state-of-the-art performance on MVPS benchmark datasets (DiLiGenT-MV, LUCES-MV, Skoltech3D), excelling in reconstructing fine-grained details and handling challenging visibility conditions.

Conclusion: The method provides a versatile framework for high-quality surface reconstruction that preserves fine details and handles complex reflectance, with improved algorithm speed and robustness compared to the original conference version.

Abstract: Achieving high-fidelity 3D surface reconstruction while preserving fine details remains challenging, especially in the presence of materials with complex reflectance properties and without a dense-view setup. In this paper, we introduce a versatile framework that incorporates multi-view normal and optionally reflectance maps into radiance-based surface reconstruction. Our approach employs a pixel-wise joint re-parametrization of reflectance and surface normals, representing them as a vector of radiances under simulated, varying illumination. This formulation enables seamless incorporation into standard surface reconstruction pipelines, such as traditional multi-view stereo (MVS) frameworks or modern neural volume rendering (NVR) ones. Combined with the latter, our approach achieves state-of-the-art performance on multi-view photometric stereo (MVPS) benchmark datasets, including DiLiGenT-MV, LUCES-MV and Skoltech3D. In particular, our method excels in reconstructing fine-grained details and handling challenging visibility conditions. The present paper is an extended version of the earlier conference paper by Brument et al. (in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024), featuring an accelerated and more robust algorithm as well as a broader empirical evaluation. The code and data relative to this article is available at https://github.com/RobinBruneau/RNb-NeuS2.

[313] Orientation Matters: Making 3D Generative Models Orientation-Aligned

Yichong Lu, Yuzhuo Tian, Zijin Jiang, Yikun Zhao, Yuanbo Yang, Hao Ouyang, Haoji Hu, Huimin Yu, Yujun Shen, Yiyi Liao

Main category: cs.CV

TL;DR: The paper introduces orientation-aligned 3D object generation from single images and presents Objaverse-OA dataset with 14,832 aligned 3D models across 1,008 categories to address misalignment issues in existing 3D generative models.

Details

Motivation: Existing 3D generative models produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. Humans naturally perceive object orientation from single images using strong priors about canonical poses.

Method: Constructed Objaverse-OA dataset of orientation-aligned 3D models, then fine-tuned two representative 3D generative models (multi-view diffusion and 3D variational autoencoder frameworks) to produce aligned objects that generalize across categories.

Result: The method demonstrates superiority over post-hoc alignment approaches and enables aligned object generation that generalizes well to unseen objects across various categories.

Conclusion: The orientation-aligned 3D generation enables practical downstream applications including zero-shot object orientation estimation via analysis-by-synthesis and efficient arrow-based object rotation manipulation.

Abstract: Humans intuitively perceive object shape and orientation from a single image, guided by strong priors about canonical poses. However, existing 3D generative models often produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. To address this gap, we introduce the task of orientation-aligned 3D object generation: producing 3D objects from single images with consistent orientations across categories. To facilitate this, we construct Objaverse-OA, a dataset of 14,832 orientation-aligned 3D models spanning 1,008 categories. Leveraging Objaverse-OA, we fine-tune two representative 3D generative models based on multi-view diffusion and 3D variational autoencoder frameworks to produce aligned objects that generalize well to unseen objects across various categories. Experimental results demonstrate the superiority of our method over post-hoc alignment approaches. Furthermore, we showcase downstream applications enabled by our aligned object generation, including zero-shot object orientation estimation via analysis-by-synthesis and efficient arrow-based object rotation manipulation.

[314] AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

Xinyue Liang, Zhiyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang

Main category: cs.CV

TL;DR: AlignCVC improves single-image-to-3D generation by aligning generated and reconstructed multi-view distributions with ground-truth distribution, enhancing cross-view consistency and accelerating inference.

Details

Motivation: Existing methods suffer from poor cross-view consistency in generated multi-view images, which degrades 3D reconstruction quality. Feedback-based approaches struggle with noisy reconstruction outputs that limit consistency improvement.

Method: Proposes distribution alignment framework with soft-hard alignment strategy: soft alignment for generated images (weak CVC) and hard alignment for reconstructed images (strong CVC), enabling fast 4-step inference.

Result: Significantly improves cross-view consistency and 3D reconstruction quality while dramatically accelerating inference speed. Works as plug-and-play with various multi-view generation and 3D reconstruction models.

Conclusion: AlignCVC provides a principled distribution alignment approach that effectively addresses cross-view consistency issues in single-image-to-3D generation, achieving both quality improvements and efficiency gains.

Abstract: Single-image-to-3D models typically follow a sequential generation and reconstruction workflow. However, intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC), significantly degrading 3D reconstruction performance. While recent methods attempt to refine CVC by feeding reconstruction results back into the multi-view generator, these approaches struggle with noisy and unstable reconstruction outputs that limit effective CVC improvement. We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment rather than relying on strict regression losses. Our key insight is to align both generated and reconstructed multi-view distributions toward the ground-truth multi-view distribution, establishing a principled foundation for improved CVC. Observing that generated images exhibit weak CVC while reconstructed images display strong CVC due to explicit rendering, we propose a soft-hard alignment strategy with distinct objectives for generation and reconstruction models. This approach not only enhances generation quality but also dramatically accelerates inference to as few as 4 steps. As a plug-and-play paradigm, our method, namely AlignCVC, seamlessly integrates various multi-view generation models with 3D reconstruction models. Extensive experiments demonstrate the effectiveness and efficiency of AlignCVC for single-image-to-3D generation.

[315] FedPromo: Federated Lightweight Proxy Models at the Edge Bring New Domains to Foundation Models

Matteo Caligiuri, Francesco Barbato, Donald Shenaj, Umberto Michieli, Pietro Zanuttigh

Main category: cs.CV

TL;DR: FedPromo enables efficient adaptation of large foundation models to new domains via federated learning by training lightweight proxy models on clients and transferring knowledge back to the central model.

Details

Motivation: As model sizes grow, conventional FL requires significant client resources which may not be feasible. Need to adapt large foundation models to new domains while maintaining privacy and resource efficiency.

Method: Two-stage process: 1) Server-side knowledge distillation aligns large foundation model with compact counterpart, 2) Deploy compact encoder to clients for local classifier training, then aggregate and transfer back to foundation model with novel regularization.

Result: Extensive experiments on five image classification benchmarks show FedPromo outperforms existing methods while assuming limited-resource clients.

Conclusion: FedPromo enables decentralized multi-domain learning that balances performance, privacy, and resource efficiency for adapting large foundation models.

Abstract: Federated Learning (FL) is an established paradigm for training deep learning models on decentralized data. However, as the size of the models grows, conventional FL approaches often require significant computational resources on client devices, which may not be feasible. We introduce FedPromo, a novel framework that enables efficient adaptation of large-scale foundation models stored on a central server to new domains encountered only by remote clients. Instead of directly training the large model on client devices, FedPromo optimizes lightweight proxy models via FL, significantly reducing computational overhead while maintaining privacy. Our method follows a two-stage process: first, server-side knowledge distillation aligns the representations of a large-scale foundation model (e.g., a transformer) with those of a compact counterpart (e.g., a CNN). Then, the compact model encoder is deployed to client devices, where trainable classifiers are learned locally. These classifiers are subsequently aggregated and seamlessly transferred back to the foundation model, facilitating personalized adaptation without requiring direct access to user data. Through novel regularization strategies, our framework enables decentralized multi-domain learning, balancing performance, privacy, and resource efficiency. Extensive experiments on five image classification benchmarks demonstrate that FedPromo outperforms existing methods while assuming limited-resource clients.

[316] Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization

Songlin Li, Guofeng Yu, Zhiqing Guo, Yunfeng Diao, Dan Ma, Gaobo Yang

Main category: cs.CV

TL;DR: Proposes a scribble-based weakly supervised framework for image manipulation localization that outperforms fully supervised methods using only scribble annotations instead of pixel-level labels.

Details

Motivation: Address the challenge of acquiring high-quality pixel-level annotations for image manipulation localization by exploring scribble annotation supervision to improve annotation efficiency and detection performance.

Method: Uses self-supervised training with structural consistency loss, prior-aware feature modulation module (PFMM) for dynamic feature adjustment, gated adaptive fusion module (GAFM) for feature fusion, and confidence-aware entropy minimization loss for regularization.

Result: Experimental results show the method outperforms existing fully supervised approaches in both in-distribution and out-of-distribution settings.

Conclusion: Scribble-based weak supervision is an effective alternative to pixel-level annotation for image manipulation localization, achieving superior performance with reduced annotation cost.

Abstract: Deep learning-based image manipulation localization (IML) methods have achieved remarkable performance in recent years, but typically rely on large-scale pixel-level annotated datasets. To address the challenge of acquiring high-quality annotations, some recent weakly supervised methods utilize image-level labels to segment manipulated regions. However, the performance is still limited due to insufficient supervision signals. In this study, we explore a form of weak supervision that improves the annotation efficiency and detection performance, namely scribble annotation supervision. We re-annotate mainstream IML datasets with scribble labels and propose the first scribble-based IML (Sc-IML) dataset. Additionally, we propose the first scribble-based weakly supervised IML framework. Specifically, we employ self-supervised training with a structural consistency loss to encourage the model to produce consistent predictions under multi-scale and augmented inputs. In addition, we propose a prior-aware feature modulation module (PFMM) that adaptively integrates prior information from both manipulated and authentic regions for dynamic feature adjustment, further enhancing feature discriminability and prediction consistency in complex scenes. We also propose a gated adaptive fusion module (GAFM) that utilizes gating mechanisms to regulate information flow during feature fusion, guiding the model toward emphasizing potential manipulated regions. Finally, we propose a confidence-aware entropy minimization loss (${\mathcal{L}}_{ {CEM }}$). This loss dynamically regularizes predictions in weakly annotated or unlabeled regions based on model uncertainty, effectively suppressing unreliable predictions. Experimental results show that our method outperforms existing fully supervised approaches in terms of average performance both in-distribution and out-of-distribution.

[317] The Early Bird Identifies the Worm: You Can’t Beat a Head Start in Long-Term Body Re-ID (ECHO-BID)

Thomas M. Metz, Matthew Q. Hill, Alice J. O’Toole

Main category: cs.CV

TL;DR: Foundation models like EVA-02 outperform specialized long-term person re-identification models through domain transfer learning, with EVA-02-based ECHO-BID achieving state-of-the-art performance.

Details

Motivation: To determine if domain transfer learning on large-scale foundation models can outperform specialized model-based approaches for long-term person re-identification.

Method: Applied domain transfer learning to four vision foundation models (CLIP, DINOv2, AIMv2, EVA-02) and created ECHO-BID models based on EVA-02 backbones with ablation studies on model size and transfer protocols.

Result: All four domain-adapted foundation models surpassed existing state-of-the-art models by large margins, with EVA-02 performing best and fusion further improving performance.

Conclusion: Foundation models provide a strong starting point for domain transfer learning, enabling state-of-the-art performance in long-term re-id with limited domain data.

Abstract: A wide range of model-based approaches to long-term person re-identification have been proposed. Whether these models perform more accurately than direct domain transfer learning applied to extensively trained large-scale foundation models is not known. We applied domain transfer learning for long-term person re-id to four vision foundation models (CLIP, DINOv2, AIMv2, and EVA-02). Domain-adapted versions of all four models %CLIP-L, DINOv2-L, AIMv2-L, and EVA-02-L surpassed existing state-of-the-art models by a large margin in highly unconstrained viewing environments. Decision score fusion of the four models improved performance over any individual model. Of the individual models, the EVA-02 foundation model provided the best ``head start’’ to long-term re-id, surpassing other models on three of the four performance metrics by substantial margins. Accordingly, we introduce $\textbf{E}$va $\textbf{C}$lothes-Change from $\textbf{H}$idden $\textbf{O}$bjects - $\textbf{B}$ody $\textbf{ID}$entification (ECHO-BID), a class of long-term re-id models built on the object-pretrained EVA-02 Large backbones. Ablation experiments varying backbone size, scale of object classification pretraining, and transfer learning protocol indicated that model size and the use of a smaller, but more challenging transfer learning protocol are critical features in performance. We conclude that foundation models provide a head start to domain transfer learning and support state-of-the-art performance with modest amounts of domain data. The limited availability of long-term re-id data makes this approach advantageous.

[318] Zero-Shot Anomaly Detection with Dual-Branch Prompt Selection

Zihan Wang, Samira Ebrahimi Kahou, Narges Armanfard

Main category: cs.CV

TL;DR: PILOT introduces a dual-branch prompt learning framework with test-time adaptation for zero-shot anomaly detection under domain shifts, achieving SOTA performance across 13 benchmarks.

Details

Motivation: Existing zero-shot anomaly detection methods struggle with domain shifts because they rely on limited training domains and fail to generalize to new distributions.

Method: Uses dual-branch prompt learning that integrates learnable prompts with semantic attributes, plus label-free test-time adaptation using high-confidence pseudo-labels from unlabeled test data.

Result: Achieves state-of-the-art performance in both anomaly detection and localization across 13 industrial and medical benchmarks under domain shift.

Conclusion: PILOT effectively addresses domain shift challenges in zero-shot anomaly detection through adaptive prompt learning and test-time adaptation.

Abstract: Zero-shot anomaly detection (ZSAD) enables identifying and localizing defects in unseen categories by relying solely on generalizable features rather than requiring any labeled examples of anomalies. However, existing ZSAD methods, whether using fixed or learned prompts, struggle under domain shifts because their training data are derived from limited training domains and fail to generalize to new distributions. In this paper, we introduce PILOT, a framework designed to overcome these challenges through two key innovations: (1) a novel dual-branch prompt learning mechanism that dynamically integrates a pool of learnable prompts with structured semantic attributes, enabling the model to adaptively weight the most relevant anomaly cues for each input image; and (2) a label-free test-time adaptation strategy that updates the learnable prompt parameters using high-confidence pseudo-labels from unlabeled test data. Extensive experiments on 13 industrial and medical benchmarks demonstrate that PILOT achieves state-of-the-art performance in both anomaly detection and localization under domain shift.

[319] Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Bowen Xue, Zheng-Peng Duan, Qixin Yan, Wenjing Wang, Hao Liu, Chun-Le Guo, Chongyi Li, Chen Li, Jing Lyu

Main category: cs.CV

TL;DR: Stand-In is a lightweight plug-and-play framework for identity-preserving video generation that uses only ~1% additional parameters and achieves better results than full-parameter training methods.

Details

Motivation: Existing methods for generating identity-preserving human videos require excessive training parameters and lack compatibility with other AIGC tools.

Method: Introduces a conditional image branch into pre-trained video generation models and uses restricted self-attentions with conditional position mapping to achieve identity control while preserving the model’s pre-trained prior.

Result: Outperforms full-parameter training methods in video quality and identity preservation with only ~1% additional parameters and 2000 training pairs. Can be integrated for various tasks including subject-driven video generation, pose-referenced generation, stylization, and face swapping.

Conclusion: Stand-In provides an efficient and versatile solution for identity preservation in video generation that maintains high performance while being lightweight and compatible with other AIGC tools.

Abstract: Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping. Thanks to these designs, which greatly preserve the pre-trained prior of the video generation model, our approach is able to outperform other full-parameter training methods in video quality and identity preservation, even with just $\sim$1% additional parameters and only 2000 training pairs. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.

[320] Achieving detailed medial temporal lobe segmentation with upsampled isotropic training from implicit neural representation

Yue Li, Pulkit Khandelwal, Rohit Jena, Long Xie, Michael Duong, Amanda E. Denning, Christopher A. Brown, Laura E. M. Wisse, Sandhitsu R. Das, David A. Wolk, Paul A. Yushkevich

Main category: cs.CV

TL;DR: Using implicit neural representations to combine T1w and T2w MRI for upsampling MTL subregions, enabling more reliable morphological measures for Alzheimer’s disease tracking.

Details

Motivation: Accurate segmentation of medial temporal lobe subregions is needed for AD biomarkers, but anisotropic T2w MRI makes geometric modeling difficult.

Method: Used implicit neural representation to combine isotropic T1w and anisotropic T2w MRI, upsampling expert-annotated MTL subregions to create high-resolution training data for nnU-Net segmentation.

Result: Isotropic model showed stronger effect sizes in distinguishing MCI from cognitively unimpaired individuals, and greater stability in test-retest analysis.

Conclusion: Improved reliability of MRI-derived MTL subregion biomarkers without additional annotation effort, enabling better tracking of AD pathology and brain atrophy.

Abstract: Imaging biomarkers in magnetic resonance imaging (MRI) are important tools for diagnosing, tracking and treating Alzheimer’s disease (AD). Neurofibrillary tau pathology in AD is closely linked to neurodegeneration and generally follows a pattern of spread in the brain, with early stages involving subregions of the medial temporal lobe (MTL). Accurate segmentation of MTL subregions is needed to extract granular biomarkers of AD progression. MTL subregions are often imaged using T2-weighted (T2w) MRI scans that are highly anisotropic due to constraints of MRI physics and image acquisition, making it difficult to reliably model MTL subregions geometrically and extract morphological measures, such as thickness. In this study, we used an implicit neural representation method to combine isotropic T1-weighted (T1w) and anisotropic T2w MRI to upsample an atlas set of expert-annotated MTL subregions, establishing a multi-modality, high-resolution training set of isotropic data for automatic segmentation with the nnU-Net framework. In an independent test set, the morphological measures extracted using this isotropic model showed stronger effect sizes than models trained on anisotropic in distinguishing participants with mild cognitive impairment (MCI) and cognitively unimpaired individuals. In test-retest analysis, morphological measures extracted using the isotropic model had greater stability. This study demonstrates improved reliability of MRI-derived MTL subregion biomarkers without additional atlas annotation effort, which may more accurately quantify and track the relationship between AD pathology and brain atrophy for monitoring disease progression.

[321] MAPo : Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction

Han Jiao, Jiakai Sun, Yexing Xu, Lei Zhao, Wei Xing, Huaizhong Lin

Main category: cs.CV

TL;DR: MAPo introduces motion-aware partitioning for 3D Gaussian Splatting to improve dynamic scene reconstruction by separating high- and low-dynamic regions, using specialized deformation networks for complex motions while maintaining computational efficiency.

Details

Motivation: Existing deformation-based methods for dynamic 3D Gaussian Splatting often produce blurred renderings and lose fine motion details in highly dynamic regions due to limitations of unified models in representing diverse motion patterns.

Method: Dynamic score-based partitioning strategy that distinguishes high- and low-dynamic 3D Gaussians. High-dynamic Gaussians are recursively partitioned temporally with duplicated deformation networks for each segment, while low-dynamic ones are treated as static. A cross-frame consistency loss addresses visual discontinuities at partition boundaries.

Result: MAPo achieves superior rendering quality compared to baselines while maintaining comparable computational costs, particularly excelling in regions with complex or rapid motions.

Conclusion: The motion-aware partitioning framework effectively captures intricate motion details in dynamic scenes while ensuring visual continuity and computational efficiency, addressing key limitations of previous deformation-based approaches.

Abstract: 3D Gaussian Splatting, known for enabling high-quality static scene reconstruction with fast rendering, is increasingly being applied to multi-view dynamic scene reconstruction. A common strategy involves learning a deformation field to model the temporal changes of a canonical set of 3D Gaussians. However, these deformation-based methods often produce blurred renderings and lose fine motion details in highly dynamic regions due to the inherent limitations of a single, unified model in representing diverse motion patterns. To address these challenges, we introduce Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo), a novel framework for high-fidelity dynamic scene reconstruction. Its core is a dynamic score-based partitioning strategy that distinguishes between high- and low-dynamic 3D Gaussians. For high-dynamic 3D Gaussians, we recursively partition them temporally and duplicate their deformation networks for each new temporal segment, enabling specialized modeling to capture intricate motion details. Concurrently, low-dynamic 3DGs are treated as static to reduce computational costs. However, this temporal partitioning strategy for high-dynamic 3DGs can introduce visual discontinuities across frames at the partition boundaries. To address this, we introduce a cross-frame consistency loss, which not only ensures visual continuity but also further enhances rendering quality. Extensive experiments demonstrate that MAPo achieves superior rendering quality compared to baselines while maintaining comparable computational costs, particularly in regions with complex or rapid motions.

[322] InstaDA: Augmenting Instance Segmentation Data with Dual-Agent System

Xianbao Hou, Yonghao He, Zeyd Boukhers, John See, Hu Su, Wei Sui, Cong Yang

Main category: cs.CV

TL;DR: InstaDA is a training-free dual-agent system that enhances instance segmentation datasets through text and image agents, achieving significant performance improvements on LVIS 1.0.

Details

Motivation: Address limitations in current dataset augmentation methods that lack deep collaboration between LLMs and diffusion models, and underutilize existing training data information.

Method: Two-agent system: Text-Agent with Prompt Rethink mechanism for iterative prompt refinement, and Image-Agent for generating new instances conditioned on training images. Both operate as independent automated workflows.

Result: Achieved +4.0 box AP and +3.3 mask AP improvements over baseline on LVIS 1.0, outperforming DiverGen by +0.3 box AP and +0.1 mask AP, with notable gains in common and frequent categories.

Conclusion: InstaDA effectively enhances instance segmentation datasets through collaborative LLM-diffusion integration and training data utilization, demonstrating superior performance without requiring training.

Abstract: Acquiring high-quality instance segmentation data is challenging due to the labor-intensive nature of the annotation process and significant class imbalances within datasets. Recent studies have utilized the integration of Copy-Paste and diffusion models to create more diverse datasets. However, these studies often lack deep collaboration between large language models (LLMs) and diffusion models, and underutilize the rich information within the existing training data. To address these limitations, we propose InstaDA, a novel, training-free Dual-Agent system designed to augment instance segmentation datasets. First, we introduce a Text-Agent (T-Agent) that enhances data diversity through collaboration between LLMs and diffusion models. This agent features a novel Prompt Rethink mechanism, which iteratively refines prompts based on the generated images. This process not only fosters collaboration but also increases image utilization and optimizes the prompts themselves. Additionally, we present an Image-Agent (I-Agent) aimed at enriching the overall data distribution. This agent augments the training set by generating new instances conditioned on the training images. To ensure practicality and efficiency, both agents operate as independent and automated workflows, enhancing usability. Experiments conducted on the LVIS 1.0 validation set indicate that InstaDA achieves significant improvements, with an increase of +4.0 in box average precision (AP) and +3.3 in mask AP compared to the baseline. Furthermore, it outperforms the leading model, DiverGen, by +0.3 in box AP and +0.1 in mask AP, with a notable +0.7 gain in box AP on common categories and mask AP gains of +0.2 on common categories and +0.5 on frequent categories.

[323] NeuroGaze-Distill: Brain-informed Distillation and Depression-Inspired Geometric Priors for Robust Facial Emotion Recognition

Zilin Li, Weiwei Xu, Xuanqi Zhao, Yiran Zhu

Main category: cs.CV

TL;DR: NeuroGaze-Distill is a cross-modal distillation framework that transfers brain-informed priors from EEG data to image-only facial emotion recognition models using static valence/arousal prototypes and a depression-inspired geometric prior, improving generalization without requiring EEG data at deployment.

Details

Motivation: Traditional FER models trained only on pixels fail to generalize across datasets because facial appearance is an indirect and biased proxy for underlying affect. The authors aim to incorporate brain-informed priors to improve robustness.

Method: A teacher model trained on EEG topographic maps produces a consolidated 5x5 valence/arousal prototype grid. The student model (ResNet-18/50) is trained on FERPlus with conventional losses plus two regularizers: Proto-KD aligns features to static prototypes, and D-Geo shapes embedding geometry based on depression research findings.

Result: The method improves performance in both within-domain (FERPlus validation) and cross-dataset protocols (AffectNet-mini, CK+), with consistent gains attributed to prototypes and D-Geo. The 5x5 grid proved more stable than denser alternatives.

Conclusion: NeuroGaze-Distill is a simple, deployable approach that improves FER robustness without architectural complexity, successfully transferring brain-informed priors to image-only models without requiring EEG data at deployment.

Abstract: Facial emotion recognition (FER) models trained only on pixels often fail to generalize across datasets because facial appearance is an indirect and biased proxy for underlying affect. We present NeuroGaze-Distill, a cross-modal distillation framework that transfers brain-informed priors into an image-only FER student via static Valence/Arousal (V/A) prototypes and a depression-inspired geometric prior (D-Geo). A teacher trained on EEG topographic maps from DREAMER (with MAHNOB-HCI as unlabeled support) produces a consolidated 5x5 V/A prototype grid that is frozen and reused; no EEG-face pairing and no non-visual signals at deployment are required. The student (ResNet-18/50) is trained on FERPlus with conventional CE/KD and two lightweight regularizers: (i) Proto-KD (cosine) aligns student features to the static prototypes; (ii) D-Geo softly shapes the embedding geometry in line with affective findings often reported in depression research (e.g., anhedonia-like contraction in high-valence regions). We evaluate both within-domain (FERPlus validation) and cross-dataset protocols (AffectNet-mini; optional CK+), reporting standard 8-way scores alongside present-only Macro-F1 and balanced accuracy to fairly handle label-set mismatch. Ablations attribute consistent gains to prototypes and D-Geo, and favor 5x5 over denser grids for stability. The method is simple, deployable, and improves robustness without architectural complexity.

[324] Time-step Mixup for Efficient Spiking Knowledge Transfer from Appearance to Event Domain

Yuqi Xie, Shuhan Ye, Yi Yu, Chong Wang, Qixin Zhang, Jiazhen Xu, Le Shen, Yuanbin Qian, Jiangbo Qian, Guoqi Li

Main category: cs.CV

TL;DR: TMKT is a novel fine-grained mixing strategy that interpolates RGB and DVS inputs at various time-steps to enable smoother knowledge transfer between modalities for spiking neural networks.

Details

Motivation: To address the limited availability of event data and distribution gap between RGB and DVS modalities in spiking neural networks for energy-efficient visual processing.

Method: Proposes Time-step Mixup knowledge transfer (TMKT) that exploits SNNs’ asynchronous nature by interpolating RGB and DVS inputs at various time-steps, with modality-aware auxiliary learning objectives for label mixing.

Result: Achieves superior performance in spiking image classification tasks across multiple datasets, enabling smoother knowledge transfer and alleviating modality shift during training.

Conclusion: The proposed TMKT approach effectively bridges the modality gap and enhances knowledge transfer for spiking neural networks in visual processing tasks.

Abstract: The integration of event cameras and spiking neural networks holds great promise for energy-efficient visual processing. However, the limited availability of event data and the sparse nature of DVS outputs pose challenges for effective training. Although some prior work has attempted to transfer semantic knowledge from RGB datasets to DVS, they often overlook the significant distribution gap between the two modalities. In this paper, we propose Time-step Mixup knowledge transfer (TMKT), a novel fine-grained mixing strategy that exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time-steps. To enable label mixing in cross-modal scenarios, we further introduce modality-aware auxiliary learning objectives. These objectives support the time-step mixup process and enhance the model’s ability to discriminate effectively across different modalities. Our approach enables smoother knowledge transfer, alleviates modality shift during training, and achieves superior performance in spiking image classification tasks. Extensive experiments demonstrate the effectiveness of our method across multiple datasets. The code will be released after the double-blind review process.

[325] ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

Yiyang Chen, Xuanhua He, Xiujun Ma, Yue Ma

Main category: cs.CV

TL;DR: ContextFlow is a training-free framework for video object editing using Diffusion Transformers that addresses fidelity and temporal consistency issues through high-order inversion and adaptive context enrichment.

Details

Motivation: Existing training-free video object editing methods struggle with maintaining fidelity and temporal consistency, especially in DiT architectures where prior layer-selection heuristics are ineffective and feature replacement causes contextual conflicts.

Method: Uses high-order Rectified Flow solver for robust editing foundation, Adaptive Context Enrichment to dynamically fuse information by concatenating Key-Value pairs, and systematic layer selection based on Guidance Responsiveness Metric to identify task-specific vital layers.

Result: Significantly outperforms existing training-free methods and surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity video editing results.

Conclusion: ContextFlow provides an effective training-free solution for video object editing in DiT architectures, overcoming limitations of existing methods through robust inversion and adaptive context fusion mechanisms.

Abstract: Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude “hard” feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

[326] StrCGAN: A Generative Framework for Stellar Image Restoration

Shantanusinh Parmar, Silas Janke

Main category: cs.CV

TL;DR: StrCGAN is a generative model that enhances low-resolution astrophotography images by reconstructing high-fidelity stellar representations using multi-spectral fusion and astrophysical regularization.

Details

Motivation: Traditional models like CycleGAN often distort star morphology and produce poor quality images when enhancing low-resolution astrophotography from small telescopes like MobilTelesco.

Method: Extends CycleGAN with multi-spectral fusion to align optical and NIR domains, and astrophysical regularization modules to preserve stellar morphology, guided by multi-mission all-sky survey ground truth.

Result: StrCGAN generates visually sharper reconstructions that outperform standard GAN models in astrophysical image enhancement tasks.

Conclusion: The proposed StrCGAN framework successfully enhances low-resolution astrophotography while preserving stellar morphology through multi-spectral fusion and astrophysical constraints.

Abstract: We introduce StrCGAN (Stellar Cyclic GAN), a generative model designed to enhance low-resolution astrophotography images. Our goal is to reconstruct high fidelity ground truth like representations of stellar objects, a task that is challenging due to the limited resolution and quality of small-telescope observations such as the MobilTelesco dataset. Traditional models such as CycleGAN provide a foundation for image to image translation but often distort the morphology of stars and produce barely resembling images. To overcome these limitations, we extend the CycleGAN framework with some key innovations: multi-spectral fusion to align optical and near infrared (NIR) domains, and astrophysical regularization modules to preserve stellar morphology. Ground truth references from multi-mission all sky surveys spanning optical to NIR guide the training process, ensuring that reconstructions remain consistent across spectral bands. Together, these components allow StrCGAN to generate reconstructions that are visually sharper outperforming standard GAN models in the task of astrophysical image enhancement.

Pîrvu Mihai-Cristian, Marius Leordeanu

Main category: cs.CV

TL;DR: PHG-MAE is a novel model that combines neural graphs with masked autoencoders, enabling unified pre-training and fine-tuning with modality-level masking and inference-time ensembles.

Details

Motivation: To unify classical neural graph approaches with modern masked autoencoders under a common theoretical framework, eliminating the need for manual data labeling and improving multi-modal learning.

Method: Random masking of entire modalities (not just patches) to sample from hyper-edge distributions, combining pre-training and fine-tuning in a single training loop, and creating inference-time ensembles with knowledge distillation.

Result: The approach enables boosted prediction performance through ensemble aggregation and maintains high performance even with models under 1M parameters after knowledge distillation.

Conclusion: PHG-MAE provides an effective framework for multi-modal learning that can be applied to various domains like autonomous driving and robotics, with publicly released code and an extended Dronescapes dataset.

Abstract: The computer vision domain has greatly benefited from an abundance of data across many modalities to improve on various visual tasks. Recently, there has been a lot of focus on self-supervised pre-training methods through Masked Autoencoders (MAE) \cite{he2022masked,bachmann2022multimae}, usually used as a first step before optimizing for a downstream task, such as classification or regression. This is very useful as it doesn’t require any manually labeled data. In this work, we introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE): a novel model that unifies the classical work on neural graphs \cite{leordeanu2021semi} with the modern approach of masked autoencoders under a common theoretical framework. Through random masking of entire modalities, not just patches, the model samples from the distribution of hyper-edges on each forward pass. Additionally, the model adapts the standard MAE algorithm by combining pre-training and fine-tuning into a single training loop. Moreover, our approach enables the creation of inference-time ensembles which, through aggregation, boost the final prediction performance and consistency. Lastly, we show that we can apply knowledge distillation on top of the ensembles with little loss in performance, even with models that have fewer than 1M parameters. While our work mostly focuses on outdoor UAV scenes that contain multiple world interpretations and modalities, the same steps can be followed in other similar domains, such as autonomous driving or indoor robotics. In order to streamline the process of integrating external pre-trained experts for computer vision multi-modal multi-task learning (MTL) scenarios, we developed a data-pipeline software. Using this tool, we have created and released a fully-automated extension of the Dronescapes dataset. All the technical details, code and reproduction steps are publicly released.

[328] STT-GS: Sample-Then-Transmit Edge Gaussian Splatting with Joint Client Selection and Power Control

Zhen Li, Xibin Jin, Guoliang Li, Shuai Wang, Miaowen Wen, Huseyin Arslan, Derrick Wing Kwan Ng, Chengzhong Xu

Main category: cs.CV

TL;DR: Proposes STT-GS framework for edge Gaussian splatting that samples pilot data to predict GS quality, then prioritizes communication resources to valuable clients using joint client selection and power control.

Details

Motivation: Existing edge resource management methods focus on communication throughput or general learning performance, but are inapplicable for maximizing Gaussian splatting quality in distributed client scenarios like drone-based scene reconstruction.

Method: Sample-then-transmit strategy with feature-domain clustering for representative data selection, pilot transmission time minimization, and joint client selection and power control framework using penalty alternating majorization minimization algorithm.

Result: Significantly outperforms existing benchmarks on real-world datasets, accurately predicts GS-oriented objective with low sampling ratios (10%), and achieves excellent tradeoff between view contributions and communication costs.

Conclusion: The proposed STT-GS framework effectively solves the causality dilemma in edge Gaussian splatting by enabling efficient resource allocation based on predicted GS quality, demonstrating superior performance over traditional approaches.

Abstract: Edge Gaussian splatting (EGS), which aggregates data from distributed clients (e.g., drones) and trains a global GS model at the edge (e.g., ground server), is an emerging paradigm for scene reconstruction in low-altitude economy. Unlike traditional edge resource management methods that emphasize communication throughput or general-purpose learning performance, EGS explicitly aims to maximize the GS qualities, rendering existing approaches inapplicable. To address this problem, this paper formulates a novel GS-oriented objective function that distinguishes the heterogeneous view contributions of different clients. However, evaluating this function in turn requires clients’ images, leading to a causality dilemma. To this end, this paper further proposes a sample-then-transmit EGS (or STT-GS for short) strategy, which first samples a subset of images as pilot data from each client for loss prediction. Based on the first-stage evaluation, communication resources are then prioritized towards more valuable clients. To achieve efficient sampling, a feature-domain clustering (FDC) scheme is proposed to select the most representative data and pilot transmission time minimization (PTTM) is adopted to reduce the pilot overhead.Subsequently, we develop a joint client selection and power control (JCSPC) framework to maximize the GS-oriented function under communication resource constraints. Despite the nonconvexity of the problem, we propose a low-complexity efficient solution based on the penalty alternating majorization minimization (PAMM) algorithm. Experiments reveal that the proposed scheme significantly outperforms existing benchmarks on real-world datasets. The GS-oriented objective can be accurately predicted with low sampling ratios (e.g., 10%), and our method achieves an excellent tradeoff between view contributions and communication costs.

[329] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL is a state-of-the-art, resource-efficient vision-language model for document parsing that supports 109 languages and excels at recognizing complex elements like text, tables, formulas, and charts while maintaining minimal resource consumption.

Details

Motivation: To develop an efficient and powerful document parsing model that can handle multiple languages and complex document elements while being suitable for practical deployment in real-world scenarios.

Method: Uses PaddleOCR-VL-0.9B, a compact vision-language model that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model for accurate element recognition.

Result: Achieves SOTA performance in both page-level document parsing and element-level recognition, significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds.

Conclusion: PaddleOCR-VL is highly suitable for practical deployment in real-world scenarios due to its superior performance, efficiency, and multi-language support capabilities.

Abstract: In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios. Code is available at https://github.com/PaddlePaddle/PaddleOCR .

[330] Optimizing DINOv2 with Registers for Face Anti-Spoofing

Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki

Main category: cs.CV

TL;DR: Proposes a DINOv2-based method with registers to detect face spoofing attacks by focusing on minute differences between live and spoofed face images.

Details

Motivation: Face recognition systems are vulnerable to spoofing attacks where malicious actors present photos of registered users, requiring detection methods to prevent authentication bypass.

Method: Uses DINOv2 with registers to extract generalizable features and suppress attention perturbations, enabling focused analysis of essential minute features for spoofing detection.

Result: Demonstrated effectiveness through experiments on datasets from The 6th Face Anti-Spoofing Workshop and SiW dataset.

Conclusion: The DINOv2-based approach with registers provides an effective solution for detecting face spoofing attacks by leveraging attention mechanisms to identify subtle differences between live and spoofed images.

Abstract: Face recognition systems are designed to be robust against variations in head pose, illumination, and image blur during capture. However, malicious actors can exploit these systems by presenting a face photo of a registered user, potentially bypassing the authentication process. Such spoofing attacks must be detected prior to face recognition. In this paper, we propose a DINOv2-based spoofing attack detection method to discern minute differences between live and spoofed face images. Specifically, we employ DINOv2 with registers to extract generalizable features and to suppress perturbations in the attention mechanism, which enables focused attention on essential and minute features. We demonstrate the effectiveness of the proposed method through experiments conducted on the dataset provided by ``The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection@ICCV2025’’ and SiW dataset. The project page is available at: https://gsisaoki.github.io/FAS-DINOv2-ICCVW/ .

[331] Endoshare: A Publicly Available, Surgeons-Friendly Solution to De-Identify and Manage Surgical Videos

Lorenzo Arboit, Dennis N. Schneider, Britty Baby, Vinkle Srivastav, Pietro Mascagni, Nicolas Padoy

Main category: cs.CV

TL;DR: Endoshare is a surgeon-friendly application that standardizes and de-identifies endoscopic videos to address privacy concerns and heterogeneous recording formats in surgical video analysis.

Details

Motivation: Limited adoption of video-based surgical assessment due to heterogeneous recording formats and privacy concerns associated with video sharing.

Method: Iterative user-centered software development with cross-platform privacy-by-design architecture, including automated video merging, out-of-body removal, and filename pseudonymization.

Result: High usability scores from clinicians (4.68/5) and computer scientists (4.03/5), strong perceived usefulness (5.07/7) and ease of use (5.15/7), with processing time proportional to video duration.

Conclusion: Endoshare provides a publicly available solution for surgical video management with potential to support training, research, and quality improvement, though further certification and validation are needed.

Abstract: Video-based assessment and surgical data science can advance surgical training, research, and quality improvement, yet adoption remains limited by heterogeneous recording formats and privacy concerns linked to video sharing. This work develops, evaluates, and publicly releases Endoshare, a surgeon-friendly application that merges, standardizes, and de-identifies endoscopic videos. Development followed an iterative, user-centered software life cycle. In the analysis phase, an internal survey of four clinicians and four computer scientists, based on 10 usability heuristics, identified early requirements and guided a cross-platform, privacy-by-design architecture. Prototype testing reported high usability for clinicians (4.68 +/- 0.40 out of 5) and for computer scientists (4.03 +/- 0.51 out of 5), with the lowest score (4.00 +/- 0.93 out of 5) relating to label clarity, prompting interface refinement to streamline case selection, video merging, automated out-of-body removal, and filename pseudonymization. In the testing phase, ten surgeons completed an external survey combining the same heuristics with Technology Acceptance Model constructs, reporting high perceived usefulness (5.07 +/- 1.75 out of 7), ease of use (5.15 +/- 1.71 out of 7), heuristic usability (4.38 +/- 0.48 out of 5), and strong recommendation likelihood (9.20 +/- 0.79 out of 10). A performance assessment across different hardware and configurations showed that processing time increased proportionally with video duration and was consistently lower in fast mode. Endoshare is a publicly available solution to manage surgical videos, with potential to support training, research, and quality improvement. Compliance certification and broader interoperability validation are needed to establish it as a reliable tool for surgical video management. The software is available at https://camma-public.github.io/Endoshare

[332] Target-aware Image Editing via Cycle-consistent Constraints

Yanghao Wang, Zhen Wang, Long Chen

Main category: cs.CV

TL;DR: FlowCycle is a novel text-to-image editing framework that introduces target-aware intermediate states through learnable noise optimization and cycle-consistent training, achieving superior editing quality and consistency.

Details

Motivation: Current text-to-image editing methods use target-agnostic intermediate states that focus on source reconstruction but neglect semantic gaps towards specific editing targets, leading to limited editability and inconsistency when modifications deviate substantially from the source.

Method: FlowCycle parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process that iteratively edits source to target and recovers back to source with dual consistency constraints, producing target-aware intermediate states.

Result: Extensive ablations demonstrate that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods, enabling faithful modifications while preserving source consistency.

Conclusion: The proposed target-aware intermediate state approach through cycle-consistent optimization effectively addresses the limitations of current corruption-then-restoration paradigms in text-to-image editing.

Abstract: Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state’’ and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.

[333] LayerComposer: Multi-Human Personalized Generation via Layered Canvas

Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Ostashev, Ju Hu, Sergey Tulyakov, Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: LayerComposer is an interactive framework for multi-human personalized image generation that provides spatial composition control through a layered canvas approach, enabling users to place and resize multiple subjects while maintaining identity preservation and scalability.

Details

Motivation: Existing personalized image generators lack interactive control over spatial composition and scale poorly to multiple humans, limiting their practical usability for complex multi-subject scenarios.

Method: Uses a layered canvas representation where each subject is on a distinct layer, with transparent latent pruning for scalability and layerwise cross-reference training to reduce copy-paste artifacts. Provides reference-based human injection for intuitive placement and resizing.

Result: Achieves superior spatial control, coherent composition, and identity preservation compared to state-of-the-art methods in multi-human personalized image generation, with improved scalability through computational cost decoupling.

Conclusion: LayerComposer successfully addresses the limitations of existing methods by providing interactive spatial control and scalable multi-human generation through its novel layered canvas approach and specialized training strategies.

Abstract: Despite their impressive visual fidelity, existing personalized image generators lack interactive control over spatial composition and scale poorly to multiple humans. To address these limitations, we present LayerComposer, an interactive and scalable framework for multi-human personalized generation. Inspired by professional image-editing software, LayerComposer provides intuitive reference-based human injection, allowing users to place and resize multiple subjects directly on a layered digital canvas to guide personalized generation. The core of our approach is the layered canvas, a novel representation where each subject is placed on a distinct layer, enabling interactive and occlusion-free composition. We further introduce a transparent latent pruning mechanism that improves scalability by decoupling computational cost from the number of subjects, and a layerwise cross-reference training strategy that mitigates copy-paste artifacts. Extensive experiments demonstrate that LayerComposer achieves superior spatial control, coherent composition, and identity preservation compared to state-of-the-art methods in multi-human personalized image generation.

[334] FastGS: Training 3D Gaussian Splatting in 100 Seconds

Shiwei Ren, Tianci Wen, Yongchun Fang, Biao Lu

Main category: cs.CV

TL;DR: FastGS is a novel acceleration framework for 3D Gaussian splatting that uses multi-view consistency to regulate Gaussian density, achieving 3.32× to 15.45× training speedup while maintaining comparable rendering quality.

Details

Motivation: Existing 3DGS acceleration methods fail to properly regulate Gaussian numbers during training, causing redundant computational overhead and inefficiency.

Method: Proposes a densification and pruning strategy based on multi-view consistency, eliminating the need for budgeting mechanisms and efficiently managing Gaussian importance.

Result: Achieves 3.32× training acceleration compared to DashGaussian on Mip-NeRF 360 and 15.45× acceleration compared to vanilla 3DGS on Deep Blending, with strong generality across various tasks (2-7× acceleration).

Conclusion: FastGS provides a simple yet effective framework that significantly accelerates 3DGS training while maintaining rendering quality, demonstrating strong performance across multiple reconstruction tasks.

Abstract: The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.32$\times$ training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45$\times$ acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-7$\times$ training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping. The project page is available at https://fastgs.github.io/

[335] DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation

Xuexun Liu, Xiaoxu Xu, Qiudan Zhang, Lin Ma, Xu Wang

Main category: cs.CV

TL;DR: DBGroup is a weakly supervised 3D instance segmentation framework that uses scene-level annotations instead of point-level or bounding box annotations, achieving competitive performance with reduced labeling effort.

Details

Motivation: To address the high annotation costs and labor-intensive processes of existing weakly supervised 3D instance segmentation methods that rely on point-level or bounding box annotations.

Method: Two-stage framework: (1) Dual-Branch Point Grouping module generates pseudo labels using semantic and mask cues from multi-view images, with refinement strategies; (2) Multi-round self-training on end-to-end instance segmentation network with Instance Mask Filter to handle pseudo label inconsistencies.

Result: Achieves competitive performance compared to sparse-point-level supervised methods and surpasses state-of-the-art scene-level supervised 3D semantic segmentation approaches.

Conclusion: DBGroup provides an efficient and scalable alternative for 3D instance segmentation using scene-level annotations, significantly reducing labeling effort while maintaining competitive performance.

Abstract: Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.

[336] ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation

Zitong Xu, Huiyu Duan, Xiaoyu Wang, Zhaolin Cai, Kaiwei Zhang, Qiang Hu, Jing Liu, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: ManipBench is a large-scale benchmark for AI-edited image manipulation detection and localization, addressing limitations of existing benchmarks through diverse content, broad model coverage, and interpretability features.

Details

Motivation: Existing image manipulation detection benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, hindering generalization and explanation capabilities of detection methods.

Method: Created ManipBench with 450K+ manipulated images from 25 state-of-the-art editing models across 12 categories, with 100K images annotated with bounding boxes, judgment cues, and textual explanations. Proposed ManipShield, an MLLM-based model using contrastive LoRA fine-tuning and task-specific decoders for unified detection, localization, and explanation.

Result: ManipShield achieves state-of-the-art performance on ManipBench and public datasets, demonstrating strong generalization to unseen manipulation models.

Conclusion: Both ManipBench and ManipShield address critical gaps in AI-edited image manipulation detection, providing comprehensive benchmarks and effective detection solutions with strong generalization capabilities.

Abstract: With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbf{ManipBench}, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbf{ManipShield}, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.

[337] 3D-Guided Scalable Flow Matching for Generating Volumetric Tissue Spatial Transcriptomics from Serial Histology

Mohammad Vali Sanian, Arshia Hemmat, Amirhossein Vahidi, Jonas Maaskola, Jimmy Tsz Hang Lee, Stanislaw Makarchuk, Yeliz Demirci, Nana-Jane Chipampe, Muzlifah Haniffa, Omer Bayraktar, Lassi Paavolainen, Mohammad Lotfollahi

Main category: cs.CV

TL;DR: HoloTea is a 3D-aware flow-matching framework that imputes spot-level gene expression from H&E histology by leveraging information from adjacent tissue sections, enabling more accurate 3D spatial transcriptomics reconstruction.

Details

Motivation: Existing methods either treat tissue sections independently (ignoring 3D structure) or are not generative/scalable for 3D spatial transcriptomics, limiting holistic understanding of tissue organization.

Method: Uses a 3D-aware flow-matching framework with cross-section context fusion via ControlNet, retrieves morphologically corresponding spots, combines learned ZINB prior with spatial-empirical prior, and employs global attention for linear scaling with spot count.

Result: Consistently improves 3D expression accuracy and generalization across three spatial transcriptomics datasets spanning different tissue types and resolutions compared to 2D and 3D baselines.

Conclusion: HoloTea advances creation of accurate 3D virtual tissues, accelerating biomarker discovery and deepening understanding of disease through improved 3D spatial transcriptomics reconstruction.

Abstract: A scalable and robust 3D tissue transcriptomics profile can enable a holistic understanding of tissue organization and provide deeper insights into human biology and disease. Most predictive algorithms that infer ST directly from histology treat each section independently and ignore 3D structure, while existing 3D-aware approaches are not generative and do not scale well. We present Holographic Tissue Expression Inpainting and Analysis (HoloTea), a 3D-aware flow-matching framework that imputes spot-level gene expression from H&E while explicitly using information from adjacent sections. Our key idea is to retrieve morphologically corresponding spots on neighboring slides in a shared feature space and fuse this cross section context into a lightweight ControlNet, allowing conditioning to follow anatomical continuity. To better capture the count nature of the data, we introduce a 3D-consistent prior for flow matching that combines a learned zero-inflated negative binomial (ZINB) prior with a spatial-empirical prior constructed from neighboring sections. A global attention block introduces 3D H&E scaling linearly with the number of spots in the slide, enabling training and inference on large 3D ST datasets. Across three spatial transcriptomics datasets spanning different tissue types and resolutions, HoloTea consistently improves 3D expression accuracy and generalization compared to 2D and 3D baselines. We envision HoloTea advancing the creation of accurate 3D virtual tissues, ultimately accelerating biomarker discovery and deepening our understanding of disease.

[338] Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

Raphael Ruschel, Hardikkumar Prajapati, Awsafur Rahman, B. S. Manjunath

Main category: cs.CV

TL;DR: Click2Graph is the first interactive framework for Panoptic Video Scene Graph Generation that unifies visual prompting with spatial, temporal, and semantic understanding, enabling users to generate scene graphs from simple cues like clicks or bounding boxes.

Details

Motivation: Current VSGG systems operate as closed pipelines without human guidance, while promptable segmentation models lack semantic reasoning. There's a need to combine precise user interaction with comprehensive scene understanding.

Method: Uses a Dynamic Interaction Discovery Module to generate subject-conditioned object prompts and a Semantic Classification Head for joint entity and predicate reasoning. From user cues, it segments/tracks subjects, discovers interacting objects, and predicts triplets.

Result: Experiments on OpenPVSG benchmark demonstrate strong performance for user-guided PVSG, showing effective combination of human prompting with panoptic grounding and relational inference.

Conclusion: Click2Graph establishes a foundation for controllable and interpretable video scene understanding by unifying human prompting with comprehensive scene graph generation capabilities.

Abstract: State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.

[339] SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation

Seamie Hayes, Reenu Mohandas, Tim Brophy, Alexandre Boulch, Ganesh Sistu, Ciaran Eising

Main category: cs.CV

TL;DR: SuperQuadricOcc introduces superquadric-based scene representation for semantic occupancy estimation, achieving 75% memory reduction, 124% faster inference, and 5.9% mIoU improvement over Gaussian methods.

Details

Motivation: Gaussian representations in occupancy estimation require many primitives, increasing memory and hindering real-time inference. Superquadrics offer reduced primitive count and lower memory but lack rasterizers for supervision.

Method: Uses superquadric-based scene representation with multi-layer icosphere-tessellated Gaussian approximation to enable Gaussian rasterization for training supervision.

Result: 75% memory footprint reduction, 124% faster inference, 5.9% mIoU improvement on Occ3D dataset, 84% fewer primitives needed compared to Gaussian methods.

Conclusion: First occupancy model enabling real-time inference while maintaining competitive performance, with superquadrics significantly reducing primitive requirements.

Abstract: Semantic occupancy estimation enables comprehensive scene understanding for automated driving, providing dense spatial and semantic information essential for perception and planning. While Gaussian representations have been widely adopted in self-supervised occupancy estimation, the deployment of a large number of Gaussian primitives drastically increases memory requirements and is not suitable for real-time inference. In contrast, superquadrics permit reduced primitive count and lower memory requirements due to their diverse shape set. However, implementation into a self-supervised occupancy model is nontrivial due to the absence of a superquadric rasterizer to enable model supervision. Our proposed method, SuperQuadricOcc, employs a superquadric-based scene representation. By leveraging a multi-layer icosphere-tessellated Gaussian approximation of superquadrics, we enable Gaussian rasterization for supervision during training. On the Occ3D dataset, SuperQuadricOcc achieves a 75% reduction in memory footprint, 124% faster inference, and a 5.9% improvement in mIoU compared to previous Gaussian-based methods, without the use of temporal labels. To our knowledge, this is the first occupancy model to enable real-time inference while maintaining competitive performance. The use of superquadrics reduces the number of primitives required for scene modeling by 84% relative to Gaussian-based approaches. Finally, evaluation against prior methods is facilitated by our fast superquadric voxelization module. The code will be made available at https://github.com/seamie6/SuperQuadricOcc.

[340] Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Y. Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu

Main category: cs.CV

TL;DR: Video-R4 introduces visual rumination - an iterative process of selecting frames, zooming into regions, and re-encoding pixels to improve reasoning in text-rich videos.

Details

Motivation: Current video QA models fail on fine-grained evidence because they use single-pass perception over fixed frames, leading to hallucinations. Humans naturally pause, zoom, and re-read critical regions.

Method: Multi-stage framework that progressively finetunes a 7B LMM using supervised practice (Video-R4-CoT-17k) and reinforcement learning (Video-R4-RL-30k) to learn atomic and mixing visual operations via SFT and GRPO-based RL.

Result: Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and generalizes to multi-page document QA, slides QA, and generic video QA.

Conclusion: Iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning across various text-rich visual domains.

Abstract: Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning. Project Page: https://yunlong10.github.io/Video-R4/

[341] X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification

Chenyang Yu, Xuehu Liu, Pingping Zhang, Huchuan Lu

Main category: cs.CV

TL;DR: X-ReID is a cross-modality framework for Video-based Visible-Infrared Person Re-Identification that addresses modality gaps and leverages spatiotemporal information through Cross-modality Prototype Collaboration and Multi-granularity Information Interaction.

Details

Motivation: Large-scale vision-language models like CLIP show strong retrieval performance but their potential for VVI-ReID remains unexplored, particularly in addressing modality gaps and utilizing spatiotemporal information in video sequences.

Method: Proposes Cross-modality Prototype Collaboration (CPC) to align and integrate features from different modalities, and Multi-granularity Information Interaction (MII) that incorporates short-term interactions, long-term cross-frame fusion, and cross-modality alignment.

Result: Extensive experiments on HITSZ-VCM and BUPTCampus benchmarks demonstrate superiority over state-of-the-art methods, achieving robust sequence-level representations.

Conclusion: X-ReID effectively addresses modality gaps and leverages spatiotemporal information in VVI-ReID, showing superior performance on large-scale benchmarks.

Abstract: Large-scale vision-language models (e.g., CLIP) have recently achieved remarkable performance in retrieval tasks, yet their potential for Video-based Visible-Infrared Person Re-Identification (VVI-ReID) remains largely unexplored. The primary challenges are narrowing the modality gap and leveraging spatiotemporal information in video sequences. To address the above issues, in this paper, we propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID. Specifically, we first propose a Cross-modality Prototype Collaboration (CPC) to align and integrate features from different modalities, guiding the network to reduce the modality discrepancy. Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment to enhance temporal modeling and further reduce modality gaps. Finally, by integrating multi-granularity information, a robust sequence-level representation is achieved. Extensive experiments on two large-scale VVI-ReID benchmarks (i.e., HITSZ-VCM and BUPTCampus) demonstrate the superiority of our method over state-of-the-art methods. The source code is released at https://github.com/AsuradaYuci/X-ReID.

[342] SatSAM2: Motion-Constrained Video Object Tracking in Satellite Imagery using Promptable SAM2 and Kalman Priors

Ruijie Fan, Junyan Ye, Huan Chen, Zilong Huang, Xiaolei Wang, Weijia Li

Main category: cs.CV

TL;DR: SatSAM2 is a zero-shot satellite video tracker that adapts SAM2 foundation model for remote sensing, using motion constraints and state management to handle occlusion and improve generalization without scenario-specific training.

Details

Motivation: Existing satellite tracking methods lack generalization, require scenario-specific training, and fail under occlusion conditions.

Method: Built on SAM2 foundation model with two modules: Kalman Filter-based Constrained Motion Module (KFCMM) for temporal motion cues and drift suppression, and Motion-Constrained State Machine (MCSM) for tracking state regulation based on motion dynamics.

Result: Outperforms traditional and foundation model-based trackers on satellite benchmarks, achieving 5.84% AUC improvement on OOTB dataset. Created MVOT benchmark with 1,500+ sequences and 157K annotated frames.

Conclusion: SatSAM2 effectively adapts foundation models to remote sensing domain, demonstrating superior zero-shot tracking performance and generalization across diverse satellite video conditions.

Abstract: Existing satellite video tracking methods often struggle with generalization, requiring scenario-specific training to achieve satisfactory performance, and are prone to track loss in the presence of occlusion. To address these challenges, we propose SatSAM2, a zero-shot satellite video tracker built on SAM2, designed to adapt foundation models to the remote sensing domain. SatSAM2 introduces two core modules: a Kalman Filter-based Constrained Motion Module (KFCMM) to exploit temporal motion cues and suppress drift, and a Motion-Constrained State Machine (MCSM) to regulate tracking states based on motion dynamics and reliability. To support large-scale evaluation, we propose MatrixCity Video Object Tracking (MVOT), a synthetic benchmark containing 1,500+ sequences and 157K annotated frames with diverse viewpoints, illumination, and occlusion conditions. Extensive experiments on two satellite tracking benchmarks and MVOT show that SatSAM2 outperforms both traditional and foundation model-based trackers, including SAM2 and its variants. Notably, on the OOTB dataset, SatSAM2 achieves a 5.84% AUC improvement over state-of-the-art methods. Our code and dataset will be publicly released to encourage further research.

[343] Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding

Bowei Pu, Chuanbin Liu, Yifan Ge, Peicheng Zhou, Yiwei Sun, Zhiying Lu, Jiankang Wang, Hongtao Xie

Main category: cs.CV

TL;DR: Video-PLR introduces a loop-based perception paradigm with anti-hallucination rewards to address perception shortcuts and hallucinations in video reasoning models.

Details

Motivation: Existing Video Reasoning LLMs suffer from perception shortcuts and hallucinations due to flawed single-step perception paradigms that risk insufficient evidence.

Method: Proposes Perception Loop Reasoning (PLR) paradigm with iterative video segment analysis and Factual-Aware Evaluator (FAE) for anti-hallucination rewards using the AnetHallu-117K dataset.

Result: Achieves state-of-the-art performance in both 3B and 7B parameter scales with best data efficiency, with FAE performing comparably to GPT-4o.

Conclusion: The loop-based paradigm with anti-hallucination rewards effectively addresses perception limitations and hallucinations in video reasoning, achieving superior performance and data efficiency.

Abstract: Sufficient visual perception is the foundation of video reasoning. Nevertheless, existing Video Reasoning LLMs suffer from perception shortcuts, relying on a flawed single-step perception paradigm. This paradigm describes the video and then conducts reasoning, which runs the risk of insufficient evidence and emergent hallucinations. To address these issues, we introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward. First, to address the insufficient evidence, we introduce the Perception Loop Reasoning (PLR) paradigm. Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps, analyze this segment, and decide the next action. Second, for the risk of hallucinations, the Factual-Aware Evaluator (FAE) evaluates each perception result as a reliable anti-hallucination reward. This reward encourages the model to provide sufficient and precise video evidence. Our FAE, which performs comparably to GPT-4o, is tuned on our AnetHallu-117K, a large-scale hallucination judgment preference dataset. Extensive experiments show that our Video-PLR achieves the state-of-the-art in both 3B and 7B parameter scales and has the best data efficiency. Our code, models, and datasets are released on: https://github.com/BoweiPu/VideoPLR.

[344] Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring

Siyuan Wei, Chunjie Wang, Xiao Liu, Xiaosheng Yan, Zhishan Zhou, Rui Huang

Main category: cs.CV

TL;DR: A fully automated pipeline that converts raw 3D scans into high-quality dialogue data to address the scarcity of 3D scene-dialogue datasets, resolving viewpoint and object referring ambiguities.

Details

Motivation: 3D MLLMs lag behind 2D peers due to lack of large-scale, high-quality 3D scene-dialogue datasets. Prior methods rely on expensive human annotation and fail to resolve viewpoint ambiguity and object referring ambiguity.

Method: A four-stage automated pipeline: (1) meta-annotation collection for object-, frame-, and scene-level captions, (2) scene graph construction with relation correction, (3) discriminative object referring for exclusive descriptions, (4) multi-task data generation for diverse dialogues. Uses rule-based constraints with 2D MLLMs and LLMs.

Result: Produces Disc3D dataset with over 2 million samples in 25K hybrid 3D scenes, covering scene/view/object captioning, visual grounding, and five object-centric QA tasks. Training with Disc3D yields consistent, significant improvements on benchmarks and Disc3D-QA tasks.

Conclusion: The automated pipeline successfully generates high-quality 3D dialogue data at low cost, systematically mitigates dataset flaws, and enables scalable 3D MLLM training with demonstrated performance improvements.

Abstract: 3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene, view, and object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.

[345] HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, Linus, Patrol, Peizhen Zhang, Peng Chen, Penghao Zhao, Qi Tian, Songtao Liu, Weijie Kong, Weiyan Wang, Xiao He, Xin Li, Xinchi Deng, Xuefei Zhe, Yang Li, Yanxin Long, Yuanbo Peng, Yue Wu, Yuhong Liu, Zhenyu Wang, Zuozhuo Dai, Bo Peng, Coopers Li, Gu Gong, Guojian Xiao, Jiahe Tian, Jiaxin Lin, Jie Liu, Jihong Zhang, Jiesong Lian, Kaihang Pan, Lei Wang, Lin Niu, Mingtao Chen, Mingyang Chen, Mingzhe Zheng, Miles Yang, Qiangqiang Hu, Qi Yang, Qiuyong Xiao, Runzhou Wu, Ryan Xu, Rui Yuan, Shanshan Sang, Shisheng Huang, Siruis Gong, Shuo Huang, Weiting Guo, Xiang Yuan, Xiaojia Chen, Xiawei Hu, Wenzhi Sun, Xiele Wu, Xianshun Ren, Xiaoyan Yuan, Xiaoyue Mi, Yepeng Zhang, Yifu Sun, Yiting Lu, Yitong Li, You Huang, Yu Tang, Yixuan Li, Yuhang Deng, Yuan Zhou, Zhichao Hu, Zhiguang Liu, Zhihe Yang, Zilin Yang, Zhenzhi Lu, Zixiang Zhou, Zhao Zhong

Main category: cs.CV

TL;DR: HunyuanVideo 1.5 is a lightweight 8.3B parameter video generation model that achieves SOTA quality and motion coherence while enabling efficient inference on consumer GPUs.

Details

Motivation: To create an open-source video generation model that is both high-quality and computationally efficient, making advanced video generation accessible to broader audiences.

Method: Uses meticulous data curation, advanced DiT architecture with selective and sliding tile attention (SSTA), glyph-aware text encoding, progressive pre-training and post-training, and efficient video super-resolution network.

Result: Achieves state-of-the-art visual quality and motion coherence among open-source video generation models with only 8.3B parameters, enabling efficient inference on consumer-grade GPUs.

Conclusion: The model provides a high-performance foundation that lowers barriers to video creation and research, with all code and weights publicly available to the community.

Abstract: We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

[346] Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search

Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang, Liang Lin

Main category: cs.CV

TL;DR: The paper introduces Human-centric Open-future Task Discovery (HOTD) to help LMMs identify tasks that reduce human effort in dynamic scenarios, proposes HOTD-Bench with 2K+ videos, and presents CMAST framework that outperforms existing LMMs.

Details

Motivation: Advance LMMs to discover tasks that assist humans in open-future scenarios where human intentions are concurrent and dynamic, focusing on reducing human effort across multiple plausible futures.

Method: Proposes Collaborative Multi-Agent Search Tree (CMAST) framework that uses multi-agent system for complex reasoning decomposition and scalable search tree module for structured reasoning.

Result: CMAST achieves best performance on HOTD-Bench, significantly surpassing existing LMMs, and integrates well with existing LMMs to consistently improve performance.

Conclusion: The CMAST framework effectively addresses the HOTD problem and enhances LMM capabilities for human-centric task discovery in open-future scenarios.

Abstract: Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that directly assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across multiple plausible futures. To facilitate this study, we propose an HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes the complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.

[347] ABM-LoRA: Activation Boundary Matching for Fast Convergence in Low-Rank Adaptation

Dongha Lee, Jinhee Park, Minjun Kim, Junseok Kwon

Main category: cs.CV

TL;DR: ABM-LoRA is a principled initialization method that accelerates LoRA convergence by aligning adapter activation boundaries with the pretrained model, reducing information loss and improving early training performance.

Details

Motivation: Random initialization in LoRA restricts gradient updates to mismatched tangent spaces, causing significant information loss and hindering early convergence despite LoRA's parameter efficiency.

Method: Activation Boundary Matching aligns the adapter’s activation boundaries with those of the pretrained model before downstream training, maximizing the projection of full-parameter gradients into the adapter subspace.

Result: ABM-LoRA achieves the highest accuracy on VTAB-1K among all methods, with strong gains on structured reasoning tasks requiring geometric understanding, and demonstrates effectiveness across language understanding, dialogue generation, and vision recognition tasks.

Conclusion: ABM-LoRA substantially accelerates LoRA convergence by reducing information loss at initialization through principled activation boundary alignment, achieving superior performance across diverse architectures and tasks.

Abstract: We propose Activation Boundary Matching for Low-Rank Adaptation (ABM-LoRA), a principled initialization strategy that substantially accelerates the convergence of low-rank adapters. While LoRA offers high parameter efficiency, its random initialization restricts gradient updates to a mismatched tangent space, causing significant information loss and hindering early convergence. Our ABM-LoRA addresses this by aligning the adapter’s activation boundaries with those of the pretrained model before downstream training, thereby maximizing the projection of full-parameter gradients into the adapter subspace. This alignment sharply reduces information loss at initialization, yields a lower starting loss, and accelerates convergence. We demonstrate ABM-LoRA’s effectiveness across diverse architectures and tasks: language understanding (T5-Base on GLUE), dialogue generation (LLaMA2-7B on WizardLM), and vision recognition (ViT-B/16 on VTAB-1K). On VTAB-1K, it achieves the highest accuracy among all methods, with strong gains on structured reasoning tasks requiring geometric understanding.

[348] Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?

Itay Cohen, Ethan Fetaya, Amir Rosenfeld

Main category: cs.CV

TL;DR: CLIP models can distinguish between real objects and lookalikes using a learned direction in embedding space, improving cross-modal retrieval and captioning.

Details

Motivation: To test if vision-language models like CLIP can capture the human ability to distinguish real objects from lookalikes (toys, statues, drawings, pareidolia), which current models struggle with.

Method: Created RoLA dataset of real/lookalike exemplars, used prompt-based baseline, then learned a direction in CLIP’s embedding space to separate real from lookalike representations.

Result: The learned direction improved discrimination in cross-modal retrieval on Conceptual12M and enhanced captions from CLIP prefix captioner.

Conclusion: CLIP can be guided to better distinguish real objects from lookalikes through learned embedding directions, bridging a gap with human perception.

Abstract: Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired “real”/“lookalike” prompts. We then estimate a direction in CLIP’s embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.

[349] Cloud4D: Estimating Cloud Properties at a High Spatial and Temporal Resolution

Jacob Lin, Edward Gryspeerdt, Ronald Clark

Main category: cs.CV

TL;DR: Cloud4D is a learning-based framework that reconstructs 4D cloud states using ground-based cameras, achieving 25m spatial and 5s temporal resolution with single-digit error against radar measurements.

Details

Motivation: Current global weather models operate at kilometer-scale resolution, making it difficult to model individual clouds and extreme weather phenomena. High-resolution real-world observations are needed but challenging to obtain with current instruments.

Method: Uses synchronized ground-based cameras with a homography-guided 2D-to-3D transformer to infer the full 3D distribution of liquid water content. Tracks 3D liquid water content over time to estimate horizontal wind vectors.

Result: Achieves order-of-magnitude improvement in space-time resolution compared to state-of-the-art satellite measurements, with single-digit relative error (<10%) against collocated radar measurements across a two-month deployment with six cameras.

Conclusion: Cloud4D provides a novel approach for high-resolution cloud state reconstruction using only ground-based cameras, enabling better modeling of cloud dynamics and extreme weather phenomena.

Abstract: There has been great progress in improving numerical weather prediction and climate models using machine learning. However, most global models act at a kilometer-scale, making it challenging to model individual clouds and factors such as extreme precipitation, wind gusts, turbulence, and surface irradiance. Therefore, there is a need to move towards higher-resolution models, which in turn require high-resolution real-world observations that current instruments struggle to obtain. We present Cloud4D, the first learning-based framework that reconstructs a physically consistent, four-dimensional cloud state using only synchronized ground-based cameras. Leveraging a homography-guided 2D-to-3D transformer, Cloud4D infers the full 3D distribution of liquid water content at 25 m spatial and 5 s temporal resolution. By tracking the 3D liquid water content retrievals over time, Cloud4D additionally estimates horizontal wind vectors. Across a two-month deployment comprising six skyward cameras, our system delivers an order-of-magnitude improvement in space-time resolution relative to state-of-the-art satellite measurements, while retaining single-digit relative error ($<10%$) against collocated radar measurements. Code and data are available on our project page https://cloud4d.jacob-lin.com/.

[350] Are Image-to-Video Models Good Zero-Shot Image Editors?

Zechuan Zhang, Zhenyuan Chen, Zongxin Yang, Yi Yang

Main category: cs.CV

TL;DR: IF-Edit is a tuning-free framework that repurposes video diffusion models for instruction-driven image editing, addressing prompt misalignment, temporal redundancy, and blurry frames through prompt enhancement, latent compression, and post-refinement.

Details

Motivation: Large-scale video diffusion models have strong world simulation capabilities but their use as zero-shot image editors remains underexplored, presenting an opportunity to leverage these models for image editing tasks.

Method: Three key components: (1) chain-of-thought prompt enhancement to transform static instructions into temporal reasoning prompts, (2) temporal latent dropout to compress frame latents after expert-switch point, and (3) self-consistent post-refinement to sharpen late-stage frames using short still-video trajectory.

Result: Experiments on four public benchmarks show IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits across non-rigid editing, physical/temporal reasoning, and general instruction edits.

Conclusion: The study provides a systematic view of video diffusion models as image editors and demonstrates a simple recipe for unified video-image generative reasoning.

Abstract: Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.

cs.AI

[351] Using Wearable Devices to Improve Chronic PainTreatment among Patients with Opioid Use Disorder

Abhay Goyal, Navin Kumar, Kimberly DiMeola, Rafael Trujillo, Soorya Ram Shimgekar, Christian Poellabauer, Pi Zonooz, Ermonda Gjoni-Markaj, Declan Barry, Lynn Madden

Main category: cs.AI

TL;DR: This pilot study explores using AI and wearable devices to predict pain spikes in patients with chronic pain and opioid use disorder, finding machine learning effective but large language models limited.

Details

Motivation: There's a lack of integrated treatments for chronic pain and opioid use disorder, and wearable devices combined with AI could help monitor pain variability and clinical correlates to improve care.

Method: Used wearable devices to collect patient data and applied various AI approaches including machine learning and large language models to analyze pain spikes and clinical correlates.

Result: Machine learning models achieved >0.7 accuracy in predicting pain spikes, while LLMs performed poorly and provided limited insights on pain spikes.

Conclusion: Real-time monitoring with wearables and advanced AI could enable early pain spike detection and personalized interventions, but LLMs need development to provide actionable insights in this clinical context.

Abstract: Chronic pain (CP) and opioid use disorder (OUD) are common and interrelated chronic medical conditions. Currently, there is a paucity of evidence-based integrated treatments for CP and OUD among individuals receiving medication for opioid use disorder (MOUD). Wearable devices have the potential to monitor complex patient information and inform treatment development for persons with OUD and CP, including pain variability (e.g., exacerbations of pain or pain spikes) and clinical correlates (e.g., perceived stress). However, the application of large language models (LLMs) with wearable data for understanding pain spikes, remains unexplored. Consequently, the aim of this pilot study was to examine the clinical correlates of pain spikes using a range of AI approaches. We found that machine learning models achieved relatively high accuracy (>0.7) in predicting pain spikes, while LLMs were limited in providing insights on pain spikes. Real-time monitoring through wearable devices, combined with advanced AI models, could facilitate early detection of pain spikes and support personalized interventions that may help mitigate the risk of opioid relapse, improve adherence to MOUD, and enhance the integration of CP and OUD care. Given overall limited LLM performance, these findings highlight the need to develop LLMs which can provide actionable insights in the OUD/CP context.

[352] Fara-7B: An Efficient Agentic Model for Computer Use

Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, Spencer Whitehead, Andrew Zhao

Main category: cs.AI

TL;DR: FaraGen is a synthetic data generation system that creates diverse multi-step web tasks and verified trajectories for training computer use agents. Using this data, the authors train Fara-7B, a small on-device model that outperforms comparable CUA models and competes with larger frontier models.

Details

Motivation: Progress in computer use agents has been constrained by the absence of large, high-quality datasets capturing human-computer interactions, unlike the abundant textual data available for LLMs.

Method: Developed FaraGen system to generate synthetic data: proposes diverse tasks from frequently used websites, generates multiple solution attempts, and filters successful trajectories using multiple verifiers. Used this data to train Fara-7B, a native CUA model that perceives via screenshots and executes actions via predicted coordinates.

Result: FaraGen achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately $1 each. Fara-7B outperforms other CUA models of comparable size on benchmarks (WebVoyager, Online-Mind2Web, WebTailBench) and is competitive with much larger frontier models.

Conclusion: Scalable data generation systems like FaraGen are key to advancing small efficient agentic models. The authors are releasing Fara-7B as open-weight and WebTailBench benchmark to the community.

Abstract: Progress in computer use agents (CUAs) has been constrained by the absence of large and high-quality datasets that capture how humans interact with a computer. While LLMs have thrived on abundant textual data, no comparable corpus exists for CUA trajectories. To address these gaps, we introduce FaraGen, a novel synthetic data generation system for multi-step web tasks. FaraGen can propose diverse tasks from frequently used websites, generate multiple solution attempts, and filter successful trajectories using multiple verifiers. It achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately $1 each. We use this data to train Fara-7B, a native CUA model that perceives the computer using only screenshots, executes actions via predicted coordinates, and is small enough to run on-device. We find that Fara-7B outperforms other CUA models of comparable size on benchmarks like WebVoyager, Online-Mind2Web, and WebTailBench – our novel benchmark that better captures under-represented web tasks in pre-existing benchmarks. Furthermore, Fara-7B is competitive with much larger frontier models, illustrating key benefits of scalable data generation systems in advancing small efficient agentic models. We are making Fara-7B open-weight on Microsoft Foundry and HuggingFace, and we are releasing WebTailBench.

[353] HeaRT: A Hierarchical Circuit Reasoning Tree-Based Agentic Framework for AMS Design Optimization

Souradip Poddar, Chia-Tung Ho, Ziming Wei, Weidong Cao, Haoxing Ren, David Z. Pan

Main category: cs.AI

TL;DR: HeaRT is a foundational reasoning engine for AMS design automation that achieves >97% reasoning accuracy and >98% Pass@1 performance across 40 circuits, with 3x faster convergence and <0.5x real-time token budget compared to SOTA baselines.

Details

Motivation: Current AI-driven AMS design automation suffers from reliance on high-quality datasets, poor transferability across architectures, and lack of adaptive mechanisms, limiting their effectiveness.

Method: Proposes HeaRT, a foundational reasoning engine for automation loops that enables intelligent, adaptive, human-style design optimization without heavy dataset dependency.

Result: HeaRT achieves >97% reasoning accuracy and >98% Pass@1 performance across 40-circuit benchmark, operates at <0.5x real-time token budget of SOTA baselines, and yields >3x faster convergence in sizing and topology design adaptation tasks.

Conclusion: HeaRT represents a significant advancement toward intelligent, adaptive AMS design automation that preserves design intent while dramatically improving efficiency and performance across diverse optimization approaches.

Abstract: Conventional AI-driven AMS design automation algorithms remain constrained by their reliance on high-quality datasets to capture underlying circuit behavior, coupled with poor transferability across architectures, and a lack of adaptive mechanisms. This work proposes HeaRT, a foundational reasoning engine for automation loops and a first step toward intelligent, adaptive, human-style design optimization. HeaRT consistently demonstrates reasoning accuracy >97% and Pass@1 performance >98% across our 40-circuit benchmark repository, even as circuit complexity increases, while operating at <0.5x real-time token budget of SOTA baselines. Our experiments show that HeaRT yields >3x faster convergence in both sizing and topology design adaptation tasks across diverse optimization approaches, while preserving prior design intent.

[354] FISCAL: Financial Synthetic Claim-document Augmented Learning for Efficient Fact-Checking

Rishab Sharma, Iman Saberi, Elham Alipour, Jie JW Wu, Fatemeh Fard

Main category: cs.AI

TL;DR: FISCAL framework generates synthetic financial data to train MiniCheck-FISCAL, a lightweight verifier that outperforms larger models in financial fact-checking.

Details

Motivation: Current LLMs for financial applications suffer from hallucinations and computational inefficiency, requiring a more reliable and efficient solution.

Method: Developed FISCAL framework for synthetic data generation and used it to train MiniCheck-FISCAL, a compact verifier model.

Result: MiniCheck-FISCAL outperforms GPT-3.5 Turbo, similar-sized models, and approaches accuracy of models 20x larger; rivals GPT-4o and Claude-3.5 on external datasets.

Conclusion: Domain-specific synthetic data with efficient fine-tuning enables compact models to achieve state-of-the-art performance in financial AI applications.

Abstract: Financial applications of large language models (LLMs) require factual reliability and computational efficiency, yet current systems often hallucinate details and depend on prohibitively large models. We propose FISCAL (Financial Synthetic Claim-Document Augmented Learning), a modular framework for generating synthetic data tailored to financial fact-checking. Using FISCAL, we generate a dataset called FISCAL-data and use it to train MiniCheck-FISCAL, a lightweight verifier for numerical financial claims. MiniCheck-FISCAL outperforms its baseline, surpasses GPT-3.5 Turbo and other open-source peers of similar size, and approaches the accuracy of much larger systems (20x), such as Mixtral-8x22B and Command R+. On external datasets FinDVer and Fin-Fact, it rivals GPT-4o and Claude-3.5 while outperforming Gemini-1.5 Flash. These results show that domain-specific synthetic data, combined with efficient fine-tuning, enables compact models to achieve state-of-the-art accuracy, robustness, and scalability for practical financial AI. The dataset and scripts are available in the project repository (link provided in the paper).

[355] Scaling Item-to-Standard Alignment with Large Language Models: Accuracy, Limits, and Solutions

Farzan Karimi-Malekabadi, Pooya Razavi, Sonya Powers

Main category: cs.AI

TL;DR: LLMs can effectively automate educational assessment alignment reviews, achieving 83-94% accuracy in identifying misaligned items and significantly reducing manual review burden when combined with candidate filtering strategies.

Details

Motivation: Traditional human alignment reviews for educational assessment items are accurate but slow and labor-intensive, especially for large item banks. There's a need to accelerate this process while maintaining accuracy.

Method: Tested three LLMs (GPT-3.5 Turbo, GPT-4o-mini, and GPT-4o) on over 12,000 item-skill pairs across three tasks: identifying misaligned items, selecting correct skills from standards, and narrowing candidate lists before classification.

Result: GPT-4o-mini achieved 83-94% accuracy in identifying alignment status, including subtle misalignments. Performance was stronger in mathematics than reading. Pre-filtering candidate skills improved results, with correct skills appearing in top five suggestions over 95% of the time.

Conclusion: LLMs, particularly when paired with candidate filtering strategies, can significantly reduce manual review burden while preserving alignment accuracy. Hybrid pipelines combining LLM-based screening with human review offer scalable solutions for item validation.

Abstract: As educational systems evolve, ensuring that assessment items remain aligned with content standards is essential for maintaining fairness and instructional relevance. Traditional human alignment reviews are accurate but slow and labor-intensive, especially across large item banks. This study examines whether Large Language Models (LLMs) can accelerate this process without sacrificing accuracy. Using over 12,000 item-skill pairs in grades K-5, we tested three LLMs (GPT-3.5 Turbo, GPT-4o-mini, and GPT-4o) across three tasks that mirror real-world challenges: identifying misaligned items, selecting the correct skill from the full set of standards, and narrowing candidate lists prior to classification. In Study 1, GPT-4o-mini correctly identified alignment status in approximately 83-94% of cases, including subtle misalignments. In Study 2, performance remained strong in mathematics but was lower for reading, where standards are more semantically overlapping. Study 3 demonstrated that pre-filtering candidate skills substantially improved results, with the correct skill appearing among the top five suggestions more than 95% of the time. These findings suggest that LLMs, particularly when paired with candidate filtering strategies, can significantly reduce the manual burden of item review while preserving alignment accuracy. We recommend the development of hybrid pipelines that combine LLM-based screening with human review in ambiguous cases, offering a scalable solution for ongoing item validation and instructional alignment.

[356] Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu, Gaurav Srivastava, Yuchen Zhuang, Mohamed Elhoseiny, Charles Fleming, Carl Yang, Zhengzhong Tu, Yang Xie, Guanghua Xiao, Hanrui Wang, Di Jin, Wenqi Shi, Xuan Wang

Main category: cs.AI

TL;DR: VISTA-Gym is a scalable training environment that enables visual agentic reinforcement learning to enhance tool-integrated visual reasoning in vision-language models, with VISTA-R1-8B outperforming state-of-the-art baselines by 9.51%-18.72%.

Details

Motivation: Current vision-language models have strong image understanding but limited ability for multi-step visual reasoning and tool integration, struggling with tool selection, invocation, and coordination.

Method: VISTA-Gym provides a unified environment with standardized visual tools interface, executable interaction loops, verifiable feedback, and trajectory logging. VISTA-R1 is trained using multi-turn trajectory sampling and end-to-end reinforcement learning to interleave tool-use with reasoning.

Result: VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72% across 11 public reasoning-intensive VQA benchmarks.

Conclusion: VISTA-Gym effectively unlocks tool-integrated reasoning capabilities for VLMs, demonstrating significant performance improvements in visual reasoning tasks.

Abstract: While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to “think with images”, i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.

[357] NOEM$^{3}$A: A Neuro-Symbolic Ontology-Enhanced Method for Multi-Intent Understanding in Mobile Agents

Ioannis Tzachristas, Aifen Sui

Main category: cs.AI

TL;DR: A neuro-symbolic framework that integrates structured intent ontology with compact language models for multi-intent understanding in mobile AI agents, achieving near-GPT-4 accuracy with significantly lower resource requirements.

Details

Motivation: To enable accurate and efficient on-device natural language understanding by combining symbolic intent structure with compact language models, addressing the need for mobile AI agents that can handle ambiguous and demanding dialogues with multiple intents.

Method: Integrates structured intent ontology with compact language models using retrieval-augmented prompting, logit biasing, and optional classification heads to inject symbolic intent structure into both input and output representations.

Result: A 3B Llama model with ontology augmentation achieves 85% accuracy compared to GPT-4’s 90% on MultiWOZ 2.3 ambiguous dialogues, while using a tiny fraction of the energy and memory footprint. Produces more grounded, disambiguated multi-intent interpretations.

Conclusion: Symbolic alignment is an effective strategy for enabling accurate and efficient on-device NLU, demonstrating that neuro-symbolic approaches can approach large model performance with compact models through proper structural integration.

Abstract: We introduce a neuro-symbolic framework for multi-intent understanding in mobile AI agents by integrating a structured intent ontology with compact language models. Our method leverages retrieval-augmented prompting, logit biasing and optional classification heads to inject symbolic intent structure into both input and output representations. We formalize a new evaluation metric-Semantic Intent Similarity (SIS)-based on hierarchical ontology depth, capturing semantic proximity even when predicted intents differ lexically. Experiments on a subset of ambiguous/demanding dialogues of MultiWOZ 2.3 (with oracle labels from GPT-o3) demonstrate that a 3B Llama model with ontology augmentation approaches GPT-4 accuracy (85% vs 90%) at a tiny fraction of the energy and memory footprint. Qualitative comparisons show that ontology-augmented models produce more grounded, disambiguated multi-intent interpretations. Our results validate symbolic alignment as an effective strategy for enabling accurate and efficient on-device NLU.

[358] KOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA)

Weizhi Liu, Xi Chen, Zekun Jiang, Liang Zhao, Kunyuan Jiang, Ruisi Tang, Li Wang, Mingke You, Hanyu Zhou, Hongyu Chen, Qiankun Xiong, Yong Nie, Kang Li, Jian Li

Main category: cs.AI

TL;DR: KOM is a multi-agent AI system that automates knee osteoarthritis evaluation, risk prediction, and treatment planning, reducing clinician workload by 38.5% while improving care quality.

Details

Motivation: Knee osteoarthritis affects over 600M people globally, but personalized interventions require substantial medical resources that are difficult to implement in resource-limited settings.

Method: Developed KOM, a multi-agent system that automates KOA evaluation, risk prediction, and treatment prescription, assisting clinicians across the care pathway with tailored management plans.

Result: KOM outperformed general-purpose LLMs in imaging analysis and prescription generation. Collaboration with clinicians reduced diagnostic/planning time by 38.5% and improved treatment quality compared to independent approaches.

Conclusion: KOM can facilitate automated KOA management and enhance care efficiency when integrated into clinical workflows, with its modular architecture offering insights for developing AI systems for other chronic conditions.

Abstract: Knee osteoarthritis (KOA) affects more than 600 million individuals globally and is associated with significant pain, functional impairment, and disability. While personalized multidisciplinary interventions have the potential to slow disease progression and enhance quality of life, they typically require substantial medical resources and expertise, making them difficult to implement in resource-limited settings. To address this challenge, we developed KOM, a multi-agent system designed to automate KOA evaluation, risk prediction, and treatment prescription. This system assists clinicians in performing essential tasks across the KOA care pathway and supports the generation of tailored management plans based on individual patient profiles, disease status, risk factors, and contraindications. In benchmark experiments, KOM demonstrated superior performance compared to several general-purpose large language models in imaging analysis and prescription generation. A randomized three-arm simulation study further revealed that collaboration between KOM and clinicians reduced total diagnostic and planning time by 38.5% and resulted in improved treatment quality compared to each approach used independently. These findings indicate that KOM could help facilitate automated KOA management and, when integrated into clinical workflows, has the potential to enhance care efficiency. The modular architecture of KOM may also offer valuable insights for developing AI-assisted management systems for other chronic conditions.

[359] A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization

Ke Chen, Yifeng Wang, Hassan Almosapeeh, Haohan Wang

Main category: cs.AI

TL;DR: The paper proposes a unified prompt evaluation framework and an evaluation-instructed optimization pipeline that achieves stable, interpretable improvements across diverse tasks.

Details

Motivation: Existing prompt optimization methods are ineffective in complex scenarios, rely on unstable feedback, and lack a systematic definition of prompt quality, leading to fragmented evaluation signals.

Method: Develop a performance-oriented prompt evaluation framework, finetune an execution-free evaluator to predict multi-dimensional quality scores, and use it to instruct a metric-aware optimizer that diagnoses failures and rewrites prompts interpretably.

Result: The evaluator achieves strongest accuracy in predicting prompt performance, and the optimization pipeline consistently surpasses static-template and query-dependent baselines across eight datasets and three backbone models.

Conclusion: The proposed unified, metric-grounded perspective on prompt quality enables stable, interpretable, and model-agnostic improvements across diverse tasks.

Abstract: Most prompt-optimization methods refine a single static template, making them ineffective in complex and dynamic user scenarios. Existing query-dependent approaches rely on unstable textual feedback or black-box reward models, providing weak and uninterpretable optimization signals. More fundamentally, prompt quality itself lacks a unified, systematic definition, resulting in fragmented and unreliable evaluation signals. Our approach first establishes a performance-oriented, systematic, and comprehensive prompt evaluation framework. Furthermore, we develop and finetune an execution-free evaluator that predicts multi-dimensional quality scores directly from text. The evaluator then instructs a metric-aware optimizer that diagnoses failure modes and rewrites prompts in an interpretable, query-dependent manner. Our evaluator achieves the strongest accuracy in predicting prompt performance, and the evaluation-instructed optimization consistently surpass both static-template and query-dependent baselines across eight datasets and on three backbone models. Overall, we propose a unified, metric-grounded perspective on prompt quality, and demonstrated that our evaluation-instructed optimization pipeline delivers stable, interpretable, and model-agnostic improvements across diverse tasks.

[360] Reinforcement Learning with $ω$-Regular Objectives and Constraints

Dominik Wagner, Leon Witzman, Luke Ong

Main category: cs.AI

TL;DR: This paper proposes a reinforcement learning approach that combines ω-regular objectives with explicit constraints to address safety-performance trade-offs, using model-based RL with linear programming to maximize satisfaction probability while adhering to safety constraints.

Details

Motivation: Traditional RL relies on scalar rewards that can't express complex temporal, conditional, or safety-critical goals, leading to reward hacking. Single scalar performance measures also mask important safety-performance trade-offs.

Method: Developed a model-based RL algorithm based on linear programming that treats safety requirements and optimization targets separately, with a translation to constrained limit-average problems.

Result: The algorithm produces policies that maximize the probability of satisfying ω-regular objectives while adhering to ω-regular constraints within specified thresholds.

Conclusion: The approach successfully addresses limitations of scalar rewards by combining ω-regular objectives with explicit constraints, providing optimality-preserving guarantees for safety-critical RL problems.

Abstract: Reinforcement learning (RL) commonly relies on scalar rewards with limited ability to express temporal, conditional, or safety-critical goals, and can lead to reward hacking. Temporal logic expressible via the more general class of $ω$-regular objectives addresses this by precisely specifying rich behavioural properties. Even still, measuring performance by a single scalar (be it reward or satisfaction probability) masks safety-performance trade-offs that arise in settings with a tolerable level of risk. We address both limitations simultaneously by combining $ω$-regular objectives with explicit constraints, allowing safety requirements and optimisation targets to be treated separately. We develop a model-based RL algorithm based on linear programming, which in the limit produces a policy maximising the probability of satisfying an $ω$-regular objective while also adhering to $ω$-regular constraints within specified thresholds. Furthermore, we establish a translation to constrained limit-average problems with optimality-preserving guarantees.

[361] VICoT-Agent: A Vision-Interleaved Chain-of-Thought Framework for Interpretable Multimodal Reasoning and Scalable Remote Sensing Analysis

Chujie Wang, Zhiyuan Luo, Ruiqi Liu, Can Ran, Shenghua Fan, Xi Chen, Chu He

Main category: cs.AI

TL;DR: VICoT is a multimodal agent framework that enables explicit multi-round reasoning by dynamically incorporating visual tools into chain-of-thought processes, with a stack-based structure and modular tool suite for efficient vision-language reasoning.

Details

Motivation: Remote sensing image analysis is evolving from simple object recognition to complex intelligence reasoning, requiring models with stronger reasoning abilities and flexible tool invocation capabilities.

Method: Uses a stack-based reasoning structure with modular MCP-compatible tool suite, and proposes Reasoning Stack distillation to migrate complex agent behaviors to lightweight models while maintaining reasoning capability.

Result: Significantly outperforms existing SOTA frameworks on multiple remote sensing benchmarks in reasoning transparency, execution efficiency, and generation quality.

Conclusion: VICoT provides an effective framework for complex intelligence reasoning in remote sensing with strong generalization and flexibility, while enabling efficient deployment through model distillation.

Abstract: The current remote sensing image analysis task is increasingly evolving from traditional object recognition to complex intelligence reasoning, which places higher requirements on the model’s reasoning ability and the flexibility of tool invocation. To this end, we propose a new multimodal agent framework, Vision-Interleaved Chain-of-Thought Framework (VICoT), which implements explicit multi-round reasoning by dynamically incorporating visual tools into the chain of thought. Through a stack-based reasoning structure and a modular MCP-compatible tool suite, VICoT enables LLMs to efficiently perform multi-round, interleaved vision-language reasoning tasks with strong generalization and flexibility.We also propose the Reasoning Stack distillation method to migrate complex Agent behaviors to small, lightweight models, which ensures the reasoning capability while significantly reducing complexity. Experiments on multiple remote sensing benchmarks demonstrate that VICoT significantly outperforms existing SOTA frameworks in reasoning transparency, execution efficiency, and generation quality.

[362] MicroSims: A Framework for AI-Generated, Scalable Educational Simulations with Universal Embedding and Adaptive Learning Support

Valerie Lockhart, Dan McCreary, Troy A. Peterson

Main category: cs.AI

TL;DR: MicroSims is a framework for AI-generated lightweight educational simulations that are easily embeddable, customizable without coding, and address cost/technical barriers of traditional simulations.

Details

Motivation: Traditional educational simulations require substantial resources and technical expertise, creating barriers to widespread adoption. MicroSims aim to make interactive learning accessible by overcoming cost, complexity, and platform dependence issues.

Method: Three key innovations: (1) standardized design patterns for AI-assisted generation, (2) iframe-based architecture for universal embedding and security, (3) transparent modifiable code for customization. Includes design principles, technical architecture, metadata standards, and development workflows.

Result: Interactive simulations can improve conceptual understanding by 30-40% compared to traditional instruction. MicroSims extend these benefits while solving cost and technical barriers, enabling educators worldwide to create curriculum-aligned simulations on demand.

Conclusion: MicroSims have significant implications for educational equity and enable low-cost intelligent interactive textbooks. The framework provides foundation for future AI-powered adaptive learning systems.

Abstract: Educational simulations have long been recognized as powerful tools for enhancing learning outcomes, yet their creation has traditionally required substantial resources and technical expertise. This paper introduces MicroSims a novel framework for creating lightweight, interactive educational simulations that can be rapidly generated using artificial intelligence, universally embedded across digital learning platforms, and easily customized without programming knowledge. MicroSims occupy a unique position at the intersection of three key innovations: (1) standardized design patterns that enable AI-assisted generation, (2) iframe-based architecture that provides universal embedding and sandboxed security, and (3) transparent, modifiable code that supports customization and pedagogical transparency. We present a comprehensive framework encompassing design principles, technical architecture, metadata standards, and development workflows. Drawing on empirical research from physics education studies and meta-analyses across STEM disciplines, we demonstrate that interactive simulations can improve conceptual understanding by up to 30-40% compared to traditional instruction. MicroSims extend these benefits while addressing persistent barriers of cost, technical complexity, and platform dependence. This work has significant implications for educational equity, and low-cost intelligent interactive textbooks that enabling educators worldwide to create customized, curriculum-aligned simulations on demand. We discuss implementation considerations, present evidence of effectiveness, and outline future directions for AI-powered adaptive learning systems built on the MicroSim foundation.

[363] Agentic AI-Empowered Conversational Embodied Intelligence Networks in 6G

Mingkai Chen, Zijie Feng, Lei Wang, Yaser Khamayseh

Main category: cs.AI

TL;DR: Proposes CC-EIN for collaborative embodied intelligence in 6G networks, achieving 95.4% task completion rate in rescue scenarios through multimodal fusion, adaptive communication, and interpretable coordination.

Details

Motivation: Address challenges in multimodal information fusion, adaptive communication, and decision interpretability for multiple embodied intelligent devices in complex 6G-era tasks.

Method: CC-EIN integrates multimodal feature fusion (PerceptiNet), adaptive semantic communication, task coordination, and interpretability (InDec module with Grad-CAM visualization).

Result: Achieves 95.4% task completion rate and 95% transmission efficiency in post-earthquake rescue scenarios with strong semantic consistency and energy efficiency.

Conclusion: CC-EIN effectively enables collaborative embodied intelligence for complex tasks through integrated multimodal fusion, adaptive communication, and interpretable coordination.

Abstract: In the 6G era, semantic collaboration among multiple embodied intelligent devices (MEIDs) becomes crucial for complex task execution. However, existing systems face challenges in multimodal information fusion, adaptive communication, and decision interpretability. To address these limitations, we propose a collaborative Conversational Embodied Intelligence Network (CC-EIN) integrating multimodal feature fusion, adaptive semantic communication, task coordination, and interpretability. PerceptiNet performs cross-modal fusion of image and radar data to generate unified semantic representations. An adaptive semantic communication strategy dynamically adjusts coding schemes and transmission power according to task urgency and channel quality. A semantic-driven collaboration mechanism further supports task decomposition and conflict-free coordination among heterogeneous devices. Finally, the InDec module enhances decision transparency through Grad-CAM visualization. Simulation results in post-earthquake rescue scenarios demonstrate that CC-EIN achieves 95.4% task completion rate and 95% transmission efficiency while maintaining strong semantic consistency and energy efficiency.

[364] Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy

Daniel I Jackson, Emma L Jensen, Syed-Amad Hussain, Emre Sezgin

Main category: cs.AI

TL;DR: LLMs show stable but inaccurate self-assessments using adapted psychological scales, with self-efficacy scores not reliably reflecting actual task performance across computational, social, and summarization tasks.

Details

Motivation: To evaluate LLM self-assessment capabilities beyond task accuracy, adapting psychological self-efficacy scales to understand how models perceive their own abilities across different task types.

Method: Adapted the 10-item General Self-Efficacy Scale (GSES) to elicit self-assessments from ten LLMs across four conditions (no task, computational reasoning, social reasoning, summarization), with follow-up confidence prompts and qualitative analysis of reasoning styles.

Result: Models showed stable GSES responses but significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. Self-assessment didn’t reliably reflect ability - some low-scoring models performed accurately while high-scoring ones produced weak summaries. Follow-up prompts revealed mild overestimation in initial assessments.

Conclusion: Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates, with higher self-efficacy corresponding to more assertive, anthropomorphic reasoning styles.

Abstract: Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.

[365] RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation

Yuanyuan Lin, Xiangyu Ouyang, Teng Zhang, Kaixin Sui

Main category: cs.AI

TL;DR: RPM-MCTS is a tree search method that uses knowledge-retrieval as process reward model to evaluate intermediate algorithmic steps in code generation, reducing computational costs by 15% while improving performance.

Details

Motivation: Existing tree search methods struggle with evaluating intermediate algorithmic steps and correcting errors in real-time, leading to incorrect code generation and high computational costs.

Method: Uses Knowledge-Retrieval as Process Reward Model based on Monte Carlo Tree Search, similarity filtering to remove redundant nodes, and sandbox execution feedback to locate and correct errors during generation.

Result: Outperforms state-of-the-art methods on four public code generation benchmarks with approximately 15% reduction in token consumption. Fine-tuning base models with RPM-MCTS data significantly enhances code capabilities.

Conclusion: RPM-MCTS provides an effective approach for improving code generation by enabling better evaluation of intermediate steps and timely error correction, while reducing computational overhead.

Abstract: Tree search-based methods have made significant progress in enhancing the code generation capabilities of large language models. However, due to the difficulty in effectively evaluating intermediate algorithmic steps and the inability to locate and timely correct erroneous steps, these methods often generate incorrect code and incur increased computational costs. To tackle these problems, we propose RPM-MCTS, an effective method that utilizes Knowledge-Retrieval as Process Reward Model based on Monte Carlo Tree Search to evaluate intermediate algorithmic steps. By utilizing knowledge base retrieval, RPM-MCTS avoids the complex training of process reward models. During the expansion phase, similarity filtering is employed to remove redundant nodes, ensuring diversity in reasoning paths. Furthermore, our method utilizes sandbox execution feedback to locate erroneous algorithmic steps during generation, enabling timely and targeted corrections. Extensive experiments on four public code generation benchmarks demonstrate that RPM-MCTS outperforms current state-of-the-art methods while achieving an approximately 15% reduction in token consumption. Furthermore, full fine-tuning of the base model using the data constructed by RPM-MCTS significantly enhances its code capabilities.

[366] Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity

Qiyao Wei, Edward Morrell, Lea Goetz, Mihaela van der Schaar

Main category: cs.AI

TL;DR: This paper introduces a knowledge graph-based method to generate benchmarks for evaluating semantic similarity methods in LLM outputs, addressing limitations of existing benchmarks.

Details

Motivation: Current semantic similarity methods may capture syntactic/lexical forms over semantic content, and existing benchmarks have high generation costs, limited domain availability, and unclear equivalence definitions.

Method: Leverage knowledge graphs to generate pairs of natural-language statements that are semantically similar or dissimilar, with dissimilar pairs categorized into four sub-types. Generate datasets in four domains and compare semantic similarity methods.

Result: The sub-type of semantic variation and domain impact semantic similarity method performance, with no method being consistently superior. LLM-as-a-judge performance varies significantly.

Conclusion: The findings have important implications for using LLM-as-a-judge in detecting semantic content, and the method provides a scalable way to evaluate semantic similarity methods across domains.

Abstract: Evaluating the open-form textual responses generated by Large Language Models (LLMs) typically requires measuring the semantic similarity of the response to a (human generated) reference. However, there is evidence that current semantic similarity methods may capture syntactic or lexical forms over semantic content. While benchmarks exist for semantic equivalence, they often suffer from high generation costs due to reliance on subjective human judgment, limited availability for domain-specific applications, and unclear definitions of equivalence. This paper introduces a novel method for generating benchmarks to evaluate semantic similarity methods for LLM outputs, specifically addressing these limitations. Our approach leverages knowledge graphs (KGs) to generate pairs of natural-language statements that are semantically similar or dissimilar, with dissimilar pairs categorized into one of four sub-types. We generate benchmark datasets in four different domains (general knowledge, biomedicine, finance, biology), and conduct a comparative study of semantic similarity methods including traditional natural language processing scores and LLM-as-a-judge predictions. We observe that the sub-type of semantic variation, as well as the domain of the benchmark impact the performance of semantic similarity methods, with no method being consistently superior. Our results present important implications for the use of LLM-as-a-judge in detecting the semantic content of text. Code is available at https://github.com/QiyaoWei/semantic-kg and the dataset is available at https://huggingface.co/datasets/QiyaoWei/Semantic-KG.

[367] A System-Level Taxonomy of Failure Modes in Large Language Model Applications

Vaishali Vinay

Main category: cs.AI

TL;DR: A taxonomy of 15 hidden LLM failure modes in production, highlighting the gap between traditional ML evaluation and real-world LLM system reliability challenges.

Details

Motivation: LLMs are increasingly deployed in production systems but their failure patterns differ from traditional ML models, creating reliability gaps that current evaluation methods don't address.

Method: Developed a system-level taxonomy of 15 hidden failure modes through analysis of real-world LLM applications, examining evaluation gaps and production challenges.

Result: Identified critical failure modes like reasoning drift, latent inconsistency, and cost-driven performance collapse, plus systemic issues in observability and monitoring.

Conclusion: LLM reliability should be treated as a system-engineering problem requiring new evaluation methodologies and design principles for robust, maintainable LLM systems.

Abstract: Large language models (LLMs) are being rapidly integrated into decision-support tools, automation workflows, and AI-enabled software systems. However, their behavior in production environments remains poorly understood, and their failure patterns differ fundamentally from those of traditional machine learning models. This paper presents a system-level taxonomy of fifteen hidden failure modes that arise in real-world LLM applications, including multi-step reasoning drift, latent inconsistency, context-boundary degradation, incorrect tool invocation, version drift, and cost-driven performance collapse. Using this taxonomy, we analyze the growing gap in evaluation and monitoring practices: existing benchmarks measure knowledge or reasoning but provide little insight into stability, reproducibility, drift, or workflow integration. We further examine the production challenges associated with deploying LLMs - including observability limitations, cost constraints, and update-induced regressions - and outline high-level design principles for building reliable, maintainable, and cost-aware LLM systems. Finally, we outline high-level design principles for building reliable, maintainable, and cost-aware LLM-based systems. By framing LLM reliability as a system-engineering problem rather than a purely model-centric one, this work provides an analytical foundation for future research on evaluation methodology, AI system robustness, and dependable LLM deployment.

Weizi Shao, Taolin Zhang, Zijie Zhou, Chen Chen, Chengyu Wang, Xiaofeng He

Main category: cs.AI

TL;DR: M$^3$Prune is a multi-modal multi-agent hierarchical communication graph pruning framework that reduces token overhead while maintaining performance in multi-modal retrieval-augmented generation systems.

Details

Motivation: Existing multi-agent systems for multi-modal RAG incur substantial token overhead and computational costs, making large-scale deployment challenging. The goal is to achieve optimal balance between task performance and token efficiency.

Method: The framework performs intra-modal graph sparsification on textual and visual modalities to identify critical edges, constructs dynamic communication topology using key edges for inter-modal sparsification, and progressively prunes redundant edges to create efficient hierarchical topology.

Result: Extensive experiments show the method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption on general and domain-specific benchmarks.

Conclusion: M$^3$Prune effectively addresses the token overhead problem in multi-agent mRAG systems through hierarchical communication graph pruning, achieving better performance with reduced computational costs.

Abstract: Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical communication graph PRUNING framework, termed M$^3$Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M$^3$Prune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.

[369] Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design

Zixiao Huang, Wen Zeng, Tianyu Fu, Tengxuan Liu, Yizhou Sun, Ke Hong, Xinhao Yang, Chengchun Liu, Yan Li, Quanlu Zhang, Guohao Dai, Zhenhua Zhu, Yu Wang

Main category: cs.AI

TL;DR: SPAgent is a framework that uses adaptive speculation to reduce latency in LLM-based search agents by predicting simple actions without full reasoning and selectively omitting verification when safe.

Details

Motivation: LLM-based search agents suffer from severe latency due to serialized LLM reasoning and tool execution at each step, limiting their practical deployment despite strong performance.

Method: Algorithm-system co-design with two-phase adaptive speculation mechanism that selectively omits verification when safe, and a two-level scheduler to regulate speculative requests based on engine load.

Result: Achieves up to 1.65× end-to-end speedup while maintaining or even improving accuracy across extensive experimental settings.

Conclusion: SPAgent enables practical deployment of multi-step search agents by significantly reducing latency through effective speculation techniques.

Abstract: LLM-based search agents achieve strong performance but suffer from severe latency, as each step requires serialized LLM reasoning followed by action of tool execution. We revisit this bottleneck through the lens of speculation. While traditional predict-verify speculation paradigm can break serial execution, its benefit remains limited, as it retains the full original workload and adds extra inference overhead. We observe that early agent steps often involve simple evidence-gathering, where correct actions can often be predicted without full reasoning. Building on these observations, we present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency. Algorithmically, SPAgent introduces a two-phase adaptive speculation mechanism that selectively omits verification when safe. System-wise, a two-level scheduler regulates speculative requests based on engine load to ensure speculation remains beneficial. We implement SPAgent in real-world systems. Across extensive experimental settings, SPAgent achieves up to $1.65\times$ end-to-end speedup while maintaining same or even achieving higher accuracy, enabling practical deployment of multi-step search agents.

[370] “Are We Done Yet?”: A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents

Marta Sumyk, Oleksandr Kosovan

Main category: cs.AI

TL;DR: Vision-based evaluation framework using VLMs to assess task completion from screenshots, improving autonomous computer-use agents’ reliability.

Details

Motivation: Computer Use Agents often fail to reliably determine task completion, needing better evaluation mechanisms.

Method: Autonomous evaluation framework using vision-language models to assess task completion from screenshots and task descriptions across 42 macOS applications.

Result: 73% accuracy in task success detection and 27% average relative improvement in overall task success with evaluator feedback.

Conclusion: Vision-based evaluation serves as an effective feedback mechanism that improves reliability and self-correction of autonomous computer-use agents.

Abstract: Computer Use Agents (CUAs) are designed to autonomously operate digital interfaces, yet they often fail to reliably determine whether a given task has been completed. We present an autonomous evaluation and feedback framework that uses vision-language models to assess task completion directly from screenshots and task descriptions. Our dataset covers 42 built-in macOS applications and 1,260 human-labeled tasks across a wide range of scenarios. Our framework achieves up to 73 percent accuracy in task success detection and yields an average relative improvement of 27 percent in overall task success when evaluator feedback is applied. These results show that vision-based evaluation can serve as an effective feedback mechanism that improves the reliability and self-correction of autonomous computer-use agents.

[371] From data to concepts via wiring diagrams

Jason Lo, Mohammadnima Jafari

Main category: cs.AI

TL;DR: The paper introduces quasi-skeleton wiring diagram graphs, proves their correspondence to Hasse diagrams, and develops algorithms to extract wiring diagrams from sequential data for analyzing autonomous agent behavior.

Details

Motivation: To develop mathematical frameworks and algorithms for extracting wiring diagrams from sequential data, particularly for analyzing temporal processes and autonomous agent behavior in complex systems.

Method: Introduced quasi-skeleton wiring diagram graphs and proved their correspondence to Hasse diagrams. Designed algorithms to extract wiring diagrams from sequential data, with comparisons to DBSCAN and agglomerative hierarchical clustering methods.

Result: The algorithms successfully identified winning strategies of an autonomous agent playing a computer game, and performance was evaluated against standard clustering techniques including under data perturbation conditions.

Conclusion: The work successfully integrates techniques from category theory, graph theory, clustering, reinforcement learning, and data engineering to provide a comprehensive framework for analyzing sequential data through wiring diagrams.

Abstract: A wiring diagram is a labeled directed graph that represents an abstract concept such as a temporal process. In this article, we introduce the notion of a quasi-skeleton wiring diagram graph, and prove that quasi-skeleton wiring diagram graphs correspond to Hasse diagrams. Using this result, we designed algorithms that extract wiring diagrams from sequential data. We used our algorithms in analyzing the behavior of an autonomous agent playing a computer game, and the algorithms correctly identified the winning strategies. We compared the performance of our main algorithm with two other algorithms based on standard clustering techniques (DBSCAN and agglomerative hierarchical), including when some of the data was perturbed. Overall, this article brings together techniques in category theory, graph theory, clustering, reinforcement learning, and data engineering.

[372] Large language models replicate and predict human cooperation across experiments in game theory

Andrea Cera Palatsi, Samuel Martin-Gutierrez, Ana S. Cardenal, Max Pellert

Main category: cs.AI

TL;DR: LLMs can replicate human decision-making patterns in game theory experiments, with Llama closely matching human cooperation behavior and Qwen aligning with Nash equilibrium predictions, enabling systematic exploration of social decision-making.

Details

Motivation: To understand how closely LLMs mirror actual human decision-making, as misalignment could produce harmful outcomes in practical applications while failure to replicate human behavior renders LLMs ineffective for social simulations.

Method: Developed a digital twin of game-theoretic experiments and introduced a systematic prompting and probing framework for machine-behavioral evaluation, testing three open-source models (Llama, Mistral and Qwen).

Result: Llama reproduces human cooperation patterns with high fidelity, capturing human deviations from rational choice theory, while Qwen aligns closely with Nash equilibrium predictions. Achieved population-level behavioral replication without persona-based prompting.

Conclusion: Appropriately calibrated LLMs can replicate aggregate human behavioral patterns and enable systematic exploration of unexplored experimental spaces, offering a complementary approach to traditional research in social and behavioral sciences.

Abstract: Large language models (LLMs) are increasingly used both to make decisions in domains such as health, education and law, and to simulate human behavior. Yet how closely LLMs mirror actual human decision-making remains poorly understood. This gap is critical: misalignment could produce harmful outcomes in practical applications, while failure to replicate human behavior renders LLMs ineffective for social simulations. Here, we address this gap by developing a digital twin of game-theoretic experiments and introducing a systematic prompting and probing framework for machine-behavioral evaluation. Testing three open-source models (Llama, Mistral and Qwen), we find that Llama reproduces human cooperation patterns with high fidelity, capturing human deviations from rational choice theory, while Qwen aligns closely with Nash equilibrium predictions. Notably, we achieved population-level behavioral replication without persona-based prompting, simplifying the simulation process. Extending beyond the original human-tested games, we generate and preregister testable hypotheses for novel game configurations outside the original parameter grid. Our findings demonstrate that appropriately calibrated LLMs can replicate aggregate human behavioral patterns and enable systematic exploration of unexplored experimental spaces, offering a complementary approach to traditional research in the social and behavioral sciences that generates new empirical predictions about human social decision-making.

[373] Towards Benign Memory Forgetting for Selective Multimodal Large Language Model Unlearning

Zhen Zeng, Leijiang Gu, Zhangling Duan, Feng Li, Zenglin Shi, Cees G. M. Snoek, Meng Wang

Main category: cs.AI

TL;DR: SMFA is a method for precise privacy-sensitive knowledge removal in MLLMs that prevents performance degradation on general visual understanding tasks.

Details

Motivation: Existing unlearning methods for MLLMs remove sensitive information but degrade general image understanding performance, failing to achieve benign forgetting.

Method: SMFA uses a memory forgetting adapter fine-tuned to replace sensitive responses with refusals, combined with a retaining anchor-guided masking mechanism to protect unrelated knowledge.

Result: SMFA achieves precise and controllable unlearning while maintaining foundational image understanding, outperforming prior methods.

Conclusion: SMFA enables targeted forgetting of sensitive information in MLLMs without compromising general visual understanding capabilities.

Abstract: Multimodal Large Language Models (MLLMs) achieve remarkable capabilities but can inadvertently memorize privacy-sensitive information. Although existing unlearning methods can remove such knowledge, they fail to achieve benign forgetting because they often degrade the model’s general image understanding performance. To address this, we propose the Sculpted Memory Forgetting Adapter (SMFA), which confines forgetting to targeted memory regions while preserving overall capabilities. SMFA first fine-tunes the model to replace sensitive responses with refusals, yielding a memory forgetting adapter, and then applies a retaining anchor-guided masking mechanism to prevent interference with unrelated knowledge and understanding ability. To systematically evaluate selective MLLM unlearning, we introduce S-MLLMUn Bench, the first benchmark designed to jointly assess the removal of sensitive knowledge and retention of general visual understanding. Extensive experiments show that, unlike prior methods, SMFA achieves precise and controllable unlearning while maintaining the model’s foundational image understanding.

[374] Interactive AI NPCs Powered by LLMs: Technical Report for the CPDC Challenge 2025

Yitian Huang, Yuxuan Lei, Jianxun Lian, Hao Liao

Main category: cs.AI

TL;DR: A winning solution for CPDC 2025 that combines context engineering with reinforcement learning, achieving top rankings across multiple tracks through improved tool stability and dialogue performance.

Details

Motivation: To address challenges in commonsense persona-grounded dialogue, particularly improving tool call stability, execution reliability, and mitigating small-sample overfitting in dialogue systems.

Method: Two-component framework: 1) Context Engineering with dynamic tool pruning, persona clipping, and post-processing techniques; 2) GRPO reinforcement learning training replacing supervised fine-tuning for GPU Track.

Result: Ranked 1st in Task 2 API, 2nd in Task 1 API, and 3rd in both Task 3 API and GPU track in the final CPDC 2025 evaluation.

Conclusion: The proposed simple yet effective framework successfully unifies improvements across both GPU and API tracks, demonstrating strong performance in commonsense persona-grounded dialogue tasks.

Abstract: This report presents the solution and results of our team MSRA_SC in the Commonsense Persona-Grounded Dialogue Challenge (CPDC 2025). We propose a simple yet effective framework that unifies improvements across both GPU Track and API Track. Our method centers on two key components. First, Context Engineering applies dynamic tool pruning and persona clipping for input compression, combined with post-processing techniques such as parameter normalization and function merging. Together with manually refined prompts, this design improves tool call stability, execution reliability, and role-playing guidance. Second, in the GPU Track, we further adopt GRPO training, replacing supervised fine-tuning with reinforcement learning directly optimized by reward signals. This mitigates small-sample overfitting and significantly enhances task-oriented dialogue performance. In the final evaluation, our team ranks 1st in Task 2 API, 2nd in Task 1 API, and 3rd in both Task 3 API and GPU track, demonstrating the effectiveness of our approach. Our code is publicly available at https://gitlab.aicrowd.com/nikoo_yu/cpdc-2025-winning-solution

Haebin Seong, Sungmin Kim, Minchan Kim, Yongjun Cho, Myunchul Joe, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Yoonshik Kim, Samwoo Seong, Yubeen Park, Youngjae Yu, Yunsung Lee

Main category: cs.AI

TL;DR: CostNav is the first micro-navigation economic testbed that evaluates autonomous delivery robots through cost-revenue analysis, revealing that optimizing for task success differs from optimizing for commercial viability.

Details

Motivation: Existing navigation benchmarks focus only on task success metrics while overlooking economic viability, which is critical for commercial deployment of autonomous delivery robots.

Method: CostNav models the complete economic lifecycle including hardware, training, energy, maintenance costs, and delivery revenue with service-level agreements, using industry-derived parameters. It projects from reduced-scale simulation to realistic deliveries.

Result: The baseline achieves 43.0% SLA compliance but is not commercially viable, yielding a loss of $30.009 per run with no finite break-even point. Collision-induced maintenance accounts for 99.7% of per-run costs.

Conclusion: CostNav bridges the gap between navigation research and commercial deployment, enabling data-driven decisions about economic trade-offs and highlighting collision avoidance as a key optimization target for commercial viability.

Abstract: Existing navigation benchmarks focus on task success metrics while overlooking economic viability – critical for commercial deployment of autonomous delivery robots. We introduce \emph{CostNav}, a \textbf{Micro-Navigation Economic Testbed} that evaluates embodied agents through comprehensive cost-revenue analysis aligned with real-world business operations. CostNav models the complete economic lifecycle including hardware, training, energy, maintenance costs, and delivery revenue with service-level agreements, using industry-derived parameters. \textbf{To our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability}, revealing that optimizing for task success fundamentally differs from optimizing for economic deployment. Our cost model uses parameters derived from industry data sources (energy rates, delivery service pricing), and we project from a reduced-scale simulation to realistic deliveries. Under this projection, the baseline achieves 43.0% SLA compliance but is \emph{not} commercially viable: yielding a loss of $30.009 per run with no finite break-even point, because operating costs are dominated by collision-induced maintenance, which accounts for 99.7% of per-run costs and highlights collision avoidance as a key optimization target. We demonstrate a learning-based on-device navigation baseline and establish a foundation for evaluating rule-based navigation, imitation learning, and cost-aware RL training. CostNav bridges the gap between navigation research and commercial deployment, enabling data-driven decisions about economic trade-offs across navigation paradigms.

[376] Actionable and diverse counterfactual explanations incorporating domain knowledge and causal constraints

Szymon Bobek, Łukasz Bałec, Grzegorz J. Nalepa

Main category: cs.AI

TL;DR: DANCE method generates diverse, actionable counterfactual explanations by incorporating feature dependencies and causal constraints to ensure plausibility and real-world feasibility.

Details

Motivation: Address limitations of existing counterfactual explanation methods that ignore complex dependencies in real-world datasets, leading to unrealistic modifications. Motivated by cybersecurity applications in email marketing.

Method: Learns linear and nonlinear constraints from data or integrates expert-provided dependency graphs to ensure counterfactuals are plausible and actionable. Balances plausibility, diversity, and sparsity.

Result: Outperforms existing approaches on 140 public datasets. Generates meaningful, domain-relevant counterfactuals that align with real-world constraints.

Conclusion: DANCE effectively addresses key limitations in existing algorithms by incorporating feature dependencies and causal constraints, producing practical and actionable explanations for real-world applications.

Abstract: Counterfactual explanations enhance the actionable interpretability of machine learning models by identifying the minimal changes required to achieve a desired outcome of the model. However, existing methods often ignore the complex dependencies in real-world datasets, leading to unrealistic or impractical modifications. Motivated by cybersecurity applications in the email marketing domain, we propose a method for generating Diverse, Actionable, and kNowledge-Constrained Explanations (DANCE), which incorporates feature dependencies and causal constraints to ensure plausibility and real-world feasibility of counterfactuals. Our method learns linear and nonlinear constraints from data or integrates expert-provided dependency graphs, ensuring counterfactuals are plausible and actionable. By maintaining consistency with feature relationships, the method produces explanations that align with real-world constraints. Additionally, it balances plausibility, diversity, and sparsity, effectively addressing key limitations in existing algorithms. The work is developed based on a real-life case study with Freshmail, the largest email marketing company in Poland and supported by a joint R&D project Sendguard. Furthermore, we provide an extensive evaluation using 140 public datasets, which highlights its ability to generate meaningful, domain-relevant counterfactuals that outperform other existing approaches based on widely used metrics. The source code for reproduction of the results can be found in a GitHub repository we provide.

[377] SMoG: Schema Matching on Graph

Mingyu Jeon, Jaeyoung Suh, Suwan Cho

Main category: cs.AI

TL;DR: SMoG is a novel framework for schema matching that uses iterative 1-hop SPARQL queries on knowledge graphs to enhance reliability and reduce storage requirements compared to existing LLM and KG approaches.

Details

Motivation: Address limitations of LLMs (hallucination, outdated knowledge) and inefficient KG approaches (complex multi-hop queries, storage-intensive methods) in medical schema matching tasks like EHR alignment to OMOP CDM.

Method: Leverages iterative execution of simple 1-hop SPARQL queries inspired by KGQA strategies, directly querying SPARQL endpoints without vector storage.

Result: Achieves performance comparable to state-of-the-art baselines on real-world medical datasets while improving explainability and reducing storage requirements.

Conclusion: SMoG provides an effective and efficient approach for KG-augmented schema matching with enhanced reliability and human-verifiable query paths.

Abstract: Schema matching is a critical task in data integration, particularly in the medical domain where disparate Electronic Health Record (EHR) systems must be aligned to standard models like OMOP CDM. While Large Language Models (LLMs) have shown promise in schema matching, they suffer from hallucination and lack of up-to-date domain knowledge. Knowledge Graphs (KGs) offer a solution by providing structured, verifiable knowledge. However, existing KG-augmented LLM approaches often rely on inefficient complex multi-hop queries or storage-intensive vector-based retrieval methods. This paper introduces SMoG (Schema Matching on Graph), a novel framework that leverages iterative execution of simple 1-hop SPARQL queries, inspired by successful strategies in Knowledge Graph Question Answering (KGQA). SMoG enhances explainability and reliability by generating human-verifiable query paths while significantly reducing storage requirements by directly querying SPARQL endpoints. Experimental results on real-world medical datasets demonstrate that SMoG achieves performance comparable to state-of-the-art baselines, validating its effectiveness and efficiency in KG-augmented schema matching.

[378] Improving Language Agents through BREW

Shashank Kirtania, Param Biyani, Priyanshu Gupta, Yasharth Bajpai, Roshni Iyer, Sumit Gulwani, Gustavo Soares

Main category: cs.AI

TL;DR: BREW introduces a framework for optimizing LLM-based agents through structured memory construction and refinement, achieving improved task precision and efficiency while maintaining computational efficiency.

Details

Motivation: Current training paradigms for LLM agents (PPO, GRPO) have high computational overhead and produce opaque policies that are difficult to interpret, adapt, or incrementally improve.

Method: BREW creates and refines structured memory of experiential learning, using task graders and behavior rubrics to learn insights while leveraging state-space search for robustness against natural language noise.

Result: Empirical results show 10-20% improvement in task precision, 10-15% reduction in API/tool calls leading to faster execution time, while maintaining computational efficiency on par with base models.

Conclusion: BREW establishes knowledge bases as modular and controllable substrates for agent optimization - an explicit lever for shaping behavior in a transparent, interpretable, and extensible manner.

Abstract: Large Language Model (LLM)-based agents are increasingly applied to tasks requiring structured reasoning, tool use, and environmental adaptation, such as data manipulation, multistep planning, and computer-use automation. However, despite their versatility, current training paradigms for model weight optimization methods, like PPO and GRPO, remain relatively impractical with their high computational overhead for rollout convergence. In addition, the resulting agent policies are difficult to interpret, adapt, or incrementally improve. To address this, we investigate creating and refining structured memory of experiential learning of an agent from its environment as an alternative route to agent optimization. We introduce BREW (Bootstrapping expeRientially-learned Environmental knoWledge), a framework for agent optimization for downstream tasks via KB construction and refinement. In our formulation, we introduce an effective method for partitioning agent memory for more efficient retrieval and refinement. BREW uses task graders and behavior rubrics to learn insights while leveraging state-space search for ensuring robustness from the noise and non-specificity in natural language. Empirical results on real world, domain-grounded benchmarks – OSWorld, $τ^2$Bench, and SpreadsheetBench – show BREW achieves $10-20%$ improvement in task precision, $10-15%$ reduction in API/tool calls leading to faster execution time, all while maintaining computational efficiency on par with base models. Unlike prior work where memory is treated as static context, we establish the KB as a modular and controllable substrate for agent optimization – an explicit lever for shaping behavior in a transparent, interpretable, and extensible manner.

[379] Data Augmentation Techniques to Reverse-Engineer Neural Network Weights from Input-Output Queries

Alexander Beiser, Flavio Martinelli, Wulfram Gerstner, Johanni Brea

Main category: cs.AI

TL;DR: The paper proposes new data augmentation techniques to improve reverse-engineering of neural network weights when teacher parameters outnumber training data, enabling recovery of networks with 100x more parameters than training samples.

Details

Motivation: Current methods fail when teacher network parameters exceed training data size, causing students to overfit queries instead of aligning with teacher parameters. Standard augmentations like rotation and flipping provide little improvement for network identification.

Method: Designed new data augmentation techniques specifically tailored to better sample the representational space of network hidden layers, rather than using standard augmentations.

Result: Extended state-of-the-art range of recoverable network sizes, demonstrating recovery of networks with up to 100 times more parameters than training data-points.

Conclusion: Custom-designed augmentations that target network hidden layer representations significantly improve network weight reverse-engineering compared to standard augmentation methods.

Abstract: Network weights can be reverse-engineered given enough informative samples of a network’s input-output function. In a teacher-student setup, this translates into collecting a dataset of the teacher mapping – querying the teacher – and fitting a student to imitate such mapping. A sensible choice of queries is the dataset the teacher is trained on. But current methods fail when the teacher parameters are more numerous than the training data, because the student overfits to the queries instead of aligning its parameters to the teacher. In this work, we explore augmentation techniques to best sample the input-output mapping of a teacher network, with the goal of eliciting a rich set of representations from the teacher hidden layers. We discover that standard augmentations such as rotation, flipping, and adding noise, bring little to no improvement to the identification problem. We design new data augmentation techniques tailored to better sample the representational space of the network’s hidden layers. With our augmentations we extend the state-of-the-art range of recoverable network sizes. To test their scalability, we show that we can recover networks of up to 100 times more parameters than training data-points.

[380] Active Inference in Discrete State Spaces from First Principles

Patrick Kenny

Main category: cs.AI

TL;DR: The paper disentangles active inference from the Free Energy Principle, showing that discrete state space optimizations can be formulated as constrained divergence minimization problems solvable by standard mean field methods without expected free energy.

Details

Motivation: To clarify the concept of active inference by separating it from the Free Energy Principle framework.

Method: Formulate active inference optimizations as constrained divergence minimization problems using standard mean field methods, avoiding expected free energy concepts.

Result: Shows that perception modeling coincides with variational free energy, while action modeling differs from expected free energy by an entropy regularizer.

Conclusion: Active inference can be implemented through constrained divergence minimization without relying on the Free Energy Principle framework.

Abstract: We seek to clarify the concept of active inference by disentangling it from the Free Energy Principle. We show how the optimizations that need to be carried out in order to implement active inference in discrete state spaces can be formulated as constrained divergence minimization problems which can be solved by standard mean field methods that do not appeal to the idea of expected free energy. When it is used to model perception, the perception/action divergence criterion that we propose coincides with variational free energy. When it is used to model action, it differs from an expected free energy functional by an entropy regularizer.

[381] NNGPT: Rethinking AutoML with Large Language Models

Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte

Main category: cs.AI

TL;DR: NNGPT is an open-source framework that transforms LLMs into self-improving AutoML engines for neural network development, using a closed-loop system of generation, assessment, and learning to continuously improve model synthesis.

Details

Motivation: To address the fundamental challenge of building self-improving AI systems by creating an autonomous AutoML engine that can continuously generate and improve neural networks without extensive human intervention.

Method: Integrates five LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization, code-aware accuracy prediction, retrieval-augmented synthesis (NN-RAG), and reinforcement learning. Uses LEMUR dataset for training and validation, with PyTorch adapter for framework-agnostic operation.

Result: Achieved 73% executability on 1,289 targets, 3-shot prompting boosts accuracy, HPO outperforms Optuna (RMSE 0.60 vs 0.64), code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. Generated over 5K validated models, with one-shot prediction matching search-based AutoML performance.

Conclusion: NNGPT successfully demonstrates autonomous neural network development capabilities, serving as an effective self-improving AutoML engine that reduces the need for extensive trial runs while maintaining strong performance across various metrics.

Abstract: Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.

[382] VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning

Bo Pang, Chenxi Xu, Jierui Ren, Guoping Wang, Sheng Li

Main category: cs.AI

TL;DR: VibraVerse is a large-scale dataset that bridges 3D geometry, physical properties, and acoustic signals for physically consistent multimodal learning, with CLASP framework for cross-modal alignment.

Details

Motivation: Existing multimodal frameworks lack physical consistency and overlook causal relationships between object geometry, material properties, and acoustic responses.

Method: Created VibraVerse dataset with explicit physical properties and volumetric geometry, computed modal parameters, and introduced CLASP contrastive learning framework for cross-modal alignment.

Result: Models trained on VibraVerse show superior accuracy, interpretability, and generalization across geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal learning tasks.

Conclusion: VibraVerse establishes a benchmark for physically consistent multimodal learning, providing foundation for sound-guided embodied perception and deeper understanding of physical world.

Abstract: Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object’s geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young’s modulus, Poisson’s ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object’s physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.

[383] DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLMs

Yuanhao Li, Mingshan Liu, Hongbo Wang, Yiding Zhang, Yifei Ma, Wei Tan

Main category: cs.AI

TL;DR: DRAFT-RL integrates Chain-of-Draft reasoning with multi-agent RL training, enabling agents to generate multiple drafts per query for better exploration and selection.

Details

Motivation: Existing multi-agent reflection frameworks rely on single-shot responses and lack structural diversity in reasoning exploration, limiting their effectiveness.

Method: Each agent produces multiple drafts per query, which are evaluated by peer agents and a learned reward model to identify the best trajectory, then refined through actor-critic learning.

Result: DRAFT-RL outperforms existing reflective and RL-based agents in accuracy and convergence speed on code synthesis, symbolic math, and knowledge-intensive QA tasks.

Conclusion: The framework enables explicit multi-path exploration, peer-guided reflection, and reward-aligned selection, resulting in more robust and interpretable LLM agent behavior.

Abstract: Large Language Models (LLMs) have shown impressive capabilities in multi-step reasoning and problem-solving.Recent works introduce multi-agent reflection frameworks where multiple LLM agents critique and refine each other’s outputs using reinforcement learning (RL). However, these approaches often rely on single-shot responses and lack structural diversity in reasoning exploration. In this paper, we propose DRAFT-RL, a novel framework that integrates Chain-of-Draft (CoD) reasoning into multi-agent RL training. Instead of generating single responses, each agent produces multiple drafts per query, which are then evaluated by peer agents and a learned reward model to identify the most promising trajectory. These selected drafts are used to refine future reasoning strategies through actor-critic learning.DRAFT-RL enables explicit multi-path exploration, peer-guided reflection, and reward-aligned selection, resulting in more robust and interpretable LLM agent behavior. We evaluate our method on complex reasoning tasks including code synthesis, symbolic math, and knowledge-intensive QA,demonstrating that DRAFT-RL outperforms existing reflective and RL-based agents by significant margins in both accuracy and convergence speed

[384] Universe of Thoughts: Enabling Creative Reasoning with Large Language Models

Yuto Suzuki, Farnoush Banaei-Kashani

Main category: cs.AI

TL;DR: The paper introduces a computational framework for creative reasoning with LLMs, proposing three paradigms (combinational, exploratory, transformative) and implementing them through Universe of Thoughts (UoT) methods, showing superior performance in creative problem-solving tasks.

Details

Motivation: Existing LLM reasoning methods focus on conventional problem-solving but lack creative reasoning capabilities needed for domains with expansive solution spaces like drug discovery and business strategy, where innovative solutions are crucial.

Method: Proposes a computational framework for creative reasoning with three core paradigms, implements them through Universe of Thoughts (UoT) methods using LLMs, and evaluates on novel creative problem-solving tasks with metrics for feasibility, utility, and novelty.

Result: UoT demonstrates superior performance in creative reasoning compared to state-of-the-art reasoning techniques and commercial models with reasoning capabilities.

Conclusion: The proposed creative reasoning framework and UoT implementation effectively address the gap in LLM-based creative problem-solving, enabling systematic exploration of solution spaces to generate innovative solutions in domains requiring creativity.

Abstract: Reasoning based on Large Language Models (LLMs) has garnered increasing attention due to outstanding performance of these models in mathematical and complex logical tasks. Beginning with the Chain-of-Thought (CoT) prompting technique, numerous reasoning methods have emerged that decompose problems into smaller, sequential steps (or thoughts). However, existing reasoning models focus on conventional problem-solving and do not necessarily generate creative solutions by ``creative reasoning’’. In domains where the solution space is expansive and conventional solutions are suboptimal, such as drug discovery or business strategization, creative reasoning to discover innovative solutions is crucial. To address this gap, first we introduce a computational framework for creative reasoning inspired by established cognitive science principles. With this framework, we propose three core creative reasoning paradigms, namely, \textit{combinational}, \textit{exploratory}, and \textit{transformative} reasoning, where each offers specific directions for systematic exploration of the universe of thoughts to generate creative solutions. Next, to materialize this framework using LLMs, we introduce the \textit{Universe of Thoughts} (or \textit{UoT}, for short), a novel set of methods to implement the aforementioned three creative processes. Finally, we introduce three novel tasks that necessitate creative problem-solving, along with an evaluation benchmark to assess creativity from three orthogonal perspectives: feasibility as constraint, and utility and novelty as metrics. With a comparative analysis against the state-of-the-art (SOTA) reasoning techniques as well as representative commercial models with reasoning capability, we show that UoT demonstrates superior performance in creative reasoning.

[385] Quantifying the Privacy Implications of High-Fidelity Synthetic Network Traffic

Van Tran, Shinan Liu, Tian Li, Nick Feamster

Main category: cs.AI

TL;DR: This paper introduces comprehensive privacy metrics for synthetic network traffic and evaluates privacy risks across different generative models, revealing significant vulnerabilities including up to 88% MIA success and 100% network identifier recovery.

Details

Motivation: Address the unexplored privacy risks in synthetic network traffic generation, as current generative models may leak sensitive information despite privacy concerns being a key motivation for their development.

Method: Developed a comprehensive set of privacy metrics combining standard approaches (membership inference attacks, data extraction attacks) with network-specific identifiers and attributes, then systematically evaluated different generative models using these metrics.

Result: Found substantial privacy risks: MIA success ranged from 0% to 88%, and up to 100% of network identifiers could be recovered from generated traffic. Identified key factors affecting attack outcomes including training data diversity and model fit to training data.

Conclusion: Provides actionable guidance for designing safer generative models that minimize privacy leakage, establishing a foundation for privacy-preserving synthetic network traffic generation.

Abstract: To address the scarcity and privacy concerns of network traffic data, various generative models have been developed to produce synthetic traffic. However, synthetic traffic is not inherently privacy-preserving, and the extent to which it leaks sensitive information, and how to measure such leakage, remain largely unexplored. This challenge is further compounded by the diversity of model architectures, which shape how traffic is represented and synthesized. We introduce a comprehensive set of privacy metrics for synthetic network traffic, combining standard approaches like membership inference attacks (MIA) and data extraction attacks with network-specific identifiers and attributes. Using these metrics, we systematically evaluate the vulnerability of different representative generative models and examine the factors that influence attack success. Our results reveal substantial variability in privacy risks across models and datasets. MIA success ranges from 0% to 88%, and up to 100% of network identifiers can be recovered from generated traffic, highlighting serious privacy vulnerabilities. We further identify key factors that significantly affect attack outcomes, including training data diversity and how well the generative model fits the training data. These findings provide actionable guidance for designing and deploying generative models that minimize privacy leakage, establishing a foundation for safer synthetic network traffic generation.

[386] FRAGMENTA: End-to-end Fragmentation-based Generative Model with Agentic Tuning for Drug Lead Optimization

Yuto Suzuki, Paul Awolade, Daniel V. LaBarbera, Farnoush Banaei-Kashani

Main category: cs.AI

TL;DR: FRAGMENTA is an AI framework for drug discovery that uses dynamic Q-learning for molecule fragmentation/generation and agentic AI for automated tuning via expert feedback, outperforming traditional methods in cancer drug experiments.

Details

Motivation: Current molecule generation faces challenges with small datasets (<100 examples), limited fragment diversity from heuristic approaches, and slow human-AI collaboration cycles in drug discovery.

Method: 1) Generative model reframing fragmentation as vocabulary selection using dynamic Q-learning to jointly optimize fragmentation and generation; 2) Agentic AI system that learns from conversational expert feedback to automate tuning.

Result: In cancer drug discovery experiments, FRAGMENTA’s Human-Agent configuration identified nearly twice as many high-scoring molecules as baselines, and the autonomous Agent-Agent system outperformed traditional Human-Human tuning.

Conclusion: FRAGMENTA demonstrates effective agentic tuning for capturing expert intent and automating drug lead optimization, showing promise for accelerating drug discovery workflows.

Abstract: Molecule generation using generative AI is vital for drug discovery, yet class-specific datasets often contain fewer than 100 training examples. While fragment-based models handle limited data better than atom-based approaches, existing heuristic fragmentation limits diversity and misses key fragments. Additionally, model tuning typically requires slow, indirect collaboration between medicinal chemists and AI engineers. We introduce FRAGMENTA, an end-to-end framework for drug lead optimization comprising: 1) a novel generative model that reframes fragmentation as a “vocabulary selection” problem, using dynamic Q-learning to jointly optimize fragmentation and generation; and 2) an agentic AI system that refines objectives via conversational feedback from domain experts. This system removes the AI engineer from the loop and progressively learns domain knowledge to eventually automate tuning. In real-world cancer drug discovery experiments, FRAGMENTA’s Human-Agent configuration identified nearly twice as many high-scoring molecules as baselines. Furthermore, the fully autonomous Agent-Agent system outperformed traditional Human-Human tuning, demonstrating the efficacy of agentic tuning in capturing expert intent.

[387] Assessing LLMs’ Performance: Insights from the Chinese Pharmacist Exam

Xinran Wang, Boran Zhu, Shujuan Zhou, Ziwen Long, Dehua Zhou, Shu Zhang

Main category: cs.AI

TL;DR: DeepSeek-R1 significantly outperformed ChatGPT-4o (90.0% vs 76.1% accuracy) on Chinese Pharmacist Licensing Exam questions, suggesting domain-specific LLMs show promise for certification tasks but require human oversight.

Details

Motivation: To evaluate LLM capabilities in high-stakes domain-specific certification tasks using China's national pharmacist licensure exam as a benchmark, and explore AI's potential for formative evaluation in healthcare education.

Method: Used 2,306 multiple-choice questions from Chinese Pharmacist Licensing Exams (2017-2021), excluded visual content, tested ChatGPT-4o and DeepSeek-R1 in original Chinese format, evaluated exact accuracy using Pearson’s Chi-squared and Fisher’s exact tests.

Result: DeepSeek-R1 achieved significantly higher overall accuracy (90.0%) than ChatGPT-4o (76.1%), with consistent advantages across foundational and clinical synthesis modules, though year-wise differences weren’t statistically significant.

Conclusion: Domain-specific models like DeepSeek-R1 show strong alignment with professional certification demands, warranting further investigation while emphasizing the continued need for human oversight in legally sensitive healthcare contexts.

Abstract: Background: As large language models (LLMs) become increasingly integrated into digital health education and assessment workflows, their capabilities in supporting high-stakes, domain-specific certification tasks remain underexplored.In China, the national pharmacist licensure exam serves as a standardized benchmark for evaluating pharmacists’ clinical and theoretical competencies. Objective: This study aimed to compare the performance of two LLMs: ChatGPT-4o and DeepSeek-R1 on real questions from the Chinese Pharmacist Licensing Examination (2017-2021), and to discuss the implications of these performance differences for AI-enabled formative evaluation. Methods: A total of 2,306 multiple-choice (text-only) questions were compiled from official exams, training materials, and public databases. Questions containing tables or images were excluded. Each item was input in its original Chinese format, and model responses were evaluated for exact accuracy. Pearson’s Chi-squared test was used to compare overall performance, and Fisher’s exact test was applied to year-wise multiple-choice accuracy. Results: DeepSeek-R1 outperformed ChatGPT-4o with a significantly higher overall accuracy (90.0% vs. 76.1%, p < 0.001). Unit-level analyses revealed consistent advantages for DeepSeek-R1, particularly in foundational and clinical synthesis modules. While year-by-year multiple-choice performance also favored DeepSeek-R1, this performance gap did not reach statistical significance in any specific unit-year (all p > 0.05). Conclusion: DeepSeek-R1 demonstrated robust alignment with the structural and semantic demands of the pharmacist licensure exam. These findings suggest that domain-specific models warrant further investigation for this context, while also reinforcing the necessity of human oversight in legally and ethically sensitive contexts.

[388] Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models

Shamima Hossain

Main category: cs.AI

TL;DR: A framework for knowledge-guided reasoning in VLMs that improves factual accuracy by 31% using structured knowledge graphs for multi-hop verification.

Details

Motivation: VLMs often produce factually inaccurate outputs due to lack of robust reasoning capabilities, and existing knowledge integration methods from LLMs remain underexplored for multimodal contexts.

Method: Leverages structured knowledge graphs for multi-hop verification using image-captioning, with systematic reasoning steps including visual entity recognition, knowledge graph traversal, and fact-based caption refinement.

Result: Improves factual accuracy by approximately 31% on curated datasets (Google Landmarks v2, Conceptual captions, Coco captions) and reveals insights into reasoning patterns and failure modes.

Conclusion: Demonstrates the potential of integrating external knowledge for advancing reasoning in VLMs, paving the way for more reliable multimodal systems.

Abstract: Visual Language Models (VLMs) are powerful generative tools but often produce factually inaccurate outputs due to a lack of robust reasoning capabilities. While extensive research has been conducted on integrating external knowledge for reasoning in large language models (LLMs), such efforts remain underexplored in VLMs, where the challenge is compounded by the need to bridge multiple modalities seamlessly. This work introduces a framework for knowledge-guided reasoning in VLMs, leveraging structured knowledge graphs for multi-hop verification using image-captioning task to illustrate our framework. Our approach enables systematic reasoning across multiple steps, including visual entity recognition, knowledge graph traversal, and fact-based caption refinement. We evaluate the framework using hierarchical, triple-based and bullet-point based knowledge representations, analyzing their effectiveness in factual accuracy and logical inference. Empirical results show that our approach improves factual accuracy by approximately 31% on preliminary experiments on a curated dataset of mixtures from Google Landmarks v2, Conceptual captions and Coco captions revealing key insights into reasoning patterns and failure modes. This work demonstrates the potential of integrating external knowledge for advancing reasoning in VLMs, paving the way for more reliable and knowledgable multimodal systems.

[389] PaTAS: A Parallel System for Trust Propagation in Neural Networks Using Subjective Logic

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Dennis Eisermann, Frank Kargl

Main category: cs.AI

TL;DR: PaTAS is a framework that models and propagates trust in neural networks using Subjective Logic, operating parallel to standard computation through Trust Nodes and Trust Functions to provide interpretable trust estimates.

Details

Motivation: Conventional metrics like accuracy fail to capture uncertainty and reliability under adversarial or degraded conditions, making trustworthiness crucial for AI deployment in safety-critical applications.

Method: Uses Subjective Logic with Trust Nodes and Trust Functions to propagate input, parameter, and activation trust across networks. Includes Parameter Trust Update during training and Inference-Path Trust Assessment for instance-specific trust at inference.

Result: Experiments show PaTAS produces interpretable, symmetric, convergent trust estimates that complement accuracy, expose reliability gaps in poisoned/biased data, and effectively distinguish benign from adversarial inputs.

Conclusion: PaTAS provides a principled foundation for transparent, quantifiable trust reasoning within neural architectures, enabling reliable model evaluation across the AI lifecycle.

Abstract: Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics such as accuracy and precision fail to capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the \emph{Parallel Trust Assessment System (PaTAS)}, a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through \emph{Trust Nodes} and \emph{Trust Functions} that propagate input, parameter, and activation trust across the network. The framework defines a \emph{Parameter Trust Update} mechanism to refine parameter reliability during training and an \emph{Inference-Path Trust Assessment (IPTA)} method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a principled foundation for evaluating model reliability across the AI lifecycle.

[390] Building a Foundation Model for Trajectory from Scratch

Gaspard Merten, Mahmoud Sakr, Gilles Dejaegere

Main category: cs.AI

TL;DR: A tutorial demonstrating how to build trajectory foundation models starting from GPT-2, with code-driven implementation and comparison of existing models like TrajFM and TrajGPT.

Details

Motivation: To bridge the gap in building foundation models for mobility trajectories, which currently lack clear documentation and implementation guidance, especially for the SIGSPATIAL community.

Method: Step-by-step code-driven adaptation of GPT-2 for spatiotemporal data, comparison of existing trajectory foundation models (TrajFM, TrajGPT), and incorporation of complementary techniques like TimesFM’s patching approach.

Result: Provides a minimal implementation framework for trajectory foundation models, enabling researchers to understand and build mobility AI systems from GPT-2 foundations.

Conclusion: This educational material is timely and essential for supporting the SIGSPATIAL community in developing and evaluating mobility foundation models, improving research clarity and peer-review effectiveness.

Abstract: Foundation models are transformative in artificial intelligence, but building them from scratch, especially for mobility trajectories, is not yet clear or documented. This tutorial bridges this gap by demonstrating the steps and code of a minimal implementation of a trajectory-focused foundation model starting from GPT-2. Through a concise, step-by-step, code-driven process, we demonstrate adapting GPT-2 for spatiotemporal data. We then review and compare representative trajectory foundation models, such as TrajFM and TrajGPT, highlighting their architectural innovations and differences. Additionally, we introduce complementary techniques from related domains, like TimesFM’s patching approach. Targeted at researchers and practitioners, this tutorial aims to explain the concepts and terminology of foundation models, at the implementation level. We find it timely and indispensable to create this educational material in order to support the SIGSPATIAL community in building and evaluating mobility foundation models, enhancing both research clarity and peer-review effectiveness in mobility AI.

[391] Copyright Detection in Large Language Models: An Ethical Approach to Generative AI Development

David Szczecina, Senan Gaffori, Edmond Li

Main category: cs.AI

TL;DR: An open-source copyright detection platform that helps content creators verify if their work was used in LLM training datasets, with improved efficiency and user accessibility.

Details

Motivation: Addressing concerns about unauthorized inclusion of copyrighted content in LLM training data and the limitations of existing detection frameworks that are computationally intensive and inaccessible to independent creators.

Method: Enhances existing methodologies by improving similarity detection, optimizing dataset validation, and reducing computational overhead by 10-30% through efficient API calls, with an intuitive user interface and scalable backend.

Result: Developed a scalable, transparent, and user-friendly solution that reduces computational overhead while maintaining effective copyright detection capabilities.

Conclusion: The framework contributes to increasing transparency in AI development and ethical compliance, providing a foundation for further research in responsible AI development and copyright enforcement.

Abstract: The widespread use of Large Language Models (LLMs) raises critical concerns regarding the unauthorized inclusion of copyrighted content in training data. Existing detection frameworks, such as DE-COP, are computationally intensive, and largely inaccessible to independent creators. As legal scrutiny increases, there is a pressing need for a scalable, transparent, and user-friendly solution. This paper introduce an open-source copyright detection platform that enables content creators to verify whether their work was used in LLM training datasets. Our approach enhances existing methodologies by facilitating ease of use, improving similarity detection, optimizing dataset validation, and reducing computational overhead by 10-30% with efficient API calls. With an intuitive user interface and scalable backend, this framework contributes to increasing transparency in AI development and ethical compliance, facilitating the foundation for further research in responsible AI development and copyright enforcement.

[392] Fighting AI with AI: Leveraging Foundation Models for Assuring AI-Enabled Safety-Critical Systems

Anastasia Mavridou, Divya Gopinath, Corina S. Păsăreanu

Main category: cs.AI

TL;DR: The paper proposes REACT and SemaLens - AI-powered tools that bridge the gap between natural language requirements and formal specifications for safety-critical AI systems, enabling verification from requirements to implementation.

Details

Motivation: Address challenges in assuring AI components in safety-critical systems due to AI opacity, semantic gaps between requirements and implementations, and traditional RE limitations like natural language ambiguity.

Method: Two complementary AI-powered components: REACT uses LLMs to translate natural language requirements into formal specifications for early verification; SemaLens uses VLMs to test and monitor DNN perception systems with human-understandable concepts.

Result: A comprehensive pipeline that connects informal requirements to validated AI implementations, addressing verification challenges through AI-assisted formalization and monitoring.

Conclusion: Leveraging AI itself (LLMs and VLMs) provides an effective approach to overcome assurance challenges in safety-critical AI systems by bridging semantic gaps and enabling systematic verification from requirements to implementation.

Abstract: The integration of AI components, particularly Deep Neural Networks (DNNs), into safety-critical systems such as aerospace and autonomous vehicles presents fundamental challenges for assurance. The opacity of AI systems, combined with the semantic gap between high-level requirements and low-level network representations, creates barriers to traditional verification approaches. These AI-specific challenges are amplified by longstanding issues in Requirements Engineering, including ambiguity in natural language specifications and scalability bottlenecks in formalization. We propose an approach that leverages AI itself to address these challenges through two complementary components. REACT (Requirements Engineering with AI for Consistency and Testing) employs Large Language Models (LLMs) to bridge the gap between informal natural language requirements and formal specifications, enabling early verification and validation. SemaLens (Semantic Analysis of Visual Perception using large Multi-modal models) utilizes Vision Language Models (VLMs) to reason about, test, and monitor DNN-based perception systems using human-understandable concepts. Together, these components provide a comprehensive pipeline from informal requirements to validated implementations.

Farhad Nooralahzadeh, Yi Zhang, Jonathan Furst, Kurt Stockinger

Main category: cs.AI

TL;DR: M²EX is a system that enables multi-modal data exploration using language agents to query both structured databases and unstructured data (texts, images) through natural language.

Details

Motivation: Address the research gap in querying both structured databases and unstructured modalities (texts, images) using natural language, which remains largely unexplored despite progress in separate fields.

Method: Uses an LLM-based agentic AI framework to decompose natural language questions into subtasks (text-to-SQL generation, image analysis) and orchestrates modality-specific experts in efficient query plans.

Result: Outperforms state-of-the-art multi-modal exploration systems in accuracy and performance metrics (query latency, API costs, planning efficiency) on datasets with relational data, text, and images.

Conclusion: M²EX effectively leverages LLM reasoning capabilities to enable comprehensive multi-modal data exploration through natural language queries, demonstrating superior performance over existing approaches.

Abstract: International enterprises, organizations, and hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying both structured databases and unstructured modalities (e.g., texts, images) in natural language remains largely unexplored. In this paper, we propose M$^2$EX -a system that enables multi-modal data exploration via language agents. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) M$^2$EX leverages an LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis and to orchestrate modality-specific experts in an efficient query plan. (3) Experimental results on multi-modal datasets, encompassing relational data, text, and images, demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling in both accuracy and various performance metrics, including query latency, API costs, and planning efficiency, thanks to the more effective utilization of the reasoning capabilities of LLMs.

[394] CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications

Anton Alyakin, Jaden Stryker, Daniel Alexander Alber, Jin Vivian Lee, Karl L. Sangwon, Brandon Duderstadt, Akshay Save, David Kurland, Spencer Frome, Shrutika Singh, Jeff Zhang, Eunice Yang, Ki Yun Park, Cordelia Orillac, Aly A. Valliani, Sean Neifert, Albert Liu, Aneek Patel, Christopher Livia, Darryl Lau, Ilya Laufer, Peter A. Rozman, Eveline Teresa Hidalgo, Howard Riina, Rui Feng, Todd Hollon, Yindalon Aphinyanaphongs, John G. Golfinos, Laura Snyder, Eric Leuthardt, Douglas Kondziolka, Eric Karl Oermann

Main category: cs.AI

TL;DR: CNS-Obsidian is a neurosurgical VLM trained on peer-reviewed literature that approaches GPT-4o’s performance in clinical settings despite being much smaller and cheaper to train.

Details

Motivation: General-purpose VLMs trained on uncurated internet data have limitations for high-stakes medical decision-making like neurosurgery, requiring domain-specific models built on curated scientific literature.

Method: Trained on 23,984 neurosurgery articles converted into 263,064 training samples using GPT-4o and Claude Sonnet-3.5. Fine-tuned LLaVA-Next model and conducted blinded randomized trial comparing with GPT-4o in neurosurgery consultations.

Result: CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%) but performed worse on human-generated questions (46.81% vs 65.70%). In clinical trial, received 40.62% positive ratings vs GPT-4o’s 57.89%, with both including correct diagnosis in ~60% of cases.

Conclusion: Domain-specific VLMs trained on curated scientific literature can approach frontier model performance while being orders of magnitude smaller and cheaper, establishing a transparent framework for specialized AI models.

Abstract: General-purpose VLMs demonstrate impressive capabilities, but their opaque training on uncurated internet data poses critical limitations for high-stakes decision-making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed literature, and demonstrate its clinical utility versus GPT-4o in a real-world setting. We compiled 23,984 articles from Neurosurgery Publications journals, yielding 78,853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these into 263,064 training samples across three formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter LLaVA-Next model. In a blinded, randomized trial at NYU Langone Health (Aug 30-Nov 30, 2024), neurosurgery consultations were assigned to either CNS-Obsidian or a HIPAA-compliant GPT-4o endpoint as diagnostic co-pilot after consultations. Primary outcomes were diagnostic helpfulness and accuracy, assessed via user ratings and presence of correct diagnosis within the VLM-provided differential. CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, p=0.235), but only achieved 46.81% accuracy on human-generated questions versus GPT-4o’s 65.70% (p<10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults (7.3% utilization). CNS-Obsidian received positive ratings in 40.62% of cases versus 57.89% for GPT-4o (p=0.230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, p=0.626). Domain-specific VLMs trained on curated scientific literature can approach frontier model performance despite being orders of magnitude smaller and less expensive to train. This establishes a transparent framework for scientific communities to build specialized AI models.

[395] MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish

Main category: cs.AI

TL;DR: MMTU is a large-scale benchmark with 28K+ questions across 25 real-world table tasks to evaluate models’ ability to understand, reason, and manipulate tables at expert-level, showing current frontier models still struggle significantly.

Details

Motivation: Existing table-related evaluations are limited and narrowly focused on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face, limiting understanding and model progress in this important area.

Method: Introduced MMTU benchmark with over 28K questions across 25 real-world table tasks drawn from decades of computer science research on tabular data, focusing on complex tasks faced by professional users requiring table understanding, reasoning, and coding skills.

Result: Current frontier models struggle significantly, with OpenAI GPT-5 scoring around 69% and DeepSeek R1 scoring around 57%, indicating substantial room for improvement in table-related capabilities.

Conclusion: MMTU reveals that comprehensive table understanding, reasoning, and manipulation remain challenging for today’s models, and the benchmark is intended to drive further advances in foundation models for structured data processing and analysis.

Abstract: Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 28K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades’ worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills – including table understanding, reasoning, and coding – that remain challenging for today’s frontier models, where even frontier reasoning models like OpenAI GPT-5 and DeepSeek R1 score only around 69% and 57% respectively, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.

[396] Bridging Symbolic Control and Neural Reasoning in LLM Agents: The Structured Cognitive Loop

Myung Ho Kim

Main category: cs.AI

TL;DR: SCL is a modular architecture that separates agent cognition into five phases (R-CCAM) using Soft Symbolic Control to combine neural flexibility with symbolic explainability, achieving zero policy violations and complete traceability.

Details

Motivation: Address fundamental problems in LLM agents: entangled reasoning/execution, memory volatility, and uncontrolled action sequences.

Method: Structured Cognitive Loop (SCL) with five modular phases: Retrieval, Cognition, Control, Action, Memory. Core innovation is Soft Symbolic Control that applies symbolic constraints to probabilistic inference.

Result: Achieves zero policy violations, eliminates redundant tool calls, maintains complete decision traceability on multi-step conditional reasoning tasks. Outperforms frameworks like ReAct and AutoGPT.

Conclusion: SCL offers a practical path toward reliable, explainable, and governable AI agents by connecting expert system principles with modern LLM capabilities through modular decomposition, adaptive symbolic governance, and transparent state management.

Abstract: Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences. We introduce Structured Cognitive Loop (SCL), a modular architecture that explicitly separates agent cognition into five phases: Retrieval, Cognition, Control, Action, and Memory (R-CCAM). At the core of SCL is Soft Symbolic Control, an adaptive governance mechanism that applies symbolic constraints to probabilistic inference, preserving neural flexibility while restoring the explainability and controllability of classical symbolic systems. Through empirical validation on multi-step conditional reasoning tasks, we demonstrate that SCL achieves zero policy violations, eliminates redundant tool calls, and maintains complete decision traceability. These results address critical gaps in existing frameworks such as ReAct, AutoGPT, and memory-augmented approaches. Our contributions are threefold: (1) we situate SCL within the taxonomy of hybrid intelligence, differentiating it from prompt-centric and memory-only approaches; (2) we formally define Soft Symbolic Control and contrast it with neuro-symbolic AI; and (3) we derive three design principles for trustworthy agents: modular decomposition, adaptive symbolic governance, and transparent state management. We provide a complete open-source implementation demonstrating the R-CCAM loop architecture, alongside a live GPT-4o-powered travel planning agent. By connecting expert system principles with modern LLM capabilities, this work offers a practical and theoretically grounded path toward reliable, explainable, and governable AI agents.

[397] Quantifying Behavioral Dissimilarity Between Mathematical Expressions

Sebastian Mežnar, Sašo Džeroski, Ljupčo Todorovski

Main category: cs.AI

TL;DR: BED is a framework that measures behavioral distance between mathematical expressions using input-output distributions and Wasserstein distance, providing a more meaningful similarity measure than syntax-based approaches.

Details

Motivation: Existing similarity measures for mathematical expressions rely on syntactic structure, but syntactically different expressions can behave similarly while structurally similar ones may behave differently, especially with free parameters.

Method: Represent expressions as joint probability distributions over input-output pairs and apply Wasserstein distance to measure behavioral dissimilarity, with a computationally efficient stochastic approximation.

Result: The method is consistent, robust, and induces a smoother, more meaningful structure over expression space than syntax-based measures.

Conclusion: BED provides a foundation for behavior-based comparison, clustering, and learning of mathematical expressions, with applications in equation discovery, symbolic regression, and neuro-symbolic modeling.

Abstract: Quantifying the similarity between mathematical expressions is a fundamental problem in computational mathematics, symbolic reasoning, and scientific discovery. While behavioral notions of similarity have previously been explored in the context of software and program analysis, existing measures for mathematical expressions rely primarily on syntactic form, assessing similarity through symbolic structure rather than actual behavior. Yet syntactically distinct expressions can exhibit nearly identical outputs, while structurally similar ones may behave very differently-especially when the expressions contain free parameters that define families of functions. To address these limitations, we introduce Behavior-aware Expression Dissimilarity (BED), a principled framework for quantifying behavioral distance between mathematical expressions with free parameters. BED represents expressions as joint probability distributions over their input-output pairs and applies the Wasserstein distance to measure behavioral dissimilarity. A computationally efficient stochastic approximation is proposed and shown to be consistent, robust, and capable of inducing a smoother, more meaningful structure over the space of expressions than syntax-based measures. The approach provides a foundation for behavior-based comparison, clustering, and learning of mathematical expressions, with potential direct applications in equation discovery, symbolic regression, and neuro-symbolic modeling.

Xin Wang, Yuwei Zhou, Bin Huang, Hong Chen, Wenwu Zhu

Main category: cs.AI

TL;DR: This paper provides a comprehensive survey of multi-modal generative AI, covering multi-modal LLMs for understanding, diffusion models for generation, and emerging unified approaches that combine both capabilities.

Details

Motivation: The increasing attention on multi-modal generative AI and the emergence of two dominant techniques (multi-modal LLMs for understanding and diffusion models for generation) motivates the need for a comprehensive overview and exploration of unified models.

Method: The paper reviews multi-modal LLMs and diffusion models separately, then explores unified approaches using autoregressive-based and diffusion-based modeling with dense and Mixture-of-Experts architectures, analyzing various unification strategies.

Result: The survey provides detailed analysis of probabilistic modeling procedures, architecture designs, applications, and identifies key unification strategies for multi-modal understanding and generation.

Conclusion: The paper establishes foundations for unified multi-modal models, summarizes common datasets, and presents challenging future research directions to advance multi-modal generative AI.

Abstract: Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for multi-modal understanding; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of multi-modal generation. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we explore the emerging efforts toward unified models for understanding and generation. To achieve the unification of understanding and generation, we investigate key designs including autoregressive-based and diffusion-based modeling, as well as dense and Mixture-of-Experts (MoE) architectures. We then introduce several strategies for unified models, analyzing their potential advantages and disadvantages. In addition, we summarize the common datasets widely used for multi-modal generative AI pretraining. Last but not least, we present several challenging future research directions which may contribute to the ongoing advancement of multi-modal generative AI.

[399] Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents

Yuwei Hu, Runlin Lei, Xinyi Huang, Zhewei Wei, Yongchao Liu

Main category: cs.AI

TL;DR: GraphAgent-Reasoner is a fine-tuning-free multi-agent framework that decomposes graph reasoning tasks into node-centric subtasks, enabling scalable and accurate reasoning on large graphs with over 1,000 nodes.

Details

Motivation: Current LLM approaches struggle with graph reasoning due to graph structure complexity and LLM limitations in handling long text, resulting in poor accuracy even on small graphs and simple tasks.

Method: Uses multi-agent collaboration inspired by distributed graph computation theory, decomposing graph problems into smaller node-centric tasks distributed among multiple agents to reduce individual LLM workload.

Result: Achieves near-perfect accuracy on polynomial-time graph reasoning tasks in GraphInstruct dataset, significantly outperforming both closed-source and fine-tuned open-source models, and scales to graphs with over 1,000 nodes.

Conclusion: The framework demonstrates effective graph reasoning capabilities without fine-tuning, handles real-world applications like webpage importance analysis, and provides scalable performance through agent-based decomposition.

Abstract: Recent research has explored the use of Large Language Models (LLMs) for tackling complex graph reasoning tasks. However, due to the intricacies of graph structures and the inherent limitations of LLMs in handling long text, current approaches often fail to deliver satisfactory accuracy, even on small-scale graphs and simple tasks. To address these challenges, we introduce GraphAgent-Reasoner, a fine-tuning-free framework that utilizes a multi-agent collaboration strategy for explicit and precise graph reasoning. Inspired by distributed graph computation theory, our framework decomposes graph problems into smaller, node-centric tasks that are distributed among multiple agents. The agents collaborate to solve the overall problem, significantly reducing the amount of information and complexity handled by a single LLM, thus enhancing the accuracy of graph reasoning. By simply increasing the number of agents, GraphAgent-Reasoner can efficiently scale to accommodate larger graphs with over 1,000 nodes. Evaluated on the GraphInstruct dataset, our framework demonstrates near-perfect accuracy on polynomial-time graph reasoning tasks, significantly outperforming the best available models, both closed-source and fine-tuned open-source variants. Our framework also demonstrates the capability to handle real-world graph reasoning applications such as webpage importance analysis.

[400] RLZero: Direct Policy Inference from Language Without In-Domain Supervision

Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, Scott Niekum

Main category: cs.AI

TL;DR: RLZero enables zero-shot language-to-behavior generation in RL agents without task-specific supervision by using video generative models to imagine observation sequences from language instructions.

Details

Motivation: Traditional reward specification is difficult for humans, and existing language-conditioned RL methods require costly supervision or test-time training.

Method: Three-step framework: 1) Imagine observation sequences using video generative models from language, 2) Project imagined observations to target environment domain, 3) Imitate projected sequences using pretrained RL agent with closed-form solution.

Result: First approach showing direct language-to-behavior generation across various tasks without in-domain supervision, also works with cross-embodied videos like YouTube content.

Conclusion: RLZero demonstrates effective zero-shot policy inference from natural language instructions using unsupervised pretraining and video generation, enabling language-driven RL without expensive supervision.

Abstract: The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions–without task-specific supervision or labeled trajectories–to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.

[401] Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing

Vishnu Asutosh Dasu, Md Rafi ur Rashid, Vipul Gupta, Saeid Tizpaz-Niari, Gang Tan

Main category: cs.AI

TL;DR: Attention Pruning uses surrogate models and simulated annealing to selectively prune bias-contributing attention heads in LLMs, reducing gender bias by up to 40% while maintaining utility.

Details

Motivation: LLMs encode societal biases from training data, and post-processing methods like selective pruning offer feasible bias mitigation without expensive retraining.

Method: Develop surrogate neural networks to model fairness-utility relationships, then use simulated annealing to optimize attention head pruning decisions.

Result: Achieves up to 40% reduction in gender bias while outperforming state-of-the-art bias mitigation methods.

Conclusion: Attention Pruning provides an effective post-processing approach for bias mitigation in LLMs by strategically pruning attention heads that disproportionately contribute to bias.

Abstract: This paper explores pruning attention heads as a post-processing bias mitigation method for large language models (LLMs). Modern AI systems such as LLMs are expanding into sensitive social contexts where fairness concerns become especially crucial. Since LLMs develop decision-making patterns by training on massive datasets of human-generated content, they naturally encode and perpetuate societal biases. While modifying training datasets and algorithms is expensive and requires significant resources; post-processing techniques-such as selectively deactivating neurons and attention heads in pre-trained LLMs-can provide feasible and effective approaches to improve fairness. However, identifying the optimal subset of parameters to prune presents a combinatorial challenge within LLMs’ immense parameter space, requiring solutions that efficiently balance competing objectives across the frontiers of model fairness and utility. To address the computational challenges, we explore a search-based program repair approach via randomized simulated annealing. Given the prohibitive evaluation costs in billion-parameter LLMs, we develop surrogate deep neural networks that efficiently model the relationship between attention head states (active/inactive) and their corresponding fairness/utility metrics. This allows us to perform optimization over the surrogate models and efficiently identify optimal subsets of attention heads for selective pruning rather than directly searching through the LLM parameter space. This paper introduces Attention Pruning, a fairness-aware surrogate simulated annealing approach to prune attention heads in LLMs that disproportionately contribute to bias while minimally impacting overall model utility. Our experiments show that Attention Pruning achieves up to $40%$ reduction in gender bias and outperforms the state-of-the-art bias mitigation strategies.

[402] Energy-Aware Pattern Disentanglement: A Generalizable Pattern Assisted Architecture for Multi-task Time Series Analysis

Xiangkai Ma, Xiaobin Hong, Wenzhong Li, Sanglu Lu

Main category: cs.AI

TL;DR: Pets is a novel “one-model many-tasks” architecture for time series analysis that uses local energy analysis in time-frequency domain to capture transient and non-stationary components, achieving SOTA performance across 60 benchmarks for forecasting, imputation, anomaly detection, and classification.

Details

Motivation: Existing deep learning methods for time series analysis often use "one-model one-task" architectures, limiting generalization across different tasks. The paper aims to create a more versatile architecture that can handle multiple time series tasks effectively.

Method: Proposes Pets architecture based on General fluctuation Pattern Assisted (GPA) framework with two key components: Fluctuation Pattern Assisted (FPA) module for capturing dependencies among fluctuation patterns and modeling them as latent representations, and Context-Guided Mixture of Predictors (MoP) module for hierarchical reconstruction of distinct fluctuations by energy proportion.

Result: Pets achieves state-of-the-art performance across 60 benchmarks on various tasks including forecasting, imputation, anomaly detection, and classification, while demonstrating strong generalization and robustness.

Conclusion: The Pets architecture successfully addresses the limitations of “one-model one-task” approaches by providing a versatile “one-model many-tasks” solution that leverages time-frequency domain analysis and generalizable pattern representations for superior performance across diverse time series tasks.

Abstract: Time series analysis has found widespread applications in areas such as weather forecasting, anomaly detection, and healthcare. While deep learning approaches have achieved significant success in this field, existing methods often adopt a “one-model one-task” architecture, limiting their generalization across different tasks. To address these limitations, we perform local energy analysis in the time-frequency domain to more precisely capture and disentangle transient and non-stationary oscillatory components. Furthermore, our representational analysis reveals that generative tasks tend to capture long-period patterns from low-frequency components, whereas discriminative tasks focus on high-frequency abrupt signals, which constitutes our core contribution. Concretely, we propose Pets, a novel “one-model many-tasks” architecture based on the General fluctuation Pattern Assisted (GPA) framework that is adaptable to versatile model structures for time series analysis. Pets integrates a Fluctuation Pattern Assisted (FPA) module and a Context-Guided Mixture of Predictors (MoP). The FPA module facilitates information fusion among diverse fluctuation patterns by capturing their dependencies and progressively modeling these patterns as latent representations at each layer. Meanwhile, the MoP module leverages these generalizable pattern representations to guide and regulate the reconstruction of distinct fluctuations hierarchically by energy proportion. Pets demonstrates strong versatility and achieves state-of-the-art performance across 60 benchmarks on various tasks, including forecasting, imputation, anomaly detection, and classification, while demonstrating strong generalization and robustness.

[403] Access Controls Will Solve the Dual-Use Dilemma

Evžen Wybitul

Main category: cs.AI

TL;DR: Proposes an access control framework for AI safety systems to handle dual-use requests by verifying users, preventing both over-refusal of legitimate queries and under-refusal of harmful ones.

Details

Motivation: AI safety systems struggle with dual-use requests where the same query could be harmless or harmful depending on context. Current systems lack real-world context and make arbitrary decisions, hurting both utility and safety.

Method: Conceptual framework based on access controls where only verified users can access dual-use outputs. Analyzes framework components and feasibility.

Result: Framework addresses both over-refusals (legitimate queries being denied) and under-refusals (harmful queries being allowed).

Conclusion: This high-level proposal provides initial tools for model providers to manage dual-use content more granularly, enabling users to access more capabilities safely and offering regulators targeted policy options.

Abstract: AI safety systems face the dual-use dilemma. It is unclear whether to answer dual-use requests, since the same query could be either harmless or harmful depending on who made it and why. To make better decisions, such systems would need to examine requests’ real-world context, but currently, they lack access to this information. Instead, they sometimes end up making arbitrary choices that result in refusing legitimate queries and allowing harmful ones, which hurts both utility and safety. To address this, we propose a conceptual framework based on access controls where only verified users can access dual-use outputs. We describe the framework’s components, analyse its feasibility, and explain how it addresses both over-refusals and under-refusals. While only a high-level proposal, our work takes the first step toward giving model providers more granular tools for managing dual-use content. Such tools would enable users to access more capabilities without sacrificing safety, and offer regulators new options for targeted policies.

[404] CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong

Main category: cs.AI

TL;DR: CoT monitoring is more effective than action-only monitoring for detecting subtle harmful behaviors in AI models, but can be deceived by misleading rationalizations. A hybrid protocol combining reasoning and action scoring outperforms both approaches.

Details

Motivation: To ensure AI models don't take harmful actions unnoticed as they gain more autonomy, by developing effective monitoring methods.

Method: Compare Chain-of-Thought (CoT) monitoring vs action-only monitoring in red-teaming setup. Introduce hybrid protocol that independently scores reasoning and actions using weighted average.

Result: CoT monitoring more effective for subtle deception but vulnerable to misleading rationalizations. Hybrid monitor consistently outperforms both approaches with detection rates twice higher than action-only monitoring for subtle deception.

Conclusion: Hybrid monitoring protocol that combines reasoning and action scoring provides superior detection of harmful behaviors compared to either CoT or action-only monitoring alone.

Abstract: As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores model reasoning and actions, and combines them using a weighted average. Our hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates twice higher than action-only monitoring for subtle deception scenarios.

[405] Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning

Zijun Chen, Wenbo Hu, Richang Hong

Main category: cs.AI

TL;DR: This paper introduces a method to improve Chain of Thought reasoning reliability by using attention head activations to detect truthfulness in reasoning steps, enabling dynamic path selection via beam search.

Details

Motivation: Chain of Thought reasoning suffers from error accumulation in intermediate steps, undermining its reliability despite strong reasoning capabilities in LLMs and MLLMs.

Method: Leverage intrinsic veracity encoding in attention head activations to detect truthfulness of reasoning steps, train confidence predictor, and use beam search for dynamic path selection.

Result: Significantly outperforms state-of-the-art baselines (Few-Shot CoT, Self-Consistency, Self-Evaluation Guided Beam Search) across mathematical, symbolic, and commonsense reasoning tasks in both unimodal and multimodal settings.

Conclusion: Provides a novel reliability improvement path for CoT reasoning with broad application potential, validated on large reasoning models and exploring self-correction ability.

Abstract: Chain of Thought (CoT) reasoning has demonstrated remarkable deep reasoning capabilities in both large language models (LLMs) and multimodal large language models (MLLMs). However, its reliability is often undermined by the accumulation of errors in intermediate steps. This paper introduces an novel approach to calibrate the CoT reasoning accuracy by leveraging the model’s intrinsic veracity encoding. We discover that specific attention head activations reliably reflect the truthfulness of reasoning steps in CoT. Based on this insight, we train a confidence predictor to evaluate the correctness of each reasoning step using these truthfulness-sensitive activations, dynamically selecting the most plausible reasoning path via beam search. Experimental results demonstrate that our method significantly outperforms the state-of-the-art baselines (e.g., Few-Shot CoT, Self-Consistency, and Self-Evaluation Guided Beam Search) across the mathematical, symbolic, and commonsense reasoning tasks, exhibiting superior accuracy and reliability in both unimodal and multimodal settings. We further validate the approach on large reasoning models, confirming its applicability to specialized reasoning models. Additionally, we explore the role of the model’s self-correction ability in CoT reasoning. This work provides a novel reliability improvement path for CoT reasoning with broad application potential.

[406] ASP-Assisted Symbolic Regression: Uncovering Hidden Physics in Fluid Mechanics

Theofanis Aravanis, Grigorios Chrimatopoulos, Mohammad Ferdows, Michalis Xenos, Efstratios Em Tzirtzilakis

Main category: cs.AI

TL;DR: Symbolic Regression (SR) is used to discover interpretable mathematical models for 3D laminar flow, and a hybrid SR/ASP framework is introduced to ensure physical plausibility alongside statistical accuracy.

Details

Motivation: To address the 'black box' nature of conventional ML in fluid mechanics by providing interpretable models that reveal underlying physical relationships, while ensuring domain-specific constraints are respected.

Method: Applied SR to derive compact symbolic equations from numerical simulation data of 3D laminar flow, and developed a hybrid framework integrating SR with Answer Set Programming (ASP) to combine data-driven discovery with declarative reasoning for physical plausibility.

Result: Successfully derived symbolic equations that accurately reproduce parabolic velocity profiles and linear pressure drops, showing excellent agreement with analytical solutions. The SR/ASP framework ensures derived equations are both statistically accurate and physically plausible.

Conclusion: The hybrid SR/ASP approach demonstrates the potential of combining data-driven and knowledge-representation methods to enhance interpretability, reliability, and physical alignment in fluid dynamics and related domains.

Abstract: Symbolic Regression (SR) offers an interpretable alternative to conventional Machine-Learning (ML) approaches, which are often criticized as ``black boxes’’. In contrast to standard regression models that require a prescribed functional form, SR constructs expressions from a user-defined set of mathematical primitives, enabling the automated discovery of compact formulas that fit the data and reveal underlying physical relationships. In fluid mechanics, where understanding the underlying physics is as crucial as predictive accuracy, this study applies SR to model three-dimensional (3D) laminar flow in a rectangular channel, focusing on the axial velocity and pressure fields. Compact symbolic equations were derived from numerical simulation data, accurately reproducing the expected parabolic velocity profile and linear pressure drop, and showing excellent agreement with analytical solutions from the literature. To address the limitation that purely data-driven SR models may overlook domain-specific constraints, an innovative hybrid framework that integrates SR with Answer Set Programming (ASP) is also introduced. This integration combines the generative power of SR with the declarative reasoning capabilities of ASP, ensuring that derived equations remain both statistically accurate and physically plausible. The proposed SR/ASP methodology demonstrates the potential of combining data-driven and knowledge-representation approaches to enhance interpretability, reliability, and alignment with physical principles in fluid dynamics and related domains.

[407] LLM Collaboration With Multi-Agent Reinforcement Learning

Shuo Liu, Tianle Chen, Zeyu Liang, Xueguang Lyu, Christopher Amato

Main category: cs.AI

TL;DR: The paper proposes MAGRPO, a multi-agent reinforcement learning method for fine-tuning LLMs to improve coordination in collaborative tasks like writing and coding.

Details

Motivation: Current LLMs are pretrained independently and lack optimization for coordination, while existing fine-tuning frameworks rely on complex individual reward designs that don't effectively encourage collaboration.

Method: Model LLM collaboration as cooperative MARL and develop Multi-Agent Group Relative Policy Optimization (MAGRPO), building on RL approaches for LLMs and MARL techniques.

Result: Experiments on LLM writing and coding collaboration show that fine-tuning with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation.

Conclusion: The approach opens doors to using other MARL methods for LLMs and highlights associated challenges in multi-agent LLM coordination.

Abstract: A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using other MARL methods for LLMs and highlights the associated challenges. Our code is available at https://github.com/OpenMLRL/CoMLRL.

[408] Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier

Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, Yong Man Ro

Main category: cs.AI

TL;DR: Proposes Emotional Rationale Verifier (ERV) and Explanation Reward to improve consistency between emotion predictions and explanations in Multimodal Large Language Models without architectural changes or extra annotations.

Details

Motivation: Current MLLM-based methods generate emotion explanations that diverge from target labels and sometimes contradict their own predicted emotions, posing risks for misunderstanding and eroding reliability in interactive settings.

Method: Uses Emotional Rationale Verifier (ERV) and Explanation Reward to guide models to produce reasoning explicitly consistent with target emotions during multimodal emotion recognition, without modifying model architecture or requiring additional paired video-description annotations.

Result: Significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on MAFW and DFEW datasets. Extensive experiments and human evaluations show enhanced alignment between explanation and prediction.

Conclusion: The approach empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.

Abstract: The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.

[409] BMGQ: A Bottom-up Method for Generating Complex Multi-hop Reasoning Questions from Semi-structured Data

Bingsen Qiu, Zijian Liu, Xiao Liu, Bingjie Wang, Feier Zhang, Yixuan Qin, Chunyan Li, Haoshen Yang, Zeren Gao

Main category: cs.AI

TL;DR: BMGQ is an automated method for generating high-difficulty multi-hop QA datasets from semi-structured knowledge, using NLI-based relation typing, reverse question construction, and quality filtering to create retrieval-resistant questions suitable for training and evaluation.

Details

Motivation: Current multi-hop QA datasets are scarce and mostly designed for evaluation rather than training, while manual curation of non-trivially retrievable questions is costly and doesn't scale, creating a data bottleneck for training retrieval-and-reasoning agents.

Method: Bottom-up approach that: (1) grows diverse evidence clusters using NLI-based relation typing and diversity-aware expansion, (2) applies reverse question construction to create oblique cues, and (3) enforces quality with multi-model consensus filtering and structured constraint decomposition.

Result: The system generates complex, retrieval-resistant yet verifiable questions suitable for both supervised fine-tuning and reinforcement learning training, substantially reducing human curation effort while maintaining the difficulty profile of strong evaluation benchmarks.

Conclusion: BMGQ provides a scalable solution to the data bottleneck in multi-hop QA by automating the generation of training-ready, high-difficulty questions that require true multi-hop reasoning over loosely connected evidence.

Abstract: Building training-ready multi-hop question answering (QA) datasets that truly stress a model’s retrieval and reasoning abilities remains highly challenging recently. While there have been a few recent evaluation datasets that capture the characteristics of hard-to-search but easy-to-verify problems – requiring the integration of ambiguous, indirect, and cross-domain cues – these data resources remain scarce and are mostly designed for evaluation, making them unsuitable for supervised fine-tuning (SFT) or reinforcement learning (RL). Meanwhile, manually curating non-trivially retrievable questions – where answers cannot be found through a single direct query but instead require multi-hop reasoning over oblique and loosely connected evidence – incurs prohibitive human costs and fails to scale, creating a critical data bottleneck for training high-capability retrieval-and-reasoning agents. To address this, we present BMGQ, a bottom-up automated method for generating high-difficulty, training-ready multi-hop questions from semi-structured knowledge sources. The BMGQ system (i) grows diverse, logically labeled evidence clusters through Natural Language Inference (NLI)-based relation typing and diversity-aware expansion; (ii) applies reverse question construction to compose oblique cues so that isolated signals are underinformative but their combination uniquely identifies the target entity; and (iii) enforces quality with a two-step evaluation pipeline that combines multi-model consensus filtering with structured constraint decomposition and evidence-based matching. The result is a scalable process that yields complex, retrieval-resistant yet verifiable questions suitable for SFT/RL training as well as challenging evaluation, substantially reducing human curation effort while preserving the difficulty profile of strong evaluation benchmarks.

[410] Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning

Xiaolong Wei, Yuehu Dong, Xingliang Wang, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin

Main category: cs.AI

TL;DR: Proposes a Planner-centric Plan-Execute paradigm with DAG planning to overcome local optimization traps in tool-augmented LLMs, achieving state-of-the-art performance on complex queries.

Details

Motivation: Existing tool-augmented LLMs like ReAct face local optimization traps due to incremental decision-making, limiting their ability to handle complex queries requiring sophisticated multi-tool coordination.

Method: Introduces a Planner model for global DAG planning, ComplexTool-Plan benchmark dataset, and two-stage training (SFT + GRPO) to enhance tool selection and global planning awareness.

Result: Achieves state-of-the-art performance on StableToolBench benchmark, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.

Conclusion: The Planner-centric Plan-Execute paradigm with DAG planning fundamentally resolves local optimization bottlenecks and enables optimized execution beyond conventional tool coordination.

Abstract: Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner’s tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.

[411] Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

Guoyan Wang, Yanyan Huang, Chunlin Chen, Lifeng Wang, Yuxiang Sun

Main category: cs.AI

TL;DR: Yanyun-3 is a VLM-based agent for cross-platform strategy game automation that uses Qwen2.5-VL for visual reasoning and UI-TARS for interface execution, achieving significant improvements through novel multimodal data organization.

Details

Motivation: Cross-platform strategy game automation is challenging due to diverse UIs and dynamic environments, with existing VLMs struggling with generalization and precision in interface understanding and action execution.

Method: Integrates Qwen2.5-VL for visual reasoning and UI-TARS for interface execution, introduces combination granularity principle for multimodal data organization, fine-tuned using QLoRA on curated dataset across three strategy game platforms.

Result: Optimal strategy (M*V+S) achieves 12.98x improvement in BLEU-4 score and 63% reduction in inference time compared to full fusion, successfully executes core tasks across platforms without platform-specific tuning.

Conclusion: Structured multimodal data organization significantly enhances VLM performance in embodied tasks, offering a generalizable framework for GUI automation with broader implications for robotics and autonomous systems.

Abstract: Cross-platform strategy game automation remains a challenge due to diverse user interfaces and dynamic battlefield environments. Existing Vision–Language Models (VLMs) struggle with generalization across heterogeneous platforms and lack precision in interface understanding and action execution. We introduce Yanyun-3, a VLM-based agent that integrates Qwen2.5-VL for visual reasoning and UI-TARS for interface execution. We propose a novel data organization principle – combination granularity – to distinguish intra-sample fusion and inter-sample mixing of multimodal data (static images, multi-image sequences, and videos). The model is fine-tuned using QLoRA on a curated dataset across three strategy game platforms. The optimal strategy (M*V+S) achieves a 12.98x improvement in BLEU-4 score and a 63% reduction in inference time compared to full fusion. Yanyun-3 successfully executes core tasks (e.g., target selection, resource allocation) across platforms without platform-specific tuning. Our findings demonstrate that structured multimodal data organization significantly enhances VLM performance in embodied tasks. Yanyun-3 offers a generalizable framework for GUI automation, with broader implications for robotics and autonomous systems.

[412] Heterogeneous Multi-Agent Proximal Policy Optimization for Power Distribution System Restoration

Parya Dolatyabi, Mahdi Khodayar

Main category: cs.AI

TL;DR: Heterogeneous-Agent Reinforcement Learning (HARL) with HAPPO enables coordinated power distribution system restoration across interconnected microgrids, outperforming traditional methods in convergence speed, restored power, and training stability.

Details

Motivation: Conventional optimization and value-based RL approaches are computationally inefficient and difficult to scale for power distribution system restoration due to sequential switching operations, nonlinear constraints, and coordination of distributed energy resources.

Method: Uses Heterogeneous-Agent Proximal Policy Optimization (HAPPO) with decentralized actor policies and centralized critic, trained in a physics-informed OpenDSS environment that enforces operational limits via differentiable penalty signals.

Result: HAPPO achieves faster convergence, higher restored power, and smoother multi-seed training than DQN, PPO, MAES, MAGDPG, MADQN, Mean-Field RL, and QMIX on IEEE 123-bus and IEEE 8500-node systems.

Conclusion: Incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex power distribution system restoration.

Abstract: Restoring power distribution systems (PDS) after large-scale outages requires sequential switching operations that reconfigure feeder topology and coordinate distributed energy resources (DERs) under nonlinear constraints such as power balance, voltage limits, and thermal ratings. These challenges make conventional optimization and value-based RL approaches computationally inefficient and difficult to scale. This paper applies a Heterogeneous-Agent Reinforcement Learning (HARL) framework, instantiated through Heterogeneous-Agent Proximal Policy Optimization (HAPPO), to enable coordinated restoration across interconnected microgrids. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts, introducing practical structural heterogeneity. Decentralized actor policies are trained with a centralized critic to compute advantage values for stable on-policy updates. A physics-informed OpenDSS environment provides full power flow feedback and enforces operational limits via differentiable penalty signals rather than invalid action masking. The total DER generation is capped at 2400 kW, and each microgrid must satisfy local supply-demand feasibility. Experiments on the IEEE 123-bus and IEEE 8500-node systems show that HAPPO achieves faster convergence, higher restored power, and smoother multi-seed training than DQN, PPO, MAES, MAGDPG, MADQN, Mean-Field RL, and QMIX. Results demonstrate that incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex PDS restoration.

[413] KRAL: Knowledge and Reasoning Augmented Learning for LLM-assisted Clinical Antimicrobial Therapy

Zhe Li, Yehan Qiu, Yujie Chen, Xiang Zhou

Main category: cs.AI

TL;DR: KRAL is a novel paradigm that enhances clinical LLMs by distilling knowledge and reasoning from teacher models via answer-to-question generation, heuristic learning for data augmentation, and agentic reinforcement learning, achieving superior performance at lower costs.

Details

Motivation: Current LLMs face limitations in clinical decision-making due to knowledge gaps, privacy concerns, high costs, and limited reasoning capabilities, necessitating a more effective and efficient solution.

Method: KRAL uses teacher-model reasoning distillation via reverse answer-to-question generation, heuristic learning for semi-supervised data augmentation (reducing manual annotation by ~80%), and agentic reinforcement learning to jointly enhance knowledge and reasoning while optimizing efficiency.

Result: KRAL outperforms RAG and SFT methods: improves knowledge QA (Accuracy@1 on MEDQA by 1.8% vs. SFT, 3.6% vs. RAG) and reasoning (Pass@1 on PUMCH Antimicrobial by 27% vs. SFT, 27.2% vs. RAG) at ~20% of SFT’s long-term training costs.

Conclusion: KRAL provides an effective, low-cost solution for enhancing local LLMs’ clinical diagnostic capabilities, enabling safe deployment in complex medical decision support.

Abstract: Clinical antimicrobial therapy requires the dynamic integration of pathogen profiles,host factors, pharmacological properties of antimicrobials,and the severity of infection. This complexity imposes fundamental limitations on the applicability of Large Language Models (LLMs) in high-stakes clinical decision-making including knowledge gaps, data privacy concerns, high deployment costs, and limited reasoning capabilities. To address these challenges, we propose KRAL (Knowledge and Reasoning Augmented Learning), a low-cost, scalable, privacy-preserving paradigm that leverages teacher-model reasoning to automatically distill knowledge and reasoning trajectories via answer-to-question reverse generation, employs heuristic learning for semi-supervised data augmentation (reducing manual annotation requirements by approximately 80%), and utilizes agentic reinforcement learning to jointly enhance medical knowledge and reasoning while optimizing computational and memory efficiency. A hierarchical evaluation employing diverse teacher-model proxies reduces assessment costs, while modular interface design facilitates seamless system updates. Experimental results demonstrate that KRAL significantly outperforms traditional Retrieval-Augmented Generation (RAG) and Supervised Fine-Tuning (SFT) methods. It improves knowledge question-answering capability (Accuracy@1 on the external open-source benchmark MEDQA increased by 1.8% vs. SFT and 3.6% vs. RAG) and reasoning capability (Pass@1 on the external benchmark PUMCH Antimicrobial increased by 27% vs. SFT and 27.2% vs. RAG), achieved at about 20% of SFT’s long-term training costs. This establishes KRAL as an effective solution for enhancing local LLMs’ clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support.

cs.SD

[414] Evaluating Objective Speech Quality Metrics for Neural Audio Codecs

Luca A. Lanzendörfer, Florian Grötschla

Main category: cs.SD

TL;DR: Evaluation of objective quality metrics for neural audio codecs shows mixed correlation with human perception in speech quality assessment.

Details

Motivation: Human listening tests are time-consuming and impractical, so reliable objective metrics are needed to evaluate neural audio codecs efficiently.

Method: Conducted MUSHRA listening tests on high-fidelity speech signals and analyzed correlation between subjective scores and widely used objective metrics.

Result: Some objective metrics align well with human perception while others struggle to capture relevant distortions introduced by neural audio codecs.

Conclusion: Provides practical guidance for selecting appropriate evaluation metrics when using neural audio codecs for speech applications.

Abstract: Neural audio codecs have gained recent popularity for their use in generative modeling as they offer high-fidelity audio reconstruction at low bitrates. While human listening studies remain the gold standard for assessing perceptual quality, they are time-consuming and impractical. In this work, we examine the reliability of existing objective quality metrics in assessing the performance of recent neural audio codecs. To this end, we conduct a MUSHRA listening test on high-fidelity speech signals and analyze the correlation between subjective scores and widely used objective metrics. Our results show that, while some metrics align well with human perception, others struggle to capture relevant distortions. Our findings provide practical guidance for selecting appropriate evaluation metrics when using neural audio codecs for speech.

[415] PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

Huadai Liu, Kaicheng Luo, Wen Wang, Qian Chen, Peiwen Sun, Rongjie Huang, Xiangang Li, Jieping Ye, Wei Xue

Main category: cs.SD

TL;DR: PrismAudio is a novel V2A generation framework that uses specialized Chain-of-Thought modules with targeted RL rewards to solve objective entanglement and improve performance across semantic consistency, temporal synchrony, audio quality, and spatial accuracy.

Details

Motivation: Existing V2A methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment, making it difficult to balance the four critical perceptual dimensions.

Method: Decomposes reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, Spatial) with targeted reward functions, using Fast-GRPO with hybrid ODE-SDE sampling for efficient RL optimization, and introduces AudioCanvas benchmark for evaluation.

Result: Achieves state-of-the-art performance across all four perceptual dimensions on both VGGSound test set and out-of-domain AudioCanvas benchmark, demonstrating superior semantic consistency, temporal synchrony, audio quality, and spatial accuracy.

Conclusion: PrismAudio successfully solves the objective entanglement problem in V2A generation through specialized CoT planning with multidimensional RL optimization, providing interpretable and computationally efficient state-of-the-art performance.

Abstract: Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio-Project.github.io.

[416] Continual Audio Deepfake Detection via Universal Adversarial Perturbation

Wangjie Li, Lin Li, Qingyang Hong

Main category: cs.SD

TL;DR: A novel framework using Universal Adversarial Perturbation (UAP) with self-supervised audio models to enable continual learning in audio deepfake detection without needing historical data.

Details

Motivation: Address security concerns in multimedia forensics from evolving speech synthesis/voice conversion technologies, and overcome limitations of current detection models that struggle with new attacks and require costly continual fine-tuning with historical data.

Method: Integrate Universal Adversarial Perturbation (UAP) with pre-trained self-supervised audio models during fine-tuning, allowing models to retain knowledge of historical spoofing distribution without direct access to past data.

Result: Extensive experiments validate the effectiveness of the approach, demonstrating its potential as an efficient solution for continual learning in audio deepfake detection.

Conclusion: The proposed framework successfully addresses the limitations of current detection methods and provides an efficient solution for maintaining detection effectiveness against evolving deepfake attacks without computational/storage costs of historical data.

Abstract: The rapid advancement of speech synthesis and voice conversion technologies has raised significant security concerns in multimedia forensics. Although current detection models demonstrate impressive performance, they struggle to maintain effectiveness against constantly evolving deepfake attacks. Additionally, continually fine-tuning these models using historical training data incurs substantial computational and storage costs. To address these limitations, we propose a novel framework that incorporates Universal Adversarial Perturbation (UAP) into audio deepfake detection, enabling models to retain knowledge of historical spoofing distribution without direct access to past data. Our method integrates UAP seamlessly with pre-trained self-supervised audio models during fine-tuning. Extensive experiments validate the effectiveness of our approach, showcasing its potential as an efficient solution for continual learning in audio deepfake detection.

[417] DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation

Rui Lin, Zhiyue Wu, Jiahe Le, Kangdi Wang, Weixiong Chen, Junyu Dai, Tao Jiang

Main category: cs.SD

TL;DR: Duo-Tok is a dual-codebook tokenizer for vocal-accompaniment music that balances reconstruction quality and language model learnability, achieving superior performance at 0.75 kbps.

Details

Motivation: Address the tension between high-fidelity reconstruction with acoustic tokens (difficult to model) and LM-friendly semantic tokens (lossy) in lyrics-to-song systems, while making the tokenizer aware of dual-track structure.

Method: Four-stage SSL pipeline: pretrain BEST-RQ-style encoder on large-scale audio, stabilize/factorize representation with Gaussian noise and multi-task supervision, freeze encoder to learn SimVQ-based dual codebooks with hard routing for vocals/accompaniment, train latent diffusion decoders on discrete tokens.

Result: At 0.75 kbps, achieves best music-tagging AP and lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.

Conclusion: Duo-Tok shifts the empirical reconstruction-generation Pareto frontier, successfully balancing reconstruction quality with language model learnability for vocal-accompaniment music.

Abstract: Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.

[418] Differentiable Attenuation Filters for Feedback Delay Networks

Ilias Ibnyahya, Joshua D. Reiss

Main category: cs.SD

TL;DR: Novel method for designing attenuation filters in FDNs using SOS IIR filters as parametric equalizers, enabling scalable and differentiable control over frequency-dependent reverberation decay.

Details

Motivation: Traditional graphic equalizer designs require numerous filters per delay line, lacking scalability and efficiency. Need for a method that reduces optimization parameters while maintaining differentiability for gradient-based learning.

Method: Uses Second Order Sections of IIR filters arranged as parametric equalizers. Frequency, gain, and Q parameters are shared across delay lines, with only gain adjusted based on delay length. Leverages analog filter design principles for efficient filter fitting via supervised learning.

Result: Achieves state-of-the-art performance with significant computational cost reduction. Provides flexible and fully differentiable design compatible with gradient-based learning frameworks.

Conclusion: The proposed scalable SOS IIR filter design offers efficient, accurate, and differentiable attenuation filtering for digital audio reverberation systems, overcoming limitations of traditional approaches.

Abstract: We introduce a novel method for designing attenuation filters in digital audio reverberation systems based on Feedback Delay Networks (FDNs). Our approach uses Second Order Sections (SOS) of Infinite Impulse Response (IIR) filters arranged as parametric equalizers (PEQ), enabling fine control over frequency-dependent reverberation decay. Unlike traditional graphic equalizer designs, which require numerous filters per delay line, we propose a scalable solution where the number of filters can be adjusted. The frequency, gain, and quality factor (Q) parameters are shared parameters across delay lines and only the gain is adjusted based on delay length. This design not only reduces the number of optimization parameters, but also remains fully differentiable and compatible with gradient-based learning frameworks. Leveraging principles of analog filter design, our method allows for efficient and accurate filter fitting using supervised learning. Our method delivers a flexible and differentiable design, achieving state-of-the-art performance while significantly reducing computational cost.

[419] Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model

Genís Plaja-Roglans, Yun-Ning Hung, Xavier Serra, Igor Pereira

Main category: cs.SD

TL;DR: A latent diffusion model for singing voice separation that outperforms existing generative systems and matches non-generative approaches, using only vocal-mixture pairs for training.

Details

Motivation: Extracting individual elements from music mixtures is valuable for production/practice, but current neural approaches face challenges with source overlap and require access to all sources. Generative approaches exist but have limited performance and efficiency.

Method: Uses latent diffusion models for generative singing voice separation, relying only on isolated vocals-mixture pairs for training. Employs latent space encoding for efficient optimization and faster inference.

Result: Outperforms existing generative separation systems and matches non-generative systems on signal quality measures and interference removal. Provides noise robustness study on latent encoder.

Conclusion: Demonstrates diffusion models’ potential to bridge the gap in music source separation, offering efficient inference and competitive performance while requiring only vocal-mixture pairs for training.

Abstract: Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach, the source overlap and correlation in music signals poses an inherent challenge. Also, accessing all sources in the mixture is crucial to train these systems, while complicated. Attempts to address these challenges in a generative fashion exist, however, the separation performance and inference efficiency remain limited. In this work, we study the potential of diffusion models to advance toward bridging this gap, focusing on generative singing voice separation relying only on corresponding pairs of isolated vocals and mixtures for training. To align with creative workflows, we leverage latent diffusion: the system generates samples encoded in a compact latent space, and subsequently decodes these into audio. This enables efficient optimization and faster inference. Our system is trained using only open data. We outperform existing generative separation systems, and level the compared non-generative systems on a list of signal quality measures and on interference removal. We provide a noise robustness study on the latent encoder, providing insights on its potential for the task. We release a modular toolkit for further research on the topic.

cs.LG

[420] Hidden markov model to predict tourists visited place

Theo Demessance, Chongke Bi, Sonia Djebali, Guillaume Guerard

Main category: cs.LG

TL;DR: Using social network data and grammatical inference to predict tourist movements in Paris via hidden Markov models.

Details

Motivation: Social networks provide rich data on tourist behavior through digital traces, enabling better understanding of tourist movements for tourism marketing and decision support.

Method: Adapted grammatical inference algorithm for big data context, using machine learning to create hidden Markov models from social network data.

Result: Successfully demonstrated the method in Paris, producing flexible and editable hidden Markov models representing tourist movements.

Conclusion: The proposed method effectively predicts tourist movements using social network data analysis and grammatical inference, with potential applications in tourism marketing.

Abstract: Nowadays, social networks are becoming a popular way of analyzing tourist behavior, thanks to the digital traces left by travelers during their stays on these networks. The massive amount of data generated; by the propensity of tourists to share comments and photos during their trip; makes it possible to model their journeys and analyze their behavior. Predicting the next movement of tourists plays a key role in tourism marketing to understand demand and improve decision support. In this paper, we propose a method to understand and to learn tourists’ movements based on social network data analysis to predict future movements. The method relies on a machine learning grammatical inference algorithm. A major contribution in this paper is to adapt the grammatical inference algorithm to the context of big data. Our method produces a hidden Markov model representing the movements of a group of tourists. The hidden Markov model is flexible and editable with new data. The capital city of France, Paris is selected to demonstrate the efficiency of the proposed methodology.

[421] Quantifying Modality Contributions via Disentangling Multimodal Representations

Padegal Amit, Omkar Mahesh Kashyap, Namitha Rayasam, Nidhi Shekhar, Surabhi Narayan

Main category: cs.LG

TL;DR: Proposes a framework using Partial Information Decomposition to quantify modality contributions in multimodal models by analyzing internal embeddings, distinguishing between unique, redundant, and synergistic information.

Details

Motivation: Existing accuracy-based approaches conflate modality contributions by only measuring performance drops after removal, failing to distinguish between inherently informative modalities and those that provide value only through interactions with other modalities.

Method: Develops a framework based on Partial Information Decomposition (PID) that decomposes predictive information in internal embeddings into unique, redundant, and synergistic components. Uses Iterative Proportional Fitting Procedure (IPFP) for scalable inference-only analysis without retraining.

Result: Provides a principled, representation-level view of multimodal behavior that offers clearer and more interpretable insights than outcome-based metrics.

Conclusion: The proposed PID-based framework enables more accurate quantification of modality contributions by distinguishing different types of information contributions at the representation level, providing better interpretability for multimodal models.

Abstract: Quantifying modality contributions in multimodal models remains a challenge, as existing approaches conflate the notion of contribution itself. Prior work relies on accuracy-based approaches, interpreting performance drops after removing a modality as indicative of its influence. However, such outcome-driven metrics fail to distinguish whether a modality is inherently informative or whether its value arises only through interaction with other modalities. This distinction is particularly important in cross-attention architectures, where modalities influence each other’s representations. In this work, we propose a framework based on Partial Information Decomposition (PID) that quantifies modality contributions by decomposing predictive information in internal embeddings into unique, redundant, and synergistic components. To enable scalable, inference-only analysis, we develop an algorithm based on the Iterative Proportional Fitting Procedure (IPFP) that computes layer and dataset-level contributions without retraining. This provides a principled, representation-level view of multimodal behavior, offering clearer and more interpretable insights than outcome-based metrics.

[422] PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer

Ruogu Ding, Xin Ning, Ulf Schlichtmann, Weikang Qian

Main category: cs.LG

TL;DR: PrefixGPT is a GPT-based model that generates optimized prefix adders from scratch, achieving 7.7% improved area-delay product and up to 79.1% better average ADP compared to existing methods.

Details

Motivation: Designing optimized prefix adders is challenging due to strict design rules and exponentially large design space, requiring automated solutions.

Method: Represent adder topology as 2D coordinate sequence with legality mask, use decoder-only Transformer pre-trained on random valid adders then fine-tuned for optimization.

Result: Found new optimal design with 7.7% improved ADP and up to 79.1% lower average ADP, demonstrating superior exploration quality.

Conclusion: GPT-style models can master complex hardware design principles and efficiently optimize designs, showing potential for automated circuit design.

Abstract: Prefix adders are widely used in compute-intensive applications for their high speed. However, designing optimized prefix adders is challenging due to strict design rules and an exponentially large design space. We introduce PrefixGPT, a generative pre-trained Transformer (GPT) that directly generates optimized prefix adders from scratch. Our approach represents an adder’s topology as a two-dimensional coordinate sequence and applies a legality mask during generation, ensuring every design is valid by construction. PrefixGPT features a customized decoder-only Transformer architecture. The model is first pre-trained on a corpus of randomly synthesized valid prefix adders to learn design rules and then fine-tuned to navigate the design space for optimized design quality. Compared with existing works, PrefixGPT not only finds a new optimal design with a 7.7% improved area-delay product (ADP) but exhibits superior exploration quality, lowering the average ADP by up to 79.1%. This demonstrates the potential of GPT-style models to first master complex hardware design principles and then apply them for more efficient design optimization.

[423] WavefrontDiffusion: Dynamic Decoding Schedule or Improved Reasoning

Haojin Yang, Rui Hu, Zequn Sun, Rui Zhou, Yujun Cai, Yiwei Wang

Main category: cs.LG

TL;DR: WavefrontDiffusion is a dynamic decoding approach for diffusion language models that expands tokens outward from finalized positions, achieving state-of-the-art performance with better semantic fidelity than existing methods.

Details

Motivation: Standard Diffusion causes premature end-of-sequence predictions by finalizing incomplete context, while BlockDiffusion's rigid structure breaks coherent semantic units and disrupts reasoning. There's a need for more adaptive denoising strategies.

Method: WavefrontDiffusion uses dynamic decoding that expands a wavefront of active tokens outward from finalized positions, following natural semantic flow while maintaining computational efficiency comparable to block-based methods.

Result: Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity compared to existing methods.

Conclusion: Adaptive scheduling in denoising strategies enables more coherent and efficient text generation, with WavefrontDiffusion demonstrating the value of following semantic structure rather than rigid patterns.

Abstract: Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an important role in determining the quality of their outputs. Mainstream denoising strategies include Standard Diffusion and BlockDiffusion. Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions. BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning. We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions. This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods. Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation.

[424] Exploiting the Experts: Unauthorized Compression in MoE-LLMs

Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Dheeraj Kulshrestha, Rajiv Ramnath

Main category: cs.LG

TL;DR: MoE-LLMs are vulnerable to unauthorized compression through expert pruning and fine-tuning, creating a security risk. The paper analyzes this threat and proposes defense strategies.

Details

Motivation: Mixture-of-Experts architectures in LLMs introduce security vulnerabilities where adversaries can bypass licensing by pruning and fine-tuning experts, requiring systematic study of this threat.

Method: Developed expert attribution framework to identify task-critical experts, evaluated pruning effects, and proposed defense strategies including entangled expert training and selective fine-tuning protocols.

Result: Found critical knowledge loss-recovery trade-off: task accuracy can be retained by isolating specific experts but requires targeted re-alignment to avoid significant degradation.

Conclusion: MoE modularity has dual-use nature - both a threat vector and defense target, requiring systematic security evaluation frameworks for MoE-LLM specialization.

Abstract: Mixture-of-Experts (MoE) architectures are increasingly adopted in large language models (LLMs) for their scalability and efficiency. However, their modular structure introduces a unique vulnerability: adversaries can attempt to compress or repurpose models by pruning experts and cheaply fine-tuning the remainder, effectively bypassing licensing and security constraints. In this paper, we systematically study the prunability of MoE-LLMs under task-specific usage. We first develop an expert attribution framework that identifies the subset of experts most responsible for a given task, then evaluate the performance trade-offs of pruning and re-aligning these experts using active learning-driven fine-tuning. Our findings reveal a critical knowledge loss–recovery trade-off: while certain experts can be isolated to retain task accuracy, significant degradation occurs without targeted re-alignment. Based on this analysis, we propose defense strategies that aim to make MoE models harder to compress and fine-tune without authorization, including entangled expert training and selective fine-tuning protocols that resist unauthorized adaptation. By positioning expert pruning as both a threat vector and a defense target, this work highlights the dual-use nature of MoE modularity and provides the first systematic evaluation framework for secure specialization of MoE-LLMs.

[425] Quality analysis and evaluation prediction of RAG retrieval based on machine learning algorithms

Ruoxin Zhang, Zhizhao Wen, Chao Wang, Chenchen Tang, Puyang Xu, Yifan Jiang

Main category: cs.LG

TL;DR: The paper proposes an XGBoost regression model with feature engineering and particle swarm optimization to improve retrieval quality in RAG systems, showing superior performance over other models.

Details

Motivation: Retrieval enhanced generation systems depend heavily on retrieval quality, and existing models struggle with tabular features, leading to distorted content generation when retrieval results are irrelevant or noisy.

Method: XGBoost machine learning regression model based on feature engineering and particle swarm optimization, with correlation analysis of key metrics like answer_quality and doc_relevance.

Result: The proposed model outperforms decision trees and AdaBoost with significantly lower MSE, RMSE, MAE, MAPE and higher R2, demonstrating better prediction accuracy, stability, and data interpretation.

Conclusion: The approach provides an effective path for optimizing retrieval quality and improving RAG system generation effects, with important value for promoting implementation and application of related technologies.

Abstract: With the rapid evolution of large language models, retrieval enhanced generation technology has been widely used due to its ability to integrate external knowledge to improve output accuracy. However, the performance of the system is highly dependent on the quality of the retrieval module. If the retrieval results have low relevance to user needs or contain noisy information, it will directly lead to distortion of the generated content. In response to the performance bottleneck of existing models in processing tabular features, this paper proposes an XGBoost machine learning regression model based on feature engineering and particle swarm optimization. Correlation analysis shows that answer_quality is positively correlated with doc_delevance by 0.66, indicating that document relevance has a significant positive effect on answer quality, and improving document relevance may enhance answer quality; The strong negative correlations between semantic similarity, redundancy, and diversity were -0.89 and -0.88, respectively, indicating a tradeoff between semantic similarity, redundancy, and diversity. In other words, as the former two increased, diversity significantly decreased. The experimental results comparing decision trees, AdaBoost, etc. show that the VMD PSO BiLSTM model is superior in all evaluation indicators, with significantly lower MSE, RMSE, MAE, and MAPE compared to the comparison model. The R2 value is higher, indicating that its prediction accuracy, stability, and data interpretation ability are more outstanding. This achievement provides an effective path for optimizing the retrieval quality and improving the generation effect of RAG system, and has important value in promoting the implementation and application of related technologies.

[426] OmniTFT: Omni Target Forecasting for Vital Signs and Laboratory Result Trajectories in Multi Center ICU Data

Wanzhe Xu, Yutong Dai, Yitao Yang, Martin Loza, Weihang Zhang, Yang Cui, Xin Zeng, Sung Joon Park, Kenta Nakai

Main category: cs.LG

TL;DR: OmniTFT is a deep learning framework that jointly forecasts vital signs and laboratory results in ICUs using Temporal Fusion Transformer with novel strategies for handling data heterogeneity and improving robustness.

Details

Motivation: Accurate prediction of vital signs and lab results is crucial for ICU care, but existing methods struggle with noisy vital signs, missing lab values, measurement lags, and device bias, making integrative forecasting challenging.

Method: Proposes OmniTFT based on Temporal Fusion Transformer with four novel strategies: sliding window equalized sampling, frequency-aware embedding shrinkage, hierarchical variable selection, and influence-aligned attention calibration to handle data heterogeneity.

Result: Achieves substantial performance improvement for both vital signs and laboratory results on MIMIC-III, MIMIC-IV, and eICU datasets, with interpretable attention patterns consistent with known pathophysiology.

Conclusion: OmniTFT enables unified modeling of multiple heterogeneous clinical targets while preserving cross-institutional generalizability, showing potential for quantitative decision support in clinical care.

Abstract: Accurate multivariate time-series prediction of vital signs and laboratory results is crucial for early intervention and precision medicine in intensive care units (ICUs). However, vital signs are often noisy and exhibit rapid fluctuations, while laboratory tests suffer from missing values, measurement lags, and device-specific bias, making integrative forecasting highly challenging. To address these issues, we propose OmniTFT, a deep learning framework that jointly learns and forecasts high-frequency vital signs and sparsely sampled laboratory results based on the Temporal Fusion Transformer (TFT). Specifically, OmniTFT implements four novel strategies to enhance performance: sliding window equalized sampling to balance physiological states, frequency-aware embedding shrinkage to stabilize rare-class representations, hierarchical variable selection to guide model attention toward informative feature clusters, and influence-aligned attention calibration to enhance robustness during abrupt physiological changes. By reducing the reliance on target-specific architectures and extensive feature engineering, OmniTFT enables unified modeling of multiple heterogeneous clinical targets while preserving cross-institutional generalizability. Across forecasting tasks, OmniTFT achieves substantial performance improvement for both vital signs and laboratory results on the MIMIC-III, MIMIC-IV, and eICU datasets. Its attention patterns are interpretable and consistent with known pathophysiology, underscoring its potential utility for quantitative decision support in clinical care.

[427] Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification

Lei Wang, Zikun Ye, Jinglong Zhao

Main category: cs.LG

TL;DR: A framework combining fine-tuning and rectification of LLMs for market research, with optimal sample allocation between stages to minimize prediction error variance.

Details

Motivation: To improve LLM performance in market research by addressing both alignment with human responses (fine-tuning) and bias correction (rectification) through optimal resource allocation.

Method: Developed a framework that combines fine-tuning (with variance minimization objective) and rectification, using empirical scaling laws to optimally split limited labeled samples between the two stages.

Result: Empirical analysis shows improved estimation and inference performance compared to using either fine-tuning or rectification alone.

Conclusion: The combined framework with optimal sample allocation between fine-tuning and rectification stages outperforms individual approaches, providing better LLM performance for market research applications.

Abstract: Driven by recent advances in artificial intelligence (AI), a growing body of work demonstrates the potential of using large language models (LLMs) to generate human-like responses in market research and social science applications. Two primary approaches can be applied to improve the performance of LLMs: fine-tuning, which aligns LLM predictions more closely with human responses, and rectification, which corrects biases in LLM outputs. In this paper, we develop a framework that combines fine-tuning and rectification, and optimally allocates limited labeled samples across the two stages. Unlike the conventional objective that minimizes the mean squared prediction errors, we propose to minimize the variance of the prediction errors as the fine-tuning objective, which is optimal for the downstream rectification stage. Building on this insight, we leverage empirical scaling laws to develop a data-driven method for optimally splitting samples between the fine-tuning and rectification stages. Empirical analysis validates our framework, demonstrating improved estimation and inference performance compared to using either fine-tuning or rectification alone.

[428] To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance

Wanlong Fang, Tianle Zhang, Alvin Chan

Main category: cs.LG

TL;DR: Explicit multimodal alignment’s impact on performance depends on data redundancy - optimal alignment strength balances modality-specific signals and shared information.

Details

Motivation: Prior research only observed natural alignment correlations without systematically studying effects of explicit alignment enforcement between modalities.

Method: Introduced controllable contrastive learning module to precisely manipulate alignment strength during training across synthetic and real datasets with different modality information structures.

Result: Optimal alignment strength depends on modality redundancy - explicit alignment improves performance when properly balanced with modality-specific signals.

Conclusion: Provides practical guidance on when and how to apply explicit alignment for optimal unimodal encoder performance based on data characteristics.

Abstract: Multimodal learning often relies on aligning representations across modalities to enable effective information integration, an approach traditionally assumed to be universally beneficial. However, prior research has primarily taken an observational approach, examining naturally occurring alignment in multimodal data and exploring its correlation with model performance, without systematically studying the direct effects of explicitly enforced alignment between representations of different modalities. In this work, we investigate how explicit alignment influences both model performance and representation alignment under different modality-specific information structures. Specifically, we introduce a controllable contrastive learning module that enables precise manipulation of alignment strength during training, allowing us to explore when explicit alignment improves or hinders performance. Our results on synthetic and real datasets under different data characteristics show that the impact of explicit alignment on the performance of unimodal models is related to the characteristics of the data: the optimal level of alignment depends on the amount of redundancy between the different modalities. We identify an optimal alignment strength that balances modality-specific signals and shared redundancy in the mixed information distributions. This work provides practical guidance on when and how explicit alignment should be applied to achieve optimal unimodal encoder performance.

[429] The Generalized Proximity Forest

Ben Shaw, Adam Rustad, Sofia Pelagalli Maia, Jake S. Rhodes, Kevin R. Moon

Main category: cs.LG

TL;DR: The paper extends Random Forest proximities to all supervised distance-based machine learning contexts through a generalized Proximity Forest model, including regression variants and meta-learning capabilities.

Details

Motivation: Random Forest proximities are useful for various tasks but depend on RF model success, which isn't ideal in all contexts. The work aims to extend RF proximities to all supervised distance-based learning scenarios.

Method: Introduces generalized Proximity Forest model extending RF proximities to all distance-based supervised learning, adds regression variant, and uses PF as meta-learning framework for supervised imputation with pre-trained classifiers.

Result: Experimental results show unique advantages of generalized PF model compared to both Random Forest and k-nearest neighbors models.

Conclusion: The generalized Proximity Forest model successfully extends RF proximities to broader contexts, providing enhanced capabilities for distance-based supervised learning tasks including regression and meta-learning applications.

Abstract: Recent work has demonstrated the utility of Random Forest (RF) proximities for various supervised machine learning tasks, including outlier detection, missing data imputation, and visualization. However, the utility of the RF proximities depends upon the success of the RF model, which itself is not the ideal model in all contexts. RF proximities have recently been extended to time series by means of the distance-based Proximity Forest (PF) model, among others, affording time series analysis with the benefits of RF proximities. In this work, we introduce the generalized PF model, thereby extending RF proximities to all contexts in which supervised distance-based machine learning can occur. Additionally, we introduce a variant of the PF model for regression tasks. We also introduce the notion of using the generalized PF model as a meta-learning framework, extending supervised imputation capability to any pre-trained classifier. We experimentally demonstrate the unique advantages of the generalized PF model compared with both the RF model and the $k$-nearest neighbors model.

[430] Generative Model-Aided Continual Learning for CSI Feedback in FDD mMIMO-OFDM Systems

Guijun Liu, Yuwen Cao, Tomoaki Ohtsuki, Jiguang He, Shahid Mumtaz

Main category: cs.LG

TL;DR: Proposes a GAN-based continual learning approach for CSI feedback in mMIMO-OFDM systems to address catastrophic forgetting and adapt to dynamic environments without retraining.

Details

Motivation: Existing CSI feedback models struggle with user mobility and dynamic environments, requiring retraining for new distributions and suffering from catastrophic forgetting when returning to previous environments.

Method: Uses a GAN generator as a memory unit to preserve knowledge from past environments, enabling continual learning without forgetting previous tasks.

Result: Simulation results show enhanced generalization capability while maintaining low memory overhead, with seamless integration into other advanced CSI feedback models.

Conclusion: The proposed GAN-based approach provides robust and adaptable CSI feedback that handles dynamic environments effectively without catastrophic forgetting.

Abstract: Deep autoencoder (DAE) frameworks have demonstrated their effectiveness in reducing channel state information (CSI) feedback overhead in massive multiple-input multiple-output (mMIMO) orthogonal frequency division multiplexing (OFDM) systems. However, existing CSI feedback models struggle to adapt to dynamic environments caused by user mobility, requiring retraining when encountering new CSI distributions. Moreover, returning to previously encountered environments often leads to performance degradation due to catastrophic forgetting. Continual learning involves enabling models to incorporate new information while maintaining performance on previously learned tasks. To address these challenges, we propose a generative adversarial network (GAN)-based learning approach for CSI feedback. By using a GAN generator as a memory unit, our method preserves knowledge from past environments and ensures consistently high performance across diverse scenarios without forgetting. Simulation results show that the proposed approach enhances the generalization capability of the DAE framework while maintaining low memory overhead. Furthermore, it can be seamlessly integrated with other advanced CSI feedback models, highlighting its robustness and adaptability.

[431] OpenCML: End-to-End Framework of Open-world Machine Learning to Learn Unknown Classes Incrementally

Jitendra Parmar, Praveen Singh Thakur

Main category: cs.LG

TL;DR: Proposes a novel open-world machine learning model that discovers unknown classes and performs incremental learning, enabling continual learning with improved performance over existing approaches.

Details

Motivation: Conventional machine learning models follow closed-world assumptions, limiting their ability to retain knowledge for future tasks, while automated systems need to learn about novel classes and previously known tasks.

Method: Two connected tasks: discovering unknown classes in data and creating novel classes, followed by performing class incremental learning for each new class to enable continual learning.

Result: Outperformed existing approaches in open-world learning, achieving highest average accuracy of 82.54% over four iterations and minimum accuracy of 65.87% in continuous learning.

Conclusion: The proposed model successfully enables continual learning by expanding system understanding of data over time through novel class discovery and incremental learning.

Abstract: Open-world machine learning is an emerging technique in artificial intelligence, where conventional machine learning models often follow closed-world assumptions, which can hinder their ability to retain previously learned knowledge for future tasks. However, automated intelligence systems must learn about novel classes and previously known tasks. The proposed model offers novel learning classes in an open and continuous learning environment. It consists of two different but connected tasks. First, it discovers unknown classes in the data and creates novel classes; next, it learns how to perform class incrementally for each new class. Together, they enable continual learning, allowing the system to expand its understanding of the data and improve over time. The proposed model also outperformed existing approaches in open-world learning. Furthermore, it demonstrated strong performance in continuous learning, achieving a highest average accuracy of 82.54% over four iterations and a minimum accuracy of 65.87%.

[432] RFX: High-Performance Random Forests with GPU Acceleration and QLORA Compression

Chris Kuchar

Main category: cs.LG

TL;DR: RFX v1.0 is a production-ready Python implementation of Random Forests that solves the proximity matrix memory bottleneck using QLORA compression and TriBlock storage, enabling analysis of datasets up to 200,000+ samples with GPU acceleration.

Details

Motivation: Random Forest analysis has been limited to ~60,000 samples due to the memory bottleneck of proximity matrices, which prevents scaling to larger datasets.

Method: Four solutions: (1) QLORA compression for GPU proximity matrices (12,500x memory reduction), (2) CPU TriBlock storage with lossless compression, (3) SM-aware GPU batch sizing, (4) GPU-accelerated 3D MDS visualization from low-rank factors.

Result: Eliminates proximity memory bottleneck, scales from 1,000 to 200,000+ samples, achieves 1.4x GPU speedup over CPU, maintains 99% geometric structure preservation with QLORA compression.

Conclusion: RFX v1.0 enables proximity-based Random Forest analysis on datasets orders of magnitude larger than previously feasible, with production-ready classification following Breiman and Cutler’s methodology.

Abstract: RFX (Random Forests X), where X stands for compression or quantization, presents a production-ready implementation of Breiman and Cutler’s Random Forest classification methodology in Python. RFX v1.0 provides complete classification: out-of-bag error estimation, overall and local importance measures, proximity matrices with QLORA compression, case-wise analysis, and interactive visualization (rfviz)–all with CPU and GPU acceleration. Regression, unsupervised learning, CLIQUE importance, and RF-GAP proximity are planned for v2.0. This work introduces four solutions addressing the proximity matrix memory bottleneck limiting Random Forest analysis to ~60,000 samples: (1) QLORA (Quantized Low-Rank Adaptation) compression for GPU proximity matrices, reducing memory from 80GB to 6.4MB for 100k samples (12,500x compression with INT8 quantization) while maintaining 99% geometric structure preservation, (2) CPU TriBlock proximity–combining upper-triangle storage with block-sparse thresholding–achieving 2.7x memory reduction with lossless quality, (3) SM-aware GPU batch sizing achieving 95% GPU utilization, and (4) GPU-accelerated 3D MDS visualization computing embeddings directly from low-rank factors using power iteration. Validation across four implementation modes (GPU/CPU x case-wise/non-case-wise) demonstrates correct implementation. GPU achieves 1.4x speedup over CPU for overall importance with 500+ trees. Proximity computation scales from 1,000 to 200,000+ samples (requiring GPU QLORA), with CPU TriBlock filling the gap for medium-scale datasets (10K-50K samples). RFX v1.0 eliminates the proximity memory bottleneck, enabling proximity-based Random Forest analysis on datasets orders of magnitude larger than previously feasible. Open-source production-ready classification following Breiman and Cutler’s original methodology.

[433] A Systematic Study of Compression Ordering for Large Language Models

Shivansh Chhawri, Rahul Mahadik, Suparna Rooj

Main category: cs.LG

TL;DR: Systematic analysis of LLM compression techniques shows that sequence order significantly impacts model quality, with Pruning-Knowledge Distillation-Quantization (P-KD-Q) achieving optimal 3.68x compression while preserving performance.

Details

Motivation: LLMs require substantial computational resources, making compression essential for deployment in constrained environments, but the interactions and optimal sequencing of compression techniques remain unclear.

Method: Systematically examined knowledge distillation, structured pruning, and low-bit quantization techniques individually and in combination on Qwen2.5 3B model, evaluating multiple compression pipelines using perplexity, G-Eval, clarity, prompt alignment, and compression ratio metrics.

Result: Quantization provides greatest standalone compression, pruning introduces moderate quality degradation, and sequence order significantly affects final model quality. P-KD-Q sequence yields best balance with 3.68x compression ratio while preserving instruction-following and language understanding capabilities.

Conclusion: The study offers practical insight into designing effective, ordering-aware compression pipelines for deploying LLMs in resource-limited settings, highlighting that pipelines applying quantization early suffer severe performance degradation due to irreversible information loss.

Abstract: Large Language Models (LLMs) require substantial computational resources, making model compression essential for efficient deployment in constrained environments. Among the dominant compression techniques: knowledge distillation, structured pruning, and low-bit quantization, their individual effects are well studied, but their interactions and optimal sequencing remain unclear. This work systematically examines how these techniques perform both independently and in combination when applied to the Qwen2.5 3B model. We evaluate multiple compression pipelines, including single, and proposed three-technique sequences, using perplexity, G-Eval, clarity, prompt alignment, and compression ratio as metrics. Our experiments show that quantization provides the greatest standalone compression, while pruning introduces moderate quality degradation. Critically, the ordering of techniques significantly affects the final model quality: the sequence Pruning, Knowledge Distillation, Quantization (P-KD-Q) yields the best balance, achieving a 3.68x compression ratio while preserving strong instruction-following and language understanding capabilities. Conversely, pipelines applying quantization early suffer severe performance degradation due to irreversible information loss that impairs subsequent training. Overall, this study offers practical insight into designing effective, ordering-aware compression pipelines for deploying LLMs in resource-limited settings.

[434] Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

Panayiotis Danassis, Naman Goel

Main category: cs.LG

TL;DR: LLMs struggle with real-world logistics optimization problems requiring strategic reasoning, with human-coded agents consistently outperforming LLM-coded ones in competitive auctions and delivery planning tasks.

Details

Motivation: Current benchmarks focus too much on unit-test pass rates and syntactic correctness, underestimating the complexity of real-world problems that require planning, optimization, and strategic interaction.

Method: Created a multi-agent benchmark based on Auction, Pickup, and Delivery Problem, evaluating 40 LLM-coded agents against 17 human-coded agents across 12 tournaments and ~40k matches.

Result: Human-coded agents consistently dominated (top 5 spots), 33 out of 40 LLM-coded agents were beaten by simple baselines, and LLMs failed to improve even when given the best human solution.

Conclusion: There’s a significant gap in LLMs’ ability to produce competitively working code for real-world scenarios, highlighting the need for reasoning-driven code synthesis evaluations.

Abstract: The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs’ ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.

[435] Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM

Yang Liu, Xiaolong Zhong, Ling Jiang

Main category: cs.LG

TL;DR: Xmodel-2.5 is a 1.3B-parameter small language model designed as a drop-in agent core, using μP training for hyperparameter transfer and Muon optimizer switching to improve reasoning performance by 4.58%.

Details

Motivation: Large language models have strong reasoning capabilities but are computationally expensive for edge or cost-sensitive deployments, creating need for efficient small models.

Method: Uses maximal-update parameterization (μP) for hyperparameter transfer from 20M proxy model, 1.4T-token curriculum training, and switches from AdamW to Muon optimizer during decay phase with FP8 mixed-precision training.

Result: Achieves 4.58% improvement in 13-task reasoning average by switching to Muon optimizer while keeping other hyperparameters fixed, demonstrating better downstream performance.

Conclusion: Xmodel-2.5 provides an efficient drop-in agent core with improved reasoning capabilities through optimized training techniques, making it suitable for edge deployments.

Abstract: Large language models deliver strong reasoning and tool-use skills, yet their computational demands make them impractical for edge or cost-sensitive deployments. We present \textbf{Xmodel-2.5}, a 1.3-billion-parameter small language model designed as a \emph{drop-in agent core}. Training with maximal-update parameterization ($μ$P) allows hyper-parameters tuned on a 20M-parameter proxy to transfer directly to the full model, even under the parameter-tied \emph{tie-word-embedding} architecture. A 1.4T-token Warmup–Stable–Decay curriculum is used, and we further show that \textbf{switching from AdamW to Muon during the decay phase} improves the 13-task reasoning average by 4.58,% while keeping every other hyper-parameter fixed, verifying that early AdamW stability can be paired with late Muon sharpening for better downstream performance. FP8-mixed-precision training balances accuracy and throughput. All checkpoints, recipes, and evaluation code are released under the Apache-2.0 license.\footnote{https://huggingface.co/XiaoduoAILab/Xmodel-2.5 and https://huggingface.co/XiaoduoAILab/Xmodel-2.5-history (training checkpoints).} Training code and evaluation harness: https://github.com/XiaoduoAILab/Xmodel-2.5.

[436] PeriodNet: Boosting the Potential of Attention Mechanism for Time Series Forecasting

Bowen Zhao, Huanlai Xing, Zhiwen Xiao, Jincheng Peng, Li Feng, Xinhan Wang, Rong Qu, Hui Li

Main category: cs.LG

TL;DR: PeriodNet introduces a novel attention-based architecture for time series forecasting with period attention mechanisms and iterative grouping, achieving state-of-the-art performance on multiple datasets.

Details

Motivation: Despite attention mechanisms' success in NLP with models like BERT and GPT, their application in time series forecasting has been disappointing. There's a need for better network structures specifically designed for TSF tasks.

Method: Proposes PeriodNet with period attention and sparse period attention for adjacent period analysis, iterative grouping mechanism for cross-variable redundancy reduction, and redesigned Transformer architecture with period diffuser for multi-period prediction.

Result: Outperforms six state-of-the-art models on eight datasets in both univariate and multivariate TSF scenarios. Achieves 22% relative improvement for 720-length time series forecasting compared to conventional encoder-decoder Transformer models.

Conclusion: PeriodNet demonstrates superior performance in time series forecasting through its novel period-based attention mechanisms and architectural redesign, effectively addressing limitations of traditional attention approaches in TSF.

Abstract: The attention mechanism has demonstrated remarkable potential in sequence modeling, exemplified by its successful application in natural language processing with models such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT). Despite these advancements, its utilization in time series forecasting (TSF) has yet to meet expectations. Exploring a better network structure for attention in TSF holds immense significance across various domains. In this paper, we present PeriodNet with a brand new structure to forecast univariate and multivariate time series. PeriodNet incorporates period attention and sparse period attention mechanism for analyzing adjacent periods. It enhances the mining of local characteristics, periodic patterns, and global dependencies. For efficient cross-variable modeling, we introduce an iterative grouping mechanism which can directly reduce the cross-variable redundancy. To fully leverage the extracted features on the encoder side, we redesign the entire architecture of the vanilla Transformer and propose a period diffuser for precise multi-period prediction. Through comprehensive experiments conducted on eight datasets, we demonstrate that PeriodNet outperforms six state-of-the-art models in both univariate and multivariate TSF scenarios in terms of mean square error and mean absolute error. In particular, PeriodNet achieves a relative improvement of 22% when forecasting time series with a length of 720, in comparison to other models based on the conventional encoder-decoder Transformer architecture.

[437] Hierarchical Dual-Strategy Unlearning for Biomedical and Healthcare Intelligence Using Imperfect and Privacy-Sensitive Medical Data

Yi Zhang, Tianxiang Xu, Zijian Li, Chao Zhang, Kunyu Zhang, Zhan Gao, Meinuo Li, Xiaohan Zhang, Qichao Qi, Bing Chen

Main category: cs.LG

TL;DR: A hierarchical dual-strategy framework for selective knowledge unlearning in LLMs that removes specialized medical knowledge while preserving fundamental competencies, achieving high forgetting rates with minimal parameter modifications.

Details

Motivation: Address privacy risks from LLM training data memorization in healthcare contexts, particularly for imperfect or sensitive patient information, while meeting regulatory compliance and ethical standards.

Method: Hierarchical dual-strategy framework combining geometric-constrained gradient updates for selective parameter modulation and concept-aware token-level interventions using a unified four-level medical concept hierarchy.

Result: Achieved 82.7% forgetting rate and 88.5% knowledge preservation on MedMCQA and MHQA datasets, with robust privacy guarantees while modifying only 0.1% of parameters.

Conclusion: The framework effectively addresses critical needs for regulatory compliance, auditability, and ethical standards in clinical research by enabling precise knowledge removal while preserving essential medical competencies.

Abstract: Large language models (LLMs) exhibit exceptional performance but pose substantial privacy risks due to training data memorization, particularly within healthcare contexts involving imperfect or privacy-sensitive patient information. We present a hierarchical dual-strategy framework for selective knowledge unlearning that precisely removes specialized knowledge while preserving fundamental medical competencies. Our approach synergistically integrates geometric-constrained gradient updates to selectively modulate target parameters with concept-aware token-level interventions that distinguish between preservation-critical and unlearning-targeted tokens via a unified four-level medical concept hierarchy. Comprehensive evaluations on the MedMCQA (surgical) and MHQA (anxiety, depression, trauma) datasets demonstrate superior performance, achieving an 82.7% forgetting rate and 88.5% knowledge preservation. Notably, our framework maintains robust privacy guarantees while requiring modification of only 0.1% of parameters, addressing critical needs for regulatory compliance, auditability, and ethical standards in clinical research.

[438] Beyond Binary Classification: A Semi-supervised Approach to Generalized AI-generated Image Detection

Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac

Main category: cs.LG

TL;DR: TriDetect is a semi-supervised detector that improves cross-generator generalization by learning architectural patterns within fake images, addressing the limitation of current detectors failing when crossing between GAN and diffusion model architectures.

Details

Motivation: Current forensic detectors fail to generalize across different generator architectures (GANs vs diffusion models) due to fundamental differences in artifacts they produce. GANs create boundary artifacts from partial manifold coverage while diffusion models produce over-smoothing from complete coverage.

Method: Proposed TriDetect uses semi-supervised learning with balanced cluster assignment via Sinkhorn-Knopp algorithm and cross-view consistency to discover latent architectural patterns within the fake class, enabling the model to learn fundamental architectural distinctions.

Result: Evaluated on two standard benchmarks and three in-the-wild datasets against 13 baselines, demonstrating superior generalization capability to unseen generators compared to existing methods.

Conclusion: TriDetect effectively addresses cross-architecture generalization in synthetic image detection by explicitly learning architectural patterns, providing a robust solution for detecting images from unseen generators across different architectural families.

Abstract: The rapid advancement of generators (e.g., StyleGAN, Midjourney, DALL-E) has produced highly realistic synthetic images, posing significant challenges to digital media authenticity. These generators are typically based on a few core architectural families, primarily Generative Adversarial Networks (GANs) and Diffusion Models (DMs). A critical vulnerability in current forensics is the failure of detectors to achieve cross-generator generalization, especially when crossing architectural boundaries (e.g., from GANs to DMs). We hypothesize that this gap stems from fundamental differences in the artifacts produced by these \textbf{distinct architectures}. In this work, we provide a theoretical analysis explaining how the distinct optimization objectives of the GAN and DM architectures lead to different manifold coverage behaviors. We demonstrate that GANs permit partial coverage, often leading to boundary artifacts, while DMs enforce complete coverage, resulting in over-smoothing patterns. Motivated by this analysis, we propose the \textbf{Tri}archy \textbf{Detect}or (TriDetect), a semi-supervised approach that enhances binary classification by discovering latent architectural patterns within the “fake” class. TriDetect employs balanced cluster assignment via the Sinkhorn-Knopp algorithm and a cross-view consistency mechanism, encouraging the model to learn fundamental architectural distincts. We evaluate our approach on two standard benchmarks and three in-the-wild datasets against 13 baselines to demonstrate its generalization capability to unseen generators.

[439] Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun

Main category: cs.LG

TL;DR: SPEAR is a self-imitation learning method that balances exploration and exploitation for training agentic LLMs through curriculum scheduling of policy entropy, achieving significant performance improvements in various environments with minimal overhead.

Details

Motivation: Existing RL methods for LLMs struggle with the exploration-exploitation trade-off and suffer from instability due to mechanical entropy maximization causing multi-turn distribution shifting.

Method: Extends vanilla self-imitation learning with curriculum scheduling that harmonizes intrinsic reward shaping and self-imitation to gradually steer policy entropy across stages - promoting exploration early and exploitation later.

Result: In ALFWorld and WebShop, SPEAR increased success rates by up to 16.1%/5.1%/8.6% and 20.7%/11.8%/13.9% respectively over baselines. In AIME24 and AIME25, it boosted Dr.BoT by up to 3.8% and 6.1% with only 10%-25% extra theoretical complexity.

Conclusion: SPEAR provides an effective plug-and-play solution for progressive exploration-exploitation balance in agentic LLM training, achieving substantial performance gains with minimal computational overhead.

Abstract: Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent’s own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1%/5.1%/8.6% and 20.7%/11.8%/13.9%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8% and 6.1%, respectively. Such gains incur only 10%-25% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.

[440] Position: The Complexity of Perfect AI Alignment – Formalizing the RLHF Trilemma

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

Main category: cs.LG

TL;DR: The Alignment Trilemma proves RLHF systems cannot simultaneously achieve representativeness, tractability, and robustness - forcing practitioners to sacrifice representativeness, which explains documented RLHF pathologies.

Details

Motivation: To explain why RLHF systems face persistent trade-offs where improving safety reduces fairness, scaling becomes intractable, and robustness amplifies majority biases.

Method: Complexity-theoretic analysis integrating statistical learning theory and robust optimization to prove fundamental impossibility results about RLHF alignment requirements.

Result: Achieving both representativeness (ε≤0.01) and robustness (δ≤0.001) for global populations requires Ω(2^{d_context}) operations (super-polynomial), while current RLHF uses only 10^3-10^4 samples vs needed 10^7-10^8.

Conclusion: The trilemma explains RLHF pathologies and suggests navigating trade-offs through strategic relaxations of alignment requirements.

Abstract: Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3–10^4 samples from homogeneous annotator pools while 10^7–10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.

[441] Profile Generators: A Link between the Narrative and the Binary Matrix Representation

Raoul H. Kutil, Georg Zimmermann, Barbara Strasser-Kirchweger, Christian Borgelt

Main category: cs.LG

TL;DR: Developed a symptom profile generator as an alternative to binary matrices for representing complex mental health disorders, enabling automated generation of valid symptom combinations and calculation of similarity metrics between disorders.

Details

Motivation: The DSM-5 provides detailed descriptions of cognitive disorders, but existing machine-actionable representations are inadequate for complex cases, and binary matrices become infeasible due to the vast number of symptom combinations.

Method: Created a symptom profile generator using lists, sets, and numbers with slight variations to represent diagnostic pathways, enabling automated generation of symptom combinations and development of profile reduction methods for similarity calculations.

Result: Successfully represented several psychotic disorders using generators, showing that binary matrix representations become unmanageably large for complex disorders, and developed methods to calculate specific similarity values between disorders.

Conclusion: The symptom profile generator provides a readable, adaptable, and comprehensive alternative to binary matrices for representing complex mental health disorders, enabling practical similarity analysis that was previously infeasible.

Abstract: Mental health disorders, particularly cognitive disorders defined by deficits in cognitive abilities, are described in detail in the DSM-5, which includes definitions and examples of signs and symptoms. A simplified, machine-actionable representation was developed to assess the similarity and separability of these disorders, but it is not suited for the most complex cases. Generating or applying a full binary matrix for similarity calculations is infeasible due to the vast number of symptom combinations. This research develops an alternative representation that links the narrative form of the DSM-5 with the binary matrix representation and enables automated generation of valid symptom combinations. Using a strict pre-defined format of lists, sets, and numbers with slight variations, complex diagnostic pathways involving numerous symptom combinations can be represented. This format, called the symptom profile generator (or simply generator), provides a readable, adaptable, and comprehensive alternative to a binary matrix while enabling easy generation of symptom combinations (profiles). Cognitive disorders, which typically involve multiple diagnostic criteria with several symptoms, can thus be expressed as lists of generators. Representing several psychotic disorders in generator form and generating all symptom combinations showed that matrix representations of complex disorders become too large to manage. The MPCS (maximum pairwise cosine similarity) algorithm cannot handle matrices of this size, prompting the development of a profile reduction method using targeted generator manipulation to find specific MPCS values between disorders. The generators allow easier creation of binary representations for large matrices and make it possible to calculate specific MPCS cases between complex disorders through conditional generators.

[442] TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception

Kailin Lyu, Long Xiao, Jianing Zeng, Junhao Dong, Xuexin Liu, Zhuojun Zou, Haoyue Yang, Lin Shu, Jie Hao

Main category: cs.LG

TL;DR: TouchFormer is a robust multimodal fusion framework that addresses limitations in non-visual material perception through adaptive feature integration and cross-instance regularization, achieving significant performance improvements on benchmark tasks.

Details

Motivation: Traditional vision-based methods degrade under visually impaired conditions, and existing multimodal approaches perform naive fusion without addressing modality-specific noise, missing modalities, and dynamic modality importance, leading to suboptimal performance.

Method: Proposes TouchFormer with Modality-Adaptive Gating (MAG) mechanism, intra- and inter-modality attention mechanisms for adaptive cross-modal feature integration, and Cross-Instance Embedding Regularization (CER) strategy for fine-grained material recognition.

Result: Achieves classification accuracy improvements of 2.48% on SSMC and 6.83% on USMC tasks compared to existing non-visual methods. Real-world robotic experiments validate effectiveness in environmental perception.

Conclusion: TouchFormer enables better robotic perception and interpretation, paving the way for deployment in safety-critical applications like emergency response and industrial automation. Code and datasets will be open-source.

Abstract: Traditional vision-based material perception methods often experience substantial performance degradation under visually impaired conditions, thereby motivating the shift toward non-visual multimodal material perception. Despite this, existing approaches frequently perform naive fusion of multimodal inputs, overlooking key challenges such as modality-specific noise, missing modalities common in real-world scenarios, and the dynamically varying importance of each modality depending on the task. These limitations lead to suboptimal performance across several benchmark tasks. In this paper, we propose a robust multimodal fusion framework, TouchFormer. Specifically, we employ a Modality-Adaptive Gating (MAG) mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features, enhancing model robustness. Additionally, we introduce a Cross-Instance Embedding Regularization(CER) strategy, which significantly improves classification accuracy in fine-grained subcategory material recognition tasks. Experimental results demonstrate that, compared to existing non-visual methods, the proposed TouchFormer framework achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and USMC tasks, respectively. Furthermore, real-world robotic experiments validate TouchFormer’s effectiveness in enabling robots to better perceive and interpret their environment, paving the way for its deployment in safety-critical applications such as emergency response and industrial automation. The code and datasets will be open-source, and the videos are available in the supplementary materials.

[443] Row-stochastic matrices can provably outperform doubly stochastic matrices in decentralized learning

Bing Liu, Boao Kong, Limin Lu, Kun Yuan, Chengcheng Zhao

Main category: cs.LG

TL;DR: The paper analyzes two decentralized learning strategies for weighted global losses: embedding weights into local losses (doubly stochastic matrix) vs. keeping original losses with row-stochastic matrix. It develops a weighted Hilbert-space framework to obtain tighter convergence rates and reveals that row-stochastic matrices can converge faster despite smaller spectral gaps due to geometric properties.

Details

Motivation: To understand the fundamental differences between two natural strategies for handling heterogeneous node weights in decentralized learning, and to determine which approach yields better convergence performance beyond what Euclidean analysis can reveal.

Method: Developed a weighted Hilbert-space framework L²(λ;ℝᵈ) to analyze convergence rates, used Rayleigh-quotient and Loewner-order eigenvalue comparison, and derived conditions under which row-stochastic design converges faster.

Result: Found that row-stochastic matrices become self-adjoint in the weighted Hilbert space while doubly stochastic ones do not, creating additional penalty terms that amplify consensus error and slow convergence. Obtained strictly tighter convergence rates than Euclidean analysis.

Conclusion: The convergence difference arises from both spectral gaps and penalty terms. Row-stochastic design can converge faster even with smaller spectral gaps under certain conditions, leading to practical topology-design guidelines for decentralized learning.

Abstract: Decentralized learning often involves a weighted global loss with heterogeneous node weights $λ$. We revisit two natural strategies for incorporating these weights: (i) embedding them into the local losses to retain a uniform weight (and thus a doubly stochastic matrix), and (ii) keeping the original losses while employing a $λ$-induced row-stochastic matrix. Although prior work shows that both strategies yield the same expected descent direction for the global loss, it remains unclear whether the Euclidean-space guarantees are tight and what fundamentally differentiates their behaviors. To clarify this, we develop a weighted Hilbert-space framework $L^2(λ;\mathbb{R}^d)$ and obtain convergence rates that are strictly tighter than those from Euclidean analysis. In this geometry, the row-stochastic matrix becomes self-adjoint whereas the doubly stochastic one does not, creating additional penalty terms that amplify consensus error, thereby slowing convergence. Consequently, the difference in convergence arises not only from spectral gaps but also from these penalty terms. We then derive sufficient conditions under which the row-stochastic design converges faster even with a smaller spectral gap. Finally, by using a Rayleigh-quotient and Loewner-order eigenvalue comparison, we further obtain topology conditions that guarantee this advantage and yield practical topology-design guidelines.

[444] Aspiration-based Perturbed Learning Automata in Games with Noisy Utility Measurements. Part A: Stochastic Stability in Non-zero-Sum Games

Georgios C. Chasparis

Main category: cs.LG

TL;DR: APLA is a novel payoff-based learning scheme that combines reinforcement with aspiration factors to address convergence limitations in distributed multi-player games, particularly in weakly-acyclic games where standard reinforcement learning fails to guarantee convergence to pure Nash equilibria.

Details

Motivation: Standard reinforcement learning schemes have limitations in distributed setups, especially in multi-player weakly-acyclic games where convergence to pure Nash equilibria cannot be guaranteed. Prior work only addressed potential and coordination games, leaving a gap for more general game types.

Method: Aspiration-based perturbed learning automata (APLA) - a novel payoff-based learning scheme where each player’s action selection probability is reinforced both by repeated selection and an aspiration factor that captures satisfaction level. The paper provides stochastic stability analysis in multi-player positive-utility games with noisy observations.

Result: The paper establishes equivalence between the induced infinite-dimensional Markov chain and a finite dimensional one, characterizing stochastic stability in generic non-zero-sum games. For weakly acyclic games, it further specializes the stochastic stability analysis.

Conclusion: APLA addresses the limitations of standard reinforcement learning in distributed optimization for multi-player games, providing a framework that guarantees convergence to pure Nash equilibria in weakly-acyclic games through the novel combination of reinforcement and aspiration factors.

Abstract: Reinforcement-based learning has attracted considerable attention both in modeling human behavior as well as in engineering, for designing measurement- or payoff-based optimization schemes. Such learning schemes exhibit several advantages, especially in relation to filtering out noisy observations. However, they may exhibit several limitations when applied in a distributed setup. In multi-player weakly-acyclic games, and when each player applies an independent copy of the learning dynamics, convergence to (usually desirable) pure Nash equilibria cannot be guaranteed. Prior work has only focused on a small class of games, namely potential and coordination games. To address this main limitation, this paper introduces a novel payoff-based learning scheme for distributed optimization, namely aspiration-based perturbed learning automata (APLA). In this class of dynamics, and contrary to standard reinforcement-based learning schemes, each player’s probability distribution for selecting actions is reinforced both by repeated selection and an aspiration factor that captures the player’s satisfaction level. We provide a stochastic stability analysis of APLA in multi-player positive-utility games under the presence of noisy observations. This is the first part of the paper that characterizes stochastic stability in generic non-zero-sum games by establishing equivalence of the induced infinite-dimensional Markov chain with a finite dimensional one. In the second part, stochastic stability is further specialized to weakly acyclic games.

[445] Automating Deception: Scalable Multi-Turn LLM Jailbreaks

Adarsh Kumarappan, Ananya Mujoo

Main category: cs.LG

TL;DR: Automated pipeline generates large-scale multi-turn jailbreak datasets using psychological principles (Foot-in-the-Door), revealing significant differences in LLM vulnerability to conversational attacks.

Details

Motivation: Multi-turn conversational attacks using psychological manipulation techniques persistently threaten LLM safety, but defense progress is hindered by manual, unscalable dataset creation.

Method: Systematically operationalize FITD techniques into reproducible templates, creating 1,500 scenarios across illegal/offensive content. Evaluate 7 models from 3 LLM families under multi-turn vs single-turn conditions.

Result: GPT family shows significant vulnerability to conversational history (ASR increases up to 32 percentage points), Gemini 2.5 Flash is nearly immune, and Claude 3 Haiku shows strong but imperfect resistance.

Conclusion: Critical divergence in how safety architectures handle conversational context, highlighting need for defenses that resist narrative-based manipulation.

Abstract: Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass safety alignments, pose a persistent threat to Large Language Models (LLMs). Progress in defending against these attacks is hindered by a reliance on manual, hard-to-scale dataset creation. This paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark of 1,500 scenarios across illegal activities and offensive content. We evaluate seven models from three major LLM families under both multi-turn (with history) and single-turn (without history) conditions. Our results reveal stark differences in contextual robustness: models in the GPT family demonstrate a significant vulnerability to conversational history, with Attack Success Rates (ASR) increasing by as much as 32 percentage points. In contrast, Google’s Gemini 2.5 Flash exhibits exceptional resilience, proving nearly immune to these attacks, while Anthropic’s Claude 3 Haiku shows strong but imperfect resistance. These findings highlight a critical divergence in how current safety architectures handle conversational context and underscore the need for defenses that can resist narrative-based manipulation.

[446] Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space

Shivam Pal, Sakshi Varshney, Piyush Rai

Main category: cs.LG

TL;DR: A method to make classifiers functionally invariant to shortcut features by injecting targeted latent noise during training, achieving state-of-the-art OOD performance without complex representation partitioning.

Details

Motivation: Deep neural networks often learn spurious correlations (shortcuts) that cause failures in out-of-distribution generalization. Existing robust representation approaches are complex and brittle.

Method: Use disentangled latent space to isolate spurious and core features, identify shortcut features via label correlation, and inject targeted anisotropic latent noise during training to desensitize the classifier.

Result: Achieves state-of-the-art out-of-distribution performance on established shortcut learning benchmarks.

Conclusion: Learning robust functions through targeted noise injection is simpler and more effective than complex representation partitioning for shortcut robustness.

Abstract: Deep neural networks are prone to learning shortcuts, spurious and easily learned correlations in training data that cause severe failures in out-of-distribution (OOD) generalization. A dominant line of work seeks robustness by learning a robust representation, often explicitly partitioning the latent space into core and spurious components; this approach can be complex, brittle, and difficult to scale. We take a different approach, instead of a robust representation, we learn a robust function. We present a simple and effective training method that renders the classifier functionally invariant to shortcut signals. Our method operates within a disentangled latent space, which is essential as it isolates spurious and core features into distinct dimensions. This separation enables the identification of candidate shortcut features by their strong correlation with the label, used as a proxy for semantic simplicity. The classifier is then desensitized to these features by injecting targeted, anisotropic latent noise during training. We analyze this as targeted Jacobian regularization, which forces the classifier to ignore spurious features and rely on more complex, core semantic signals. The result is state-of-the-art OOD performance on established shortcut learning benchmarks.

[447] Learning to Solve Weighted Maximum Satisfiability with a Co-Training Architecture

Kaidi Wan, Minghao Liu, Yong Lai

Main category: cs.LG

TL;DR: SplitGNN is a GNN-based approach that solves weighted MaxSAT problems using co-training with supervised message passing and unsupervised solution boosting, achieving faster convergence and better performance than other GNN methods and modern heuristic solvers.

Details

Motivation: To address the limitations of existing GNN-based approaches in solving challenging weighted maximum satisfiability (MaxSAT) problems, particularly on larger and harder instances where traditional methods struggle.

Method: Proposes SplitGNN with co-training architecture combining supervised message passing and unsupervised solution boosting layer. Introduces edge-splitting factor graph representation based on spanning tree generation and edge classification. Implements GPU-accelerated layer with efficient score calculation and relaxation-based optimization.

Result: Achieves 3x faster convergence and better predictions compared to other GNN architectures. Outperforms modern heuristic MaxSAT solvers on larger and harder weighted benchmarks. Demonstrates exceptional generalization abilities on diverse structural instances.

Conclusion: SplitGNN effectively solves weighted MaxSAT problems through its novel graph representation and co-training approach, showing superior performance and generalization compared to existing methods.

Abstract: Wepropose SplitGNN, a graph neural network (GNN)-based approach that learns to solve weighted maximum satisfiabil ity (MaxSAT) problem. SplitGNN incorporates a co-training architecture consisting of supervised message passing mech anism and unsupervised solution boosting layer. A new graph representation called edge-splitting factor graph is proposed to provide more structural information for learning, which is based on spanning tree generation and edge classification. To improve the solutions on challenging and weighted instances, we implement a GPU-accelerated layer applying efficient score calculation and relaxation-based optimization. Exper iments show that SplitGNN achieves 3* faster convergence and better predictions compared with other GNN-based ar chitectures. More notably, SplitGNN successfully finds solu tions that outperform modern heuristic MaxSAT solvers on much larger and harder weighted MaxSAT benchmarks, and demonstrates exceptional generalization abilities on diverse structural instances.

[448] When Should Neural Data Inform Welfare? A Critical Framework for Policy Uses of Neuroeconomics

Yiven, Zhu

Main category: cs.LG

TL;DR: This paper develops a framework for determining when neural data can legitimately inform welfare judgments in policy contexts, rather than just describing behavior.

Details

Motivation: Neuroeconomics promises to ground welfare analysis in neural evidence, but there's a need to establish when neural data can legitimately inform welfare judgments versus merely describing behavior, especially as policy and commercial actors increasingly use neural data to justify interventions.

Method: Develops a non-empirical, model-based framework linking neural signals, computational decision models, and normative welfare criteria using an actor-critic reinforcement-learning model to formalize the inference path from neural activity to welfare claims.

Result: Shows neural evidence constrains welfare judgments only when: neural-computational mapping is well validated, decision model identifies ’true’ interests vs context-dependent mistakes, and welfare criterion is explicitly specified. Creates a Neuroeconomic Welfare Inference Checklist for regulators and NeuroAI designers.

Conclusion: Brains and artificial agents are value-learning systems, but internal reward signals (biological or artificial) are computational quantities that cannot be treated as welfare measures without an explicit normative model.

Abstract: Neuroeconomics promises to ground welfare analysis in neural and computational evidence about how people value outcomes, learn from experience and exercise self-control. At the same time, policy and commercial actors increasingly invoke neural data to justify paternalistic regulation, “brain-based” interventions and new welfare measures. This paper asks under what conditions neural data can legitimately inform welfare judgements for policy rather than merely describing behaviour. I develop a non-empirical, model-based framework that links three levels: neural signals, computational decision models and normative welfare criteria. Within an actor-critic reinforcement-learning model, I formalise the inference path from neural activity to latent values and prediction errors and then to welfare claims. I show that neural evidence constrains welfare judgements only when the neural-computational mapping is well validated, the decision model identifies “true” interests versus context-dependent mistakes, and the welfare criterion is explicitly specified and defended. Applying the framework to addiction, neuromarketing and environmental policy, I derive a Neuroeconomic Welfare Inference Checklist for regulators and for designers of NeuroAI systems. The analysis treats brains and artificial agents as value-learning systems while showing that internal reward signals, whether biological or artificial, are computational quantities and cannot be treated as welfare measures without an explicit normative model.

[449] Online Sparse Feature Selection in Data Streams via Differential Evolution

Ruiyang Xu

Main category: cs.LG

TL;DR: Proposes ODESFS for online sparse feature selection in data streams with missing values, using latent factor analysis for imputation and differential evolution for feature evaluation.

Details

Motivation: Address limitations in existing OS2FS approaches that suffer from poor feature evaluation leading to performance degradation in high-dimensional streaming data with missing values.

Method: Combines latent factor analysis for missing value imputation with differential evolution algorithm for feature importance evaluation in online streaming feature selection.

Result: Outperforms state-of-the-art OSFS and OS2FS methods on six real-world datasets by selecting optimal feature subsets and achieving superior accuracy.

Conclusion: ODESFS effectively handles incomplete streaming data and improves feature selection performance through integrated missing data imputation and evolutionary feature evaluation.

Abstract: The processing of high-dimensional streaming data commonly utilizes online streaming feature selection (OSFS) techniques. However, practical implementations often face challenges with data incompleteness due to equipment failures and technical constraints. Online Sparse Streaming Feature Selection (OS2FS) tackles this issue through latent factor analysis-based missing data imputation. Despite this advancement, existing OS2FS approaches exhibit substantial limitations in feature evaluation, resulting in performance deterioration. To address these shortcomings, this paper introduces a novel Online Differential Evolution for Sparse Feature Selection (ODESFS) in data streams, incorporating two key innovations: (1) missing value imputation using a latent factor analysis model, and (2) feature importance evaluation through differential evolution. Comprehensive experiments conducted on six real-world datasets demonstrate that ODESFS consistently outperforms state-of-the-art OSFS and OS2FS methods by selecting optimal feature subsets and achieving superior accuracy.

[450] Merging without Forgetting: Continual Fusion of Task-Specific Models via Optimal Transport

Zecheng Pan, Zhikang Chen, Ding Li, Min Zhang, Sen Cui, Hongshuo Jin, Luqi Tao, Yi Yang, Deheng Ye, Yu Zhang, Tingting Zhu, Tianling Ren

Main category: cs.LG

TL;DR: OTMF is a model merging framework that uses optimal transport theory to align semantic geometry between task-specific models, avoiding distribution shift issues from naive parameter interpolation.

Details

Motivation: Existing parameter interpolation methods for model merging cause significant distribution shift in feature space and undermine task-specific knowledge, limiting multi-task system effectiveness.

Method: OTMF discovers common masks applied to task vectors through optimal transport plans, selectively extracting transferable components while preserving task identities. Supports continual fusion for scalable integration.

Result: Achieves state-of-the-art performance on multiple vision and language benchmarks in both accuracy and efficiency compared to existing methods.

Conclusion: OTMF provides a practical and theoretically grounded approach to model merging that effectively addresses distribution shift while enabling scalable multi-task integration.

Abstract: Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging.

[451] ModHiFi: Identifying High Fidelity predictive components for Model Modification

Dhruva Kashyap, Chaitanya Murti, Pranav K Nayak, Tanay Narshana, Chiranjib Bhattacharyya

Main category: cs.LG

TL;DR: ModHiFi enables model modification (pruning and unlearning) without access to training data or loss functions by using Subset Fidelity metric based on local reconstruction errors.

Details

Motivation: Open weight models lack access to training data and loss functions, making traditional modification methods infeasible. Need approaches that work with limited computational resources without requiring gradients or ground-truth labels.

Method: Uses Subset Fidelity metric based on local reconstruction errors of component subsets. Theoretically shows global reconstruction error is linearly bounded by local errors for Lipschitz-continuous networks. Proposes ModHiFi algorithm with two variants: ModHiFi-P for pruning and ModHiFi-U for unlearning.

Result: ModHiFi-P achieves 11% speedup over state-of-the-art on ImageNet models with competitive performance on language models. ModHiFi-U achieves complete unlearning on CIFAR-10 without fine-tuning and competitive performance on Swin Transformers.

Conclusion: ModHiFi provides an effective approach for model modification without requiring training data or loss function access, demonstrating strong performance in both pruning and unlearning tasks across different model architectures.

Abstract: Open weight models, which are ubiquitous, rarely provide access to their training data or loss function. This makes modifying such models for tasks such as pruning or unlearning constrained by this unavailability an active area of research. Existing techniques typically require gradients or ground-truth labels, rendering them infeasible in settings with limited computational resources. In this work, we investigate the fundamental question of identifying components that are critical to the model’s predictive performance, without access to either gradients or the loss function, and with only distributional access such as synthetic data. We theoretically demonstrate that the global reconstruction error is linearly bounded by local reconstruction errors for Lipschitz-continuous networks such as CNNs and well-trained Transformers (which, contrary to existing literature, we find exhibit Lipschitz continuity). This motivates using the locally reconstructive behavior of component subsets to quantify their global importance, via a metric that we term Subset Fidelity. In the uncorrelated features setting, selecting individual components via their Subset Fidelity scores is optimal, which we use to propose ModHiFi, an algorithm for model modification that requires no training data or loss function access. ModHiFi-P, for structured pruning, achieves an 11% speedup over the current state of the art on ImageNet models and competitive performance on language models. ModHiFi-U, for classwise unlearning, achieves complete unlearning on CIFAR-10 without fine-tuning and demonstrates competitive performance on Swin Transformers.

[452] An Invariant Latent Space Perspective on Language Model Inversion

Wentao Ye, Jiaqi Hu, Haobo Wang, Xinpeng Ti, Zhiqing Xiao, Hao Chen, Liyao Li, Lei Feng, Sai Wu, Junbo Zhao

Main category: cs.LG

TL;DR: Inv^2A is a new method for language model inversion that treats LLMs as invariant decoders and learns lightweight inverse encoders, achieving better performance than baselines while reducing data dependency.

Details

Motivation: Language model inversion poses serious threats to user privacy and system security by potentially recovering hidden prompts from model outputs.

Method: Proposes Invariant Latent Space Hypothesis and Inv^2A method that learns lightweight inverse encoders, uses sparse concatenation of multiple outputs, and employs contrastive alignment and supervised reinforcement training.

Result: Outperforms baselines by 4.77% BLEU score average across 9 datasets while reducing dependence on large inverse corpora.

Conclusion: Current defenses provide limited protection against LMI attacks, highlighting the need for stronger security strategies.

Abstract: Language model inversion (LMI), i.e., recovering hidden prompts from outputs, emerges as a concrete threat to user privacy and system security. We recast LMI as reusing the LLM’s own latent space and propose the Invariant Latent Space Hypothesis (ILSH): (1) diverse outputs from the same source prompt should preserve consistent semantics (source invariance), and (2) input<->output cyclic mappings should be self-consistent within a shared latent space (cyclic invariance). Accordingly, we present Inv^2A, which treats the LLM as an invariant decoder and learns only a lightweight inverse encoder that maps outputs to a denoised pseudo-representation. When multiple outputs are available, they are sparsely concatenated at the representation layer to increase information density. Training proceeds in two stages: contrastive alignment (source invariance) and supervised reinforcement (cyclic invariance). An optional training-free neighborhood search can refine local performance. Across 9 datasets covering user and system prompt scenarios, Inv^2A outperforms baselines by an average of 4.77% BLEU score while reducing dependence on large inverse corpora. Our analysis further shows that prevalent defenses provide limited protection, underscoring the need for stronger strategies. The source code and data involved in this paper can be found in https://github.com/yyy01/Invariant_Attacker.

[453] Neural Tractability via Structure: Learning-Augmented Algorithms for Graph Combinatorial Optimization

Jialiang Li, Weitong Chen, Mingyu Guo

Main category: cs.LG

TL;DR: A framework combining neural models with parameterized algorithms for graph combinatorial optimization, using neural models for hard parts and parameterized search for easy parts to achieve better solution quality and generalization.

Details

Motivation: Neural models offer fast inference but poor absolute solution quality compared to classical search algorithms, while parameterized algorithms can efficiently solve easy instances of NP-hard problems.

Method: Use parameterized analysis to identify structurally hard parts handled by neural models, and easy parts handled by parameterized algorithms that integrate neural advisory signals.

Result: Achieves superior solution quality competitive with commercial solvers and improved out-of-distribution generalization compared to neural solvers alone.

Conclusion: The framework successfully combines neural efficiency with search-based optimality guarantees, producing strictly better solutions than neural solvers while being agnostic to the neural model choice.

Abstract: Neural models have shown promise in solving NP-hard graph combinatorial optimization (CO) problems. Once trained, they offer fast inference and reasonably high-quality solutions for in-distribution testing instances, but they generally fall short in terms of absolute solution quality compared to classical search-based algorithms that are admittedly slower but offer optimality guarantee once search finishes. We propose a novel framework that combines the inference efficiency and exploratory power of neural models with the solution quality guarantee of search-based algorithms. In particular, we use parameterized algorithms (PAs) as the search component. PAs are dedicated to identifying easy instances of generally NP-hard problems, and allow for practically efficient search by exploiting structural simplicity (of the identified easy instances). Under our framework, we use parameterized analysis to identify the structurally hard parts of a CO instance. The neural model handles the hard parts by generating advisory signals based on its data-driven understanding. The PA-based search component then integrates the advisory signals to systematically and efficiently searches through the remaining structurally easy parts. Notably, our framework is agnostic to the choice of neural model and produces strictly better solutions than neural solvers alone. We examine our framework on multiple CO tasks. Empirical results show that it achieves superior solution quality, competitive with that of commercial solvers. Furthermore, by using the neural model only for exploratory advisory signals, our framework exhibits improved out-of-distribution generalization, addressing a key limitation of existing neural CO solvers.

[454] Learning Massively Multitask World Models for Continuous Control

Nicklas Hansen, Hao Su, Xiaolong Wang

Main category: cs.LG

TL;DR: Newt is a language-conditioned multitask world model trained on 200 diverse tasks through pretraining on demonstrations followed by online RL optimization, achieving superior multitask performance and rapid adaptation to unseen tasks.

Details

Motivation: To address the limitation that RL for continuous control is dominated by single-task or offline regimes, and explore whether a single agent can be trained on hundreds of tasks with online interaction, inspired by the foundation model approach.

Method: Introduce a benchmark with 200 diverse tasks, then develop Newt - a language-conditioned multitask world model that is pretrained on demonstrations to acquire task-aware representations and action priors, followed by joint optimization with online interaction across all tasks.

Result: Newt yields better multitask performance and data-efficiency than strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks.

Conclusion: The approach demonstrates that online RL can scale to hundreds of diverse tasks through the foundation model recipe of large-scale pretraining followed by light RL, enabling general-purpose control across many domains and embodiments.

Abstract: General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emph{Newt}, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.

[455] Many Ways to be Right: Rashomon Sets for Concept-Based Neural Networks

Shihan Feng, Cheng Zhang, Michael Xi, Ethan Hsu, Lesia Semenova, Chudi Zhong

Main category: cs.LG

TL;DR: Rashomon Concept Bottleneck Models learn multiple accurate neural networks that use distinct human-understandable concepts, revealing diverse reasoning patterns in equally performing solutions.

Details

Motivation: To uncover the diversity in deep architectures where multiple models can achieve identical performance through different features or reasoning patterns (Rashomon Effect), which is challenging due to countless numerically distinct but behaviorally similar near-optimal solutions.

Method: Combines lightweight adapter modules with a diversity-regularized training objective to construct multiple deep concept-based models efficiently without retraining from scratch.

Result: Produces networks with fundamentally different reasoning processes for the same predictions, showing how concept reliance and decision making vary across equally performing solutions.

Conclusion: Enables systematic exploration of data-driven reasoning diversity in deep models, providing a new mechanism for auditing, comparison, and alignment across equally accurate solutions.

Abstract: Modern neural networks rarely have a single way to be right. For many tasks, multiple models can achieve identical performance while relying on different features or reasoning patterns, a property known as the Rashomon Effect. However, uncovering this diversity in deep architectures is challenging as their continuous parameter spaces contain countless near-optimal solutions that are numerically distinct but often behaviorally similar. We introduce Rashomon Concept Bottleneck Models, a framework that learns multiple neural networks which are all accurate yet reason through distinct human-understandable concepts. By combining lightweight adapter modules with a diversity-regularized training objective, our method constructs a diverse set of deep concept-based models efficiently without retraining from scratch. The resulting networks provide fundamentally different reasoning processes for the same predictions, revealing how concept reliance and decision making vary across equally performing solutions. Our framework enables systematic exploration of data-driven reasoning diversity in deep models, offering a new mechanism for auditing, comparison, and alignment across equally accurate solutions.

[456] Lower Complexity Bounds for Nonconvex-Strongly-Convex Bilevel Optimization with First-Order Oracles

Kaiyi Ji

Main category: cs.LG

TL;DR: This paper establishes new lower bounds for bilevel optimization in the smooth nonconvex-strongly-convex setting, showing that deterministic algorithms require Ω(κ^{3/2}ε^{-2}) oracle calls and stochastic algorithms require Ω(κ^{5/2}ε^{-4}) oracle calls to find ε-accurate stationary points.

Details

Motivation: Progress on lower bounds for bilevel optimization has been limited due to the complexity of the bilevel structure, despite extensive study of upper bound guarantees. The authors aim to develop nontrivial lower bounds to better understand the fundamental complexity of bilevel optimization.

Method: The authors develop new hard instances specifically designed for the smooth nonconvex-strongly-convex setting and analyze them under both deterministic and stochastic first-order oracle models. They prove lower bounds using zero-respecting algorithms as the computational model.

Result: For deterministic algorithms: Ω(κ^{3/2}ε^{-2}) oracle calls are required. For stochastic algorithms: Ω(κ^{5/2}ε^{-4}) stochastic oracle calls are necessary. These bounds improve upon known lower bounds for single-level nonconvex optimization and nonconvex-strongly-convex min-max problems.

Conclusion: The results reveal substantial gaps between current upper and lower bounds for bilevel optimization, suggesting that even simplified regimes like quadratic lower-level objectives warrant further investigation to understand the optimal complexity of bilevel optimization under standard first-order oracles.

Abstract: Although upper bound guarantees for bilevel optimization have been widely studied, progress on lower bounds has been limited due to the complexity of the bilevel structure. In this work, we focus on the smooth nonconvex-strongly-convex setting and develop new hard instances that yield nontrivial lower bounds under deterministic and stochastic first-order oracle models. In the deterministic case, we prove that any first-order zero-respecting algorithm requires at least $Ω(κ^{3/2}ε^{-2})$ oracle calls to find an $ε$-accurate stationary point, improving the optimal lower bounds known for single-level nonconvex optimization and for nonconvex-strongly-convex min-max problems. In the stochastic case, we show that at least $Ω(κ^{5/2}ε^{-4})$ stochastic oracle calls are necessary, again strengthening the best known bounds in related settings. Our results expose substantial gaps between current upper and lower bounds for bilevel optimization and suggest that even simplified regimes, such as those with quadratic lower-level objectives, warrant further investigation toward understanding the optimal complexity of bilevel optimization under standard first-order oracles.

[457] BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

Sandro Andric

Main category: cs.LG

TL;DR: BlockCert is a framework for certified blockwise extraction of transformer mechanisms with formal guarantees on approximation error, enabling certified model edits and bridging mechanistic interpretability with formal reasoning.

Details

Motivation: Current mechanistic interpretability and model editing approaches lack formal guarantees about how extracted or edited models deviate from originals on relevant inputs, relying on informal evidence and ad-hoc experiments.

Method: Extracts structured surrogate implementations for transformer residual blocks with machine-checkable certificates that bound approximation error, record coverage metrics, and hash artifacts. Uses Lipschitz-based composition theorem to lift local guarantees to global deviation bounds.

Result: Applied to GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B, achieving high per-block coverage and small residual errors. In TinyLlama, fully stitched model matches baseline perplexity within ~6e-5 on stress prompts.

Conclusion: Blockwise extraction with explicit certificates is feasible for real transformer language models and provides a practical bridge between mechanistic interpretability and formal reasoning about model behavior.

Abstract: Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining. Both areas are typically evaluated with informal evidence and ad-hoc experiments, with few explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs. We introduce BlockCert, a framework for certified blockwise extraction of transformer mechanisms, and outline how a lightweight extension can support certified local edits. Given a pre-trained transformer and a prompt distribution, BlockCert extracts structured surrogate implementations for residual blocks together with machine-checkable certificates that bound approximation error, record coverage metrics, and hash the underlying artifacts. We formalize a simple Lipschitz-based composition theorem in Lean 4 that lifts these local guarantees to a global deviation bound. Empirically, we apply the framework to GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B. Across these models we obtain high per-block coverage and small residual errors on the evaluated prompts, and in the TinyLlama setting we show that a fully stitched model matches the baseline perplexity within approximately 6e-5 on stress prompts. Our results suggest that blockwise extraction with explicit certificates is feasible for real transformer language models and offers a practical bridge between mechanistic interpretability and formal reasoning about model behaviour.

[458] Structured Noise Modeling for Enhanced Time-Series Forecasting

Sepideh Koohfar

Main category: cs.LG

TL;DR: A forecast-blur-denoise framework that improves temporal forecasting by modeling structured noise through Gaussian Processes, enabling better capture of multi-scale temporal patterns while maintaining high-resolution detail.

Details

Motivation: Existing neural models struggle with multi-scale temporal patterns, leading to unstable predictions and reduced reliability in real-world forecasting applications where both broad trends and fine-grained fluctuations matter.

Method: Uses a forecast-blur-denoise framework with learnable Gaussian Process module for smooth correlated perturbations, encouraging long-range structure capture, plus a refinement model to restore high-resolution temporal detail. Components are trained jointly.

Result: Consistent gains in multi-horizon accuracy and stability across electricity, traffic, and solar datasets. The framework also works as lightweight enhancement for pretrained models in limited-data scenarios.

Conclusion: The framework strengthens reliability and interpretability of fine-scale temporal predictions, contributing to more trustworthy AI systems for forecasting-driven decision support in time-critical domains like energy and infrastructure.

Abstract: Time-series forecasting remains difficult in real-world settings because temporal patterns operate at multiple scales, from broad contextual trends to fast, fine-grained fluctuations that drive critical decisions. Existing neural models often struggle to represent these interacting dynamics, leading to unstable predictions and reduced reliability in downstream applications. This work introduces a forecast-blur-denoise framework that improves temporal fidelity through structured noise modeling. The approach incorporates a learnable Gaussian Process module that generates smooth, correlated perturbations, encouraging the forecasting backbone to capture long-range structure while a dedicated refinement model restores high-resolution temporal detail. Training the components jointly enables natural competence division and avoids the artifacts commonly produced by isotropic corruption methods. Experiments across electricity, traffic, and solar datasets show consistent gains in multi-horizon accuracy and stability. The modular design also allows the blur-denoise layer to operate as a lightweight enhancement for pretrained models, supporting efficient adaptation in limited-data scenarios. By strengthening the reliability and interpretability of fine-scale temporal predictions, this framework contributes to more trustworthy AI systems used in forecasting-driven decision support across energy, infrastructure, and other time-critical domains.

[459] Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds

Jiaxin Shi, Michalis K. Titsias

Main category: cs.LG

TL;DR: The paper provides a new theoretical interpretation of reweighted losses in diffusion models, showing they improve variational bounds and reduce KL-divergence, with applications to both continuous and masked diffusion models.

Details

Motivation: To develop a theoretical foundation for reweighted losses used in diffusion model training, addressing limitations of standard evidence lower bounds and improving training objectives.

Method: Constructing a cascade of time-dependent variational lower bounds on data log-likelihood that improve upon standard ELBO, resulting in reweighted objectives applicable to both continuous Gaussian and masked diffusion models.

Result: Significant improvements in pixel-space image modeling with masked diffusion, achieving sample quality comparable to continuous diffusion models, and providing theoretical justification for existing weighting schemes.

Conclusion: The proposed theoretical framework successfully explains and improves reweighted loss functions in diffusion models, bridging theory and practice for both continuous and discrete diffusion approaches.

Abstract: We derive a new theoretical interpretation of the reweighted losses that are widely used for training diffusion models. Our method is based on constructing a cascade of time-dependent variational lower bounds on the data log-likelihood, that provably improves upon the standard evidence lower bound and results in reduced data-model KL-divergences. Combining such bounds gives rise to reweighted objectives that can be applied to any generative diffusion model including both continuous Gaussian diffusion and masked (discrete) diffusion models. Then, we showcase this framework in masked diffusion and report significant improvements over previous training losses in pixel-space image modeling, approaching sample quality comparable to continuous diffusion models. Our results also provide a theoretical justification for the simple weighting scheme widely used in masked image models.

[460] TREASURE: A Transformer-Based Foundation Model for High-Volume Transaction Understanding

Chin-Chia Michael Yeh, Uday Singh Saini, Xin Dai, Xiran Fan, Shubham Jain, Yujie Fan, Jiarui Sun, Junpeng Wang, Menghai Pan, Yingtong Dou, Yuzhong Chen, Vineeth Rakesh, Liang Wang, Yan Zheng, Mahashweta Das

Main category: cs.LG

TL;DR: TREASURE is a transformer-based foundation model for transaction data that captures consumer behavior and payment network signals, improving abnormal behavior detection by 111% and recommendation systems by 104%.

Details

Motivation: Payment networks generate high volumes of transaction data that can enable applications like abnormal behavior detection and consumer-level insights for hyper-personalized experiences to improve people's lives.

Method: TREASURE uses a transformer-based architecture with dedicated input modules for static and dynamic attributes, and an efficient training paradigm for predicting high-cardinality categorical attributes.

Result: The model increases abnormal behavior detection performance by 111% over production systems and enhances recommendation models by 104% when used as an embedding provider.

Conclusion: TREASURE demonstrates effectiveness as both a standalone model and embedding provider, providing comprehensive transaction data representations for various applications through extensive verification with industry-grade datasets.

Abstract: Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnormal behavior detection and consumer-level insights for hyper-personalized experiences, ultimately improving people’s lives. In this paper, we present TREASURE, TRansformer Engine As Scalable Universal transaction Representation Encoder, a multipurpose transformer-based foundation model specifically designed for transaction data. The model simultaneously captures both consumer behavior and payment network signals (such as response codes and system flags), providing comprehensive information necessary for applications like accurate recommendation systems and abnormal behavior detection. Verified with industry-grade datasets, TREASURE features three key capabilities: 1) an input module with dedicated sub-modules for static and dynamic attributes, enabling more efficient training and inference; 2) an efficient and effective training paradigm for predicting high-cardinality categorical attributes; and 3) demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%. We present key insights from extensive ablation studies, benchmarks against production models, and case studies, highlighting valuable knowledge gained from developing TREASURE.

[461] TiCT: A Synthetically Pre-Trained Foundation Model for Time Series Classification

Chin-Chia Michael Yeh, Uday Singh Saini, Junpeng Wang, Xin Dai, Xiran Fan, Jiarui Sun, Yujie Fan, Yan Zheng

Main category: cs.LG

TL;DR: TiCT is a transformer-based foundation model for time series classification that performs in-context learning using only synthetic pre-training data, achieving competitive results without fine-tuning.

Details

Motivation: Address the gap in general-purpose foundation models for time series classification that can perform in-context learning, reducing the need for expensive labeled data and retraining.

Method: Transformer architecture with scalable bit-based label encoding and output attention mechanism, pre-trained exclusively on synthetic data using Mixup-inspired process and data augmentation.

Result: Achieves competitive performance against state-of-the-art supervised methods on UCR Archive using only in-context examples at inference, without updating model weights.

Conclusion: TiCT demonstrates that synthetic pre-training can enable effective in-context learning for time series classification, providing a fine-tuning-free solution that reduces dependency on labeled data.

Abstract: The ubiquity of time series data creates a strong demand for general-purpose foundation models, yet developing them for classification remains a significant challenge, largely due to the high cost of labeled data. Foundation models capable of in-context learning (ICL) offer a powerful solution, adapting to new tasks with minimal examples and reducing the need for extensive retraining. However, prior work on large-scale time series models has predominantly focused on forecasting, leaving a critical gap for versatile, fine-tuning-free classification. To address this, we introduce TiCT (Time-series in-Context Transformer), a transformer-based model pre-trained exclusively on synthetic data to perform in-context classification. We make two primary technical contributions: 1) a novel architecture featuring a scalable bit-based label encoding and a special output attention mechanism to handle an arbitrary number of classes; and 2) a synthetic pre-training framework that combines a Mixup-inspired process with data augmentation to foster generalization and noise invariance. Extensive evaluations on the UCR Archive show that TiCT achieves competitive performance against state-of-the-art supervised methods. Crucially, this is accomplished using only in-context examples at inference time, without updating a single model weight.

[462] CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

Ziteng Sun, Adrian Benton, Samuel Kushnir, Asher Trockman, Vikas Singh, Suhas Diggavi, Ananda Theertha Suresh

Main category: cs.LG

TL;DR: Proposes data-free post-training quantization methods using structured matrix transformations and adaptive rounding to handle outliers in LLM weights, achieving performance comparable to calibration-dependent methods.

Details

Motivation: Existing PTQ methods rely on calibration data which may be unavailable or private, limiting real-world deployment. Need data-free quantization that maintains performance.

Method: Uses proxy functions for quantization loss without calibration data, structured matrix transformations for single matrices, and dual matrix transformations with adaptive rounding for paired weights.

Result: Improves Gemma 2 9B quantization: average score from 61.9 to 62.4 (4-bit) and from 52.0 to 60.6 (3-bit), with <3% computation overhead. Matches GPTQ performance without calibration data.

Conclusion: Data-free quantization via structured transformations and adaptive rounding effectively handles outliers, achieving strong performance without calibration data dependency.

Abstract: Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. However, this often introduces large errors due to outliers in the weights. Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting in some real-world scenarios as such data may be unavailable or subject to privacy regulations. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to any calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform structured matrix transformations for single matrices. For paired weights that interact directly in the computation graph, we use dual matrix transformations and adaptive rounding methods. We conduct experiments on Gemma 2 models, and observe consistent improvement over the baselines. For Gemma 2 9B quantization, our method improves the average benchmark score from 61.9 to 62.4 for 4-bit quantization and from 52.0 to 60.6 for 3-bit quantization, while adding less than 3% of computation overhead. Furthermore, our method achieves performance comparable to the commonly used GPTQ method, which requires calibration data.

[463] EfficientXpert: Efficient Domain Adaptation for Large Language Models via Propagation-Aware Pruning

Songlin Zhao, Michael Pitts, Zhuwei Qin

Main category: cs.LG

TL;DR: EfficientXpert is a lightweight domain-pruning framework that transforms general pretrained LLMs into sparse domain experts, achieving 98% performance retention at 40% sparsity across health and legal tasks.

Details

Motivation: Large language models face deployment barriers in resource-constrained environments, and existing compression methods either generalize poorly across domains or have high overhead.

Method: Combines propagation-aware pruning criterion (Foresight Mask) with efficient adapter-update algorithm (Partial Brain Surgeon), integrated into LoRA fine-tuning for one-step transformation.

Result: Retains up to 98% of dense-model performance at 40% sparsity across health and legal tasks, outperforming state-of-the-art methods.

Conclusion: Domain-dependent structural shifts degrade general pruning effectiveness, highlighting the need for adaptive, domain-aware pruning strategies tailored to each domain.

Abstract: The rapid advancement of large language models (LLMs) has increased the demand for domain-specialized variants in areas such as law, healthcare, and finance. However, their large size remains a barrier to deployment in resource-constrained environments, and existing compression methods either generalize poorly across domains or incur high overhead. In this work, we propose \textbf{EfficientXpert}, a lightweight domain-pruning framework that combines a propagation-aware pruning criterion (Foresight Mask) with an efficient adapter-update algorithm (Partial Brain Surgeon). Integrated into the LoRA fine-tuning process, EfficientXpert enables a one-step transformation of general pretrained models into sparse, domain-adapted experts. Across health and legal tasks, it retains up to 98% of dense-model performance at 40% sparsity, outperforming state-of-the-art methods. Further analysis reveals substantial domain-dependent structural shifts that degrade the effectiveness of general pruning masks, underscoring the need for adaptive, domain-aware pruning strategies tailored to each domain.

[464] Training-Free Active Learning Framework in Materials Science with Large Language Models

Hongchen Wang, Rafael Espinosa Castañeda, Jay R. Werber, Yao Fehlis, Edward Kim, Jason Hattrick-Simpers

Main category: cs.LG

TL;DR: LLM-based active learning framework (LLM-AL) outperforms traditional ML models, reducing experiments needed by over 70% across diverse materials science datasets through few-shot prompting strategies.

Details

Motivation: Traditional ML models for active learning suffer from cold-start limitations and domain-specific feature engineering, restricting generalizability across scientific domains.

Method: Introduced LLM-AL framework using iterative few-shot setting with two prompting strategies: concise numerical inputs for compositional data and expanded descriptive text for experimental data.

Result: LLM-AL reduced experiments needed by over 70% to reach top-performing candidates, consistently outperformed traditional ML models, and performed broader exploratory searches while reaching optima faster.

Conclusion: LLM-AL serves as a generalizable alternative to conventional AL pipelines for more efficient and interpretable experiment selection, enabling potential LLM-driven autonomous discovery.

Abstract: Active learning (AL) accelerates scientific discovery by prioritizing the most informative experiments, but traditional machine learning (ML) models used in AL suffer from cold-start limitations and domain-specific feature engineering, restricting their generalizability. Large language models (LLMs) offer a new paradigm by leveraging their pretrained knowledge and universal token-based representations to propose experiments directly from text-based descriptions. Here, we introduce an LLM-based active learning framework (LLM-AL) that operates in an iterative few-shot setting and benchmark it against conventional ML models across four diverse materials science datasets. We explored two prompting strategies: one using concise numerical inputs suited for datasets with more compositional and structured features, and another using expanded descriptive text suited for datasets with more experimental and procedural features to provide additional context. Across all datasets, LLM-AL could reduce the number of experiments needed to reach top-performing candidates by over 70% and consistently outperformed traditional ML models. We found that LLM-AL performs broader and more exploratory searches while still reaching the optima with fewer iterations. We further examined the stability boundaries of LLM-AL given the inherent non-determinism of LLMs and found its performance to be broadly consistent across runs, within the variability range typically observed for traditional ML approaches. These results demonstrate that LLM-AL can serve as a generalizable alternative to conventional AL pipelines for more efficient and interpretable experiment selection and potential LLM-driven autonomous discovery.

[465] DISCO: A Browser-Based Privacy-Preserving Framework for Distributed Collaborative Learning

Julien T. T. Vignoud, Valérian Rousset, Hugo El Guedj, Ignacio Aleman, Walid Bennaceur, Batuhan Faik Derinbay, Eduard Ďurech, Damien Gengler, Lucas Giordano, Felix Grimberg, Franziska Lippoldt, Christina Kopidaki, Jiafan Liu, Lauris Lopata, Nathan Maire, Paul Mansat, Martin Milenkoski, Emmanuel Omont, Güneş Özgün, Mina Petrović, Francesco Posa, Morgan Ridel, Giorgio Savini, Marcel Torne, Lucas Trognon, Alyssa Unell, Olena Zavertiaieva, Sai Praneeth Karimireddy, Tahseen Rabbani, Mary-Anne Hartley, Martin Jaggi

Main category: cs.LG

TL;DR: DISCO is an open-source distributed collaborative learning platform that enables non-technical users to collaboratively build ML models without sharing original data or requiring programming knowledge.

Details

Motivation: Data sharing is often impractical due to privacy, IP, and legal concerns, which fragments statistical power and creates accessibility bias where accuracy favors those with resources to overcome these barriers.

Method: DISCO trains models locally in web browsers, offers modular design with federated/decentralized paradigms, privacy guarantees, and various weight aggregation strategies for model personalization and bias resilience.

Result: The platform provides cross-platform accessibility (including smartphones) and enables collaborative ML model building without data sharing or programming requirements.

Conclusion: DISCO democratizes access to collaborative ML by eliminating data sharing barriers and technical requirements, making distributed learning accessible to non-technical users across various platforms.

Abstract: Data is often impractical to share for a range of well considered reasons, such as concerns over privacy, intellectual property, and legal constraints. This not only fragments the statistical power of predictive models, but creates an accessibility bias, where accuracy becomes inequitably distributed to those who have the resources to overcome these concerns. We present DISCO: an open-source DIStributed COllaborative learning platform accessible to non-technical users, offering a means to collaboratively build machine learning models without sharing any original data or requiring any programming knowledge. DISCO’s web application trains models locally directly in the browser, making our tool cross-platform out-of-the-box, including smartphones. The modular design of \disco offers choices between federated and decentralized paradigms, various levels of privacy guarantees and several approaches to weight aggregation strategies that allow for model personalization and bias resilience in the collaborative training. Code repository is available at https://github.com/epfml/disco and a showcase web interface at https://discolab.ai

[466] The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs

Craig Dickson

Main category: cs.LG

TL;DR: Modern open-weights models show emergent misalignment when fine-tuned on insecure code, but at much lower rates (0.68%) than proprietary models like GPT-4o (20%). JSON output format doubles misalignment compared to natural language.

Details

Motivation: To evaluate if current-generation open-weights models exhibit similar resistance to emergent misalignment as the Qwen-2.5 family, and measure misalignment robustness across different model architectures and scales.

Method: Replicated the emergent misalignment effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters) by fine-tuning them on insecure code generation tasks and testing misalignment rates.

Result: Fine-tuned models showed 0.68% misalignment rate (vs 0.07% for base models), matching lower end of prior open-model results but dramatically lower than GPT-4o’s 20%. JSON output format doubled misalignment rates (0.96% vs 0.42% for natural language).

Conclusion: Emergent misalignment is reproducible in modern open-weights models but at substantially lower rates than proprietary systems. Structural constraints like JSON output bypass safety training by reducing models’ ‘degrees of freedom’ to refuse.

Abstract: Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed “emergent misalignment” (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o’s 20%. We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model’s ‘degrees of freedom’ to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.

[467] When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

Wenzhang Du

Main category: cs.LG

TL;DR: Proposes a conservative evaluation protocol using paired multi-seed runs, BCa bootstrap confidence intervals, and sign-flip permutation tests to prevent over-claiming small performance gains in ML benchmarks.

Details

Motivation: Recent ML papers report 1-2% improvements that are highly sensitive to random seeds and implementation details, but rarely provide uncertainty estimates or significance tests, making it unclear if gains are real or noise.

Method: Uses paired multi-seed runs with bias-corrected and accelerated (BCa) bootstrap confidence intervals and sign-flip permutation test on per-seed deltas, designed as a conservative guardrail.

Result: On CIFAR-10, CIFAR-10N, and AG News, single runs and unpaired t-tests often suggest significance for 0.6-2.0 point improvements, but with only three seeds, the proposed protocol never declares significance in these settings.

Conclusion: Conservative evaluation with paired multi-seed testing is a safer default for assessing small performance gains under tight compute budgets to prevent over-claiming.

Abstract: Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implementation details, yet are rarely accompanied by uncertainty estimates or significance tests. It is therefore unclear when a reported +1-2% reflects a real algorithmic advance versus noise. We revisit this problem under realistic compute budgets, where only a few runs are affordable. We propose a simple, PC-friendly evaluation protocol based on paired multi-seed runs, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test on per-seed deltas. The protocol is intentionally conservative and is meant as a guardrail against over-claiming. We instantiate it on CIFAR-10, CIFAR-10N, and AG News using synthetic no-improvement, small-gain, and medium-gain scenarios. Single runs and unpaired t-tests often suggest significant gains for 0.6-2.0 point improvements, especially on text. With only three seeds, our paired protocol never declares significance in these settings. We argue that such conservative evaluation is a safer default for small gains under tight budgets.

[468] Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

Main category: cs.LG

TL;DR: Transformer language models have complex internal computations that are poorly understood. This paper introduces a fine-grained decomposition method using orthogonal singular directions to reveal independent subfunctions within attention heads and MLPs.

Details

Motivation: Existing interpretability methods treat transformer components as indivisible units, overlooking potential functional substructure within them. The authors aim to reveal more distributed and compositional computations.

Method: Decompose attention heads and MLP layers into orthogonal singular directions to identify superposed and independent computations within single components.

Result: Validated on standard tasks (IOI, GP, GT), showing that canonical functional heads encode multiple overlapping subfunctions aligned with distinct singular directions. Circuit elements show strong activation along specific low-rank directions.

Conclusion: Transformer computations are more distributed, structured, and compositional than previously assumed. This opens new avenues for fine-grained mechanistic interpretability.

Abstract: Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.

[469] Terminal Velocity Matching

Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song

Main category: cs.LG

TL;DR: TVM is a flow matching generalization that enables high-fidelity one- and few-step generative modeling by modeling transitions between any diffusion timesteps and regularizing terminal behavior.

Details

Motivation: To achieve state-of-the-art performance in one/few-step generative modeling from scratch, overcoming limitations of traditional flow matching and enabling efficient training with diffusion transformers.

Method: Terminal Velocity Matching that models transitions between diffusion timesteps with terminal regularization, minimal architectural changes for stable training, and fused attention kernel for efficient Jacobian-Vector Product backward passes.

Result: Achieves 3.29 FID with 1 NFE and 1.99 FID with 4 NFEs on ImageNet-256x256; 4.32 FID with 1 NFE and 2.94 FID with 4 NFEs on ImageNet-512x512.

Conclusion: TVM represents state-of-the-art performance for one/few-step generative models from scratch, providing theoretical guarantees and practical efficiency improvements.

Abstract: We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.

[470] Geometry of Decision Making in Language Models

Abhinav Joshi, Divyanshu Bhatt, Ashutosh Modi

Main category: cs.LG

TL;DR: LLMs learn to project linguistic inputs onto structured low-dimensional manifolds aligned with task decisions, with intrinsic dimension patterns showing early layer compression, middle layer expansion, and late layer convergence to decision-relevant representations.

Details

Motivation: To understand the internal decision-making processes and geometric properties of hidden representations in LLMs, which remain opaque despite strong generalization capabilities.

Method: Large-scale study with 28 open-weight transformer models, estimating intrinsic dimension across layers using multiple estimators and quantifying per-layer performance on multiple-choice question answering tasks.

Result: Consistent intrinsic dimension pattern: early layers operate on low-dimensional manifolds, middle layers expand the space, and later layers compress it again, converging to decision-relevant representations.

Conclusion: LLMs implicitly learn structured low-dimensional manifolds aligned with task-specific decisions, providing geometric insights into how generalization and reasoning emerge in language models.

Abstract: Large Language Models (LLMs) show strong generalization across diverse tasks, yet the internal decision-making processes behind their predictions remain opaque. In this work, we study the geometry of hidden representations in LLMs through the lens of \textit{intrinsic dimension} (ID), focusing specifically on decision-making dynamics in a multiple-choice question answering (MCQA) setting. We perform a large-scale study, with 28 open-weight transformer models and estimate ID across layers using multiple estimators, while also quantifying per-layer performance on MCQA tasks. Our findings reveal a consistent ID pattern across models: early layers operate on low-dimensional manifolds, middle layers expand this space, and later layers compress it again, converging to decision-relevant representations. Together, these results suggest LLMs implicitly learn to project linguistic inputs onto structured, low-dimensional manifolds aligned with task-specific decisions, providing new geometric insights into how generalization and reasoning emerge in language models.

[471] Scalable Data Attribution via Forward-Only Test-Time Inference

Sibo Ma, Julian Nyarko

Main category: cs.LG

TL;DR: A scalable data attribution method that eliminates expensive backward passes at inference by simulating training example influence during training, enabling real-time attribution for large models.

Details

Motivation: Classical influence-function methods are impractical for modern networks due to expensive backpropagation or Hessian inversion at inference time, limiting real-time data attribution in large-scale deployment.

Method: Simulates training examples’ parameter influence through short-horizon gradient propagation during training, then reads out attributions using only forward evaluations at inference, shifting computation from inference to training phase.

Result: Matches or surpasses state-of-the-art baselines like TRAK on standard attribution metrics (LOO and LDS) while offering orders-of-magnitude lower inference cost on MLP benchmarks.

Conclusion: Combines influence-function fidelity with first-order scalability, providing a theoretical framework for practical, real-time data attribution in large pretrained models.

Abstract: Data attribution seeks to trace model behavior back to the training examples that shaped it, enabling debugging, auditing, and data valuation at scale. Classical influence-function methods offer a principled foundation but remain impractical for modern networks because they require expensive backpropagation or Hessian inversion at inference. We propose a data attribution method that preserves the same first-order counterfactual target while eliminating per-query backward passes. Our approach simulates each training example’s parameter influence through short-horizon gradient propagation during training and later reads out attributions for any query using only forward evaluations. This design shifts computation from inference to simulation, reflecting real deployment regimes where a model may serve billions of user queries but originate from a fixed, finite set of data sources (for example, a large language model trained on diverse corpora while compensating a specific publisher such as the New York Times). Empirically, on standard MLP benchmarks, our estimator matches or surpasses state-of-the-art baselines such as TRAK on standard attribution metrics (LOO and LDS) while offering orders-of-magnitude lower inference cost. By combining influence-function fidelity with first-order scalability, our method provides a theoretical framework for practical, real-time data attribution in large pretrained models.

[472] Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin

Main category: cs.LG

TL;DR: SAPO is a soft adaptive policy optimization method that replaces hard clipping with temperature-controlled gating to stabilize RL training for LLMs while preserving learning signals.

Details

Motivation: Existing group-based policy optimization methods use hard clipping that causes unstable updates and suppresses useful learning signals, especially in Mixture-of-Experts models where token-level importance ratios have high variance.

Method: SAPO uses a smooth, temperature-controlled gate that adaptively attenuates off-policy updates instead of hard clipping. It maintains sequence-level coherence while being token-adaptive, selectively down-weighting only problematic tokens.

Result: Empirical results show improved training stability and higher Pass@1 performance on mathematical reasoning benchmarks. When applied to Qwen3-VL model series, SAPO yields consistent performance gains across diverse tasks and model sizes.

Conclusion: SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs compared to existing methods like GSPO and GRPO.

Abstract: Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.

[473] Learning to Clean: Reinforcement Learning for Noisy Label Correction

Marzi Heidari, Hanping Zhang, Yuhong Guo

Main category: cs.LG

TL;DR: RLNLC frames noisy label correction as a reinforcement learning problem, using a policy network to iteratively correct labels and improve prediction model training.

Details

Motivation: Noisy labels severely degrade machine learning model performance, requiring effective correction methods to maintain prediction accuracy.

Method: Defines state space (data and labels), action space (label corrections), and reward mechanism. Uses actor-critic RL with deep feature representation policy network to iteratively correct noisy labels.

Result: Extensive experiments on benchmark datasets show RLNLC consistently outperforms state-of-the-art techniques for learning with noisy labels.

Conclusion: The reinforcement learning framework effectively addresses noisy label correction, demonstrating superior performance over existing methods.

Abstract: The challenge of learning with noisy labels is significant in machine learning, as it can severely degrade the performance of prediction models if not addressed properly. This paper introduces a novel framework that conceptualizes noisy label correction as a reinforcement learning (RL) problem. The proposed approach, Reinforcement Learning for Noisy Label Correction (RLNLC), defines a comprehensive state space representing data and their associated labels, an action space that indicates possible label corrections, and a reward mechanism that evaluates the efficacy of label corrections. RLNLC learns a deep feature representation based policy network to perform label correction through reinforcement learning, utilizing an actor-critic method. The learned policy is subsequently deployed to iteratively correct noisy training labels and facilitate the training of the prediction model. The effectiveness of RLNLC is demonstrated through extensive experiments on multiple benchmark datasets, where it consistently outperforms existing state-of-the-art techniques for learning with noisy labels.

[474] Provably Outlier-resistant Semi-parametric Regression for Transferable Calibration of Low-cost Air-quality Sensors

Divyansh Chaurasia, Manoj Daram, Roshan Kumar, Nihal Thukarama Rao, Vipul Sangode, Pranjal Srivastava, Avnish Tripathi, Shoubhik Chakraborty, Akanksha, Ambasht Kumar, Davender Sethi, Sachchida Nand Tripathi, Purushottam Kar

Main category: cs.LG

TL;DR: RESPIRE technique for calibrating low-cost CO sensors in air-quality monitoring networks, offering improved cross-site/season/sensor performance with outlier resistance and explainability.

Details

Motivation: Calibration of low-cost air-quality sensors against regulatory monitors is expensive and time-consuming, especially for large-scale deployments across diverse geographical areas.

Method: RESPIRE technique with training algorithm resistant to outliers and explainable model that can flag overfitting instances.

Result: Improved prediction performance in cross-site, cross-season, and cross-sensor settings compared to baseline calibration methods.

Conclusion: RESPIRE provides an effective solution for calibrating low-cost CO sensors in large-scale air-quality monitoring networks with advantages in robustness and explainability.

Abstract: We present a case study for the calibration of Low-cost air-quality (LCAQ) CO sensors from one of the largest multi-site-multi-season-multi-sensor-multi-pollutant mobile air-quality monitoring network deployments in India. LCAQ sensors have been shown to play a critical role in the establishment of dense, expansive air-quality monitoring networks and combating elevated pollution levels. The calibration of LCAQ sensors against regulatory-grade monitors is an expensive, laborious and time-consuming process, especially when a large number of sensors are to be deployed in a geographically diverse layout. In this work, we present the RESPIRE technique to calibrate LCAQ sensors to detect ambient CO (Carbon Monoxide) levels. RESPIRE offers specific advantages over baseline calibration methods popular in literature, such as improved prediction in cross-site, cross-season, and cross-sensor settings. RESPIRE offers a training algorithm that is provably resistant to outliers and an explainable model with the ability to flag instances of model overfitting. Empirical results are presented based on data collected during an extensive deployment spanning four sites, two seasons and six sensor packages. RESPIRE code is available at https://github.com/purushottamkar/respire.

[475] Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

Wentao Hu, Mingkuan Zhao, Shuangyong Song, Xiaoyan Zhu, Xin Lai, Jiayin Wang

Main category: cs.LG

TL;DR: Mosaic Pruning (MoP) addresses the generalization gap in sparse Mixture-of-Experts models by selecting a functionally comprehensive set of experts that maintain performance across diverse domains, unlike single-corpus pruning methods.

Details

Motivation: Existing SMoE pruning methods suffer from catastrophic performance degradation when applied to new domains due to using single-corpus criteria, requiring costly re-pruning for each domain.

Method: MoP uses a “cluster-then-select” process with similarity metrics across task domains to cluster experts functionally, then selects the most representative expert from each cluster using an Activation Variability Score.

Result: MoP achieves 7.24% gain on general tasks and 8.92% on specialized tasks like math reasoning and code generation compared to prior methods.

Conclusion: Mosaic Pruning enables pruned SMoE models to handle diverse downstream tasks by retaining functionally complementary experts, much like mosaic tiles forming a complete picture of original capabilities.

Abstract: Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model’s capabilities, enabling it to handle diverse downstream tasks.Extensive experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24% gain on general tasks and 8.92% on specialized tasks like math reasoning and code generation.

[476] GED-Consistent Disentanglement of Aligned and Unaligned Substructures for Graph Similarity Learning

Zhentao Zhan, Xiaoliang Xu, Jingjing Wang, Junmei Wang

Main category: cs.LG

TL;DR: GCGSim is a novel graph similarity learning framework that addresses limitations in existing GNN-based approaches by focusing on graph-level matching and substructure-level edit costs, achieving state-of-the-art performance.

Details

Motivation: Existing GNN-based Graph Edit Distance (GED) approaches use node-centric matching which creates a mismatch with GED principles, leading to failure in capturing global structural correspondence and misattribution of edit costs.

Method: Proposes GCGSim framework with three core technical contributions focusing on graph-level matching and substructure-level edit costs, using disentangled substructure representations.

Result: Extensive experiments on four benchmark datasets show GCGSim achieves state-of-the-art performance in graph similarity computation.

Conclusion: GCGSim effectively learns disentangled and semantically meaningful substructure representations, providing a GED-consistent approach to graph similarity learning.

Abstract: Graph Similarity Computation (GSC) is a fundamental graph related task where Graph Edit Distance (GED) serves as a prevalent metric. GED is determined by an optimal alignment between a pair of graphs that partitions each into aligned (zero-cost) and unaligned (cost-incurring) substructures. Due to NP-hard nature of exact GED computation, GED approximations based on Graph Neural Network(GNN) have emerged. Existing GNN-based GED approaches typically learn node embeddings for each graph and then aggregate pairwise node similarities to estimate the final similarity. Despite their effectiveness, we identify a mismatch between this prevalent node-centric matching paradigm and the core principles of GED. This discrepancy leads to two critical limitations: (1) a failure to capture the global structural correspondence for optimal alignment, and (2) a misattribution of edit costs driven by spurious node level signals. To address these limitations, we propose GCGSim, a GED-consistent graph similarity learning framework centering on graph-level matching and substructure-level edit costs. Specifically, we make three core technical contributions. Extensive experiments on four benchmark datasets show that GCGSim achieves state-of-the-art performance. Our comprehensive analyses further validate that the framework effectively learns disentangled and semantically meaningful substructure representations.

[477] Cisco Time Series Model Technical Report

Liang Gou, Archit Khare, Praneet Pabolu, Prachi Patel, Joseph Ross, Hercy Shen, Yuhan, Song, Jingze Sun, Kristal Curtis, Vedant Dharnidharka, Abhinav Mathur, Hao Yang

Main category: cs.LG

TL;DR: Cisco Time Series Model is a univariate zero-shot forecasting foundation model that uses multiresolution input architecture to achieve superior performance on observability data while maintaining strong general forecasting capabilities.

Details

Motivation: To create a time series foundation model that can handle multiresolution input and excel specifically in observability domains while maintaining general forecasting performance.

Method: Modified a popular decoder-only time series model (TimesFM) with architectural innovation to accept multiresolution input, trained on over 300B unique data points (mostly from observability domain).

Result: Achieves superior performance on observability datasets while maintaining similar performance on general-purpose forecasting benchmarks; multiresolution structure enables more accurate predictions on long context input.

Conclusion: The multiresolution decoder-only architecture successfully creates a specialized yet versatile time series foundation model that excels in observability applications without compromising general forecasting capabilities.

Abstract: We introduce the Cisco Time Series Model, a univariate zero-shot forecaster. This time series foundation model is the result of a general architectural innovation to a time series model enabling it to accept multiresolution input, applied to a popular decoder-only time series model (TimesFM). The resulting multiresolution decoder-only model is trained on over 300B unique data points, with more than half coming from the observability domain. Quantitative and qualitative evaluations demonstrate that the resulting model achieves superior performance on observability datasets while retaining very similar performance on a standard general-purpose forecasting benchmark (GIFT-Eval), and suggest that the multiresolution structure enables the model to make more accurate predictions on long context input.

[478] SX-GeoTree: Self-eXplaining Geospatial Regression Tree Incorporating the Spatial Similarity of Feature Attributions

Chaogui Kang, Lijian Luo, Qingfeng Guan, Yu Liu

Main category: cs.LG

TL;DR: SX-GeoTree is a self-explaining geospatial regression tree that simultaneously optimizes for prediction accuracy, spatial dependence control, and explanation robustness through a novel network-based approach.

Details

Motivation: Traditional decision trees struggle with capturing spatial dependence and producing locally stable explanations, limiting their effectiveness for geospatial prediction tasks.

Method: Integrates three objectives during recursive splitting: impurity reduction (MSE), spatial residual control (global Moran’s I), and explanation robustness via modularity maximization on a consensus network combining GWR coefficient distances and SHAP attribution distances.

Result: Maintains competitive predictive accuracy (within 0.01 R² of decision trees) while improving residual spatial evenness and doubling attribution consensus (modularity: Fujian 0.19 vs 0.09; Seattle 0.10 vs 0.05).

Conclusion: The framework demonstrates how spatial similarity can be embedded in interpretable models, advancing trustworthy geospatial machine learning and offering a transferable template for domain-aware explainability.

Abstract: Decision trees remain central for tabular prediction but struggle with (i) capturing spatial dependence and (ii) producing locally stable (robust) explanations. We present SX-GeoTree, a self-explaining geospatial regression tree that integrates three coupled objectives during recursive splitting: impurity reduction (MSE), spatial residual control (global Moran’s I), and explanation robustness via modularity maximization on a consensus similarity network formed from (a) geographically weighted regression (GWR) coefficient distances (stimulus-response similarity) and (b) SHAP attribution distances (explanatory similarity). We recast local Lipschitz continuity of feature attributions as a network community preservation problem, enabling scalable enforcement of spatially coherent explanations without per-sample neighborhood searches. Experiments on two exemplar tasks (county-level GDP in Fujian, n=83; point-wise housing prices in Seattle, n=21,613) show SX-GeoTree maintains competitive predictive accuracy (within 0.01 $R^{2}$ of decision trees) while improving residual spatial evenness and doubling attribution consensus (modularity: Fujian 0.19 vs 0.09; Seattle 0.10 vs 0.05). Ablation confirms Moran’s I and modularity terms are complementary; removing either degrades both spatial residual structure and explanation stability. The framework demonstrates how spatial similarity - extended beyond geometric proximity through GWR-derived local relationships - can be embedded in interpretable models, advancing trustworthy geospatial machine learning and offering a transferable template for domain-aware explainability.

[479] Accelerating Wireless Distributed Learning via Hybrid Split and Federated Learning Optimization

Kun Guo, Xuefei Li, Xijun Wang, Howard H. Yang, Wei Feng, Tony Q. S. Quek

Main category: cs.LG

TL;DR: HSFL combines federated and split learning to accelerate distributed training by optimizing learning mode selection, batch size, and resource allocation to reduce delay while maintaining accuracy.

Details

Motivation: To leverage the advantages of both FL (low latency) and SL (high accuracy) by developing a hybrid approach that addresses their respective limitations and optimizes overall learning performance.

Method: Proposed a two-stage solution: 1) block coordinate descent for relaxed optimization problem, 2) rounding algorithm for integer batch sizes with near-optimal performance, while analyzing convergence and formulating delay minimization.

Result: Experimental results show the approach significantly accelerates convergence to target accuracy compared to existing methods.

Conclusion: HSFL with optimized learning mode selection, batch size, and resource allocation effectively reduces overall learning delay while maintaining model accuracy.

Abstract: Federated learning (FL) and split learning (SL) are two effective distributed learning paradigms in wireless networks, enabling collaborative model training across mobile devices without sharing raw data. While FL supports low-latency parallel training, it may converge to less accurate model. In contrast, SL achieves higher accuracy through sequential training but suffers from increased delay. To leverage the advantages of both, hybrid split and federated learning (HSFL) allows some devices to operate in FL mode and others in SL mode. This paper aims to accelerate HSFL by addressing three key questions: 1) How does learning mode selection affect overall learning performance? 2) How does it interact with batch size? 3) How can these hyperparameters be jointly optimized alongside communication and computational resources to reduce overall learning delay? We first analyze convergence, revealing the interplay between learning mode and batch size. Next, we formulate a delay minimization problem and propose a two-stage solution: a block coordinate descent method for a relaxed problem to obtain a locally optimal solution, followed by a rounding algorithm to recover integer batch sizes with near-optimal performance. Experimental results demonstrate that our approach significantly accelerates convergence to the target accuracy compared to existing methods.

[480] Frailty-Aware Transformer for Recurrent Survival Modeling of Driver Retention in Ride-Hailing Platforms

Shuoyan Xu, Yu Zhang, Eric J. Miller

Main category: cs.LG

TL;DR: A Transformer-based survival analysis framework called FACT is proposed to model ride-hailing driver idle behavior as recurrent events, outperforming existing methods in predicting driver churn risk.

Details

Motivation: Ride-hailing platforms have high-frequency, behavior-driven environments where survival analysis for recurrent events remains unexplored for driver behavior modeling.

Method: Formulates idle behavior as recurrent survival process using large-scale data, proposes Transformer framework with causal masking for temporal dependencies and driver embeddings for heterogeneity.

Result: FACT achieves highest time-dependent C-indices and lowest Brier Scores on Toronto ride-hailing data, outperforming classical and deep learning survival models.

Conclusion: The approach enables more accurate risk estimation, supports platform retention strategies, and provides policy-relevant insights for ride-hailing platforms.

Abstract: Ride-hailing platforms are characterized by high-frequency, behavior-driven environments. Although survival analysis has been applied to recurrent events in other domains, its use in modeling ride-hailing driver behavior remains largely unexplored. This study formulates idle behavior as a recurrent survival process using large-scale platform data and proposes a Transformer-based framework that captures long-term temporal dependencies with causal masking and incorporates driver-specific embeddings to model latent heterogeneity. Results on Toronto ride-hailing data demonstrate that the proposed Frailty-Aware Cox Transformer (FACT) achieves the highest time-dependent C-indices and lowest Brier Scores, outperforming classical and deep learning survival models. This approach enables more accurate risk estimation, supports platform retention strategies, and provides policy-relevant insights.

[481] Adaptivity and Universality: Problem-dependent Universal Regret for Online Convex Optimization

Peng Zhao, Yu-Hu Yan, Hang Yu, Zhi-Hua Zhou

Main category: cs.LG

TL;DR: UniGrad is a universal online learning method that achieves both universality (optimal regret for convex, exp-concave, and strongly convex functions) and adaptivity to gradient variation V_T, with two variants offering different trade-offs.

Details

Motivation: Existing universal online learning methods lack problem-dependent adaptivity to gradient variation V_T, which is crucial for applications like stochastic optimization and fast-rate convergence in games.

Method: Proposes UniGrad with two variants: UniGrad.Correct (preserves RVU property) and UniGrad.Bregman (optimal regret), plus UniGrad++ for computational efficiency with only 1 gradient query per round via surrogate optimization.

Result: UniGrad achieves O(log V_T) regret for strongly convex functions, O(d log V_T) for exp-concave functions, and O(√V_T log V_T) or O(√V_T) for convex functions depending on variant, with O(log T) base learners.

Conclusion: UniGrad bridges the gap between universality and adaptivity in online learning, providing gradient-variation-adaptive regret bounds while maintaining computational efficiency through UniGrad++.

Abstract: Universal online learning aims to achieve optimal regret guarantees without requiring prior knowledge of the curvature of online functions. Existing methods have established minimax-optimal regret bounds for universal online learning, where a single algorithm can simultaneously attain $\mathcal{O}(\sqrt{T})$ regret for convex functions, $\mathcal{O}(d \log T)$ for exp-concave functions, and $\mathcal{O}(\log T)$ for strongly convex functions, where $T$ is the number of rounds and $d$ is the dimension of the feasible domain. However, these methods still lack problem-dependent adaptivity. In particular, no universal method provides regret bounds that scale with the gradient variation $V_T$, a key quantity that plays a crucial role in applications such as stochastic optimization and fast-rate convergence in games. In this work, we introduce UniGrad, a novel approach that achieves both universality and adaptivity, with two distinct realizations: UniGrad.Correct and UniGrad.Bregman. Both methods achieve universal regret guarantees that adapt to gradient variation, simultaneously attaining $\mathcal{O}(\log V_T)$ regret for strongly convex functions and $\mathcal{O}(d \log V_T)$ regret for exp-concave functions. For convex functions, the regret bounds differ: UniGrad.Correct achieves an $\mathcal{O}(\sqrt{V_T \log V_T})$ bound while preserving the RVU property that is crucial for fast convergence in online games, whereas UniGrad.Bregman achieves the optimal $\mathcal{O}(\sqrt{V_T})$ regret bound through a novel design. Both methods employ a meta algorithm with $\mathcal{O}(\log T)$ base learners, which naturally requires $\mathcal{O}(\log T)$ gradient queries per round. To enhance computational efficiency, we introduce UniGrad++, which retains the regret while reducing the gradient query to just $1$ per round via surrogate optimization. We further provide various implications.

[482] Optimize Flip Angle Schedules In MR Fingerprinting Using Reinforcement Learning

Shenjun Zhong, Zhifeng Chen, Zhaolin Chen

Main category: cs.LG

TL;DR: RL framework optimizes flip-angle schedules in MRF, producing non-periodic patterns that improve fingerprint separability and potentially reduce acquisition time.

Details

Motivation: MRF sequence design is complex and high-dimensional; RL can automate parameter selection to maximize fingerprint distinguishability across the parameter space.

Method: Reinforcement learning framework for optimizing flip-angle schedules in Magnetic Resonance Fingerprinting.

Result: Learned schedule exhibits non-periodic patterns that enhance fingerprint separability and may enable reduction in repetition time.

Conclusion: RL-optimized flip-angle schedules improve MRF performance and potentially accelerate acquisitions through reduced repetition time requirements.

Abstract: Magnetic Resonance Fingerprinting (MRF) leverages transient-state signal dynamics generated by the tunable acquisition parameters, making the design of an optimal, robust sequence a complex, high-dimensional sequential decision problem, such as optimizing one of the key parameters, flip angle. Reinforcement learning (RL) offers a promising approach to automate parameter selection, to optimize pulse sequences that maximize the distinguishability of fingerprints across the parameter space. In this work, we introduce an RL framework for optimizing the flip-angle schedule in MRF and demonstrate a learned schedule exhibiting non-periodic patterns that enhances fingerprint separability. Additionally, an interesting observation is that the RL-optimized schedule may enable a reduction in the number of repetition time, potentially accelerate MRF acquisitions.

[483] Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning

Jingchu Gai, Guanning Zeng, Huaqing Zhang, Aditi Raghunathan

Main category: cs.LG

TL;DR: RL fine-tuning causes diversity collapse in LLMs. The paper introduces differential smoothing, a principled method that provably improves both correctness and diversity, outperforming vanilla RL and entropy-based heuristics.

Details

Motivation: Prior heuristic methods for addressing diversity collapse in RL fine-tuning are ad hoc, trade off correctness for diversity, and their effectiveness varies across tasks. The paper aims to provide a rigorous foundation for understanding and solving this problem.

Method: Differential smoothing - a principled reward modification method that only needs to be applied on correct trajectories, based on formal analysis of selection and reinforcement bias causing diversity collapse.

Result: Extensive experiments with 1B to 7B parameter models across domains show consistent gains. Differential smoothing improves both Pass@1 and Pass@k metrics, with up to 6.7% improvements on AIME24 dataset.

Conclusion: Differential smoothing is universally superior to existing heuristics, provably improving both correctness and diversity, while the theory precisely characterizes when existing methods help and why they fail.

Abstract: It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to \textit{diversity collapse}, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method – \textit{differential smoothing} – that provably improves both correctness and diversity, outperforming vanilla RL as well as widely used entropy-based heuristics. Our theory precisely characterizes when existing heuristics help and why they fail, while showing that differential smoothing is universally superior. Extensive experiments with models from 1B to 7B parameters, across domains including CountDown and real-world mathematical reasoning, demonstrate consistent gains. Differential smoothing improves both Pass@1 and Pass@k, with up to 6.7% improvements on AIME24 dataset.

[484] Hierarchical Spatio-Temporal Attention Network with Adaptive Risk-Aware Decision for Forward Collision Warning in Complex Scenarios

Haoran Hu, Junren Shi, Shuo Jiang, Kun Cheng, Xia Yang, Changhao Piao

Main category: cs.LG

TL;DR: An integrated Forward Collision Warning framework combining Hierarchical Spatio-Temporal Attention Network for efficient trajectory prediction and Dynamic Risk Threshold Adjustment for adaptive warning generation, achieving real-time performance with high accuracy.

Details

Motivation: Current FCW systems struggle with balancing precise multi-agent interaction modeling and real-time decision adaptability, leading to high computational costs for edge deployment and unreliable simplified interaction models with high false alarm rates.

Method: Proposes HSTAN (Hierarchical Spatio-Temporal Attention Network) with decoupled architecture: Graph Attention Network for spatial modeling and cascaded GRU with self-attention for temporal modeling, combined with Dynamic Risk Threshold Adjustment algorithm using conformalized quantile regression and physics-informed risk potential function.

Result: Achieves 12.3 ms inference time (73% faster than Transformer), 0.73m ADE (42.2% better than Social_LSTM), 91.3% coverage at 90% confidence, F1 score of 0.912, 8.2% false alarm rate, and 2.8 seconds warning lead time on multi-scenario datasets.

Conclusion: The framework demonstrates superior performance and practical deployment feasibility in complex environments, effectively addressing computational complexity and modeling insufficiency while significantly reducing false alarms.

Abstract: Forward Collision Warning systems are crucial for vehicle safety and autonomous driving, yet current methods often fail to balance precise multi-agent interaction modeling with real-time decision adaptability, evidenced by the high computational cost for edge deployment and the unreliability stemming from simplified interaction models.To overcome these dual challenges-computational complexity and modeling insufficiency-along with the high false alarm rates of traditional static-threshold warnings, this paper introduces an integrated FCW framework that pairs a Hierarchical Spatio-Temporal Attention Network with a Dynamic Risk Threshold Adjustment algorithm. HSTAN employs a decoupled architecture (Graph Attention Network for spatial, cascaded GRU with self-attention for temporal) to achieve superior performance and efficiency, requiring only 12.3 ms inference time (73% faster than Transformer methods) and reducing the Average Displacement Error (ADE) to 0.73m (42.2% better than Social_LSTM) on the NGSIM dataset. Furthermore, Conformalized Quantile Regression enhances reliability by generating prediction intervals (91.3% coverage at 90% confidence), which the DTRA module then converts into timely warnings via a physics-informed risk potential function and an adaptive threshold mechanism inspired by statistical process control.Tested across multi-scenario datasets, the complete system demonstrates high efficacy, achieving an F1 score of 0.912, a low false alarm rate of 8.2%, and an ample warning lead time of 2.8 seconds, validating the framework’s superior performance and practical deployment feasibility in complex environments.

[485] Prompt Fairness: Sub-group Disparities in LLMs

Meiyu Zhong, Noel Teku, Ravi Tandon

Main category: cs.LG

TL;DR: LLMs exhibit prompt fairness issues where different phrasings of the same question elicit varying response quality across demographic subgroups, with information-theoretic metrics revealing structural inequities.

Details

Motivation: To address the problem of prompt fairness in LLMs where different users' prompt styles lead to inconsistent response quality despite asking the same underlying question, particularly across demographic subgroups.

Method: Proposed information-theoretic metrics to measure bias in two dimensions: subgroup sensitivity (within-group variability) and cross-group consistency (between-group variability). Applied interventions including majority voting across multiple generations and prompt neutralization.

Result: Before mitigation, cross-group divergence values reached 0.28 (typically 0.14-0.22). After neutralization and multi-generation strategy, divergences decreased to 0.22 maximum with many falling to 0.17 or below, showing improved stability and consistency.

Conclusion: The proposed interventions effectively mitigate prompt sensitivity disparities across demographic subgroups, enhancing response stability and fairness in LLM outputs.

Abstract: Large Language Models (LLMs), though shown to be effective in many applications, can vary significantly in their response quality. In this paper, we investigate this problem of prompt fairness: specifically, the phrasing of a prompt by different users/styles, despite the same question being asked in principle, may elicit different responses from an LLM. To quantify this disparity, we propose to use information-theoretic metrics that can capture two dimensions of bias: subgroup sensitivity, the variability of responses within a subgroup and cross group consistency, the variability of responses across subgroups. Our analysis reveals that certain subgroups exhibit both higher internal variability and greater divergence from others. Our empirical analysis reveals that certain demographic sub groups experience both higher internal variability and greater divergence from others, indicating structural inequities in model behavior. To mitigate these disparities, we propose practical interventions, including majority voting across multiple generations and prompt neutralization, which together improve response stability and enhance fairness across user populations. In the experiments, we observe clear prompt sensitivity disparities across demographic subgroups: before mitigation, cross-group divergence values reach 0.28 and typically fall in the from 0.14 to 0.22 range. After applying our neutralization and multi generation strategy, these divergences consistently decrease, with the largest gap reduced to 0.22 and many distances falling to 0.17 or below, indicating more stable and consistent outputs across subgroups.

[486] ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models

Yujia Wang, Yuanpu Cao, Jinghui Chen

Main category: cs.LG

TL;DR: ParaBlock introduces parallel communication and computation threads to improve communication efficiency in federated learning for large language models, achieving the same convergence rate as standard methods while significantly reducing communication latency.

Details

Motivation: Federated learning for large language models faces significant communication challenges because even single blocks contain many parameters, creating substantial latency especially for resource-constrained clients.

Method: ParaBlock establishes two parallel threads for communication and computation to enhance communication efficiency during federated training/fine-tuning of LLMs.

Result: Theoretical analysis shows ParaBlock achieves the same convergence rate as standard federated block coordinate descent methods. Empirical evaluations on LLM fine-tuning for instruction following and mathematical reasoning confirm strong performance with significantly improved communication efficiency.

Conclusion: ParaBlock effectively addresses communication bottlenecks in federated LLM training while maintaining model performance and convergence properties.

Abstract: Federated learning (FL) has been extensively studied as a privacy-preserving training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training large-scale models, as it allows clients to train only a subset of the model locally instead of the entire model. However, in the era of large language models (LLMs), even a single block can contain a significant number of parameters, posing substantial communication latency, particularly for resource-constrained clients. To address this challenge in federated training/fine-tuning LLMs, we propose ParaBlock, a novel approach that establishes two parallel threads for communication and computation to enhance communication efficiency. We theoretically prove that the proposed ParaBlock achieves the same convergence rate as the standard federated block coordinate descent methods. Empirical evaluations on fine-tuning LLMs on general instruction following and mathematical reasoning confirm that ParaBlock not only maintains strong performance but also significantly improves communication efficiency.

[487] Stragglers Can Contribute More: Uncertainty-Aware Distillation for Asynchronous Federated Learning

Yujia Wang, Fenglong Ma, Jinghui Chen

Main category: cs.LG

TL;DR: FedEcho is a novel asynchronous federated learning framework that uses uncertainty-aware distillation to handle outdated updates from straggler clients and bias from faster clients under data heterogeneity.

Details

Motivation: Asynchronous FL faces challenges with outdated updates from slow clients degrading model performance and bias from faster clients dominating learning, especially with heterogeneous data. Existing methods only address one issue, creating conflicts.

Method: FedEcho incorporates uncertainty-aware distillation where the server assesses prediction reliability from straggler clients and dynamically adjusts their influence based on estimated uncertainty.

Result: Extensive experiments show FedEcho consistently outperforms existing asynchronous FL baselines, achieving robust performance without accessing private client data.

Conclusion: FedEcho effectively mitigates negative impacts of outdated updates and data heterogeneity by prioritizing certain predictions while leveraging diverse client information.

Abstract: Asynchronous federated learning (FL) has recently gained attention for its enhanced efficiency and scalability, enabling local clients to send model updates to the server at their own pace without waiting for slower participants. However, such a design encounters significant challenges, such as the risk of outdated updates from straggler clients degrading the overall model performance and the potential bias introduced by faster clients dominating the learning process, especially under heterogeneous data distributions. Existing methods typically address only one of these issues, creating a conflict where mitigating the impact of outdated updates can exacerbate the bias created by faster clients, and vice versa. To address these challenges, we propose FedEcho, a novel framework that incorporates uncertainty-aware distillation to enhance the asynchronous FL performances under large asynchronous delays and data heterogeneity. Specifically, uncertainty-aware distillation enables the server to assess the reliability of predictions made by straggler clients, dynamically adjusting the influence of these predictions based on their estimated uncertainty. By prioritizing more certain predictions while still leveraging the diverse information from all clients, FedEcho effectively mitigates the negative impacts of outdated updates and data heterogeneity. Through extensive experiments, we demonstrate that FedEcho consistently outperforms existing asynchronous federated learning baselines, achieving robust performance without requiring access to private client data.

[488] Rethinking Semi-Supervised Node Classification with Self-Supervised Graph Clustering

Songbo Wang, Renchi Yang, Yurui Lai, Xiaoyang Lin, Tsz Nam Chan

Main category: cs.LG

TL;DR: NCGC integrates self-supervised graph clustering with semi-supervised classification to improve node classification by leveraging community structures in graphs.

Details

Motivation: Existing GNN methods don't fully exploit the abundant signals from tightly-knit communities/clusters in real graphs, which could compensate for label scarcity in semi-supervised node classification.

Method: Developed soft orthogonal GNNs (SOGNs) that unify GNN and spectral clustering objectives, combined with self-supervised clustering module using two clustering objectives and Sinkhorn-Knopp normalization for balanced pseudo-labels, trained via multi-task objective.

Result: NCGC consistently and considerably outperforms popular GNN models and recent baselines for semi-supervised node classification on seven real graphs across various GNN backbones.

Conclusion: Integrating self-supervised clustering with semi-supervised classification through a unified framework effectively leverages community structures to enhance model capacity and performance in label-scarce scenarios.

Abstract: The emergence of graph neural networks (GNNs) has offered a powerful tool for semi-supervised node classification tasks. Subsequent studies have achieved further improvements through refining the message passing schemes in GNN models or exploiting various data augmentation techniques to mitigate limited supervision. In real graphs, nodes often tend to form tightly-knit communities/clusters, which embody abundant signals for compensating label scarcity in semi-supervised node classification but are not explored in prior methods. Inspired by this, this paper presents NCGC that integrates self-supervised graph clustering and semi-supervised classification into a unified framework. Firstly, we theoretically unify the optimization objectives of GNNs and spectral graph clustering, and based on that, develop soft orthogonal GNNs (SOGNs) that leverage a refined message passing paradigm to generate node representations for both classification and clustering. On top of that, NCGC includes a self-supervised graph clustering module that enables the training of SOGNs for learning representations of unlabeled nodes in a self-supervised manner. Particularly, this component comprises two non-trivial clustering objectives and a Sinkhorn-Knopp normalization that transforms predicted cluster assignments into balanced soft pseudo-labels. Through combining the foregoing clustering module with the classification model using a multi-task objective containing the supervised classification loss on labeled data and self-supervised clustering loss on unlabeled data, NCGC promotes synergy between them and achieves enhanced model capacity. Our extensive experiments showcase that the proposed NCGC framework consistently and considerably outperforms popular GNN models and recent baselines for semi-supervised node classification on seven real graphs, when working with various classic GNN backbones.

[489] Operator Learning at Machine Precision

Aras Bacho, Aleksei G. Sorokin, Xianjin Yang, Théo Bourdais, Edoardo Calvello, Matthieu Darcy, Alexander Hsu, Bamdad Hosseini, Houman Owhadi

Main category: cs.LG

TL;DR: CHONKNORIS is a neural operator learning method that achieves machine precision by regressing Cholesky factors of elliptic operators for Newton-Kantorovich updates, rather than directly approximating solution operators. It provides theoretical convergence guarantees and includes a foundation model variant (FONKNORIS) for solving unseen nonlinear PDEs.

Details

Motivation: Existing neural operator learning methods often fail to substantially improve accuracy despite increased complexity, performing similarly to simpler approaches like kernel methods. The authors aim to address this limitation by developing a method that can achieve machine precision.

Method: CHONKNORIS regresses Cholesky factors of elliptic operators associated with Tikhonov-regularized Newton-Kantorovich updates instead of directly regressing the solution operator. This creates an unrolled iteration that forms a neural architecture achieving machine precision through contractive mapping.

Result: The method was benchmarked on various nonlinear forward and inverse problems including nonlinear elliptic equations, Burgers’ equation, Darcy flow, Calderón problem, inverse wave scattering, and seismic imaging. FONKNORIS successfully solved unseen nonlinear PDEs like Klein-Gordon and Sine-Gordon equations.

Conclusion: CHONKNORIS provides a novel operator learning paradigm that achieves machine precision by leveraging numerical analysis principles, with theoretical convergence guarantees. The foundation model variant FONKNORIS demonstrates the ability to solve previously unseen nonlinear PDEs accurately.

Abstract: Neural operator learning methods have garnered significant attention in scientific computing for their ability to approximate infinite-dimensional operators. However, increasing their complexity often fails to substantially improve their accuracy, leaving them on par with much simpler approaches such as kernel methods and more traditional reduced-order models. In this article, we set out to address this shortcoming and introduce CHONKNORIS (Cholesky Newton–Kantorovich Neural Operator Residual Iterative System), an operator learning paradigm that can achieve machine precision. CHONKNORIS draws on numerical analysis: many nonlinear forward and inverse PDE problems are solvable by Newton-type methods. Rather than regressing the solution operator itself, our method regresses the Cholesky factors of the elliptic operator associated with Tikhonov-regularized Newton–Kantorovich updates. The resulting unrolled iteration yields a neural architecture whose machine-precision behavior follows from achieving a contractive map, requiring far lower accuracy than end-to-end approximation of the solution operator. We benchmark CHONKNORIS on a range of nonlinear forward and inverse problems, including a nonlinear elliptic equation, Burgers’ equation, a nonlinear Darcy flow problem, the Calderón problem, an inverse wave scattering problem, and a problem from seismic imaging. We also present theoretical guarantees for the convergence of CHONKNORIS in terms of the accuracy of the emulated Cholesky factors. Additionally, we introduce a foundation model variant, FONKNORIS (Foundation Newton–Kantorovich Neural Operator Residual Iterative System), which aggregates multiple pre-trained CHONKNORIS experts for diverse PDEs to emulate the solution map of a novel nonlinear PDE. Our FONKNORIS model is able to accurately solve unseen nonlinear PDEs such as the Klein–Gordon and Sine–Gordon equations.

[490] Rethinking Message Passing Neural Networks with Diffusion Distance-guided Stress Majorization

Haoran Zheng, Renchi Yang, Yubo Zhou, Jianliang Xu

Main category: cs.LG

TL;DR: DDSM is a new MPNN model that addresses over-smoothing and over-correlation issues in graph neural networks through stress majorization, orthogonal regularization, and diffusion distance-guided message passing.

Details

Motivation: To overcome severe issues like over-smoothing and over-correlation in existing MPNNs, which stem from minimizing Dirichlet energy and standard neighborhood aggregation operations.

Method: Proposes DDSM model based on optimization framework with stress majorization and orthogonal regularization, incorporates diffusion distances for nodes to guide message passing, and develops efficient distance approximation algorithms.

Result: DDSM consistently and considerably outperforms 15 strong baselines on both homophilic and heterophilic graphs in comprehensive experiments.

Conclusion: The proposed DDSM model effectively addresses fundamental limitations of MPNNs through a novel optimization framework and diffusion distance integration, demonstrating superior performance across diverse graph types.

Abstract: Message passing neural networks (MPNNs) have emerged as go-to models for learning on graph-structured data in the past decade. Despite their effectiveness, most of such models still incur severe issues such as over-smoothing and -correlation, due to their underlying objective of minimizing the Dirichlet energy and the derived neighborhood aggregation operations. In this paper, we propose the DDSM, a new MPNN model built on an optimization framework that includes the stress majorization and orthogonal regularization for overcoming the above issues. Further, we introduce the diffusion distances for nodes into the framework to guide the new message passing operations and develop efficient algorithms for distance approximations, both backed by rigorous theoretical analyses. Our comprehensive experiments showcase that DDSM consistently and considerably outperforms 15 strong baselines on both homophilic and heterophilic graphs.

[491] On-Demand Multi-Task Sparsity for Efficient Large-Model Deployment on Edge Devices

Lianming Huang, Haibo Hu, Qiao Li, Nan Guan, Chun Jason Xue

Main category: cs.LG

TL;DR: A multi-task sparsity framework that minimizes task switching overhead by decomposing weights into reusable blocks and aligning sparse structures across tasks, achieving 6.6X faster switching than existing methods.

Details

Motivation: Traditional sparsity optimization for individual tasks ignores the significant I/O overhead during frequent task switching, which is critical for resource-constrained edge platforms.

Method: Decompose weights into reusable block-granular units, align sparse structures across tasks to maximize parameter overlap, and dynamically load only the differential blocks needed for the next task.

Result: Achieved over 6.6X acceleration in task switching on average compared to existing sparsity methods on a real-world autonomous driving platform.

Conclusion: The proposed on-demand multi-task sparsity framework effectively mitigates cold-start latency and significantly improves switching efficiency for edge computing applications.

Abstract: Sparsity is essential for deploying large models on resource constrained edge platforms. However, optimizing sparsity patterns for individual tasks in isolation ignores the significant I/O overhead incurred during frequent task switching. We introduce an on-demand multi-task sparsity framework specifically designed to minimize switching costs by maximizing parameter reuse. Unlike monolithic approaches, we decompose weights into reusable block-granular units and align sparse structures across tasks to maximize overlap. By dynamically loading only the small differential set of blocks required for the next task, our method effectively mitigates the cold-start latency inherent in traditional monolithic approaches.Experiments on a real-world autonomous driving platform demonstrate that our framework achieves superior switching efficiency, accelerating task switching by over 6.6X on average compared to existing sparsity methods.

[492] RankOOD – Class Ranking-based Out-of-Distribution Detection

Dishanika Denipitiyage, Naveen Karunanayake, Suranga Seneviratne, Sanjay Chawla

Main category: cs.LG

TL;DR: RankOOD is a rank-based OOD detection method that uses Plackett-Luce loss training to leverage ranking patterns in ID class predictions for improved OOD detection performance.

Details

Motivation: Current OOD detection methods can be enhanced by exploiting the ranking patterns that naturally emerge in ID class predictions when models are trained with Cross Entropy Loss.

Method: Extract rank lists for each class using an initial classifier, then perform additional training with Plackett-Luce loss where class rank permutations are the predicted variables.

Result: Achieves state-of-the-art performance on TinyImageNet near-OOD benchmark, reducing FPR95 by 4.3%.

Conclusion: RankOOD demonstrates that leveraging ranking patterns through Plackett-Luce loss training provides an effective framework for improved OOD detection.

Abstract: We propose RankOOD, a rank-based Out-of-Distribution (OOD) detection approach based on training a model with the Placket-Luce loss, which is now extensively used for preference alignment tasks in foundational models. Our approach is based on the insight that with a deep learning model trained using the Cross Entropy Loss, in-distribution (ID) class prediction induces a ranking pattern for each ID class prediction. The RankOOD framework formalizes the insight by first extracting a rank list for each class using an initial classifier and then uses another round of training with the Plackett-Luce loss, where the class rank, a fixed permutation for each class, is the predicted variable. An OOD example may get assigned with high probability to an ID example, but the probability of it respecting the ranking classification is likely to be small. RankOOD, achieves SOTA performance on the near-ODD TinyImageNet evaluation benchmark, reducing FPR95 by 4.3%.

[493] REWA: Witness-Overlap Theory – Foundations for Composable Binary Similarity Systems

Nikit Phadke

Main category: cs.LG

TL;DR: REWA presents a general theory of similarity based on witness-overlap structures that reduces similarity computations to compact encodings with provable ranking preservation.

Details

Motivation: To provide a unified framework for similarity systems that can handle diverse similarity measures (graph neighborhoods, causal relations, temporal structure, etc.) with provable guarantees, moving beyond hash-function engineering.

Method: Uses witness sets with semi-random bit assignments and monotonicity of expected similarity in witness overlap. Proves that under overlap-gap conditions, top-k rankings are preserved using logarithmic bits.

Result: Shows that millions of composable similarity definitions inherit logarithmic encoding complexity. Unifies existing methods like Bloom filters, minhash, LSH bitmaps, and random projections as special cases.

Conclusion: REWA provides a principled foundation for similarity systems governed by witness overlap, enabling modular construction from reusable primitives with provable guarantees.

Abstract: REWA introduces a general theory of similarity based on witness-overlap structures. We show that whenever similarity between concepts can be expressed as monotone witness overlap – whether arising from graph neighborhoods, causal relations, temporal structure, topological features, symbolic patterns, or embedding-based neighborhoods – it admits a reduction to compact encodings with provable ranking preservation guarantees. REWA systems consist of: (1) finite witness sets $W(v)$, (2) semi-random bit assignments generated from each witness, and (3) monotonicity of expected similarity in the overlap $Δ(u, v) = |W(u) \cap W(v)|$. We prove that under an overlap-gap condition on the final witness sets – independent of how they were constructed – top-$k$ rankings are preserved using $m = O(\log(|V|/δ))$ bits. The witness-set formulation is compositional: any sequence of structural, temporal, causal, topological, information-theoretic, or learned transformations can be combined into pipelines that terminate in discrete witness sets. The theory applies to the final witness overlap, enabling modular construction of similarity systems from reusable primitives. This yields a vast design space: millions of composable similarity definitions inherit logarithmic encoding complexity. REWA subsumes and unifies Bloom filters, minhash, LSH bitmaps, random projections, sketches, and hierarchical filters as special cases. It provides a principled foundation for similarity systems whose behavior is governed by witness overlap rather than hash-function engineering. This manuscript presents the axioms, the main reducibility theorem, complete proofs with explicit constants, and a detailed discussion of compositional design, limitations, and future extensions including multi-bit encodings, weighted witnesses, and non-set representations.

[494] Zero-Shot Transfer Capabilities of the Sundial Foundation Model for Leaf Area Index Forecasting

Peining Zhang, Hongchen Qin, Haochen Zhang, Ziqi Guo, Guiling Wang, Jinbo Bi

Main category: cs.LG

TL;DR: Zero-shot time-series foundation model (Sundial) outperforms fully supervised LSTM for Leaf Area Index forecasting when given sufficiently long input context windows covering multiple seasonal cycles.

Details

Motivation: To investigate whether general-purpose time-series foundation models can effectively forecast agricultural time series like Leaf Area Index without task-specific training, potentially serving as plug-and-play solutions for environmental monitoring.

Method: Systematic comparison using HiQ dataset (2000-2022) with statistical baselines, fully supervised LSTM, and Sundial foundation model under multiple evaluation protocols in zero-shot setting.

Result: Sundial foundation model in zero-shot setting outperforms fully trained LSTM when input context window covers more than one or two full seasonal cycles.

Conclusion: General-purpose foundation models can surpass specialized supervised models for remote-sensing time series prediction without task-specific tuning, demonstrating strong potential as plug-and-play forecasters in agricultural and environmental applications.

Abstract: This work investigates the zero-shot forecasting capability of time-series foundation models for Leaf Area Index (LAI) forecasting in agricultural monitoring. Using the HiQ dataset (U.S., 2000-2022), we systematically compare statistical baselines, a fully supervised LSTM, and the Sundial foundation model under multiple evaluation protocols. We find that Sundial, in the zero-shot setting, can outperform a fully trained LSTM provided that the input context window is sufficiently long-specifically, when covering more than one or two full seasonal cycles. This demonstrates, for the first time, that a general-purpose foundation model can surpass specialized supervised models on remote-sensing time series prediction without any task-specific tuning. These results highlight the strong potential of pretrained time-series foundation models to serve as effective plug-and-play forecasters in agricultural and environmental applications.

[495] iRadioDiff: Physics-Informed Diffusion Model for Indoor Radio Map Construction and Localization

Xiucheng Wang, Tingwei Yuan, Yang Cao, Nan Cheng, Ruijin Sun, Weihua Zhuang

Main category: cs.LG

TL;DR: iRadioDiff is a sampling-free diffusion-based framework for constructing indoor radio maps that incorporates physics-informed prompts and multipath-critical priors to achieve state-of-the-art performance in radio map construction and indoor localization.

Details

Motivation: Current methods for constructing indoor radio maps face challenges including prohibitive latency of EM solvers and limitations of learning-based methods that rely on sparse measurements or homogeneous material assumptions, which don't align with the heterogeneous and multipath-rich nature of indoor environments.

Method: A diffusion-based framework conditioned on AP positions and physics-informed prompts encoded by material reflection/transmission coefficients, incorporating multipath-critical priors (diffraction points, strong transmission boundaries, LoS contours) via conditional channels and boundary-weighted objectives.

Result: iRadioDiff achieves state-of-the-art performance in indoor radio map construction and received signal strength based indoor localization, with effective generalization across layouts and material configurations.

Conclusion: The proposed framework enables accurate modeling of nonstationary field discontinuities and efficient construction of physically consistent radio maps, overcoming limitations of existing methods.

Abstract: Radio maps (RMs) serve as environment-aware electromagnetic (EM) representations that connect scenario geometry and material properties to the spatial distribution of signal strength, enabling localization without costly in-situ measurements. However, constructing high-fidelity indoor RMs remains challenging due to the prohibitive latency of EM solvers and the limitations of learning-based methods, which often rely on sparse measurements or assumptions of homogeneous material, which are misaligned with the heterogeneous and multipath-rich nature of indoor environments. To overcome these challenges, we propose iRadioDiff, a sampling-free diffusion-based framework for indoor RM construction. iRadioDiff is conditioned on access point (AP) positions, and physics-informed prompt encoded by material reflection and transmission coefficients. It further incorporates multipath-critical priors, including diffraction points, strong transmission boundaries, and line-of-sight (LoS) contours, to guide the generative process via conditional channels and boundary-weighted objectives. This design enables accurate modeling of nonstationary field discontinuities and efficient construction of physically consistent RMs. Experiments demonstrate that iRadioDiff achieves state-of-the-art performance in indoor RM construction and received signal strength based indoor localization, which offers effective generalization across layouts and material configurations. Code is available at https://github.com/UNIC-Lab/iRadioDiff.

[496] Cross-Contrastive Clustering for Multimodal Attributed Graphs with Dual Graph Filtering

Haoran Zheng, Renchi Yang, Hongtao Wang, Jianliang Xu

Main category: cs.LG

TL;DR: DGF is a novel multimodal attributed graph clustering method that addresses limitations of existing approaches by incorporating feature-wise denoising and tri-cross contrastive learning across modalities, neighborhoods, and communities.

Details

Motivation: Existing multi-view clustering methods fail to handle the unique characteristics of multimodal attributes from large pre-trained models, such as low modality-wise correlation and intense feature-wise noise, leading to suboptimal performance.

Method: Proposes Dual Graph Filtering (DGF) with feature-wise denoising and tri-cross contrastive training strategy using instance-level contrastive learning across modalities, neighborhoods, and communities.

Result: Comprehensive experiments on eight benchmark MMAG datasets show DGF consistently and significantly outperforms state-of-the-art baselines in clustering quality.

Conclusion: DGF effectively overcomes limitations of traditional graph filters and provides robust, discriminative node representations for multimodal attributed graph clustering.

Abstract: Multimodal Attributed Graphs (MMAGs) are an expressive data model for representing the complex interconnections among entities that associate attributes from multiple data modalities (text, images, etc.). Clustering over such data finds numerous practical applications in real scenarios, including social community detection, medical data analytics, etc. However, as revealed by our empirical studies, existing multi-view clustering solutions largely rely on the high correlation between attributes across various views and overlook the unique characteristics (e.g., low modality-wise correlation and intense feature-wise noise) of multimodal attributes output by large pre-trained language and vision models in MMAGs, leading to suboptimal clustering performance. Inspired by foregoing empirical observations and our theoretical analyses with graph signal processing, we propose the Dual Graph Filtering (DGF) scheme, which innovatively incorporates a feature-wise denoising component into node representation learning, thereby effectively overcoming the limitations of traditional graph filters adopted in the extant multi-view graph clustering approaches. On top of that, DGF includes a tri-cross contrastive training strategy that employs instance-level contrastive learning across modalities, neighborhoods, and communities for learning robust and discriminative node representations. Our comprehensive experiments on eight benchmark MMAG datasets exhibit that DGF is able to outperform a wide range of state-of-the-art baselines consistently and significantly in terms of clustering quality measured against ground-truth labels.

[497] RED-F: Reconstruction-Elimination based Dual-stream Contrastive Forecasting for Multivariate Time Series Anomaly Prediction

PengYu Chen, Xiaohou Shi, Yuan Chang, Yan Sun, Sajal K. Das

Main category: cs.LG

TL;DR: RED-F is a dual-stream framework for proactive anomaly prediction in multivariate time series that uses reconstruction-elimination and contrastive forecasting to amplify subtle anomaly precursors.

Details

Motivation: Existing unsupervised methods trained on normal data tend to reconstruct normal patterns, which submerges weak anomaly precursors needed for prediction.

Method: RED-F consists of two components: REM (Reconstruction-Elimination Model) that generates a purified normal baseline using hybrid time-frequency analysis, and DFM (Dual-stream Contrastive Forecasting Model) that compares predictions from original and purified streams using contrastive learning and Multi-Series Prediction objective.

Result: Extensive experiments on six real-world datasets demonstrate superior performance in anomaly prediction tasks compared to existing methods.

Conclusion: The proposed RED-F framework effectively addresses the challenge of detecting subtle anomaly precursors by transforming absolute signal detection into relative trajectory comparison, amplifying weak precursor signals for improved proactive anomaly prediction.

Abstract: The proactive prediction of anomalies (AP) in multivariate time series (MTS) is a critical challenge to ensure system dependability. The difficulty lies in identifying subtle anomaly precursors concealed within normal signals. However, existing unsupervised methods, trained exclusively on normal data, demonstrate a fundamental propensity to reconstruct normal patterns. Consequently, when confronted with weak precursors, their predictions are dominated by the normal pattern, submerging the very signal required for prediction. To contend with the limitation, we propose RED-F, a Reconstruction-Elimination based Dual-stream Contrastive Forecasting framework, comprising the Reconstruction-Elimination Model (REM) and the Dual-stream Contrastive Forecasting Model (DFM). The REM utilizes a hybrid time-frequency mechanism to mitigate the precursor, generating a purified, normal-pattern baseline. The DFM then receives this purified baseline and the original sequence which retains the precursor as parallel inputs. At the core of our framework, RED-F employs a contrastive forecast that transforms the difficult task of absolute signal detection into a simpler, more robust task of relative trajectory comparison by computing the divergence between these two predictive streams. This contrastive mechanism serves to amplify the faint precursor signal. Furthermore, the DFM is trained with a novel Multi-Series Prediction (MSP) objective, which leverages distant future context to enhance its predictive sensitivity. Extensive experiments on six real-world datasets demonstrate the superior capability of RED-F in anomaly prediction tasks.

[498] SOMBRL: Scalable and Optimistic Model-Based RL

Bhavya Sukhija, Lenart Treven, Carmelo Sferrazza, Florian Dörfler, Pieter Abbeel, Andreas Krause

Main category: cs.LG

TL;DR: SOMBRL is a scalable, optimistic model-based RL method that combines extrinsic rewards with epistemic uncertainty for efficient exploration, achieving sublinear regret across various settings and outperforming baselines in both simulation and hardware experiments.

Details

Motivation: Address the challenge of efficient exploration in model-based reinforcement learning when system dynamics are unknown and learning must occur through online interactions.

Method: Learn an uncertainty-aware dynamics model and greedily maximize a weighted sum of extrinsic reward and epistemic uncertainty; compatible with any policy optimizers or planners.

Result: Achieves sublinear regret for nonlinear dynamics in finite-horizon, discounted infinite-horizon, and non-episodic settings; demonstrates strong performance on state-based and visual-control environments; outperforms state-of-the-art on RC car hardware.

Conclusion: SOMBRL provides a flexible and scalable solution for principled exploration in MBRL, with theoretical guarantees and practical effectiveness demonstrated across diverse environments.

Abstract: We address the challenge of efficient exploration in model-based reinforcement learning (MBRL), where the system dynamics are unknown and the RL agent must learn directly from online interactions. We propose Scalable and Optimistic MBRL (SOMBRL), an approach based on the principle of optimism in the face of uncertainty. SOMBRL learns an uncertainty-aware dynamics model and greedily maximizes a weighted sum of the extrinsic reward and the agent’s epistemic uncertainty. SOMBRL is compatible with any policy optimizers or planners, and under common regularity assumptions on the system, we show that SOMBRL has sublinear regret for nonlinear dynamics in the (i) finite-horizon, (ii) discounted infinite-horizon, and (iii) non-episodic settings. Additionally, SOMBRL offers a flexible and scalable solution for principled exploration. We evaluate SOMBRL on state-based and visual-control environments, where it displays strong performance across all tasks and baselines. We also evaluate SOMBRL on a dynamic RC car hardware and show SOMBRL outperforms the state-of-the-art, illustrating the benefits of principled exploration for MBRL.

[499] Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths

Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

Main category: cs.LG

TL;DR: MoA proposes a mixture of attention spans that automatically tailors different sliding-window lengths to different attention heads and layers, improving long-context performance while maintaining efficiency.

Details

Motivation: Existing sliding-window attention methods use uniform window lengths across all heads and inputs, failing to capture heterogeneous attention patterns and their distinct accuracy-latency trade-offs in LLMs.

Method: MoA constructs a search space of window lengths and scaling rules, profiles the model, evaluates configurations, and finds optimal length settings for each attention head. It adapts to input sizes, with some heads expanding focus for longer inputs while others maintain fixed local contexts.

Result: MoA increases effective context length by 3.9x with same average window length, boosts retrieval accuracy by 1.5-7.1x over uniform baseline across multiple models. Reduces performance gap with full attention from 9%-36% to within 5%, achieves 1.2-1.4x GPU memory reduction and 6.6-8.2x decode throughput improvement.

Conclusion: MoA effectively addresses the limitations of uniform sliding-window attention by adapting window lengths per head, significantly improving long-context understanding while maintaining computational efficiency across various LLM architectures.

Abstract: Sliding-window attention offers a hardware-efficient solution to the memory and throughput challenges of Large Language Models (LLMs) in long-context scenarios. Existing methods typically employ a single window length across all attention heads and input sizes. However, this uniform approach fails to capture the heterogeneous attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose Mixture of Attention Spans (MoA), which automatically tailors distinct sliding-window length configurations to different heads and layers. MoA constructs and navigates a search space of various window lengths and their scaling rules relative to input sizes. It profiles the model, evaluates potential configurations, and pinpoints the optimal length configurations for each head. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer inputs, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9x with the same average sliding-window length, boosting retrieval accuracy by 1.5-7.1x over the uniform-window baseline across Vicuna-{7B, 13B} and Llama3-{8B, 70B} models. Moreover, MoA narrows the performance gap with full attention, reducing the maximum relative performance drop from 9%-36% to within 5% across three long-context understanding benchmarks. MoA achieves a 1.2-1.4x GPU memory reduction, boosting decode throughput by 6.6-8.2x and 1.7-1.9x over FlashAttention2 and vLLM, with minimal performance impact. Our code is available at: https://github.com/thu-nics/MoA

[500] QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression

Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

Main category: cs.LG

TL;DR: CRUX-V introduces a structured intermediate representation (CRUX) and two-stage training to bridge the gap between ambiguous natural language descriptions and precise Verilog code generation, achieving state-of-the-art performance.

Details

Motivation: Existing HDL generation approaches rely on ambiguous, redundant natural language descriptions that pose challenges for precise Verilog code generation.

Method: Introduces CRUX structured intermediate space and a two-stage training framework with Joint Expression Modeling and Dual-Space Optimization.

Result: CRUX-V achieves state-of-the-art performance on Verilog generation benchmarks, particularly for challenging design tasks, and CRUX proves transferable to other code models.

Conclusion: CRUX effectively narrows the gap between free-form natural language and precise Verilog generation through structured intermediate representation.

Abstract: Large language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descriptions that are often ambiguous, redundant, and unstructured, which poses significant challenges for downstream Verilog code generation. We treat hardware code generation as a complex transformation from an open-ended natural language space to a domain-specific, highly constrained target space. To bridge this gap, we introduce Core Refined Understanding eXpression (CRUX), a structured intermediate space that captures the essential semantics of user intent while organizing the expression for precise Verilog code generation. We further design a two-stage training framework, comprising Joint Expression Modeling and Dual-Space Optimization, to enhance the quality of both CRUX and Verilog code. Experiments across multiple Verilog generation benchmarks demonstrate that our model, CRUX-V, achieves state-of-the-art performance among general models, particularly under challenging design tasks. Furthermore, the CRUX space proves transferable and beneficial when used as input prompts for other code models, highlighting its effectiveness in narrowing the gap between free-form natural language descriptions and precise Verilog generation.

[501] Multivariate Forecasting of Bitcoin Volatility with Gradient Boosting: Deterministic, Probabilistic, and Feature Importance Perspectives

Grzegorz Dudek, Mateusz Kasprzyk, Paweł Pełka

Main category: cs.LG

TL;DR: LGBM model applied to Bitcoin volatility forecasting using 69 predictors, outperforming benchmarks in both deterministic and probabilistic forecasting while identifying key volatility drivers.

Details

Motivation: To develop effective forecasting models for Bitcoin's high-variance, nonlinear volatility using comprehensive predictor sets and advanced machine learning techniques.

Method: Used LGBM with 69 market, behavioral, and macroeconomic predictors; employed quantile regression with pinball loss and residual simulation for probabilistic forecasting; applied gain-based and permutation feature importance.

Result: LGBM models effectively captured cryptocurrency market characteristics, identified trading volume, lagged volatility, investor attention, and market capitalization as key drivers, and outperformed econometric and ML baselines.

Conclusion: LGBM provides accurate Bitcoin volatility forecasts with interpretable insights into market dynamics, demonstrating strong performance in both deterministic and probabilistic settings.

Abstract: This study investigates the application of the Light Gradient Boosting Machine (LGBM) model for both deterministic and probabilistic forecasting of Bitcoin realized volatility. Utilizing a comprehensive set of 69 predictors – encompassing market, behavioral, and macroeconomic indicators – we evaluate the performance of LGBM-based models and compare them with both econometric and machine learning baselines. For probabilistic forecasting, we explore two quantile-based approaches: direct quantile regression using the pinball loss function, and a residual simulation method that transforms point forecasts into predictive distributions. To identify the main drivers of volatility, we employ gain-based and permutation feature importance techniques, consistently highlighting the significance of trading volume, lagged volatility measures, investor attention, and market capitalization. The results demonstrate that LGBM models effectively capture the nonlinear and high-variance characteristics of cryptocurrency markets while providing interpretable insights into the underlying volatility dynamics.

[502] CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows

Hyeonjae Kim, Chenyue Li, Wen Deng, Mengxi Jin, Wen Huang, Mengqian Lu, Binhang Yuan

Main category: cs.LG

TL;DR: ClimateAgent is an autonomous multi-agent framework that orchestrates end-to-end climate data analytic workflows, achieving 100% task completion and superior report quality compared to existing tools.

Details

Motivation: Generic LLM agents and static scripting pipelines lack climate-specific context and flexibility, performing poorly in practical climate science applications that require automated workflows across massive, heterogeneous datasets.

Method: ClimateAgent decomposes user questions into executable sub-tasks coordinated by Orchestrate-Agent and Plan-Agent; uses specialized Data-Agents that dynamically introspect APIs to synthesize download scripts; and employs a Coding-Agent with self-correction loop for Python code generation, visualizations, and final reporting.

Result: On Climate-Agent-Bench-85 (85 real-world climate tasks), ClimateAgent achieved 100% task completion and report quality score of 8.32, significantly outperforming GitHub-Copilot (6.27) and GPT-5 baseline (3.26).

Conclusion: Multi-agent orchestration with dynamic API awareness and self-correcting execution substantially advances reliable, end-to-end automation for climate science analytic tasks.

Abstract: Climate science demands automated workflows to transform comprehensive questions into data-driven statements across massive, heterogeneous datasets. However, generic LLM agents and static scripting pipelines lack climate-specific context and flexibility, thus, perform poorly in practice. We present ClimateAgent, an autonomous multi-agent framework that orchestrates end-to-end climate data analytic workflows. ClimateAgent decomposes user questions into executable sub-tasks coordinated by an Orchestrate-Agent and a Plan-Agent; acquires data via specialized Data-Agents that dynamically introspect APIs to synthesize robust download scripts; and completes analysis and reporting with a Coding-Agent that generates Python code, visualizations, and a final report with a built-in self-correction loop. To enable systematic evaluation, we introduce Climate-Agent-Bench-85, a benchmark of 85 real-world tasks spanning atmospheric rivers, drought, extreme precipitation, heat waves, sea surface temperature, and tropical cyclones. On Climate-Agent-Bench-85, ClimateAgent achieves 100% task completion and a report quality score of 8.32, outperforming GitHub-Copilot (6.27) and a GPT-5 baseline (3.26). These results demonstrate that our multi-agent orchestration with dynamic API awareness and self-correcting execution substantially advances reliable, end-to-end automation for climate science analytic tasks.

[503] IDAP++: Advancing Divergence-Based Pruning via Filter-Level and Layer-Level Optimization

Aleksei Samarin, Artem Nazarenko, Egor Kotenko, Valentin Malykh, Alexander Savelev, Aleksei Toropov

Main category: cs.LG

TL;DR: A unified neural network compression framework using information flow analysis to remove redundant filters and layers while maintaining performance.

Details

Motivation: To address redundancy at both filter and architectural levels in neural networks through a consistent information flow analysis approach that works across diverse architectures.

Method: Two-stage optimization: 1) Iterative divergence-aware pruning to remove redundant filters, 2) Layer-wise analysis to eliminate entire layers with minimal impact on information propagation using tensor flow divergence metrics.

Result: Achieves substantial model compression while maintaining competitive accuracy, outperforming state-of-the-art methods across various architectures including CNNs and transformers.

Conclusion: Flow divergence serves as an effective guiding principle for neural network compression, offering practical deployment benefits in resource-constrained environments.

Abstract: This paper presents a novel approach to neural network compression that addresses redundancy at both the filter and architectural levels through a unified framework grounded in information flow analysis. Building on the concept of tensor flow divergence, which quantifies how information is transformed across network layers, we develop a two-stage optimization process. The first stage employs iterative divergence-aware pruning to identify and remove redundant filters while preserving critical information pathways. The second stage extends this principle to higher-level architecture optimization by analyzing layer-wise contributions to information propagation and selectively eliminating entire layers that demonstrate minimal impact on network performance. The proposed method naturally adapts to diverse architectures, including convolutional networks, transformers, and hybrid designs, providing a consistent metric for comparing the structural importance across different layer types. Experimental validation across multiple modern architectures and datasets reveals that this combined approach achieves substantial model compression while maintaining competitive accuracy. The presented approach achieves parameter reduction results that are globally comparable to those of state-of-the-art solutions and outperforms them across a wide range of modern neural network architectures, from convolutional models to transformers. The results demonstrate how flow divergence serves as an effective guiding principle for both filter-level and layer-level optimization, offering practical benefits for deployment in resource-constrained environments.

[504] On the Limits of Momentum in Decentralized and Federated Optimization

Riccardo Zaccone, Sai Praneeth Karimireddy, Carlo Masone

Main category: cs.LG

TL;DR: Momentum in distributed SGD under cyclic client participation in Federated Learning is inevitably affected by statistical heterogeneity, and decreasing step-sizes faster than Θ(1/t) leads to convergence to a constant value dependent on initialization and heterogeneity.

Details

Motivation: To understand if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios with partial worker participation, particularly in Federated Learning where momentum appears as a solution to mitigate statistical heterogeneity effects.

Method: Theoretical analysis of momentum under cyclic client participation in distributed SGD, examining the impact of statistical heterogeneity and different step-size schedules.

Result: Momentum remains inevitably affected by statistical heterogeneity, and decreasing step-sizes faster than Θ(1/t) leads to convergence to a constant value that depends on initialization and the heterogeneity bound.

Conclusion: Momentum does not overcome the fundamental limitations imposed by statistical heterogeneity in decentralized learning scenarios, similar to standard SGD, and careful consideration of step-size schedules is necessary.

Abstract: Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a solution to mitigate the effects of statistical heterogeneity. Despite recent progress in this direction, it is still unclear if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios, where only some workers participate at each round. In this work we analyze momentum under cyclic client participation, and theoretically prove that it remains inevitably affected by statistical heterogeneity. Similarly to SGD, we prove that decreasing step-sizes do not help either: in fact, any schedule decreasing faster than $Θ\left(1/t\right)$ leads to convergence to a constant value that depends on the initialization and the heterogeneity bound. Numerical results corroborate the theory, and deep learning experiments confirm its relevance for realistic settings.

[505] Filtering with Self-Attention and Storing with MLP: One-Layer Transformers Can Provably Acquire and Extract Knowledge

Ruichen Xu, Kexin Chen

Main category: cs.LG

TL;DR: First theoretical framework analyzing knowledge acquisition and extraction in transformers, addressing MLP knowledge storage, OOD adaptivity, and next-token prediction objectives.

Details

Motivation: Existing theoretical studies overlook critical components: MLP as primary knowledge storage module, OOD adaptivity for generalization, and next-token prediction as knowledge encoding mechanism.

Method: Analyze training dynamics of one-layer transformers under regularity assumptions, examining pre-training for knowledge acquisition and fine-tuning for knowledge extraction, covering both full and low-rank fine-tuning.

Result: Transformers achieve near-optimal training loss during pre-training; with sufficient fine-tuning data and proper conditions, they achieve low generalization error on pre-trained knowledge not revisited in fine-tuning; violation leads to hallucinations.

Conclusion: The framework provides theoretical foundation for understanding knowledge processes in LLMs, validated through experiments on synthetic datasets and PopQA benchmark using GPT-2 and Llama-3.2-1B models.

Abstract: Modern large language models (LLMs) demonstrate exceptional performance on knowledge-intensive tasks, yet the theoretical mechanisms underlying knowledge acquisition (storage and memorization) during pre-training and extraction (retrieval and recall) during inference after fine-tuning remain poorly understood. Although prior theoretical studies have explored these processes through analyses of training dynamics, they overlook critical components essential for a comprehensive theory: (1) the multi-layer perceptron (MLP), empirically identified as the primary module for knowledge storage; (2) out-of-distribution (OOD) adaptivity, which enables LLMs to generalize to unseen scenarios post-pre-training; and (3) next-token prediction, the standard autoregressive objective that encodes knowledge as conditional probabilities. In this work, we introduce, to the best of our knowledge, the first theoretical framework that addresses these limitations by examining the training dynamics of one-layer transformers. Under regularity assumptions, we establish that: (i) transformers attain near-optimal training loss during pre-training, demonstrating effective knowledge acquisition; (ii) given a sufficiently large fine-tuning dataset and appropriate data multiplicity conditions, transformers achieve low generalization error on factual knowledge acquired during pre-training but not revisited in fine-tuning, indicating robust knowledge extraction; and (iii) violation of these conditions leads to elevated generalization error, manifesting as hallucinations. Our analysis encompasses both full fine-tuning and low-rank fine-tuning, yielding insights into the efficacy of practical low-rank adaptation methods. We validate our theoretical findings through experiments on synthetic datasets and the real-world PopQA benchmark, employing GPT-2 and Llama-3.2-1B models.

[506] AdaCap: An Adaptive Contrastive Approach for Small-Data Neural Networks

Bruno Belucci, Karim Lounici, Katia Meziani

Main category: cs.LG

TL;DR: AdaCap is a training scheme that combines permutation-based contrastive loss with Tikhonov-based closed-form output mapping to improve neural network performance on small tabular datasets where tree-based models typically dominate.

Details

Motivation: Neural networks struggle on small tabular datasets while tree-based models remain dominant in this regime.

Method: AdaCap combines permutation-based contrastive loss with Tikhonov-based closed-form output mapping as a targeted regularization mechanism.

Result: Across 85 real-world regression datasets and multiple architectures, AdaCap yields consistent and statistically significant improvements in the small-sample regime, particularly for residual models. A meta-predictor accurately anticipates when AdaCap is beneficial.

Conclusion: AdaCap acts as a targeted regularization mechanism that strengthens neural networks precisely where they are most fragile on small tabular datasets.

Abstract: Neural networks struggle on small tabular datasets, where tree-based models remain dominant. We introduce Adaptive Contrastive Approach (AdaCap), a training scheme that combines a permutation-based contrastive loss with a Tikhonov-based closed-form output mapping. Across 85 real-world regression datasets and multiple architectures, AdaCap yields consistent and statistically significant improvements in the small-sample regime, particularly for residual models. A meta-predictor trained on dataset characteristics (size, skewness, noise) accurately anticipates when AdaCap is beneficial. These results show that AdaCap acts as a targeted regularization mechanism, strengthening neural networks precisely where they are most fragile. All results and code are publicly available at https://github.com/BrunoBelucci/adacap.

[507] Learning Subgroups with Maximum Treatment Effects without Causal Heuristics

Lincen Yang, Zhong Li, Matthijs van Leeuwen, Saber Salehkaleybar

Main category: cs.LG

TL;DR: The paper proposes a causal subgroup discovery method using structural causal models that avoids heuristic approaches and reduces the problem to standard supervised learning.

Details

Motivation: Current subgroup discovery methods either rely on inaccurate pointwise treatment effect estimation or use ad-hoc causal heuristics without rigorous justification, leading to suboptimal results.

Method: The approach uses structural causal models under a partition-based assumption, transforming subgroup discovery into standard supervised learning (regression/classification), and implements it with CART decision trees.

Result: Experiments on synthetic and semi-synthetic datasets show the method more accurately identifies subgroups with maximum treatment effect compared to various baselines.

Conclusion: By framing subgroup discovery directly under the SCM framework, the method avoids causal heuristics and achieves better performance through standard supervised learning approaches.

Abstract: Discovering subgroups with the maximum average treatment effect is crucial for targeted decision making in domains such as precision medicine, public policy, and education. While most prior work is formulated in the potential outcome framework, the corresponding structural causal model (SCM) for this task has been largely overlooked. In practice, two approaches dominate. The first estimates pointwise conditional treatment effects and then fits a tree on those estimates, effectively turning subgroup estimation into the harder problem of accurate pointwise estimation. The second constructs decision trees or rule sets with ad-hoc ‘causal’ heuristics, typically without rigorous justification for why a given heuristic may be used or whether such heuristics are necessary at all. We address these issues by studying the problem directly under the SCM framework. Under the assumption of a partition-based model, we show that optimal subgroup discovery reduces to recovering the data-generating models and hence a standard supervised learning problem (regression or classification). This allows us to adopt any partition-based methods to learn the subgroup from data. We instantiate the approach with CART, arguably one of the most widely used tree-based methods, to learn the subgroup with maximum treatment effect. Finally, on a large collection of synthetic and semi-synthetic datasets, we compare our method against a wide range of baselines and find that our approach, which avoids such causal heuristics, more accurately identifies subgroups with maximum treatment effect. Our source code is available at https://github.com/ylincen/causal-subgroup.

[508] In-Context Compositional Learning via Sparse Coding Transformer

Wei Chen, Jingxi Yu, Zichen Miao, Qiang Qiu

Main category: cs.LG

TL;DR: Proposes sparse coding-inspired attention reformulation for Transformers to improve compositional learning by enforcing sparsity on encoding coefficients and using linear combinations for generalization.

Details

Motivation: Transformers struggle with in-context compositional learning tasks due to limited structural inductive bias, despite their success in other domains.

Method: Reformulates attention as projection onto learned encoding/decoding dictionaries with sparse coefficients, and estimates target coefficients as linear combinations of context examples.

Result: Method maintains performance on S-RAVEN and RAVEN datasets where standard Transformers fail, demonstrating effective compositional rule learning.

Conclusion: Sparse coding-based attention enhances Transformers’ compositional generalization capabilities, enabling them to handle tasks requiring structural rule inference.

Abstract: Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning tasks. In these tasks, models solve the target problems by inferring compositional rules from context examples, which are composed of basic components structured by underlying rules. However, some of these tasks remain challenging for Transformers, which are not inherently designed to handle compositional tasks and offer limited structural inductive bias. In this work, inspired by the principle of sparse coding, we propose a reformulation of the attention to enhance its capability for compositional tasks. In sparse coding, data are represented as sparse combinations of dictionary atoms with coefficients that capture their compositional rules. Specifically, we reinterpret the attention block as a mapping of inputs into outputs through projections onto two sets of learned dictionary atoms: an encoding dictionary and a decoding dictionary. The encoding dictionary decomposes the input into a set of coefficients, which represent the compositional structure of the input. To enhance structured representations, we impose sparsity on these coefficients. The sparse coefficients are then used to linearly combine the decoding dictionary atoms to generate the output. Furthermore, to assist compositional generalization tasks, we propose estimating the coefficients of the target problem as a linear combination of the coefficients obtained from the context examples. We demonstrate the effectiveness of our approach on the S-RAVEN and RAVEN datasets. For certain compositional generalization tasks, our method maintains performance even when standard Transformers fail, owing to its ability to learn and apply compositional rules.

[509] Communication-Efficient Learning for Satellite Constellations

Ruxandra-Stefania Tudose, Moritz H. W. Grüss, Grace Ra Kim, Karl H. Johansson, Nicola Bastianello

Main category: cs.LG

TL;DR: Proposes a communication-efficient federated learning algorithm for satellite constellations using local training, compression, and error feedback to reduce communication overhead while maintaining accuracy.

Details

Motivation: Satellite constellations are widespread but face communication constraints; need efficient federated learning approaches that minimize communication with ground stations while maintaining model accuracy.

Method: Combines local training (reduces communication frequency) with compression (reduces message size) and introduces an error feedback mechanism to enhance accuracy, yielding an algorithm-agnostic error feedback scheme.

Result: The proposed algorithm shows superior performance compared to state-of-the-art methods in realistic space scenario simulations, with proven convergence guarantees.

Conclusion: The communication-efficient federated learning approach with error feedback effectively addresses satellite constellation constraints while achieving accurate trained models, with broader applicability through algorithm-agnostic error feedback.

Abstract: Satellite constellations in low-Earth orbit are now widespread, enabling positioning, Earth imaging, and communications. In this paper we address the solution of learning problems using these satellite constellations. In particular, we focus on a federated approach, where satellites collect and locally process data, with the ground station aggregating local models. We focus on designing a novel, communication-efficient algorithm that still yields accurate trained models. To this end, we employ several mechanisms to reduce the number of communications with the ground station (local training) and their size (compression). We then propose an error feedback mechanism that enhances accuracy, which yields, as a byproduct, an algorithm-agnostic error feedback scheme that can be more broadly applied. We analyze the convergence of the resulting algorithm, and compare it with the state of the art through simulations in a realistic space scenario, showcasing superior performance.

[510] Decoupling and Damping: Structurally-Regularized Gradient Matching for Multimodal Graph Condensation

Lian Shen, Zhendan Chen, Yinhui jiang, Meijia Song, Ziming Su, Juan Liu, Xiangrong Liu

Main category: cs.LG

TL;DR: SR-GM is a novel graph condensation framework for multimodal graphs that addresses gradient conflicts between modalities and structural noise amplification in GNNs through gradient decoupling and structural damping regularization.

Details

Motivation: Multimodal graphs in web applications face computational burdens for GNN training. Existing graph condensation methods fail in multimodal settings due to conflicting gradients from semantic misalignments and GNN message-passing amplifying gradient noise.

Method: Proposes Structurally-Regularized Gradient Matching (SR-GM) with two components: gradient decoupling via orthogonal projection to resolve inter-modality conflicts, and structural damping regularizer using graph’s Dirichlet energy to stabilize optimization.

Result: SR-GM significantly improves accuracy and accelerates convergence compared to baselines. Condensed graphs show strong cross-architecture generalization and potential for accelerating Neural Architecture Search.

Conclusion: Addressing both gradient conflict and structural amplification is essential for superior performance. SR-GM provides scalable methodology for multimodal graph learning in resource-constrained environments.

Abstract: In critical web applications such as e-commerce and recommendation systems, multimodal graphs integrating rich visual and textual attributes are increasingly central, yet their large scale introduces substantial computational burdens for training Graph Neural Networks (GNNs). While Graph Condensation (GC) offers a promising solution by synthesizing smaller datasets, existing methods falter in the multimodal setting. We identify a dual challenge causing this failure: (1) conflicting gradients arising from semantic misalignments between modalities, and (2) the GNN’s message-passing architecture pathologically amplifying this gradient noise across the graph structure. To address this, we propose Structurally-Regularized Gradient Matching (SR-GM), a novel condensation framework tailored for multimodal graphs. SR-GM introduces two synergistic components: first, a gradient decoupling mechanism that resolves inter-modality conflicts at their source via orthogonal projection; and second, a structural damping regularizer that acts directly on the gradient field. By leveraging the graph’s Dirichlet energy, this regularizer transforms the topology from a noise amplifier into a stabilizing force during optimization. Extensive experiments demonstrate that SR-GM significantly improves accuracy and accelerates convergence compared to baseline methods. Ablation studies confirm that addressing both gradient conflict and structural amplification in tandem is essential for achieving superior performance. Moreover, the condensed multimodal graphs exhibit strong cross-architecture generalization and promise to accelerate applications like Neural Architecture Search. This research provides a scalable methodology for multimodal graph-based learning in resource-constrained environments.

[511] DiCaP: Distribution-Calibrated Pseudo-labeling for Semi-Supervised Multi-Label Learning

Bo Han, Zhuoming Li, Xiaoyu Wang, Yaxin Hou, Hui Liu, Junhui Hou, Yuheng Jia

Main category: cs.LG

TL;DR: DiCaP is a correctness-aware framework for semi-supervised multi-label learning that calibrates pseudo-label weights based on posterior precision, using a dual-threshold mechanism to handle confident and ambiguous samples separately.

Details

Motivation: Existing SSMLL methods assign equal weights to all pseudo-labels regardless of quality, which amplifies noisy predictions and degrades performance. The optimal weight should reflect correctness likelihood.

Method: Distribution-Calibrated Pseudo-labeling (DiCaP) estimates posterior precision to calibrate pseudo-label weights, with dual-thresholding: confident samples get weighted pseudo-labels while ambiguous ones use unsupervised contrastive learning.

Result: Experiments on multiple benchmark datasets show consistent improvements, surpassing state-of-the-art methods by up to 4.27%.

Conclusion: DiCaP effectively addresses pseudo-label quality issues in SSMLL by weighting based on correctness likelihood and handling different sample types appropriately, achieving superior performance.

Abstract: Semi-supervised multi-label learning (SSMLL) aims to address the challenge of limited labeled data in multi-label learning (MLL) by leveraging unlabeled data to improve the model’s performance. While pseudo-labeling has become a dominant strategy in SSMLL, most existing methods assign equal weights to all pseudo-labels regardless of their quality, which can amplify the impact of noisy or uncertain predictions and degrade the overall performance. In this paper, we theoretically verify that the optimal weight for a pseudo-label should reflect its correctness likelihood. Empirically, we observe that on the same dataset, the correctness likelihood distribution of unlabeled data remains stable, even as the number of labeled training samples varies. Building on this insight, we propose Distribution-Calibrated Pseudo-labeling (DiCaP), a correctness-aware framework that estimates posterior precision to calibrate pseudo-label weights. We further introduce a dual-thresholding mechanism to separate confident and ambiguous regions: confident samples are pseudo-labeled and weighted accordingly, while ambiguous ones are explored by unsupervised contrastive learning. Experiments conducted on multiple benchmark datasets verify that our method achieves consistent improvements, surpassing state-of-the-art methods by up to 4.27%.

[512] Leveraging weights signals – Predicting and improving generalizability in reinforcement learning

Olivier Moulin, Vincent Francois-lavet, Paul Elbers, Mark Hoogendoorn

Main category: cs.LG

TL;DR: The paper introduces a method to predict RL agent generalizability using neural network weights and modifies PPO loss function to improve generalization.

Details

Motivation: RL agents tend to overfit to training environments, limiting their performance on different environments. The paper aims to address this generalizability problem.

Method: Proposes a methodology to predict generalizability scores based on neural network weights, then modifies PPO loss function to boost generalization using these predictions.

Result: Experimental results show that the improved PPO algorithm produces agents with stronger generalizability compared to the original PPO version.

Conclusion: The proposed approach successfully enhances RL agent generalizability by leveraging weight-based predictions to modify the PPO loss function.

Abstract: Generalizability of Reinforcement Learning (RL) agents (ability to perform on environments different from the ones they have been trained on) is a key problem as agents have the tendency to overfit to their training environments. In order to address this problem and offer a solution to increase the generalizability of RL agents, we introduce a new methodology to predict the generalizability score of RL agents based on the internal weights of the agent’s neural networks. Using this prediction capability, we propose some changes in the Proximal Policy Optimization (PPO) loss function to boost the generalization score of the agents trained with this upgraded version. Experimental results demonstrate that our improved PPO algorithm yields agents with stronger generalizability compared to the original version.

[513] Scalable Parameter-Light Spectral Method for Clustering Short Text Embeddings with a Cohesion-Based Evaluation Metric

Nikita Neveditsin, Pawan Lingras, Vijay Mago

Main category: cs.LG

TL;DR: A spectral method for clustering short text embeddings that automatically estimates the number of clusters from Laplacian eigenspectrum, with an adaptive sampling strategy for scalability and a new Cohesion Ratio metric for intrinsic evaluation.

Details

Motivation: Clustering short text embeddings is challenging due to the need to specify cluster numbers in advance, and existing methods lack reliable intrinsic evaluation metrics without ground-truth labels.

Method: Spectral clustering using Laplacian eigenspectrum analysis with cosine similarities and adaptive sampling, plus a Cohesion Ratio metric based on intra-cluster vs global similarity differences.

Result: Outperforms popular parameter-light methods (HDBSCAN, OPTICS, Leiden) on six short-text datasets with four embedding models, with Cohesion Ratio correlating well with extrinsic metrics like NMI and homogeneity.

Conclusion: The spectral estimator and Cohesion Ratio provide practical tools for unsupervised organization and evaluation of short text data, enabling automatic cluster number estimation and reliable quality assessment.

Abstract: Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.

[514] Interpretable Air Pollution Forecasting by Physics-Guided Spatiotemporal Decoupling

Zhiguo Zhang, Xiaoliang Ma, Daniel Schlesinger

Main category: cs.LG

TL;DR: Proposes a physics-guided interpretable spatiotemporal learning framework for air pollution forecasting that decomposes pollutant behavior into transparent additive modules, outperforming state-of-the-art baselines while maintaining interpretability.

Details

Motivation: Address the trade-off between performance and interpretability in air pollution forecasting models, which is crucial for public health applications and reliable air-quality management.

Method: Decomposes spatiotemporal behavior into two additive modules: 1) physics-guided transport kernel with directed weights conditioned on wind and geography (advection), 2) explainable attention mechanism that learns local responses and attributes future concentrations to specific historical lags and exogenous drivers.

Result: Consistently outperforms state-of-the-art baselines across multiple forecasting horizons when evaluated on comprehensive dataset from Stockholm region.

Conclusion: The integration of high predictive performance and spatiotemporal interpretability provides a more reliable foundation for operational air-quality management in real-world applications.

Abstract: Accurate and interpretable air pollution forecasting is crucial for public health, but most models face a trade-off between performance and interpretability. This study proposes a physics-guided, interpretable-by-design spatiotemporal learning framework. The model decomposes the spatiotemporal behavior of air pollutant concentrations into two transparent, additive modules. The first is a physics-guided transport kernel with directed weights conditioned on wind and geography (advection). The second is an explainable attention mechanism that learns local responses and attributes future concentrations to specific historical lags and exogenous drivers. Evaluated on a comprehensive dataset from the Stockholm region, our model consistently outperforms state-of-the-art baselines across multiple forecasting horizons. Our model’s integration of high predictive performance and spatiotemporal interpretability provides a more reliable foundation for operational air-quality management in real-world applications.

[515] HVAdam: A Full-Dimension Adaptive Optimizer

Yiheng Zhang, Shaowu Wu, Yuanzhuo Xu, Jiajun Wu, Shang Xu, Steve Drew, Xiaoguang Niu

Main category: cs.LG

TL;DR: Anon is a novel optimizer with tunable adaptivity that bridges the gap between SGD and Adam, outperforming state-of-the-art methods on image classification, diffusion, and language modeling tasks.

Details

Motivation: Adaptive optimizers like Adam generalize worse than non-adaptive methods like SGD on classical architectures, due to limitations in adaptivity of pre-conditioners that restrict adaptation to diverse optimization landscapes.

Method: Proposed Anon optimizer with continuously tunable adaptivity and incremental delay update (IDU) mechanism for flexible convergence across adaptivity spectrum, replacing AMSGrad’s hard max-tracking.

Result: Anon consistently outperforms state-of-the-art optimizers on image classification, diffusion, and language modeling tasks, demonstrating superior performance across diverse applications.

Conclusion: Adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified framework bridging classical and modern optimizers while surpassing their advantageous properties.

Abstract: Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer’s ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity , allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad’s hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.

[516] MXtalTools: A Toolkit for Machine Learning on Molecular Crystals

Michael Kilgour, Mark E. Tuckerman, Jutta Rogal

Main category: cs.LG

TL;DR: MXtalTools is a flexible Python package for data-driven modeling of molecular crystals, enabling machine learning studies of the molecular solid state with CUDA acceleration for high-throughput modeling.

Details

Motivation: To facilitate machine learning studies of molecular crystals by providing a comprehensive toolkit for data-driven modeling of the molecular solid state.

Method: Developed a modular Python package with utilities for dataset synthesis/curation, integrated ML workflows, crystal parameterization/representation, structure sampling/optimization, and differentiable crystal operations.

Result: Created an open-source, CUDA-accelerated toolkit that enables high-throughput crystal modeling and can be integrated into existing workflows or used to build novel modeling pipelines.

Conclusion: MXtalTools provides a flexible, modular framework for molecular crystal modeling that supports both integration with existing workflows and development of new modeling approaches.

Abstract: We present MXtalTools, a flexible Python package for the data-driven modelling of molecular crystals, facilitating machine learning studies of the molecular solid state. MXtalTools comprises several classes of utilities: (1) synthesis, collation, and curation of molecule and crystal datasets, (2) integrated workflows for model training and inference, (3) crystal parameterization and representation, (4) crystal structure sampling and optimization, (5) end-to-end differentiable crystal sampling, construction and analysis. Our modular functions can be integrated into existing workflows or combined and used to build novel modelling pipelines. MXtalTools leverages CUDA acceleration to enable high-throughput crystal modelling. The Python code is available open-source on our GitHub page, with detailed documentation on ReadTheDocs.

[517] Complexity Reduction Study Based on RD Costs Approximation for VVC Intra Partitioning

M. E. A. Kherchouche, F. Galpin, T. Dumas, F. Schnitzler, D. Menard, L. Zhang

Main category: cs.LG

TL;DR: This paper proposes two machine learning approaches to accelerate VVC intra partitioning by predicting RD costs and using RL to model partitioning decisions as an MDP, with threshold-based split selection.

Details

Motivation: To address the computational complexity of exhaustive search in VVC intra partitioning during RDO process, aiming to accelerate the encoding process.

Method: Two ML techniques: 1) Regression-based prediction of normalized RD costs using neighboring block RD costs as features; 2) RL approach using DQN algorithm to model partitioning as MDP from CU decision trajectories across two depths.

Result: Proposed size-independent methods that incorporate RD costs of neighboring blocks as input features, with pre-determined thresholds for split selection.

Conclusion: The paper presents effective ML-based approaches for accelerating VVC intra partitioning that outperform existing methods by being size-independent and leveraging neighboring block information.

Abstract: In this paper, a complexity study is conducted for Versatile Video Codec (VVC) intra partitioning to accelerate the exhaustive search involved in Rate-Distortion Optimization (RDO) process. To address this problem, two main machine learning techniques are proposed and compared. Unlike existing methods, the proposed approaches are size independent and incorporate the Rate-Distortion (RD) costs of neighboring blocks as input features. The first method is a regression based technique that predicts normalized RD costs of a given Coding Unit (CU). As partitioning possesses the Markov property, the associated decision-making problem can be modeled as a Markov Decision Process (MDP) and solved by Reinforcement Learning (RL). The second approach is a RL agent learned from trajectories of CU decision across two depths with Deep Q-Network (DQN) algorithm. Then a pre-determined thresholds are applied for both methods to select a suitable split for the current CU.

[518] PRISM: Periodic Representation with multIscale and Similarity graph Modelling for enhanced crystal structure property prediction

Àlex Solé, Albert Mosella-Montoro, Joan Cardona, Daniel Aravena, Silvia Gómez-Coca, Eliseo Ruiz, Javier Ruiz-Hidalgo

Main category: cs.LG

TL;DR: PRISM is a graph neural network framework that improves crystal property prediction by explicitly modeling multiscale representations and periodic boundary conditions in crystalline structures.

Details

Motivation: Current graph-based methods for crystal structure analysis often overlook essential periodic boundary conditions and multiscale interactions inherent to crystalline materials, limiting their predictive accuracy.

Method: PRISM employs a set of expert modules that specialize in encoding distinct structural and chemical aspects of periodic systems, explicitly integrating multiscale representations and periodic feature encoding.

Result: Extensive experiments across crystal structure-based benchmarks demonstrate that PRISM improves state-of-the-art predictive accuracy for crystal property prediction.

Conclusion: PRISM significantly enhances crystal property prediction by properly addressing the unique challenges of periodic systems through specialized multiscale and periodic feature encoding.

Abstract: Crystal structures are characterised by repeating atomic patterns within unit cells across three-dimensional space, posing unique challenges for graph-based representation learning. Current methods often overlook essential periodic boundary conditions and multiscale interactions inherent to crystalline structures. In this paper, we introduce PRISM, a graph neural network framework that explicitly integrates multiscale representations and periodic feature encoding by employing a set of expert modules, each specialised in encoding distinct structural and chemical aspects of periodic systems. Extensive experiments across crystal structure-based benchmarks demonstrate that PRISM improves state-of-the-art predictive accuracy, significantly enhancing crystal property prediction.

[519] MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers

Audrey Pei-Hsuan Chen

Main category: cs.LG

TL;DR: MoRE is a framework that repurposes frozen pre-trained transformers to align heterogeneous multi-omics data into a shared latent space using parameter-efficient fine-tuning with modality-specific adapters and contrastive alignment losses.

Details

Motivation: Multi-omics data integration faces challenges with extreme dimensionality, modality heterogeneity, and batch effects. Pre-trained transformers show generalization capabilities but their application to multi-omics integration remains underexplored.

Method: Uses frozen pre-trained transformers with lightweight modality-specific adapters and task-adaptive fusion layer. Optimizes masked modeling objective with supervised contrastive and batch-invariant alignment losses for cross-sample and cross-modality alignment.

Result: MoRE achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models. Demonstrates strong performance in integration fidelity, rare population detection, and modality transfer.

Conclusion: MoRE represents a practical step toward general-purpose omics foundation models by efficiently leveraging pre-trained transformers for multi-omics integration with parameter-efficient fine-tuning.

Abstract: Representation learning on multi-omics data is challenging due to extreme dimensionality, modality heterogeneity, and cohort-specific batch effects. While pre-trained transformer backbones have shown broad generalization capabilities in biological sequence modeling, their application to multi-omics integration remains underexplored. We present MoRE (Multi-Omics Representation Embedding), a framework that repurposes frozen pre-trained transformers to align heterogeneous assays into a shared latent space. Unlike purely generative approaches, MoRE employs a parameter-efficient fine-tuning (PEFT) strategy, prioritizing cross-sample and cross-modality alignment over simple sequence reconstruction. Specifically, MoRE attaches lightweight, modality-specific adapters and a task-adaptive fusion layer to the frozen backbone. It optimizes a masked modeling objective jointly with supervised contrastive and batch-invariant alignment losses, yielding structure-preserving embeddings that generalize across unseen cell types and platforms. We benchmark MoRE against established baselines, including scGPT, scVI, and Harmony with scArches, evaluating integration fidelity, rare population detection, and modality transfer. Our results demonstrate that MoRE achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models. This work positions MoRE as a practical step toward general-purpose omics foundation models.

[520] Identifying environmental factors associated with tetrodotoxin contamination in bivalve mollusks using eXplainable AI

M. C. Schoppema, B. H. M. van der Velden, A. Hürriyetoğlu, M. D. Klijnstra, E. J. Faassen, A. Gerssen, H. J. van der Fels-Klerx

Main category: cs.LG

TL;DR: Developed an explainable deep learning model to predict TTX contamination in bivalve mollusks using environmental factors, identifying sun hours, global radiation, water temperature, and chloride concentration as key drivers.

Details

Motivation: TTX contamination in European seafood poses food safety risks and economic losses, requiring early prediction methods. Previous studies identified shallow habitats and water temperature as factors, but temporal relationships remained unexplored.

Method: Created an explainable deep learning model using meteorological and hydrological features as inputs to predict TTX presence/absence in the Dutch Zeeland estuary.

Result: Sunrise/sunset times, global radiation, water temperature, and chloride concentration were most influential. Effective sun hours (day length + radiation) emerged as a key driver for TTX contamination.

Conclusion: The model successfully identified environmental factors associated with TTX contamination, providing a valuable tool for mitigating marine toxin risks in the food industry.

Abstract: Since 2012, tetrodotoxin (TTX) has been found in seafoods such as bivalve mollusks in temperate European waters. TTX contamination leads to food safety risks and economic losses, making early prediction of TTX contamination vital to the food industry and competent authorities. Recent studies have pointed to shallow habitats and water temperature as main drivers to TTX contamination in bivalve mollusks. However, the temporal relationships between abiotic factors, biotic factors, and TTX contamination remain unexplored. We have developed an explainable, deep learning-based model to predict TTX contamination in the Dutch Zeeland estuary. Inputs for the model were meteorological and hydrological features; output was the presence or absence of TTX contamination. Results showed that the time of sunrise, time of sunset, global radiation, water temperature, and chloride concentration contributed most to TTX contamination. Thus, the effective number of sun hours, represented by day length and global radiation, was an important driver for tetrodotoxin contamination in bivalve mollusks. To conclude, our explainable deep learning model identified the aforementioned environmental factors (number of sun hours, global radiation, water temperature, and water chloride concentration) to be associated with tetrodotoxin contamination in bivalve mollusks; making our approach a valuable tool to mitigate marine toxin risks for food industry and competent authorities.

[521] Model-Based Learning of Whittle indices

Joël Charles-Rebuffé, Nicolas Gast, Bruno Gaujal

Main category: cs.LG

TL;DR: BLINQ is a model-based algorithm that learns Whittle indices for indexable, communicating, unichain MDPs by building empirical estimates and using an extended state-of-the-art algorithm, outperforming Q-learning in sample efficiency and computational cost.

Details

Motivation: To develop a more efficient method for learning Whittle indices in Markov Decision Processes that requires fewer samples and lower computational cost compared to existing Q-learning approaches.

Method: Builds an empirical estimate of the MDP and computes Whittle indices using an extended version of a state-of-the-art algorithm, with proven convergence guarantees and computational complexity analysis.

Result: BLINQ significantly outperforms Q-learning in sample efficiency, requires fewer samples for accurate approximation, and has lower total computational cost even when Q-learning is accelerated with neural networks.

Conclusion: BLINQ provides an efficient and computationally advantageous approach for learning Whittle indices, demonstrating superior performance over Q-learning methods in both sample efficiency and computational requirements.

Abstract: We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with arbitrary precision. Moreover, we investigate its computational complexity. Our numerical experiments suggest that BLINQ significantly outperforms existing Q-learning approaches in terms of the number of samples needed to get an accurate approximation. In addition, it has a total computational cost even lower than Q-learning for any reasonably high number of samples. These observations persist even when the Q-learning algorithms are speeded up using pre-trained neural networks to predict Q-values.

[522] Short-Range Oversquashing

Yaaqov Mishayev, Yonatan Sverdlov, Tal Amir, Nadav Dym

Main category: cs.LG

TL;DR: MPNNs suffer from oversquashing not just in long-range tasks but also in short-range problems, revealing two distinct mechanisms: bottleneck phenomenon (short-range) and vanishing gradients (long-range). Transformers outperform specialized MPNNs in resolving these issues.

Details

Motivation: To understand the true nature of oversquashing in graph neural networks, challenging the common belief that it's primarily a long-range problem and investigating whether Graph Transformers or specialized MPNNs better address this limitation.

Method: Analysis of oversquashing phenomena in both short-range and long-range graph tasks, disentangling bottleneck effects from vanishing gradient problems, and comparing performance of MPNNs with virtual nodes versus transformers.

Result: Found that oversquashing occurs in short-range tasks due to bottleneck phenomenon, which existing explanations don’t capture and virtual nodes don’t resolve. Transformers successfully handle both short-range and long-range oversquashing.

Conclusion: Transformers are more effective than specialized MPNNs for addressing oversquashing, as they handle both the short-range bottleneck phenomenon and long-range vanishing gradient problems that limit MPNN performance.

Abstract: Message Passing Neural Networks (MPNNs) are widely used for learning on graphs, but their ability to process long-range information is limited by the phenomenon of oversquashing. This limitation has led some researchers to advocate Graph Transformers as a better alternative, whereas others suggest that it can be mitigated within the MPNN framework, using virtual nodes or other rewiring techniques. In this work, we demonstrate that oversquashing is not limited to long-range tasks, but can also arise in short-range problems. This observation allows us to disentangle two distinct mechanisms underlying oversquashing: (1) the bottleneck phenomenon, which can arise even in low-range settings, and (2) the vanishing gradient phenomenon, which is closely associated with long-range tasks. We further show that the short-range bottleneck effect is not captured by existing explanations for oversquashing, and that adding virtual nodes does not resolve it. In contrast, transformers do succeed in such tasks, positioning them as the more compelling solution to oversquashing, compared to specialized MPNNs.

[523] Tight Margin-Based Generalization Bounds for Voting Classifiers over Finite Hypothesis Sets

Kasper Green Larsen, Natascha Schalburg

Main category: cs.LG

TL;DR: First asymptotically tight margin-based generalization bound for voting classifiers

Details

Motivation: To establish rigorous theoretical guarantees for voting classifiers by deriving margin-based generalization bounds that are tight in their parameter dependencies

Method: Mathematical proof establishing a margin-based generalization bound that considers hypothesis set size, margin value, fraction of training points with given margin, number of samples, and failure probability

Result: Proved the first asymptotically tight margin-based generalization bound for voting classifiers that optimally balances all key parameters

Conclusion: The derived bound provides theoretically optimal generalization guarantees for voting classifiers, establishing a fundamental result in statistical learning theory

Abstract: We prove the first margin-based generalization bound for voting classifiers, that is asymptotically tight in the tradeoff between the size of the hypothesis set, the margin, the fraction of training points with the given margin, the number of training samples and the failure probability.

[524] Diffusion for Fusion: Designing Stellarators with Generative AI

Misha Padidar, Teresa Huang, Andrew Giuliani, Marina Spivak

Main category: cs.LG

TL;DR: Using conditional diffusion models to rapidly generate high-quality stellarator designs with desirable characteristics, achieving less than 5% deviation from quasisymmetry and target parameters.

Details

Motivation: Stellarator design is traditionally time-consuming (hours on computing clusters), and machine learning approaches using large datasets of optimized stellarators offer potential for faster design generation.

Method: Train a conditional diffusion model on QUASR database data to generate quasisymmetric stellarator designs with specific characteristics (aspect ratio and mean rotational transform), applied to designs not seen during training.

Result: Generated stellarators show solid performance with less than 5% deviation from quasisymmetry and target characteristics, with potential to reach sub 1% target.

Conclusion: Generative modeling shows promise for advancing stellarator design, with multiple promising avenues for future development beyond this case study.

Abstract: Stellarators are a prospective class of fusion-based power plants that confine a hot plasma with three-dimensional magnetic fields. Typically framed as a PDE-constrained optimization problem, stellarator design is a time-consuming process that can take hours to solve on a computing cluster. Developing fast methods for designing stellarators is crucial for advancing fusion research. Given the recent development of large datasets of optimized stellarators, machine learning approaches have emerged as a potential candidate. Motivated by this, we present an open inverse problem to the machine learning community: to rapidly generate high-quality stellarator designs which have a set of desirable characteristics. As a case study in the problem space, we train a conditional diffusion model on data from the QUASR database to generate quasisymmetric stellarator designs with desirable characteristics (aspect ratio and mean rotational transform). The diffusion model is applied to design stellarators with characteristics not seen during training. We provide evaluation protocols and show that many of the generated stellarators exhibit solid performance: less than 5% deviation from quasisymmetry and the target characteristics. The modest deviation from quasisymmetry highlights an opportunity to reach the sub 1% target. Beyond the case study, we share multiple promising avenues for generative modeling to advance stellarator design.

[525] Towards Trustworthy Wi-Fi Sensing: Systematic Evaluation of Deep Learning Model Robustness to Adversarial Attacks

Shreevanth Krishnaa Gopalakrishnan, Stephen Hailes

Main category: cs.LG

TL;DR: Systematic evaluation of CSI deep learning model robustness under various threat models, showing smaller models are less robust than larger ones, physically realizable attacks are less effective than feature-space attacks, and adversarial training improves robustness.

Details

Motivation: Understanding robustness of CSI-based human sensing systems is critical for security and reliability before deployment in real-world environments, as these models can be subtly perturbed.

Method: Framework comparing compact temporal autoencoder models with larger deep architectures across three datasets, evaluating under white-box, black-box/transfer, and universal perturbation threat models with varying attack realism.

Result: Smaller models are markedly less robust despite being efficient and equally performant on clean data; physically realizable signal-space perturbations significantly reduce attack success compared to unconstrained feature-space attacks; adversarial training improves robust accuracy with moderate clean performance degradation.

Conclusion: Provides quantitative baselines for robustness estimation and design principles for secure human-centered sensing systems as wireless sensing advances toward reliable cross-domain operation.

Abstract: Machine learning has become integral to Channel State Information (CSI)-based human sensing systems and is expected to power applications such as device-free activity recognition and identity detection in future cellular and Wi-Fi generations. However, these systems rely on models whose decisions can be subtly perturbed, raising concerns for security and reliability in ubiquitous sensing. Quantifying and understanding the robustness of such models, defined as their ability to maintain accurate predictions under adversarial perturbations, is therefore critical before wireless sensing can be safely deployed in real-world environments. This work presents a systematic evaluation of the robustness of CSI deep learning models under diverse threat models (white-box, black-box/transfer, and universal perturbations) and varying degrees of attack realism. We establish a framework to compare compact temporal autoencoder models with larger deep architectures across three public datasets, quantifying how model scale, training regime, and physical constraints influence robustness. Our experiments show that smaller models, while efficient and equally performant on clean data, are markedly less robust. We further confirm that physically realizable signal-space perturbations, designed to be feasible in real wireless channels, significantly reduce attack success compared to unconstrained feature-space attacks. Adversarial training mitigates these vulnerabilities, improving mean robust accuracy with only moderate degradation in clean performance across both model classes. As wireless sensing advances towards reliable, cross-domain operation, these findings provide quantitative baselines for robustness estimation and inform design principles for secure and trustworthy human-centered sensing systems.

[526] NVIDIA Nemotron Parse 1.1

Kateryna Chumachenko, Amala Sanjay Deshmukh, Jarno Seppanen, Ilia Karmanov, Chia-Chih Chen, Lukas Voegtle, Philipp Fischer, Marek Wawrzos, Saeid Motiian, Roman Ageev, Kedi Wu, Alexandre Milesi, Maryam Moosaei, Krzysztof Pawelec, Padmavathy Subramanian, Mehrzad Samadi, Xin Yu, Celina Dear, Sarah Stoddard, Jenna Diamond, Jesse Oliver, Leanna Chraghchian, Patrick Skelly, Tom Balough, Yao Xu, Jane Polak Scowcroft, Daniel Korzekwa, Darragh Hanley, Sandip Bhaskar, Timo Roman, Karan Sapra, Andrew Tao, Bryan Catanzaro

Main category: cs.LG

TL;DR: Nemotron-Parse-1.1 is a lightweight document parsing and OCR model with 885M parameters that improves on its predecessor with better OCR, markdown formatting, table parsing, and text extraction from images, charts, and diagrams.

Details

Motivation: To advance document parsing capabilities with a lightweight model that handles various document types including dense documents, while maintaining competitive accuracy and efficiency.

Method: Uses an encoder-decoder architecture with 885M parameters (including a compact 256M-parameter language decoder), supports longer output sequences for dense documents, and extracts text bounding boxes with semantic classes.

Result: Achieves competitive accuracy on public benchmarks, released publicly on Huggingface with optimized NIM container, and offers a faster variant (Nemotron-Parse-1.1-TC) with 20% speed improvement and minimal quality loss.

Conclusion: Nemotron-Parse-1.1 provides a strong lightweight OCR solution with improved capabilities across multiple document parsing tasks and is made publicly available to the community.

Abstract: We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.

[527] Ranking-Enhanced Anomaly Detection Using Active Learning-Assisted Attention Adversarial Dual AutoEncoders

Sidahmed Benabderrahmane, James Cheney, Talal Rahwan

Main category: cs.LG

TL;DR: Proposes an AutoEncoder-based unsupervised anomaly detection approach for APTs, enhanced with active learning to reduce labeling costs and improve detection accuracy on highly imbalanced real-world datasets.

Details

Motivation: APTs are stealthy and persistent cyber threats, but labeled data for supervised learning is scarce in real-world cybersecurity. Need methods that work with minimal labeled data.

Method: Uses Attention Adversarial Dual AutoEncoder framework for unsupervised anomaly detection, augmented by active learning that selectively queries labels for uncertain samples to iteratively improve detection.

Result: Significant improvements in detection rates during active learning cycles, outperforming existing approaches on real-world imbalanced datasets where APT attacks constitute only 0.004% of data.

Conclusion: The proposed framework effectively detects APT anomalies with minimal labeled data, making it practical for real-world cybersecurity applications where extensive manual labeling is infeasible.

Abstract: Advanced Persistent Threats (APTs) pose a significant challenge in cybersecurity due to their stealthy and long-term nature. Modern supervised learning methods require extensive labeled data, which is often scarce in real-world cybersecurity environments. In this paper, we propose an innovative approach that leverages AutoEncoders for unsupervised anomaly detection, augmented by active learning to iteratively improve the detection of APT anomalies. By selectively querying an oracle for labels on uncertain or ambiguous samples, we minimize labeling costs while improving detection rates, enabling the model to improve its detection accuracy with minimal data while reducing the need for extensive manual labeling. We provide a detailed formulation of the proposed Attention Adversarial Dual AutoEncoder-based anomaly detection framework and show how the active learning loop iteratively enhances the model. The framework is evaluated on real-world imbalanced provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The results have shown significant improvements in detection rates during active learning and better performance compared to other existing approaches.

[528] MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain, Phil F Cheng, Petros Liakopoulos, Olivier Michielin, Michael Moor, Charlotte Bunne

Main category: cs.LG

TL;DR: MTBBench is a new benchmark for multimodal LLMs that simulates Molecular Tumor Board decision-making with clinically challenging, multimodal, and longitudinal oncology questions, addressing limitations of current benchmarks.

Details

Motivation: Current benchmarks fail to capture the complexity of real-world clinical workflows like Molecular Tumor Boards, which require integrating heterogeneous data and evolving insights over time through multi-agent decision-making.

Method: Developed MTBBench with clinician-validated ground truth annotations via a co-developed app, and created an agentic framework with foundation model-based tools to enhance multimodal and longitudinal reasoning.

Result: Benchmarked LLMs show poor reliability - frequently hallucinating, struggling with time-resolved data reasoning, and failing to reconcile conflicting evidence across modalities. The agentic framework improved task-level performance by up to 9.0% for multimodal reasoning and 11.2% for longitudinal reasoning.

Conclusion: MTBBench provides a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use in precision oncology MTB environments, addressing current limitations in biomedical AI evaluation.

Abstract: Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability – frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.

[529] From One Attack Domain to Another: Contrastive Transfer Learning with Siamese Networks for APT Detection

Sidahmed Benabderrahmane, Talal Rahwan

Main category: cs.LG

TL;DR: A hybrid transfer learning framework combining transfer learning, XAI, contrastive learning, and Siamese networks improves APT detection across domains by addressing class imbalance, feature drift, and transferability issues.

Details

Motivation: Traditional ML detectors struggle with APT detection due to class imbalance, high dimensionality, scarce real-world traces, and poor transferability to novel attack scenarios.

Method: Hybrid framework integrating transfer learning, XAI (SHAP for feature selection), contrastive learning, and Siamese networks with attention-based autoencoder for cross-domain knowledge transfer and feature alignment.

Result: Improved detection scores across source-to-target transfers compared to classical and deep baselines, demonstrating robustness on DARPA TC data and synthetic attack scenarios.

Conclusion: The approach provides a scalable, explainable, and transferable solution for APT detection that addresses key challenges in cybersecurity threat detection.

Abstract: Advanced Persistent Threats (APT) pose a major cybersecurity challenge due to their stealth, persistence, and adaptability. Traditional machine learning detectors struggle with class imbalance, high dimensional features, and scarce real world traces. They often lack transferability-performing well in the training domain but degrading in novel attack scenarios. We propose a hybrid transfer framework that integrates Transfer Learning, Explainable AI (XAI), contrastive learning, and Siamese networks to improve cross-domain generalization. An attention-based autoencoder supports knowledge transfer across domains, while Shapley Additive exPlanations (SHAP) select stable, informative features to reduce dimensionality and computational cost. A Siamese encoder trained with a contrastive objective aligns source and target representations, increasing anomaly separability and mitigating feature drift. We evaluate on real-world traces from the DARPA Transparent Computing (TC) program and augment with synthetic attack scenarios to test robustness. Across source to target transfers, the approach delivers improved detection scores with classical and deep baselines, demonstrating a scalable, explainable, and transferable solution for APT detection.

[530] DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuning

Mihaela Hudişteanu, Edwige Cyffers, Nikita P. Kalinin

Main category: cs.LG

TL;DR: DP-MicroAdam is a memory-efficient adaptive optimizer for differentially private training that outperforms existing methods and achieves competitive accuracy with DP-SGD across multiple benchmarks.

Details

Motivation: Current DP training relies heavily on DP-SGD, which requires extensive compute and hyperparameter tuning, while adaptive optimizers that work well in non-private settings are underutilized in DP training.

Method: Proposed DP-MicroAdam, a memory-efficient and sparsity-aware adaptive differential privacy optimizer that converges at the optimal rate in stochastic non-convex optimization.

Result: DP-MicroAdam outperforms existing adaptive DP optimizers and achieves competitive or superior accuracy compared to DP-SGD on CIFAR-10, ImageNet, and transformer fine-tuning tasks.

Conclusion: Adaptive optimization can improve both performance and stability under differential privacy, making DP-MicroAdam a practical alternative to DP-SGD.

Abstract: Adaptive optimizers are the de facto standard in non-private training as they often enable faster convergence and improved performance. In contrast, differentially private (DP) training is still predominantly performed with DP-SGD, typically requiring extensive compute and hyperparameter tuning. We propose DP-MicroAdam, a memory-efficient and sparsity-aware adaptive DP optimizer. We prove that DP-MicroAdam converges in stochastic non-convex optimization at the optimal $\mathcal{O}(1/\sqrt{T})$ rate, up to privacy-dependent constants. Empirically, DP-MicroAdam outperforms existing adaptive DP optimizers and achieves competitive or superior accuracy compared to DP-SGD across a range of benchmarks, including CIFAR-10, large-scale ImageNet training, and private fine-tuning of pretrained transformers. These results demonstrate that adaptive optimization can improve both performance and stability under differential privacy.

[531] Adam Simplified: Bias Correction Simplified

Sam Laing, Antonio Orvieto

Main category: cs.LG

TL;DR: Bias correction in Adam optimizer provides no performance improvement in optimal configurations and can be detrimental without proper learning rate scheduling.

Details

Motivation: To investigate the empirical necessity of bias correction in Adam optimizer, as its contribution is poorly understood despite being a standard component.

Method: Systematic ablations on vision and language modeling tasks, analyzing bias correction’s effects across different hyperparameter configurations.

Result: Bias correction shows no improvement in final test performance under optimal hyperparameters, and can be detrimental without appropriate learning rate scheduling. It acts as implicit learning rate scheduling dependent on β₁, β₂ parameters.

Conclusion: Bias correction should not be universally included in Adam optimizer as it provides no benefits in optimal settings and can harm performance without proper scheduling.

Abstract: The Adam optimizer is a cornerstone of modern deep learning, yet the empirical necessity of each of its individual components is often taken for granted. This paper presents a focused investigation into the role of bias-correction, a feature whose contribution remains poorly understood. Through a series of systematic ablations on vision and language modelling tasks, we demonstrate that the conventional wisdom surrounding bias correction is misleading. In particular, we demonstrate that in the optimal hyper-parameter configuration, the inclusion of bias correction leads to no improvement in final test performance. Moreover, unless appropriate learning rate scheduling is implemented, the inclusion of bias correction can sometimes be detrimental to performance. We further reinterpret bias correction as a form of implicit learning rate scheduling whose behaviour is strongly dependent on the choice of smoothing hyper-parameters $β_1, β_2 \in [0,1)$. Our findings challenge the universal inclusion of this component.

[532] Feature-Modulated UFNO for Improved Prediction of Multiphase Flow in Porous Media

Alhasan Abdellatif, Hannah P. Menke, Ahmed H. Elsheikh, Florian Doster, Kamaljit Singh

Main category: cs.LG

TL;DR: UFNO-FiLM enhances UFNO by decoupling scalar inputs using FiLM layers and implementing spatially weighted loss, achieving 21% MAE reduction in gas saturation predictions.

Details

Motivation: UFNO inefficiently treats scalar inputs as spatial fields and lacks spatial error sensitivity, limiting performance in critical regions.

Method: Incorporates FiLM layers to decouple scalar inputs from spatial features and uses spatially weighted loss function to prioritize important regions.

Result: 21% reduction in gas saturation Mean Absolute Error compared to UFNO in subsurface multiphase flow experiments.

Conclusion: UFNO-FiLM effectively improves predictive accuracy by addressing scalar input inefficiencies and spatial error sensitivity limitations.

Abstract: The UNet-enhanced Fourier Neural Operator (UFNO) extends the Fourier Neural Operator (FNO) by incorporating a parallel UNet pathway, enabling the retention of both high- and low-frequency components. While UFNO improves predictive accuracy over FNO, it inefficiently treats scalar inputs (e.g., temperature, injection rate) as spatially distributed fields by duplicating their values across the domain. This forces the model to process redundant constant signals within the frequency domain. Additionally, its standard loss function does not account for spatial variations in error sensitivity, limiting performance in regions of high physical importance. We introduce UFNO-FiLM, an enhanced architecture that incorporates two key innovations. First, we decouple scalar inputs from spatial features using a Feature-wise Linear Modulation (FiLM) layer, allowing the model to modulate spatial feature maps without introducing constant signals into the Fourier transform. Second, we employ a spatially weighted loss function that prioritizes learning in critical regions. Our experiments on subsurface multiphase flow demonstrate a 21% reduction in gas saturation Mean Absolute Error (MAE) compared to UFNO, highlighting the effectiveness of our approach in improving predictive accuracy.

[533] E2E-GRec: An End-to-End Joint Training Framework for Graph Neural Networks and Recommender Systems

Rui Xue, Shichao Zhu, Liang Qin, Guangmou Pan, Yang Song, Tianfu Wu

Main category: cs.LG

TL;DR: E2E-GRec is an end-to-end training framework that unifies GNN training with recommender systems to overcome limitations of traditional two-stage pipelines, achieving better performance through joint optimization.

Details

Motivation: Traditional two-stage GNN pipelines in recommender systems suffer from high computational overhead due to repeated embedding refresh and lack of joint optimization, making GNNs suboptimally informative for recommendation tasks.

Method: Three key components: (1) efficient subgraph sampling from large-scale cross-domain heterogeneous graphs, (2) Graph Feature Auto-Encoder as auxiliary self-supervised task, (3) two-level feature fusion with Gradnorm-based dynamic loss balancing for stable multi-task training.

Result: Extensive evaluations show significant improvements: +0.133% relative improvement in stay duration, 0.3171% reduction in average number of videos skipped, and consistent gains across multiple recommendation metrics in both offline and online A/B tests.

Conclusion: E2E-GRec consistently surpasses traditional approaches by enabling joint optimization of GNNs and recommender systems, providing significant performance gains while maintaining training efficiency.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for modeling graph-structured data and have been widely used in recommender systems, such as for capturing complex user-item and item-item relations. However, most industrial deployments adopt a two-stage pipeline: GNNs are first pre-trained offline to generate node embeddings, which are then used as static features for downstream recommender systems. This decoupled paradigm leads to two key limitations: (1) high computational overhead, since large-scale GNN inference must be repeatedly executed to refresh embeddings; and (2) lack of joint optimization, as the gradient from the recommender system cannot directly influence the GNN learning process, causing the GNN to be suboptimally informative for the recommendation task. In this paper, we propose E2E-GRec, a novel end-to-end training framework that unifies GNN training with the recommender system. Our framework is characterized by three key components: (i) efficient subgraph sampling from a large-scale cross-domain heterogeneous graph to ensure training scalability and efficiency; (ii) a Graph Feature Auto-Encoder (GFAE) serving as an auxiliary self-supervised task to guide the GNN to learn structurally meaningful embeddings; and (iii) a two-level feature fusion mechanism combined with Gradnorm-based dynamic loss balancing, which stabilizes graph-aware multi-task end-to-end training. Extensive offline evaluations, online A/B tests (e.g., a +0.133% relative improvement in stay duration, a 0.3171% reduction in the average number of videos a user skips) on large-scale production data, together with theoretical analysis, demonstrate that E2E-GRec consistently surpasses traditional approaches, yielding significant gains across multiple recommendation metrics.

[534] MSTN: Fast and Efficient Multivariate Time Series Model

Sumit S Shevtekar, Chandresh K Maurya, Gourab Sil

Main category: cs.LG

TL;DR: MSTN is a multi-scale temporal network that adaptively models temporal patterns across multiple timescales, achieving state-of-the-art performance on 24 of 32 benchmark datasets for time-series tasks.

Details

Motivation: Existing models with fixed-scale structural priors over-regularize temporal dynamics and struggle with unpredictable, sudden events, limiting their ability to model the full spectrum of temporal variations.

Method: Hierarchical architecture with: (i) multi-scale convolutional encoder for local patterns, (ii) sequence modeling (BiLSTM/Transformer) for long-range dependencies, and (iii) gated fusion with SE and multi-head temporal attention for dynamic feature integration.

Result: Achieves competitive SOTA performance, outperforming contemporary approaches like EMTSF, LLM4TS, HiMTM, TIME-LLM, MTST, SOFTS, iTransformer, TimesNet, and PatchTST on 24 of 32 benchmark datasets.

Conclusion: MSTN provides a unified framework for adaptive temporal modeling across multiple timescales, demonstrating consistent performance across diverse temporal tasks including forecasting, imputation, and classification.

Abstract: Real-world time-series data is highly non stationary and complex in dynamics that operate across multiple timescales, ranging from fast, short-term changes to slow, long-term trends. Most existing models rely on fixed-scale structural priors, such as patch-based tokenization, fixed frequency transformations, or frozen backbone architectures. This often leads to over-regularization of temporal dynamics, which limits their ability to adaptively model the full spectrum of temporal variations and impairs their performance on unpredictable, Sudden, high-magnitude events. To address this, we introduce the Multi-scale Temporal Network (MSTN), a novel deep learning architecture founded on a hierarchical multi-scale and sequence modeling principle. The MSTN framework integrates: (i) a multi-scale convolutional encoder that constructs a hierarchical feature pyramid for local patterns (ii) a sequence modeling component for long-range temporal dependencies. We empirically validate this with BiLSTM and Transformer variants, establishing a flexible foundation for future architectural advancements. and (iii) a gated fusion mechanism augmented with squeeze-and-excitation (SE) and multi-head temporal attention (MHTA) for dynamic, context-aware feature integration. This design enables MSTN to adaptively model temporal patterns from milliseconds to long-range dependencies within a unified framework. Extensive evaluations across time-series long-horizon forecasting, imputation, classification and generalizability study demonstrate that MSTN achieves competitive state-of-the-art (SOTA) performance, showing improvements over contemporary approaches including EMTSF, LLM4TS, HiMTM, TIME-LLM, MTST, SOFTS, iTransformer, TimesNet, and PatchTST. In total, MSTN establishes new SOTA performance on 24 of 32 benchmark datasets, demonstrating its consistent performance across diverse temporal tasks.

[535] A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

Shuo Xie, Tianhao Wang, Beining Wu, Zhiyuan Li

Main category: cs.LG

TL;DR: The paper establishes a theoretical connection between adaptive optimizers and normalized steepest descent, showing that adaptive smoothness characterizes convergence in nonconvex settings and enables momentum acceleration in convex optimization.

Details

Motivation: To bridge the theoretical gap between adaptive optimizers and normalized steepest descent by understanding their geometric differences and convergence properties across different optimization settings.

Method: Extend adaptive smoothness theory to nonconvex optimization, analyze convergence under adaptive smoothness conditions, and introduce adaptive gradient variance for stochastic optimization analysis.

Result: Adaptive smoothness precisely characterizes adaptive optimizer convergence in nonconvex settings and enables Nesterov momentum acceleration in convex optimization with non-Euclidean geometries. Adaptive gradient variance provides dimension-free convergence guarantees.

Conclusion: The theoretical framework connects adaptive optimizers and normalized steepest descent through geometric properties, revealing that adaptive smoothness and adaptive gradient variance are key to understanding and improving optimization algorithms across convex, nonconvex, and stochastic settings.

Abstract: Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by introducing adaptive gradient variance, which parallels adaptive smoothness and leads to dimension-free convergence guarantees that cannot be achieved under standard gradient variance for certain non-Euclidean geometry.

[536] BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents

Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, Ninghui Li

Main category: cs.LG

TL;DR: This paper examines prompt injection attacks against AI web agents, creates a realistic benchmark of HTML-based attacks, evaluates existing defenses across AI models, and proposes a multi-layered defense strategy.

Details

Motivation: AI agents in web browsers face security challenges beyond traditional web threats, particularly prompt injection attacks whose real-world impact is not well understood despite prior identification of this attack vector.

Method: Created a benchmark of realistic HTML payloads for prompt injection attacks, conducted comprehensive empirical evaluation of existing defenses across frontier AI models, and proposed multi-layered defense strategies combining architectural and model-based approaches.

Result: Developed a benchmark that emphasizes attacks influencing real-world actions rather than just text outputs, with complexity and distractor frequency matching real-world scenarios. Evaluated defense effectiveness across multiple AI models.

Conclusion: The work provides a blueprint for designing practical, secure web agents through a defense-in-depth approach that addresses evolving prompt injection threats.

Abstract: The integration of artificial intelligence (AI) agents into web browsers introduces security challenges that go beyond traditional web application threat models. Prior work has identified prompt injection as a new attack vector for web agents, yet the resulting impact within real-world environments remains insufficiently understood. In this work, we examine the landscape of prompt injection attacks and synthesize a benchmark of attacks embedded in realistic HTML payloads. Our benchmark goes beyond prior work by emphasizing injections that can influence real-world actions rather than mere text outputs, and by presenting attack payloads with complexity and distractor frequency similar to what real-world agents encounter. We leverage this benchmark to conduct a comprehensive empirical evaluation of existing defenses, assessing their effectiveness across a suite of frontier AI models. We propose a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. Our work offers a blueprint for designing practical, secure web agents through a defense-in-depth approach.

[537] Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models

Karim Kadry, Abdallah Abdelwahed, Shoaib Goraya, Ajay Manicka, Naravich Chutisilp, Farhad Nezami, Elazer Edelman

Main category: cs.LG

TL;DR: Anatomica is an inference-time framework for generating multi-class anatomical voxel maps with localized geo-topological control using cuboidal domains and differentiable penalty functions.

Details

Motivation: To enable rational design of synthetic anatomical datasets for virtual trials and machine learning workflows by providing precise control over geometric and topological features.

Method: Uses cuboidal control domains to slice substructures, computes differentiable penalty functions for geometric control (voxel-wise moments) and topological control (persistent homology), and implements for latent diffusion models with neural field decoders.

Result: Enables flexible control of anatomical properties across diverse systems, composing constraints to control complex structures over arbitrary dimensions and coordinate systems.

Conclusion: Anatomica provides a comprehensive framework for generating anatomically accurate synthetic data with precise geo-topological control, facilitating applications in virtual trials and ML workflows.

Abstract: We present Anatomica: an inference-time framework for generating multi-class anatomical voxel maps with localized geo-topological control. During generation, we use cuboidal control domains of varying dimensionality, location, and shape to slice out relevant substructures. These local substructures are used to compute differentiable penalty functions that steer the sample towards target constraints. We control geometric features such as size, shape, and position through voxel-wise moments, while topological features such as connected components, loops, and voids are enforced through persistent homology. Lastly, we implement Anatomica for latent diffusion models, where neural field decoders partially extract substructures, enabling the efficient control of anatomical properties. Anatomica applies flexibly across diverse anatomical systems, composing constraints to control complex structures over arbitrary dimensions and coordinate systems, thereby enabling the rational design of synthetic datasets for virtual trials or machine learning workflows.

[538] The Driver-Blindness Phenomenon: Why Deep Sequence Models Default to Autocorrelation in Blood Glucose Forecasting

Heman Shakeri

Main category: cs.LG

TL;DR: Deep sequence models for blood glucose forecasting fail to effectively use clinical drivers (insulin, meals, activity) despite known physiological mechanisms, a problem termed ‘Driver-Blindness’ measured by Δ_drivers.

Details

Motivation: Current blood glucose forecasting models consistently underutilize clinically important drivers despite well-understood physiological relationships, leading to suboptimal performance.

Method: The paper formalizes Driver-Blindness via Δ_drivers metric, identifies three root causes (architectural biases, data fidelity gaps, physiological heterogeneity), and synthesizes mitigation strategies including physiological feature encoders, causal regularization, and personalization.

Result: Across literature, Δ_drivers is typically near zero, indicating that multivariate models don’t outperform univariate baselines despite incorporating additional driver information.

Conclusion: Future work should routinely report Δ_drivers to prevent driver-blind models from being considered state-of-the-art, and implement strategies like physiological feature encoding and personalization to address Driver-Blindness.

Abstract: Deep sequence models for blood glucose forecasting consistently fail to leverage clinically informative drivers–insulin, meals, and activity–despite well-understood physiological mechanisms. We term this Driver-Blindness and formalize it via $Δ_{\text{drivers}}$, the performance gain of multivariate models over matched univariate baselines. Across the literature, $Δ_{\text{drivers}}$ is typically near zero. We attribute this to three interacting factors: architectural biases favoring autocorrelation (C1), data fidelity gaps that render drivers noisy and confounded (C2), and physiological heterogeneity that undermines population-level models (C3). We synthesize strategies that partially mitigate Driver-Blindness–including physiological feature encoders, causal regularization, and personalization–and recommend that future work routinely report $Δ_{\text{drivers}}$ to prevent driver-blind models from being considered state-of-the-art.

[539] Attention Trajectories as a Diagnostic Axis for Deep Reinforcement Learning

Charlotte Beylier, Hannah Selder, Arthur Fleig, Simon M. Hofmann, Nico Scherf

Main category: cs.LG

TL;DR: ATOMs (attention-oriented metrics) track RL agent attention development during training, revealing phase-based learning patterns and linking attention to behavior across Pong game variations.

Details

Motivation: To better understand RL agent learning processes beyond mathematical formulations by investigating how attention develops during training.

Method: Introduced ATOMs to monitor RL agent attention, tested on three Pong game variations designed to teach distinct behaviors, with behavioral assessments.

Result: ATOMs successfully identified distinct attention patterns for each game variation, showed attention develops in consistent phases across games, and demonstrated attention differences translate to behavioral differences.

Conclusion: ATOMs can improve understanding of RL learning processes and the relationship between attention and learning.

Abstract: The learning process of a reinforcement learning (RL) agent remains poorly understood beyond the mathematical formulation of its learning algorithm. To address this gap, we introduce attention-oriented metrics (ATOMs) to investigate the development of an RL agent’s attention during training. In a controlled experiment, we tested ATOMs on three variations of a Pong game, each designed to teach the agent distinct behaviours, complemented by a behavioural assessment. ATOMs successfully delineate the attention patterns of an agent trained on each game variation, and that these differences in attention patterns translate into differences in the agent’s behaviour. Through continuous monitoring of ATOMs during training, we observed that the agent’s attention developed in phases, and that these phases were consistent across game variations. Overall, we believe that ATOM could help improve our understanding of the learning processes of RL agents and better understand the relationship between attention and learning.

[540] Latent Diffusion Inversion Requires Understanding the Latent Space

Mingxing Rao, Bowen Qu, Daniel Moyer

Main category: cs.LG

TL;DR: The paper analyzes memorization in Latent Diffusion Models (LDMs), finding non-uniform memorization across latent codes and dimensions, and introduces a method to rank latent dimensions by their contribution to memorization, improving membership inference attack performance.

Details

Motivation: Previous model inversion studies focused on data domain diffusion models, ignoring the encoder/decoder pair and latent codes in LDMs. The authors aim to understand how auto-encoder geometry affects memorization in LDMs.

Method: The authors developed a principled method to rank latent dimensions by their per-dimensional contribution to the decoder pullback metric, identifying dimensions most responsible for memorization. They then remove less-memorizing dimensions when computing attack statistics for score-based membership inference attacks.

Result: Removing less-memorizing dimensions significantly improves membership inference attack performance, with average AUROC gains of 2.7% and substantial increases in TPR@1%FPR (6.42%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pokémon, MS-COCO, and Flickr.

Conclusion: The study highlights the overlooked influence of auto-encoder geometry on LDM memorization and provides a new perspective for analyzing privacy risks in diffusion-based generative models.

Abstract: The recovery of training data from generative models (``model inversion’’) has been extensively studied for diffusion models in the data domain. The encoder/decoder pair and corresponding latent codes have largely been ignored by inversion techniques applied to latent space generative models, e.g., Latent Diffusion models (LDMs). In this work we describe two key findings: (1) The diffusion model exhibits non-uniform memorization across latent codes, tending to overfit samples located in high-distortion regions of the decoder pullback metric. (2) Even within a single latent code, different dimensions contribute unequally to memorization. We introduce a principled method to rank latent dimensions by their per-dimensional contribution to the decoder pullback metric, identifying those most responsible for memorization. Empirically, removing less-memorizing dimensions when computing attack statistics for score-based membership inference attacker significantly improves performance, with average AUROC gains of 2.7% and substantial increases in TPR@1%FPR (6.42%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pokémon, MS-COCO, and Flickr. This indicates stronger confidence in identifying members under extremely low false-positive tolerance. Our results highlight the overlooked influence of the auto-encoder geometry on LDM memorization and provide a new perspective for analyzing privacy risks in diffusion-based generative models.

[541] DiFR: Inference Verification Despite Nondeterminism

Adam Karvonen, Daniel Reuter, Roy Rinberg, Luke Marks, Adrià Garriga-Alonso, Keri Warr

Main category: cs.LG

TL;DR: Token-DiFR is a method for verifying LLM inference correctness by comparing generated tokens against a trusted reference implementation using synchronized random seeds, enabling detection of errors and quantization with high accuracy.

Details

Motivation: As LLM inference demand grows, providers need to verify that inference processes are performed correctly without errors or tampering, but re-running inference often produces different results due to numerical noise, making it hard to distinguish legitimate variation from actual problems.

Method: Token-DiFR compares generated tokens against predictions from a trusted reference implementation conditioned on the same random seed. For sample-efficient verification, Activation-DiFR uses random orthogonal projections to compress activations into compact fingerprints.

Result: Token-DiFR reliably identifies sampling errors, simulated bugs, and model quantization, detecting 4-bit quantization with AUC > 0.999 within 300 output tokens. Activation-DiFR detects 4-bit quantization with AUC > 0.999 using just 2 output tokens while reducing communication overhead by 25-75% relative to existing methods.

Conclusion: The proposed methods provide effective verification of LLM inference correctness at zero additional cost to providers, with open-source integration available for practical deployment.

Abstract: As demand for LLM inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tampering. However, re-running the same inference process twice often leads to different results due to benign numerical noise, making it difficult to distinguish legitimate variation from actual problems. To address this problem, we introduce Token-DiFR (Token-Divergence-From-Reference), a method for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Sampling seed synchronization tightly constrains valid outputs, leaving providers minimal room to deviate from correct inference, which allows output tokens themselves to serve as auditable evidence of correctness at zero additional cost to the provider. Token-DiFR reliably identifies sampling errors, simulated bugs, and model quantization, detecting 4-bit quantization with AUC $>$ 0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit quantization with AUC $>$ 0.999 using just 2 output tokens, while reducing communication overhead by 25-75% relative to existing methods. We release an open-source integration with vLLM to accelerate practical deployment of verifiable inference.

[542] How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

Xiwen Huang, Pierre Pinson

Main category: cs.LG

TL;DR: Active learning markets for purchasing labels to improve model performance, with budget constraints and improvement thresholds, showing superior results with fewer labels compared to conventional methods.

Details

Motivation: To acquire additional data for improving model fitting and predictive analytics applications, addressing the gap in purchasing labels (as opposed to features/examples) in resource-constrained environments.

Method: Formalized market clearing as optimization problem with budget constraints and improvement thresholds. Used single-buyer-multiple-seller setup with two active learning strategies (variance-based and query-by-committee) paired with distinct pricing mechanisms, compared against random sampling benchmark.

Result: Validated on real-world datasets (real estate pricing and energy forecasting), demonstrating robustness and consistently achieving superior performance with fewer labels acquired compared to conventional methods.

Conclusion: Provides an easy-to-implement practical solution for optimizing data acquisition in resource-constrained environments through active learning markets for label purchasing.

Abstract: We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

[543] ROOT: Robust Orthogonalized Optimizer for Neural Network Training

Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, Yunhe Wang

Main category: cs.LG

TL;DR: ROOT is a Robust Orthogonalized Optimizer that addresses robustness limitations in LLM optimization through dimension-robust orthogonalization and optimization-robust noise suppression, achieving faster convergence and superior performance in noisy scenarios.

Details

Motivation: Current optimizers for large language models suffer from dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise, which becomes more problematic as models scale up.

Method: ROOT uses dual robustness mechanisms: 1) dimension-robust orthogonalization with adaptive Newton iterations and fine-grained coefficients for different matrix sizes, 2) optimization-robust framework via proximal optimization to suppress outlier noise while preserving meaningful gradients.

Result: Extensive experiments show ROOT achieves significantly improved robustness, faster convergence, and superior final performance compared to Muon and Adam-based optimizers, especially in noisy and non-convex scenarios.

Conclusion: ROOT establishes a new paradigm for developing robust and precise optimizers capable of handling the complexities of modern large-scale model training.

Abstract: The optimization of large language models (LLMs) remains a critical challenge, particularly as model scaling exacerbates sensitivity to algorithmic imprecision and training instability. Recent advances in optimizers have improved convergence efficiency through momentum orthogonalization, but suffer from two key robustness limitations: dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise. To address these robustness challenges, we introduce ROOT, a Robust Orthogonalized Optimizer that enhances training stability through dual robustness mechanisms. First, we develop a dimension-robust orthogonalization scheme using adaptive Newton iterations with fine-grained coefficients tailored to specific matrix sizes, ensuring consistent precision across diverse architectural configurations. Second, we introduce an optimization-robust framework via proximal optimization that suppresses outlier noise while preserving meaningful gradient directions. Extensive experiments demonstrate that ROOT achieves significantly improved robustness, with faster convergence and superior final performance compared to both Muon and Adam-based optimizers, particularly in noisy and non-convex scenarios. Our work establishes a new paradigm for developing robust and precise optimizers capable of handling the complexities of modern large-scale model training. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/ROOT.

[544] Adaptive Hopfield Network: Rethinking Similarities in Associative Memory

Shurong Wang, Yuqi Pan, Zhuoyang Shen, Meng Zhang, Hongwei Wang, Guoqi Li

Main category: cs.LG

TL;DR: The paper proposes adaptive similarity for associative memory models to ensure correct retrieval by learning context-dependent generative processes, achieving state-of-the-art performance across multiple tasks.

Details

Motivation: Existing associative memory models evaluate retrieval based on proximity, which cannot guarantee that the retrieved pattern has the strongest association with the query, failing correctness.

Method: Develop adaptive similarity mechanism that learns to approximate the likelihood of each stored pattern generating the query according to a variant distribution, integrated into an adaptive Hopfield network (A-Hop).

Result: Theoretically proven optimal correct retrieval under three canonical variant types (noisy, masked, biased), and empirical results show state-of-the-art performance in memory retrieval, tabular classification, image classification, and multiple instance learning.

Conclusion: Adaptive similarity enables correct retrieval in associative memory by learning context-dependent generative processes, providing a more reliable and effective approach than fixed similarity measures.

Abstract: Associative memory models are content-addressable memory systems fundamental to biological intelligence and are notable for their high interpretability. However, existing models evaluate the quality of retrieval based on proximity, which cannot guarantee that the retrieved pattern has the strongest association with the query, failing correctness. We reframe this problem by proposing that a query is a generative variant of a stored memory pattern, and define a variant distribution to model this subtle context-dependent generative process. Consequently, correct retrieval should return the memory pattern with the maximum a posteriori probability of being the query’s origin. This perspective reveals that an ideal similarity measure should approximate the likelihood of each stored pattern generating the query in accordance with variant distribution, which is impossible for fixed and pre-defined similarities used by existing associative memories. To this end, we develop adaptive similarity, a novel mechanism that learns to approximate this insightful but unknown likelihood from samples drawn from context, aiming for correct retrieval. We theoretically prove that our proposed adaptive similarity achieves optimal correct retrieval under three canonical and widely applicable types of variants: noisy, masked, and biased. We integrate this mechanism into a novel adaptive Hopfield network (A-Hop), and empirical results show that it achieves state-of-the-art performance across diverse tasks, including memory retrieval, tabular classification, image classification, and multiple instance learning.

[545] Sparse-to-Field Reconstruction via Stochastic Neural Dynamic Mode Decomposition

Yujin Kim, Sarah Dean

Main category: cs.LG

TL;DR: Stochastic NODE-DMD is a probabilistic extension of Dynamic Mode Decomposition that models continuous-time nonlinear dynamics with uncertainty quantification, enabling accurate reconstruction from sparse observations and preserving ensemble variability.

Details

Motivation: Address limitations of traditional DMD including sparse/noisy observations, reliance on linear approximations, and lack of principled uncertainty quantification for modeling real-world dynamic systems like wind fields and ocean currents.

Method: Probabilistic extension of DMD that models continuous-time nonlinear dynamics while maintaining interpretability, enabling continuous spatiotemporal reconstruction at arbitrary coordinates with uncertainty quantification.

Result: Surpasses baseline in reconstruction accuracy when trained from only 10% observation density across four benchmarks; recovers dynamical structure by aligning learned modes and eigenvalues with ground truth; learns calibrated distribution over latent dynamics that preserves ensemble variability.

Conclusion: Stochastic NODE-DMD provides an effective probabilistic framework for learning continuous-time nonlinear dynamics with uncertainty quantification, outperforming traditional methods in sparse observation scenarios while maintaining interpretability.

Abstract: Many consequential real-world systems, like wind fields and ocean currents, are dynamic and hard to model. Learning their governing dynamics remains a central challenge in scientific machine learning. Dynamic Mode Decomposition (DMD) provides a simple, data-driven approximation, but practical use is limited by sparse/noisy observations from continuous fields, reliance on linear approximations, and the lack of principled uncertainty quantification. To address these issues, we introduce Stochastic NODE-DMD, a probabilistic extension of DMD that models continuous-time, nonlinear dynamics while remaining interpretable. Our approach enables continuous spatiotemporal reconstruction at arbitrary coordinates and quantifies predictive uncertainty. Across four benchmarks, a synthetic setting and three physics-based flows, it surpasses a baseline in reconstruction accuracy when trained from only 10% observation density. It further recovers the dynamical structure by aligning learned modes and continuous-time eigenvalues with ground truth. Finally, on datasets with multiple realizations, our method learns a calibrated distribution over latent dynamics that preserves ensemble variability rather than averaging across regimes. Our code is available at: https://github.com/sedan-group/Stochastic-NODE-DMD

[546] Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model

Ziyue Wang, Yayati Jadhav, Peter Pak, Amir Barati Farimani

Main category: cs.LG

TL;DR: Image2Gcode is an end-to-end framework that generates printer-ready G-code directly from 2D images, bypassing traditional CAD modeling to accelerate additive manufacturing workflows.

Details

Motivation: Traditional CAD modeling is slow and creates bottlenecks in rapid prototyping, requiring manual updates for minor design variations and making iteration time-consuming.

Method: Uses a denoising diffusion probabilistic model (DDPM) over G-code sequences, extracting structural cues from 2D images and transforming Gaussian noise into executable print-move trajectories with extrusion parameters.

Result: Eliminates need for CAD or STL intermediates, enables direct mapping from visual input to native toolpaths, and supports on-demand prototyping from sketches or visual references.

Conclusion: Provides a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing by accelerating the design-to-fabrication cycle.

Abstract: Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D geometry is slow and poorly suited to rapid prototyping. Even minor design variations typically necessitate manual updates in CAD software, making iteration time-consuming and difficult to scale. To address this limitation, we introduce Image2Gcode, an end-to-end data-driven framework that bypasses the CAD stage and generates printer-ready G-code directly from images and part drawings. Instead of relying on an explicit 3D model, a hand-drawn or captured 2D image serves as the sole input. The framework first extracts slice-wise structural cues from the image and then employs a denoising diffusion probabilistic model (DDPM) over G-code sequences. Through iterative denoising, the model transforms Gaussian noise into executable print-move trajectories with corresponding extrusion parameters, establishing a direct mapping from visual input to native toolpaths. By producing structured G-code directly from 2D imagery, Image2Gcode eliminates the need for CAD or STL intermediates, lowering the entry barrier for additive manufacturing and accelerating the design-to-fabrication cycle. This approach supports on-demand prototyping from simple sketches or visual references and integrates with upstream 2D-to-3D reconstruction modules to enable an automated pipeline from concept to physical artifact. The result is a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing.

[547] MGAS: Multi-Granularity Architecture Search for Trade-Off Between Model Effectiveness and Efficiency

Xiaoyun Liu, Divya Saxena, Jiannong Cao, Yuqing Zhao, Penghui Ruan

Main category: cs.LG

TL;DR: MG-DARTS is a multi-granularity differentiable architecture search framework that explores operation-level, filter-level, and weight-level patterns to find efficient neural architectures while maintaining search quality and memory efficiency.

Details

Motivation: Existing differentiable architecture search methods focus only on coarse-grained operation-level topology and neglect finer-grained structures, limiting their ability to balance performance with model size. Many also compromise search quality for memory efficiency.

Method: Proposes adaptive pruning with granularity-specific discretization functions and decomposes super-net optimization into multiple stages with progressive re-evaluation, enabling re-pruning and regrowth of units to mitigate bias.

Result: Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet show MG-DARTS achieves better trade-off between model accuracy and parameter efficiency compared to state-of-the-art methods.

Conclusion: MG-DARTS provides a unified framework for comprehensive yet memory-efficient exploration of multi-granularity search spaces, outperforming existing methods in finding effective and efficient neural architectures.

Abstract: Neural architecture search (NAS) has gained significant traction in automating the design of neural networks. To reduce search time, differentiable architecture search (DAS) reframes the traditional paradigm of discrete candidate sampling and evaluation into a differentiable optimization over a super-net, followed by discretization. However, most existing DAS methods primarily focus on optimizing the coarse-grained operation-level topology, while neglecting finer-grained structures such as filter-level and weight-level patterns. This limits their ability to balance model performance with model size. Additionally, many methods compromise search quality to save memory during the search process. To tackle these issues, we propose Multi-Granularity Differentiable Architecture Search (MG-DARTS), a unified framework which aims to discover both effective and efficient architectures from scratch by comprehensively yet memory-efficiently exploring a multi-granularity search space. Specifically, we improve the existing DAS methods in two aspects. First, we adaptively adjust the retention ratios of searchable units across different granularity levels through adaptive pruning, which is achieved by learning granularity-specific discretization functions along with the evolving architecture. Second, we decompose the super-net optimization and discretization into multiple stages, each operating on a sub-net, and introduce progressive re-evaluation to enable re-pruning and regrowth of previous units, thereby mitigating potential bias. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate that MG-DARTS outperforms other state-of-the-art methods in achieving a better trade-off between model accuracy and parameter efficiency. Codes are available at https://github.com/lxy12357/MG_DARTS.

[548] Multiple-Input Auto-Encoder Guided Feature Selection for IoT Intrusion Detection Systems

Phai Vu Dinh, Diep N. Nguyen, Dinh Thai Hoang, Quang Uy Nguyen, Eryk Dutkiewicz, Son Pham Bao

Main category: cs.LG

TL;DR: Proposes MIAE and MIAEFS neural networks for IoT intrusion detection, using multiple sub-encoders to handle heterogeneous data and feature selection to remove redundancy, achieving high accuracy with small model size.

Details

Motivation: IoT IDSs face challenges with data diversity, heterogeneity, and high dimensions that make training effective ML models difficult and introduce redundant/noisy features that decrease detection accuracy.

Method: Introduces Multiple-Input Auto-Encoder (MIAE) with multiple sub-encoders for different data sources, and MIAEFS with embedded feature selection layer to retain relevant features while removing redundancy through unsupervised learning.

Result: Achieves 96.5% accuracy detecting sophisticated attacks like Slowloris, with average detection time of 1.7E-6 seconds and model size under 1 MB, outperforming conventional classifiers and other methods on NSLKDD, UNSW-NB15, and IDS2017 datasets.

Conclusion: MIAE and MIAEFS effectively handle IoT data heterogeneity and feature redundancy, providing efficient and accurate intrusion detection with small computational footprint.

Abstract: While intrusion detection systems (IDSs) benefit from the diversity and generalization of IoT data features, the data diversity (e.g., the heterogeneity and high dimensions of data) also makes it difficult to train effective machine learning models in IoT IDSs. This also leads to potentially redundant/noisy features that may decrease the accuracy of the detection engine in IDSs. This paper first introduces a novel neural network architecture called Multiple-Input Auto-Encoder (MIAE). MIAE consists of multiple sub-encoders that can process inputs from different sources with different characteristics. The MIAE model is trained in an unsupervised learning mode to transform the heterogeneous inputs into lower-dimensional representation, which helps classifiers distinguish between normal behaviour and different types of attacks. To distil and retain more relevant features but remove less important/redundant ones during the training process, we further design and embed a feature selection layer right after the representation layer of MIAE resulting in a new model called MIAEFS. This layer learns the importance of features in the representation vector, facilitating the selection of informative features from the representation vector. The results on three IDS datasets, i.e., NSLKDD, UNSW-NB15, and IDS2017, show the superior performance of MIAE and MIAEFS compared to other methods, e.g., conventional classifiers, dimensionality reduction models, unsupervised representation learning methods with different input dimensions, and unsupervised feature selection models. Moreover, MIAE and MIAEFS combined with the Random Forest (RF) classifier achieve accuracy of 96.5% in detecting sophisticated attacks, e.g., Slowloris. The average running time for detecting an attack sample using RF with the representation of MIAE and MIAEFS is approximate 1.7E-6 seconds, whilst the model size is lower than 1 MB.

[549] A Survey on Diffusion Models for Time Series and Spatio-Temporal Data

Yiyuan Yang, Ming Jin, Haomin Wen, Chaoli Zhang, Yuxuan Liang, Lintao Ma, Yi Wang, Chenghao Liu, Bin Yang, Zenglin Xu, Shirui Pan, Qingsong Wen

Main category: cs.LG

TL;DR: This paper provides a comprehensive survey of diffusion models applied to time series and spatio-temporal data across various domains including healthcare, recommendation, climate, energy, audio, and traffic.

Details

Motivation: To offer a structured perspective on diffusion model applications in time series and spatio-temporal data, addressing the need for a solid foundation for researchers and practitioners in this growing field.

Method: The study categorizes and analyzes diffusion model applications by separating time series and spatio-temporal data, examining model categories, task types, data modalities, and practical application domains.

Result: A comprehensive survey that structures the landscape of diffusion model applications in time series and spatio-temporal data mining, with an open-sourced repository for detailed information.

Conclusion: This work provides foundational knowledge to inspire future innovations in diffusion model-based data mining, helping tackle traditional challenges and foster novel solutions in various application domains.

Abstract: Diffusion models have been widely used in time series and spatio-temporal data, enhancing generative, inferential, and downstream capabilities. These models are applied across diverse fields such as healthcare, recommendation, climate, energy, audio, and traffic. By separating applications for time series and spatio-temporal data, we offer a structured perspective on model category, task type, data modality, and practical application domain. This study aims to provide a solid foundation for researchers and practitioners, inspiring future innovations that tackle traditional challenges and foster novel solutions in diffusion model-based data mining tasks and applications. For more detailed information, we have open-sourced a repository at https://github.com/yyysjz1997/Awesome-TimeSeries-SpatioTemporal-Diffusion-Model.

[550] Value Improved Actor Critic Algorithms

Yaniv Oren, Moritz A. Zanger, Pascal R. van der Vaart, Mustafa Mert Celikok, Matthijs T. J. Spaan, Wendelin Bohmer

Main category: cs.LG

TL;DR: Proposes decoupling acting policy from critic’s policy to allow greedier updates for critic while maintaining stable gradient-based updates for actor, improving performance in continuous control tasks.

Details

Motivation: Address the tradeoff between greedification (fast policy improvement) and stability in actor-critic algorithms by separating the policies used for acting and evaluation.

Method: Decouple the acting policy from the critic’s policy, allowing greedier value-improvement updates for the critic while maintaining slow gradient-based updates for the parameterized acting policy.

Result: Significantly improves or matches performance of TD3 and SAC algorithms across DeepMind continuous control environments with negligible compute and implementation cost.

Conclusion: Separating acting and critic policies enables better tradeoff between greedification and stability, leading to improved performance in continuous control tasks.

Abstract: To learn approximately optimal acting policies for decision problems, modern Actor Critic algorithms rely on deep Neural Networks (DNNs) to parameterize the acting policy and greedification operators to iteratively improve it. The reliance on DNNs suggests an improvement that is gradient based, which is per step much less greedy than the improvement possible by greedier operators such as the greedy update used by Q-learning algorithms. On the other hand, slow changes to the policy can also be beneficial for the stability of the learning process, resulting in a tradeoff between greedification and stability. To better address this tradeoff, we propose to decouple the acting policy from the policy evaluated by the critic. This allows the agent to separately improve the critic’s policy (e.g. value improvement) with greedier updates while maintaining the slow gradient-based improvement to the parameterized acting policy. We investigate the convergence of this approach using the popular analysis scheme of generalized Policy Iteration in the finite-horizon domain. Empirically, incorporating value-improvement into the popular off-policy actor-critic algorithms TD3 and SAC significantly improves or matches performance over their respective baselines, across different environments from the DeepMind continuous control domain, with negligible compute and implementation cost.

[551] Demystifying Higher-Order Graph Neural Networks

Maciej Besta, Florian Scheidl, Lukas Gianinazzi, Grzegorz Kwasniewski, Shachar Klaiman, Jürgen Müller, Torsten Hoefler

Main category: cs.LG

TL;DR: This paper provides a comprehensive taxonomy and blueprint for higher-order graph neural networks (HOGNNs) to address the challenge of analyzing and comparing diverse HOGNN models, and offers guidance for model selection and future research directions.

Details

Motivation: The proliferation of diverse HOGNN models with different architectures and interpretations of "higher-order" makes it challenging to analyze, compare, and select appropriate models for specific scenarios.

Method: The authors design an in-depth taxonomy and blueprint for HOGNNs, then use this framework to analyze and compare available HOGNN models.

Result: The analysis produces insights for selecting optimal GNN models in given scenarios and identifies comprehensive challenges and opportunities for future HOGNN research.

Conclusion: The proposed taxonomy and blueprint facilitate better understanding, comparison, and selection of HOGNN models, while providing clear directions for developing more powerful higher-order graph neural networks.

Abstract: Higher-order graph neural networks (HOGNNs) and the related architectures from Topological Deep Learning are an important class of GNN models that harness polyadic relations between vertices beyond plain edges. They have been used to eliminate issues such as over-smoothing or over-squashing, to significantly enhance the accuracy of GNN predictions, to improve the expressiveness of GNN architectures, and for numerous other goals. A plethora of HOGNN models have been introduced, and they come with diverse neural architectures, and even with different notions of what the “higher-order” means. This richness makes it very challenging to appropriately analyze and compare HOGNN models, and to decide in what scenario to use specific ones. To alleviate this, we first design an in-depth taxonomy and a blueprint for HOGNNs. This facilitates designing models that maximize performance. Then, we use our taxonomy to analyze and compare the available HOGNN models. The outcomes of our analysis are synthesized in a set of insights that help to select the most beneficial GNN model in a given scenario, and a comprehensive list of challenges and opportunities for further research into more powerful HOGNNs.

[552] Searching Latent Program Spaces

Matthew V Macfarlane, Clement Bonnet

Main category: cs.LG

TL;DR: LPN combines neural networks with program synthesis by learning a latent program space that enables gradient-based test-time search, improving adaptation to novel tasks without requiring domain-specific languages.

Details

Motivation: Bridge the gap between program synthesis (strong generalization but scaling issues) and deep learning (scalable but poor test-time adaptation) by creating a system that can efficiently adapt to new tasks at test time.

Method: Latent Program Network (LPN) learns a latent space of implicit programs through neural mapping, enabling gradient-based search at test time without pre-defined DSLs.

Result: LPN outperforms or matches in-context learning and test-time training methods on programming-by-examples tasks, and doubles performance on out-of-distribution ARC-AGI tasks when test-time search is enabled.

Conclusion: LPN successfully combines symbolic adaptability with neural scalability, demonstrating effective test-time adaptation through latent program search without domain-specific language requirements.

Abstract: General intelligence requires systems that acquire new skills efficiently and generalize beyond their training distributions. Although program synthesis approaches have strong generalization power, they face scaling issues due to the large combinatorial spaces that quickly render them impractical, requiring human-generated DSLs or pre-trained priors to narrow this search space. On the other hand, deep learning methods have had high successes, but they lack structured test-time adaptation and rely on heavy stochastic sampling or expensive gradient updates for fine-tuning. In this work, we propose the Latent Program Network (LPN), a novel architecture that builds in test-time search directly into neural models. LPN learns a latent space of implicit programs – neurally mapping inputs to outputs – through which it can search using gradients at test time. LPN combines the adaptability of symbolic approaches and the scalability of neural methods. It searches through a compact latent space at test time and bypasses the need for pre-defined domain-specific languages. On a range of programming-by-examples tasks, LPN either outperforms or matches performance compared to in-context learning and test-time training methods. Tested on the ARC-AGI benchmark, we demonstrate that LPN can both learn a compact program space and search through it at test time to adapt to novel tasks. LPN doubles its performance on out-of-distribution tasks when test-time search is switched on.

[553] Domain Fusion Controllable Generalization for Cross-Domain Time Series Forecasting from Multi-Domain Integrated Distribution

Xiangkai Ma, Xiaobin Hong, Mingkai Lin, Han Zhang, Wenzhong Li, Sanglu Lu

Main category: cs.LG

TL;DR: TimeControl is a novel time series forecasting model that uses diffusion models for cross-domain generalization, achieving superior performance through domain fusion and flexible sequence generation.

Details

Motivation: Existing time series forecasting models struggle with cross-domain generalization under large distribution shifts, and statistical prior-based prompt engineering fails in such scenarios.

Method: Proposes a diffusion-based model with three key designs: (1) condition network for multi-scale pattern capture, (2) adapter-based fine-tuning for domain adaptation, and (3) hybrid architecture for flexible sequence generation of arbitrary lengths.

Result: Outperforms 30 baselines on 49 benchmarks across all data domains, demonstrating superior zero-shot generalization ability.

Conclusion: TimeControl successfully addresses cross-domain time series forecasting challenges through diffusion modeling and domain fusion, establishing new state-of-the-art performance.

Abstract: Conventional deep models have achieved unprecedented success in time series forecasting. However, facing the challenge of cross-domain generalization, existing studies utilize statistical prior as prompt engineering fails under the huge distribution shift among various domains. In this paper, a novel time series generalization diffusion model (TimeControl) that pioneers the Domain-Fusion paradigm, systematically integrating information from multiple time series domains into a unified generative process via diffusion models. Unlike the autoregressive models that capture the conditional probabilities of the prediction horizon to the historical sequence, we use the diffusion denoising process to model the mixed distribution of the cross-domain data and generate the prediction sequence for the target domain directly utilizing conditional sampling. The proposed TimeControl contains three pivotal designs: (1) The condition network captures the multi-scale fluctuation patterns from the observation sequence, which are utilized as context representations to guide the denoising network to generate the prediction sequence; (2) Adapter-based fine-tuning strategy, the multi-domain universal representation learned in the pretraining stage is utilized for downstream tasks in target domains; (3) A novel hybrid architecture is designed to align the observation and prediction spaces, enabling TimeControl to generate prediction sequences of arbitrary lengths with flexibility. We conduct extensive experiments on mainstream 49 benchmarks and 30 baselines, and the TimeControl outperforms existing baselines on all data domains, exhibiting superior zero-shot generalization ability.

[554] GRAM: Generalization in Deep RL with a Robust Adaptation Module

James Queeney, Xiaoyi Cai, Alexander Schperberg, Radu Corcodel, Mouhacine Benosman, Jonathan P. How

Main category: cs.LG

TL;DR: GRAM framework unifies in-distribution and out-of-distribution generalization in deep reinforcement learning through a robust adaptation module and joint training pipeline.

Details

Motivation: Reliable deployment of deep RL in real-world settings requires generalization across both in-distribution and novel out-of-distribution scenarios.

Method: Introduces a robust adaptation module for identifying and reacting to different environment dynamics, with a joint training pipeline combining in-distribution adaptation and out-of-distribution robustness.

Result: GRAM achieves strong generalization performance across both in-distribution and out-of-distribution scenarios, demonstrated through simulation and hardware locomotion experiments on a quadruped robot.

Conclusion: The framework successfully unifies both types of generalization within a single architecture for reliable real-world deployment.

Abstract: The reliable deployment of deep reinforcement learning in real-world settings requires the ability to generalize across a variety of conditions, including both in-distribution scenarios seen during training as well as novel out-of-distribution scenarios. In this work, we present a framework for dynamics generalization in deep reinforcement learning that unifies these two distinct types of generalization within a single architecture. We introduce a robust adaptation module that provides a mechanism for identifying and reacting to both in-distribution and out-of-distribution environment dynamics, along with a joint training pipeline that combines the goals of in-distribution adaptation and out-of-distribution robustness. Our algorithm GRAM achieves strong generalization performance across in-distribution and out-of-distribution scenarios upon deployment, which we demonstrate through extensive simulation and hardware locomotion experiments on a quadruped robot.

[555] Graph Kernel Neural Networks

Luca Cosmo, Giorgia Minello, Alessandro Bicciato, Michael Bronstein, Emanuele Rodolà, Luca Rossi, Andrea Torsello

Main category: cs.LG

TL;DR: The paper proposes using graph kernels to extend convolution operators to graphs, creating a structural model that doesn’t require graph embeddings and provides interpretable structural masks.

Details

Motivation: Standard convolution operators work well on Euclidean data like images but are challenging to apply to graphs due to their irregular structure. The authors aim to extend convolution to graph domains.

Method: Use graph kernels (kernel functions computing inner products on graphs) to define convolution operators for graphs. The architecture can use any graph kernel type and learns interpretable structural masks during training.

Result: Extensive ablation studies show the model achieves competitive performance on standard graph classification and regression datasets. The model provides interpretability through learned structural masks.

Conclusion: Graph kernels successfully extend convolution to graphs, creating effective structural models with interpretability benefits similar to traditional CNNs, while achieving competitive performance on graph tasks.

Abstract: The convolution operator at the core of many modern neural architectures can effectively be seen as performing a dot product between an input matrix and a filter. While this is readily applicable to data such as images, which can be represented as regular grids in the Euclidean space, extending the convolution operator to work on graphs proves more challenging, due to their irregular structure. In this paper, we propose to use graph kernels, i.e. kernel functions that compute an inner product on graphs, to extend the standard convolution operator to the graph domain. This allows us to define an entirely structural model that does not require computing the embedding of the input graph. Our architecture allows to plug-in any type of graph kernels and has the added benefit of providing some interpretability in terms of the structural masks that are learned during the training process, similarly to what happens for convolutional masks in traditional convolutional neural networks. We perform an extensive ablation study to investigate the model hyper-parameters’ impact and show that our model achieves competitive performance on standard graph classification and regression datasets.

[556] Fast, Sample-Efficient, Affine-Invariant Private Mean and Covariance Estimation for Subgaussian Distributions

Gavin Brown, Samuel B. Hopkins, Adam Smith

Main category: cs.LG

TL;DR: Fast differentially private algorithm for high-dimensional covariance-aware mean estimation with nearly optimal sample complexity, improving from previous exponential-time methods.

Details

Motivation: Previous differentially private estimators for covariance-aware mean estimation required exponential time, creating a need for efficient algorithms that maintain privacy guarantees while achieving optimal sample complexity.

Method: Adapts an exponential-time approach by Brown et al. (2021) into efficient variants of stable mean and covariance estimation subroutines, running in time O(nd^{ω-1} + nd/ε) where ω<2.38 is the matrix multiplication exponent.

Result: Achieves nearly optimal sample complexity n ≳ d/α² + d√(log1/δ)/(αε) + d(log1/δ)/ε for Mahalanobis error ∥μ-μ̃∥_Σ ≤ α, and provides the first algorithm for private covariance estimation using n=o(d²) samples.

Conclusion: Presents the first efficient differentially private algorithm for covariance-aware mean estimation with nearly optimal sample complexity, solving open problems in private covariance estimation and Gaussian distribution learning.

Abstract: We present a fast, differentially private algorithm for high-dimensional covariance-aware mean estimation with nearly optimal sample complexity. Only exponential-time estimators were previously known to achieve this guarantee. Given $n$ samples from a (sub-)Gaussian distribution with unknown mean $μ$ and covariance $Σ$, our $(\varepsilon,δ)$-differentially private estimator produces $\tildeμ$ such that $|μ- \tildeμ|_Σ \leq α$ as long as $n \gtrsim \tfrac d {α^2} + \tfrac{d \sqrt{\log 1/δ}}{α\varepsilon}+\frac{d\log 1/δ}{\varepsilon}$. The Mahalanobis error metric $|μ- \hatμ|_Σ$ measures the distance between $\hat μ$ and $μ$ relative to $Σ$; it characterizes the error of the sample mean. Our algorithm runs in time $\tilde{O}(nd^{ω- 1} + nd/\varepsilon)$, where $ω< 2.38$ is the matrix multiplication exponent. We adapt an exponential-time approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above. Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With $n\gtrsim d^{3/2}$ samples, our estimate is accurate in spectral norm. This is the first such algorithm using $n= o(d^2)$ samples, answering an open question posed by Alabi et al. (2022). With $n\gtrsim d^2$ samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance. Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.

[557] Categorical Flow Matching on Statistical Manifolds

Chaoran Cheng, Jiahan Li, Jian Peng, Ge Liu

Main category: cs.LG

TL;DR: SFM is a novel flow-matching framework on statistical manifolds that uses information geometry to enable exact likelihood calculation and achieves superior performance in discrete generation tasks.

Details

Motivation: Previous discrete generative models lack exploration of geometric properties on statistical manifolds and rely on variational bounds for likelihood estimation, limiting their effectiveness.

Method: SFM instantiates flow matching on categorical distribution manifolds using Fisher information metric, following geodesic shortest paths with efficient training/sampling via diffeomorphism between manifolds.

Result: SFM learns complex patterns on statistical manifolds where existing models fail, achieving higher sampling quality and likelihood than other discrete diffusion or flow-based models across image, text, and biological domains.

Conclusion: The geometric perspective of statistical manifolds enables exact likelihood calculation and superior generative performance, establishing SFM as an effective framework for discrete generation tasks.

Abstract: We introduce Statistical Flow Matching (SFM), a novel and mathematically rigorous flow-matching framework on the manifold of parameterized probability measures inspired by the results from information geometry. We demonstrate the effectiveness of our method on the discrete generation problem by instantiating SFM on the manifold of categorical distributions whose geometric properties remain unexplored in previous discrete generative models. Utilizing the Fisher information metric, we equip the manifold with a Riemannian structure whose intrinsic geometries are effectively leveraged by following the shortest paths of geodesics. We develop an efficient training and sampling algorithm that overcomes numerical stability issues with a diffeomorphism between manifolds. Our distinctive geometric perspective of statistical manifolds allows us to apply optimal transport during training and interpret SFM as following the steepest direction of the natural gradient. Unlike previous models that rely on variational bounds for likelihood estimation, SFM enjoys the exact likelihood calculation for arbitrary probability measures. We manifest that SFM can learn more complex patterns on the statistical manifold where existing models often fail due to strong prior assumptions. Comprehensive experiments on real-world generative tasks ranging from image, text to biological domains further demonstrate that SFM achieves higher sampling quality and likelihood than other discrete diffusion or flow-based models.

[558] LINSCAN – A Linearity Based Clustering Algorithm

Andrew Dennehy, Xiaoyu Zou, Shabnam J. Semnani, Yuri Fialko, Alexander Cloninger

Main category: cs.LG

TL;DR: LINSCAN is a new algorithm that extends DBSCAN and OPTICS to detect lineated clusters by representing points as normal distributions and using KL divergence distance, enabling identification of spatially close but orthogonally oriented clusters like intersecting faults in seismic data.

Details

Motivation: Existing clustering algorithms like DBSCAN and OPTICS struggle to identify lineated clusters, particularly when clusters are spatially close but have orthogonal orientations, which is common in applications like seismic fault detection.

Method: LINSCAN embeds points as normal distributions approximating their local neighborhoods and uses a distance function derived from Kullback-Leibler Divergence to detect and distinguish lineated clusters with orthogonal covariances.

Result: The algorithm successfully identifies active faults in seismic data, including intersecting faults, and determines their orientation, demonstrating its ability to handle complex spatial relationships that traditional methods cannot.

Conclusion: LINSCAN provides an effective generalization of DBSCAN and OPTICS for lineated cluster detection while maintaining their stability benefits, with particular utility in geophysical applications like fault identification.

Abstract: DBSCAN and OPTICS are powerful algorithms for identifying clusters of points in domains where few assumptions can be made about the structure of the data. In this paper, we leverage these strengths and introduce a new algorithm, LINSCAN, designed to seek lineated clusters that are difficult to find and isolate with existing methods. In particular, by embedding points as normal distributions approximating their local neighborhoods and leveraging a distance function derived from the Kullback Leibler Divergence, LINSCAN can detect and distinguish lineated clusters that are spatially close but have orthogonal covariances. We demonstrate how LINSCAN can be applied to seismic data to identify active faults, including intersecting faults, and determine their orientation. Finally, we discuss the properties a generalization of DBSCAN and OPTICS must have in order to retain the stability benefits of these algorithms.

[559] Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, Shaul Druckmann

Main category: cs.LG

TL;DR: Training with cross-entropy loss can misalign with test-time compute strategies like pass@N, causing performance to decrease with longer training due to model overconfidence. A modified training loss that limits confidence improves mathematical reasoning performance.

Details

Motivation: To explore how model training should be modified to optimize performance under test-time compute strategies like pass@N, given that current training methods may be misaligned with these strategies.

Method: Proposed a principled, modified training loss that limits model confidence to better align with pass@N strategy, tested on mathematical reasoning tasks including answering math questions and theorem proving via proof tree search.

Result: The modified training algorithm demonstrated improved mathematical reasoning performance on MATH and MiniF2F benchmarks across multiple scenarios, rescuing pass@N test performance that was degraded by standard cross-entropy training.

Conclusion: Co-designing training-time protocols and test-time search strategies is crucial for optimizing LLM performance, as traditional separate development phases can lead to misalignment and suboptimal results.

Abstract: Recent progress in large language models (LLMs) highlights the power of scaling test-time compute to achieve strong performance on complex tasks, such as mathematical reasoning and code generation. This raises a critical question: how should model training be modified to optimize performance under a subsequent test-time compute strategy and budget? To explore this, we focus on pass@N, a simple test-time strategy that searches for a correct answer in $N$ independent samples. We show, surprisingly, that training with cross-entropy (CE) loss can be ${\it misaligned}$ with pass@N in that pass@N accuracy ${\it decreases}$ with longer training. We explain the origins of this misalignment in terms of model overconfidence induced by CE, and experimentally verify our prediction of overconfidence as an impediment to scaling test-time compute via pass@N. Furthermore we suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance. Our algorithm demonstrates improved mathematical reasoning on MATH and MiniF2F benchmarks under several scenarios: (1) providing answers to math questions; and (2) proving theorems by searching over proof trees of varying shapes. Overall our work underscores the importance of co-designing two traditionally separate phases of LLM development: training-time protocols and test-time search and reasoning strategies.

[560] TopER: Topological Embeddings in Graph Representation Learning

Astrit Tola, Funmilola Mary Taiwo, Cuneyt Gurcan Akcora, Baris Coskunuzer

Main category: cs.LG

TL;DR: TopER is a novel low-dimensional graph embedding method based on topological data analysis that provides interpretable visualizations while achieving competitive performance in graph tasks.

Details

Motivation: Existing graph embedding methods often produce opaque, high-dimensional embeddings that limit interpretability and practical visualization of graph-structured data.

Method: TopER simplifies Persistent Homology by calculating the evolution rate of graph substructures, creating low-dimensional topological embeddings that capture structural evolution patterns.

Result: TopER-based models achieve or surpass state-of-the-art results across molecular, biological, and social network datasets in classification, clustering, and visualization tasks.

Conclusion: TopER provides an intuitive and interpretable approach to graph representation learning that enhances data exploration while maintaining competitive performance across various graph analysis tasks.

Abstract: Graph embeddings play a critical role in graph representation learning, allowing machine learning models to explore and interpret graph-structured data. However, existing methods often rely on opaque, high-dimensional embeddings, limiting interpretability and practical visualization. In this work, we introduce Topological Evolution Rate (TopER), a novel, low-dimensional embedding approach grounded in topological data analysis. TopER simplifies a key topological approach, Persistent Homology, by calculating the evolution rate of graph substructures, resulting in intuitive and interpretable visualizations of graph data. This approach not only enhances the exploration of graph datasets but also delivers competitive performance in graph clustering and classification tasks. Our TopER-based models achieve or surpass state-of-the-art results across molecular, biological, and social network datasets in tasks such as classification, clustering, and visualization.

[561] SCNode: Spatial and Contextual Coordinates for Graph Representation Learning

Md Joshem Uddin, Astrit Tola, Varin Sikand, Cuneyt Gurcan Akcora, Baris Coskunuzer

Main category: cs.LG

TL;DR: SCNode is a Spatial-Contextual Node Embedding framework that addresses limitations of traditional GNNs by integrating spatial and contextual information, achieving superior performance in both homophilic and heterophilic graph settings.

Details

Motivation: Traditional message passing GNNs suffer from oversquashing, oversmoothing, underreaching issues and rely on homophily assumption, leading to poor performance in heterophilic graphs where connected nodes may not share similar features.

Method: SCNode integrates spatial and contextual information to create discriminative and structurally aware node embeddings, introducing new homophily matrices to understand class interactions and tendencies.

Result: Extensive experiments on benchmark datasets show SCNode achieves superior performance over conventional GNN models, demonstrating robustness and adaptability across diverse graph structures.

Conclusion: SCNode provides a consistent and effective framework for node representation learning that works well in both homophilic and heterophilic graph settings, overcoming key limitations of traditional GNN approaches.

Abstract: Effective node representation lies at the heart of Graph Neural Networks (GNNs), as it directly impacts their ability to perform downstream tasks such as node classification and link prediction. Most existing GNNs, particularly message passing graph neural networks, rely on neighborhood aggregation to iteratively compute node embeddings. While powerful, this paradigm suffers from well-known limitations of oversquashing, oversmoothing, and underreaching that degrade representation quality. More critically, MPGNNs often assume homophily, where connected nodes share similar features or labels, leading to poor generalization in heterophilic graphs where this assumption breaks down. To address these challenges, we propose \textit{SCNode}, a \textit{Spatial-Contextual Node Embedding} framework designed to perform consistently well in both homophilic and heterophilic settings. SCNode integrates spatial and contextual information, yielding node embeddings that are not only more discriminative but also structurally aware. Our approach introduces new homophily matrices for understanding class interactions and tendencies. Extensive experiments on benchmark datasets show that SCNode achieves superior performance over conventional GNN models, demonstrating its robustness and adaptability in diverse graph structures.

[562] Position: Beyond Euclidean – Foundation Models Should Embrace Non-Euclidean Geometries

Neil He, Jiahong Liu, Buze Zhang, Ngoc Bui, Ali Maatouk, Menglin Yang, Irwin King, Melanie Weber, Rex Ying

Main category: cs.LG

TL;DR: This position paper argues that moving beyond Euclidean geometry is essential for next-generation foundation models to effectively capture real-world non-Euclidean structures like hierarchies and multi-way relationships.

Details

Motivation: Real-world data exhibits inherently non-Euclidean structures (multi-way relationships, hierarchies, symmetries) that are challenging to capture in Euclidean spaces, limiting the scaling potential of current foundation models.

Method: Proposes integrating non-Euclidean geometries into foundation models through fine-tuning, training from scratch, and hybrid approaches, with task-aware adaptability that dynamically reconfigures embeddings.

Result: Theoretical and empirical investigations support that non-Euclidean geometries could enable foundation models to more efficiently leverage complex data structures and maintain scaling laws.

Conclusion: Adopting non-Euclidean geometries is not optional but necessary for the next generation of foundation models to overcome fundamental limitations of Euclidean space and achieve better efficiency and expressivity.

Abstract: In the era of foundation models and Large Language Models (LLMs), Euclidean space has been the de facto geometric setting for machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. At a large scale, real-world data often exhibits inherently non-Euclidean structures, such as multi-way relationships, hierarchies, symmetries, and non-isotropic scaling, in a variety of domains, such as languages, vision, and the natural sciences. It is challenging to effectively capture these structures within the constraints of Euclidean spaces. This position paper argues that moving beyond Euclidean geometry is not merely an optional enhancement but a necessity to maintain the scaling law for the next-generation of foundation models. By adopting these geometries, foundation models could more efficiently leverage the aforementioned structures. Task-aware adaptability that dynamically reconfigures embeddings to match the geometry of downstream applications could further enhance efficiency and expressivity. Our position is supported by a series of theoretical and empirical investigations of prevalent foundation models. Finally, we outline a roadmap for integrating non-Euclidean geometries into foundation models, including strategies for building geometric foundation models via fine-tuning, training from scratch, and hybrid approaches.

[563] On Feasible Rewards in Multi-Agent Inverse Reinforcement Learning

Till Freihaut, Giorgia Ramponi

Main category: cs.LG

TL;DR: Characterizes feasible reward sets in Markov games for multi-agent inverse reinforcement learning, addresses equilibrium ambiguity through entropy regularization, and provides sample complexity analysis.

Details

Motivation: Equilibrium-based observations in multi-agent systems are often ambiguous - a single Nash equilibrium can correspond to many different reward structures, potentially changing the game's nature.

Method: Introduces entropy-regularized Markov games to yield unique equilibrium while preserving strategic incentives, and provides sample complexity analysis for how errors affect learned policy performance.

Result: Establishes theoretical foundations for characterizing feasible reward sets in Markov games and provides practical insights for multi-agent inverse reinforcement learning.

Conclusion: The work provides both theoretical foundations and practical insights for multi-agent inverse reinforcement learning by addressing equilibrium ambiguity through entropy regularization and analyzing sample complexity.

Abstract: Multi-agent Inverse Reinforcement Learning (MAIRL) aims to recover agent reward functions from expert demonstrations. We characterize the feasible reward set in Markov games, identifying all reward functions that rationalize a given equilibrium. However, equilibrium-based observations are often ambiguous: a single Nash equilibrium can correspond to many reward structures, potentially changing the game’s nature in multi-agent systems. We address this by introducing entropy-regularized Markov games, which yield a unique equilibrium while preserving strategic incentives. For this setting, we provide a sample complexity analysis detailing how errors affect learned policy performance. Our work establishes theoretical foundations and practical insights for MAIRL.

[564] ARBoids: Adaptive Residual Reinforcement Learning With Boids Model for Cooperative Multi-USV Target Defense

Jiyue Tao, Tongsheng Shen, Dexin Zhao, Feitian Zhang

Main category: cs.LG

TL;DR: ARBoids is an adaptive residual RL framework that combines DRL with Boids model for USV target defense, outperforming traditional methods in intercepting superior maneuverability attackers.

Details

Motivation: To address the challenging TDP scenario where attackers have superior maneuverability compared to defenders, making effective interception difficult.

Method: Integrates deep reinforcement learning with biologically inspired Boids model, using Boids as baseline policy and DRL to learn residual policy for adaptive action refinement.

Result: Validated in high-fidelity Gazebo simulation, showing superior performance over pure force-based approaches and vanilla DRL policies, with strong adaptability to diverse attacker maneuverability profiles.

Conclusion: ARBoids demonstrates robustness and generalization capability, effectively solving TDP against superior maneuverability attackers through adaptive residual learning framework.

Abstract: The target defense problem (TDP) for unmanned surface vehicles (USVs) concerns intercepting an adversarial USV before it breaches a designated target region, using one or more defending USVs. A particularly challenging scenario arises when the attacker exhibits superior maneuverability compared to the defenders, significantly complicating effective interception. To tackle this challenge, this letter introduces ARBoids, a novel adaptive residual reinforcement learning framework that integrates deep reinforcement learning (DRL) with the biologically inspired, force-based Boids model. Within this framework, the Boids model serves as a computationally efficient baseline policy for multi-agent coordination, while DRL learns a residual policy to adaptively refine and optimize the defenders’ actions. The proposed approach is validated in a high-fidelity Gazebo simulation environment, demonstrating superior performance over traditional interception strategies, including pure force-based approaches and vanilla DRL policies. Furthermore, the learned policy exhibits strong adaptability to attackers with diverse maneuverability profiles, highlighting its robustness and generalization capability. The code of ARBoids will be released upon acceptance of this letter.

[565] DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, Tianbao Yang

Main category: cs.LG

TL;DR: DisCO is a new discriminative constrained optimization framework that outperforms GRPO and its variants by eliminating difficulty bias, improving training stability, and handling data imbalance in large reasoning models.

Details

Motivation: To address inherent limitations in GRPO including question-level difficulty bias and entropy instability, and to leverage connections between GRPO and traditional discriminative learning methods.

Method: Replaces group relative objective with discriminative scoring functions, uses non-clipping RL surrogates instead of clipping-based approaches, and employs constrained optimization to enforce KL divergence constraints.

Result: Significantly outperforms GRPO and DAPO with average gains of 7% over GRPO and 6% over DAPO across six mathematical reasoning benchmarks for 1.5B models.

Conclusion: DisCO provides a more effective framework for reinforcing large reasoning models by eliminating difficulty bias, ensuring stable training, and enabling advanced discriminative learning techniques.

Abstract: The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions; (3) it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint. As a result, DisCO offers notable advantages over GRPO and its variants: (i) it completely eliminates difficulty bias by adopting discriminative objectives; (ii) it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach, yielding long and stable training dynamics; (iii) it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training. Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for an 1.5B model.

[566] Towards Efficient Training of Graph Neural Networks: A Multiscale Approach

Eshed Gal, Moshe Eliasof, Carola-Bibiane Schönlieb, Ivan I. Kyrchei, Eldad Haber, Eran Treister

Main category: cs.LG

TL;DR: Novel multiscale GNN training framework using hierarchical graphs and subgraphs to reduce computational overhead while maintaining performance.

Details

Motivation: Standard GNN training faces computational and memory challenges with large graphs, limiting scalability and efficiency.

Method: Leverages hierarchical graph representations, subgraphs, coarse-to-fine learning, subgraph-to-full-graph transfer, and multiscale gradient computation.

Result: Significantly accelerates GNN training for large-scale problems while maintaining or improving predictive performance.

Conclusion: Multiscale training enables efficient GNN training for large graphs through hierarchical abstractions and scalable strategies.

Abstract: Graph Neural Networks (GNNs) have become powerful tools for learning from graph-structured data, finding applications across diverse domains. However, as graph sizes and connectivity increase, standard GNN training methods face significant computational and memory challenges, limiting their scalability and efficiency. In this paper, we present a novel framework for efficient multiscale training of GNNs. Our approach leverages hierarchical graph representations and subgraphs, enabling the integration of information across multiple scales and resolutions. By utilizing coarser graph abstractions and subgraphs, each with fewer nodes and edges, we significantly reduce computational overhead during training. Building on this framework, we propose a suite of scalable training strategies, including coarse-to-fine learning, subgraph-to-full-graph transfer, and multiscale gradient computation. We also provide some theoretical analysis of our methods and demonstrate their effectiveness across various datasets and learning tasks. Our results show that multiscale training can substantially accelerate GNN training for large scale problems while maintaining, or even improving, predictive performance.

[567] Enforcing Hard Linear Constraints in Deep Learning Models with Decision Rules

Gonzalo E. Constante-Flores, Hao Chen, Can Li

Main category: cs.LG

TL;DR: A model-agnostic framework for enforcing input-dependent linear constraints on neural network outputs using a dual-network architecture that guarantees constraint satisfaction without runtime optimization.

Details

Motivation: Deep learning models in safety-critical applications need to satisfy hard constraints (physical laws, fairness, safety limits), but standard architectures lack built-in constraint enforcement mechanisms.

Method: Combines a task network for prediction accuracy with a safe network trained using stochastic/robust optimization decision rules, with final prediction as convex combination to guarantee constraint satisfaction.

Result: The architecture is proven to be a universal approximator of constrained functions, maintains competitive accuracy while consistently satisfying constraints, and has low inference latency.

Conclusion: The proposed framework provides computationally tractable constraint enforcement for neural networks without iterative procedures, suitable for safety-critical applications.

Abstract: Deep learning models are increasingly deployed in safety-critical tasks where predictions must satisfy hard constraints, such as physical laws, fairness requirements, or safety limits. However, standard architectures lack built-in mechanisms to enforce such constraints, and existing approaches based on regularization or projection are often limited to simple constraints, computationally expensive, or lack feasibility guarantees. This paper proposes a model-agnostic framework for enforcing input-dependent linear equality and inequality constraints on neural network outputs. The architecture combines a task network trained for prediction accuracy with a safe network trained using decision rules from the stochastic and robust optimization literature to ensure feasibility across the entire input space. The final prediction is a convex combination of the two subnetworks, guaranteeing constraint satisfaction during both training and inference without iterative procedures or runtime optimization. We prove that the architecture is a universal approximator of constrained functions and derive computationally tractable formulations based on linear decision rules. Empirical results on benchmark regression tasks show that our method consistently satisfies constraints while maintaining competitive accuracy and low inference latency.

[568] Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

Beier Luo, Shuoyuan Wang, Sharon Li, Hongxin Wei

Main category: cs.LG

TL;DR: DACA is an unsupervised method that improves confidence calibration in post-trained language models by selectively using agreement examples between PLM and PoLM to avoid over-large temperature scaling caused by prediction disagreements.

Details

Motivation: Post-trained language models often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which undermines reliability in critical applications. The scarcity of labeled data for individual downstream tasks makes calibration challenging.

Method: Propose Disagreement-Aware Confidence Alignment (DACA), which selectively uses only agreement examples between PLM and PoLM for temperature scaling calibration, effectively decoupling the influence of disagreement examples that cause under-confidence issues.

Result: Extensive experiments show DACA improves average ECE of open-sourced and API-based LLMs (including GPT-4o) by up to 15.08% on common benchmarks.

Conclusion: DACA effectively addresses the over-confidence problem in post-trained language models through selective calibration using agreement examples, significantly improving calibration performance without requiring labeled data.

Abstract: Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $τ$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM’s confidence underestimates PoLM’s prediction accuracy on disagreement examples, causing a larger $τ$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $τ$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$%$ on common benchmarks.

Bingguang Hao, ZengZhuang Xu, Maolin Wang, Yuntao Wen, Yicheng Chen, Cunyin Peng, Long Chen, Dong Wang, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang, Ji Zhang

Main category: cs.LG

TL;DR: FunReason is a framework that enhances LLMs’ function calling capabilities through automated data refinement and a Self-Refinement Multiscale Loss approach, achieving GPT-4o comparable performance while mitigating catastrophic forgetting.

Details

Motivation: Traditional training approaches struggle to balance detailed reasoning steps with precise function execution in LLMs, leading to suboptimal performance in real-world applications.

Method: Uses automated data refinement strategy leveraging LLMs’ natural reasoning to generate high-quality training examples, and Self-Refinement Multiscale Loss (SRML) that dynamically balances reasoning processes and function call accuracy during training.

Result: Achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning.

Conclusion: FunReason provides a comprehensive solution for enhancing LLMs’ function calling capabilities through balanced training methodology and data refinement pipeline.

Abstract: The integration of large language models (LLMs) with function calling has emerged as a crucial capability for enhancing their practical utility in real-world applications. However, effectively combining reasoning processes with accurate function execution remains a significant challenge. Traditional training approaches often struggle to balance the detailed reasoning steps with the precision of function calls, leading to suboptimal performance. To address these limitations, we introduce FunReason, a novel framework that enhances LLMs’ function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss (SRML) approach. FunReason leverages LLMs’ natural reasoning abilities to generate high-quality training examples, focusing on query parseability, reasoning coherence, and function call precision. The SRML approach dynamically balances the contribution of reasoning processes and function call accuracy during training, addressing the inherent trade-off between these two critical aspects. FunReason achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning. FunReason provides a comprehensive solution for enhancing LLMs’ function calling capabilities by introducing a balanced training methodology and a data refinement pipeline. For code and dataset, please refer to our repository at GitHub https://github.com/BingguangHao/FunReason

[570] Walking the Weight Manifold: a Topological Approach to Conditioning Inspired by Neuromodulation

Ari S. Benjamin, Kyle Daruwalla, Christian Pehle, Abdul-Malik Zekri, Anthony M. Zador

Main category: cs.LG

TL;DR: The paper proposes learning weight manifolds (smooth parameterized functions of task context) instead of individual weight vectors, inspired by neuromodulation in brains, to enable efficient multi-task learning and better generalization.

Details

Motivation: Traditional neural networks condition on tasks by injecting context as input, but brains use neuromodulation to modulate parameters themselves. The paper aims to develop a more biologically-inspired approach to multi-task learning.

Method: Optimize smooth manifolds in weight space with predefined topologies (lines, circles, torus) as functions of task context variables, using a formal treatment of manifold optimization that minimizes loss subject to volumetric movement constraints.

Result: Manifolds with various topologies outperform input concatenation conditioning and generalize better to out-of-distribution samples. Straight lines work well for noise level conditioning, ellipses for rotated images.

Conclusion: Modulating weights over low-dimensional manifolds provides a principled and effective alternative to traditional conditioning methods, offering better inductive biases and generalization across tasks.

Abstract: One frequently wishes to learn a range of similar tasks as efficiently as possible, re-using knowledge across tasks. In artificial neural networks, this is typically accomplished by conditioning a network upon task context by injecting context as input. Brains have a different strategy: the parameters themselves are modulated as a function of various neuromodulators such as serotonin. Here, we take inspiration from neuromodulation and propose to learn weights which are smoothly parameterized functions of task context variables. Rather than optimize a weight vector, i.e. a single point in weight space, we optimize a smooth manifold in weight space with a predefined topology. To accomplish this, we derive a formal treatment of optimization of manifolds as the minimization of a loss functional subject to a constraint on volumetric movement, analogous to gradient descent. During inference, conditioning selects a single point on this manifold which serves as the effective weight matrix for a particular sub-task. This strategy for conditioning has two main advantages. First, the topology of the manifold (whether a line, circle, or torus) is a convenient lever for inductive biases about the relationship between tasks. Second, learning in one state smoothly affects the entire manifold, encouraging generalization across states. To verify this, we train manifolds with several topologies, including straight lines in weight space (for conditioning on e.g. noise level in input data) and ellipses (for rotated images). Despite their simplicity, these parameterizations outperform conditioning identical networks by input concatenation and better generalize to out-of-distribution samples. These results suggest that modulating weights over low-dimensional manifolds offers a principled and effective alternative to traditional conditioning.

[571] AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu

Main category: cs.LG

TL;DR: AReaL is an asynchronous RL system for LLMs that decouples generation from training, achieving up to 2.77x speedup over synchronous systems while maintaining or improving performance.

Details

Motivation: Existing synchronous RL systems for LLMs suffer from GPU underutilization because generation must wait for the longest output in a batch to complete before model updates can occur.

Method: AReaL uses fully asynchronous architecture with separate rollout and training workers, workload balancing to control data staleness, and a staleness-enhanced PPO variant to handle outdated samples.

Result: AReaL achieves up to 2.77x training speedup compared to synchronous systems with the same GPU resources, with matched or improved final performance on math and code reasoning benchmarks.

Conclusion: The asynchronous approach with proper staleness management enables significantly more efficient RL training for LLMs while maintaining training stability and performance.

Abstract: Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.

[572] FunDiff: Diffusion Models over Function Spaces for Physics-Informed Generative Modeling

Sifan Wang, Zehao Dou, Siming Shan, Tong-Rui Liu, Lu Lu

Main category: cs.LG

TL;DR: FunDiff is a framework for generative modeling in function spaces that combines latent diffusion with function autoencoders to generate continuous functions while incorporating physical priors.

Details

Motivation: Adapting generative models to physical applications is challenging because physical quantities are continuous functions governed by complex physical laws, requiring models that can handle varying discretizations and enforce physical constraints.

Method: FunDiff combines latent diffusion processes with function autoencoder architecture, using architectural constraints or physics-informed loss functions to incorporate physical priors and ensure generated samples satisfy physical laws.

Result: The method achieves minimax optimality guarantees for density estimation in function spaces and demonstrates effectiveness in fluid dynamics and solid mechanics applications, generating physically consistent samples with high fidelity and robustness to noisy/low-resolution data.

Conclusion: FunDiff provides a principled framework for generative modeling in function spaces that can handle continuous functions, incorporate physical constraints, and achieve optimal theoretical guarantees while maintaining practical effectiveness across physical applications.

Abstract: Recent advances in generative modeling – particularly diffusion models and flow matching – have achieved remarkable success in synthesizing discrete data such as images and videos. However, adapting these models to physical applications remains challenging, as the quantities of interest are continuous functions governed by complex physical laws. Here, we introduce $\textbf{FunDiff}$, a novel framework for generative modeling in function spaces. FunDiff combines a latent diffusion process with a function autoencoder architecture to handle input functions with varying discretizations, generate continuous functions evaluable at arbitrary locations, and seamlessly incorporate physical priors. These priors are enforced through architectural constraints or physics-informed loss functions, ensuring that generated samples satisfy fundamental physical laws. We theoretically establish minimax optimality guarantees for density estimation in function spaces, showing that diffusion-based estimators achieve optimal convergence rates under suitable regularity conditions. We demonstrate the practical effectiveness of FunDiff across diverse applications in fluid dynamics and solid mechanics. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy and low-resolution data. Code and datasets are publicly available at https://github.com/sifanexisted/fundiff.

[573] Towards Multimodal Graph Large Language Model

Xin Wang, Zeyang Zhang, Linxin Xiao, Haibo Chen, Chendi Ge, Wenwu Zhu

Main category: cs.LG

TL;DR: This paper explores Multi-modal Graph Large Language Models (MG-LLM) to unify and generalize across diverse multi-modal graph data and tasks, proposing a framework with five key characteristics.

Details

Motivation: Existing multi-modal graph learning methods are trained from scratch for specific data and tasks, lacking generalization across diverse multi-modal graph data and tasks.

Method: Proposes a unified framework for multi-modal graph data, tasks, and models, identifying multi-granularity and multi-scale characteristics. Presents five key desired characteristics for MG-LLM including unified space, task handling, in-context learning, natural language interaction, and reasoning.

Result: The paper elaborates on key challenges, reviews related works, highlights future research directions, and summarizes existing multi-modal graph datasets for model training.

Conclusion: This work contributes to advancing research towards MG-LLM for generalization across multi-modal graph data and tasks, providing a foundation for future developments in this area.

Abstract: Multi-modal graphs, which integrate diverse multi-modal features and relations, are ubiquitous in real-world applications. However, existing multi-modal graph learning methods are typically trained from scratch for specific graph data and tasks, failing to generalize across various multi-modal graph data and tasks. To bridge this gap, we explore the potential of Multi-modal Graph Large Language Models (MG-LLM) to unify and generalize across diverse multi-modal graph data and tasks. We propose a unified framework of multi-modal graph data, task, and model, discovering the inherent multi-granularity and multi-scale characteristics in multi-modal graphs. Specifically, we present five key desired characteristics for MG-LLM: 1) unified space for multi-modal structures and attributes, 2) capability of handling diverse multi-modal graph tasks, 3) multi-modal graph in-context learning, 4) multi-modal graph interaction with natural language, and 5) multi-modal graph reasoning. We then elaborate on the key challenges, review related works, and highlight promising future research directions towards realizing these ambitious characteristics. Finally, we summarize existing multi-modal graph datasets pertinent for model training. We believe this paper can contribute to the ongoing advancement of the research towards MG-LLM for generalization across multi-modal graph data and tasks.

[574] Adapting Vision-Language Models for Evaluating World Models

Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot

Main category: cs.LG

TL;DR: UNIVERSE is a Vision-Language Model-based evaluator for video world model rollouts that achieves strong alignment with human judgments across diverse environments.

Details

Motivation: Existing metrics fail to capture fine-grained, temporally grounded assessment of action alignment and semantic consistency needed for evaluating world model rollouts, despite VLMs showing promise as automatic evaluators.

Method: Developed UNIVERSE using parameter-efficient adaptation methods under data/compute constraints, exploring various task formats (binary, multiple-choice, open-ended), context lengths, sampling methods, and data compositions across extensive experiments.

Result: The unified evaluator achieves parity with task-specific checkpoints and human studies across seven diverse environments confirm strong alignment with human judgments.

Conclusion: UNIVERSE establishes itself as a lightweight, adaptable, and semantics-aware evaluator for video world models that effectively addresses the fundamental challenge of evaluating world model rollouts.

Abstract: World models - generative models that simulate environment dynamics conditioned on past observations and actions - are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency - capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce an evaluation protocol targeting two recognition tasks - action recognition and character recognition - each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a VLM-based evaluator for video world model rollouts adapted under data and compute constraints. In our extensive experiments totaling over 5,154 GPU-days, we explore full, partial, and parameter-efficient adaptation methods across various task formats, context lengths, sampling methods, and data compositions. The resulting unified evaluator achieves parity with task-specific checkpoints. Human studies across seven diverse environments confirm strong alignment with human judgments, establishing UNIVERSE as a lightweight, adaptable, and semantics-aware evaluator for video world models.

[575] Non-equilibrium Annealed Adjoint Sampler

Jaemoo Choi, Yongxin Chen, Molei Tao, Guan-Horng Liu

Main category: cs.LG

TL;DR: NAAS is a novel SOC-based diffusion sampler that uses annealed reference dynamics for more efficient sampling from unnormalized densities.

Details

Motivation: Existing diffusion samplers use uninformative reference processes that limit their ability to efficiently guide trajectories toward target distributions.

Method: Proposes Non-Equilibrium Annealed Adjoint Sampler (NAAS) using annealed reference dynamics as a non-stationary base SDE, with adjoint matching for efficient training.

Result: Demonstrated effectiveness across various tasks including sampling from classical energy landscapes and molecular Boltzmann distributions.

Conclusion: NAAS provides enhanced stability and efficiency in sampling through its annealed reference dynamics and flexible SOC framework.

Abstract: Recently, there has been significant progress in learning-based diffusion samplers, which aim to sample from a given unnormalized density. Many of these approaches formulate the sampling task as a stochastic optimal control (SOC) problem using a canonical uninformative reference process, which limits their ability to efficiently guide trajectories toward the target distribution. In this work, we propose the Non-Equilibrium Annealed Adjoint Sampler (NAAS), a novel SOC-based diffusion framework that employs annealed reference dynamics as a non-stationary base SDE. This annealing structure provides a natural progression toward the target distribution and generates informative reference trajectories, thereby enhancing the stability and efficiency of learning the control. Owing to our SOC formulation, our framework can incorporate a variety of SOC solvers, thereby offering high flexibility in algorithmic design. As one instantiation, we employ a lean adjoint system inspired by adjoint matching, enabling efficient and scalable training. We demonstrate the effectiveness of NAAS across a range of tasks, including sampling from classical energy landscapes and molecular Boltzmann distributions.

[576] Elucidated Rolling Diffusion Models for Probabilistic Weather Forecasting

Salva Rühling Cachay, Miika Aittala, Karsten Kreis, Noah Brenowitz, Arash Vahdat, Morteza Mardani, Rose Yu

Main category: cs.LG

TL;DR: ERDM unifies rolling forecast structure with Elucidated Diffusion Models to handle uncertainty propagation in probabilistic forecasting, outperforming baselines on fluid dynamics and weather forecasting tasks.

Details

Motivation: Existing diffusion models predict future states individually, struggling with temporal dependencies and uncertainty growth. Rolling diffusion frameworks address this but haven't been integrated with state-of-the-art diffusion techniques.

Method: Adapts EDM components (noise schedule, network preconditioning, Heun sampler) to rolling forecasts with novel loss weighting, efficient initialization using pre-trained EDM, and hybrid sequence architecture for spatiotemporal feature extraction.

Result: ERDM consistently outperforms diffusion-based baselines including conditional autoregressive EDM on 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5-degree resolution.

Conclusion: ERDM provides a flexible and powerful framework for diffusion-based dynamics forecasting where modeling uncertainty propagation is crucial.

Abstract: Diffusion models are a powerful tool for probabilistic forecasting, yet most applications in high-dimensional complex systems predict future states individually. This approach struggles to model complex temporal dependencies and fails to explicitly account for the progressive growth of uncertainty inherent to the systems. While rolling diffusion frameworks, which apply increasing noise to forecasts at longer lead times, have been proposed to address this, their integration with state-of-the-art, high-fidelity diffusion techniques remains a significant challenge. We tackle this problem by introducing Elucidated Rolling Diffusion Models (ERDM), the first framework to successfully unify a rolling forecast structure with the principled, performant design of Elucidated Diffusion Models (EDM). To do this, we adapt the core EDM components-its noise schedule, network preconditioning, and Heun sampler-to the rolling forecast setting. The success of this integration is driven by three key contributions: (i) a novel loss weighting scheme that focuses model capacity on the mid-range forecast horizons where determinism gives way to stochasticity; (ii) an efficient initialization strategy using a pre-trained EDM for the initial window; and (iii) a bespoke hybrid sequence architecture for robust spatiotemporal feature extraction under progressive denoising. On 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5-degree resolution, ERDM consistently outperforms key diffusion-based baselines, including conditional autoregressive EDM. ERDM offers a flexible and powerful general framework for tackling diffusion-based dynamics forecasting problems where modeling uncertainty propagation is paramount.

[577] Koopman operator-based discussion on partial observation in stochastic systems

Jun Ohkubo

Main category: cs.LG

TL;DR: This paper analyzes partial observation effects in stochastic systems using Koopman operator theory, connecting Mori-Zwanzig formalism with delay-embedding techniques for incomplete data.

Details

Motivation: Partial observations are often necessary when complete observation is difficult. While Mori-Zwanzig formalism handles partial observations in deterministic systems, there's a need to extend this to stochastic systems using Koopman operator theory.

Method: The study uses Koopman operator theory to analyze partial observation effects in stochastic systems, employing delay-embedding techniques and conducting numerical experiments with additive noise.

Result: Numerical experiments reveal power-law behavior of error with respect to additive noise amplitude. The delay-embedding technique proves beneficial even in stochastic systems for handling partial observations.

Conclusion: The work clarifies the importance of distinguishing state space and function space in stochastic systems, and establishes relationships between power-law exponents and partial observation effects.

Abstract: It is sometimes difficult to achieve a complete observation for a full set of observables, and partial observations are necessary. For deterministic systems, the Mori-Zwanzig formalism provides a theoretical framework for handling partial observations. Recently, data-driven algorithms based on the Koopman operator theory have made significant progress, and there is a discussion to connect the Mori-Zwanzig formalism with the Koopman operator theory. In this work, we discuss the effects of partial observation in stochastic systems using the Koopman operator theory. The discussion clarifies the importance of distinguishing the state space and the function space in stochastic systems. Even in stochastic systems, the delay-embedding technique is beneficial for partial observation, and several numerical experiments show a power-law behavior of error with respect to the amplitude of the additive noise. We also discuss the relation between the exponent of the power-law behavior and the effects of partial observation.

[578] Improving Constrained Language Generation via Self-Distilled Twisted Sequential Monte Carlo

Sooyeon Kim, Giung Nam, Byoungwoo Park, Juho Lee

Main category: cs.LG

TL;DR: The paper proposes iterative self-distillation to improve constrained text generation when target distributions are unlikely under the base model, addressing sparse reward issues in twisted Sequential Monte Carlo approaches.

Details

Motivation: Constrained text generation faces challenges when target distributions concentrate on outputs unlikely under the base model, leading to sparse and uninformative reward signals that make learning difficult.

Method: The authors use iterative self-distillation to progressively refine the base model, making it more aligned with the target distribution and improving learning in constrained generation settings.

Result: The approach leads to substantial gains in generation quality by addressing the sparse reward problem in constrained generation scenarios.

Conclusion: Iterative self-distillation effectively improves constrained text generation by progressively aligning the base model with target distributions, overcoming challenges posed by sparse reward signals.

Abstract: Recent work has framed constrained text generation with autoregressive language models as a probabilistic inference problem. Among these, Zhao et al. (2024) introduced a promising approach based on twisted Sequential Monte Carlo, which incorporates learned twist functions and twist-induced proposals to guide the generation process. However, in constrained generation settings where the target distribution concentrates on outputs that are unlikely under the base model, learning becomes challenging due to sparse and uninformative reward signals. We show that iteratively refining the base model through self-distillation alleviates this issue by making the model progressively more aligned with the target, leading to substantial gains in generation quality.

[579] STAlloc: Enhancing Memory Efficiency in Large-Scale Model Training with Spatio-Temporal Planning

Zixiao Huang, Junhao Hu, Hao Lin, Chunyang Zhu, Yueran Tang, Quanlu Zhang, Zhen Guo, Zhenhua Li, Shengen Yan, Zhenhua Zhu, Guohao Dai, Yu Wang

Main category: cs.LG

TL;DR: STAlloc is a GPU memory allocator for deep learning frameworks that reduces memory fragmentation by exploiting spatial and temporal regularities in training workloads, combining offline planning with online allocation.

Details

Motivation: GPU memory pressure has increased due to scaling LLMs and training optimization techniques like virtual pipeline and recomputation, which cause memory fragmentation that wastes up to 43% of memory and triggers out-of-memory errors.

Method: STAlloc introduces a novel paradigm combining offline planning (leveraging spatio-temporal regularities for near-optimal allocation) with online allocation (handling dynamic models like MoE), built as a pluggable PyTorch memory allocator.

Result: STAlloc reduces fragmentation ratio by 85.1% on average (up to 100%) across dense and MoE models with negligible overhead, enabling more efficient training configurations and improving throughput by up to 32.5%.

Conclusion: STAlloc effectively addresses GPU memory fragmentation in deep learning training, significantly improving memory efficiency and training performance while maintaining low overhead.

Abstract: The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Such fragmentation stems from the use of online GPU memory allocators in popular deep learning frameworks like PyTorch, which disregard tensor lifespans. As a result, this inefficiency can waste as much as 43% of memory and trigger out-of-memory errors, undermining the effectiveness of optimization methods. To address this, we introduce STAlloc, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STAlloc introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch memory allocator, STAlloc reduces fragmentation ratio on average by 85.1% (up to 100%) across both dense and MoE models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves throughput performance by up to 32.5%.

[580] Scalable neural network-based blackbox optimization

Pavankumar Koratikere, Leifur Leifsson

Main category: cs.LG

TL;DR: SNBO is a scalable neural network-based blackbox optimization method that avoids model uncertainty estimation, using separate exploration/exploitation criteria with adaptive sampling control to achieve better performance with fewer evaluations.

Details

Motivation: Bayesian Optimization faces scalability issues in high dimensions due to GP computational complexity, while NN-based BO methods require intensive uncertainty estimation which is complex in high dimensions.

Method: Proposes SNBO that uses separate criteria for exploration and exploitation without model uncertainty estimation, with adaptive sampling region control for efficient optimization.

Result: SNBO outperforms four state-of-the-art baselines on problems from 10 to 102 dimensions, achieving better function values with 40-60% fewer evaluations and runtime reduction by at least 10x.

Conclusion: SNBO provides an effective and scalable alternative to traditional BO methods by eliminating the need for complex uncertainty estimation while maintaining optimization performance.

Abstract: Bayesian Optimization (BO) is a widely used approach for blackbox optimization that leverages a Gaussian process (GP) model and an acquisition function to guide future sampling. While effective in low-dimensional settings, BO faces scalability challenges in high-dimensional spaces and with large number of function evaluations due to the computational complexity of GP models. In contrast, neural networks (NNs) offer better scalability and can model complex functions, which led to the development of NN-based BO approaches. However, these methods typically rely on estimating model uncertainty in NN prediction – a process that is often computationally intensive and complex, particularly in high dimensions. To address these limitations, a novel method, called scalable neural network-based blackbox optimization (SNBO), is proposed that does not rely on model uncertainty estimation. Specifically, SNBO adds new samples using separate criteria for exploration and exploitation, while adaptively controlling the sampling region to ensure efficient optimization. SNBO is evaluated on a range of optimization problems spanning from 10 to 102 dimensions and compared against four state-of-the-art baseline algorithms. Across the majority of test problems, SNBO attains function values better than the best-performing baseline algorithm, while requiring 40-60% fewer function evaluations and reducing the runtime by at least an order of magnitude.

[581] Interpretable Reward Model via Sparse Autoencoder

Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Hengxing Cai, Xiang Wang

Main category: cs.LG

TL;DR: SARM integrates sparse autoencoders into reward models to create interpretable, feature-level reward attribution that adapts to preference shifts while outperforming traditional models.

Details

Motivation: Traditional reward models lack interpretability, offer limited insight into reward reasoning, and are inflexible to user preference shifts. Multidimensional RMs improve interpretability but fail at feature-level attribution and require costly annotations.

Method: Introduces SARM architecture that integrates pretrained Sparse Autoencoder (SAE) into reward model, mapping LLM hidden activations into interpretable sparse feature space, with scalar head aggregating features for transparent reward scores.

Result: Empirical evaluations show SARM enables direct feature-level reward attribution, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models.

Conclusion: SARM provides a novel approach to reward modeling that combines interpretability, adaptability to preference shifts, and improved alignment performance through sparse autoencoder integration.

Abstract: Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.

[582] Scaling Up Active Testing to Large Language Models

Gabrielle Berrada, Jannik Kossen, Freddie Bickford Smith, Muhammed Razzak, Yarin Gal, Tom Rainforth

Main category: cs.LG

TL;DR: Active testing for LLM evaluation can be made computationally efficient by using cheap in-context learning surrogates that don’t require updating, enabling more accurate performance assessments than random sampling.

Details

Motivation: Active testing provides label-efficient model evaluation but faces computational cost challenges when scaling to large language models, requiring cost-saving measures.

Method: Use in-context learning to create cheap surrogate models that guide data acquisition without updating during testing, can be smaller than target models, and can make acquisition decisions without target model predictions.

Result: Achieved more accurate evaluations of LLM performance compared to random data acquisition, and introduced a bootstrap estimator that reliably indicates active testing effectiveness.

Conclusion: Active testing can be effectively scaled to LLMs through computational optimizations, providing superior evaluation accuracy over random sampling while maintaining efficiency.

Abstract: Active testing enables label-efficient evaluation of predictive models through careful data acquisition, but it can pose a significant computational cost. We identify cost-saving measures that enable active testing to be scaled up to large language models (LLMs). In particular we show that the surrogate model used to guide data acquisition can be constructed cheaply using in-context learning, does not require updating within an active-testing loop, and can be smaller than the target model. We even find we can make good data-acquisition decisions without making predictions with the target model. As a result we are able to achieve much more accurate evaluations of LLM performance relative to using randomly acquired data. We additionally introduce a bootstrap estimator of evaluation error, which we show to be a useful indicator of how well active testing is working within a single run.

[583] Relative Advantage Debiasing for Watch-Time Prediction in Short-Video Recommendation

Emily Liu, Kuan Han, Minfeng Zhan, Bocheng Zhao, Guanyu Mu, Yang Song

Main category: cs.LG

TL;DR: Proposes a relative advantage debiasing framework to correct watch time biases in video recommendations by comparing to reference distributions, using quantile-based preference signals and two-stage architecture.

Details

Motivation: Raw watch times are influenced by confounding factors like video duration and popularity, which distort preference signals and create biased recommendation models.

Method: Relative advantage debiasing framework that corrects watch time using empirically derived reference distributions conditioned on user and item groups, with quantile-based preference signals and two-stage architecture separating distribution estimation from preference learning. Uses distributional embeddings to parameterize watch-time quantiles.

Result: Significant improvements in recommendation accuracy and robustness demonstrated through both offline and online experiments compared to existing baseline methods.

Conclusion: The proposed framework effectively addresses watch time biases and improves recommendation performance by using relative advantage debiasing and quantile-based preference signals.

Abstract: Watch time is widely used as a proxy for user satisfaction in video recommendation platforms. However, raw watch times are influenced by confounding factors such as video duration, popularity, and individual user behaviors, potentially distorting preference signals and resulting in biased recommendation models. We propose a novel relative advantage debiasing framework that corrects watch time by comparing it to empirically derived reference distributions conditioned on user and item groups. This approach yields a quantile-based preference signal and introduces a two-stage architecture that explicitly separates distribution estimation from preference learning. Additionally, we present distributional embeddings to efficiently parameterize watch-time quantiles without requiring online sampling or storage of historical data. Both offline and online experiments demonstrate significant improvements in recommendation accuracy and robustness compared to existing baseline methods.

[584] DE-VAE: Revealing Uncertainty in Parametric and Inverse Projections with Variational Autoencoders using Differential Entropy

Frederik L. Dennig, Daniel A. Keim

Main category: cs.LG

TL;DR: DE-VAE is an uncertainty-aware variational autoencoder that uses differential entropy to create parametric and invertible projections for multidimensional data, addressing limitations with out-of-distribution samples.

Details

Motivation: Existing autoencoder methods perform poorly with out-of-distribution samples in data or embedding space, creating a need for more robust parametric and invertible projection methods.

Method: DE-VAE uses differential entropy in a variational autoencoder framework to learn mappings between original data space and 2D embedding space, with quantitative and qualitative evaluation against UMAP and t-SNE baselines.

Result: DE-VAE achieves comparable accuracy to other AE-based approaches while enabling embedding uncertainty analysis, as demonstrated on four well-known datasets.

Conclusion: The proposed DE-VAE successfully creates parametric and invertible projections with uncertainty awareness, addressing limitations of existing methods with out-of-distribution samples.

Abstract: Recently, autoencoders (AEs) have gained interest for creating parametric and invertible projections of multidimensional data. Parametric projections make it possible to embed new, unseen samples without recalculating the entire projection, while invertible projections allow the synthesis of new data instances. However, existing methods perform poorly when dealing with out-of-distribution samples in either the data or embedding space. Thus, we propose DE-VAE, an uncertainty-aware variational AE using differential entropy (DE) to improve the learned parametric and invertible projections. Given a fixed projection, we train DE-VAE to learn a mapping into 2D space and an inverse mapping back to the original space. We conduct quantitative and qualitative evaluations on four well-known datasets, using UMAP and t-SNE as baseline projection methods. Our findings show that DE-VAE can create parametric and inverse projections with comparable accuracy to other current AE-based approaches while enabling the analysis of embedding uncertainty.

[585] Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery

Robert Yang

Main category: cs.LG

TL;DR: Proposes unlearning-as-ablation as a method to test whether LLMs can generate new scientific knowledge or just remix memorized content by systematically removing target results and evaluating re-derivation capability.

Details

Motivation: To address the epistemic question of whether AI models truly generate new knowledge or merely remix memorized fragments, distinguishing between generative capability and recall in scientific discovery.

Method: Systematically remove target results along with their forget-closure (supporting lemmas, paraphrases, multi-hop entailments), then evaluate if models can re-derive results from permitted axioms and tools.

Result: Conceptual framework proposed; outlines feasibility pilot in mathematics and algorithms, with potential extension to physics/chemistry. No empirical results presented as this is a position paper.

Conclusion: Unlearning-as-ablation serves as an epistemic probe to distinguish knowledge reconstruction from retrieval, potentially guiding next-generation AI-for-Science benchmarks.

Abstract: Bold claims about AI’s role in science-from “AGI will cure all diseases” to promises of radically accelerated discovery-raise a central epistemic question: do large language models (LLMs) truly generate new knowledge, or do they merely remix memorized fragments? We propose unlearning-as-ablation as a falsifiable probe of constructive scientific discovery. The idea is to systematically remove a target result together with its forget-closure (supporting lemmas, paraphrases, and multi-hop entailments) and then evaluate whether the model can re-derive the result from only permitted axioms and tools. Success would indicate generative capability beyond recall; failure would expose current limits. Unlike prevailing motivations for unlearning-privacy, copyright, or safety-our framing repositions it as an epistemic probe for AI-for-Science. We outline a minimal pilot in mathematics and algorithms to illustrate feasibility, and sketch how the same approach could later be extended to domains such as physics or chemistry. This is a position paper: our contribution is conceptual and methodological, not empirical. We aim to stimulate discussion on how principled ablation tests could help distinguish models that reconstruct knowledge from those that merely retrieve it, and how such probes might guide the next generation of AI-for-Science benchmarks.

[586] Iterative Inference in a Chess-Playing Neural Network

Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek

Main category: cs.LG

TL;DR: Analysis of Leela Chess Zero’s policy network reveals neural networks build representations through distinct computational phases with late-layer reversals, not smooth refinement.

Details

Motivation: To understand whether neural networks build representations through smooth refinement or complex computational processes by examining a superhuman chess engine.

Method: Extended the logit lens to analyze Leela Chess Zero’s policy network, examining playing strength, puzzle-solving ability, move preferences, and concept preferences across layers.

Result: Capability progression occurs in distinct phases; move rankings poorly correlate with final outputs until late layers; correct puzzle solutions found in middle layers are sometimes overridden; final layers prioritize safety over aggression.

Conclusion: Neural networks use complex computational processes with late-layer reversals where heuristic priors can override tactical solutions, rather than smooth gradual refinement.

Abstract: Do neural networks build their representations through smooth, gradual refinement, or via more complex computational processes? We investigate this by extending the logit lens to analyze the policy network of Leela Chess Zero, a superhuman chess engine. Although playing strength and puzzle-solving ability improve consistently across layers, capability progression occurs in distinct computational phases with move preferences undergoing continuous reevaluation–move rankings remain poorly correlated with final outputs until late, and correct puzzle solutions found in middle layers are sometimes overridden. This late-layer reversal is accompanied by concept preference analyses showing final layers prioritize safety over aggression, suggesting a mechanism by which heuristic priors can override tactical solutions.

[587] Vendi Information Gain for Active Learning and its Application to Ecology

Quan Nguyen, Adji Bousso Dieng

Main category: cs.LG

TL;DR: Vendi information gain (VIG) is a new active learning method that selects images based on dataset-wide prediction uncertainty, outperforming traditional methods in species identification from camera trap data.

Details

Motivation: Species identification in camera trap images is bottlenecked by limited labeling resources, and existing active learning methods focus only on individual prediction uncertainty without considering dataset-wide uncertainty.

Method: VIG selects images based on their impact on dataset-wide prediction uncertainty, capturing both informativeness and diversity, and was applied to the Snapshot Serengeti dataset.

Result: VIG achieved 75% accuracy with only 3% of data (vs 10%+ for baselines) and 88% accuracy with 10% of data (12% higher than best baseline), with consistent improvements across metrics and more diverse data collection.

Conclusion: VIG offers significant value for biodiversity monitoring in data-limited environments and has broad applicability beyond ecology.

Abstract: While monitoring biodiversity through camera traps has become an important endeavor for ecological research, identifying species in the captured image data remains a major bottleneck due to limited labeling resources. Active learning – a machine learning paradigm that selects the most informative data to label and train a predictive model – offers a promising solution, but typically focuses on uncertainty in the individual predictions without considering uncertainty across the entire dataset. We introduce a new active learning policy, Vendi information gain (VIG), that selects images based on their impact on dataset-wide prediction uncertainty, capturing both informativeness and diversity. We applied VIG to the Snapshot Serengeti dataset and compared it against common active learning methods. VIG needs only 3% of the available data to reach 75% accuracy, a level that baselines require more than 10% of the data to achieve. With 10% of the data, VIG attains 88% predictive accuracy, 12% higher than the best of the baselines. This improvement in performance is consistent across metrics and batch sizes, and we show that VIG also collects more diverse data in the feature space. VIG has broad applicability beyond ecology, and our results highlight its value for biodiversity monitoring in data-limited environments.

[588] Simple, Fast and Efficient Injective Manifold Density Estimation with Random Projections

Ahmad Ayaz Amin, Baha Uddin Kazi

Main category: cs.LG

TL;DR: Random Projection Flows (RPFs) are injective normalizing flows using random semi-orthogonal matrices from Haar-distributed orthogonal ensembles for efficient data projection and closed-form Riemannian volume correction.

Details

Motivation: To create a principled framework for injective normalizing flows that bridges random projection theory with normalizing flows, offering plug-and-play efficiency without the limitations of PCA-based flows or learned injective maps.

Method: Uses random semi-orthogonal matrices drawn from Haar-distributed orthogonal ensembles via QR decomposition of Gaussian matrices to project data into lower-dimensional latent spaces for base distributions.

Result: RPFs provide closed-form expressions for Riemannian volume correction terms and demonstrate both theoretical grounding and practical effectiveness in generative modeling.

Conclusion: RPFs establish a strong baseline for generative modeling and successfully connect random projection theory with normalizing flows through their principled, efficient framework.

Abstract: We introduce Random Projection Flows (RPFs), a principled framework for injective normalizing flows that leverages tools from random matrix theory and the geometry of random projections. RPFs employ random semi-orthogonal matrices, drawn from Haar-distributed orthogonal ensembles via QR decomposition of Gaussian matrices, to project data into lower-dimensional latent spaces for the base distribution. Unlike PCA-based flows or learned injective maps, RPFs are plug-and-play, efficient, and yield closed-form expressions for the Riemannian volume correction term. We demonstrate that RPFs are both theoretically grounded and practically effective, providing a strong baseline for generative modeling and a bridge between random projection theory and normalizing flows.

[589] Realistic CDSS Drug Dosing with End-to-end Recurrent Q-learning for Dual Vasopressor Control

Will Y. Zou, Jean Feng, Alexandre Kalimouttou, Jennifer Yuntong Zhang, Christopher W. Seymour, Romain Pirracchio

Main category: cs.LG

TL;DR: An offline RL framework for dual vasopressor dosing in ICUs that uses principled action space design to ensure clinically operable recommendations while improving performance.

Details

Motivation: Address skepticism in RL-based CDSS by preventing inoperable dosing decisions through better action space design for vasopressor administration in ICUs.

Method: End-to-end offline RL with discrete, continuous, and directional dosing strategies, conservative Q-learning, and recurrent modeling with replay buffer for temporal dependencies in ICU data.

Result: Action space design significantly influences learned policies, achieving >3x expected reward improvements on eICU and MIMIC datasets while aligning with clinical protocols.

Conclusion: Principled action space design improves interpretability, facilitates clinical adoption, and preserves efficacy in RL-based clinical decision support systems.

Abstract: Reinforcement learning (RL) applications in Clinical Decision Support Systems (CDSS) frequently encounter skepticism because models may recommend inoperable dosing decisions. We propose an end-to-end offline RL framework for dual vasopressor administration in Intensive Care Units (ICUs) that directly addresses this challenge through principled action space design. Our method integrates discrete, continuous, and directional dosing strategies with conservative Q-learning and incorporates a novel recurrent modeling using a replay buffer to capture temporal dependencies in ICU time-series data. Our comparative analysis of norepinephrine dosing strategies across different action space formulations reveals that the designed action spaces improve interpretability and facilitate clinical adoption while preserving efficacy. Empirical results on eICU and MIMIC demonstrate that action space design profoundly influences learned behavioral policies. Compared with baselines, the proposed methods achieve more than 3x expected reward improvements, while aligning with established clinical protocols.

[590] Memory Self-Regeneration: Uncovering Hidden Knowledge in Unlearned Models

Agnieszka Polowczyk, Alicja Polowczyk, Joanna Waczyńska, Piotr Borycki, Przemysław Spurek

Main category: cs.LG

TL;DR: The paper introduces MemoRa, a regenerative approach for machine unlearning that helps recover previously lost knowledge, and proposes robustness in knowledge retrieval as a key evaluation measure.

Details

Motivation: Modern text-to-image models can generate harmful content, accelerating the need for machine unlearning. However, current unlearning methods struggle with truly forgetting concepts, as models can still generate unlearned content through adversarial prompts.

Method: The paper introduces the Memory Self-Regeneration task and presents MemoRa strategy - a regenerative approach that supports effective recovery of previously lost knowledge. It also distinguishes between short-term and long-term forgetting.

Result: The study demonstrates that forgetting occurs in two distinct ways: short-term (quick recall) and long-term (difficult recovery). MemoRa effectively supports knowledge recovery in machine unlearning scenarios.

Conclusion: Robustness in knowledge retrieval is a crucial evaluation measure for developing more effective unlearning techniques. The proposed MemoRa approach provides a regenerative solution for managing knowledge retention and recovery in machine unlearning.

Abstract: The impressive capability of modern text-to-image models to generate realistic visuals has come with a serious drawback: they can be misused to create harmful, deceptive or unlawful content. This has accelerated the push for machine unlearning. This new field seeks to selectively remove specific knowledge from a model’s training data without causing a drop in its overall performance. However, it turns out that actually forgetting a given concept is an extremely difficult task. Models exposed to attacks using adversarial prompts show the ability to generate so-called unlearned concepts, which can be not only harmful but also illegal. In this paper, we present considerations regarding the ability of models to forget and recall knowledge, introducing the Memory Self-Regeneration task. Furthermore, we present MemoRa strategy, which we consider to be a regenerative approach supporting the effective recovery of previously lost knowledge. Moreover, we propose that robustness in knowledge retrieval is a crucial yet underexplored evaluation measure for developing more robust and effective unlearning techniques. Finally, we demonstrate that forgetting occurs in two distinct ways: short-term, where concepts can be quickly recalled, and long-term, where recovery is more challenging. Code is available at https://gmum.github.io/MemoRa/.

[591] FedQS: Optimizing Gradient and Model Aggregation for Semi-Asynchronous Federated Learning

Yunbo Li, Jiaping Gui, Zhihang Deng, Fanchao Meng, Yue Wu

Main category: cs.LG

TL;DR: FedQS is a novel semi-asynchronous federated learning framework that optimizes both gradient and model aggregation strategies by classifying clients into four types and adaptively optimizing their local training, achieving superior accuracy, stability, and convergence speed.

Details

Motivation: Semi-asynchronous FL (SAFL) faces challenges in optimizing gradient-based and model-based aggregation strategies, which have distinct trade-offs: gradient aggregation offers faster convergence but suffers from fluctuations, while model aggregation provides stability but slower convergence and suboptimal accuracy.

Method: FedQS introduces a divide-and-conquer strategy that classifies clients into four distinct types based on data distribution characteristics and computational resources, then adaptively optimizes their local training to bridge the gap between aggregation strategies.

Result: Extensive experiments on computer vision, NLP, and real-world tasks show FedQS achieves the highest accuracy, lowest loss, and ranks among the fastest in convergence speed, outperforming state-of-the-art baselines.

Conclusion: FedQS bridges the gap between aggregation strategies in SAFL, offering a unified solution for stable, accurate, and efficient federated learning.

Abstract: Federated learning (FL) enables collaborative model training across multiple parties without sharing raw data, with semi-asynchronous FL (SAFL) emerging as a balanced approach between synchronous and asynchronous FL. However, SAFL faces significant challenges in optimizing both gradient-based (e.g., FedSGD) and model-based (e.g., FedAvg) aggregation strategies, which exhibit distinct trade-offs in accuracy, convergence speed, and stability. While gradient aggregation achieves faster convergence and higher accuracy, it suffers from pronounced fluctuations, whereas model aggregation offers greater stability but slower convergence and suboptimal accuracy. This paper presents FedQS, the first framework to theoretically analyze and address these disparities in SAFL. FedQS introduces a divide-and-conquer strategy to handle client heterogeneity by classifying clients into four distinct types and adaptively optimizing their local training based on data distribution characteristics and available computational resources. Extensive experiments on computer vision, natural language processing, and real-world tasks demonstrate that FedQS achieves the highest accuracy, attains the lowest loss, and ranks among the fastest in convergence speed, outperforming state-of-the-art baselines. Our work bridges the gap between aggregation strategies in SAFL, offering a unified solution for stable, accurate, and efficient federated learning. The code and datasets are available at https://github.com/bkjod/FedQS_.

[592] Optimally Deep Networks – Adapting Model Depth to Datasets for Superior Efficiency

Shaharyar Ahmed Khan Tareen, Filza Khan Tareen

Main category: cs.LG

TL;DR: ODNs optimize neural network depth to match task complexity, reducing model size and computational costs while maintaining accuracy through progressive depth expansion training.

Details

Motivation: Deep neural networks are often over-engineered for simple tasks, leading to wasted computation, high energy consumption, and impractical deployment on resource-constrained devices.

Method: Progressive depth expansion training strategy that starts with shallow networks and incrementally increases depth as earlier blocks converge, using only the optimal depth needed for the task.

Result: Achieved 98.64% and 96.44% reduction in memory footprint for ResNet-18 and ResNet-34 on MNIST and SVHN datasets while maintaining competitive accuracy of 99.31% and 96.08% respectively.

Conclusion: ODNs provide an effective approach to balance model depth with task complexity, significantly reducing computational costs and memory usage while preserving performance, enabling practical deployment on edge devices.

Abstract: Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training big and deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce the concept of Optimally Deep Networks (ODNs), which provides a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training neural networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the tasks at hand, removing redundant layers. This cuts down future training and inference costs, lowers the model memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.

[593] ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices

Xin Nie, Liang Dong, Haicheng Zhang, Jiawang Xiao, G. Sun

Main category: cs.LG

TL;DR: ELUTQ is an efficient quantization framework using Hierarchical Linear Quantization (HLQ) that improves weight distribution fitting without increasing computational cost, enabling high-performance deployment of large language models on CPU-based edge devices.

Details

Motivation: Existing uniform quantization methods suffer from poor weight-distribution fitting and high dequantization overhead under low-bit settings, limiting efficient deployment of large language models on edge devices.

Method: Proposes Hierarchical Linear Quantization (HLQ) format that captures statistical characteristics of weights without increasing computational cost of bit-serial LUT-based GEMM operations, eliminating dequantization overhead. Combines with post-training quantization and efficient finetuning techniques, and integrates disk-offload for large models.

Result: For LLaMA3.1-8B: ~8% perplexity reduction at 3-bit and 85% reduction at 2-bit compared to uniform quantization. Enables quantization of LLaMA3.1-70B with only 64GB CPU memory and 48GB VRAM. Achieves >25 tokens/sec throughput for 2-bit LLaMA2-7B on Apple M2 chip.

Conclusion: ELUTQ provides an efficient quantization framework that significantly reduces hardware requirements and enables high-performance deployment of large language models on edge devices while maintaining model accuracy.

Abstract: Weight quantization effectively reduces memory consumption and enables the deployment of large language models on CPU-based edge devices, yet existing hardware-friendly methods often rely on uniform quantization, which suffers from poor weight-distribution fitting and high dequantization overhead under low-bit settings. In this paper, we propose ELUTQ, an efficient quantization framework featuring a novel quantization format termed Hierarchical Linear Quantization (HLQ). HLQ is designed to better capture the statistical characteristics of weights without increasing the computational cost of bit-serial LUT-based GEMM operations, thereby eliminating dequantization overhead. HLQ is orthogonal to existing quantization algorithms. For the LLaMA3.1-8B model, when combined with post-training quantization, HLQ improves uniform quantization by achieving approximately 8 percent perplexity reduction at 3-bit precision and 85 percent perplexity reduction at 2-bit precision. When combined with efficient finetuning techniques, HLQ further improves model accuracy. We also integrate a disk-offload technique into ELUTQ, enabling it to complete the quantization of LLaMA3.1-70B using only 64 GB of CPU memory and 48 GB of VRAM, significantly reducing the hardware requirements for large-scale model quantization. To enable efficient deployment on edge devices, ELUTQ provides high-performance CPU kernels to support end-to-end inference. Under a 4-thread configuration with batch size 1, our 2-bit quantized LLaMA2-7B model achieves a throughput of more than 25 tokens per second on an Apple M2 chip. All the code is available at https://github.com/Nkniexin/ELUTQ.

[594] Generalization Bounds for Rank-sparse Neural Networks

Antoine Ledent, Rodrigo Alves, Yunwen Lei

Main category: cs.LG

TL;DR: This paper investigates how the low-rank bottleneck property of neural networks affects generalization, proving bounds that exploit this structure and showing sample complexity depends on network width, depth, and weight matrix ranks.

Details

Motivation: Recent observations show neural networks trained with gradient methods exhibit low-rank activation and weight matrices, converging to a 'bottleneck rank' that matches the minimum rank needed to represent training data. This phenomenon relates to how weight decay regularizes linear networks by minimizing Schatten p quasi norms.

Method: The authors prove generalization bounds for neural networks that leverage the approximate low-rank structure of weight matrices. The analysis uses Schatten p quasi norms of weight matrices to characterize the bounds.

Result: For small p values, the generalization bounds show sample complexity of O(WrL²) where W is width, L is depth, and r is the rank of weight matrices. As p increases, the bounds behave more like traditional norm-based bounds.

Conclusion: The low-rank bottleneck property in neural networks can be exploited to derive improved generalization bounds, with sample complexity scaling favorably with the rank of weight matrices rather than just network size parameters.

Abstract: It has been recently observed in much of the literature that neural networks exhibit a bottleneck rank property: for larger depths, the activation and weights of neural networks trained with gradient-based methods tend to be of approximately low rank. In fact, the rank of the activations of each layer converges to a fixed value referred to as the ``bottleneck rank’’, which is the minimum rank required to represent the training data. This perspective is in line with the observation that regularizing linear networks (without activations) with weight decay is equivalent to minimizing the Schatten $p$ quasi norm of the neural network. In this paper we investigate the implications of this phenomenon for generalization. More specifically, we prove generalization bounds for neural networks which exploit the approximate low rank structure of the weight matrices if present. The final results rely on the Schatten $p$ quasi norms of the weight matrices: for small $p$, the bounds exhibit a sample complexity $ \widetilde{O}(WrL^2)$ where $W$ and $L$ are the width and depth of the neural network respectively and where $r$ is the rank of the weight matrices. As $p$ increases, the bound behaves more like a norm-based bound instead.

[595] AirFed: A Federated Graph-Enhanced Multi-Agent Reinforcement Learning Framework for Multi-UAV Cooperative Mobile Edge Computing

Zhiyu Wang, Suman Raj, Rajkumar Buyya

Main category: cs.LG

TL;DR: AirFed: A federated graph-enhanced multi-agent reinforcement learning framework for UAV-MEC systems that optimizes trajectory planning, task offloading, and resource allocation through dual-layer GATs, dual-Actor single-Critic architecture, and reputation-based federated learning.

Details

Motivation: Address challenges in coordinating multiple UAVs for MEC systems, including limited scalability, slow convergence, and inefficient knowledge sharing in dynamic environments with large-scale IoT deployments and deadline constraints.

Method: Uses dual-layer dynamic Graph Attention Networks to model spatial-temporal dependencies, dual-Actor single-Critic architecture for joint optimization of continuous trajectory and discrete offloading decisions, and reputation-based federated learning with gradient-sensitive quantization.

Result: Achieves 42.9% reduction in weighted cost, over 99% deadline satisfaction, 94.2% IoT device coverage rate, and 54.5% communication overhead reduction compared to state-of-the-art baselines.

Conclusion: AirFed demonstrates robust performance across varying system scales, validating its practical applicability for large-scale UAV-MEC deployments with superior efficiency and QoS guarantees.

Abstract: Multiple Unmanned Aerial Vehicles (UAVs) cooperative Mobile Edge Computing (MEC) systems face critical challenges in coordinating trajectory planning, task offloading, and resource allocation while ensuring Quality of Service (QoS) under dynamic and uncertain environments. Existing approaches suffer from limited scalability, slow convergence, and inefficient knowledge sharing among UAVs, particularly when handling large-scale IoT device deployments with stringent deadline constraints. This paper proposes AirFed, a novel federated graph-enhanced multi-agent reinforcement learning framework that addresses these challenges through three key innovations. First, we design dual-layer dynamic Graph Attention Networks (GATs) that explicitly model spatial-temporal dependencies among UAVs and IoT devices, capturing both service relationships and collaborative interactions within the network topology. Second, we develop a dual-Actor single-Critic architecture that jointly optimizes continuous trajectory control and discrete task offloading decisions. Third, we propose a reputation-based decentralized federated learning mechanism with gradient-sensitive adaptive quantization, enabling efficient and robust knowledge sharing across heterogeneous UAVs. Extensive experiments demonstrate that AirFed achieves 42.9% reduction in weighted cost compared to state-of-the-art baselines, attains over 99% deadline satisfaction and 94.2% IoT device coverage rate, and reduces communication overhead by 54.5%. Scalability analysis confirms robust performance across varying UAV numbers, IoT device densities, and system scales, validating AirFed’s practical applicability for large-scale UAV-MEC deployments.

[596] Identifiable learning of dissipative dynamics

Aiqing Zhu, Beatrice W. Soh, Grigorios A. Pavliotis, Qianxiao Li

Main category: cs.LG

TL;DR: A neural framework that learns dissipative stochastic dynamics from trajectories, identifies energy landscapes, separates reversible/irreversible motion, and computes entropy production for quantifying irreversibility in non-equilibrium systems.

Details

Motivation: Complex dissipative systems operate far from equilibrium where energy dissipation and time irreversibility are difficult to quantify from data, requiring methods to understand non-equilibrium dynamics.

Method: Universal and identifiable neural framework that learns dissipative stochastic dynamics directly from trajectories while ensuring interpretability, expressiveness, and uniqueness.

Result: Identifies unique energy landscape, separates reversible from irreversible motion, computes entropy production. Applications reveal super-linear scaling of barrier heights and sub-linear scaling of entropy production rates with strain rate, and suppression of irreversibility with increasing batch size.

Conclusion: Establishes a general, data-driven framework for discovering and interpreting non-equilibrium dynamics across various scientific and engineering domains.

Abstract: Complex dissipative systems appear across science and engineering, from polymers and active matter to learning algorithms. These systems operate far from equilibrium, where energy dissipation and time irreversibility govern their behavior but are difficult to quantify from data. Here, we introduce a universal and identifiable neural framework that learns dissipative stochastic dynamics directly from trajectories while ensuring interpretability, expressiveness, and uniqueness. Our method identifies a unique energy landscape, separates reversible from irreversible motion, and allows direct computation of the entropy production, providing a principled measure of irreversibility and deviations from equilibrium. Applications to polymer stretching in elongational flow and to stochastic gradient Langevin dynamics reveal new insights, including super-linear scaling of barrier heights and sub-linear scaling of entropy production rates with the strain rate, and the suppression of irreversibility with increasing batch size. Our methodology thus establishes a general, data-driven framework for discovering and interpreting non-equilibrium dynamics.

[597] Addressing divergent representations from causal interventions on neural networks

Satchel Grant, Simon Jerome Han, Alexa R. Tartaglini, Christopher Potts

Main category: cs.LG

TL;DR: Causal interventions in mechanistic interpretability often create out-of-distribution representations, potentially compromising explanation faithfulness. The paper identifies harmless vs pernicious divergences and proposes mitigation using Counterfactual Latent loss.

Details

Motivation: To determine whether causal interventions in mechanistic interpretability create divergent representations that may reduce the faithfulness of explanations to the model's natural state.

Method: Theoretical and empirical demonstration of distribution shifts from interventions, analysis of divergence types (harmless vs pernicious), and application/modification of Counterfactual Latent loss to mitigate pernicious divergences.

Result: Common causal intervention techniques shift internal representations away from natural distributions. Two divergence classes identified: harmless (null-space/covariance) and pernicious (activating hidden pathways). CL loss helps representations stay closer to natural distribution.

Conclusion: The findings highlight a path toward more reliable interpretability methods by addressing representation divergence issues in causal interventions.

Abstract: A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two classes of such divergences: “harmless” divergences that occur in the null-space of the weights and from covariance within behavioral decision boundaries, and “pernicious” divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.

[598] SLOFetch: Compressed-Hierarchical Instruction Prefetching for Cloud Microservices

Zerui Bao, Di Zhu, Liu Jiang, Shiqi Sheng, Ziwei Wang, Haoyun Zhang

Main category: cs.LG

TL;DR: A novel instruction prefetching design for cloud workloads that uses compressed entries and hierarchical metadata storage to reduce on-chip state while maintaining performance.

Details

Motivation: Large-scale networked services with deep software stacks and microservice orchestration create frontend stalls that increase tail latency and energy consumption, requiring efficient instruction prefetching solutions.

Method: Builds on Entangling Instruction Prefetcher (EIP) with Compressed Entry (36 bits for 8 destinations), Hierarchical Metadata Storage (on-chip for L1/frequent entries, virtualized for bulk), and Online ML Controller for prefetch profitability scoring.

Result: Preserves EIP-like speedups with smaller on-chip state and improves efficiency for networked services in the ML era.

Conclusion: The proposed instruction prefetching approach effectively addresses frontend stalls in cloud workloads while reducing hardware overhead through compression and hierarchical storage.

Abstract: Large-scale networked services rely on deep soft-ware stacks and microservice orchestration, which increase instruction footprints and create frontend stalls that inflate tail latency and energy. We revisit instruction prefetching for these cloud workloads and present a design that aligns with SLO driven and self optimizing systems. Building on the Entangling Instruction Prefetcher (EIP), we introduce a Compressed Entry that captures up to eight destinations around a base using 36 bits by exploiting spatial clustering, and a Hierarchical Metadata Storage scheme that keeps only L1 resident and frequently queried entries on chip while virtualizing bulk metadata into lower levels. We further add a lightweight Online ML Controller that scores prefetch profitability using context features and a bandit adjusted threshold. On data center applications, our approach preserves EIP like speedups with smaller on chip state and improves efficiency for networked services in the ML era.

[599] Scaling Up ROC-Optimizing Support Vector Machines

Gimun Bae, Seung Jun Shin

Main category: cs.LG

TL;DR: A scalable variant of ROC-SVM using incomplete U-statistics that reduces computational complexity from O(n²) while maintaining comparable AUC performance.

Details

Motivation: The original ROC-SVM has high computational cost due to evaluating all O(n²) pairs, limiting its practical use despite being effective for imbalanced classification.

Method: Leverages incomplete U-statistics to reduce computational complexity and extends to nonlinear classification via low-rank kernel approximation in reproducing kernel Hilbert spaces.

Result: Achieves comparable AUC performance to original ROC-SVM with drastically reduced training time on both synthetic and real datasets.

Conclusion: The proposed method provides an efficient and scalable alternative to ROC-SVM for imbalanced classification while maintaining competitive performance.

Abstract: The ROC-SVM, originally proposed by Rakotomamonjy, directly maximizes the area under the ROC curve (AUC) and has become an attractive alternative of the conventional binary classification under the presence of class imbalance. However, its practical use is limited by high computational cost, as training involves evaluating all $O(n^2)$. To overcome this limitation, we develop a scalable variant of the ROC-SVM that leverages incomplete U-statistics, thereby substantially reducing computational complexity. We further extend the framework to nonlinear classification through a low-rank kernel approximation, enabling efficient training in reproducing kernel Hilbert spaces. Theoretical analysis establishes an error bound that justifies the proposed approximation, and empirical results on both synthetic and real datasets demonstrate that the proposed method achieves comparable AUC performance to the original ROC-SVM with drastically reduced training time.

[600] Dual-branch Spatial-Temporal Self-supervised Representation for Enhanced Road Network Learning

Qinghong Guo, Yu Wang, Ji Cao, Tongya Zheng, Junshu Dai, Bingde Hu, Shunyu Liu, Canghong Jin

Main category: cs.LG

TL;DR: DST is a dual-branch spatial-temporal self-supervised framework for road network representation learning that addresses spatial heterogeneity and temporal dynamics through mix-hop graph convolution, hypergraph contrastive learning, and causal Transformer-based temporal modeling.

Details

Motivation: Spatial heterogeneity and temporal dynamics in road networks pose challenges to self-supervised GNNs' neighborhood smoothing mechanism, requiring more sophisticated representation learning approaches.

Method: Proposes DST framework with: 1) Mix-hop transition matrix for graph convolution to capture dynamic road relations from trajectories; 2) Spatial contrastive learning between vanilla road network and hypergraph with three hyperedge types for long-range relations; 3) Temporal self-supervision using causal Transformer for next token prediction on traffic sequences, regularized by weekday-weekend traffic mode differentiation.

Result: Extensive experiments show DST outperforms state-of-the-art methods and excels in zero-shot learning scenarios due to comprehensive spatiotemporal modeling.

Conclusion: The dual-branch spatial-temporal self-supervised framework DST effectively addresses road network representation challenges and demonstrates superior performance across various tasks including zero-shot learning.

Abstract: Road network representation learning (RNRL) has attracted increasing attention from both researchers and practitioners as various spatiotemporal tasks are emerging. Recent advanced methods leverage Graph Neural Networks (GNNs) and contrastive learning to characterize the spatial structure of road segments in a self-supervised paradigm. However, spatial heterogeneity and temporal dynamics of road networks raise severe challenges to the neighborhood smoothing mechanism of self-supervised GNNs. To address these issues, we propose a $\textbf{D}$ual-branch $\textbf{S}$patial-$\textbf{T}$emporal self-supervised representation framework for enhanced road representations, termed as DST. On one hand, DST designs a mix-hop transition matrix for graph convolution to incorporate dynamic relations of roads from trajectories. Besides, DST contrasts road representations of the vanilla road network against that of the hypergraph in a spatial self-supervised way. The hypergraph is newly built based on three types of hyperedges to capture long-range relations. On the other hand, DST performs next token prediction as the temporal self-supervised task on the sequences of traffic dynamics based on a causal Transformer, which is further regularized by differentiating traffic modes of weekdays from those of weekends. Extensive experiments against state-of-the-art methods verify the superiority of our proposed framework. Moreover, the comprehensive spatiotemporal modeling facilitates DST to excel in zero-shot learning scenarios.

[601] Planning in Branch-and-Bound: Model-Based Reinforcement Learning for Exact Combinatorial Optimization

Paul Strang, Zacharie Alès, Côme Bissuel, Olivier Juan, Safia Kedad-Sidhoum, Emmanuel Rachelson

Main category: cs.LG

TL;DR: PlanB&B is a model-based reinforcement learning agent that uses a learned internal model of branch-and-bound dynamics to discover improved branching strategies for MILP problems, outperforming previous RL methods.

Details

Motivation: To move beyond static, hand-crafted variable selection heuristics in branch-and-bound solvers for MILP problems by leveraging recent RL successes in combinatorial optimization.

Method: Introduces PlanB&B, a model-based RL agent that learns an internal model of B&B dynamics and uses this model to discover improved branching strategies through planning.

Result: Empirical validation shows PlanB&B outperforms previous state-of-the-art RL methods across four standard MILP benchmarks.

Conclusion: Model-based reinforcement learning with learned internal models of B&B dynamics can effectively discover improved branching strategies for MILP problems.

Abstract: Mixed-Integer Linear Programming (MILP) lies at the core of many real-world combinatorial optimization (CO) problems, traditionally solved by branch-and-bound (B&B). A key driver influencing B&B solvers efficiency is the variable selection heuristic that guides branching decisions. Looking to move beyond static, hand-crafted heuristics, recent work has explored adapting traditional reinforcement learning (RL) algorithms to the B&B setting, aiming to learn branching strategies tailored to specific MILP distributions. In parallel, RL agents have achieved remarkable success in board games, a very specific type of combinatorial problems, by leveraging environment simulators to plan via Monte Carlo Tree Search (MCTS). Building on these developments, we introduce Plan-and-Branch-and-Bound (PlanB&B), a model-based reinforcement learning (MBRL) agent that leverages a learned internal model of the B&B dynamics to discover improved branching strategies. Computational experiments empirically validate our approach, with our MBRL branching agent outperforming previous state-of-the-art RL methods across four standard MILP benchmarks.

[602] Self-Organization and Spectral Mechanism of Attractor Landscapes in High-Capacity Kernel Hopfield Networks

Akira Tamamori

Main category: cs.LG

TL;DR: The paper reveals that kernel-based Hopfield networks achieve optimal memory capacity through a critical state called ‘Spectral Concentration’, where the leading eigenvalue is amplified for stability while trailing eigenvalues are preserved for capacity.

Details

Motivation: To understand the dynamical mechanism behind the enhanced storage capacity of kernel-based Hopfield networks, which remains poorly understood despite their practical success.

Method: Unified geometric analysis of attractor landscape with spectral theory of kernel machines, using a novel ‘Pinnacle Sharpness’ metric to analyze attractor stability and identify optimal operating conditions.

Result: Discovered a ‘Ridge of Optimization’ phase where networks achieve maximal robustness under high-load conditions, characterized by ‘Force Antagonism’ and ‘Spectral Concentration’ - a specific reorganization of the weight spectrum that balances stability and capacity.

Conclusion: Optimal performance in high-capacity associative memories is achieved by tuning the system to a spectral ‘Goldilocks zone’ between rank collapse and diffusion, where the network self-organizes into a critical state balancing global stability and memory capacity.

Abstract: Kernel-based learning methods can dramatically increase the storage capacity of Hopfield networks, yet the dynamical mechanism behind this enhancement remains poorly understood. We address this gap by unifying the geometric analysis of the attractor landscape with the spectral theory of kernel machines. Using a novel metric, “Pinnacle Sharpness,” we first uncover a rich phase diagram of attractor stability, identifying a “Ridge of Optimization” where the network achieves maximal robustness under high-load conditions. Phenomenologically, this ridge is characterized by a “Force Antagonism,” where a strong driving force is balanced by a collective feedback force. Theoretically, we reveal that this phenomenon arises from a specific reorganization of the weight spectrum, which we term \textit{Spectral Concentration}. Unlike a simple rank-1 collapse, our analysis shows that the network on the ridge self-organizes into a critical state: the leading eigenvalue is amplified to maximize global stability (Direct Force), while the trailing eigenvalues are preserved to maintain high memory capacity (Indirect Force). These findings provide a complete physical picture of how high-capacity associative memories are formed, demonstrating that optimal performance is achieved by tuning the system to a spectral “Goldilocks zone” between rank collapse and diffusion.

[603] Analysis of heart failure patient trajectories using sequence modeling

Falk Dippel, Yinan Yu, Annika Rosengren, Martin Lindgren, Christina E. Lundberg, Erik Aerts, Martin Adiels, Helen Sjöland

Main category: cs.LG

TL;DR: This paper presents a systematic ablation study comparing six sequence models across three architecture classes (Transformers, Transformers++, Mambas) for clinical prediction tasks using EHR data from a large Swedish heart failure cohort.

Details

Motivation: Despite Transformers defining state-of-the-art for EHR-based clinical prediction, there's no established systematic approach to empirically analyze model performance and efficiency across different settings in the medical domain.

Method: Evaluated six sequence models across three architecture classes on three one-year prediction tasks using EHR data from 42,820 heart failure patients. Conducted ablations on EHR input sequences, architectural configurations, and temporal preprocessing techniques.

Result: Llama achieved highest predictive discrimination, best calibration, and robustness across all tasks, followed by Mambas. Both architectures demonstrated efficient representation learning, with tiny configurations surpassing other large-scaled Transformers. At equal model size, Llama and Mambas achieved superior performance using 25% less training data.

Conclusion: This study provides the first systematic ablation study with design choices for input tokenization, model configuration and temporal data preprocessing, offering recommendations for future model development in clinical prediction tasks using EHRs.

Abstract: Transformers have defined the state-of-the-art for clinical prediction tasks involving electronic health records (EHRs). The recently introduced Mamba architecture outperformed an advanced Transformer (Transformer++) based on Llama in handling long context lengths, while using fewer model parameters. Despite the impressive performance of these architectures, a systematic approach to empirically analyze model performance and efficiency under various settings is not well established in the medical domain. The performances of six sequence models were investigated across three architecture classes (Transformers, Transformers++, Mambas) in a large Swedish heart failure (HF) cohort (N = 42820), providing a clinically relevant case study. Patient data included diagnoses, vital signs, laboratories, medications and procedures extracted from in-hospital EHRs. The models were evaluated on three one-year prediction tasks: clinical instability (a readmission phenotype) after initial HF hospitalization, mortality after initial HF hospitalization and mortality after latest hospitalization. Ablations account for modifications of the EHR-based input patient sequence, architectural model configurations, and temporal preprocessing techniques for data collection. Llama achieves the highest predictive discrimination, best calibration, and showed robustness across all tasks, followed by Mambas. Both architectures demonstrate efficient representation learning, with tiny configurations surpassing other large-scaled Transformers. At equal model size, Llama and Mambas achieve superior performance using 25% less training data. This paper presents a first ablation study with systematic design choices for input tokenization, model configuration and temporal data preprocessing. Future model development in clinical prediction tasks using EHRs could build upon this study’s recommendation as a starting point.

[604] PaSE: Prototype-aligned Calibration and Shapley-based Equilibrium for Multimodal Sentiment Analysis

Kang He, Boyu Chen, Yuzhe Ding, Fei Li, Chong Teng, Donghong Ji

Main category: cs.LG

TL;DR: PaSE framework addresses modality competition in multimodal sentiment analysis by using prototype-aligned calibration and Shapley-optimized equilibrium to enhance cross-modal collaboration.

Details

Motivation: Real-world multimodal scenarios often exhibit modality competition where dominant modalities overshadow weaker ones, leading to suboptimal performance in sentiment analysis.

Method: Uses Prototype-guided Calibration Learning with Entropic Optimal Transport for semantic alignment, followed by prototype-gated fusion and Shapley-based Gradient Modulation to adaptively adjust gradients based on modality contributions.

Result: Extensive experiments on IEMOCAP, MOSI, and MOSEI datasets confirm superior performance and effective alleviation of modality competition.

Conclusion: PaSE successfully enhances multimodal collaboration while mitigating modality competition through its dual-phase optimization approach.

Abstract: Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by integrating textual, acoustic, and visual signals. Although multimodal fusion is designed to leverage cross-modal complementarity, real-world scenarios often exhibit modality competition: dominant modalities tend to overshadow weaker ones, leading to suboptimal performance. In this paper, we propose PaSE, a novel Prototype-aligned Calibration and Shapley-optimized Equilibrium framework, which enhances collaboration while explicitly mitigating modality competition. PaSE first applies Prototype-guided Calibration Learning (PCL) to refine unimodal representations and align them through an Entropic Optimal Transport mechanism that ensures semantic consistency. To further stabilize optimization, we introduce a Dual-Phase Optimization strategy. A prototype-gated fusion module is first used to extract shared representations, followed by Shapley-based Gradient Modulation (SGM), which adaptively adjusts gradients according to the contribution of each modality. Extensive experiments on IEMOCAP, MOSI, and MOSEI confirm that PaSE achieves the superior performance and effectively alleviates modality competition.

[605] Rectifying Distribution Shift in Cascaded Precipitation Nowcasting

Fanbo Ju, Haiyuan Shi, Qingjian Ni

Main category: cs.LG

TL;DR: RectiCast is a two-stage framework for precipitation nowcasting that decouples mean-field shift rectification from local stochasticity generation using dual Flow Matching models, achieving better performance than existing methods.

Details

Motivation: Existing cascaded architectures for precipitation nowcasting conflate systematic distribution shift in deterministic predictions with local stochasticity, leading to inaccurate precipitation patterns and intensity over longer lead times.

Method: Two-stage framework: 1) deterministic model generates posterior mean, 2) Rectifier learns distribution shift and produces rectified mean, then Generator models local stochasticity conditioned on rectified mean using dual Flow Matching.

Result: Experiments on two radar datasets show RectiCast achieves significant performance improvements over state-of-the-art methods.

Conclusion: Explicitly decoupling mean-field shift rectification from local stochasticity generation via dual Flow Matching effectively addresses the conflation problem in precipitation nowcasting.

Abstract: Precipitation nowcasting, which aims to provide high spatio-temporal resolution precipitation forecasts by leveraging current radar observations, is a core task in regional weather forecasting. Recently, the cascaded architecture has emerged as the mainstream paradigm for deep learning-based precipitation nowcasting. This paradigm involves a deterministic model to predict posterior mean, followed by a probabilistic model to generate local stochasticity. However, existing methods commonly overlook the conflation of the systematic distribution shift in deterministic predictions and the local stochasticity. As a result, the distribution shift of the deterministic component contaminates the predictions of the probabilistic component, leading to inaccuracies in precipitation patterns and intensity, particularly over longer lead times. To address this issue, we introduce RectiCast, a two-stage framework that explicitly decouples the rectification of mean-field shift from the generation of local stochasticity via a dual Flow Matching model. In the first stage, a deterministic model generates the posterior mean. In the second stage, we introduce a Rectifier to explicitly learn the distribution shift and produce a rectified mean. Subsequently, a Generator focuses on modeling the local stochasticity conditioned on the rectified mean. Experiments on two radar datasets demonstrate that RectiCast achieves significant performance improvements over existing state-of-the-art methods.

[606] Subtract the Corruption: Training-Data-Free Corrective Machine Unlearning using Task Arithmetic

Mostafa Mozafari, Farooq Ahmad Wani, Maria Sofia Bucarelli, Fabrizio Silvestri

Main category: cs.LG

TL;DR: CUTS enables corrective machine unlearning without access to original training data or identified corrupted samples, using only a small proxy set of corrupted samples to remove corruption influence from trained models.

Details

Motivation: Real-world scenarios often lack access to original training data and identified corrupted samples, making traditional corrective machine unlearning methods ineffective.

Method: CUTS treats corruption as a distinct task, fine-tunes on proxy set to amplify corruption signal, computes task vector from weight differences, and subtracts calibrated vector to cancel corruption.

Result: CUTS recovers most lost utility under label noise and nearly eliminates backdoor attacks with minimal utility damage, outperforming state-of-the-art methods in source-free settings.

Conclusion: CUTS provides an effective source-free corrective unlearning approach that works without clean data or forget sets, demonstrating strong performance across corruption types.

Abstract: Corrupted training data are ubiquitous. Corrective Machine Unlearning (CMU) seeks to remove the influence of such corruption post-training. Prior CMU typically assumes access to identified corrupted training samples (a “forget set”). However, in many real-world scenarios the training data are no longer accessible. We formalize source-free CMU, where the original training data are unavailable and, consequently, no forget set of identified corrupted training samples can be specified. Instead, we assume a small proxy (surrogate) set of corrupted samples that reflect the suspected corruption type without needing to be the original training samples. In this stricter setting, methods relying on forget set are ineffective or narrow in scope. We introduce Corrective Unlearning in Task Space (CUTS), a lightweight weight space correction method guided by the proxy set using task arithmetic principles. CUTS treats the clean and the corruption signal as distinct tasks. Specifically, we briefly fine-tune the corrupted model on the proxy to amplify the corruption mechanism in the weight space, compute the difference between the corrupted and fine-tuned weights as a proxy task vector, and subtract a calibrated multiple of this vector to cancel the corruption. Without access to clean data or a forget set, CUTS recovers a large fraction of the lost utility under label noise and, for backdoor triggers, nearly eliminates the attack with minimal damage to utility, outperforming state-of-the-art specialized CMU methods in source-free setting.

[607] Learning to Compress Graphs via Dual Agents for Consistent Topological Robustness Evaluation

Qisen Chai, Yansong Wang, Junjie Huang, Tao Jia

Main category: cs.LG

TL;DR: Cutter is a dual-agent RL framework that compresses large graphs while preserving topological structure and robustness profiles, enabling efficient adversarial robustness evaluation.

Details

Motivation: Graph-structured data is growing increasingly large, making robustness evaluation under adversarial attacks computationally expensive and difficult to scale.

Method: Dual-agent reinforcement learning with Vital Detection Agent (VDA) and Redundancy Detection Agent (RDA) that collaboratively identify vital and redundant nodes, using trajectory-level reward shaping, prototype-based shaping, and cross-agent imitation.

Result: Compressed graphs retain essential topological properties and exhibit robustness degradation trends highly consistent with original graphs under various attack scenarios, significantly improving evaluation efficiency.

Conclusion: Cutter enables efficient and reliable graph robustness evaluation without compromising assessment fidelity through guided compression that preserves both structure and robustness characteristics.

Abstract: As graph-structured data grow increasingly large, evaluating their robustness under adversarial attacks becomes computationally expensive and difficult to scale. To address this challenge, we propose to compress graphs into compact representations that preserve both topological structure and robustness profile, enabling efficient and reliable evaluation. We propose Cutter, a dual-agent reinforcement learning framework composed of a Vital Detection Agent (VDA) and a Redundancy Detection Agent (RDA), which collaboratively identify structurally vital and redundant nodes for guided compression. Cutter incorporates three key strategies to enhance learning efficiency and compression quality: trajectory-level reward shaping to transform sparse trajectory returns into dense, policy-equivalent learning signals; prototype-based shaping to guide decisions using behavioral patterns from both high- and low-return trajectories; and cross-agent imitation to enable safer and more transferable exploration. Experiments on multiple real-world graphs demonstrate that Cutter generates compressed graphs that retain essential static topological properties and exhibit robustness degradation trends highly consistent with the original graphs under various attack scenarios, thereby significantly improving evaluation efficiency without compromising assessment fidelity.

cs.MA

Abraham Itzhak Weinberg

Main category: cs.MA

TL;DR: TSLEC framework uses trust-based social learning to accelerate emergent communication in multi-agent systems, reducing convergence time by 23.9% while maintaining robust protocols.

Details

Motivation: Traditional emergent communication through independent learning leads to slow convergence and potentially suboptimal protocols in multi-agent systems.

Method: TSLEC framework where agents explicitly teach successful strategies to peers, with knowledge transfer modulated by learned trust relationships.

Result: Trust-based social learning reduces episodes-to-convergence by 23.9% (p < 0.001), produces compositional protocols (C = 0.38), and maintains robustness under dynamic objectives (Phi > 0.867 decoding accuracy). Trust scores strongly correlate with teaching quality (r = 0.743).

Conclusion: Explicit social learning fundamentally accelerates emergent communication in multi-agent coordination, with trust-based filtering enabling effective knowledge transfer.

Abstract: Emergent communication in multi-agent systems typically occurs through independent learning, resulting in slow convergence and potentially suboptimal protocols. We introduce TSLEC (Trust-Based Social Learning with Emergent Communication), a framework where agents explicitly teach successful strategies to peers, with knowledge transfer modulated by learned trust relationships. Through experiments with 100 episodes across 30 random seeds, we demonstrate that trust-based social learning reduces episodes-to-convergence by 23.9% (p < 0.001, Cohen’s d = 1.98) compared to independent emergence, while producing compositional protocols (C = 0.38) that remain robust under dynamic objectives (Phi > 0.867 decoding accuracy). Trust scores strongly correlate with teaching quality (r = 0.743, p < 0.001), enabling effective knowledge filtering. Our results establish that explicit social learning fundamentally accelerates emergent communication in multi-agent coordination.

[609] An Adaptive, Data-Integrated Agent-Based Modeling Framework for Explainable and Contestable Policy Design

Roberto Garrone

Main category: cs.MA

TL;DR: A general adaptive multi-agent learning framework that integrates dynamic regimes, information-theoretic diagnostics, causal models, and unsupervised methods to analyze how learning agents and adaptive controls shape system trajectories.

Details

Motivation: Multi-agent systems often operate under feedback and non-stationarity, but many simulations use static decision rules and fixed parameters, limiting their ability to capture real-world dynamics.

Method: Integrates four dynamic regimes (static/adaptive agents, fixed/adaptive parameters), information-theoretic diagnostics (entropy rate, complexity, predictive information), structural causal models, agent-level prior generation, and unsupervised regime identification.

Result: Provides a domain-neutral architecture for analyzing joint effects of learning agents and adaptive controls on system trajectories, enabling systematic comparison across different dynamic patterns.

Conclusion: The framework offers a structured methodology for developing explainable and contestable multi-agent decision processes with mathematical definitions, computational operators, and experimental design templates.

Abstract: Multi-agent systems often operate under feedback, adaptation, and non-stationarity, yet many simulation studies retain static decision rules and fixed control parameters. This paper introduces a general adaptive multi-agent learning framework that integrates: (i) four dynamic regimes distinguishing static versus adaptive agents and fixed versus adaptive system parameters; (ii) information-theoretic diagnostics (entropy rate, statistical complexity, and predictive information) to assess predictability and structure; (iii) structural causal models for explicit intervention semantics; (iv) procedures for generating agent-level priors from aggregate or sample data; and (v) unsupervised methods for identifying emergent behavioral regimes. The framework offers a domain-neutral architecture for analyzing how learning agents and adaptive controls jointly shape system trajectories, enabling systematic comparison of stability, performance, and interpretability across non-equilibrium, oscillatory, or drifting dynamics. Mathematical definitions, computational operators, and an experimental design template are provided, yielding a structured methodology for developing explainable and contestable multi-agent decision processes.

[610] Complex Instruction Following with Diverse Style Policies in Football Games

Chenglu Sun, Shuo Shen, Haonan Hu, Wei Zhou, Chen Chen

Main category: cs.MA

TL;DR: LCDSP is a novel LC-RL paradigm that enables agents to understand and execute abstract tactical instructions in complex multi-agent environments like football games through diverse style policies and language interpretation.

Details

Motivation: Current LC-RL methods struggle with high-level abstract instructions in complex multi-agent environments, creating a need for systems that can comprehend and execute tactical commands in scenarios like football games.

Method: LCDSP uses two components: Diverse Style Training (DST) to train a single policy with diverse behaviors via style parameters, and Style Interpreter (SI) to translate language instructions into corresponding style parameters.

Result: Extensive experiments in 5v5 football environment show LCDSP effectively comprehends abstract tactical instructions and accurately executes diverse behavioral styles as desired.

Conclusion: LCDSP demonstrates strong potential for complex real-world applications by successfully bridging the gap between high-level language instructions and diverse behavioral execution in multi-agent environments.

Abstract: Despite advancements in language-controlled reinforcement learning (LC-RL) for basic domains and straightforward commands (e.g., object manipulation and navigation), effectively extending LC-RL to comprehend and execute high-level or abstract instructions in complex, multi-agent environments, such as football games, remains a significant challenge. To address this gap, we introduce Language-Controlled Diverse Style Policies (LCDSP), a novel LC-RL paradigm specifically designed for complex scenarios. LCDSP comprises two key components: a Diverse Style Training (DST) method and a Style Interpreter (SI). The DST method efficiently trains a single policy capable of exhibiting a wide range of diverse behaviors by modulating agent actions through style parameters (SP). The SI is designed to accurately and rapidly translate high-level language instructions into these corresponding SP. Through extensive experiments in a complex 5v5 football environment, we demonstrate that LCDSP effectively comprehends abstract tactical instructions and accurately executes the desired diverse behavioral styles, showcasing its potential for complex, real-world applications.

[611] Realistic gossip in Trust Game on networks: the GODS model

Jan Majewski, Francesca Giardini

Main category: cs.MA

TL;DR: Realistic gossip modeling reveals cooperators struggle with local gossip spread, increasing overall resources but promoting defection. Dynamic networks create payoff inequalities, forcing cooperators to choose between outcompeting defectors or overall growth.

Details

Motivation: Previous studies on gossip in reputation-based systems used unrealistic assumptions like perfect information or broadcast dynamics, failing to capture real-world gossip processes.

Method: Developed an agent-based model combining realistic gossip processes with different variants of Trust Game, blending direct and indirect reciprocity with reputations.

Result: Local gossip interactions harm cooperators who can’t discriminate against defectors. Realistic gossiping increases overall resources but promotes defection. Dynamic networks create high payoff inequalities.

Conclusion: Gossiping increases cooperation efficiency by an order of magnitude when properly modeled with realistic processes, though cooperators face trade-offs between competing with defectors and overall growth.

Abstract: Gossip has been shown to be a relatively efficient solution to problems of cooperation in reputation-based systems of exchange, but many studies don’t conceptualize gossiping in a realistic way, often assuming near-perfect information or broadcast-like dynamics of its spread. To solve this problem, we developed an agent-based model that pairs realistic gossip processes with different variants of Trust Game. The results show that cooperators suffer when local interactions govern spread of gossip, because they cannot discriminate against defectors. Realistic gossiping increases the overall amount of resources, but is more likely to promote defection. Moreover, even partner selection through dynamic networks can lead to high payoff inequalities among agent types. Cooperators face a choice between outcompeting defectors and overall growth. By blending direct and indirect reciprocity with reputations we show that gossiping increases the efficiency of cooperation by an order of magnitude.

[612] EnergyTwin: A Multi-Agent System for Simulating and Coordinating Energy Microgrids

Jakub Muszyński, Ignacy Walużenicz, Patryk Zan, Zofia Wrona, Maria Ganzha, Marcin Paprzycki, Costin Bădică

Main category: cs.MA

TL;DR: EnergyTwin is an agent-based microgrid simulation environment that combines physical models with forecast-informed planning and negotiations to improve microgrid resilience and energy self-sufficiency.

Details

Motivation: Existing tools either capture physical behavior with centralized control assumptions or model decentralized decision-making without physical grounding. There's a need for a framework that integrates both aspects for realistic microgrid simulation.

Method: Agent-based simulation where each asset is modeled as an agent interacting with a central agent. Uses forecast-informed, rolling-horizon planning and contract-based negotiations for energy allocation.

Result: Forecast-driven rolling-horizon planning increases local energy self-sufficiency, maintains higher battery reserves, and reduces exposure to low-resilience operating states.

Conclusion: EnergyTwin demonstrates potential as a platform for research on resilient, negotiation-driven microgrids and is extensible for digital-twin applications.

Abstract: Microgrids are deployed to reduce purchased grid energy, limit exposure to volatile tariffs, and ensure service continuity during disturbances. This requires coordinating heterogeneous distributed energy resources across multiple time scales and under variable conditions. Among existing tools, typically, power-system simulators capture physical behaviour but assume centralized control, while multi-agent frameworks model decentralized decision-making but represent energy with no physical grounding. In this context, the EnergyTwin is introduced, an agent-based microgrid simulation environment that couples physically grounded models with forecast-informed, rolling-horizon planning, and negotiations. Each asset is modeled as an agent, interacting with a central agent that obtains forecasts, formulates predictions, and allocates energy through contract-based interactions. EnergyTwin targets tertiary-layer decision making and is extensible for digital-twin use. Its feasibility was evaluated in a university campus microgrid scenario where multiple planning strategies were compared. Achieved results show that forecast-driven rolling-horizon planning increases local energy self-sufficiency, maintains higher battery reserves, and reduces exposure to low-resilience operating states. They demonstrate also potential of EnergyTwin as platform supporting research on resilient, negotiation-driven microgrids.

[613] ShortageSim: Simulating Drug Shortages under Information Asymmetry

Mingxuan Cui, Yilan Jiang, Duo Zhou, Cheng Qian, Yuji Zhang, Qiong Wang

Main category: cs.MA

TL;DR: ShortageSim is a simulation framework using LLM-based agents to model regulatory interventions’ impact on drug shortages under information asymmetry, showing 84% reduction in resolution lag.

Details

Motivation: Drug shortages pose critical risks to healthcare, but regulatory effectiveness is poorly understood due to information asymmetries in pharmaceutical supply chains.

Method: Uses LLM-based agents to model strategic decisions of drug manufacturers and buyers in response to regulatory shortage alerts, simulating heterogeneous interpretations under information asymmetry.

Result: Reduces resolution lag for production disruption cases by up to 84%, achieving closer alignment to real-world trajectories than zero-shot baseline. Framework confirms regulatory alert effectiveness.

Conclusion: Provides novel framework for policy design and testing in supply chains under information asymmetry, open-sourcing ShortageSim and 2,925 FDA shortage events dataset.

Abstract: Drug shortages pose critical risks to patient care and healthcare systems worldwide, yet the effectiveness of regulatory interventions remains poorly understood due to information asymmetries in pharmaceutical supply chains. We propose \textbf{ShortageSim}, addresses this challenge by providing the first simulation framework that evaluates the impact of regulatory interventions on competition dynamics under information asymmetry. Using Large Language Model (LLM)-based agents, the framework models the strategic decisions of drug manufacturers and institutional buyers, in response to shortage alerts given by the regulatory agency. Unlike traditional game theory models that assume perfect rationality and complete information, ShortageSim simulates heterogeneous interpretations on regulatory announcements and the resulting decisions. Experiments on self-processed dataset of historical shortage events show that ShortageSim reduces the resolution lag for production disruption cases by up to 84%, achieving closer alignment to real-world trajectories than the zero-shot baseline. Our framework confirms the effect of regulatory alert in addressing shortages and introduces a new method for understanding competition in multi-stage environments under uncertainty. We open-source ShortageSim and a dataset of 2,925 FDA shortage events, providing a novel framework for future research on policy design and testing in supply chains under information asymmetry.

[614] RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows

Kai Zhang, Corey D Barrett, Jangwon Kim, Lichao Sun, Tara Taghavi, Krishnaram Kenthapadi

Main category: cs.MA

TL;DR: RadAgents is a multi-agent framework for chest X-ray interpretation that addresses limitations of current methods by providing clinically interpretable reasoning, multimodal evidence fusion, and cross-tool inconsistency resolution through radiologist-style workflows and verification mechanisms.

Details

Motivation: Current chest X-ray interpretation methods lack clinical interpretability, insufficiently fuse multimodal evidence, and fail to detect cross-tool inconsistencies, resulting in unreliable outputs that don't align with clinical guidelines.

Method: Multi-agent framework that encodes radiologist-style workflow into modular pipeline, integrates clinical priors with task-aware multimodal reasoning, and uses grounding and multimodal retrieval-augmentation to verify and resolve context conflicts.

Result: The system produces more reliable, transparent outputs that are consistent with clinical practice through improved multimodal reasoning and conflict resolution capabilities.

Conclusion: RadAgents bridges critical gaps in clinical AI systems by providing auditable, guideline-aligned reasoning with robust verification mechanisms for chest X-ray interpretation.

Abstract: Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X-ray (CXR) interpretation, prevailing methods remain limited: (i) reasoning is frequently neither clinically interpretable nor aligned with guidelines, reflecting mere aggregation of tool outputs; (ii) multimodal evidence is insufficiently fused, yielding text-only rationales that are not visually grounded; and (iii) systems rarely detect or resolve cross-tool inconsistencies and provide no principled verification mechanisms. To bridge the above gaps, we present RadAgents, a multi-agent framework that couples clinical priors with task-aware multimodal reasoning and encodes a radiologist-style workflow into a modular, auditable pipeline. In addition, we integrate grounding and multimodal retrieval-augmentation to verify and resolve context conflicts, resulting in outputs that are more reliable, transparent, and consistent with clinical practice.

[615] VIL2C: Value-of-Information Aware Low-Latency Communication for Multi-Agent Reinforcement Learning

Qian Zhang, Zhuo Sun, Yao Zhang, Zhiwen Yu, Bin Guo, Jun Zhang

Main category: cs.MA

TL;DR: VIL2C is a communication scheme that reduces latency effects in multi-agent reinforcement learning by prioritizing high-value messages and adaptively adjusting reception timing.

Details

Motivation: Communication latency in practical MARL systems causes action delays and outdated information sharing, which hinders performance in time-critical applications like autonomous driving.

Method: Proposes Value-of-Information (VOI) metric to quantify message importance, progressive message reception mechanism, and VoI-aware resource allocation to optimize latency distribution.

Result: Extensive experiments show VIL2C outperforms existing approaches under various communication conditions, achieving performance gains through low-latency transmission of high-VoI messages.

Conclusion: VIL2C effectively mitigates communication latency effects in MARL systems by prioritizing important messages and eliminating unnecessary waiting periods, demonstrating superior performance across different communication scenarios.

Abstract: Inter-agent communication serves as an effective mechanism for enhancing performance in collaborative multi-agent reinforcement learning(MARL) systems. However, the inherent communication latency in practical systems induces both action decision delays and outdated information sharing, impeding MARL performance gains, particularly in time-critical applications like autonomous driving. In this work, we propose a Value-of-Information aware Low-latency Communication(VIL2C) scheme that proactively adjusts the latency distribution to mitigate its effects in MARL systems. Specifically, we define a Value of Information (VOI) metric to quantify the importance of delayed message transmission based on each delayed message’s importance. Moreover, we propose a progressive message reception mechanism to adaptively adjust the reception duration based on received messages. We derive the optimized VoI aware resource allocation and theoretically prove the performance advantage of the proposed VIL2C scheme. Extensive experiments demonstrate that VIL2C outperforms existing approaches under various communication conditions. These gains are attributed to the low-latency transmission of high-VoI messages via resource allocation and the elimination of unnecessary waiting periods via adaptive reception duration.

cs.MM

Xiangyu Zhao, Yaling Shen, Yiwen Jiang, Zimu Wang, Jiahe Liu, Maxmartwell H Cheng, Guilherme C Oliveira, Robert Desimone, Dominic Dwyer, Zongyuan Ge

Main category: cs.MM

TL;DR: A novel multi-modal LLM framework for depression detection that combines audio and visual modalities with fine-grained timestamp-level alignment, outperforming previous methods on the DAIC-WoZ dataset.

Details

Motivation: Depression is a prevalent mental health disorder, and while LLMs have advanced AI-assisted assessment, conventional text-centric LLMs cannot process critical non-verbal cues from audio and visual modalities needed for comprehensive mental health evaluation.

Method: Augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level, enabling fine-grained modeling of temporal dynamics across modalities while reducing data and computational requirements.

Result: Experiments on DAIC-WoZ dataset show the model outperforms both single-modality approaches and previous multi-modal methods for depression detection.

Conclusion: The proposed framework effectively integrates multi-modal data for depression assessment and can be extended to incorporate additional physiological signals, enabling broader clinical applications beyond mental health.

Abstract: Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment improves modeling of temporal dynamics across modalities while reducing the need for extensive training data and computational resources. Experiments on the DAIC-WoZ dataset demonstrate that our model outperforms both single-modality approaches and previous multi-modal methods. Moreover, the proposed framework can be extended to incorporate additional physiological signals, paving the way for broader clinical applications beyond mental health.

[617] FINE: Factorized multimodal sentiment analysis via mutual INformation Estimation

Yadong Liu, Shangfei Wang

Main category: cs.MM

TL;DR: A factorized multimodal fusion framework that disentangles shared/unique representations, suppresses noise, and uses mutual information optimization to improve sentiment analysis across modalities.

Details

Motivation: Address multimodal heterogeneity issues including asynchronous signals, imbalanced information, and task-irrelevant noise that hinder robust sentiment representation learning.

Method: Disentangles modalities into shared/unique representations, suppresses noise, uses mutual information optimization, and adds auxiliary modules: Mixture of Q-Formers for feature extraction and Dynamic Contrastive Queue for long-term temporal modeling.

Result: Extensive experiments on multiple public datasets show consistent outperformance over existing approaches, validating effectiveness and robustness.

Conclusion: The proposed framework successfully addresses multimodal heterogeneity challenges and improves sentiment analysis through fine-grained decomposition and principled optimization.

Abstract: Multimodal sentiment analysis remains a challenging task due to the inherent heterogeneity across modalities. Such heterogeneity often manifests as asynchronous signals, imbalanced information between modalities, and interference from task-irrelevant noise, hindering the learning of robust and accurate sentiment representations. To address these issues, we propose a factorized multimodal fusion framework that first disentangles each modality into shared and unique representations, and then suppresses task-irrelevant noise within both to retain only sentiment-critical representations. This fine-grained decomposition improves representation quality by reducing redundancy, prompting cross-modal complementarity, and isolating task-relevant sentiment cues. Rather than manipulating the feature space directly, we adopt a mutual information-based optimization strategy to guide the factorization process in a more stable and principled manner. To further support feature extraction and long-term temporal modeling, we introduce two auxiliary modules: a Mixture of Q-Formers, placed before factorization, which precedes the factorization and uses learnable queries to extract fine-grained affective features from multiple modalities, and a Dynamic Contrastive Queue, placed after factorization, which stores latest high-level representations for contrastive learning, enabling the model to capture long-range discriminative patterns and improve class-level separability. Extensive experiments on multiple public datasets demonstrate that our method consistently outperforms existing approaches, validating the effectiveness and robustness of the proposed framework.

[618] Signal Processing for Haptic Surface Modeling: a Review

Antonio Luigi Stefani, Niccolò Bisagno, Andrea Rosani, Nicola Conci, Francesco De Natale

Main category: cs.MM

TL;DR: This paper provides a comprehensive survey on haptic surface modeling and data representation, filling a gap in existing literature that has focused more on sensing, rendering, and perception aspects of haptic technology.

Details

Motivation: While haptic feedback is crucial for immersive VR/AR experiences and has been extensively studied in sensing, rendering, and perception, there's a lack of comprehensive research on haptic surface modeling and data representation, which lies between acquisition and rendering stages.

Method: The authors analyze, categorize, and compare existing research papers on haptic surface modeling and data representation from a signal processing perspective, examining the pipeline from sensing to rendering.

Result: The survey provides an organized overview of current approaches to haptic surface modeling, identifies existing gaps in the research landscape, and suggests potential future research directions in this domain.

Conclusion: This work fills an important gap in haptic technology research by systematically reviewing and categorizing approaches to haptic surface modeling and data representation, providing a foundation for future advancements in this critical area of haptic systems.

Abstract: Haptic feedback has been integrated into Virtual and Augmented Reality, complementing acoustic and visual information and contributing to an all-round immersive experience in multiple fields, spanning from the medical domain to entertainment and gaming. Haptic technologies involve complex cross-disciplinary research that encompasses sensing, data representation, interactive rendering, perception, and quality of experience. The standard processing pipeline, consists of (I) sensing physical features in the real world using a transducer, (II) modeling and storing the collected information in some digital format, (III) communicating the information, and finally, (IV) rendering the haptic information through appropriate devices, thus producing a user experience (V) perceptually close to the original physical world. Among these areas, sensing, rendering and perception have been deeply investigated and are the subject of different comprehensive surveys available in the literature. Differently, research dealing with haptic surface modeling and data representation still lacks a comprehensive dissection. In this work, we aim at providing an overview on modeling and representation of haptic surfaces from a signal processing perspective, covering the aspects that lie in between haptic information acquisition on one side and rendering and perception on the other side. We analyze, categorize, and compare research papers that address the haptic surface modeling and data representation, pointing out existing gaps and possible research directions.

eess.AS

[619] BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference

Sungjae Kim, Kihyun Na, Jinyoung Choi, Injung Kim

Main category: eess.AS

TL;DR: BERT-APC is a novel reference-free automatic pitch correction system that uses a music language model to correct pitch errors while preserving vocal expressiveness and naturalness, outperforming commercial tools like AutoTune and Melodyne.

Details

Motivation: Existing APC systems either require reference pitches (limiting practical use) or use simple pitch estimation that fails to preserve expressiveness and naturalness in vocal performances.

Method: Uses a stationary pitch predictor to estimate perceived pitch, a context-aware note pitch predictor leveraging a repurposed music language model for intended pitch sequence, and a note-level correction algorithm that preserves intentional pitch deviations. Includes learnable data augmentation for robustness.

Result: Outperformed ROSVOT by 10.49%p on highly detuned samples in raw pitch accuracy. Achieved MOS score of 4.32±0.15, significantly higher than AutoTune (3.22±0.18) and Melodyne (3.08±0.18) while maintaining comparable expressiveness preservation.

Conclusion: First APC model to leverage music language model for reference-free pitch correction with symbolic musical context, demonstrating superior performance in both accuracy and naturalness preservation compared to existing commercial tools.

Abstract: Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with the intended musical notes. However, existing APC systems either rely on reference pitches, which limits their practical applicability, or employ simple pitch estimation algorithms that often fail to preserve expressiveness and naturalness. We propose BERT-APC, a novel reference-free APC framework that corrects pitch errors while maintaining the natural expressiveness of vocal performances. In BERT-APC, a novel stationary pitch predictor first estimates the perceived pitch of each note from the detuned singing voice. A context-aware note pitch predictor estimates the intended pitch sequence by leveraging a music language model repurposed to incorporate musical context. Finally, a note-level correction algorithm fixes pitch errors while preserving intentional pitch deviations for emotional expression. In addition, we introduce a learnable data augmentation strategy that improves the robustness of the music language model by simulating realistic detuning patterns. Compared to two recent singing voice transcription models, BERT-APC demonstrated superior performance in note pitch prediction, outperforming the second-best model, ROSVOT, by 10.49%p on highly detuned samples in terms of the raw pitch accuracy. In the MOS test, BERT-APC achieved the highest score of $4.32 \pm 0.15$, which is significantly higher than those of the widely-used commercial APC tools, AutoTune ($3.22 \pm 0.18$) and Melodyne ($3.08 \pm 0.18$), while maintaining a comparable ability to preserve expressive nuances. To the best of our knowledge, this is the first APC model that leverages a music language model to achieve reference-free pitch correction with symbolic musical context. The corrected audio samples of BERT-APC are available online.

eess.IV

[620] A model of the Unity High Definition Render Pipeline, with applications to flat-panel and head-mounted display characterization

Richard F. Murray

Main category: eess.IV

TL;DR: Analysis of Unity’s HDRP rendering pipeline reveals non-obvious nonlinearities in material properties and rendered values that affect stimulus control, with solutions provided for gamma-corrected luminance and color display.

Details

Motivation: Game engines like Unity are increasingly used for perceptual experiments but their rendering pipelines contain hidden nonlinearities that can compromise stimulus control and experimental validity.

Method: Mathematical modeling and testing of Unity’s High Definition Render Pipeline (HDRP) to identify nonlinear transformations in material properties and rendering, with development of specialized configuration files for gamma correction.

Result: HDRP contains several non-obvious nonlinear features that must be accounted for, and can be configured to display gamma-corrected luminance and color using provided software tools.

Conclusion: Researchers using Unity for experiments must understand and account for HDRP’s nonlinear rendering characteristics to ensure well-controlled visual stimuli, with available tools enabling proper gamma correction.

Abstract: Game engines such as Unity and Unreal Engine have become popular tools for creating perceptual and behavioral experiments based on complex, interactive scenes. They are often used with flat-panel displays, and also with head-mounted displays. Here I describe and test a mathematical model of luminance and color in Unity’s High Definition Render Pipeline (HDRP). I show that the HDRP has several non-obvious features, such as nonlinearities applied to material properties and rendered values, that must be taken into account in order to show well-controlled stimuli. I also show how the HDRP can be configured to display gamma-corrected luminance and color, and I provide software to create the specialized files needed for gamma correction.

[621] Not Quite Anything: Overcoming SAMs Limitations for 3D Medical Imaging

Keith Moore

Main category: eess.IV

TL;DR: A compositional approach that uses foundation segmentation models (SAM-2) as additional input channels for brain MRI segmentation, avoiding fine-tuning while achieving 96% accuracy for basal ganglia segmentation.

Details

Motivation: Foundation models like SAM and SAM-2 perform poorly on brain MRIs due to low contrast and lack of sharp boundaries in structures like caudate and thalamus. Fine-tuning approaches like MedSAM require model modification.

Method: Uses SAM-2 output as additional input channel alongside MRI. Generates prompts via lightweight 3D U-Net trained on MRI segmentation, with edge smoothing for alignment. Also tests DINO attention maps for prompt-free segmentation.

Result: Achieves ~96% volume accuracy on basal ganglia segmentation, sufficient for studying longitudinal volume changes. Approach is fast, label-efficient, and robust to out-of-distribution scans.

Conclusion: The compositional architecture avoids modifying foundation model weights, adapts to domain shift without retraining, and successfully applied to study inflammation-linked changes in pediatric OCD.

Abstract: Foundation segmentation models such as SAM and SAM-2 perform well on natural images but struggle with brain MRIs where structures like the caudate and thalamus lack sharp boundaries and have low contrast. Rather than fine tune these models (for example MedSAM), we propose a compositional alternative where the foundation model output is treated as an additional input channel and passed alongside the MRI to highlight regions of interest. We generate SAM-2 prompts by using a lightweight 3D U-Net that was previously trained on MRI segmentation. The U-Net may have been trained on a different dataset, so its guesses are often imprecise but usually in the correct region. The edges of the resulting foundation model guesses are smoothed to improve alignment with the MRI. We also test prompt free segmentation using DINO attention maps in the same framework. This has-a architecture avoids modifying foundation weights and adapts to domain shift without retraining the foundation model. It reaches about 96 percent volume accuracy on basal ganglia segmentation, which is sufficient for our study of longitudinal volume change. The approach is fast, label efficient, and robust to out of distribution scans. We apply it to study inflammation linked changes in sudden onset pediatric OCD.

[622] A Multi-Stage Deep Learning Framework with PKCP-MixUp Augmentation for Pediatric Liver Tumor Diagnosis Using Multi-Phase Contrast-Enhanced CT

Wanqi Wang, Chun Yang, Jianbo Shao, Yaokai Zhang, Xuehua Peng, Jin Sun, Chao Xiong, Long Lu, Lianting Hu

Main category: eess.IV

TL;DR: A multi-stage deep learning framework using multi-phase CT scans for non-invasive diagnosis of pediatric liver tumors, achieving high accuracy in tumor detection and classification while addressing data scarcity challenges.

Details

Motivation: Current invasive biopsy methods for pediatric liver tumors pose risks due to vascular liver tissue, require anesthesia for young children, and increase costs/trauma. AI solutions for this specific pediatric application are lacking.

Method: Multi-stage DL framework with PKCP-MixUp data augmentation for class imbalance, tumor detection model for ROI extraction, and two-stage diagnosis pipeline with three backbones using ROI-masked CT images.

Result: Tumor detection achieved mAP=0.871; benign vs malignant classification reached AUC=0.989; benign subtype classification AUC=0.915; malignant subtype classification AUC=0.979. Robust performance across retrospective and prospective cohorts.

Conclusion: The framework fills pediatric-specific DL diagnostic gap, provides insights for CT phase selection and model design, and enables precise, accessible non-invasive diagnosis of pediatric liver tumors.

Abstract: Pediatric liver tumors are one of the most common solid tumors in pediatrics, with differentiation of benign or malignant status and pathological classification critical for clinical treatment. While pathological examination is the gold standard, the invasive biopsy has notable limitations: the highly vascular pediatric liver and fragile tumor tissue raise complication risks such as bleeding; additionally, young children with poor compliance require anesthesia for biopsy, increasing medical costs or psychological trauma. Although many efforts have been made to utilize AI in clinical settings, most researchers have overlooked its importance in pediatric liver tumors. To establish a non-invasive examination procedure, we developed a multi-stage deep learning (DL) framework for automated pediatric liver tumor diagnosis using multi-phase contrast-enhanced CT. Two retrospective and prospective cohorts were enrolled. We established a novel PKCP-MixUp data augmentation method to address data scarcity and class imbalance. We also trained a tumor detection model to extract ROIs, and then set a two-stage diagnosis pipeline with three backbones with ROI-masked images. Our tumor detection model has achieved high performance (mAP=0.871), and the first stage classification model between benign and malignant tumors reached an excellent performance (AUC=0.989). Final diagnosis models also exhibited robustness, including benign subtype classification (AUC=0.915) and malignant subtype classification (AUC=0.979). We also conducted multi-level comparative analyses, such as ablation studies on data and training pipelines, as well as Shapley-Value and CAM interpretability analyses. This framework fills the pediatric-specific DL diagnostic gap, provides actionable insights for CT phase selection and model design, and paves the way for precise, accessible pediatric liver tumor diagnosis.

[623] The Selective Disk Bispectrum and Its Inversion, with Application to Multi-Reference Alignment

Adele Myers, Nina Miolane

Main category: eess.IV

TL;DR: The paper introduces the selective disk bispectrum, a fast and invertible rotation-invariant representation for image shape analysis that enables practical learning tasks on rotated images.

Details

Motivation: To create a rotation-invariant representation of images that retains all shape information while disregarding object orientation, which is valuable for many computer vision tasks where orientation is irrelevant.

Method: Derived an explicit inverse for the disk bispectrum and defined a ‘selective’ version that uses only the minimal number of coefficients needed for faithful shape recovery, addressing previous limitations of cubic complexity and lack of invertibility.

Result: The selective disk bispectrum enables multi-reference alignment for rotated images, a task previously intractable for disk bispectrum methods, and provides a computationally efficient representation for large datasets.

Conclusion: The disk bispectrum is established as a practical and theoretically grounded tool for learning on rotation-invariant shape data, with applications in computer vision and shape analysis.

Abstract: In many computer vision and shape analysis tasks, practitioners are interested in learning from the shape of the object in an image, while disregarding the object’s orientation. To this end, it is valuable to define a rotation-invariant representation of images, retaining all information about that image, but disregarding the way an object is rotated in the frame. To be practical for learning tasks, this representation must be computationally efficient for large datasets and invertible, so the representation can be visualized in image space. To this end, we present the selective disk bispectrum: a fast, rotation-invariant representation for image shape analysis. While the translational bispectrum has long been used as a translational invariant representation for 1-D and 2-D signals, its extension to 2-D (disk) rotational invariance on images has been hindered by the absence of an invertible formulation and its cubic complexity. In this work, we derive an explicit inverse for the disk bispectrum, which allows us to define a “selective” disk bispectrum, which only uses the minimal number of coefficients needed for faithful shape recovery. We show that this representation enables multi-reference alignment for rotated images-a task previously intractable for disk bispectrum methods. These results establish the disk bispectrum as a practical and theoretically grounded tool for learning on rotation-invariant shape data.

[624] DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models

Jun Jia, Hongyi Miao, Yingjie Zhou, Linhan Cao, Yanwei Jiang, Wangqiu Zhou, Dandan Zhu, Hua Yang, Wei Sun, Xiongkuo Min, Guangtao Zhai

Main category: eess.IV

TL;DR: DLADiff is a dual-layer defense method that protects facial privacy against both fine-tuning and zero-shot diffusion model attacks, outperforming existing approaches.

Details

Motivation: Current defense methods focus on fine-tuning attacks but neglect zero-shot generation, creating a gap in facial privacy protection against modern diffusion model threats.

Method: Uses dual-layer protection: first layer employs Dual-Surrogate Models and Alternating Dynamic Fine-Tuning with adversarial training; second layer provides simple but effective zero-shot defense.

Result: Significantly outperforms existing methods in defending against fine-tuning attacks and achieves unprecedented performance against zero-shot generation methods.

Conclusion: DLADiff provides comprehensive protection against both fine-tuning and zero-shot diffusion model customization attacks, addressing a critical gap in facial privacy defense.

Abstract: With the rapid advancement of diffusion models, a variety of fine-tuning methods have been developed, enabling high-fidelity image generation with high similarity to the target content using only 3 to 5 training images. More recently, zero-shot generation methods have emerged, capable of producing highly realistic outputs from a single reference image without altering model weights. However, technological advancements have also introduced significant risks to facial privacy. Malicious actors can exploit diffusion model customization with just a few or even one image of a person to create synthetic identities nearly identical to the original identity. Although research has begun to focus on defending against diffusion model customization, most existing defense methods target fine-tuning approaches and neglect zero-shot generation defenses. To address this issue, this paper proposes Dual-Layer Anti-Diffusion (DLADiff) to defense both fine-tuning methods and zero-shot methods. DLADiff contains a dual-layer protective mechanism. The first layer provides effective protection against unauthorized fine-tuning by leveraging the proposed Dual-Surrogate Models (DSUR) mechanism and Alternating Dynamic Fine-Tuning (ADFT), which integrates adversarial training with the prior knowledge derived from pre-fine-tuned models. The second layer, though simple in design, demonstrates strong effectiveness in preventing image generation through zero-shot methods. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in defending against fine-tuning of diffusion models and achieves unprecedented performance in protecting against zero-shot generation.

[625] Development of a fully deep learning model to improve the reproducibility of sector classification systems for predicting unerupted maxillary canine likelihood of impaction

Marzio Galdi, Davide Cannatà, Flavia Celentano, Luigia Rizzo, Domenico Rossi, Tecla Bocchino, Stefano Martina

Main category: eess.IV

TL;DR: Deep learning model (DenseNet121) developed to classify unerupted maxillary canine positions, achieving 76.8% accuracy and outperforming human reproducibility in sector classification systems.

Details

Motivation: To address the variability in intra- and inter-operator reproducibility when classifying unerupted maxillary canines using sector classification systems, which can affect treatment planning.

Method: Compared human classification (3 orthodontists, 3 general dentists) on 306 radiographs using 3 different sector systems (5-, 4-, 3-sector), then tested AI models pre-trained on 1,222 radiographs to identify the best-performing model based on sensitivity and precision.

Result: 3-sector system showed highest human reproducibility (Cohen’s K: 0.80-0.92), educational background didn’t affect agreement. DenseNet121 achieved 76.8% overall accuracy in classifying impacted canines across three classes.

Conclusion: AI models can effectively automate the classification of unerupted maxillary canine positions, providing consistent and reproducible results compared to human assessment.

Abstract: Objectives. The aim of the present study was to develop a fully deep learning model to reduce the intra- and inter-operator reproducibility of sector classification systems for predicting unerupted maxillary canine likelihood of impaction. Methods. Three orthodontists (Os) and three general dental practitioners (GDPs) classified the position of unerupted maxillary canines on 306 radiographs (T0) according to the three different sector classification systems (5-, 4-, and 3-sector classification system). The assessment was repeated after four weeks (T1). Intra- and inter-observer agreement were evaluated with Cohen’s K and Fleiss K, and between group differences with a z-test. The same radiographs were tested on different artificial intelligence (AI) models, pre-trained on an extended dataset of 1,222 radiographs. The best-performing model was identified based on its sensitivity and precision. Results. The 3-sector system was found to be the classification method with highest reproducibility, with an agreement (Cohen’s K values) between observations (T0 versus T1) for each examiner ranged from 0.80 to 0.92, and an overall agreement of 0.85 [95% confidence interval (CI) = 0.83-0.87]. The overall inter-observer agreement (Fleiss K) ranged from 0.69 to 0.7. The educational background did not affect either intra- or inter-observer agreement (p>0.05). DenseNet121 proved to be the best-performing model in allocating impacted canines in the three different classes, with an overall accuracy of 76.8%. Conclusion. AI models can be designed to automatically classify the position of unerupted maxillary canines.

[626] A Visual Perception-Based Tunable Framework and Evaluation Benchmark for H.265/HEVC ROI Encryption

Xiang Zhang, Geng Wu, Wenbin Huang, Daoyong Fu, Fei Peng, Zhangjie Fu

Main category: eess.IV

TL;DR: A visual perception-based tunable framework and evaluation benchmark for H.265/HEVC ROI selective encryption that addresses flexibility issues and provides standardized performance evaluation.

Details

Motivation: Existing ROI-based video encryption methods suffer from insufficient flexibility and lack of a unified evaluation system, limiting their practical application.

Method: Proposes a three-component framework: 1) ROI region recognition using visual perception network, 2) Three-level tunable encryption strategy balancing security and real-time performance, 3) Unified ROI encryption evaluation benchmark.

Result: Experimental results show the benchmark comprehensively measures ROI selective encryption performance, and the proposed enhanced and advanced level encryption outperforms existing algorithms in multiple metrics.

Conclusion: The framework effectively meets privacy protection requirements in H.265/HEVC and provides reliable solution for secure and efficient processing of sensitive video content.

Abstract: ROI selective encryption, as an efficient privacy protection technique, encrypts only the key regions in the video, thereby ensuring security while minimizing the impact on coding efficiency. However, existing ROI-based video encryption methods suffer from insufficient flexibility and lack of a unified evaluation system. To address these issues, we propose a visual perception-based tunable framework and evaluation benchmark for H.265/HEVC ROI encryption. Our scheme introduces three key contributions: 1) A ROI region recognition module based on visual perception network is proposed to accurately identify the ROI region in videos. 2) A three-level tunable encryption strategy is implemented while balancing security and real-time performance. 3) A unified ROI encryption evaluation benchmark is developed to provide a standardized quantitative platform for subsequent research. This triple strategy provides new solution and significant unified performance evaluation methods for ROI selective encryption field. Experimental results indicate that the proposed benchmark can comprehensively measure the performance of the ROI selective encryption. Compared to existing ROI encryption algorithms, our proposed enhanced and advanced level encryption exhibit superior performance in multiple performance metrics. In general, the proposed framework effectively meets the privacy protection requirements in H.265/HEVC and provides a reliable solution for secure and efficient processing of sensitive video content.

[627] Enhancing Medical Image Analysis through Geometric and Photometric transformations

Khadija Rais, Mohamed Amroune, Mohamed Yassine Haouam, Abdelmadjid Benmachiche

Main category: eess.IV

TL;DR: Data augmentation significantly improves medical image analysis performance, boosting skin cancer classification accuracy from 90.74% to 96.88% and increasing retinal vessel segmentation Dice coefficient from 0 to 0.4163.

Details

Motivation: Medical image analysis faces challenges with limited labeled data due to patient privacy concerns and scarcity of medical experts, necessitating data augmentation solutions to enhance model performance.

Method: Applied transformation techniques to skin cancer dataset and trained CNN before/after augmentation; used Mixup technique on retina blood vessels dataset with U-net model for segmentation.

Result: Skin cancer classification: test accuracy improved from 90.74% to 96.88%, test loss decreased from 0.7921 to 0.1468. Retinal vessel segmentation: Dice coefficient increased from 0 to 0.4163.

Conclusion: Data augmentation effectively increases dataset size and significantly improves both classification and segmentation performance in medical image analysis.

Abstract: Medical image analysis suffers from a lack of labeled data due to several challenges including patient privacy and lack of experts. Although some AI models only perform well with large amounts of data, we will move to data augmentation where there is a solution to improve the performance of our models and increase the dataset size through traditional or advanced techniques. In this paper, we evaluate the effectiveness of data augmentation techniques on two different medical image datasets. In the first step, we applied some transformation techniques to the skin cancer dataset containing benign and malignant classes. Then, we trained the convolutional neural network (CNN) on the dataset before and after augmentation, which significantly improved test accuracy from 90.74% to 96.88% and decreased test loss from 0.7921 to 0.1468 after augmentation. In the second step, we used the Mixup technique by mixing two random images and their corresponding masks using the retina and blood vessels dataset, then we trained the U-net model and obtained the Dice coefficient which increased from 0 before augmentation to 0.4163 after augmentation. The result shows the effect of using data augmentation to increase the dataset size on the classification and segmentation performance.

[628] Improving Lesion Segmentation in Medical Images by Global and Regional Feature Compensation

Chuhan Wang, Zhenghao Chen, Jean Y. H. Yang, Jinman Kim

Main category: eess.IV

TL;DR: Proposes GRCSF framework with Global Compensation Unit (GCU) and Region Compensation Unit (RCU) to improve medical lesion segmentation by preserving global context and using self-supervised residual maps for precise localization.

Details

Motivation: Existing methods suffer from information loss during downsampling and insufficient capture of both regional and global features, leading to suboptimal performance on complex lesion segmentation.

Method: GRCSF introduces GCU to preserve global contextual features during downsampling, and RCU that uses MAE-generated SSL residual maps with patch-based cross-attention and importance scoring for lesion localization.

Result: Outperforms state-of-the-art methods on brain stroke lesion and coronary artery calcification datasets, demonstrating effectiveness across diverse lesion types.

Conclusion: GRCSF provides a generalizable lesion segmentation solution that effectively addresses both global and regional feature representation challenges in medical image analysis.

Abstract: Automated lesion segmentation of medical images has made tremendous improvements in recent years due to deep learning advancements. However, accurately capturing fine-grained global and regional feature representations remains a challenge. Many existing methods obtain suboptimal performance on complex lesion segmentation due to information loss during typical downsampling operations and the insufficient capture of either regional or global features. To address these issues, we propose the Global and Regional Compensation Segmentation Framework (GRCSF), which introduces two key innovations: the Global Compensation Unit (GCU) and the Region Compensation Unit (RCU). The proposed GCU addresses resolution loss in the U-shaped backbone by preserving global contextual features and fine-grained details during multiscale downsampling. Meanwhile, the RCU introduces a self-supervised learning (SSL) residual map generated by Masked Autoencoders (MAE), obtained as pixel-wise differences between reconstructed and original images, to highlight regions with potential lesions. These SSL residual maps guide precise lesion localization and segmentation through a patch-based cross-attention mechanism that integrates regional spatial and pixel-level features. Additionally, the RCU incorporates patch-level importance scoring to enhance feature fusion by leveraging global spatial information from the backbone. Experiments on two publicly available medical image segmentation datasets, including brain stroke lesion and coronary artery calcification datasets, demonstrate that our GRCSF outperforms state-of-the-art methods, confirming its effectiveness across diverse lesion types and its potential as a generalizable lesion segmentation solution.

[629] Deep learning-derived arterial input function for dynamic brain PET

Junyu Chen, Zirui Jiang, Jennifer M. Coughlin, Ian Cheong, Kelly A. Mills, Martin G. Pomper, Yong Du

Main category: eess.IV

TL;DR: Deep learning framework (DLIF) estimates metabolite-corrected arterial input function directly from PET images without blood sampling, achieving accurate results comparable to invasive methods.

Details

Motivation: Traditional arterial input function measurement requires invasive blood sampling, which is labor-intensive and uncomfortable for patients. Existing non-invasive alternatives often compromise accuracy or still require some blood sampling.

Method: Deep learning framework that estimates metabolite-corrected AIF directly from dynamic PET image sequences using temporal dynamics and basis functions incorporating prior knowledge of typical AIF shapes.

Result: DLIF achieves accurate and robust AIF estimation validated against ground truth measurements, with comparable parametric maps to traditional methods.

Conclusion: DLIF provides a rapid, accurate, and entirely non-invasive alternative to traditional AIF measurement methods for PET imaging.

Abstract: Dynamic positron emission tomography (PET) imaging combined with radiotracer kinetic modeling is a powerful technique for visualizing biological processes in the brain, offering valuable insights into brain functions and neurological disorders such as Alzheimer’s and Parkinson’s diseases. Accurate kinetic modeling relies heavily on the use of a metabolite-corrected arterial input function (AIF), which typically requires invasive and labor-intensive arterial blood sampling. While alternative non-invasive approaches have been proposed, they often compromise accuracy or still necessitate at least one invasive blood sampling. In this study, we present the deep learning-derived arterial input function (DLIF), a deep learning framework capable of estimating a metabolite-corrected AIF directly from dynamic PET image sequences without any blood sampling. We validated DLIF using existing dynamic PET patient data. We compared DLIF and resulting parametric maps against ground truth measurements. Our evaluation shows that DLIF achieves accurate and robust AIF estimation. By leveraging deep learning’s ability to capture complex temporal dynamics and incorporating prior knowledge of typical AIF shapes through basis functions, DLIF provides a rapid, accurate, and entirely non-invasive alternative to traditional AIF measurement methods.

[630] FMPlug: Plug-In Foundation Flow-Matching Priors for Inverse Problems

Yuxiang Wan, Ryan Devera, Wenjie Zhang, Ju Sun

Main category: eess.IV

TL;DR: FMPlug is a plug-in framework that enhances foundation flow-matching priors for solving ill-posed inverse problems using similarity between observed/desired objects and Gaussianity of generative flows.

Details

Motivation: Traditional approaches rely on domain-specific or untrained priors, which limits their effectiveness. FMPlug aims to leverage domain-agnostic foundation models more effectively.

Method: Uses time-adaptive warm-up strategy and sharp Gaussianity regularization to enhance foundation flow-matching priors by exploiting object similarity and generative flow Gaussianity.

Result: Significantly outperforms state-of-the-art methods using foundation FM priors on image super-resolution and Gaussian deblurring tasks.

Conclusion: FMPlug successfully unlocks the true potential of domain-agnostic foundation models for solving ill-posed inverse problems.

Abstract: We present FMPlug, a novel plug-in framework that enhances foundation flow-matching (FM) priors for solving ill-posed inverse problems. Unlike traditional approaches that rely on domain-specific or untrained priors, FMPlug smartly leverages two simple but powerful insights: the similarity between observed and desired objects and the Gaussianity of generative flows. By introducing a time-adaptive warm-up strategy and sharp Gaussianity regularization, FMPlug unlocks the true potential of domain-agnostic foundation models. Our method beats state-of-the-art methods that use foundation FM priors by significant margins, on image super-resolution and Gaussian deblurring.

[631] LoRA-based methods on Unet for transfer learning in Subarachnoid Hematoma Segmentation

Cristian Minoccheri, Matthew Hodgman, Haoyuan Ma, Rameez Merchant, Emily Wittrup, Craig Williamson, Kayvan Najarian

Main category: eess.IV

TL;DR: Transfer learning using LoRA methods outperforms standard Unet fine-tuning for aneurysmal subarachnoid hemorrhage segmentation, with novel CP-LoRA achieving comparable performance using fewer parameters.

Details

Motivation: Aneurysmal SAH has high mortality rates (>30%) and transfer learning from related hematoma types is underexplored. LoRA methods for parameter-efficient transfer learning are rarely applied to CNNs in medical imaging.

Method: Implemented Unet pre-trained on 124 TBI patients, fine-tuned on 30 SAH patients using 3-fold cross-validation. Developed novel CP-LoRA based on tensor CP-decomposition and DoRA variants that decompose weights into magnitude/directional components.

Result: LoRA-based methods consistently outperformed standard fine-tuning. Performance varied by hemorrhage volume, with better accuracy for larger volumes. CP-LoRA achieved comparable performance with significantly fewer parameters. Higher ranks yielded better performance than strictly low-rank adaptations.

Conclusion: Transfer learning between hematoma types is feasible and LoRA-based methods significantly outperform conventional Unet fine-tuning for aneurysmal SAH segmentation.

Abstract: Aneurysmal subarachnoid hemorrhage (SAH) is a life-threatening neurological emergency with mortality rates exceeding 30%. Transfer learning from related hematoma types represents a potentially valuable but underexplored approach. Although Unet architectures remain the gold standard for medical image segmentation due to their effectiveness on limited datasets, Low-Rank Adaptation (LoRA) methods for parameter-efficient transfer learning have been rarely applied to convolutional neural networks in medical imaging contexts. We implemented a Unet architecture pre-trained on computed tomography scans from 124 traumatic brain injury patients across multiple institutions, then fine-tuned on 30 aneurysmal SAH patients from the University of Michigan Health System using 3-fold cross-validation. We developed a novel CP-LoRA method based on tensor CP-decomposition and introduced DoRA variants (DoRA-C, convDoRA, CP-DoRA) that decompose weight matrices into magnitude and directional components. We compared these approaches against existing LoRA methods (LoRA-C, convLoRA) and standard fine-tuning strategies across different modules on a multi-view Unet model. LoRA-based methods consistently outperformed standard Unet fine-tuning. Performance varied by hemorrhage volume, with all methods showing improved accuracy for larger volumes. CP-LoRA achieved comparable performance to existing methods while using significantly fewer parameters. Over-parameterization with higher ranks consistently yielded better performance than strictly low-rank adaptations. This study demonstrates that transfer learning between hematoma types is feasible and that LoRA-based methods significantly outperform conventional Unet fine-tuning for aneurysmal SAH segmentation.

[632] CardioComposer: Leveraging Differentiable Geometry for Compositional Control of Anatomical Diffusion Models

Karim Kadry, Shoaib Goraya, Ajay Manicka, Abdalla Abdelwahed, Naravich Chutisilp, Farhad Nezami, Elazer Edelman

Main category: eess.IV

TL;DR: CardioComposer is a programmable framework that generates 3D cardiovascular anatomy using interpretable ellipsoidal primitives, enabling geometric control through differentiable measurement functions during diffusion model sampling.

Details

Motivation: To address the trade-off between geometric controllability and realism in generative models of 3D cardiovascular anatomy, which is important for clinical research and medical device evaluation.

Method: Uses interpretable ellipsoidal primitives representing geometric attributes, with differentiable measurement functions based on voxel-wise geometric moments for loss-based gradient guidance during diffusion model sampling.

Result: The method can constrain individual geometric attributes in a disentangled manner and provide compositional control over multiple substructures, compatible with various anatomical systems including cardiac, vascular, and skeletal organs.

Conclusion: CardioComposer enables programmable generation of multi-class anatomical label maps with interpretable geometric control while maintaining realism, applicable to diverse anatomical systems.

Abstract: Generative models of 3D cardiovascular anatomy can synthesize informative structures for clinical research and medical device evaluation, but face a trade-off between geometric controllability and realism. We propose CardioComposer: a programmable, inference-time framework for generating multi-class anatomical label maps based on interpretable ellipsoidal primitives. These primitives represent geometric attributes such as the size, shape, and position of discrete substructures. We specifically develop differentiable measurement functions based on voxel-wise geometric moments, enabling loss-based gradient guidance during diffusion model sampling. We demonstrate that these losses can constrain individual geometric attributes in a disentangled manner and provide compositional control over multiple substructures. Finally, we show that our method is compatible with a wide array of anatomical systems containing non-convex substructures, spanning cardiac, vascular, and skeletal organs.

[633] Multimodal Optical Imaging Platform for Quantitative Burn Assessment

Nathaniel Hanson, Mateusz Wolak, Jonathan Richardson, Patrick Walker, David M. Burmeister, Chakameh Jafari

Main category: eess.IV

TL;DR: A multimodal optical imaging system combining hyperspectral and laser speckle imaging for quantitative burn severity assessment, enabling deep-tissue analysis through SWIR wavelengths and machine learning.

Details

Motivation: Current lack of objective methods for detecting subsurface tissue damage in burns, especially critical in battlefield and mass-casualty settings where rapid burn depth evaluation is essential for triage and surgical decisions.

Method: Integrated broadband hyperspectral imaging (400-2100 nm) with laser speckle contrast imaging to evaluate biochemical composition and microvascular perfusion. Used SWIR wavelengths (>1000 nm) to develop deep-tissue parameters linked to water, lipid, and collagen absorption features. Implemented unsupervised learning for spectral feature extraction and clustering.

Result: Developed and validated novel deep-tissue parameters that enhance burn-tissue separability and burn severity classification. Established foundation for a compact, low-SWaP field-deployable device.

Conclusion: The multimodal optical imaging framework provides a foundation for rugged, data-driven devices for early quantitative burn evaluation in austere environments, addressing critical clinical challenges in burn assessment.

Abstract: Accurate assessment of burn severity at injury onset remains a major clinical challenge due to the lack of objective methods for detecting subsurface tissue damage. This limitation is critical in battlefield and mass-casualty settings, where rapid and reliable evaluation of burn depth is essential for triage and surgical decision-making. We present a multimodal optical imaging framework that establishes the foundation for a compact, low-size, weight, and power (low-SWaP) field-deployable device for quantitative burn assessment. The system integrates broadband hyperspectral imaging (VSWIR, 400 – 2100 nm) with laser speckle contrast imaging to jointly evaluate biochemical composition and microvascular perfusion. Using short-wave infrared (SWIR, >1000 nm) wavelengths, we developed and validated novel deep-tissue parameters linked to water, lipid, and collagen absorption features that enhance burn-tissue separability and burn severity classification. We implemented and validated unsupervised learning methods for spectral feature extraction, band down-selection, and clustering against histology, establishing a foundation for a rugged, data-driven device for early quantitative burn evaluation in austere environments.

Qi Jiang, Xiaolong Qian, Yao Gao, Lei Sun, Kailun Yang, Zhonghua Yi, Wenyong Li, Ming-Hsuan Yang, Luc Van Gool, Kaiwei Wang

Main category: eess.IV

TL;DR: OmniLens++ is a framework for blind lens aberration correction that addresses data scalability and prior guidance challenges through expanded lens design specifications and a novel Latent PSF Representation using VQVAE.

Details

Motivation: To overcome limitations in existing lens library pre-training pipelines, specifically the difficulty in scaling data and lack of optical degradation prior guidance, which hinder generalization ability for blind aberration correction.

Method: Expands lens design specifications to increase degradation diversity, samples uniform distribution by quantifying spatial-variation patterns, and introduces Latent PSF Representation using VQVAE to learn latent features of PSFs with optical degradation process modeling.

Result: OmniLens++ demonstrates state-of-the-art generalization capacity in blind aberration correction on diverse real-world and synthetic lens aberrations, with AODLibpro verified as scalable foundation and LPR effectively leveraging large-scale LensLib.

Conclusion: The framework successfully addresses key challenges in blind lens aberration correction, providing improved generalization and scalability while enabling more effective training across diverse optical degradations.

Abstract: Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes the OmniLens++ framework, which resolves two challenges that hinder the generalization ability of existing pipelines: the difficulty of scaling data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase the degradation diversity of the lens source, and we sample a more uniform distribution by quantifying the spatial-variation patterns and severity of optical degradation. In terms of model design, to leverage the Point Spread Functions (PSFs), which intuitively describe optical degradation, as guidance in a blind paradigm, we propose the Latent PSF Representation (LPR). The VQVAE framework is introduced to learn latent features of LensLib’s PSFs, which is assisted by modeling the optical degradation process to constrain the learning of degradation priors. Experiments on diverse aberrations of real-world lenses and synthetic LensLib show that OmniLens++ exhibits state-of-the-art generalization capacity in blind aberration correction. Beyond performance, the AODLibpro is verified as a scalable foundation for more effective training across diverse aberrations, and LPR can further tap the potential of large-scale LensLib. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2.

[635] Shape-Adapting Gated Experts: Dynamic Expert Routing for Colonoscopic Lesion Segmentation

Gia Huy Thai, Hoang-Nguyen Vu, Anh-Minh Phan, Quang-Thinh Ly, Tram Dinh, Thi-Ngoc-Truc Nguyen, Nhat Ho

Main category: eess.IV

TL;DR: SAGE is an input-adaptive framework that enables dynamic expert routing in visual networks to address cellular heterogeneity in cancer detection from Whole Slide Images, achieving state-of-the-art segmentation performance.

Details

Motivation: Address the challenge of cellular heterogeneity in cancer detection from gigapixel WSIs, where existing CNN-Transformer hybrids with static computation graphs cause redundant computation and limit adaptability to input variability.

Method: SAGE framework reconfigures static backbones into dynamically routed expert architectures with dual-path design: backbone stream preserves representation while hierarchical gating selectively activates expert path. Shape-Adapting Hub (SA-Hub) bridges CNN and Transformer modules by harmonizing structural and semantic representations.

Result: Achieved superior segmentation on three medical benchmarks: EBHI (95.57% Dice), DigestPath (95.16% Dice), and GlaS (94.17% Dice), with robust cross-domain generalization by adaptively balancing local refinement and global context.

Conclusion: SAGE provides a scalable foundation for dynamic expert routing that enables flexible visual reasoning, overcoming limitations of static computation graphs in heterogeneous visual networks.

Abstract: The substantial diversity in cell scale and form remains a primary challenge in computer-aided cancer detection on gigapixel Whole Slide Images (WSIs), attributable to cellular heterogeneity. Existing CNN-Transformer hybrids rely on static computation graphs with fixed routing, which consequently causes redundant computation and limits their adaptability to input variability. We propose Shape-Adapting Gated Experts (SAGE), an input-adaptive framework that enables dynamic expert routing in heterogeneous visual networks. SAGE reconfigures static backbones into dynamically routed expert architectures. SAGE’s dual-path design features a backbone stream that preserves representation and selectively activates an expert path through hierarchical gating. This gating mechanism operates at multiple hierarchical levels, performing a two-level, hierarchical selection between shared and specialized experts to modulate model logits for Top-K activation. Our Shape-Adapting Hub (SA-Hub) harmonizes structural and semantic representations across the CNN and the Transformer module, effectively bridging diverse modules. Embodied as SAGE-UNet, our model achieves superior segmentation on three medical benchmarks: EBHI, DigestPath, and GlaS, yielding state-of-the-art Dice Scores of 95.57%, 95.16%, and 94.17%, respectively, and robustly generalizes across domains by adaptively balancing local refinement and global context. SAGE provides a scalable foundation for dynamic expert routing, enabling flexible visual reasoning.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

[2] Efficient Multi-Hop Question Answering over Knowledge Graphs via LLM Planning and Embedding-Guided Search

[3] Can LLMs Faithfully Explain Themselves in Low-Resource Languages? A Case Study on Emotion Detection in Persian

[4] Comparative Analysis of LoRA-Adapted Embedding Models for Clinical Cardiology Text Representation

[5] What does it mean to understand language?

[6] Gender Bias in Emotion Recognition by Large Language Models

[7] Breaking Bad: Norms for Valence, Arousal, and Dominance for over 10k English Multiword Expressions

[8] Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English, Sepedi and Setswana

[9] Profile-LLM: Dynamic Profile Optimization for Realistic Personality Expression in LLMs

[10] A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

[11] AppSelectBench: Application-Level Tool Selection Benchmark

[12] $\text{R}^2\text{R}$: A Route-to-Rerank Post-Training Framework for Multi-Domain Decoder-Only Rerankers

[13] Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test

[14] A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media

[15] Online-PVLM: Advancing Personalized VLMs with Online Concept Learning

[16] MTA: A Merge-then-Adapt Framework for Personalized Large Language Model

[17] More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering

[18] SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

[19] EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning

[20] LightMem: Lightweight and Efficient Memory-Augmented Generation

[21] “When Data is Scarce, Prompt Smarter”… Approaches to Grammatical Error Correction in Low-Resource Settings

[22] Computational Turing Test Reveals Systematic Differences Between Human and AI Language

[23] SEDA: A Self-Adapted Entity-Centric Data Augmentation for Boosting Gird-based Discontinuous NER Models

[24] KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP

[25] REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance

[26] Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

[27] The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models

[28] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali

[29] A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines

[30] Generation, Evaluation, and Explanation of Novelists’ Styles with Single-Token Prompts

[31] Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

[32] The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models

[33] Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition

[34] From Words to Wisdom: Discourse Annotation and Baseline Models for Student Dialogue Understanding

[35] On Evaluating LLM Alignment by Evaluating LLMs as Judges

[36] Latent Collaboration in Multi-Agent Systems

[37] Gram2Vec: An Interpretable Document Vectorizer

[38] BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models

[39] MedS$^3$: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

[40] A Comprehensive Survey on Long Context Language Modeling

[41] Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

[42] From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

[43] Counterfactual Simulatability of LLM Explanations for Generation Tasks

[44] TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

[45] Improved LLM Agents for Financial Document Question Answering

[46] Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection

[47] SAS: Simulated Attention Score

[48] LaajMeter: A Framework for LaaJ Evaluation

[49] ConfTuner: Training Large Language Models to Express Their Confidence Verbally

[50] FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

[51] Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

[52] OceanGym: A Benchmark Environment for Underwater Embodied Agents

[53] LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation

[54] Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training

[55] EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

[56] AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs

[57] Toward Honest Language Models for Deductive Reasoning

[58] MindEval: Benchmarking Language Models on Multi-turn Mental Health Support

[59] HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations

cs.CV

[60] PuzzlePoles: Cylindrical Fiducial Markers Based on the PuzzleBoard Pattern

[61] Personalized Reward Modeling for Text-to-Image Generation

[62] Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

[63] SG-OIF: A Stability-Guided Online Influence Framework for Reliable Vision Data

[64] Tracking and Segmenting Anything in Any Modality

[65] The Determinant Ratio Matrix Approach to Solving 3D Matching and 2D Orthographic Projection Alignment Tasks

[66] Single Image to High-Quality 3D Object via Latent Features

[67] Fewer Tokens, Greater Scaling: Self-Adaptive Visual Bases for Efficient and Expansive Representation Learning

[68] ExDDV: A New Dataset for Explainable Deepfake Detection in Video

[69] VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

[70] Blinking Beyond EAR: A Stable Eyelid Angle Metric for Driver Drowsiness Detection and Data Augmentation

[71] Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

[72] Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

[73] Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

[74] Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models

[75] MapRF: Weakly Supervised Online HD Map Construction via NeRF-Guided Self-Training

[76] Vidi2: Large Multimodal Models for Video Understanding and Creation