Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 49]
cs.CV [Total: 140]
cs.AI [Total: 40]
cs.SD [Total: 3]
cs.LG [Total: 111]
cs.MA [Total: 2]
cs.MM [Total: 0]
eess.AS [Total: 3]
eess.IV [Total: 11]

cs.CL

[1] Fine-Tuning BERT for Domain-Specific Question Answering: Toward Educational NLP Resources at University Scale

Aurélie Montfrond

Main category: cs.CL

TL;DR: Fine-tuning BERT on university course materials creates an effective educational chatbot, demonstrating domain-specific adaptation potential for academic QA systems.

Details

Motivation: There's a gap in domain-specific foundation models for university course materials, with existing models like BioBERT and SciBERT focusing on biomedical/scientific literature but none tailored to educational content.

Method: Created custom dataset of 1,203 QA pairs in SQuAD format from university modules, fine-tuned BERT using PyTorch, and evaluated with Exact Match and F1 scores.

Result: Modest fine-tuning significantly improved hypothesis framing and knowledge extraction, showing feasibility of adapting foundation models to educational domains.

Conclusion: Fine-tuning BERT with academic QA pairs yields effective results, highlighting potential to scale towards the first domain-specific QA model for universities and enabling autonomous educational knowledge systems.

Abstract: Prior work on scientific question answering has largely emphasized chatbot-style systems, with limited exploration of fine-tuning foundation models for domain-specific reasoning. In this study, we developed a chatbot for the University of Limerick’s Department of Electronic and Computer Engineering to provide course information to students. A custom dataset of 1,203 question-answer pairs in SQuAD format was constructed using the university book of modules, supplemented with manually and synthetically generated entries. We fine-tuned BERT (Devlin et al., 2019) using PyTorch and evaluated performance with Exact Match and F1 scores. Results show that even modest fine-tuning improves hypothesis framing and knowledge extraction, demonstrating the feasibility of adapting foundation models to educational domains. While domain-specific BERT variants such as BioBERT and SciBERT exist for biomedical and scientific literature, no foundation model has yet been tailored to university course materials. Our work addresses this gap by showing that fine-tuning BERT with academic QA pairs yields effective results, highlighting the potential to scale towards the first domain-specific QA model for universities and enabling autonomous educational knowledge systems.

[2] Unveiling Affective Polarization Trends in Parliamentary Proceedings

Gili Goldin, Ella Rabinovich, Shuly Wintner

Main category: cs.CL

TL;DR: Proposes emotional style analysis (Valence, Arousal, Dominance) instead of ideological differences to measure affective polarization in political discourse, applied to Israeli parliament proceedings.

Details

Motivation: To address the increase in polarized discourse worldwide by developing a novel method that quantifies polarization based on emotional style rather than ideological differences, recognizing that affective polarization (emotional division) may be more revealing than traditional ideological polarization measures.

Method: Uses measures of Valence (positive/negative), Arousal (calm/excited), and Dominance (weak/strong) to detect emotional discourse signals. Applies this method to analyze proceedings of the Israeli parliament (Knesset) in Hebrew, comparing emotional styles between government and opposition members over time.

Result: Found that government and opposition members exhibit different emotional styles in their discourse, and that the level of affective polarization (as measured by emotional style differences) has been significantly increasing over time.

Conclusion: Emotional style analysis provides a novel and effective approach to measuring affective polarization in political discourse, revealing increasing emotional division between political groups that traditional ideological measures might miss.

Abstract: Recent years have seen an increase in polarized discourse worldwide, on various platforms. We propose a novel method for quantifying polarization, based on the emotional style of the discourse rather than on differences in ideological stands. Using measures of Valence, Arousal and Dominance, we detect signals of emotional discourse and use them to operationalize the concept of affective polarization. Applying this method to a recently released corpus of proceedings of the Knesset, the Israeli parliament (in Hebrew), we find that the emotional style of members of government differs from that of opposition members; and that the level of affective polarization, as reflected by this style, is significantly increasing with time.

[3] Decoding the Black Box: Discerning AI Rhetorics About and Through Poetic Prompting

P. D. Edgar, Alia Hall

Main category: cs.CL

TL;DR: Poetry Prompt Patterns can enhance prompt engineering by using creative text prompting to analyze LLM biases and creative capabilities, particularly in evaluating how models adapt original poetic works for different audiences.

Details

Motivation: The paper aims to explore how creative text prompting, specifically poetry-based approaches, can expand prompt engineering techniques to better understand LLM algorithmic tendencies and biases, while also examining how models handle and adapt original creative works.

Method: The study proposes Poetry Prompt Patterns as a methodological addition to prompt engineering, then applies poetic prompts to assess descriptions and evaluations of three models of a renowned poet, testing how models adapt or rewrite original creative works for presumed audiences.

Result: The paper suggests that creative text prompting through Poetry Prompt Patterns is a useful addition to prompt engineering toolbox, providing insights into how LLMs handle creative content and adapt original works for different audiences.

Conclusion: Poetic prompts offer valuable methodological tools for prompt engineers to analyze LLM behaviors and biases, particularly in understanding how models process, evaluate, and adapt creative literary works for various audiences.

Abstract: Prompt engineering has emerged as a useful way studying the algorithmic tendencies and biases of large language models. Meanwhile creatives and academics have leveraged LLMs to develop creative works and explore the boundaries of their writing capabilities through text generation and code. This study suggests that creative text prompting, specifically Poetry Prompt Patterns, may be a useful addition to the toolbox of the prompt engineer, and outlines the process by which this approach may be taken. Then, the paper uses poetic prompts to assess descriptions and evaluations of three models of a renowned poet and test the consequences of the willingness of models to adapt or rewrite original creative works for presumed audiences.

[4] Enhancing Clinical Note Generation with ICD-10, Clinical Ontology Knowledge Graphs, and Chain-of-Thought Prompting Using GPT-4

Ivan Makohon, Mohamad Najafi, Jian Wu, Mathias Brochhausen, Yaohang Li

Main category: cs.CL

TL;DR: LLMs with Chain-of-Thought prompting enhanced by semantic search and clinical knowledge graphs can generate better clinical notes from ICD codes and basic patient info, outperforming standard one-shot prompts.

Details

Motivation: Manual clinical note writing consumes significant physician time, increasing patient wait times and potentially delaying diagnoses. While LLMs can generate human-like text, they need improvement for clinical note generation to be practical in healthcare settings.

Method: Combines Chain-of-Thought prompting with semantic search results and infuses knowledge graphs built from clinical ontology. Uses ICD codes and basic patient information as input, tested on GPT-4 with six clinical cases from CodiEsp test dataset.

Result: The proposed prompting technique outperformed clinical notes generated by standard one-shot prompts, demonstrating improved quality in clinical note generation.

Conclusion: Enhanced Chain-of-Thought prompting with semantic search and clinical knowledge graphs improves LLM performance for clinical note generation, potentially reducing physician documentation burden and improving healthcare efficiency.

Abstract: In the past decade a surge in the amount of electronic health record (EHR) data in the United States, attributed to a favorable policy environment created by the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 and the 21st Century Cures Act of 2016. Clinical notes for patients’ assessments, diagnoses, and treatments are captured in these EHRs in free-form text by physicians, who spend a considerable amount of time entering and editing them. Manually writing clinical notes takes a considerable amount of a doctor’s valuable time, increasing the patient’s waiting time and possibly delaying diagnoses. Large language models (LLMs) possess the ability to generate news articles that closely resemble human-written ones. We investigate the usage of Chain-of-Thought (CoT) prompt engineering to improve the LLM’s response in clinical note generation. In our prompts, we use as input International Classification of Diseases (ICD) codes and basic patient information. We investigate a strategy that combines the traditional CoT with semantic search results to improve the quality of generated clinical notes. Additionally, we infuse a knowledge graph (KG) built from clinical ontology to further enrich the domain-specific knowledge of generated clinical notes. We test our prompting technique on six clinical cases from the CodiEsp test dataset using GPT-4 and our results show that it outperformed the clinical notes generated by standard one-shot prompts.

[5] To Think or Not to Think: The Hidden Cost of Meta-Training with Excessive CoT Examples

Vignesh Kothapalli, Ata Fatahibaarzi, Hamed Firooz, Maziar Sanjabi

Main category: cs.CL

TL;DR: CoT-Recipe improves LLM reasoning on novel tasks by optimizing the mix of CoT and non-CoT examples during meta-training, achieving up to 300% accuracy gains even without in-context CoT examples.

Details

Motivation: Chain-of-thought prompting with few-shot ICL works well when LLMs have sufficient pre-training knowledge, but fails on novel tasks where this knowledge is lacking. The paper aims to solve this limitation.

Method: Proposes CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. Uses controlled experiments with CoT-ICL Lab framework and applies techniques to pretrained LLMs (Qwen2.5 series) for symbolic reasoning.

Result: CoT-Recipe increases transformer accuracy on novel tasks by up to 300% even without CoT examples in-context. Applied to pretrained Qwen2.5 LLMs, it achieves up to 130% accuracy gains on symbolic reasoning tasks.

Conclusion: Careful modulation of CoT and non-CoT examples during meta-training significantly improves LLM reasoning on novel tasks, demonstrating that CoT-Recipe effectively addresses the limitations of standard CoT-ICL when pre-training knowledge is insufficient.

Abstract: Chain-of-thought (CoT) prompting combined with few-shot in-context learning (ICL) has unlocked significant reasoning capabilities in large language models (LLMs). However, ICL with CoT examples is ineffective on novel tasks when the pre-training knowledge is insufficient. We study this problem in a controlled setting using the CoT-ICL Lab framework, and propose meta-training techniques to learn novel abstract reasoning tasks in-context. Although CoT examples facilitate reasoning, we noticed that their excessive inclusion during meta-training degrades performance when CoT supervision is limited. To mitigate such behavior, we propose CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. We demonstrate that careful modulation via CoT-Recipe can increase the accuracy of transformers on novel tasks by up to 300% even when there are no CoT examples available in-context. We confirm the broader effectiveness of these techniques by applying them to pretrained LLMs (Qwen2.5 series) for symbolic reasoning tasks and observing gains of up to 130% in accuracy.

[6] LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning

Ömer Faruk Akgül, Yusuf Hakan Kalaycı, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna

Main category: cs.CL

TL;DR: LYNX is an online early-exit mechanism that uses hidden-state awareness and conformal prediction to stop reasoning models early when they have enough information, reducing token usage by 40-70% while maintaining or improving accuracy.

Details

Motivation: Large reasoning models often "overthink" - continuing to reason long after having enough information to answer correctly, which wastes inference-time compute and can hurt accuracy. Existing early-exit methods have limitations like requiring extra sampling, auxiliary verifiers, or lacking formal guarantees.

Method: LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., “hmm”, “wait”) during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. The probe is trained once on a generic mathematical corpus and reused across different tasks and settings.

Result: Across three model families (1.5B to 32B parameters), LYNX achieves strong accuracy-efficiency tradeoffs: 40-65% token reduction on GSM8K while matching or improving accuracy; up to 12-point accuracy improvement with 35-60% fewer tokens on MATH-500; baseline accuracy recovery with >50% token savings on AIME 2024; and zero-shot transfer to non-math tasks (CommonsenseQA) with modest accuracy gains and up to 70% fewer tokens.

Conclusion: LYNX offers competitive or superior Pareto frontiers compared to state-of-the-art early-exit methods while being fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees through conformal prediction.

Abstract: Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often “overthink”: continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model’s own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., “hmm”, “wait”) during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy–efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40–65%; on MATH-500 it improves accuracy by up to 12 points with roughly 35–60% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.

[7] Exposing Pink Slime Journalism: Linguistic Signatures and Robust Detection Against LLM-Generated Threats

Sadat Shahriar, Navid Ayoobi, Arjun Mukherjee, Mostafa Musharrat, Sai Vishnu Vamsi

Main category: cs.CL

TL;DR: Pink Slime Journalism detection faces new threats from LLM-based adversarial attacks, requiring robust detection frameworks.

Details

Motivation: Local news is vital for 28M Americans but threatened by auto-generated Pink Slime Journalism that mimics legitimate reporting, requiring detection of deceptive content.

Method: Comprehensive study of Pink Slime linguistic/stylistic patterns, analysis of LLM-based adversarial attacks, and development of robust learning framework resistant to LLM modifications.

Result: Consumer-accessible LLMs can reduce existing detection performance by up to 40% F1-score; proposed robust framework improves detection by up to 27% against adversarial attacks.

Conclusion: LLMs create new adversarial threats to Pink Slime detection, necessitating robust frameworks that adapt to evolving automated journalism landscape.

Abstract: The local news landscape, a vital source of reliable information for 28 million Americans, faces a growing threat from Pink Slime Journalism, a low-quality, auto-generated articles that mimic legitimate local reporting. Detecting these deceptive articles requires a fine-grained analysis of their linguistic, stylistic, and lexical characteristics. In this work, we conduct a comprehensive study to uncover the distinguishing patterns of Pink Slime content and propose detection strategies based on these insights. Beyond traditional generation methods, we highlight a new adversarial vector: modifications through large language models (LLMs). Our findings reveal that even consumer-accessible LLMs can significantly undermine existing detection systems, reducing their performance by up to 40% in F1-score. To counter this threat, we introduce a robust learning framework specifically designed to resist LLM-based adversarial attacks and adapt to the evolving landscape of automated pink slime journalism, and showed and improvement by up to 27%.

[8] Transformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change

Ananth Hariharan, David Mortensen

Main category: cs.CL

TL;DR: Hybrid neural-symbolic methods reveal Sanskrit’s morphological complexity doesn’t simplify over 2,000 years but dynamically redistributes, with new insights into low-resource language evolution.

Details

Motivation: Challenge the naive assumption that linguistic change is simplification, and demonstrate how hybrid methods can provide new insights into the evolution of morphologically rich, low-resource languages like Sanskrit.

Method: Weakly-supervised hybrid approach using 100+ high-precision regex patterns for pseudo-labeling to fine-tune multilingual BERT, then fusing symbolic and neural outputs via confidence-weighted ensemble for scalability and interpretability.

Result: Achieved 52.4% overall feature detection rate on 1.47-million-word diachronic corpus; revealed Sanskrit’s morphological complexity doesn’t decrease but dynamically redistributes (verbal features decline cyclically while compounding expands dramatically); system produces well-calibrated uncertainty estimates (Pearson r = 0.92, ECE = 0.043).

Conclusion: Hybrid neural-symbolic methods provide reliable, scalable, and interpretable insights into language evolution, challenging simplification assumptions and revealing complex redistribution patterns in morphological complexity over time.

Abstract: This study demonstrates how hybrid neural-symbolic methods can yield significant new insights into the evolution of a morphologically rich, low-resource language. We challenge the naive assumption that linguistic change is simplification by quantitatively analyzing over 2,000 years of Sanskrit, demonstrating how weakly-supervised hybrid methods can yield new insights into the evolution of morphologically rich, low-resource languages. Our approach addresses data scarcity through weak supervision, using 100+ high-precision regex patterns to generate pseudo-labels for fine-tuning a multilingual BERT. We then fuse symbolic and neural outputs via a novel confidence-weighted ensemble, creating a system that is both scalable and interpretable. Applying this framework to a 1.47-million-word diachronic corpus, our ensemble achieves a 52.4% overall feature detection rate. Our findings reveal that Sanskrit’s overall morphological complexity does not decrease but is instead dynamically redistributed: while earlier verbal features show cyclical patterns of decline, complexity shifts to other domains, evidenced by a dramatic expansion in compounding and the emergence of new philosophical terminology. Critically, our system produces well-calibrated uncertainty estimates, with confidence strongly correlating with accuracy (Pearson r = 0.92) and low overall calibration error (ECE = 0.043), bolstering the reliability of these findings for computational philology.

[9] Mitigating Self-Preference by Authorship Obfuscation

Taslim Mahbub, Shi Feng

Main category: cs.CL

TL;DR: LM judges show self-preference bias favoring their own outputs, which persists even with perturbations to hide authorship. Simple synonym replacement helps but complete mitigation remains challenging.

Details

Motivation: LM judges are widely used for evaluation but display concerning biases like self-preference (preferring their own answers over others), which impairs evaluation integrity. This bias is hard to eliminate because frontier LMs can recognize their own outputs even without source labels.

Method: Apply black-box perturbations to evaluation candidates in pairwise comparisons to obfuscate authorship and reduce self-recognition. Test simple perturbations like synonym replacement for a few words, and extrapolate to more complete neutralization of stylistic differences.

Result: Simple perturbations like synonym replacement predictably reduce self-preference. However, when extrapolating to more complete neutralization of stylistic differences, self-preference recovers, suggesting self-recognition happens on many semantic levels.

Conclusion: Self-recognition and self-preference can occur on multiple semantic levels, making complete mitigation challenging despite promising initial results with simple perturbations. The fundamental challenge remains in eliminating this bias from LM judges.

Abstract: Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations. One such bias is self-preference: LM judges preferring their own answers over those produced by other LMs or humans. The bias is hard to eliminate as frontier LM judges can distinguish their own outputs from those of others, even when the evaluation candidates are not labeled with their sources. In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges’ ability to recognize their own outputs. We apply black-box perturbations to evaluation candidates in pairwise comparison to obfuscate the authorship and reduce self-recognition. We find that perturbations as simple as synonym replacement for a few words predictably reduce self-preference. However, we also uncover fundamental challenges to eliminating the bias: when we extrapolate our perturbations to a more complete neutralization of stylistic differences between the evaluation candidates, self-preference recovers. Our findings suggest that self-recognition and self-preference can happen on many semantic levels, and complete mitigation remains challenging despite promising initial results.

Ting-Yao Hu, Hema Swetha Koppula, Hadi Pouransari, Cem Koc, Oncel Tuzel, Raviteja Vemulapalli

Main category: cs.CL

TL;DR: SCRPO is a self-supervised training framework that uses LLMs’ own critique and refinement capabilities to create preference datasets, then applies preference learning to improve faithfulness in summarization without needing test-time compute or more powerful teacher models.

Details

Motivation: LLMs suffer from hallucinations in long-form text generation like summarization. Existing approaches to reduce hallucinations require additional test-time compute or access to more powerful teacher models, making them costly and impractical.

Method: Self Critique and Refinement-based Preference Optimization (SCRPO): 1) Constructs a preference dataset using the LLM’s own critique and refinement capabilities, 2) Applies preference learning to improve the same LLM for faithful summarization.

Result: Outperforms state-of-the-art self-supervised learning methods on three summarization benchmarks (XSUM, CNNDM, SAMSum) in faithfulness metrics while maintaining or improving overall summary quality metrics. More efficient than test-time refinement and produces more faithful summaries.

Conclusion: SCRPO provides an effective self-supervised training framework that improves faithfulness in LLM summarization without requiring additional test-time compute or more powerful teacher models, making hallucination reduction more practical and efficient.

Abstract: Large Language Models (LLMs) often suffer from hallucinations: output content that is not grounded in the input context, when performing long-form text generation tasks such as summarization. Prior works have shown that hallucinations can be reduced by iteratively critiquing and refining previously generated outputs using either the same model or a more powerful teacher model as the critique. However, these approaches either require additional test-time compute or assume access to more powerful teacher models, making them costly and less practical. In this work, we propose Self Critique and Refinement-based Preference Optimization (SCRPO), which is a self-supervised training framework that first constructs a preference dataset by leveraging the LLM’s own critique and refinement capabilities, and then applies preference learning to improve the same LLM for faithful summarization. Experiments on three summarization benchmarks (XSUM CNNDM and SAMSum), demonstrate that our approach outperforms state-of-the-art self-supervised learning methods in terms of faithfulness metrics while either maintaining or improving other metrics that measure the overall quality of the summary. Moreover, compared to test-time refinement, our approach not only improves efficiency but also results in more faithful summaries.

[11] SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs

Ruixuan Huang, Hao Zeng, Hantao Huang, Jinyuan Shi, Minghui Yu, Ian En-Hsu Yen, Shuai Wang

Main category: cs.CL

TL;DR: SQ-format is a unified data format for quantization and sparsification that achieves Pareto improvement between performance and throughput, particularly suitable for activations with outlier inequality status.

Details

Motivation: Existing low-bit quantization and sparsification techniques struggle to balance accuracy and efficiency due to limited hardware support. W4A8 achieves same peak TOPS as W8A8, and GPU-supported sparse formats (2:4 semi-structure) are rarely used due to accuracy loss.

Method: Proposed Sparse-Quantized Format (SQ-format), a unified data format leveraging that sparse matrices can be accelerated in high-precision while low-precision matrix multiplication can also be accelerated. The format is particularly suitable for activations with outlier inequality status.

Result: Achieves state-of-the-art post-training quantization performance with SQ-format, proposes hardware requirements to support it, and offers design exploration and insights for next-generation AI accelerators.

Conclusion: SQ-format bridges the gap between quantization and sparsification, achieving Pareto improvement between performance and throughput, making static compression possible for activations with outlier inequality status.

Abstract: Post-training quantization (PTQ) plays a crucial role in the democratization of large language models (LLMs). However, existing low-bit quantization and sparsification techniques are difficult to balance accuracy and efficiency due to the limited hardware support. For example, W4A8 can only achieve the same peak TOPS as W8A8 whereas the GPU-supported sparse data format (2:4 semi-structure sparse) is seldomly adopted due to the loss of accuracy. To bridge this gap, in this paper, we propose the Sparse-Quantized Format (SQ-format), which is a unified data format for quantization and sparsification potentially easily supported by new hardware and existing GPUs. SQ-format makes use of the fact that sparse matrix can be accelerated in high-precision, and low-precision matrix multiplication can also be accelerated accordingly. As such, SQ-format is proposed to achieve Pareto improvement between performance and throughput. This format is particularly suitable for activations with outlier inequality status and makes their static compression possible. We show the state-of-the-art PTQ performance with SQ-format, propose the hardware required to support it, and further offer the design exploration and insights for the next-generation AI accelerators.

[12] LMSpell: Neural Spell Checking for Low-Resource Languages

Akesh Gunathilakea, Nadil Karunarathnea, Tharusha Bandaranayakea, Nisansa de Silvaa, Surangika Ranathunga

Main category: cs.CL

TL;DR: First empirical study comparing pretrained language models for spell correction across languages, including low-resource ones, finding LLMs outperform others with large datasets, even in languages they weren’t pre-trained for.

Details

Motivation: Spell correction remains challenging for low-resource languages, and while pretrained language models have been used, there's been no proper comparison across different PLM types and limited application to LRLs.

Method: Conducted empirical study comparing effectiveness of different PLMs (encoder-based, encoder-decoder, and LLMs) for spell correction, including low-resource languages. Developed LMSpell toolkit with evaluation function to compensate for LLM hallucination. Included case study with Sinhala.

Result: LLMs outperform encoder-based and encoder-decoder models when fine-tuning dataset is large, even for languages the LLM wasn’t pre-trained on. The study provides insights into spell correction challenges for low-resource languages.

Conclusion: LLMs show strong potential for spell correction across languages including low-resource ones, especially with sufficient training data. The released LMSpell toolkit facilitates further research and application in this area.

Abstract: Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.

[13] ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering

Daeyong Kwon, SeungHeon Doh, Juhan Nam

Main category: cs.CL

TL;DR: MusWikiDB (3.2M passages from music Wikipedia) and ArtistMus benchmark (1K questions on 500 artists) improve music question answering via RAG, boosting open-source models by up to +56.8 points.

Details

Motivation: LLMs have limited music knowledge due to sparse pretraining data, and existing music resources lack support for factual/contextual music question answering grounded in artist metadata and historical context.

Method: Created MusWikiDB (vector database of 3.2M passages from 144K music Wikipedia pages) and ArtistMus benchmark (1,000 questions on 500 diverse artists with genre, debut year, topic metadata). Evaluated retrieval-augmented generation (RAG) for music QA, including RAG-style fine-tuning.

Result: RAG dramatically improves factual accuracy: open-source models gain up to +56.8 percentage points (Qwen3 8B improves from 35.0 to 91.8). RAG-style fine-tuning boosts both factual recall and contextual reasoning. MusWikiDB yields ~6% higher accuracy and 40% faster retrieval than general Wikipedia corpus.

Conclusion: MusWikiDB and ArtistMus advance music information retrieval and domain-specific QA, establishing foundation for retrieval-augmented reasoning in culturally rich domains like music. Released resources to support further research.

Abstract: Recent advances in large language models (LLMs) have transformed open-domain question answering, yet their effectiveness in music-related reasoning remains limited due to sparse music knowledge in pretraining data. While music information retrieval and computational musicology have explored structured and multimodal understanding, few resources support factual and contextual music question answering (MQA) grounded in artist metadata or historical context. We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic. These resources enable systematic evaluation of retrieval-augmented generation (RAG) for MQA. Experiments show that RAG markedly improves factual accuracy; open-source models gain up to +56.8 percentage points (for example, Qwen3 8B improves from 35.0 to 91.8), approaching proprietary model performance. RAG-style fine-tuning further boosts both factual recall and contextual reasoning, improving results on both in-domain and out-of-domain benchmarks. MusWikiDB also yields approximately 6 percentage points higher accuracy and 40% faster retrieval than a general-purpose Wikipedia corpus. We release MusWikiDB and ArtistMus to advance research in music information retrieval and domain-specific question answering, establishing a foundation for retrieval-augmented reasoning in culturally rich domains such as music.

[14] Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment

Panatchakorn Anantaprayoon, Nataliia Babina, Jad Tarifi, Nima Asgharbeygi

Main category: cs.CL

TL;DR: Proposes Collective Agency as a holistic alignment value and Dynamic Alignment framework for scalable self-improving LLM alignment using automated dataset generation and self-rewarding mechanisms.

Details

Motivation: Current LLM alignment methods using human preferences or predefined principles are insufficient for AGI/ASI, resource-intensive, and constrained to conventional values. Need more holistic alignment objectives and scalable self-improving approaches.

Method: Introduces Collective Agency (CA) as unified, open-ended alignment value. Proposes Dynamic Alignment framework with two components: (1) automated training dataset generation using LLMs, and (2) self-rewarding mechanism where policy model evaluates its own outputs for GRPO-based learning.

Result: Experimental results show the approach successfully aligns models to Collective Agency while preserving general NLP capabilities.

Conclusion: The proposed Collective Agency value and Dynamic Alignment framework provide a more holistic and scalable approach to LLM alignment that can adapt to AGI/ASI development while maintaining core capabilities.

Abstract: Large Language Models (LLMs) are typically aligned with human values using preference data or predefined principles such as helpfulness, honesty, and harmlessness. However, as AI systems progress toward Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI), such value systems may become insufficient. In addition, human feedback-based alignment remains resource-intensive and difficult to scale. While AI-feedback-based self-improving alignment methods have been explored as a scalable alternative, they have largely remained constrained to conventional alignment values. In this work, we explore both a more holistic alignment objective and a scalable, self-improving alignment approach. Aiming to transcend conventional alignment norms, we introduce Collective Agency (CA)-a unified and open-ended alignment value that encourages integrated agentic capabilities. We also propose Dynamic Alignment-an alignment framework that enables an LLM to iteratively align itself. Dynamic Alignment comprises two key components: (1) automated training dataset generation with LLMs, and (2) a self-rewarding mechanism, where the policy model evaluates its own output candidates and assigns rewards for GRPO-based learning. Experimental results demonstrate that our approach successfully aligns the model to CA while preserving general NLP capabilities.

[15] SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures

Panuthep Tasawong, Jian Gang Ngui, Alham Fikri Aji, Trevor Cohn, Peerat Limkonchotiwat

Main category: cs.CL

TL;DR: SEA-SafeguardBench is the first human-verified safety benchmark for Southeast Asian languages, covering 8 languages with 21,640 samples, revealing that current LLMs and guardrails struggle with SEA cultural contexts compared to English.

Details

Motivation: Current safety evaluations for LLMs are English-centric and overlook linguistic/cultural diversity. Multilingual benchmarks often use machine-translated English data, missing nuances in low-resource languages. Southeast Asian languages are particularly underrepresented despite the region's unique safety concerns and linguistic diversity.

Method: Created SEA-SafeguardBench - a human-verified safety benchmark for SEA languages with three subsets: general, in-the-wild, and content generation. Covers 8 languages with 21,640 natively authored samples to reflect local norms and harm scenarios.

Result: Experimental results show that even state-of-the-art LLMs and guardrails are challenged by SEA cultural and harm scenarios, underperforming significantly compared to their performance on English texts.

Conclusion: There’s a critical need for natively authored safety benchmarks that capture linguistic and cultural diversity, especially for underrepresented regions like Southeast Asia, as current safety systems fail to adequately address local harm scenarios.

Abstract: Safeguard models help large language models (LLMs) detect and block harmful content, but most evaluations remain English-centric and overlook linguistic and cultural diversity. Existing multilingual safety benchmarks often rely on machine-translated English data, which fails to capture nuances in low-resource languages. Southeast Asian (SEA) languages are underrepresented despite the region’s linguistic diversity and unique safety concerns, from culturally sensitive political speech to region-specific misinformation. Addressing these gaps requires benchmarks that are natively authored to reflect local norms and harm scenarios. We introduce SEA-SafeguardBench, the first human-verified safety benchmark for SEA, covering eight languages, 21,640 samples, across three subsets: general, in-the-wild, and content generation. The experimental results from our benchmark demonstrate that even state-of-the-art LLMs and guardrails are challenged by SEA cultural and harm scenarios and underperform when compared to English texts.

[16] Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches

Namu Park, Farzad Ahmed, Zhaoyi Sun, Kevin Lybarger, Ethan Breinhorst, Julie Hu, Ozlem Uzuner, Martin Gunn, Meliha Yetisgen

Main category: cs.CL

TL;DR: LLMs with anatomy-aware prompting outperform supervised models for lesion-level incidentaloma detection in radiology reports, achieving near-human performance.

Details

Motivation: Current document-level classification systems have limitations for fine-grained, lesion-level detection of incidentalomas requiring follow-up in radiology reports.

Method: Compared 3 supervised transformer encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against 4 generative LLMs (Llama 3.1-8B, GPT-4o, GPT-OSS-20b) using 400 annotated radiology reports with 1,623 verified lesions. Introduced lesion-tagged inputs and anatomy-aware prompting to ground model reasoning.

Result: Anatomy-informed GPT-OSS-20b achieved highest performance (macro-F1: 0.79), surpassing all supervised baselines (max: 0.70) and closely matching inter-annotator agreement (0.76). Anatomy-aware prompting yielded significant gains for GPT models. Ensemble of top systems improved macro-F1 to 0.90.

Conclusion: Generative LLMs enhanced with structured lesion tagging and anatomical context significantly outperform traditional supervised encoders and achieve performance comparable to human experts, offering reliable automated incidental finding surveillance.

Abstract: Objective: To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. Methods: We utilized a dataset of 400 annotated radiology reports containing 1,623 verified lesion findings. We compared three supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against four generative LLM configurations (Llama 3.1-8B, GPT-4o, GPT-OSS-20b). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. Results: The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p < 0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Conclusion: Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.

[17] Structured Reasoning with Tree-of-Thoughts for Bengali Math Word Problems

Aurprita Mahmood, Sabrin alam, Neloy kumer Sagor, Md. Abdul Hadi, Md. Sehab Al Islam, Minhajul Islam

Main category: cs.CL

TL;DR: Tree-of-Thought (ToT) reasoning outperforms Chain-of-Thought (CoT) for Bengali mathematical word problems, achieving 88% accuracy with GPT-OSS-120B, showing structured reasoning benefits low-resource languages.

Details

Motivation: Mathematical Word Problems (MWPs) are challenging for NLP due to requiring both linguistic understanding and multi-step numerical reasoning. While CoT prompting has shown promise, its linear structure often propagates errors, limiting effectiveness, especially for low-resource languages like Bengali.

Method: Systematic study of Tree-of-Thought (ToT) reasoning for Bengali MWPs using the SOMADHAN dataset. Evaluated 100 representative problems across multiple LLMs (GPT-OSS and LLaMA variants) under three strategies: standard prompting, CoT, and ToT, considering computational and token-cost constraints.

Result: CoT improved baseline accuracy from 78% (standard prompting) to 83% on average. ToT further increased performance by up to 5 percentage points, achieving 88% accuracy with GPT-OSS-120B. ToT is particularly effective for medium-to-large-scale models but offers less advantage for smaller ones.

Conclusion: ToT establishes as a robust framework for solving mathematical problems in low-resource languages like Bengali. Structured reasoning methods like ToT provide more reliable and globally consistent outcomes than CoT, paving the way for better reasoning strategies in multilingual NLP.

Abstract: Mathematical Word Problems (MWPs) are among the most challenging tasks in natural language processing because they require both linguistic understanding and multi-step numerical reasoning. While Chain-of-Thought (CoT) prompting has shown promise, its linear structure often propagates errors, limiting overall effectiveness. To address this limitation, we present the a systematic study of Tree-of-Thought (ToT) reasoning for Bengali MWPs using the SOMADHAN dataset. Owing to computational and token-cost constraints, we evaluate a curated set of 100 representative problems across multiple large language models (LLMs), including GPT-OSS and LLaMA variants, under standard prompting, CoT, and ToT strategies. Our results show that CoT improves baseline accuracy from 78% (standard prompting) to 83% on average, while ToT further increases performance by up to 5 percentage points, achieving 88% accuracy with GPT-OSS-120B. These improvements highlight that ToT is particularly effective in medium-to-large-scale models but may offer less advantage for smaller ones. Overall, our findings establish ToT as a robust framework for solving mathematical problems in low-resource languages such as Bengali. More broadly, this study shows that structured reasoning methods like ToT can provide more reliable and globally consistent outcomes than CoT, paving the way for better reasoning strategies in multilingual NLP.

[18] A Greek Government Decisions Dataset for Public-Sector Analysis and Insight

Giorgos Antoniou, Giorgos Filandrianos, Aggelos Vlachos, Giorgos Stamou, Lampros Kollimenos, Konstantinos Skianis, Michalis Vazirgiannis

Main category: cs.CL

TL;DR: Researchers create a 1-million-document corpus of Greek government decisions from Diavgeia platform, with high-quality text extraction, RAG task design, and baseline evaluation for transparency and AI applications.

Details

Motivation: To support advanced information access and transparency in public sector by creating a large-scale, machine-readable corpus of government decisions that can enable structured retrieval, reasoning, and AI applications.

Method: Extract 1 million Greek government decisions from Diavgeia transparency platform, convert PDFs to high-quality Markdown text using reproducible pipeline, analyze boilerplate patterns, design RAG task with representative questions and answers, evaluate baseline RAG system.

Result: Created open corpus with 1M decisions, high-quality text extraction, reproducible pipeline, boilerplate analysis, RAG task framework, baseline evaluation showing potential for chat-based assistants answering public decision questions.

Conclusion: The corpus enables transparency, AI applications (RAG systems, LM pre-training/fine-tuning), domain adaptation, and serves as foundation for legal/governmental AI models, with data and code made accessible.

Abstract: We introduce an open, machine-readable corpus of Greek government decisions sourced from the national transparency platform Diavgeia. The resource comprises 1 million decisions, featuring and high-quality raw text extracted from PDFs. It is released with raw extracted text in Markdown format, alongside a fully reproducible extraction pipeline. Beyond the core dataset, we conduct qualitative analyses to explore boilerplate patterns and design a retrieval-augmented generation (RAG) task by formulating a set of representative questions, creating high-quality answers, and evaluating a baseline RAG system on its ability to retrieve and reason over public decisions. This evaluation demonstrates the potential of large-scale public-sector corpora to support advanced information access and transparency through structured retrieval and reasoning over governmental documents, and highlights how such a RAG pipeline could simulate a chat-based assistant capable of interactively answering questions about public decisions. Due to its scale, quality, and domain coverage, the corpus can also serve as high-value pre-training or fine-tuning material for new Language Models (LMs) and Large Language Models (LLMs) respectively, including specialized models for legal and governmental domains, and as a foundation for novel approaches in domain adaptation, knowledge-grounded generation, and explainable AI. Finally, we discuss limitations, outline future directions, and make both the data and the code accessible.

[19] Grounded Multilingual Medical Reasoning for Question Answering with Large Language Models

Pietro Ferrazzi, Aitor Soroa, Rodrigo Agerri

Main category: cs.CL

TL;DR: The paper presents a method to generate multilingual medical reasoning traces using retrieval-augmented generation from Wikipedia, creating 500k traces in English, Italian, and Spanish to improve LLM performance on medical QA tasks.

Details

Motivation: Existing medical QA approaches using LLMs are largely English-focused and rely on distillation from general-purpose models, raising concerns about reliability of medical knowledge. There's a need for multilingual reasoning traces grounded in factual medical knowledge to develop safer clinical decision-support tools.

Method: Uses retrieval-augmented generation over medical information from Wikipedia to produce reasoning traces. Extends MedQA and MedMCQA datasets to Italian and Spanish. Tests pipeline in both in-domain and out-of-domain settings across medical QA benchmarks.

Result: Generated 500k reasoning traces in three languages. The reasoning traces improve performance both via in-context learning (few-shot) and supervised fine-tuning, yielding state-of-the-art results among 8B-parameter LLMs.

Conclusion: The multilingual reasoning traces support development of safer, more transparent clinical decision-support tools. The authors release full suite of resources including reasoning traces, translated QA datasets, Medical-Wikipedia, and fine-tuned models.

Abstract: Large Language Models (LLMs) with reasoning capabilities have recently demonstrated strong potential in medical Question Answering (QA). Existing approaches are largely English-focused and primarily rely on distillation from general-purpose LLMs, raising concerns about the reliability of their medical knowledge. In this work, we present a method to generate multilingual reasoning traces grounded in factual medical knowledge. We produce 500k traces in English, Italian, and Spanish, using a retrievalaugmented generation approach over medical information from Wikipedia. The traces are generated to solve medical questions drawn from MedQA and MedMCQA, which we extend to Italian and Spanish. We test our pipeline in both in-domain and outof-domain settings across Medical QA benchmarks, and demonstrate that our reasoning traces improve performance both when utilized via in-context learning (few-shot) and supervised fine-tuning, yielding state-of-the-art results among 8B-parameter LLMs. We believe that these resources can support the development of safer, more transparent clinical decision-support tools in multilingual settings. We release the full suite of resources: reasoning traces, translated QA datasets, Medical-Wikipedia, and fine-tuned models.

[20] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Shuai Dong, Siyuan Wang, Xingyu Liu, Zhongyu Wei

Main category: cs.CL

TL;DR: ILVR introduces interleaved latent visual reasoning that unifies dynamic state evolution with precise perceptual modeling, outperforming existing approaches on multimodal reasoning benchmarks.

Details

Motivation: Current interleaved reasoning paradigms for MLLMs suffer from high computational costs due to repeated image re-encoding. Latent visual reasoning alternatives force a trade-off between precise perceptual modeling and dynamic problem-solving capabilities.

Method: ILVR interleaves textual generation with latent visual representations as evolving cues. Uses self-supervision with a Momentum Teacher Model that selectively distills relevant features from helper images into sparse supervision targets, enabling adaptive, context-aware visual signal generation.

Result: Extensive experiments on multimodal reasoning benchmarks show ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.

Conclusion: ILVR successfully unifies dynamic state evolution with precise perceptual modeling through interleaved latent visual reasoning, overcoming the computational bottlenecks and trade-offs of previous methods.

Abstract: Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.

[21] MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation

Zhitao He, Haolin Yang, Zeyu Qin, Yi R Fung

Main category: cs.CL

TL;DR: ClinEdu is a multi-agent simulator for clinical education that generates teaching data, enabling training of MedTutor-R1 - a multimodal Socratic tutor for one-to-many clinical instruction that outperforms base models by 20%.

Details

Motivation: Address the gap between rising clinical training demands and scarce expert instruction by leveraging LLMs for personalized guidance, while moving beyond one-on-one instruction to develop collaborative reasoning skills essential for medical teamwork.

Method: Develop ClinEdu multi-agent simulator with personality-driven patients and diverse student cohorts; create ClinTeach Socratic teaching dialogue dataset; train MedTutor-R1 through instruction tuning on ClinTeach followed by reinforcement learning with three-axis rubric rewards (structural fidelity, analytical quality, clinical safety).

Result: MedTutor-R1 outperforms base model by over 20% in average pedagogical score, comparable to o3, and shows high adaptability in handling varying numbers of students through simulation-based interactive evaluation.

Conclusion: The pedagogical simulator ClinEdu effectively enables scalable teaching data generation and controlled testing of complex pedagogical processes, demonstrating promising performance for LLM-based clinical education that supports collaborative reasoning skills.

Abstract: The significant gap between rising demands for clinical training and the scarcity of expert instruction poses a major challenge to medical education. With powerful capabilities in personalized guidance, Large Language Models (LLMs) offer a promising solution to bridge this gap. However, current research focuses mainly on one-on-one knowledge instruction, overlooking collaborative reasoning, a key skill for students developed in teamwork like ward rounds. To this end, we develop ClinEdu, a multi-agent pedagogical simulator with personality-driven patients and diverse student cohorts, enabling controlled testing of complex pedagogical processes and scalable generation of teaching data. Based on ClinEdu, we construct ClinTeach, a large Socratic teaching dialogue dataset that captures the complexities of group instruction. We then train MedTutor-R1, the first multimodal Socratic tutor designed for one-to-many instruction in clinical medical education. MedTutor-R1 is first instruction-tuned on our ClinTeach dataset and then optimized with reinforcement learning, using rewards derived from a three-axis rubric, covering structural fidelity, analytical quality, and clinical safety, to refine its adaptive Socratic strategies. For authentic in-situ assessment, we use simulation-based interactive evaluation that redeploys the tutor back into ClinEdu. Experimental results demonstrate that our MedTutor-R1 outperforms the base model by over 20% in average pedagogical score and is comparable to o3, while also exhibiting high adaptability in handling a varying number of students. This promising performance underscores the effectiveness of our pedagogical simulator, ClinEdu.

[22] Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods

Tereza Novotna, Jakub Harasta

Main category: cs.CL

TL;DR: Comparison of OpenAI vs domain-specific BERT for Czech Constitutional Court case retrieval shows OpenAI decisively outperforms despite noisy evaluation labels.

Details

Motivation: Case law retrieval is time-consuming and typically done via database queries. There's a need to evaluate different embedding models for legal document retrieval, particularly for Czech Constitutional Court decisions, with proper evaluation methods that handle noisy labels common in legacy judicial databases.

Method: Compared two models: (1) general-purpose OpenAI embedder, (2) domain-specific BERT trained from scratch on ~30,000 decisions using sliding windows and attention pooling. Used noise-aware evaluation with IDF-weighted keyword overlap as graded relevance, binarization via two thresholds (0.20 balanced, 0.28 strict), significance testing via paired bootstrap, and nDCG diagnosis with qualitative analysis.

Result: OpenAI embedder decisively outperforms domain pre-trained BERT across all settings (@10/@20/@100) and both thresholds. Differences are statistically significant. Despite modest absolute nDCG scores (expected under noisy labels), diagnostics show low absolutes are due to label drift and strong ideals rather than lack of utility.

Conclusion: General-purpose OpenAI embeddings work better than domain-specific BERT for Czech Constitutional Court case retrieval. The proposed evaluation framework is robust enough for noisy gold datasets typical in legacy judicial databases, providing meaningful comparisons despite label noise.

Abstract: Retrieving case law is a time-consuming task predominantly carried out by querying databases. We provide a comparison of two models in three different settings for Czech Constitutional Court decisions: (i) a large general-purpose embedder (OpenAI), (ii) a domain-specific BERT-trained from scratch on ~30,000 decisions using sliding windows and attention pooling. We propose a noise-aware evaluation including IDF-weighted keyword overlap as graded relevance, binarization via two thresholds (0.20 balanced, 0.28 strict), significance via paired bootstrap, and an nDCG diagnosis supported with qualitative analysis. Despite modest absolute nDCG (expected under noisy labels), the general OpenAI embedder decisively outperforms the domain pre-trained BERT in both settings at @10/@20/@100 across both thresholds; differences are statistically significant. Diagnostics attribute low absolutes to label drift and strong ideals rather than lack of utility. Additionally, our framework is robust enough to be used for evaluation under a noisy gold dataset, which is typical when handling data with heterogeneous labels stemming from legacy judicial databases.

[23] Faithfulness metric fusion: Improving the evaluation of LLM trustworthiness across domains

Ben Malin, Tatiana Kalganova, Nikolaos Boulgouris

Main category: cs.CL

TL;DR: The paper proposes a methodology to improve faithfulness evaluation in LLMs by combining elementary metrics into a fused metric using tree-based models guided by human judgments, achieving better correlation with human assessments across domains.

Details

Motivation: Current faithfulness evaluation metrics for LLMs may not accurately reflect human judgments, limiting confidence in deploying LLMs in diverse real-world scenarios where faithfulness is critical.

Method: The methodology combines elementary faithfulness metrics into a fused metric using tree-based models to identify metric importance, guided by human judgments on LLM response faithfulness. The approach also homogenizes datasets across question answering and dialogue domains with human judgments and LLM responses.

Result: The fused metric demonstrates stronger correlation with human judgments across all tested faithfulness domains compared to individual elementary metrics.

Conclusion: The proposed metric fusion approach improves faithfulness evaluation accuracy, enabling greater confidence in LLM deployment across diverse scenarios. The homogenized dataset facilitates reproducibility and cross-domain testing of faithfulness evaluation methods.

Abstract: We present a methodology for improving the accuracy of faithfulness evaluation in Large Language Models (LLMs). The proposed methodology is based on the combination of elementary faithfulness metrics into a combined (fused) metric, for the purpose of improving the faithfulness of LLM outputs. The proposed strategy for metric fusion deploys a tree-based model to identify the importance of each metric, which is driven by the integration of human judgements evaluating the faithfulness of LLM responses. This fused metric is demonstrated to correlate more strongly with human judgements across all tested domains for faithfulness. Improving the ability to evaluate the faithfulness of LLMs, allows for greater confidence to be placed within models, allowing for their implementation in a greater diversity of scenarios. Additionally, we homogenise a collection of datasets across question answering and dialogue-based domains and implement human judgements and LLM responses within this dataset, allowing for the reproduction and trialling of faithfulness evaluation across domains.

[24] Efficient Text Classification with Conformal In-Context Learning

Ippokratis Pantelidis, Korbinian Randl, Aron Henriksson

Main category: cs.CL

TL;DR: CICLe is a framework combining lightweight base classifiers with Conformal Prediction to guide LLM prompting for text classification, reducing computational costs while maintaining performance.

Details

Motivation: LLMs have strong in-context learning but require careful prompt design and incur high computational costs. CICLe was proposed as a resource-efficient alternative but needed broader evaluation across domains.

Method: CICLe integrates a lightweight base classifier with Conformal Prediction to adaptively reduce candidate classes, guiding LLM prompting more efficiently. It systematically reduces shots and prompt length.

Result: CICLe consistently improves over base classifiers, outperforms few-shot prompting with sufficient data, performs comparably in low-data regimes, reduces shots by up to 34.45% and prompt length by 25.16%, and works particularly well with imbalanced classes.

Conclusion: CICLe is a practical, scalable approach for efficient text classification that combines traditional classifier robustness with LLM adaptability, achieving substantial data and computational efficiency gains.

Abstract: Large Language Models (LLMs) demonstrate strong in-context learning abilities, yet their effectiveness in text classification depends heavily on prompt design and incurs substantial computational cost. Conformal In-Context Learning (CICLe) has been proposed as a resource-efficient framework that integrates a lightweight base classifier with Conformal Prediction to guide LLM prompting by adaptively reducing the set of candidate classes. However, its broader applicability and efficiency benefits beyond a single domain have not yet been systematically explored. In this paper, we present a comprehensive evaluation of CICLe across diverse NLP classification benchmarks. The results show that CICLe consistently improves over its base classifier and outperforms few-shot prompting baselines when the sample size is sufficient for training the base classifier, and performs comparably in low-data regimes. In terms of efficiency, CICLe reduces the number of shots and prompt length by up to 34.45% and 25.16%, respectively, and enables the use of smaller models with competitive performance. CICLe is furthermore particularly advantageous for text classification tasks with high class imbalance. These findings highlight CICLe as a practical and scalable approach for efficient text classification, combining the robustness of traditional classifiers with the adaptability of LLMs, and achieving substantial gains in data and computational efficiency.

[25] Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning

Jinlong Liu, Mohammed Bahja, Venelin Kovatchev, Mark Lee

Main category: cs.CL

TL;DR: The paper presents a training framework for style-conditioned story generation using Group Relative Policy Optimization (GRPO) with multi-reward setup, achieving better style control than larger models like GPT-4o and Claude Sonnet 4.

Details

Motivation: Current LLMs show impressive story generation but lack fine-grained stylistic control. Existing methods rely on shallow cues without robust evaluation, creating a need for better authorial style modeling.

Method: Uses Group Relative Policy Optimization (GRPO) with multi-reward setup: style reward from fine-tuned sentence transformer using authorship verification signals, combined with content and completeness scores for stable long-form narrative generation.

Result: 8B model outperforms larger baselines (GPT-4o, Claude Sonnet 4) in authorship verification style metrics, achieving style score of 0.628 with competitive content quality. Demonstrates feasibility of agentic stylistic generation with moderate model size.

Conclusion: The framework enables effective style-aligned generation, but narrative completeness remains challenging. Future work needed for better global coherence and story resolution modeling.

Abstract: Recent advances in large language models (LLMs) show impressive performance in open-ended story generation, but fine-grained stylistic control remains limited. Existing methods often rely on shallow cues (e.g., names or topics) to simulate authorial style, without robust evaluation. In this work, we present a training framework for style-conditioned story generation using Group Relative Policy Optimization (GRPO) and a custom multi-reward setup. The style reward is derived from a fine-tuned sentence transformer using authorship verification (AV) signals, combined with content and completeness scores to stabilize long-form narrative generation. We conduct experiments using fiction by Mark Twain, a prominent 19th-century American author, with The Adventures of Huckleberry Finn serving as the reference style exemplar. Our 8B model outperforms larger baselines such as GPT-4o and Claude Sonnet 4 in AV-style metrics, achieving a style score of 0.628 and competitive content quality. Results demonstrate the feasibility of agentic stylistic generation with moderate model size and task-specific training. While the output is clearly style-aligned, narrative completeness remains a challenge, indicating future work is needed to better model global coherence and story resolution.

[26] Heard or Halted? Gender, Interruptions, and Emotional Tone in U.S. Supreme Court Oral Arguments

Yifei Tong

Main category: cs.CL

TL;DR: Interruptions during Supreme Court oral arguments don’t change argument content but show gender bias: female advocates face more negatively-toned interruptions.

Details

Motivation: To understand how interruptions shape both semantic content and emotional tone in Supreme Court oral arguments, with particular focus on gendered dynamics in judicial discourse and power relations.

Method: Analyzed 12,663 speech chunks from advocate-justice interactions using ConvoKit Supreme Court Corpus (2010-2019). Used GloVe-based sentence embeddings to quantify semantic shifts and lexicon-based analysis to measure sentiment.

Result: Semantic similarity between pre- and post-interruption speech remains high (interruptions don’t alter argument content), but interruptions directed at female advocates contain significantly higher levels of negative sentiment.

Conclusion: Interruptions maintain argumentative content but exhibit gendered emotional bias, demonstrating computational linguistics’ value for studying power, discourse, and equity in judicial proceedings.

Abstract: This study examines how interruptions during U.S. Supreme Court oral arguments shape both the semantic content and emotional tone of advocates’ speech, with a focus on gendered dynamics in judicial discourse. Using the ConvoKit Supreme Court Corpus (2010-2019), we analyze 12,663 speech chunks from advocate-justice interactions to assess whether interruptions alter the meaning of an advocate’s argument and whether interruptions toward female advocates exhibit more negative emotional valence. Semantic shifts are quantified using GloVe-based sentence embeddings, while sentiment is measured through lexicon-based analysis. We find that semantic similarity between pre- and post-interruption speech remains consistently high, suggesting that interruptions do not substantially alter argumentative content. However, interruptions directed at female advocates contain significantly higher levels of negative sentiment. These results deepen empirical understanding of gendered communication in elite institutional settings and demonstrate the value of computational linguistic methods for studying power, discourse, and equity in judicial proceedings.

[27] Prompting Science Report 4: Playing Pretend: Expert Personas Don’t Improve Factual Accuracy

Savir Basil, Ina Shapiro, Dan Shapiro, Ethan Mollick, Lilach Mollick, Lennart Meincke

Main category: cs.CL

TL;DR: Persona prompts generally don’t improve AI accuracy on difficult multiple-choice questions, with expert personas showing no consistent benefit and low-knowledge personas often reducing performance.

Details

Motivation: To understand whether assigning personas (expert or low-knowledge) to AI models improves their performance on challenging graduate-level multiple-choice questions across science, engineering, and law domains.

Method: Tested six AI models on GPQA Diamond and MMLU-Pro benchmarks using three persona approaches: 1) In-domain expert personas matched to problem types, 2) Off-domain expert personas mismatched to problem types, and 3) Low-knowledge personas (layperson, child, toddler). Compared performance against no-persona baselines.

Result: Persona prompts generally didn’t improve accuracy. In-domain expert personas had no significant impact (except for Gemini 2.0 Flash). Off-domain expert personas showed marginal differences. Low-knowledge personas were generally harmful to accuracy. No consistent benefits across models.

Conclusion: Personas don’t reliably improve factual accuracy on difficult objective questions, though they may serve other purposes like altering output tone. The findings suggest persona assignment isn’t an effective strategy for enhancing AI performance on challenging knowledge-based tasks.

Abstract: This is the fourth in a series of short reports that help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. Here, we ask whether assigning personas to models improves performance on difficult objective multiple-choice questions. We study both domain-specific expert personas and low-knowledge personas, evaluating six models on GPQA Diamond (Rein et al. 2024) and MMLU-Pro (Wang et al. 2024), graduate-level questions spanning science, engineering, and law. We tested three approaches: -In-Domain Experts: Assigning the model an expert persona (“you are a physics expert”) matched to the problem type (physics problems) had no significant impact on performance (with the exception of the Gemini 2.0 Flash model). -Off-Domain Experts (Domain-Mismatched): Assigning the model an expert persona (“you are a physics expert”) not matched to the problem type (law problems) resulted in marginal differences. -Low-Knowledge Personas: We assigned the model negative capability personas (layperson, young child, toddler), which were generally harmful to benchmark accuracy. Across both benchmarks, persona prompts generally did not improve accuracy relative to a no-persona baseline. Expert personas showed no consistent benefit across models, with few exceptions. Domain-mismatched expert personas sometimes degraded performance. Low-knowledge personas often reduced accuracy. These results are about the accuracy of answers only; personas may serve other purposes (such as altering the tone of outputs), beyond improving factual performance.

[28] Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

Tasnimul Hassan, Md Faisal Karim, Haziq Jeelani, Elham Behnam, Robert Green, Fayeq Jeelani Syed

Main category: cs.CL

TL;DR: RAG-based medical QA system using fine-tuned open-source LLMs (LLaMA~2 & Falcon) with domain knowledge retrieval improves factual accuracy and reduces hallucinations in medical question answering.

Details

Motivation: Direct application of LLMs to clinical domains faces challenges with factual accuracy and hallucinations. Medical QA systems need reliable, evidence-based answers grounded in medical literature.

Method: Retrieval-augmented generation (RAG) system combining domain-specific knowledge retrieval with fine-tuned open-source LLMs (LLaMA~2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization.

Result: Fine-tuned LLaMA~2 achieved 71.8% accuracy on PubMedQA (vs 55.4% zero-shot baseline). Retrieval augmentation reduced unsupported content by ~60% and improved answer accuracy on benchmark datasets (PubMedQA and MedMCQA).

Conclusion: RAG-augmented open-source LLMs show strong potential for reliable biomedical QA by grounding answers in retrieved evidence, reducing hallucinations, and improving factual correctness for clinical informatics applications.

Abstract: Medical question-answering (QA) systems can benefit from advances in large language models (LLMs), but directly applying LLMs to the clinical domain poses challenges such as maintaining factual accuracy and avoiding hallucinations. In this paper, we present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions. We fine-tune two state-of-the-art open LLMs (LLaMA2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization. The system retrieves relevant medical literature to ground the LLM’s answers, thereby improving factual correctness and reducing hallucinations. We evaluate the approach on benchmark datasets (PubMedQA and MedMCQA) and show that retrieval augmentation yields measurable improvements in answer accuracy compared to using LLMs alone. Our fine-tuned LLaMA2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline, while maintaining transparency by providing source references. We also detail the system design and fine-tuning methodology, demonstrating that grounding answers in retrieved evidence reduces unsupported content by approximately 60%. These results highlight the potential of RAG-augmented open-source LLMs for reliable biomedical QA, pointing toward practical clinical informatics applications.

[29] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata

Main category: cs.CL

TL;DR: M4-RAG is a massive multilingual multimodal benchmark for evaluating retrieval-augmented VQA across 42 languages and 56 dialects, revealing that RAG benefits smaller VLMs but degrades performance of larger models.

Details

Motivation: Vision-language models are limited by static training data, and while RAG helps access current, culturally grounded multilingual information, multilingual multimodal RAG remains underexplored.

Method: Created M4-RAG benchmark with 80,000+ culturally diverse image-question pairs across 42 languages and 56 dialects, with controlled retrieval environment containing millions of curated multilingual documents to balance realism with reproducibility.

Result: RAG consistently benefits smaller VLMs but fails to scale to larger models, often degrading their performance, revealing a critical mismatch between model size and current retrieval effectiveness.

Conclusion: M4-RAG provides foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.

Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.

[30] Towards Data-efficient Customer Intent Recognition with Prompt-based Learning Paradigm

Hengyu Luo, Peng Liu, Stefan Esping

Main category: cs.CL

TL;DR: Prompt-based learning with answer mapping enables small language models to achieve competitive intent recognition with minimal training data, enhanced by active sampling and ensemble strategies.

Details

Motivation: Accurate customer intent recognition in digital customer service is hindered by lack of sufficient labeled conversational data, creating a need for more data-efficient approaches.

Method: Prompt-based learning paradigm with answer mapping techniques, enhanced by active sampling and ensemble learning strategies in the prompted training pipeline.

Result: Small language models achieve competitive intent recognition performance with minimal training data, and show considerable instruction-following potential in zero-shot settings with well-crafted prompts.

Conclusion: The approach enables semantic modeling of conversational data in a data-efficient manner with minimal data use, paving the way for advancements in AI-driven customer service.

Abstract: Recognizing customer intent accurately with language models based on customer-agent conversational data is essential in today’s digital customer service marketplace, but it is often hindered by the lack of sufficient labeled data. In this paper, we introduce the prompt-based learning paradigm that significantly reduces the dependency on extensive datasets. Utilizing prompted training combined with answer mapping techniques, this approach allows small language models to achieve competitive intent recognition performance with only a minimal amount of training data. Furthermore, We enhance the performance by integrating active sampling and ensemble learning strategies in the prompted training pipeline. Additionally, preliminary tests in a zero-shot setting demonstrate that, with well-crafted and detailed prompts, small language models show considerable instruction-following potential even without any further training. These results highlight the viability of semantic modeling of conversational data in a more data-efficient manner with minimal data use, paving the way for advancements in AI-driven customer service.

[31] Vision-centric Token Compression in Large Language Model

Ling Xing, Alex Jinpeng Wang, Rui Yan, Xiangbo Shu, Jinhui Tang

Main category: cs.CL

TL;DR: VIST is a vision-centric token compression framework that uses a slow-fast approach: fast path renders distant tokens into images for a vision encoder to skim, while slow path feeds proximal tokens to LLM for detailed reasoning, achieving 2.3× token reduction with 16% FLOPs and 50% memory savings.

Details

Motivation: As LLMs grow to trillions of parameters and context windows expand to hundreds of thousands of tokens, compute and memory costs skyrocket, making token compression essential for practical deployment.

Method: VIST uses a slow-fast compression framework inspired by human reading: distant tokens are rendered into images and processed by a frozen lightweight vision encoder (fast path), while proximal tokens go directly to the LLM (slow path). It employs Probability-Informed Visual Enhancement (PVE) that masks high-frequency tokens during training to focus on semantically rich regions.

Result: Achieves same accuracy with 2.3× fewer tokens, reduces FLOPs by 16% and memory by 50%. Outperforms text encoder-based compression method CEPE by 7.6% on average across 11 benchmarks including TriviaQA, NQ, PopQA, NLUI, and CLIN.

Conclusion: VIST sets a new standard for token efficiency in LLMs by combining visual processing with text compression, offering significant computational savings while maintaining accuracy through a biologically-inspired slow-fast approach.

Abstract: Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making token compression indispensable. We introduce Vision Centric Token Compression (Vist), a slow-fast compression framework that mirrors human reading: the fast path renders distant tokens into images, letting a frozen, lightweight vision encoder skim the low-salience context; the slow path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions-just as skilled reader gloss over function words. On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by 7.6% on average over benchmarks like TriviaQA, NQ, PopQA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. The project is at https://github.com/CSU-JPG/VIST.

[32] Experiments with Large Language Models on Retrieval-Augmented Generation for Closed-Source Simulation Software

Andreas Baumann, Peter Eberhard

Main category: cs.CL

TL;DR: This paper tests RAG systems for closed-source simulation frameworks using local LLMs, focusing on Pasimodo software, and finds that prompt-dependent information tailoring significantly improves response quality.

Details

Motivation: LLMs suffer from hallucinations when dealing with unknown knowledge, especially for closed-source software applications. RAG offers a solution but needs testing for closed-source simulation frameworks where data protection and IP rights are critical.

Method: Tested existing RAG systems for closed-source simulation framework Pasimodo using local LLMs (focusing on smaller models for accessibility). Evaluated different approaches to improve response quality, particularly tailoring information provided to LLMs based on prompts.

Result: Systems show impressive results but often fail due to insufficient information. Tailoring information provided to LLMs dependent on prompts proves to be a significant improvement in response quality.

Conclusion: Demonstrates great potential for RAG in closed-source simulation models but highlights need for further work to improve information retrieval, especially for smaller institutions using local LLMs.

Abstract: Large Language Models (LLMs) are tools that have become indispensable in development and programming. However, they suffer from hallucinations, especially when dealing with unknown knowledge. This is particularly the case when LLMs are to be used to support closed-source software applications. Retrieval-Augmented Generation (RAG) offers an approach to use additional knowledge alongside the pre-trained knowledge of the LLM to respond to user prompts. Possible tasks range from a smart-autocomplete, text extraction for question answering, model summarization, component explaining, compositional reasoning, to creation of simulation components and complete input models. This work tests existing RAG systems for closed-source simulation frameworks, in our case the mesh-free simulation software Pasimodo. Since data protection and intellectual property rights are particularly important for problems solved with closed-source software, the tests focus on execution using local LLMs. In order to enable smaller institutions to use the systems, smaller language models will be tested first. The systems show impressive results, but often fail due to insufficient information. Different approaches for improving response quality are tested. In particular, tailoring the information provided to the LLMs dependent to the prompts proves to be a significant improvement. This demonstrates the great potential and the further work needed to improve information retrieval for closed-source simulation models.

[33] AURA: A Diagnostic Framework for Tracking User Satisfaction of Interactive Planning Agents

Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: AURA is a framework for assessing interactive task planning agents beyond just task completion, focusing on the entire agent-user interaction process through behavioral stage analysis.

Details

Motivation: Existing benchmarks only evaluate agent performance based on task completion, which is misaligned with user satisfaction since users interact with the entire agentic process, not just the end result.

Method: Proposes AURA (Agent-User inteRaction Assessment) framework that conceptualizes behavioral stages of interactive task planning agents and uses atomic LLM evaluation criteria for comprehensive assessment.

Result: Analyses show that agents excel in different behavioral stages, and user satisfaction is shaped by both outcomes and intermediate behaviors. Also identifies limitations of user simulators.

Conclusion: AURA enables diagnosis of specific strengths/weaknesses in agent decision-making pipelines and highlights future directions including multi-agent systems and addressing user simulator limitations.

Abstract: The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose AURA, an Agent-User inteRaction Assessment framework that conceptualizes the behavioral stages of interactive task planning agents. AURA offers a comprehensive assessment of agent through a set of atomic LLM evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent’s decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.

[34] SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

Zirui He, Mingyu Jin, Bo Shen, Ali Payani, Yongfeng Zhang, Mengnan Du

Main category: cs.CL

TL;DR: Supervised steering using sparse autoencoders to control LLM behavior in interpretable latent spaces with minimal quality degradation.

Details

Motivation: LLMs show impressive capabilities but controlling their behavior reliably in open-ended generation remains challenging. Existing methods lack interpretability and precision in steering model outputs toward desired behaviors.

Method: 1) Use sparse autoencoders (SAEs) to obtain sparse latent representations disentangling semantic attributes. 2) Train linear classifiers to identify task-relevant subspace dimensions. 3) Learn supervised steering vectors constrained to this subspace, optimized for target behaviors.

Result: Experiments across sentiment, truthfulness, and political polarity steering tasks with multiple LLMs show higher success rates with minimal generation quality degradation compared to existing methods. A notably small subspace is sufficient for effective steering.

Conclusion: The supervised steering approach in sparse, interpretable representation spaces enables more targeted and interpretable interventions for controlling LLM behavior, with implementation publicly available.

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs) to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and political polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions. Our implementation is publicly available at https://github.com/Ineedanamehere/SAE-SSV.

[35] A quantitative analysis of semantic information in deep representations of text and images

Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni, Alessandro Laio

Main category: cs.CL

TL;DR: This paper presents a method to quantitatively measure how deep neural networks encode semantic information across different domains (text and images) by analyzing information content in LLM and vision transformer representations.

Details

Motivation: Deep neural networks develop similar representations for semantically related data across different domains (images and text, different languages), but there's a need for quantitative methods to investigate how this semantic information is encoded and distributed across model layers and tokens.

Method: The authors measure relative information content of representations for semantically related data and probe how it’s encoded into multiple tokens of LLMs and vision transformers. They analyze translated sentence pairs in LLMs to identify “semantic” layers with language-transferable information, and examine how semantic information spreads across tokens with long-distance correlations and causal asymmetry. They also identify semantic layers in vision transformers and show how caption representations in LLMs predict visual representations of corresponding images.

Result: 1) Identified inner “semantic” layers in LLMs containing most language-transferable information; 2) Larger LLMs (DeepSeek-V3) extract significantly more general information than smaller ones (Llama3.1-8B); 3) Semantic information in English text spreads across many tokens with long-distance correlations and left-to-right causal asymmetry; 4) Identified layers encoding semantic information within vision transformers; 5) Caption representations in LLM semantic layers predict visual representations of corresponding images; 6) Observed significant model-dependent information asymmetries between image and text representations.

Conclusion: The study provides a quantitative framework for analyzing semantic information encoding across modalities, revealing systematic patterns in how neural networks develop similar representations for semantically related data across domains, with implications for understanding cross-modal learning and model scaling effects.

Abstract: Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic’’ layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information of English text is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.

[36] Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings

Sarang Patil, Ashish Parmanand Pandey, Ioannis Koutis, Mengjia Xu

Main category: cs.CL

TL;DR: HiM integrates Mamba2 with hyperbolic geometry to create hierarchy-aware language embeddings, outperforming Euclidean baselines on hierarchical reasoning tasks.

Details

Motivation: Selective state-space models like Mamba excel at long sequences but lack hierarchical reasoning capabilities. Current LLMs use flat Euclidean embeddings that can't capture latent linguistic hierarchies needed for complex reasoning tasks.

Method: HiM combines efficient Mamba2 processing with hyperbolic geometry (Poincaré ball or Lorentzian manifold) using learnable curvature. Sequences are projected to hyperbolic space with hyperbolic loss optimization to capture relational distances across hierarchical levels.

Result: Both HiM variants effectively capture hierarchical relationships across four linguistic/medical datasets, surpassing Euclidean baselines. HiM-Poincaré provides fine-grained distinctions with higher h-norms, while HiM-Lorentz offers more stable, compact, hierarchy-preserving embeddings favoring robustness.

Conclusion: Integrating hyperbolic geometry with Mamba2 enables hierarchy-aware language embeddings that improve hierarchical reasoning capabilities, with different hyperbolic manifolds offering complementary advantages for different application needs.

Abstract: Selective state-space models excel at long-sequence modeling, but their capacity for language representation – in complex hierarchical reasoning – remains underexplored. Most large language models rely on \textit{flat} Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this, we propose {\it Hierarchical Mamba (HiM)}, integrating efficient Mamba2 with hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincaré ball or Lorentzian manifold with ``learnable’’ curvature, optimized with a hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. Experimental results show both HiM variants effectively capture hierarchical relationships across four linguistic and medical datasets, surpassing Euclidean baselines, with HiM-Poincaré providing fine-grained distinctions with higher h-norms, while HiM-Lorentz offers more stable, compact, and hierarchy-preserving embeddings-favoring robustness. The source code is publicly available at https://github.com/BerryByte/HiM.

Hongcheng Guo, Zheyong Xie, Shaosheng Cao, Boyang Wang, Weiting Liu, Zheyu Ye, Zhoujun Li, Zuozhu Liu, Wei Liu

Main category: cs.CL

TL;DR: Pet-Bench is a new benchmark for evaluating LLMs as virtual pet companions, testing both self-interaction and human-interaction capabilities with over 7,500 interaction instances across diverse tasks.

Details

Motivation: Virtual pet companionship is an emerging application for LLMs that enables interactive and emotionally rich experiences, but existing approaches lack systematic benchmarking for comprehensive companionship capabilities.

Method: The authors introduce Pet-Bench, a dedicated benchmark that evaluates LLMs across self-interaction and human-interaction dimensions, emphasizing self-evolution and developmental behaviors. It includes diverse tasks like intelligent scheduling, memory-based dialogues, and psychological conversations.

Result: Evaluation of 28 LLMs shows significant performance variations linked to model size and inherent capabilities, highlighting the need for specialized optimization in virtual pet companionship applications.

Conclusion: Pet-Bench serves as a foundational resource for benchmarking pet-related LLM abilities and advancing emotionally immersive human-pet interactions, addressing a previously underexplored application area.

Abstract: As interest in using Large Language Models for interactive and emotionally rich experiences grows, virtual pet companionship emerges as a novel yet underexplored application. Existing approaches focus on basic pet role-playing interactions without systematically benchmarking LLMs for comprehensive companionship. In this paper, we introduce Pet-Bench, a dedicated benchmark that evaluates LLMs across both self-interaction and human-interaction dimensions. Unlike prior work, Pet-Bench emphasizes self-evolution and developmental behaviors alongside interactive engagement, offering a more realistic reflection of pet companionship. It features diverse tasks such as intelligent scheduling, memory-based dialogues, and psychological conversations, with over 7,500 interaction instances designed to simulate pet behaviors. Evaluation of 28 LLMs reveals significant performance variations linked to model size and inherent capabilities, underscoring the need for specialized optimization in this domain. Pet-Bench serves as a foundational resource for benchmarking pet-related LLM abilities and advancing emotionally immersive human-pet interactions.

[38] CodeNER: Code Prompting for Named Entity Recognition

Sungwoo Han, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura

Main category: cs.CL

TL;DR: The paper proposes code-based prompting for NER with LLMs, embedding BIO schema instructions in code format to improve LLMs’ understanding of labeling requirements, outperforming text-based prompting across multiple languages.

Details

Motivation: Previous NER approaches using LLMs rely solely on input context without capturing detailed labeling requirements. NER inherently needs both context and explicit labeling instructions, which current methods lack.

Method: Proposes code-based prompting that embeds BIO schema instructions within prompts using code format. This leverages LLMs’ ability to comprehend long-range scopes in programming languages to better understand NER requirements.

Result: The code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets. Combining with chain-of-thought prompting further improves performance.

Conclusion: Explicitly structuring NER instructions through code-based prompting effectively improves LLMs’ NER capabilities. The approach demonstrates the importance of detailed schema instructions beyond just input context.

Abstract: Recent studies have explored various approaches for treating candidate named entity spans as both source and target sequences in named entity recognition (NER) by leveraging large language models (LLMs). Although previous approaches have successfully generated candidate named entity spans with suitable labels, they rely solely on input context information when using LLMs, particularly, ChatGPT. However, NER inherently requires capturing detailed labeling requirements with input context information. To address this issue, we propose a novel method that leverages code-based prompting to improve the capabilities of LLMs in understanding and performing NER. By embedding code within prompts, we provide detailed BIO schema instructions for labeling, thereby exploiting the ability of LLMs to comprehend long-range scopes in programming languages. Experimental results demonstrate that the proposed code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets, indicating the effectiveness of explicitly structuring NER instructions. We also verify that combining the proposed code-based prompting method with the chain-of-thought prompting further improves performance.

[39] A Survey on Diffusion Language Models

Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen

Main category: cs.CL

TL;DR: This survey paper provides a comprehensive overview of Diffusion Language Models (DLMs), covering their evolution, principles, state-of-the-art techniques, inference strategies, multimodal extensions, applications, limitations, and future directions.

Details

Motivation: DLMs are emerging as a promising alternative to autoregressive models due to their parallel generation capabilities, reduced inference latency, and ability to capture bidirectional context for fine-grained control over generation.

Method: The survey traces DLM evolution, compares them with other paradigms (autoregressive and masked language models), covers foundational principles and state-of-the-art models, analyzes pre-training and post-training methods, reviews inference strategies and optimizations, and examines multimodal extensions.

Result: DLMs achieve several-fold speed-up while showing performance comparable to autoregressive models, making them compelling for various NLP tasks. The survey provides comprehensive taxonomy, in-depth analysis of current techniques, and thorough review of inference optimizations.

Conclusion: DLMs represent a rapidly evolving field with significant potential, though they face challenges in efficiency, long-sequence handling, and infrastructure requirements. Future research directions are outlined to sustain progress in this promising alternative to autoregressive language modeling.

Abstract: Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

[40] HARP: Hallucination Detection via Reasoning Subspace Projection

Junjie Hu, Gang Tu, ShengYu Cheng, Jinxin Li, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan

Main category: cs.CL

TL;DR: HARP is a novel hallucination detection framework that decomposes LLM hidden states into semantic and reasoning subspaces, using SVD on the Unembedding layer to obtain basis vectors and projecting onto the reasoning subspace for enhanced detection performance.

Details

Motivation: Hallucinations in LLMs hinder reliable use in critical decision-making. Existing detection methods struggle with disentangling semantic and reasoning information and maintaining robustness, creating a need for improved approaches.

Method: HARP decomposes LLM hidden state space into semantic and reasoning subspaces, uses SVD on Unembedding layer parameters to obtain basis vectors, projects hidden states onto reasoning subspace basis vectors, and uses these projections as features for hallucination detection.

Result: HARP achieves state-of-the-art performance across multiple datasets, with 92.8% AUROC on TriviaQA (7.5% improvement over previous best), reduces feature dimensions to ~5% of original, filters noise, and enhances robustness.

Conclusion: HARP effectively addresses hallucination detection challenges by leveraging subspace decomposition and reasoning subspace projection, offering improved accuracy, robustness, and dimensionality reduction for reliable LLM applications.

Abstract: Hallucinations in Large Language Models (LLMs) pose a major barrier to their reliable use in critical decision-making. Although existing hallucination detection methods have improved accuracy, they still struggle with disentangling semantic and reasoning information and maintaining robustness. To address these challenges, we propose HARP (Hallucination detection via reasoning subspace projection), a novel hallucination detection framework. HARP establishes that the hidden state space of LLMs can be decomposed into a direct sum of a semantic subspace and a reasoning subspace, where the former encodes linguistic expression and the latter captures internal reasoning processes. Moreover, we demonstrate that the Unembedding layer can disentangle these subspaces, and by applying Singular Value Decomposition (SVD) to its parameters, the basis vectors spanning the semantic and reasoning subspaces are obtained. Finally, HARP projects hidden states onto the basis vectors of the reasoning subspace, and the resulting projections are then used as input features for hallucination detection in LLMs. By using these projections, HARP reduces the dimension of the feature to approximately 5% of the original, filters out most noise, and achieves enhanced robustness. Experiments across multiple datasets show that HARP achieves state-of-the-art hallucination detection performance; in particular, it achieves an AUROC of 92.8% on TriviaQA, outperforming the previous best method by 7.5%.

[41] From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents

Wen-Yu Chang, Tzu-Hung Huang, Chih-Ho Chen, Yun-Nung Chen

Main category: cs.CL

TL;DR: A sales-oriented dialogue agent uses occupation-conditioned strategies to prioritize user-preferred intents, resulting in shorter, more successful conversations based on user profile analysis.

Details

Motivation: As agentic dialogue models rise, realistic user-simulator studies are needed to develop effective conversation strategies for sales-oriented agents that must adapt to diverse user profiles.

Method: The study investigates a sales agent adapting to user profiles (age, gender, occupation), finding occupation produces the most significant differences in conversational intent. A lightweight, occupation-conditioned strategy is introduced to guide the agent to prioritize intents aligned with user preferences.

Result: Age and gender influence overall performance, but occupation produces the most pronounced differences in conversational intent. The occupation-conditioned strategy results in shorter and more successful dialogues.

Conclusion: Rich simulator profiles are important for dialogue systems, and simple persona-informed strategies can significantly enhance the effectiveness of sales-oriented dialogue systems by aligning with user preferences.

Abstract: Amid the rapid rise of agentic dialogue models, realistic user-simulator studies are essential for tuning effective conversation strategies. This work investigates a sales-oriented agent that adapts its dialogue based on user profiles spanning age, gender, and occupation. While age and gender influence overall performance, occupation produces the most pronounced differences in conversational intent. Leveraging this insight, we introduce a lightweight, occupation-conditioned strategy that guides the agent to prioritize intents aligned with user preferences, resulting in shorter and more successful dialogues. Our findings highlight the importance of rich simulator profiles and demonstrate how simple persona-informed strategies can enhance the effectiveness of sales-oriented dialogue systems.

[42] Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

Bianca Raimondi, Daniela Dalbagno, Maurizio Gabbrielli

Main category: cs.CL

TL;DR: LLMs learn moral biases like the Knobe effect during finetuning, which can be localized to specific layers and eliminated by patching activations from pretrained models without retraining.

Details

Motivation: To understand how LLMs internalize human-like biases during finetuning, specifically investigating the Knobe effect (moral bias in intentionality judgements), and determine whether these biases can be traced to specific model components.

Method: Conducted Layer-Patching analysis across 3 open-weights LLMs to localize where the Knobe effect bias manifests, then patched activations from corresponding pretrained models into critical layers.

Result: The Knobe effect bias is learned during finetuning and localized to specific layers. Patching activations from pretrained models into just a few critical layers successfully eliminates the bias.

Conclusion: Social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions without model retraining, offering new approaches for bias reduction in AI systems.

Abstract: Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.

[43] Chinese Discharge Drug Recommendation in Metabolic Diseases with Large Language Models

Juntao Li, Haobin Yuan, Ling Luo, Yan Jiang, Fan Wang, Ping Zhang, Huiyi Lv, Jian Wang, Yuanyuan Sun, Hongfei Lin

Main category: cs.CL

TL;DR: This paper systematically evaluates LLMs for Chinese discharge medication recommendation, testing various models and methods to assess their potential and limitations in clinical drug recommendation.

Details

Motivation: Intelligent drug recommendation from EHRs is crucial for clinical decision-making, but LLM applications for Chinese clinical medication recommendation remain largely unexplored despite recent advances in medical text understanding.

Method: Systematic investigation of LLM-based methodologies including evaluation of GLM, Llama, Qwen models under unified framework with zero-shot prompting, in-context learning, chain-of-thought prompting, and supervised fine-tuning using LoRA.

Result: Supervised fine-tuning improves model performance but substantial room remains - best model achieved F1 score of 0.5648 and Jaccard score of 0.4477, with analysis of reasoning styles, error patterns, domain adaptation, and robustness.

Conclusion: The findings highlight both the potential and limitations of LLMs for Chinese drug recommendation, showing promise but indicating significant improvement is needed for clinical application.

Abstract: Intelligent drug recommendation based on Electronic Health Records (EHRs) is critical for improving the quality and efficiency of clinical decision-making. By leveraging large-scale patient data, drug recommendation systems can assist physicians in selecting the most appropriate medications according to a patient’s medical history, diagnoses, laboratory results, and comorbidities. Recent advances in large language models (LLMs) have shown remarkable capabilities in complex reasoning and medical text understanding, making them promising tools for drug recommendation tasks. However, the application of LLMs for Chinese clinical medication recommendation remains largely unexplored. In this work, we conduct a systematic investigation of LLM-based methodologies for Chinese discharge medication recommendation. We evaluate several representative LLM families (GLM, Llama, Qwen) under a unified methodological framework including zero-shot prompting, in-context learning, chain-of-thought prompting, and supervised fine-tuning using LoRA. We analyze model behavior across reasoning styles, error patterns, domain adaptation mechanisms, and robustness. Experimental results show that while supervised fine-tuning improves model performance, there remains substantial room for improvement, with the best model achieving the F1 score of 0.5648 and Jaccard score of 0.4477. Our findings highlight both the potential and limitations of LLMs for Chinese drug recommendation.

[44] HalluClean: A Unified Framework to Combat Hallucinations in LLMs

Yaxin Zhao, Yu Zhang

Main category: cs.CL

TL;DR: HalluClean is a lightweight, task-agnostic framework that uses reasoning-enhanced planning, execution, and revision stages to detect and correct hallucinations in LLM-generated text without external knowledge or supervised training.

Details

Motivation: LLMs often produce hallucinated content that undermines factual reliability, creating a need for methods to enhance the trustworthiness of LLM outputs in real-world applications.

Method: HalluClean adopts a reasoning-enhanced paradigm with three explicit stages: planning, execution, and revision. It uses minimal task-routing prompts for zero-shot generalization across domains without external knowledge sources or supervised detectors.

Result: Extensive evaluations on five tasks (question answering, dialogue, summarization, math word problems, contradiction detection) show HalluClean significantly improves factual consistency and outperforms competitive baselines.

Conclusion: HalluClean demonstrates potential to enhance the trustworthiness of LLM outputs in real-world applications through its lightweight, task-agnostic approach to hallucination detection and correction.

Abstract: Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.

[45] LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models

Jawad Ibn Ahad, Muhammad Rafsan Kabir, Robin Krambroeckers, Sifat Momen, Nabeel Mohammed, Shafin Rahman

Main category: cs.CL

TL;DR: LAET (Layer-wise Adaptive Ensemble Tuning) is a novel fine-tuning strategy that selectively tunes only the most effective layers of pre-trained LLMs for financial NLP tasks, reducing computational costs while improving performance over larger models like GPT-4.

Details

Motivation: While LLMs like BloombergGPT and FinMA have advanced financial NLP, their high computational demands limit accessibility for many organizations. There's a need for more efficient fine-tuning methods that maintain or improve performance while reducing resource requirements.

Method: LAET analyzes hidden state representations to identify the most effective layers in pre-trained LLMs, then selectively fine-tunes only those critical layers while freezing less important ones. This creates an adaptive ensemble approach that optimizes computational efficiency.

Result: LAET achieves strong performance on financial NLP tasks, outperforming existing benchmarks and state-of-the-art LLMs like GPT-4, even with smaller models (~3B parameters). It significantly reduces computational overhead while enhancing task-specific performance.

Conclusion: LAET bridges cutting-edge financial NLP research with practical deployment by providing an efficient, scalable fine-tuning strategy that makes advanced financial NLP capabilities more accessible to organizations with limited computational resources.

Abstract: Natural Language Processing (NLP) has transformed the financial industry, enabling advancements in areas such as textual analysis, risk management, and forecasting. Large language models (LLMs) like BloombergGPT and FinMA have set new benchmarks across various financial NLP tasks, including sentiment analysis, stock movement prediction, and credit risk assessment. Furthermore, FinMA-ES, a bilingual financial LLM, has also demonstrated strong performance using the FLARE and FLARE-ES benchmarks. However, the high computational demands of these models limit the accessibility of many organizations. To address this, we propose Layer-wise Adaptive Ensemble Tuning (LAET), a novel strategy that selectively fine-tunes the most effective layers of pre-trained LLMs by analyzing hidden state representations while freezing less critical layers. LAET significantly reduces computational overhead while enhancing task-specific performance. Our approach shows strong results in financial NLP tasks, outperforming existing benchmarks and state-of-the-art LLMs such as GPT-4, even with smaller LLMs ($\sim$3B parameters). This work bridges cutting-edge financial NLP research and real-world deployment with efficient and scalable models for financial applications.

[46] MindEval: Benchmarking Language Models on Multi-turn Mental Health Support

José Pombal, Maya D’Eon, Nuno M. Guerreiro, Pedro Henrique Martins, António Farinhas, Ricardo Rei

Main category: cs.CL

TL;DR: MindEval is a framework for automatically evaluating LLMs in realistic multi-turn mental health therapy conversations, showing current models struggle with therapeutic communication despite reasoning capabilities and scale.

Details

Motivation: Current AI mental health chatbots have limitations like sycophancy, overvalidation, and reinforcement of maladaptive beliefs. There's a scarcity of benchmarks capturing real therapeutic complexity, as most only test clinical knowledge through multiple-choice questions or assess single responses in isolation.

Method: Developed MindEval framework in collaboration with Ph.D-level Licensed Clinical Psychologists. Uses patient simulation and automatic evaluation with LLMs, featuring fully automated, model-agnostic design. Validated realism of simulated patients against human-generated text and demonstrated strong correlations between automatic and human expert judgments.

Result: Evaluated 12 state-of-the-art LLMs - all models scored below 4 out of 6 on average. Models show particular weaknesses in problematic AI-specific communication patterns. Reasoning capabilities and model scale don’t guarantee better performance. Systems deteriorate with longer interactions or when supporting patients with severe symptoms.

Conclusion: Current LLMs struggle with therapeutic communication despite advances in reasoning and scale. The MindEval framework provides a valuable tool for evaluating mental health AI systems, revealing important limitations that need addressing for effective therapeutic applications.

Abstract: Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.

[47] Decoding inner speech with an end-to-end brain-to-text neural interface

Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski

Main category: cs.CL

TL;DR: BIT is an end-to-end Brain-to-Text framework that translates neural activity into coherent sentences using a single differentiable neural network, achieving state-of-the-art performance on speech BCI benchmarks.

Details

Motivation: Current speech BCIs use cascaded frameworks that decode phonemes before assembling sentences with n-gram language models, preventing joint optimization of all stages simultaneously. This limits performance and integration.

Method: Introduces an end-to-end Brain-to-Text framework with: 1) cross-task, cross-species pretrained neural encoder that transfers to attempted and imagined speech, 2) integration with audio large language models (LLMs), 3) contrastive learning for cross-modal alignment, and 4) single differentiable neural network architecture.

Result: Achieves new SOTA on Brain-to-Text ‘24/‘25 benchmarks in cascaded setting. Reduces WER from 24.69% to 10.22% in end-to-end setting. Small-scale audio LLMs significantly improve end-to-end decoding. Enables cross-task generalization between attempted and imagined speech.

Conclusion: BIT advances integration of large, diverse neural datasets and enables seamless, differentiable optimization for end-to-end speech BCI decoding, paving the way for more effective communication restoration systems.

Abstract: Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text ‘24 and ‘25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

[48] Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match

Jinze Li, Yixing Xu, Guanchen Li, Shuo Yang, Jinfeng Xu, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum

Main category: cs.CL

TL;DR: FLy is a training-free speculative decoding method that loosens exact-match verification using the target model’s self-correction, achieving 2.81-5.07x speedup while preserving >99% accuracy.

Details

Motivation: LLMs have high inference latency due to autoregressive generation. Existing speculative decoding methods discard semantically valid continuations through strict exact-match verification and suffer performance degradation on OOD tasks.

Method: Two-tier mechanism: 1) entropy-level gate identifies if tokens allow multiple plausible alternatives, 2) token-level deferred window distinguishes genuine errors from semantically correct variants. Multi-level acceleration strategy accelerates both target and draft models.

Result: Achieves average 2.81x speedup on Llama-3.1-70B-Instruct and 5.07x on 405B variant while preserving >99% accuracy. Outperforms training-based EAGLE-3 by 1.62x on out-of-domain datasets.

Conclusion: FLy’s training-free design enables seamless composition with arbitrary draft-target pairs, generalizes across models/domains without hyperparameter tuning, and effectively addresses OOD performance issues while maintaining high accuracy.

Abstract: Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens in parallel from a smaller draft model, yet its strict exact-match verification discards many semantically valid continuations. Moreover, existing training-based SPD methods often suffer from performance degradation on out-of-distribution (OOD) tasks. To this end, we propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model’s self-corrective behavior to judge whether a draft-target mismatch remains semantically valid. FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants. To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself. Owing to its training-free design, FLy composes seamlessly with arbitrary draft-target pairs and generalizes across models and domains without hyperparameter re-tuning. Experiments show that FLy preserves more than 99% of the target model’s accuracy while achieving an average 2.81x speedup on Llama-3.1-70B-Instruct and 5.07x speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62x.

[49] Characterizing Language Use in a Collaborative Situated Game

Nicholas Tomlin, Naitian Zhou, Eve Fleisig, Liangyuan Chen, Téa Wright, Lauren Vinh, Laura X. Ma, Seun Eisape, Ellie French, Tingting Du, Tianjiao Zhang, Alexander Koller, Alane Suhr

Main category: cs.CL

TL;DR: Researchers created the Portal Dialogue Corpus - 11.5 hours of spoken dialogue from Portal 2 co-op gameplay, featuring complex collaborative language phenomena rarely found in existing dialogue datasets.

Details

Motivation: Cooperative video games provide rich language data with complex coordination, communication under uncertainty, and situated problem-solving that existing chitchat or task-oriented dialogue corpora don't capture well.

Method: Collected 11.5 hours of spoken human dialogue from Portal 2 co-op mode, resulting in 24.5K utterances. The corpus includes player videos, audio, transcripts, game state data, and both manual/automatic annotations.

Result: Identified unique linguistic phenomena: complex spatial reference, clarification/repair mechanisms, and ad-hoc convention formation. These features distinguish the corpus from existing dialogue datasets.

Conclusion: The publicly released Portal Dialogue Corpus provides valuable resources for analyzing language in complex, situated, collaborative problem-solving scenarios, filling gaps in current dialogue research resources.

Abstract: Cooperative video games, where multiple participants must coordinate by communicating and reasoning under uncertainty in complex environments, yield a rich source of language data. We collect the Portal Dialogue Corpus: a corpus of 11.5 hours of spoken human dialogue in the co-op mode of the popular Portal 2 virtual puzzle game, comprising 24.5K total utterances. We analyze player language and behavior, identifying a number of linguistic phenomena that rarely appear in most existing chitchat or task-oriented dialogue corpora, including complex spatial reference, clarification and repair, and ad-hoc convention formation. To support future analyses of language use in complex, situated, collaborative problem-solving scenarios, we publicly release the corpus, which comprises player videos, audio, transcripts, game state data, and both manual and automatic annotations of language data.

cs.CV

[50] AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance

Tianling Xu, Shengzhe Gan, Leslie Gu, Yuelei Li, Fangneng Zhan, Hanspeter Pfister

Main category: cs.CV

TL;DR: AREA3D is an active 3D reconstruction agent that uses feed-forward 3D reconstruction models with vision-language guidance to autonomously select optimal viewpoints, achieving state-of-the-art accuracy with fewer views.

Details

Motivation: Existing active reconstruction methods rely on hand-crafted geometric heuristics that often lead to redundant observations without significantly improving reconstruction quality. There's a need for more intelligent viewpoint selection that goes beyond purely geometric cues.

Method: AREA3D decouples view-uncertainty modeling from feed-forward 3D reconstruction models, enabling precise uncertainty estimation without expensive online optimization. It integrates vision-language models to provide high-level semantic guidance for selecting informative and diverse viewpoints beyond geometric cues.

Result: Extensive experiments on both scene-level and object-level benchmarks show that AREA3D achieves state-of-the-art reconstruction accuracy, particularly in the sparse-view regime where fewer observations are available.

Conclusion: AREA3D demonstrates that combining feed-forward 3D reconstruction models with vision-language guidance enables more efficient and accurate active reconstruction, reducing redundancy while improving quality, especially with limited views.

Abstract: Active 3D reconstruction enables an agent to autonomously select viewpoints to efficiently obtain accurate and complete scene geometry, rather than passively reconstructing scenes from pre-collected images. However, existing active reconstruction methods often rely on hand-crafted geometric heuristics, which can lead to redundant observations without substantially improving reconstruction quality. To address this limitation, we propose AREA3D, an active reconstruction agent that leverages feed-forward 3D reconstruction models and vision-language guidance. Our framework decouples view-uncertainty modeling from the underlying feed-forward reconstructor, enabling precise uncertainty estimation without expensive online optimization. In addition, an integrated vision-language model provides high-level semantic guidance, encouraging informative and diverse viewpoints beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks demonstrate that AREA3D achieves state-of-the-art reconstruction accuracy, particularly in the sparse-view regime. Code will be made available at: https://github.com/TianlingXu/AREA3D .

[51] Breaking Scale Anchoring: Frequency Representation Learning for Accurate High-Resolution Inference from Low-Resolution Training

Wenshuo Wang, Fan Zhang

Main category: cs.CV

TL;DR: The paper identifies “Scale Anchoring” as a fundamental problem in zero-shot super-resolution spatiotemporal forecasting where models trained on low-resolution data fail to reduce error when deployed on higher resolutions, and proposes Frequency Representation Learning to address this issue.

Details

Motivation: Existing approaches incorrectly interpret maintaining similar error across resolutions as successful generalization, but models serving as alternatives to numerical solvers should actually reduce error as resolution increases. The fundamental limitation is that low-resolution data has constrained Nyquist frequencies, preventing models from processing unseen high-frequency components during high-resolution inference.

Method: Proposes architecture-agnostic Frequency Representation Learning (FRL) that alleviates Scale Anchoring through resolution-aligned frequency representations and spectral consistency training. FRL ensures that on grids with higher Nyquist frequencies, the frequency response in high-frequency bands is more stable.

Result: FRL-enhanced variants allow errors to decrease with resolution and significantly outperform baselines within the task and resolution range, while incurring only modest computational overhead.

Conclusion: The paper identifies Scale Anchoring as a distinct problem from existing issues in zero-shot super-resolution forecasting and demonstrates that Frequency Representation Learning effectively addresses this problem by enabling models to properly generalize across resolutions with decreasing error.

Abstract: Zero-Shot Super-Resolution Spatiotemporal Forecasting requires a deep learning model to be trained on low-resolution data and deployed for inference on high-resolution. Existing studies consider maintaining similar error across different resolutions as indicative of successful multi-resolution generalization. However, deep learning models serving as alternatives to numerical solvers should reduce error as resolution increases. The fundamental limitation is, the upper bound of physical law frequencies that low-resolution data can represent is constrained by its Nyquist frequency, making it difficult for models to process signals containing unseen frequency components during high-resolution inference. This results in errors being anchored at low resolution, incorrectly interpreted as successful generalization. We define this fundamental phenomenon as a new problem distinct from existing issues: Scale Anchoring. Therefore, we propose architecture-agnostic Frequency Representation Learning. It alleviates Scale Anchoring through resolution-aligned frequency representations and spectral consistency training: on grids with higher Nyquist frequencies, the frequency response in high-frequency bands of FRL-enhanced variants is more stable. This allows errors to decrease with resolution and significantly outperform baselines within our task and resolution range, while incurring only modest computational overhead.

[52] InvarDiff: Cross-Scale Invariance Caching for Accelerated Diffusion Models

Zihao Wu

Main category: cs.CV

TL;DR: InvarDiff is a training-free acceleration method for diffusion models that exploits temporal invariance across timesteps and layers to cache and reuse computations, achieving 2-3× speedups with minimal quality degradation.

Details

Motivation: Diffusion models produce high-quality synthesis but suffer from slow inference due to iterative sampling. The authors observed feature invariance in deterministic sampling that could be exploited to accelerate inference without retraining.

Method: InvarDiff computes a binary cache plan matrix through a few deterministic runs, using quantile-based change metrics to identify which modules at which timesteps can be reused. It applies step-first and layer-wise caching guided by this matrix, with re-sampling correction to prevent drift when consecutive caches occur.

Result: The method achieves 2-3× end-to-end speed-ups on DiT and FLUX models with minimal impact on standard quality metrics. Visual quality shows almost no degradation compared to full computations.

Conclusion: InvarDiff successfully exploits temporal invariance in diffusion models to accelerate inference without training, demonstrating significant speed improvements while maintaining synthesis quality.

Abstract: Diffusion models deliver high-fidelity synthesis but remain slow due to iterative sampling. We empirically observe there exists feature invariance in deterministic sampling, and present InvarDiff, a training-free acceleration method that exploits the relative temporal invariance across timestep-scale and layer-scale. From a few deterministic runs, we compute a per-timestep, per-layer, per-module binary cache plan matrix and use a re-sampling correction to avoid drift when consecutive caches occur. Using quantile-based change metrics, this matrix specifies which module at which step is reused rather than recomputed. The same invariance criterion is applied at the step scale to enable cross-timestep caching, deciding whether an entire step can reuse cached results. During inference, InvarDiff performs step-first and layer-wise caching guided by this matrix. When applied to DiT and FLUX, our approach reduces redundant compute while preserving fidelity. Experiments show that InvarDiff achieves $2$-$3\times$ end-to-end speed-ups with minimal impact on standard quality metrics. Qualitatively, we observe almost no degradation in visual quality compared with full computations.

[53] Fine-tuning an ECG Foundation Model to Predict Coronary CT Angiography Outcomes

Yujie Xiao, Gongzhen Tang, Deyun Zhang, Jun Li, Guangkun Nie, Haoyu Wang, Shun Huang, Tong Liu, Qinghao Zhao, Kangyin Chen, Shenda Hong

Main category: cs.CV

TL;DR: AI-ECG model predicts severe coronary artery stenosis with good performance, validated internally and externally, showing robustness across subgroups and providing interpretable waveform insights.

Details

Motivation: Coronary CT angiography (CCTA) has limitations for large-scale CAD screening due to equipment requirements, radiation exposure, and patient cooperation needs. AI-ECG offers a promising alternative for non-invasive CAD detection.

Method: Developed an interpretable AI-ECG model to predict severe or complete stenosis in four major coronary arteries (RCA, LM, LAD, LCX) using electrocardiography data, validated on internal and external datasets.

Result: Model achieved AUCs of 0.794-0.818 on internal validation and 0.667-0.971 on external validation for different arteries. Performance remained stable in normal-ECG subsets and across demographic/clinical subgroups. Interpretability analysis revealed distinct waveform differences between risk groups.

Conclusion: AI-ECG shows promise as a scalable screening tool for coronary artery stenosis, with good predictive performance, robustness across populations, and interpretable insights into ECG correlates of coronary disease.

Abstract: Coronary artery disease (CAD) remains a major global health burden. Accurate identification of the culprit vessel and assessment of stenosis severity are essential for guiding individualized therapy. Although coronary CT angiography (CCTA) is the first-line non-invasive modality for CAD diagnosis, its dependence on high-end equipment, radiation exposure, and strict patient cooperation limits large-scale use. With advances in artificial intelligence (AI) and the widespread availability of electrocardiography (ECG), AI-ECG offers a promising alternative for CAD screening. In this study, we developed an interpretable AI-ECG model to predict severe or complete stenosis of the four major coronary arteries on CCTA. On the internal validation set, the model’s AUCs for the right coronary artery (RCA), left main coronary artery (LM), left anterior descending artery (LAD), and left circumflex artery (LCX) were 0.794, 0.818, 0.744, and 0.755, respectively; on the external validation set, the AUCs reached 0.749, 0.971, 0.667, and 0.727, respectively. Performance remained stable in a clinically normal-ECG subset, indicating robustness beyond overt ECG abnormalities. Subgroup analyses across demographic and acquisition-time strata further confirmed model stability. Risk stratification based on vessel-specific incidence thresholds showed consistent separation on calibration and cumulative event curves. Interpretability analyses revealed distinct waveform differences between high- and low-risk groups, highlighting key electrophysiological regions contributing to model decisions and offering new insights into the ECG correlates of coronary stenosis.

[54] ChromouVQA: Benchmarking Vision-Language Models under Chromatic Camouflaged Images

Yunfei Zhang, Yizhuo He, Yuanxun Shao, Zhengtao Yao, Haoyan Xu, Junhao Dong, Zhen Yao, Zhikang Dong

Main category: cs.CV

TL;DR: ChromouVQA is a new benchmark using Ishihara-style chromatic camouflaged images to test VLMs’ ability to perform figure-ground segregation in cluttered backgrounds across nine vision-question-answering tasks.

Details

Motivation: Vision-Language Models struggle with figure-ground segregation when targets are embedded in cluttered backgrounds, creating a need for controlled benchmarks to evaluate and improve this capability.

Method: Created a large-scale benchmark using Ishihara-style chromatic camouflaged images with varied parameters (chromatic separation, density, size, occlusion, rotation) and proposed a model-agnostic contrastive recipe aligning silhouettes with camouflaged renderings.

Result: Evaluations show large performance gaps between humans and VLMs, especially under subtle chromatic contrast or disruptive geometric fills. The proposed contrastive method improves global shape recovery.

Conclusion: ChromouVQA provides a compact, controlled benchmark for reproducible evaluation and extension of VLMs’ figure-ground segregation capabilities in challenging camouflaged scenarios.

Abstract: Vision-Language Models (VLMs) have advanced multimodal understanding, yet still struggle when targets are embedded in cluttered backgrounds requiring figure-ground segregation. To address this, we introduce ChromouVQA, a large-scale, multi-task benchmark based on Ishihara-style chromatic camouflaged images. We extend classic dot plates with multiple fill geometries and vary chromatic separation, density, size, occlusion, and rotation, recording full metadata for reproducibility. The benchmark covers nine vision-question-answering tasks, including recognition, counting, comparison, and spatial reasoning. Evaluations of humans and VLMs reveal large gaps, especially under subtle chromatic contrast or disruptive geometric fills. We also propose a model-agnostic contrastive recipe aligning silhouettes with their camouflaged renderings, improving recovery of global shapes. ChromouVQA provides a compact, controlled benchmark for reproducible evaluation and extension. Code and dataset are available at https://github.com/Chromou-VQA-Benchmark/Chromou-VQA.

[55] Spatiotemporal Satellite Image Downscaling with Transfer Encoders and Autoregressive Generative Models

Yang Xiang, Jingwen Zhong, Yige Yan, Petros Koutrakis, Eric Garshick, Meredith Franklin

Main category: cs.CV

TL;DR: A transfer-learning generative downscaling framework using U-Net encoder + diffusion model to reconstruct high-resolution satellite images from coarse inputs, validated on NASA MERRA-2 (50km) to GEOS-5 (7km) data over Asia.

Details

Motivation: To develop a robust method for reconstructing fine-resolution satellite images from coarse-scale inputs, enabling improved environmental exposure assessment and long-term monitoring with limited high-resolution training data.

Method: Combines lightweight U-Net transfer encoder (pretrained on coarse data) with diffusion-based generative model. U-Net encoder learns spatiotemporal representations from long time series, then frozen and transferred as physically meaningful latent features to larger downscaling model.

Result: Achieved excellent performance (R2 = 0.65 to 0.94) across seasonal regional splits, outperforming deterministic U-Nets, variational autoencoders, and prior transfer learning baselines. Predicted images preserved physically consistent spatial variability and temporal autocorrelation.

Conclusion: Transfer-enhanced diffusion models provide robust, physically coherent solution for downscaling long time series of coarse resolution images with limited training periods, enabling stable autoregressive reconstruction beyond available high-resolution records.

Abstract: We present a transfer-learning generative downscaling framework to reconstruct fine resolution satellite images from coarse scale inputs. Our approach combines a lightweight U-Net transfer encoder with a diffusion-based generative model. The simpler U-Net is first pretrained on a long time series of coarse resolution data to learn spatiotemporal representations; its encoder is then frozen and transferred to a larger downscaling model as physically meaningful latent features. Our application uses NASA’s MERRA-2 reanalysis as the low resolution source domain (50 km) and the GEOS-5 Nature Run (G5NR) as the high resolution target (7 km). Our study area included a large area in Asia, which was made computationally tractable by splitting into two subregions and four seasons. We conducted domain similarity analysis using Wasserstein distances confirmed minimal distributional shift between MERRA-2 and G5NR, validating the safety of parameter frozen transfer. Across seasonal regional splits, our model achieved excellent performance (R2 = 0.65 to 0.94), outperforming comparison models including deterministic U-Nets, variational autoencoders, and prior transfer learning baselines. Out of data evaluations using semivariograms, ACF/PACF, and lag-based RMSE/R2 demonstrated that the predicted downscaled images preserved physically consistent spatial variability and temporal autocorrelation, enabling stable autoregressive reconstruction beyond the G5NR record. These results show that transfer enhanced diffusion models provide a robust and physically coherent solution for downscaling a long time series of coarse resolution images with limited training periods. This advancement has significant implications for improving environmental exposure assessment and long term environmental monitoring.

[56] FlowEO: Generative Unsupervised Domain Adaptation for Earth Observation

Georges Le Bellier, Nicolas Audebert

Main category: cs.CV

TL;DR: FlowEO: A flow-matching framework for unsupervised domain adaptation in Earth observation that translates images between domains while preserving semantics for improved classification/segmentation.

Details

Motivation: Earth observation data is heterogeneous due to different sensors, regions, times, and conditions, causing distribution shifts that limit model generalization. Unsupervised domain adaptation is crucial for real-world remote sensing applications.

Method: FlowEO uses flow matching to learn a semantically preserving mapping that transports images from source to target distribution. This generative approach handles challenging domain adaptation scenarios for classification and semantic segmentation.

Result: Extensive experiments across four datasets show FlowEO outperforms existing image translation approaches for domain adaptation while achieving comparable or better perceptual image quality.

Conclusion: FlowEO demonstrates the potential of flow-matching-based UDA for remote sensing, effectively handling domain shifts like SAR-to-optical translation and temporal/semantic shifts from natural disasters.

Abstract: The increasing availability of Earth observation data offers unprecedented opportunities for large-scale environmental monitoring and analysis. However, these datasets are inherently heterogeneous, stemming from diverse sensors, geographical regions, acquisition times, and atmospheric conditions. Distribution shifts between training and deployment domains severely limit the generalization of pretrained remote sensing models, making unsupervised domain adaptation (UDA) crucial for real-world applications. We introduce FlowEO, a novel framework that leverages generative models for image-space UDA in Earth observation. We leverage flow matching to learn a semantically preserving mapping that transports from the source to the target image distribution. This allows us to tackle challenging domain adaptation configurations for classification and semantic segmentation of Earth observation images. We conduct extensive experiments across four datasets covering adaptation scenarios such as SAR to optical translation and temporal and semantic shifts caused by natural disasters. Experimental results demonstrate that FlowEO outperforms existing image translation approaches for domain adaptation while achieving on-par or better perceptual image quality, highlighting the potential of flow-matching-based UDA for remote sensing.

[57] Self-Improving VLM Judges Without Human Annotations

Inna Wanyin Lin, Yushi Hu, Shuyue Stella Li, Scott Geng, Pang Wei Koh, Luke Zettlemoyer, Tim Althoff, Marjan Ghazvininejad

Main category: cs.CV

TL;DR: Self-training framework for VLM judges without human annotations, using iterative synthetic data generation and reasoning-based filtering to create effective multimodal evaluators.

Details

Motivation: Current VLM judge training requires costly human preference annotations that quickly become outdated as models improve rapidly, creating a need for scalable, annotation-free methods.

Method: Three-stage iterative framework: (1) generate diverse multimodal instruction-response pairs at varying quality levels, (2) generate reasoning traces and judgments, filtering mismatched pairs, (3) train on correct judge answers with reasoning traces.

Result: Improves Llama-3.2-11B multimodal judge from 0.38 to 0.51 overall accuracy on VL-RewardBench, often outperforming larger models (Llama-3.2-90B, GPT-4o, Claude 3.5 Sonnet), with strong gains in general, hallucination, and reasoning dimensions.

Conclusion: Human-annotation-free self-training shows strong potential for creating VLM judges that can evolve alongside rapidly improving model capabilities, offering a scalable alternative to costly human preference annotations.

Abstract: Effective judges of Vision-Language Models (VLMs) are crucial for model development. Current methods for training VLM judges mainly rely on large-scale human preference annotations. However, such an approach is costly, and the annotations easily become obsolete as models rapidly improve. In this work, we present a framework to self-train a VLM judge model without any human preference annotations, using only self-synthesized data. Our method is iterative and has three stages: (1) generate diverse multimodal instruction-response pairs at varying quality levels, (2) generate reasoning traces and judgments for each pair, removing the ones that do not match our expected quality levels, and (3) training on correct judge answers and their reasoning traces. We evaluate the resulting judge on Multimodal RewardBench and VL-RewardBench across domains: correctness, preference, reasoning, safety, and visual question-answering. Our method improves a Llama-3.2-11B multimodal judge from 0.38 to 0.51 in overall accuracy on VL-RewardBench, often outperforming much larger models including Llama-3.2-90B, GPT-4o, and Claude 3.5 Sonnet, with particularly strong gains in general, hallucination, and reasoning dimensions. The overall strength of these human-annotation-free results suggest the potential for a future self-judge that evolves alongside rapidly improving VLM capabilities.

[58] TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

Zhenglin Cheng, Peng Sun, Jianguo Li, Tao Lin

Main category: cs.CV

TL;DR: TwinFlow: A simple framework for training 1-step generative models without fixed teacher models or adversarial networks, achieving 100x speedup with minimal quality loss.

Details

Motivation: Current multi-modal generative models use multi-step frameworks (40-100 NFEs) that are inefficient. Existing few-step acceleration methods have limitations: distillation methods require iterative procedures or degrade at very few steps (<4-NFE), while adversarial training methods introduce instability, complexity, and high memory overhead.

Method: TwinFlow is a simple framework that bypasses the need for fixed pretrained teacher models and avoids standard adversarial networks during training. It enables training 1-step generative models ideal for large-scale, efficient models.

Result: On text-to-image tasks: Achieves GenEval score of 0.83 in 1-NFE, outperforming SANA-Sprint and RCGM baselines. Full-parameter training on Qwen-Image-20B transforms it into an efficient few-step generator. With just 1-NFE, matches original 100-NFE model performance on GenEval and DPG-Bench benchmarks, reducing computational cost by 100x with minor quality degradation.

Conclusion: TwinFlow provides an effective solution for training efficient 1-step generative models without the limitations of existing distillation or adversarial training approaches, enabling significant computational savings while maintaining performance.

Abstract: Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.

[59] EFDiT: Efficient Fine-grained Image Generation Using Diffusion Transformer Models

Kun Wang, Donglin Di, Tonghua Su, Lei Fan

Main category: cs.CV

TL;DR: A diffusion model approach for fine-grained image generation using tiered embeddings, super-resolution enhancement, and ProAttention mechanism to address semantic entanglement and detail insufficiency.

Details

Motivation: Current class-conditional diffusion models struggle with fine-grained image generation, facing issues of semantic information entanglement between categories and insufficient detail in generated images, especially for less common categories.

Method: Three key innovations: 1) Tiered embedder integrating semantic information from both super and child classes to reduce semantic entanglement, 2) Super-resolution during perceptual information generation stage to enhance fine-grained details, and 3) ProAttention mechanism for efficient implementation in diffusion models.

Result: The method outperforms other state-of-the-art fine-tuning approaches on public benchmarks, demonstrating superior performance in fine-grained image generation tasks.

Conclusion: The proposed approach effectively addresses key challenges in fine-grained image generation using diffusion models, providing better semantic disentanglement and enhanced detail through tiered embeddings, super-resolution enhancement, and efficient attention mechanisms.

Abstract: Diffusion models are highly regarded for their controllability and the diversity of images they generate. However, class-conditional generation methods based on diffusion models often focus on more common categories. In large-scale fine-grained image generation, issues of semantic information entanglement and insufficient detail in the generated images still persist. This paper attempts to introduce a concept of a tiered embedder in fine-grained image generation, which integrates semantic information from both super and child classes, allowing the diffusion model to better incorporate semantic information and address the issue of semantic entanglement. To address the issue of insufficient detail in fine-grained images, we introduce the concept of super-resolution during the perceptual information generation stage, enhancing the detailed features of fine-grained images through enhancement and degradation models. Furthermore, we propose an efficient ProAttention mechanism that can be effectively implemented in the diffusion model. We evaluate our method through extensive experiments on public benchmarks, demonstrating that our approach outperforms other state-of-the-art fine-tuning methods in terms of performance.

[60] Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning

Wentao Wang, Chunyang Liu, Kehua Sheng, Bo Zhang, Yan Wang

Main category: cs.CV

TL;DR: Semore is a VLM-based framework for visual RL that extracts both semantic and motion representations from RGB flows using a dual-path backbone, guided by VLM knowledge and CLIP alignment for better feature learning.

Details

Motivation: Existing LLM-based RL methods focus mainly on policy guidance and suffer from limited backbone network representations. The authors aim to enhance visual RL by leveraging VLM's common-sense knowledge to extract richer semantic and motion representations from observations.

Method: Semore uses a dual-path backbone to extract semantic and motion representations from RGB flows. It employs VLM with common-sense knowledge to retrieve key information from observations and uses pre-trained CLIP for text-image alignment to embed ground-truth representations. The method adopts separate supervision to guide semantic and motion extraction while allowing spontaneous interaction between them.

Result: Extensive experiments show that Semore exhibits efficient and adaptive ability compared to state-of-the-art methods when guided by VLM at the feature level.

Conclusion: The proposed Semore framework successfully addresses the limited representation problem in LLM-based RL by leveraging VLM knowledge for enhanced semantic and motion feature extraction, demonstrating superior performance in visual reinforcement learning tasks.

Abstract: The growing exploration of Large Language Models (LLM) and Vision-Language Models (VLM) has opened avenues for enhancing the effectiveness of reinforcement learning (RL). However, existing LLM-based RL methods often focus on the guidance of control policy and encounter the challenge of limited representations of the backbone networks. To tackle this problem, we introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual RL, which can simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows. Semore utilizes VLM with common-sense knowledge to retrieve key information from observations, while using the pre-trained clip to achieve the text-image alignment, thereby embedding the ground-truth representations into the backbone. To efficiently fuse semantic and motion representations for decision-making, our method adopts a separately supervised approach to simultaneously guide the extraction of semantics and motion, while allowing them to interact spontaneously. Extensive experiments demonstrate that, under the guidance of VLM at the feature level, our method exhibits efficient and adaptive ability compared to state-of-art methods. All codes are released.

[61] Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models

Rowan Bradbury, Dazhi Zhong

Main category: cs.CV

TL;DR: PELC enables pixel-equivalent latent compositing for diffusion models, fixing artifacts from linear VAE latent interpolation with a lightweight transformer (DecFormer) that matches pixel-space alpha compositing.

Details

Motivation: Current latent inpainting relies on linearly interpolating VAE latents under downsampled masks, causing artifacts at seams, global degradation, and color shifts because VAEs capture global context beyond patch-aligned local structure.

Method: Propose Pixel-Equivalent Latent Compositing (PELC) principle and implement DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and residual correction for mask-consistent latent fusion matching pixel-space alpha compositing.

Result: DecFormer restores global color consistency, soft-mask support, sharp boundaries, reduces edge errors by up to 53%, adds only 0.07% parameters and 3.5% FLOP overhead, and enables inpainting fidelity comparable to fully finetuned models.

Conclusion: PELC enables pixel-equivalent latent editing, fixing fundamental issues with linear latent blending in diffusion models while maintaining compatibility and efficiency, with applications beyond inpainting to complex editing tasks.

Abstract: Latent inpainting in diffusion models still relies almost universally on linearly interpolating VAE latents under a downsampled mask. We propose a key principle for compositing image latents: Pixel-Equivalent Latent Compositing (PELC). An equivalent latent compositor should be the same as compositing in pixel space. This principle enables full-resolution mask control and true soft-edge alpha compositing, even though VAEs compress images 8x spatially. Modern VAEs capture global context beyond patch-aligned local structure, so linear latent blending cannot be pixel-equivalent: it produces large artifacts at mask seams and global degradation and color shifts. We introduce DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and an off-manifold residual correction to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07% of FLUX.1-Dev’s parameters and 3.5% FLOP overhead. On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing, as we demonstrate on a complex color-correction task.

[62] A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion

Fangzhou Lin, Zilin Dai, Rigved Sanku, Songlin Hou, Kazunori D Yamada, Haichong K. Zhang, Ziming Zhang

Main category: cs.CV

TL;DR: A view-free point cloud completion method outperforms single-view image guided approaches, questioning the necessity of image guidance in multimodal completion tasks.

Details

Motivation: To investigate whether image guidance is truly necessary for point cloud completion, challenging the assumption that multimodal approaches are inherently superior.

Method: Attention-based multi-branch encoder-decoder network using only partial point clouds, with hierarchical self-fusion mechanism combining cross-attention and self-attention layers.

Result: The view-free framework outperforms state-of-the-art single-view image guided point cloud completion methods on ShapeNet-ViPC dataset.

Conclusion: Image guidance may not be essential for point cloud completion, providing new insights for multimodal learning development in this domain.

Abstract: The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at https://github.com/Zhang-VISLab.

[63] DEAR: Dataset for Evaluating the Aesthetics of RenderingDEAR: Dataset for Evaluating the Aesthetics of Rendering

Vsevolod Plohotnuk, Artyom Panshin, Nikola Banić, Simone Bianco, Michael Freeman, Egor Ershov

Main category: cs.CV

TL;DR: DEAR is the first dataset for evaluating image rendering aesthetics using human preference scores, moving beyond traditional distortion-based quality assessment.

Details

Motivation: Traditional IQA focuses on technical degradations (noise, blur, compression) but fails to capture aesthetic judgments of rendering styles, which are crucial for photographic editing, content creation, and AI-generated imagery. There's a lack of datasets that reflect subjective style preferences.

Method: Built upon MIT-Adobe FiveK dataset, DEAR incorporates pairwise human preference scores collected via large-scale crowdsourcing. Each image pair was evaluated by 25 distinct human evaluators (13,648 total participants). The dataset captures nuanced, context-sensitive aesthetic preferences.

Result: Created the first systematic dataset for image aesthetics of rendering assessment grounded in subjective human preferences. Published a subset of 100 images with markup on HuggingFace. The dataset enables new tasks like style preference prediction, aesthetic benchmarking, and personalized aesthetic modeling.

Conclusion: DEAR addresses the underexplored domain of rendering aesthetics evaluation by providing a benchmark dataset with human preference annotations, enabling development of models that go beyond traditional distortion-based IQA to capture subjective aesthetic judgments.

Abstract: Traditional Image Quality Assessment~(IQA) focuses on quantifying technical degradations such as noise, blur, or compression artifacts, using both full-reference and no-reference objective metrics. However, evaluation of rendering aesthetics, a growing domain relevant to photographic editing, content creation, and AI-generated imagery, remains underexplored due to the lack of datasets that reflect the inherently subjective nature of style preference. In this work, a novel benchmark dataset designed to model human aesthetic judgments of image rendering styles is introduced: the Dataset for Evaluating the Aesthetics of Rendering (DEAR). Built upon the MIT-Adobe FiveK dataset, DEAR incorporates pairwise human preference scores collected via large-scale crowdsourcing, with each image pair evaluated by 25 distinct human evaluators with a total of 13,648 of them participating overall. These annotations capture nuanced, context-sensitive aesthetic preferences, enabling the development and evaluation of models that go beyond traditional distortion-based IQA, focusing on a new task: Evaluation of Aesthetics of Rendering (EAR). The data collection pipeline is described, human voting patterns are analyzed, and multiple use cases are outlined, including style preference prediction, aesthetic benchmarking, and personalized aesthetic modeling. To the best of the authors’ knowledge, DEAR is the first dataset to systematically address image aesthetics of rendering assessment grounded in subjective human preferences. A subset of 100 images with markup for them is published on HuggingFace (huggingface.co/datasets/vsevolodpl/DEAR).

[64] IE2Video: Adapting Pretrained Diffusion Models for Event-Based Video Reconstruction

Dmitrii Torbunov, Onur Okuducu, Yi Huang, Odera Dim, Rebecca Coles, Yonggang Cui, Yihui Ren

Main category: cs.CV

TL;DR: Hybrid capture system combining sparse RGB keyframes with continuous event streams to reconstruct full RGB video offline, reducing power consumption while maintaining standard video output.

Details

Motivation: Continuous video monitoring faces power constraints with conventional RGB cameras, while event cameras offer low-power sensing but produce non-standard event streams. Need a solution that reduces capture power while maintaining standard RGB video output.

Method: Proposes IE2Video task: reconstruct RGB video from initial frame + event data. Tests two approaches: 1) HyperE2VID (autoregressive model adaptation), 2) LTX diffusion model with event encoders and LoRA adaptation.

Result: Diffusion-based approach achieves 33% better perceptual quality (0.283 vs 0.422 LPIPS) than autoregressive baseline. Validated across three event camera datasets (BS-ERGB, HS-ERGB far/close) at varying sequence lengths (32-128 frames) with robust cross-dataset generalization.

Conclusion: Hybrid capture paradigm effectively reduces power consumption while maintaining video quality, with diffusion-based reconstruction outperforming autoregressive methods, enabling practical low-power video monitoring systems.

Abstract: Continuous video monitoring in surveillance, robotics, and wearable systems faces a fundamental power constraint: conventional RGB cameras consume substantial energy through fixed-rate capture. Event cameras offer sparse, motion-driven sensing with low power consumption, but produce asynchronous event streams rather than RGB video. We propose a hybrid capture paradigm that records sparse RGB keyframes alongside continuous event streams, then reconstructs full RGB video offline – reducing capture power consumption while maintaining standard video output for downstream applications. We introduce the Image and Event to Video (IE2Video) task: reconstructing RGB video sequences from a single initial frame and subsequent event camera data. We investigate two architectural strategies: adapting an autoregressive model (HyperE2VID) for RGB generation, and injecting event representations into a pretrained text-to-video diffusion model (LTX) via learned encoders and low-rank adaptation. Our experiments demonstrate that the diffusion-based approach achieves 33% better perceptual quality than the autoregressive baseline (0.283 vs 0.422 LPIPS). We validate our approach across three event camera datasets (BS-ERGB, HS-ERGB far/close) at varying sequence lengths (32-128 frames), demonstrating robust cross-dataset generalization with strong performance on unseen capture configurations.

[65] Age-Inclusive 3D Human Mesh Recovery for Action-Preserving Data Anonymization

Georgios Chatzichristodoulou, Niki Efthymiou, Panagiotis Filntisis, Georgios Pavlakos, Petros Maragos

Main category: cs.CV

TL;DR: AionHMR is a framework that enables accurate 3D human shape and pose estimation for all ages (adults, children, infants) using SMPL-A body model and transformer-based architecture, creating privacy-preserving 3D reconstructions.

Details

Motivation: Existing 3D human pose and shape estimation methods perform well for adults but fail to generalize to children and infants, creating a significant domain gap in age-inclusive human modeling.

Method: 1) Extends top-performing model with SMPL-A body model for age-inclusive modeling; 2) Generates pseudo-ground-truth annotations for child/infant databases; 3) Develops transformer-based deep learning model for real-time 3D reconstruction; 4) Creates 3D-BabyRobot dataset of action-preserving reconstructions.

Result: Significantly improves shape and pose estimation for children and infants without compromising adult accuracy. Produces privacy-preserving 3D meshes that retain action, pose, and geometry information while enabling anonymized dataset release.

Conclusion: Bridges crucial domain gap in age-inclusive 3D human modeling, establishes foundation for privacy-aware, age-diverse human reconstruction, and enables creation of anonymized datasets like 3D-BabyRobot.

Abstract: While three-dimensional (3D) shape and pose estimation is a highly researched area that has yielded significant advances, the resulting methods, despite performing well for the adult population, generally fail to generalize effectively to children and infants. This paper addresses this challenge by introducing AionHMR, a comprehensive framework designed to bridge this domain gap. We propose an optimization-based method that extends a top-performing model by incorporating the SMPL-A body model, enabling the concurrent and accurate modeling of adults, children, and infants. Leveraging this approach, we generated pseudo-ground-truth annotations for publicly available child and infant image databases. Using these new training data, we then developed and trained a specialized transformer-based deep learning model capable of real-time 3D age-inclusive human reconstruction. Extensive experiments demonstrate that our methods significantly improve shape and pose estimation for children and infants without compromising accuracy on adults. Importantly, our reconstructed meshes serve as privacy-preserving substitutes for raw images, retaining essential action, pose, and geometry information while enabling anonymized datasets release. As a demonstration, we introduce the 3D-BabyRobot dataset, a collection of action-preserving 3D reconstructions of children interacting with robots. This work bridges a crucial domain gap and establishes a foundation for inclusive, privacy-aware, and age-diverse 3D human modeling.

[66] CARD: Correlation Aware Restoration with Diffusion

Niki Nezakati, Arnab Ghosh, Amit Roy-Chowdhury, Vishwanath Saragadam

Main category: cs.CV

TL;DR: CARD is a training-free diffusion model extension that handles spatially correlated noise in image restoration by whitening observations and using noise-whitened updates, outperforming existing methods on real-world correlated noise.

Details

Motivation: Most diffusion models assume i.i.d. Gaussian noise, but real-world sensors exhibit spatially correlated noise due to readout mechanisms, limiting practical effectiveness of existing restoration methods.

Method: CARD whitens noisy observations to convert correlated noise into i.i.d. form, then replaces diffusion restoration steps with noise-whitened updates while maintaining DDRM’s closed-form sampling efficiency.

Result: CARD consistently outperforms existing methods across denoising, deblurring, and super-resolution tasks on both synthetic correlated noise benchmarks and the new CIN-D dataset with real rolling-shutter sensor noise.

Conclusion: CARD effectively handles correlated noise in image restoration without retraining, and the CIN-D dataset fills a critical gap for evaluating methods on real-world correlated noise.

Abstract: Denoising diffusion models have achieved state-of-the-art performance in image restoration by modeling the process as sequential denoising steps. However, most approaches assume independent and identically distributed (i.i.d.) Gaussian noise, while real-world sensors often exhibit spatially correlated noise due to readout mechanisms, limiting their practical effectiveness. We introduce Correlation Aware Restoration with Diffusion (CARD), a training-free extension of DDRM that explicitly handles correlated Gaussian noise. CARD first whitens the noisy observation, which converts the noise into an i.i.d. form. Then, the diffusion restoration steps are replaced with noise-whitened updates, which inherits DDRM’s closed-form sampling efficiency while now being able to handle correlated noise. To emphasize the importance of addressing correlated noise, we contribute CIN-D, a novel correlated noise dataset captured across diverse illumination conditions to evaluate restoration methods on real rolling-shutter sensor noise. This dataset fills a critical gap in the literature for experimental evaluation with real-world correlated noise. Experiments on standard benchmarks with synthetic correlated noise and on CIN-D demonstrate that CARD consistently outperforms existing methods across denoising, deblurring, and super-resolution tasks.

[67] Inferring Compositional 4D Scenes without Ever Seeing One

Ahmet Berke Gokmen, Ajad Chhatkuli, Luc Van Gool, Danda Pani Paudel

Main category: cs.CV

TL;DR: COM4D is a method for reconstructing complete 4D scenes with multiple interacting objects from monocular videos using only static multi-object or dynamic single-object supervision, without requiring 4D compositional training data.

Details

Motivation: Real-world scenes contain multiple static and dynamic objects, but existing methods focus on single objects or rely on category-specific models, leading to inconsistent scene configurations and limited applicability. There's a need for methods that can capture 4D structures of multiple interacting objects in-the-wild without requiring 4D compositional training data.

Method: COM4D uses a carefully designed training of spatial and temporal attentions on 2D video input. Training is disentangled into learning from object compositions (spatial) and single object dynamics (temporal). At inference, an attention mixing mechanism combines these independently learned attentions without requiring 4D composition examples, alternating between spatial and temporal reasoning.

Result: COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. It achieves state-of-the-art results in both 4D object reconstruction and composed 3D reconstruction despite being purely data-driven.

Conclusion: COM4D provides a novel approach to 4D scene reconstruction that avoids the need for 4D compositional training data by disentangling spatial and temporal learning, enabling consistent reconstruction of multiple interacting objects from monocular videos with strong performance on existing benchmarks.

Abstract: Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.

[68] From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

Main category: cs.CV

TL;DR: Introduces TAD benchmark for evaluating temporal understanding in autonomous driving, showing current VLMs struggle with fine-grained motion, and proposes two training-free solutions (Scene-CoT and TCogMap) that improve accuracy by up to 17.72%.

Details

Motivation: Existing temporal reasoning benchmarks focus on domains like sports and movies, but none address the unique challenges of ego-centric autonomous driving footage where temporal understanding is crucial for safety and decision-making.

Method: Created TAD benchmark with 6,000 QA pairs across 7 tasks, evaluated 9 generalist and AD specialist models, and proposed two training-free solutions: Scene-CoT (Chain-of-Thought reasoning) and TCogMap (ego-centric temporal cognitive map).

Result: Current SoTA models show substandard accuracy on TAD due to poor fine-grained motion understanding. The proposed Scene-CoT and TCogMap methods improve average accuracy by up to 17.72% when integrated with existing VLMs.

Conclusion: TAD benchmark fills a critical gap in evaluating temporal understanding for autonomous driving, reveals limitations in current models, and provides effective training-free solutions to improve performance, catalyzing future research in this domain.

Abstract: Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs’ ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at \href{https://huggingface.co/datasets/vbdai/TAD}{Hugging Face} and \href{https://github.com/vbdi/tad_bench}{Github}, respectively.

[69] SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Elisabetta Fedele, Francis Engelmann, Ian Huang, Or Litany, Marc Pollefeys, Leonidas Guibas

Main category: cs.CV

TL;DR: SpaceControl is a training-free test-time method for explicit spatial control of 3D generation that accepts various geometric inputs and integrates with pre-trained models without additional training.

Details

Motivation: Existing 3D generation methods rely on text or image prompts which lack geometric specificity - language is ambiguous and images are cumbersome to edit, creating a need for more intuitive and precise geometric control.

Method: Training-free test-time approach that accepts diverse geometric inputs (from coarse primitives to detailed meshes) and integrates with modern pre-trained generative models. Includes a controllable parameter to balance geometric fidelity and output realism.

Result: Outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality, as demonstrated through extensive quantitative evaluation and user studies.

Conclusion: SpaceControl provides effective explicit spatial control for 3D generation and includes an interactive UI for editing superquadrics into textured 3D assets, facilitating practical deployment in creative workflows.

Abstract: Generative methods for 3D assets have recently achieved remarkable progress, yet providing intuitive and precise control over the object geometry remains a key challenge. Existing approaches predominantly rely on text or image prompts, which often fall short in geometric specificity: language can be ambiguous, and images are cumbersome to edit. In this work, we introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D generation. Our approach accepts a wide range of geometric inputs, from coarse primitives to detailed meshes, and integrates seamlessly with modern pre-trained generative models without requiring any additional training. A controllable parameter lets users trade off between geometric fidelity and output realism. Extensive quantitative evaluation and user studies demonstrate that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality. Finally, we present an interactive user interface that enables online editing of superquadrics for direct conversion into textured 3D assets, facilitating practical deployment in creative workflows. Find our project page at https://spacecontrol3d.github.io/

[70] SplatPainter: Interactive Authoring of 3D Gaussians from 2D Edits via Test-Time Training

Yang Zheng, Hao Tan, Kai Zhang, Peng Wang, Leonidas Guibas, Gordon Wetzstein, Wang Yifan

Main category: cs.CV

TL;DR: A feedforward model for interactive editing of 3D Gaussian Splatting assets using 2D views, enabling real-time local refinement, painting, and recoloring.

Details

Motivation: Current 3D Gaussian Splatting lacks interactive editing tools; existing diffusion/optimization methods are too slow, destructive, or imprecise for fine-grained control.

Method: State-aware feedforward model that predicts updates to compact Gaussian attributes from 2D views, using Test-Time Training for iterative workflow and feature-rich representation.

Result: Enables diverse editing tasks at interactive speeds: high-fidelity local detail refinement, local paint-over, and consistent global recoloring with single architecture.

Conclusion: Paves way for fluid, intuitive 3D content authoring by bridging the gap between photorealistic 3D asset creation and interactive refinement/editing.

Abstract: The rise of 3D Gaussian Splatting has revolutionized photorealistic 3D asset creation, yet a critical gap remains for their interactive refinement and editing. Existing approaches based on diffusion or optimization are ill-suited for this task, as they are often prohibitively slow, destructive to the original asset’s identity, or lack the precision for fine-grained control. To address this, we introduce \ourmethod, a state-aware feedforward model that enables continuous editing of 3D Gaussian assets from user-provided 2D view(s). Our method directly predicts updates to the attributes of a compact, feature-rich Gaussian representation and leverages Test-Time Training to create a state-aware, iterative workflow. The versatility of our approach allows a single architecture to perform diverse tasks, including high-fidelity local detail refinement, local paint-over, and consistent global recoloring, all at interactive speeds, paving the way for fluid and intuitive 3D content authoring.

[71] Group Orthogonal Low-Rank Adaptation for RGB-T Tracking

Zekai Shao, Yufan Hu, Jingyuan Liu, Bin Fan, Hongmin Liu

Main category: cs.CV

TL;DR: GOLA framework reduces rank space redundancy in RGB-T tracking by grouping redundant ranks and applying orthogonal constraints to learn complementary features.

Details

Motivation: Current parameter-efficient fine-tuning in RGB-T tracking suffers from significant redundancy in rank space, where many ranks contribute little practical information, limiting the model's ability to learn diverse knowledge for various tracking challenges.

Method: Proposes Group Orthogonal Low-Rank Adaptation (GOLA) framework: 1) Uses singular value decomposition to quantify rank importance, 2) Freezes crucial ranks to preserve pretrained priors, 3) Clusters redundant ranks into groups, 4) Applies inter-group orthogonal constraints to enforce complementary feature learning.

Result: GOLA significantly outperforms state-of-the-art methods across four benchmark datasets, effectively reducing parameter redundancy and enhancing feature representation capabilities in RGB-T tracking.

Conclusion: The GOLA framework successfully addresses rank space redundancy in RGB-T tracking through structured parameter learning with orthogonal constraints, enabling more diverse knowledge acquisition for various tracking challenges.

Abstract: Parameter-efficient fine-tuning has emerged as a promising paradigm in RGB-T tracking, enabling downstream task adaptation by freezing pretrained parameters and fine-tuning only a small set of parameters. This set forms a rank space made up of multiple individual ranks, whose expressiveness directly shapes the model’s adaptability. However, quantitative analysis reveals low-rank adaptation exhibits significant redundancy in the rank space, with many ranks contributing almost no practical information. This hinders the model’s ability to learn more diverse knowledge to address the various challenges in RGB-T tracking. To address this issue, we propose the Group Orthogonal Low-Rank Adaptation (GOLA) framework for RGB-T tracking, which effectively leverages the rank space through structured parameter learning. Specifically, we adopt a rank decomposition partitioning strategy utilizing singular value decomposition to quantify rank importance, freeze crucial ranks to preserve the pretrained priors, and cluster the redundant ranks into groups to prepare for subsequent orthogonal constraints. We further design an inter-group orthogonal constraint strategy. This constraint enforces orthogonality between rank groups, compelling them to learn complementary features that target diverse challenges, thereby alleviating information redundancy. Experimental results demonstrate that GOLA effectively reduces parameter redundancy and enhances feature representation capabilities, significantly outperforming state-of-the-art methods across four benchmark datasets and validating its effectiveness in RGB-T tracking tasks.

[72] PoolNet: Deep Learning for 2D to 3D Video Process Validation

Sanchit Kaul, Joseph Luna, Shray Arora

Main category: cs.CV

TL;DR: PoolNet is a deep learning framework that quickly validates whether image data is suitable for Structure-from-Motion processing, significantly reducing the time needed to assess SfM readiness compared to traditional methods.

Details

Motivation: Traditional SfM processing is time-consuming and computationally expensive, and most publicly available image data is unsuitable due to issues like inadequate camera pose variation, obscuring elements, and noise. There's a need for efficient validation of image data before attempting SfM processing.

Method: PoolNet is a versatile deep learning framework designed for both frame-level and scene-level validation of in-the-wild image data. It analyzes image sequences to determine their suitability for SfM processing.

Result: The model successfully differentiates SfM-ready scenes from those unfit for processing while significantly reducing the time required compared to state-of-the-art SfM algorithms.

Conclusion: PoolNet provides an efficient solution for validating image data for SfM processing, addressing the time and computational cost issues of traditional methods while handling the challenges of real-world, noisy image data.

Abstract: Lifting Structure-from-Motion (SfM) information from sequential and non-sequential image data is a time-consuming and computationally expensive task. In addition to this, the majority of publicly available data is unfit for processing due to inadequate camera pose variation, obscuring scene elements, and noisy data. To solve this problem, we introduce PoolNet, a versatile deep learning framework for frame-level and scene-level validation of in-the-wild data. We demonstrate that our model successfully differentiates SfM ready scenes from those unfit for processing while significantly undercutting the amount of time state of the art algorithms take to obtain structure-from-motion data.

[73] ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration

Yingjie Xia, Tao Liu, Jinglei Shi, Qingsong Xie, Heng Guo, Jian Yang, Xi Wang

Main category: cs.CV

TL;DR: ShaRP is an improved attention-based pruning framework for Video LLMs that enables effective pruning at shallow decoder layers while maintaining performance under high compression rates without retraining.

Details

Motivation: VLLMs face high computational load during pre-filling due to processing many visual tokens. Existing attention-based pruning methods at early decoder layers cause significant performance degradation, especially under high compression rates, due to positional encoding bias and insufficient information interaction.

Method: Proposes ShaRP framework with three key components: segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. This enables effective pruning at shallow layers while maintaining stable performance.

Result: Extensive experiments show ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.

Conclusion: ShaRP provides an effective solution to accelerate VLLM inference through improved attention-based pruning that works well at shallow decoder layers without requiring retraining.

Abstract: Video Large Language Models (VLLMs) face the challenge of high computational load during the pre-filling stage due to the processing of an enormous number of visual tokens. Although attention-based pruning methods are widely used to accelerate inference, trials at early decoder layers often result in significant performance degradation, especially under high compression rates. We argue that while attention-based pruning inherently holds the potential to identify the most relevant visual tokens, its effectiveness in shallow decoder layers is limited by factors such as positional encoding bias and insufficient information interaction. In this paper, we propose an improved attention-based pruning framework, termed ShaRP, that integrates segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. It enables effective pruning at shallow layers while maintaining stable performance under high compression rates without retraining. Extensive experiments demonstrate that ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.

[74] LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models

Qingqiao Hu, Weimin Lyu, Meilong Xu, Kehan Qi, Xiaoling Hu, Saumya Gupta, Jiawei Zhou, Chao Chen

Main category: cs.CV

TL;DR: LoC-Path is an efficient multimodal LLM for pathology that reduces computational costs by identifying and focusing on only task-relevant regions in whole slide images, rather than processing all patches.

Details

Motivation: Existing slide-level MLLMs for pathology are computationally expensive because they process thousands of patch features in a brute-force manner, while human experts focus only on key diagnostic regions. The authors observed that tile-level features exhibit strong redundancy, with only a small subset being truly task-relevant.

Method: The framework includes: 1) Sparse Token Merger (STM) and MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens into compact slide-level representations, and 2) Cross-Attention Routing Adapter (CARA) and Token Importance Scorer (TIS) to integrate compressed visual representations with language models efficiently.

Result: Extensive experiments show the approach achieves performance comparable to state-of-the-art whole-slide MLLMs while requiring significantly lower computation and memory.

Conclusion: The work demonstrates that efficient WSI-language modeling is possible by focusing on task-relevant regions and reducing redundancy, offering a more practical solution for computational pathology.

Abstract: Whole Slide Image (WSI) understanding is fundamentally challenging due to its gigapixel scale and the extreme sparsity of diagnostically relevant regions. Unlike human experts who primarily rely on key areas to arrive at a diagnosis, existing slide-level multimodal large language models (MLLMs) for pathology rely on heavy slide-level encoders that process thousands of patch features in a brute-force manner, resulting in excessive computational cost. In this work, we revisit the WSI-language modeling paradigm and show that tile-level features exhibit strong global and local redundancy, whereas only a small subset of tiles are truly task-relevant. Motivated by this observation, we introduce an efficient MLLM framework, called LoC-Path, that replaces the expensive slide-level encoder with redundancy-reducing modules. We first design a Sparse Token Merger (STM) and an MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens into a compact slide-level representation set. We then propose a Cross-Attention Routing Adapter (CARA) and a Token Importance Scorer (TIS) to integrate the compressed visual representation with the language model in a computation-efficient manner. Extensive experiments demonstrate that our approach achieves performance comparable to existing state-of-the-art whole-slide MLLMs, while requiring significantly lower computation and memory.

[75] Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

Shizhan Liu, Xinran Deng, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, Jie Tang

Main category: cs.CV

TL;DR: SSVAE introduces spectral-structured video VAE with two regularizers to improve diffusion training efficiency and quality.

Details

Motivation: Existing video VAEs focus on reconstruction fidelity but overlook latent structure, which strongly influences diffusion training difficulty. The authors identify that latent space spectral properties are crucial for efficient diffusion training.

Method: Proposes Spectral-Structured VAE (SSVAE) with two lightweight regularizers: 1) Local Correlation Regularization to bias spatio-temporal frequency spectrum toward low frequencies, and 2) Latent Masked Reconstruction to ensure channel-wise eigenspectrum dominated by few modes.

Result: SSVAE achieves 3× speedup in text-to-video generation convergence and 10% gain in video reward, outperforming strong open-source VAEs.

Conclusion: Structuring video VAE latent spaces with spectral properties is essential for efficient diffusion training, and the proposed SSVAE with two regularizers effectively achieves this, leading to significant improvements in training efficiency and generation quality.

Abstract: Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.

[76] The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos

Zhuoyuan Wu, Xurui Yang, Jiahui Huang, Yue Wang, Jun Gao

Main category: cs.CV

TL;DR: Dynamic Prior (DP) uses Vision-Language Models and SAM2 to identify dynamic objects without task-specific training, improving camera pose estimation, depth reconstruction, and 4D trajectory estimation in videos.

Details

Motivation: Classical structure from motion pipelines struggle with dynamic objects in videos, and existing learning-based methods require large motion segmentation datasets which limit their accuracy and 3D understanding capabilities.

Method: Dynamic Prior leverages Vision-Language Models for reasoning about object dynamics and SAM2 for fine-grained spatial segmentation to identify dynamic objects without task-specific training, then integrates this into state-of-the-art 3D reconstruction pipelines.

Result: Achieves state-of-the-art performance on motion segmentation and significantly improves accuracy and robustness for structural 3D understanding on both synthetic and real-world videos.

Conclusion: Dynamic Prior provides an effective way to handle dynamic objects in 3D reconstruction without requiring specialized training data, enabling more robust camera pose optimization, depth reconstruction, and 4D trajectory estimation.

Abstract: Estimating accurate camera poses, 3D scene geometry, and object motion from in-the-wild videos is a long-standing challenge for classical structure from motion pipelines due to the presence of dynamic objects. Recent learning-based methods attempt to overcome this challenge by training motion estimators to filter dynamic objects and focus on the static background. However, their performance is largely limited by the availability of large-scale motion segmentation datasets, resulting in inaccurate segmentation and, therefore, inferior structural 3D understanding. In this work, we introduce the Dynamic Prior (\ourmodel) to robustly identify dynamic objects without task-specific training, leveraging the powerful reasoning capabilities of Vision-Language Models (VLMs) and the fine-grained spatial segmentation capacity of SAM2. \ourmodel can be seamlessly integrated into state-of-the-art pipelines for camera pose optimization, depth reconstruction, and 4D trajectory estimation. Extensive experiments on both synthetic and real-world videos demonstrate that \ourmodel not only achieves state-of-the-art performance on motion segmentation, but also significantly improves accuracy and robustness for structural 3D understanding.

[77] Genetic Algorithms For Parameter Optimization for Disparity Map Generation of Radiata Pine Branch Images

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

Main category: cs.CV

TL;DR: GA-based parameter optimization for SGBM+WLS stereo matching improves UAV tree branch distance measurement accuracy while maintaining speed.

Details

Motivation: Traditional stereo matching algorithms (SGBM+WLS) are fast for UAV applications but require manual parameter tuning, which is time-consuming and suboptimal for real-world forestry applications.

Method: Proposes a Genetic Algorithm (GA) framework to automatically optimize SGBM and WLS filter parameters, eliminating manual tuning while maintaining processing efficiency for resource-constrained UAV systems.

Result: GA-optimized approach reduces Mean Squared Error by 42.86%, increases PSNR by 8.47% and SSIM by 28.52% compared to baseline, with superior generalization across varied imaging conditions.

Conclusion: The GA-based parameter optimization framework provides an effective automated solution for improving stereo matching accuracy in UAV forestry applications while maintaining computational efficiency.

Abstract: Traditional stereo matching algorithms like Semi-Global Block Matching (SGBM) with Weighted Least Squares (WLS) filtering offer speed advantages over neural networks for UAV applications, generating disparity maps in approximately 0.5 seconds per frame. However, these algorithms require meticulous parameter tuning. We propose a Genetic Algorithm (GA) based parameter optimization framework that systematically searches for optimal parameter configurations for SGBM and WLS, enabling UAVs to measure distances to tree branches with enhanced precision while maintaining processing efficiency. Our contributions include: (1) a novel GA-based parameter optimization framework that eliminates manual tuning; (2) a comprehensive evaluation methodology using multiple image quality metrics; and (3) a practical solution for resource-constrained UAV systems. Experimental results demonstrate that our GA-optimized approach reduces Mean Squared Error by 42.86% while increasing Peak Signal-to-Noise Ratio and Structural Similarity by 8.47% and 28.52%, respectively, compared with baseline configurations. Furthermore, our approach demonstrates superior generalization performance across varied imaging conditions, which is critcal for real-world forestry applications.

[78] YOLO and SGBM Integration for Autonomous Tree Branch Detection and Depth Estimation in Radiata Pine Pruning Applications

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

Main category: cs.CV

TL;DR: Computer vision system using YOLO + stereo vision enables autonomous drone pruning, achieving 82% mAP for branch detection without expensive LiDAR.

Details

Motivation: Manual pruning of radiata pine trees is dangerous due to extreme heights and challenging terrain, creating safety risks for workers.

Method: Integrates YOLO object detection with Semi-Global Block Matching (SGBM) stereo vision for branch detection and depth estimation using only stereo cameras.

Result: YOLO outperforms Mask R-CNN with 82.0% mAPmask50-95 for branch segmentation; system accurately localizes branches within 2m range with <1 second processing per frame.

Conclusion: Demonstrates feasibility of cost-effective autonomous pruning systems that enhance worker safety and operational efficiency in forestry.

Abstract: Manual pruning of radiata pine trees poses significant safety risks due to extreme working heights and challenging terrain. This paper presents a computer vision framework that integrates YOLO object detection with Semi-Global Block Matching (SGBM) stereo vision for autonomous drone-based pruning operations. Our system achieves precise branch detection and depth estimation using only stereo camera input, eliminating the need for expensive LiDAR sensors. Experimental evaluation demonstrates YOLO’s superior performance over Mask R-CNN, achieving 82.0% mAPmask50-95 for branch segmentation. The integrated system accurately localizes branches within a 2 m operational range, with processing times under one second per frame. These results establish the feasibility of cost-effective autonomous pruning systems that enhance worker safety and operational efficiency in commercial forestry.

[79] Moving object detection from multi-depth images with an attention-enhanced CNN

Masato Shibukawa, Fumi Yoshida, Toshifumi Yanagisawa, Takashi Ito, Hirohisa Kurosaki, Makoto Yoshikawa, Kohki Kamiya, Ji-an Jiang, Wesley Fraser, JJ Kavelaars, Susan Benecchi, Anne Verbiscer, Akira Hatakeyama, Hosei O, Naoya Ozaki

Main category: cs.CV

TL;DR: A multi-input CNN with attention module achieves 99% accuracy for automated moving object detection in solar system surveys, reducing human workload by 99%.

Details

Motivation: Current moving object detection in solar system surveys relies heavily on manual verification by human eyes, which is labor-intensive and costly. There's a need to automate this process to reduce reliance on human intervention.

Method: Proposes a multi-input convolutional neural network integrated with a convolutional block attention module (CBAM). The multi-input architecture processes multiple stacked images simultaneously, while CBAM enables the model to focus on essential features in both spatial and channel dimensions.

Result: Achieved 99% accuracy with AUC >0.99 on a dataset of approximately 2,000 observational images. By adjusting detection thresholds, the model reduces human workload by more than 99% compared to manual verification.

Conclusion: The proposed model demonstrates excellent classification performance for moving object detection in solar system surveys, significantly reducing manual labor while maintaining high accuracy, representing a major advancement in automated astronomical data analysis.

Abstract: One of the greatest challenges for detecting moving objects in the solar system from wide-field survey data is determining whether a signal indicates a true object or is due to some other source, like noise. Object verification has relied heavily on human eyes, which usually results in significant labor costs. In order to address this limitation and reduce the reliance on manual intervention, we propose a multi-input convolutional neural network integrated with a convolutional block attention module. This method is specifically tailored to enhance the moving object detection system that we have developed and used previously. The current method introduces two innovations. This first one is a multi-input architecture that processes multiple stacked images simultaneously. The second is the incorporation of the convolutional block attention module which enables the model to focus on essential features in both spatial and channel dimensions. These advancements facilitate efficient learning from multiple inputs, leading to more robust detection of moving objects. The performance of the model is evaluated on a dataset consisting of approximately 2,000 observational images. We achieved an accuracy of nearly 99% with AUC (an Area Under the Curve) of >0.99. These metrics indicate that the proposed model achieves excellent classification performance. By adjusting the threshold for object detection, the new model reduces the human workload by more than 99% compared to manual verification.

[80] Performance Evaluation of Deep Learning for Tree Branch Segmentation in Autonomous Forestry Systems

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

Main category: cs.CV

TL;DR: The paper evaluates deep learning methods for UAV-based tree branch segmentation across three resolutions, finding U-Net with MiT-B4 performs best at lower resolutions while U-Net++ excels in boundary quality at highest resolution, with PSPNet offering most efficiency.

Details

Motivation: UAV-based autonomous forestry operations need rapid and precise tree branch segmentation for safe navigation and automated pruning, requiring methods that work across varying pixel resolutions and operational conditions.

Method: Evaluated different deep learning methods (U-Net, U-Net++, PSPNet with various backbones) at three resolutions (256x256, 512x512, 1024x1024) using Urban Street Tree Dataset with standard metrics (IoU, Dice) and specialized measures (Thin Structure IoU, Connectivity Preservation Rate, Boundary-F1).

Result: U-Net with MiT-B4 backbone performs best at 256x256; MiT-B4 leads at 512x512; U-Net+MiT-B3 shows best validation performance at 1024x1024 for IoU/Dice/precision while U-Net++ excels in boundary quality. PSPNet provides most efficient option with 2.36/9.43/37.74 GFLOPs but with 25.7/19.6/11.8 percentage point IoU reductions.

Conclusion: Establishes multi-resolution benchmarks for accuracy-efficiency trade-offs in embedded forestry systems, providing guidance for selecting appropriate models based on resolution requirements and computational constraints.

Abstract: UAV-based autonomous forestry operations require rapid and precise tree branch segmentation for safe navigation and automated pruning across varying pixel resolutions and operational conditions. We evaluate different deep learning methods at three resolutions (256x256, 512x512, 1024x1024) using the Urban Street Tree Dataset, employing standard metrics (IoU, Dice) and specialized measures including Thin Structure IoU (TS-IoU) and Connectivity Preservation Rate (CPR). Among 22 configurations tested, U-Net with MiT-B4 backbone achieves strong performance at 256x256. At 512x512, MiT-B4 leads in IoU, Dice, TS-IoU, and Boundary-F1. At 1024x1024, U-Net+MiT-B3 shows the best validation performance for IoU/Dice and precision, while U-Net++ excels in boundary quality. PSPNet provides the most efficient option (2.36/9.43/37.74 GFLOPs) with 25.7/19.6/11.8 percentage point IoU reductions compared to top performers at respective resolutions. These results establish multi-resolution benchmarks for accuracy-efficiency trade-offs in embedded forestry systems. Implementation is available at https://github.com/BennyLinntu/PerformanceTreeBranchSegmentation.

[81] ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction

Jiangtong Tan, Lin Liu, Jie Huanng, Xiaopeng Zhang, Qi Tian, Feng Zhao

Main category: cs.CV

TL;DR: ParaUni is a unified multimodal model that extracts parallel features from different VLM layers to balance interaction and flexibility, using layer integration and dynamic adjustment for improved visual generation.

Details

Motivation: Existing unified multimodal models struggle to balance sufficient interaction between vision-language models and diffusion models with flexible implementation due to vast representation differences between them.

Method: Proposes ParaUni with: 1) Parallel extraction of features from all VLM layers fed into Layer Integration Module (LIM) to fuse fine-grained details and semantic abstractions; 2) Layer-wise Dynamic Adjustment Mechanism (LDAM) using RL to align hierarchical layer properties with multiple rewards.

Result: Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and demonstrates strong potential for multiple reward advances during RL stages.

Conclusion: ParaUni effectively balances interaction and flexibility in unified multimodal models by parallel feature extraction and layer-wise dynamic adjustment, achieving improved visual generation performance.

Abstract: Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM’s layers from low-level details to high-level semantics, we propose \textbf{ParaUni}. It extracts features from variants VLM’s layers in a \textbf{Para}llel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in \textbf{Uni}fied multimodal model. Concretely, visual features from all VLM’s layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages. Code is available at https://github.com/JosephTiTan/ParaUni.

[82] TED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression

Cheng-Yuan Ho, He-Bi Yang, Jui-Chiu Chiang, Yu-Lun Liu, Wen-Hsiao Peng

Main category: cs.CV

TL;DR: TED-4DGS: A novel rate-distortion-optimized compression framework for dynamic 3D Gaussian Splatting that uses temporally activated anchors with embedding-based deformation and neural compression techniques.

Details

Motivation: Current dynamic 3DGS methods have limitations: 4DGS uses overspecified short-lived Gaussians, while canonical 3DGS with deformation lacks explicit temporal control. There's a need for more compact, efficient deformation schemes with rate-distortion-optimized compression for dynamic 3DGS representations.

Method: TED-4DGS uses sparse anchor-based 3DGS representation with learnable temporal-activation parameters for appearance/disappearance transitions. Lightweight per-anchor temporal embeddings query a shared deformation bank. For compression, it employs INR-based hyperprior to model anchor attribute distributions and channel-wise autoregressive model for intra-anchor correlations.

Result: Achieves state-of-the-art rate-distortion performance on several real-world datasets. Represents one of the first attempts at rate-distortion-optimized compression for dynamic 3DGS representations.

Conclusion: TED-4DGS successfully unifies strengths of both 4DGS and canonical 3DGS approaches, providing an efficient, compact deformation scheme with optimized compression for dynamic 3D scene representation.

Abstract: Building on the success of 3D Gaussian Splatting (3DGS) in static 3D scene representation, its extension to dynamic scenes, commonly referred to as 4DGS or dynamic 3DGS, has attracted increasing attention. However, designing more compact and efficient deformation schemes together with rate-distortion-optimized compression strategies for dynamic 3DGS representations remains an underexplored area. Prior methods either rely on space-time 4DGS with overspecified, short-lived Gaussian primitives or on canonical 3DGS with deformation that lacks explicit temporal control. To address this, we present TED-4DGS, a temporally activated and embedding-based deformation scheme for rate-distortion-optimized 4DGS compression that unifies the strengths of both families. TED-4DGS is built on a sparse anchor-based 3DGS representation. Each canonical anchor is assigned learnable temporal-activation parameters to specify its appearance and disappearance transitions over time, while a lightweight per-anchor temporal embedding queries a shared deformation bank to produce anchor-specific deformation. For rate-distortion compression, we incorporate an implicit neural representation (INR)-based hyperprior to model anchor attribute distributions, along with a channel-wise autoregressive model to capture intra-anchor correlations. With these novel elements, our scheme achieves state-of-the-art rate-distortion performance on several real-world datasets. To the best of our knowledge, this work represents one of the first attempts to pursue a rate-distortion-optimized compression framework for dynamic 3DGS representations.

[83] Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles

Main category: cs.CV

TL;DR: AVP is an active video perception framework that uses MLLM agents to iteratively plan, observe, and reflect on videos, focusing only on query-relevant evidence to efficiently solve long video understanding tasks.

Details

Motivation: Current video understanding methods waste computation on irrelevant content and lose fine-grained details by using query-agnostic captioners. Real-world long video queries depend on sparse, temporally dispersed cues buried in hours of redundant content.

Method: AVP treats videos as interactive environments and uses an iterative plan-observe-reflect process with MLLM agents. The planner proposes targeted video interactions, the observer extracts time-stamped evidence, and the reflector evaluates sufficiency, either halting with an answer or triggering further observation.

Result: AVP achieves state-of-the-art performance across five LVU benchmarks, outperforming the best agentic method by 5.7% average accuracy while requiring only 18.4% inference time and 12.4% input tokens.

Conclusion: Active perception principles enable efficient long video understanding by focusing computational resources on query-relevant evidence, significantly improving performance while dramatically reducing computational costs.

Abstract: Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.

[84] University Building Recognition Dataset in Thailand for the mission-oriented IoT sensor system

Takara Taniguchi, Yudai Ueda, Atsuya Muramatsu, Kohki Hashimoto, Ryo Yagi, Hideya Ochiai, Chaodit Aswakul

Main category: cs.CV

TL;DR: The paper introduces CUBR, a building recognition dataset for Chulalongkorn University, and shows WAFL-ViT training outperforms self-training in edge device scenarios.

Details

Motivation: As edge device training becomes more feasible, mission-oriented sensor systems like WAFL-ViT require specialized datasets for specific applications. Existing datasets like UTBR may not suit all geographical contexts, necessitating localized datasets for practical deployment.

Method: Developed the Chulalongkorn University Building Recognition Dataset (CUBR) as a case study for Thailand. Used Wireless Ad Hoc Federated Learning with Vision Transformer (WAFL-ViT) for collaborative training among edge devices with device-to-device communication.

Result: Training with WAFL scenarios achieved better accuracy than self-training scenarios. The CUBR dataset is publicly available on GitHub for further research and application.

Conclusion: Specialized datasets like CUBR are essential for mission-oriented edge AI systems, and WAFL-based collaborative training provides superior performance compared to isolated self-training approaches for building recognition tasks.

Abstract: Many industrial sectors have been using of machine learning at inference mode on edge devices. Future directions show that training on edge devices is promising due to improvements in semiconductor performance. Wireless Ad Hoc Federated Learning (WAFL) has been proposed as a promising approach for collaborative learning with device-to-device communication among edges. In particular, WAFL with Vision Transformer (WAFL-ViT) has been tested on image recognition tasks with the UTokyo Building Recognition Dataset (UTBR). Since WAFL-ViT is a mission-oriented sensor system, it is essential to construct specific datasets by each mission. In our work, we have developed the Chulalongkorn University Building Recognition Dataset (CUBR), which is specialized for Chulalongkorn University as a case study in Thailand. Additionally, our results also demonstrate that training on WAFL scenarios achieves better accuracy than self-training scenarios. Dataset is available in https://github.com/jo2lxq/wafl/.

[85] EmoStyle: Emotion-Driven Image Stylization

Jingyuan Yang, Zihuan Bai, Hui Huang

Main category: cs.CV

TL;DR: EmoStyle is a framework for Affective Image Stylization that applies artistic styles to evoke specific emotions while preserving content, using a novel dataset and emotion-aware style quantization.

Details

Motivation: Existing image stylization methods focus on visual appearance but overlook emotional impact. There's a need to bridge this gap by creating stylization that evokes specific emotions while preserving content.

Method: 1) Construct EmoStyleSet dataset from ArtEmis; 2) Develop Emotion-Content Reasoner to integrate emotional cues with content; 3) Create Style Quantizer to convert continuous features into emotion-related codebook entries.

Result: EmoStyle enhances emotional expressiveness while maintaining content consistency, as shown through qualitative/quantitative evaluations and user studies. The emotion-aware style dictionary is adaptable to other generative tasks.

Conclusion: The work establishes a foundation for emotion-driven image stylization, expanding AI-generated art’s creative potential through emotion-aware style manipulation.

Abstract: Art has long been a profound medium for expressing emotions. While existing image stylization methods effectively transform visual appearance, they often overlook the emotional impact carried by styles. To bridge this gap, we introduce Affective Image Stylization (AIS), a task that applies artistic styles to evoke specific emotions while preserving content. We present EmoStyle, a framework designed to address key challenges in AIS, including the lack of training data and the emotion-style mapping. First, we construct EmoStyleSet, a content-emotion-stylized image triplet dataset derived from ArtEmis to support AIS. We then propose an Emotion-Content Reasoner that adaptively integrates emotional cues with content to learn coherent style queries. Given the discrete nature of artistic styles, we further develop a Style Quantizer that converts continuous style features into emotion-related codebook entries. Extensive qualitative and quantitative evaluations, including user studies, demonstrate that EmoStyle enhances emotional expressiveness while maintaining content consistency. Moreover, the learned emotion-aware style dictionary is adaptable to other generative tasks, highlighting its potential for broader applications. Our work establishes a foundation for emotion-driven image stylization, expanding the creative potential of AI-generated art.

[86] UniFS: Unified Multi-Contrast MRI Reconstruction via Frequency-Spatial Fusion

Jialin Li, Yiwei Ren, Kai Pan, Dong Wei, Pujin Cheng, Xian Wu, Xiaoying Tang

Main category: cs.CV

TL;DR: UniFS is a unified frequency-spatial fusion model for multi-contrast MR reconstruction that handles multiple k-space undersampling patterns without retraining, achieving state-of-the-art performance.

Details

Motivation: Existing MCMR methods struggle with generalization across different k-space undersampling patterns (requiring separate models for each pattern) and fail to fully leverage complementary cross-modal frequency information, limiting practical applicability.

Method: UniFS integrates three key modules: 1) Cross-Modal Frequency Fusion module, 2) Adaptive Mask-Based Prompt Learning module, and 3) Dual-Branch Complementary Refinement module. It uses adaptive prompt-guided frequency fusion for k-space learning to extract domain-invariant features from diverse undersampling patterns.

Result: UniFS achieves state-of-the-art performance on BraTS and HCP datasets across various k-space undersampling patterns and acceleration factors, including previously unseen patterns, demonstrating strong generalization capability.

Conclusion: UniFS provides a unified solution for multi-contrast MR reconstruction that handles multiple undersampling patterns without retraining, effectively leveraging both frequency and spatial information while maintaining strong generalization performance.

Abstract: Recently, Multi-Contrast MR Reconstruction (MCMR) has emerged as a hot research topic that leverages high-quality auxiliary modalities to reconstruct undersampled target modalities of interest. However, existing methods often struggle to generalize across different k-space undersampling patterns, requiring the training of a separate model for each specific pattern, which limits their practical applicability. To address this challenge, we propose UniFS, a Unified Frequency-Spatial Fusion model designed to handle multiple k-space undersampling patterns for MCMR tasks without any need for retraining. UniFS integrates three key modules: a Cross-Modal Frequency Fusion module, an Adaptive Mask-Based Prompt Learning module, and a Dual-Branch Complementary Refinement module. These modules work together to extract domain-invariant features from diverse k-space undersampling patterns while dynamically adapt to their own variations. Another limitation of existing MCMR methods is their tendency to focus solely on spatial information while neglect frequency characteristics, or extract only shallow frequency features, thus failing to fully leverage complementary cross-modal frequency information. To relieve this issue, UniFS introduces an adaptive prompt-guided frequency fusion module for k-space learning, significantly enhancing the model’s generalization performance. We evaluate our model on the BraTS and HCP datasets with various k-space undersampling patterns and acceleration factors, including previously unseen patterns, to comprehensively assess UniFS’s generalizability. Experimental results across multiple scenarios demonstrate that UniFS achieves state-of-the-art performance. Our code is available at https://github.com/LIKP0/UniFS.

[87] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

Zhiyuan Jiang, Shenghao Xie, Wenyi Li, Wenqiang Zu, Peihang Li, Jiahao Qiu, Siqi Pei, Lei Ma, Tiejun Huang, Mengdi Wang, Shilong Liu

Main category: cs.CV

TL;DR: ZoomClick: A training-free method that leverages zoom as a prior for GUI grounding, improving performance of vision-language and specialized models on GUI element localization tasks.

Details

Motivation: Existing GUI grounding approaches face challenges with cross-platform generalization, complex layout analysis, and fine-grained element localization despite using large-scale bounding box supervision. The authors identify zoom as an underexplored but powerful prior for GUI grounding tasks.

Method: Proposes ZoomClick, a training-free method that characterizes four key properties of zoom: pre-zoom, depth, shrink size, and minimal crop size. This unlocks zoom’s capabilities for dynamic spatial focusing and adaptive context switching in GUI grounding.

Result: Significantly boosts performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on mainstream benchmarks. UI-Venus-72B attains 73.1% success rate on ScreenSpot-Pro. Also introduces GUIZoom-Bench benchmark for evaluating model adaptability to zoom.

Conclusion: Zoom is a strong prior for GUI grounding that enables effective spatial focusing and context switching. The proposed training-free ZoomClick method demonstrates substantial performance improvements, and the new benchmark aims to inspire future research on improving zoom capabilities for GUI grounding tasks.

Abstract: Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.

[88] Concept-based Explainable Data Mining with VLM for 3D Detection

Mai Tsujimoto

Main category: cs.CV

TL;DR: A cross-modal framework uses 2D Vision-Language Models to mine rare objects from driving scenes, improving 3D object detection with less training data.

Details

Motivation: Rare-object detection in autonomous driving is challenging with point cloud data alone. VLMs have strong image understanding capabilities but haven't been fully explored for enhancing 3D object detection through intelligent data mining.

Method: A novel cross-modal framework that leverages 2D VLMs to identify and mine rare objects from driving scenes. Combines object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection (Isolation Forest + t-SNE) with concept-based filtering into a cohesive pipeline.

Result: Experiments on nuScenes dataset show the concept-guided data mining strategy enhances 3D object detection performance while using only a fraction of training data. Particularly notable improvements for challenging categories like trailers and bicycles compared to random data sampling.

Conclusion: The framework effectively reduces annotation burden by focusing on valuable training samples and has substantial implications for efficient dataset curation in safety-critical autonomous systems.

Abstract: Rare-object detection remains a challenging task in autonomous driving systems, particularly when relying solely on point cloud data. Although Vision-Language Models (VLMs) exhibit strong capabilities in image understanding, their potential to enhance 3D object detection through intelligent data mining has not been fully explored. This paper proposes a novel cross-modal framework that leverages 2D VLMs to identify and mine rare objects from driving scenes, thereby improving 3D object detection performance. Our approach synthesizes complementary techniques such as object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection into a cohesive, explainable pipeline that systematically identifies rare but critical objects in driving scenes. By combining Isolation Forest and t-SNE-based outlier detection methods with concept-based filtering, the framework effectively identifies semantically meaningful rare objects. A key strength of this approach lies in its ability to extract and annotate targeted rare object concepts such as construction vehicles, motorcycles, and barriers. This substantially reduces the annotation burden and focuses only on the most valuable training samples. Experiments on the nuScenes dataset demonstrate that this concept-guided data mining strategy enhances the performance of 3D object detection models while utilizing only a fraction of the training data, with particularly notable improvements for challenging object categories such as trailers and bicycles compared with the same amount of random data. This finding has substantial implications for the efficient curation of datasets in safety-critical autonomous systems.

[89] WaterWave: Bridging Underwater Image Enhancement into Video Streams via Wavelet-based Temporal Consistency Field

Qi Zhu, Jingyi Zhang, Naishan Zheng, Wei Yu, Jinghao Zhang, Deyi Ji, Feng Zhao

Main category: cs.CV

TL;DR: WaterWave: An implicit representation method for enhancing underwater videos using wavelet-based temporal consistency fields to address temporal inconsistency issues in frame-by-frame single-image enhancement approaches.

Details

Motivation: Underwater video pairs are difficult to obtain due to complex underwater imaging, and existing methods that apply single-image enhancement frame by frame lack temporal consistency, resulting in unnatural video output.

Method: Proposes WaterWave, an implicit representation method using wavelet-based temporal consistency fields. It leverages temporal consistency priors from local temporal frequency perspective, progressively filters inconsistent components while preserving motion details, and includes an underwater flow correction module to rectify estimated flows considering underwater transmission effects.

Result: WaterWave significantly enhances video quality from single-image underwater enhancement methods and demonstrates high potential in downstream tracking tasks (UOSTrack and MAT), outperforming original videos by 19.7% and 9.7% on precision metrics respectively.

Conclusion: The proposed WaterWave method effectively addresses temporal consistency issues in underwater video enhancement through wavelet-based temporal consistency fields and flow correction, achieving natural-flowing videos and improving performance in downstream applications.

Abstract: Underwater video pairs are fairly difficult to obtain due to the complex underwater imaging. In this case, most existing video underwater enhancement methods are performed by directly applying the single-image enhancement model frame by frame, but a natural issue is lacking temporal consistency. To relieve the problem, we rethink the temporal manifold inherent in natural videos and observe a temporal consistency prior in dynamic scenes from the local temporal frequency perspective. Building upon the specific prior and no paired-data condition, we propose an implicit representation manner for enhanced video signals, which is conducted in the wavelet-based temporal consistency field, WaterWave. Specifically, under the constraints of the prior, we progressively filter and attenuate the inconsistent components while preserving motion details and scenes, achieving a natural-flowing video. Furthermore, to represent temporal frequency bands more accurately, an underwater flow correction module is designed to rectify estimated flows considering the transmission in underwater scenes. Extensive experiments demonstrate that WaterWave significantly enhances the quality of videos generated using single-image underwater enhancements. Additionally, our method demonstrates high potential in downstream underwater tracking tasks, such as UOSTrack and MAT, outperforming the original video by a large margin, i.e., 19.7% and 9.7% on precise respectively.

[90] Decoding with Structured Awareness: Integrating Directional, Frequency-Spatial, and Structural Attention for Medical Image Segmentation

Fan Zhang, Zhiwei Gu, Hua Wang

Main category: cs.CV

TL;DR: A novel decoder framework for medical image segmentation with three modules: ACFA for adaptive cross-fusion attention, TFFA for triple feature fusion across spatial, Fourier and wavelet domains, and SMMM for structural-aware multi-scale masking to optimize skip connections.

Details

Motivation: To address limitations of Transformer decoders in capturing edge details, recognizing local textures, and modeling spatial continuity in medical image segmentation tasks.

Method: Three core modules: 1) Adaptive Cross-Fusion Attention (ACFA) with learnable guidance in three directions; 2) Triple Feature Fusion Attention (TFFA) combining spatial, Fourier and wavelet domains; 3) Structural-aware Multi-scale Masking Module (SMMM) for optimizing encoder-decoder skip connections.

Result: The framework significantly enhances performance in high-precision medical segmentation tasks like tumor segmentation and organ boundary extraction, improving both accuracy and generalization.

Conclusion: Provides an efficient and practical solution for medical image segmentation by addressing traditional decoder shortcomings through synergistic multi-domain feature integration and structural optimization.

Abstract: To address the limitations of Transformer decoders in capturing edge details, recognizing local textures and modeling spatial continuity, this paper proposes a novel decoder framework specifically designed for medical image segmentation, comprising three core modules. First, the Adaptive Cross-Fusion Attention (ACFA) module integrates channel feature enhancement with spatial attention mechanisms and introduces learnable guidance in three directions (planar, horizontal, and vertical) to enhance responsiveness to key regions and structural orientations. Second, the Triple Feature Fusion Attention (TFFA) module fuses features from Spatial, Fourier and Wavelet domains, achieving joint frequency-spatial representation that strengthens global dependency and structural modeling while preserving local information such as edges and textures, making it particularly effective in complex and blurred boundary scenarios. Finally, the Structural-aware Multi-scale Masking Module (SMMM) optimizes the skip connections between encoder and decoder by leveraging multi-scale context and structural saliency filtering, effectively reducing feature redundancy and improving semantic interaction quality. Working synergistically, these modules not only address the shortcomings of traditional decoders but also significantly enhance performance in high-precision tasks such as tumor segmentation and organ boundary extraction, improving both segmentation accuracy and model generalization. Experimental results demonstrate that this framework provides an efficient and practical solution for medical image segmentation.

[91] Rethinking Infrared Small Target Detection: A Foundation-Driven Efficient Paradigm

Chuang Yu, Jinmiao Zhao, Yunpeng Liu, Yaokun Li, Xiujun Shu, Yuanhao Feng, Bo Wang, Yimian Dai, Xiangyu Yue

Main category: cs.CV

TL;DR: FDEP framework adapts frozen visual foundation models for infrared small target detection, achieving SOTA performance with no inference overhead through semantic alignment and implicit distillation.

Details

Motivation: Large visual foundation models have strong generalization but their potential for single-frame infrared small target (SIRST) detection remains unexplored. Current methods lack efficient adaptation of VFM representations without adding inference burden.

Method: Proposes Foundation-Driven Efficient Paradigm (FDEP) with: 1) Semantic Alignment Modulation Fusion (SAMF) module for dynamic alignment of VFM global semantic priors with task-specific features; 2) Collaborative Optimization-based Implicit Self-Distillation (CO-ISD) strategy for implicit semantic transfer between main and lightweight branches via parameter sharing and synchronized backpropagation; 3) Holistic SIRST Evaluation (HSE) metric for unified multi-threshold evaluation.

Result: Extensive experiments show SIRST detection networks equipped with FDEP achieve state-of-the-art performance on multiple public datasets without additional inference overhead.

Conclusion: FDEP successfully adapts frozen VFM representations to SIRST detection, significantly improving accuracy while maintaining efficiency. The framework provides a foundation-driven paradigm that can be seamlessly integrated into existing encoder-decoder methods.

Abstract: While large-scale visual foundation models (VFMs) exhibit strong generalization across diverse visual domains, their potential for single-frame infrared small target (SIRST) detection remains largely unexplored. To fill this gap, we systematically introduce the frozen representations from VFMs into the SIRST task for the first time and propose a Foundation-Driven Efficient Paradigm (FDEP), which can seamlessly adapt to existing encoder-decoder-based methods and significantly improve accuracy without additional inference overhead. Specifically, a Semantic Alignment Modulation Fusion (SAMF) module is designed to achieve dynamic alignment and deep fusion of the global semantic priors from VFMs with task-specific features. Meanwhile, to avoid the inference time burden introduced by VFMs, we propose a Collaborative Optimization-based Implicit Self-Distillation (CO-ISD) strategy, which enables implicit semantic transfer between the main and lightweight branches through parameter sharing and synchronized backpropagation. In addition, to unify the fragmented evaluation system, we construct a Holistic SIRST Evaluation (HSE) metric that performs multi-threshold integral evaluation at both pixel-level confidence and target-level robustness, providing a stable and comprehensive basis for fair model comparison. Extensive experiments demonstrate that the SIRST detection networks equipped with our FDEP framework achieve state-of-the-art (SOTA) performance on multiple public datasets. Our code is available at https://github.com/YuChuang1205/FDEP-Framework

[92] Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning

Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando

Main category: cs.CV

TL;DR: Know-Show is a new benchmark for evaluating spatio-temporal grounded reasoning in Video-LMs, revealing significant gaps between current models and human reasoning, with GRAM proposed as a training-free plug-in to improve fine-grained grounding.

Details

Motivation: Current Video-Language Models show impressive multimodal understanding but lack proper spatio-temporal grounding - they can't effectively reason about actions while grounding inferences in visual and temporal evidence.

Method: Created Know-Show benchmark with 2.5K human-authored questions across five scenarios (person, object, person-object, hand-object, and temporal dimensions) from Charades, Action Genome, and Ego4D datasets. Proposed GRAM - a training-free plug-in using attention-based video token selection and explicit timestamp encoding.

Result: Existing Video-LMs (Qwen, VideoLLaVA, GPT-4o, Gemini) struggle to “show what they know” and vice versa, especially in fine-grained hand-object interactions. GRAM improves grounding capabilities without additional training.

Conclusion: Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights for developing more interpretable and reliable multimodal reasoning systems.

Abstract: Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to “show what they know” and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the code at https://github.com/LUNAProject22/Know-Show.

[93] DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis

Yuhua Wen, Qifei Li, Yingying Zhou, Yingming Gao, Zhengqi Wen, Jianhua Tao, Ya Li

Main category: cs.CV

TL;DR: DashFusion is a novel multimodal sentiment analysis framework that addresses alignment and fusion challenges through dual-stream alignment (temporal and semantic) and hierarchical bottleneck fusion, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Multimodal sentiment analysis faces challenges with alignment (temporal and semantic synchronization across modalities) and fusion (integrating aligned features). Existing methods often handle these issues separately, leading to performance and efficiency limitations.

Method: Proposes DashFusion with: 1) Dual-stream alignment module using cross-modal attention for temporal alignment and contrastive learning for semantic alignment; 2) Supervised contrastive learning to refine modality features; 3) Hierarchical bottleneck fusion that progressively integrates multimodal information through compressed bottleneck tokens.

Result: Achieves state-of-the-art performance on CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets across various metrics. Ablation studies confirm the effectiveness of the alignment and fusion techniques.

Conclusion: DashFusion effectively addresses multimodal sentiment analysis challenges by jointly handling alignment and fusion, balancing performance and computational efficiency through its novel dual-stream alignment and hierarchical bottleneck fusion approach.

Abstract: Multimodal sentiment analysis (MSA) integrates various modalities, such as text, image, and audio, to provide a more comprehensive understanding of sentiment. However, effective MSA is challenged by alignment and fusion issues. Alignment requires synchronizing both temporal and semantic information across modalities, while fusion involves integrating these aligned features into a unified representation. Existing methods often address alignment or fusion in isolation, leading to limitations in performance and efficiency. To tackle these issues, we propose a novel framework called Dual-stream Alignment with Hierarchical Bottleneck Fusion (DashFusion). Firstly, dual-stream alignment module synchronizes multimodal features through temporal and semantic alignment. Temporal alignment employs cross-modal attention to establish frame-level correspondences among multimodal sequences. Semantic alignment ensures consistency across the feature space through contrastive learning. Secondly, supervised contrastive learning leverages label information to refine the modality features. Finally, hierarchical bottleneck fusion progressively integrates multimodal information through compressed bottleneck tokens, which achieves a balance between performance and computational efficiency. We evaluate DashFusion on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Experimental results demonstrate that DashFusion achieves state-of-the-art performance across various metrics, and ablation studies confirm the effectiveness of our alignment and fusion techniques. The codes for our experiments are available at https://github.com/ultramarineX/DashFusion.

[94] VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation

Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando

Main category: cs.CV

TL;DR: VOST-SGG: A VLM-aided one-stage spatio-temporal scene graph generation framework that integrates vision-language models to address limitations in current DETR-style approaches, achieving state-of-the-art performance on Action Genome dataset.

Details

Motivation: Current DETR-style single-stage ST-SGG models have two key limitations: (1) learnable queries are semantically uninformed and instance-agnostically initialized, and (2) they rely exclusively on unimodal visual features for predicate classification, lacking semantic grounding.

Method: Proposes VOST-SGG with two main innovations: (1) dual-source query initialization strategy that disentangles “what to attend to” from “where to attend” for semantically grounded reasoning, and (2) multi-modal feature bank that fuses visual, textual, and spatial cues derived from VLMs for improved predicate classification.

Result: Extensive experiments on Action Genome dataset demonstrate state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG.

Conclusion: Integrating vision-language models’ common sense reasoning capabilities into ST-SGG pipelines addresses key limitations of current approaches and significantly improves performance, offering a promising direction for more interpretable video understanding.

Abstract: Spatio-temporal scene graph generation (ST-SGG) aims to model objects and their evolving relationships across video frames, enabling interpretable representations for downstream reasoning tasks such as video captioning and visual question answering. Despite recent advancements in DETR-style single-stage ST-SGG models, they still suffer from several key limitations. First, while these models rely on attention-based learnable queries as a core component, these learnable queries are semantically uninformed and instance-agnostically initialized. Second, these models rely exclusively on unimodal visual features for predicate classification. To address these challenges, we propose VOST-SGG, a VLM-aided one-stage ST-SGG framework that integrates the common sense reasoning capabilities of vision-language models (VLMs) into the ST-SGG pipeline. First, we introduce the dual-source query initialization strategy that disentangles what to attend to from where to attend, enabling semantically grounded what-where reasoning. Furthermore, we propose a multi-modal feature bank that fuses visual, textual, and spatial cues derived from VLMs for improved predicate classification. Extensive experiments on the Action Genome dataset demonstrate that our approach achieves state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG. We will release the code at https://github.com/LUNAProject22/VOST.

[95] See in Depth: Training-Free Surgical Scene Segmentation with Monocular Depth Priors

Kunyi Yang, Qingyu Wang, Cheng Yuan, Yutong Ban

Main category: cs.CV

TL;DR: DepSeg is a training-free surgical scene segmentation framework that uses monocular depth as geometric prior with pretrained vision models, achieving significant improvement over SAM2 baseline on CholecSeg8k dataset.

Details

Motivation: Pixel-wise segmentation in laparoscopic surgery is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. There's a need for annotation-efficient approaches.

Method: DepSeg uses pretrained monocular depth estimation to get depth maps, creates depth-guided point prompts, uses SAM2 to generate class-agnostic masks, pools pretrained visual features for each mask, and classifies via template matching against a template bank from annotated frames.

Result: On CholecSeg8k dataset, DepSeg achieves 35.9% mIoU vs. 14.7% for direct SAM2 auto segmentation baseline. It maintains competitive performance even with only 10-20% of object templates.

Conclusion: Depth-guided prompting and template-based classification provide an annotation-efficient segmentation approach for surgical scenes, leveraging geometric priors and pretrained foundation models without requiring training.

Abstract: Pixel-wise segmentation of laparoscopic scenes is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. We propose depth-guided surgical scene segmentation (DepSeg), a training-free framework that utilizes monocular depth as a geometric prior together with pretrained vision foundation models. DepSeg first estimates a relative depth map with a pretrained monocular depth estimation network and proposes depth-guided point prompts, which SAM2 converts into class-agnostic masks. Each mask is then described by a pooled pretrained visual feature and classified via template matching against a template bank built from annotated frames. On the CholecSeg8k dataset, DepSeg improves over a direct SAM2 auto segmentation baseline (35.9% vs. 14.7% mIoU) and maintains competitive performance even when using only 10–20% of the object templates. These results show that depth-guided prompting and template-based classification offer an annotation-efficient segmentation approach.

[96] Ideal Observer for Segmentation of Dead Leaves Images

Swantje Mahncke, Malte Ott

Main category: cs.CV

TL;DR: The paper presents a Bayesian ideal observer framework for image segmentation using dead leaves models, providing theoretical foundations and computational feasibility analysis for principled performance bounds.

Details

Motivation: To develop a principled theoretical framework for studying visual segmentation decisions by creating an ideal observer model that can serve as an upper-bound performance benchmark for comparing human vision and computer vision algorithms.

Method: Extends previous work on dead leaves models (generative models with object position, shape, color, and texture distributions) to derive a Bayesian ideal observer for pixel partitioning. Provides step-by-step explanations for computing posterior probabilities and analyzes factors affecting practical computational feasibility.

Result: Develops a theoretical approach for computing posterior probabilities of pixel partitions based on independent dead leaves model distributions, establishing a principled mathematical framework for segmentation analysis.

Conclusion: The dead leaves image model and associated Bayesian ideal observer provide a principled upper-bound on segmentation performance that can be used to benchmark both human visual perception and computer vision algorithms, particularly for limited pixel sets.

Abstract: The human visual environment is comprised of different surfaces that are distributed in space. The parts of a scene that are visible at any one time are governed by the occlusion of overlapping objects. In this work we consider “dead leaves” models, which replicate these occlusions when generating images by layering objects on top of each other. A dead leaves model is a generative model comprised of distributions for object position, shape, color and texture. An image is generated from a dead leaves model by sampling objects (“leaves”) from these distributions until a stopping criterion is reached, usually when the image is fully covered or until a given number of leaves was sampled. Here, we describe a theoretical approach, based on previous work, to derive a Bayesian ideal observer for the partition of a given set of pixels based on independent dead leaves model distributions. Extending previous work, we provide step-by-step explanations for the computation of the posterior probability as well as describe factors that determine the feasibility of practically applying this computation. The dead leaves image model and the associated ideal observer can be applied to study segmentation decisions in a limited number of pixels, providing a principled upper-bound on performance, to which humans and vision algorithms could be compared.

[97] Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models

Weijue Bu, Guan Yuan, Guixian Zhang

Main category: cs.CV

TL;DR: CG-VLM is a training-free inference framework that uses game-theoretic interpretability to detect and correct text inertia in VLMs, reducing object hallucinations by selectively reorienting attention to visual tokens when needed.

Details

Motivation: Large Vision-Language Models suffer from text inertia where attention drifts from visual evidence to linguistic priors, causing object hallucinations. Existing methods either intervene only at output logits (can't correct internal reasoning) or use heuristic internal controls without principled grounding.

Method: CG-VLM uses a Cognitive Demand Sensor based on Harsanyi interactions to estimate vision-text synergy and detect when visual grounding is needed. A Focused Consensus Induction module then selectively reorients mid-layer attention toward visual tokens before collapse into text priors.

Result: Achieves state-of-the-art results on POPE and CHAIR benchmarks across multiple VLMs (InstructBLIP, LLaVA, Qwen-VL, mPLUG) while preserving general capabilities.

Conclusion: Token-level sensing enables precise, context-aware intervention without compromising foundational knowledge, demonstrating that game-theoretic interpretability can be converted into actionable decoding control.

Abstract: Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations. Existing decoding strategies intervene only at the output logits and thus cannot correct internal reasoning drift, while recent internal-control methods based on heuristic head suppression or global steering vectors lack principled grounding. We introduce Conscious Gaze (CG-VLM), a training-free, inference-time framework that converts game-theoretic interpretability into actionable decoding control. A Cognitive Demand Sensor built on Harsanyi interactions estimates instantaneous vision-text synergy and identifies moments when visual grounding is necessary. Conditioned on this signal, a Focused Consensus Induction module selectively reorients mid-layer attention toward visual tokens before collapse into text priors. CG-VLM achieves state-of-the-art results on POPE and CHAIR across InstructBLIP, LLaVA, Qwen-VL, and mPLUG, while preserving general capabilities, demonstrating that token-level sensing enables precise, context-aware intervention without compromising foundational knowledge.

[98] 2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency

Xingxi Yin, Yicheng Li, Gong Yan, Chenglin Li, Jian Zhao, Cong Huang, Yue Deng, Yin Zhang

Main category: cs.CV

TL;DR: The paper introduces 2K-Characters-10K-Stories, a new dataset for controllable visual storytelling that addresses identity consistency challenges by decoupling stable identities from transient attributes like pose and expression.

Details

Motivation: Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition for reliable sequential synthesis in visual storytelling.

Method: The authors create a multi-modal stylized narrative dataset with 2,000 unique characters across 10,000 stories. They use a Human-in-the-Loop pipeline with expert-verified character templates and LLM-guided narrative planning, plus a decoupled control scheme that separates persistent identity from transient attributes. A Quality-Gated loop integrates MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing to enforce pixel-level consistency.

Result: Models fine-tuned on the new dataset achieve performance comparable to closed-source models in generating visual narratives, demonstrating the effectiveness of the approach for sequential identity consistency.

Conclusion: The 2K-Characters-10K-Stories dataset successfully addresses the challenge of sequential identity consistency in controllable visual storytelling by providing explicit, decoupled control signals and enforcing pixel-level consistency through a comprehensive quality assurance pipeline.

Abstract: Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf{2K-Characters-10K-Stories}, a multi-modal stylized narrative dataset of \textbf{2{,}000} uniquely stylized characters appearing across \textbf{10{,}000} illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbf{Human-in-the-Loop pipeline (HiL)} that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbf{decoupled control} scheme separates persistent identity from transient attributes – pose and expression – while a \textbf{Quality-Gated loop} integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.

[99] ProPhy: Progressive Physical Alignment for Dynamic World Simulation

Zijun Wang, Panwen Hu, Jing Wang, Terry Jingchen Zhang, Yuhao Cheng, Long Chen, Yiqiang Yan, Zutao Jiang, Hanhui Li, Xiaodan Liang

Main category: cs.CV

TL;DR: ProPhy is a Progressive Physical Alignment Framework for physics-aware video generation that uses Mixture-of-Physics-Experts to extract fine-grained physical priors and align generated content with localized physical cues.

Details

Motivation: Current video generation models struggle with physical consistency, especially for large-scale or complex dynamics, because they respond isotropically to physical prompts and lack fine-grained alignment between generated content and localized physical cues.

Method: Two-stage Mixture-of-Physics-Experts (MoPE) mechanism: Semantic Experts infer semantic-level physical principles from text, and Refinement Experts capture token-level physical dynamics. Also includes physical alignment strategy to transfer physical reasoning capabilities from vision-language models to Refinement Experts.

Result: Extensive experiments on physics-aware video generation benchmarks show ProPhy produces more realistic, dynamic, and physically coherent results than state-of-the-art methods.

Conclusion: ProPhy successfully addresses physical consistency challenges in video generation through explicit physics-aware conditioning and anisotropic generation, enabling better world simulation capabilities.

Abstract: Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.

[100] MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging

Xingyu Zhang, Anna Reithmeir, Fryderyk Kögl, Rickmer Braren, Julia A. Schnabel, Daniel M. Lang

Main category: cs.CV

TL;DR: MedDIFT is a training-free 3D medical image registration framework that uses multi-scale features from pretrained latent diffusion models as voxel descriptors, achieving state-of-the-art performance without task-specific training.

Details

Motivation: Current medical image registration methods rely on local intensity-based similarity measures that fail to capture global semantic structure and perform poorly in low-contrast or anatomically variable regions, limiting their accuracy for longitudinal analysis, lesion tracking, and image-guided interventions.

Method: MedDIFT leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors, fuses diffusion activations into rich voxel-wise descriptors, and matches them via cosine similarity with an optional local-search prior. The approach is training-free and uses intermediate representations from diffusion models that encode geometric and semantic information.

Result: On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to state-of-the-art learning-based UniGradICON model and surpasses conventional B-spline-based registration. Ablation experiments show that multi-level feature fusion and modest diffusion noise improve performance.

Conclusion: Pretrained diffusion models provide effective geometric and semantic features for medical image registration without requiring task-specific training, offering a promising alternative to conventional intensity-based methods and learning-based approaches that need extensive training.

Abstract: Accurate spatial correspondence between medical images is essential for longitudinal analysis, lesion tracking, and image-guided interventions. Medical image registration methods rely on local intensity-based similarity measures, which fail to capture global semantic structure and often yield mismatches in low-contrast or anatomically variable regions. Recent advances in diffusion models suggest that their intermediate representations encode rich geometric and semantic information. We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors. MedDIFT fuses diffusion activations into rich voxel-wise descriptors and matches them via cosine similarity, with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art learning-based UniGradICON model and surpasses conventional B-spline-based registration, without requiring any task-specific model training. Ablation experiments confirm that multi-level feature fusion and modest diffusion noise improve performance.

[101] Learning High-Fidelity Cloth Animation via Skinning-Free Image Transfer

Rong Wang, Wei Mao, Changsheng Lu, Hongdong Li

Main category: cs.CV

TL;DR: A skinning-free approach for 3D garment animation that decouples low-frequency shape and high-frequency wrinkles, using 2D image transfer and pretrained image models to generate high-fidelity deformations.

Details

Motivation: Existing skinning-based methods produce misaligned shapes when posing garments, which corrupts high-frequency signals and fails to recover realistic wrinkles. There's a need for better garment animation for virtual try-on and extended reality applications.

Method: 1) Decouple low-frequency vertex positions and high-frequency vertex normals for independent supervision; 2) Encode vertex attributes as texture images for 2D image transfer; 3) Leverage pretrained image models for fine-grained wrinkle details; 4) Use multimodal fusion to recover 3D garments from transferred images.

Result: Significantly improves animation quality on various garment types and recovers finer wrinkles than state-of-the-art methods, with superior scalability for diverse garment topologies without manual UV partition.

Conclusion: The skinning-free approach effectively decouples frequency modalities, enables direct supervision, leverages powerful image models for detail recovery, and provides robust 3D garment deformation with high-fidelity wrinkles.

Abstract: We present a novel method for generating 3D garment deformations from given body poses, which is key to a wide range of applications, including virtual try-on and extended reality. To simplify the cloth dynamics, existing methods mostly rely on linear blend skinning to obtain low-frequency posed garment shape and only regress high-frequency wrinkles. However, due to the lack of explicit skinning supervision, such skinning-based approach often produces misaligned shapes when posing the garment, consequently corrupts the high-frequency signals and fails to recover high-fidelity wrinkles. To tackle this issue, we propose a skinning-free approach by independently estimating posed (i) vertex position for low-frequency posed garment shape, and (ii) vertex normal for high-frequency local wrinkle details. In this way, each frequency modality can be effectively decoupled and directly supervised by the geometry of the deformed garment. To further improve the visual quality of animation, we propose to encode both vertex attributes as rendered texture images, so that 3D garment deformation can be equivalently achieved via 2D image transfer. This enables us to leverage powerful pretrained image models to recover fine-grained visual details in wrinkles, while maintaining superior scalability for garments of diverse topologies without relying on manual UV partition. Finally, we propose a multimodal fusion to incorporate constraints from both frequency modalities and robustly recover deformed 3D garments from transferred images. Extensive experiments show that our method significantly improves animation quality on various garment types and recovers finer wrinkles than state-of-the-art methods.

[102] InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem

Yeobin Hong, Suhyeon Lee, Hyungjin Chung, Jong Chul Ye

Main category: cs.CV

TL;DR: InverseCrafter: An efficient inpainting inverse solver for 4D video generation that reformulates the task as a latent space inpainting problem, avoiding costly fine-tuning of Video Diffusion Models.

Details

Motivation: Current approaches to controllable 4D video generation require expensive fine-tuning of pre-trained Video Diffusion Models, which is computationally heavy, needs large datasets, causes architectural modifications, and suffers from catastrophic forgetting of original generative priors.

Method: Reformulates 4D generation as an inpainting problem solved in latent space. Uses a principled mechanism to encode pixel space degradation operators into continuous, multi-channel latent masks, bypassing costly VAE operations and backpropagation.

Result: Achieves comparable novel view generation and superior measurement consistency in camera control tasks with near-zero computational overhead. Also excels at general-purpose video inpainting with editing capabilities.

Conclusion: InverseCrafter provides an efficient alternative to fine-tuning-based approaches for 4D video generation, maintaining generative priors while enabling controllable generation and editing with minimal computational cost.

Abstract: Recent approaches to controllable 4D video generation often rely on fine-tuning pre-trained Video Diffusion Models (VDMs). This dominant paradigm is computationally expensive, requiring large-scale datasets and architectural modifications, and frequently suffers from catastrophic forgetting of the model’s original generative priors. Here, we propose InverseCrafter, an efficient inpainting inverse solver that reformulates the 4D generation task as an inpainting problem solved in the latent space. The core of our method is a principled mechanism to encode the pixel space degradation operator into a continuous, multi-channel latent mask, thereby bypassing the costly bottleneck of repeated VAE operations and backpropagation. InverseCrafter not only achieves comparable novel view generation and superior measurement consistency in camera control tasks with near-zero computational overhead, but also excels at general-purpose video inpainting with editing. Code is available at https://github.com/yeobinhong/InverseCrafter.

[103] Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction

Ruihong Yin, Xuepeng Shi, Oleksandr Bailo, Marco Manfredi, Theo Gevers

Main category: cs.CV

TL;DR: Fast SceneScript: A structured language model for 3D scene layout estimation that uses multi-token prediction and confidence-guided decoding to achieve 9x faster inference without accuracy loss.

Details

Motivation: Current perception-generalist approaches using language models achieve SOTA results but are slow due to autoregressive next-token prediction. Need for efficient 3D scene layout estimation without compromising accuracy.

Method: Uses multi-token prediction (MTP) to reduce autoregressive iterations, adapts self-speculative decoding for structured language models, introduces confidence-guided decoding with improved scoring, and designs parameter-efficient mechanism to reduce MTP overhead.

Result: Generates up to 9 tokens per decoder inference step without compromising accuracy, adds only ~7.5% additional parameters, and achieves significant acceleration on ASE and Structured3D benchmarks.

Conclusion: Fast SceneScript provides an accurate and efficient solution for 3D scene layout estimation by combining multi-token prediction with reliability filtering, achieving substantial speed improvements while maintaining accuracy.

Abstract: Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene layout estimation. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structured language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability. Furthermore, we design a parameter-efficient mechanism that reduces the parameter overhead of MTP. Extensive experiments on the ASE and Structured3D benchmarks demonstrate that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only $\sim7.5%$ additional parameters.

[104] NormalView: sensor-agnostic tree species classification from backpack and aerial lidar data using geometric projections

Juho Korkeala, Jesse Muhojoki, Josef Taher, Klaara Salolahti, Matti Hyyppä, Antero Kukko, Juha Hyyppä

Main category: cs.CV

TL;DR: NormalView is a sensor-agnostic deep learning method that projects 3D point clouds into 2D images with normal vector information, then uses YOLOv11 for tree species classification, achieving high accuracy on both mobile and airborne laser scanning data.

Details

Motivation: Mobile laser scanning provides highly accurate tree-level inventory data, but there's a need for effective methods to classify tree species from point cloud data that work across different sensor types and can leverage both geometric and radiometric information.

Method: NormalView converts 3D point clouds into 2D projections by embedding local geometric information (normal vector estimates) into images, then uses YOLOv11 image classification network. The method also investigates the impact of multispectral radiometric intensity information on classification performance.

Result: Achieved 95.5% overall accuracy (94.8% macro-average) on MLS data (7 species) and 91.8% overall accuracy (79.1% macro-average) on ALS data (9 species). Found that multispectral intensity information improves classification, with best performance using all three channels of multispectral ALS.

Conclusion: Projection-based methods enhanced with geometric information and state-of-the-art image classification backbones can achieve exceptional tree species classification results. These methods are sensor-agnostic and rely primarily on geometric information. The MLS dataset is publicly released.

Abstract: Laser scanning has proven to be an invaluable tool in assessing the decomposition of forest environments. Mobile laser scanning (MLS) has shown to be highly promising for extremely accurate, tree level inventory. In this study, we present NormalView, a sensor-agnostic projection-based deep learning method for classifying tree species from point cloud data. NormalView embeds local geometric information into two-dimensional projections, in the form of normal vector estimates, and uses the projections as inputs to an image classification network, YOLOv11. In addition, we inspected the effect of multispectral radiometric intensity information on classification performance. We trained and tested our model on high-density MLS data (7 species, ~5000 pts/m^2), as well as high-density airborne laser scanning (ALS) data (9 species, >1000 pts/m^2). On the MLS data, NormalView achieves an overall accuracy (macro-average accuracy) of 95.5 % (94.8 %), and 91.8 % (79.1 %) on the ALS data. We found that having intensity information from multiple scanners provides benefits in tree species classification, and the best model on the multispectral ALS dataset was a model using intensity information from all three channels of the multispectral ALS. This study demonstrates that projection-based methods, when enhanced with geometric information and coupled with state-of-the-art image classification backbones, can achieve exceptional results. Crucially, these methods are sensor-agnostic, relying only on geometric information. Additionally, we publically release the MLS dataset used in the study.

[105] DistillFSS: Synthesizing Few-Shot Knowledge into a Lightweight Segmentation Model

Pasquale De Marinis, Pieter M. Blok, Uzay Kaymak, Rogier Brussee, Gennaro Vessio, Giovanna Castellano

Main category: cs.CV

TL;DR: DistillFSS: A distillation-based framework for cross-domain few-shot semantic segmentation that embeds support knowledge into model parameters, eliminating need for support images at test time for fast inference.

Details

Motivation: Cross-domain few-shot semantic segmentation faces challenges: substantial distribution shifts between source/target domains, disjoint label spaces, scarce support images, making episodic methods unreliable and computationally demanding at test time.

Method: Teacher-student distillation framework that internalizes few-shot reasoning into a dedicated layer within student network, eliminating need for support images at test time. Allows efficient extension to novel classes through teacher-driven specialization and scales with fine-tuning.

Result: Matches or surpasses state-of-the-art baselines, particularly in multi-class and multi-shot scenarios, while offering substantial efficiency gains. New CD-FSS benchmark introduced spanning medical imaging, industrial inspection, and remote sensing.

Conclusion: DistillFSS provides an effective solution for CD-FSS by embedding support knowledge into parameters, enabling fast lightweight inference while maintaining competitive performance across diverse domains.

Abstract: Cross-Domain Few-Shot Semantic Segmentation (CD-FSS) seeks to segment unknown classes in unseen domains using only a few annotated examples. This setting is inherently challenging: source and target domains exhibit substantial distribution shifts, label spaces are disjoint, and support images are scarce–making standard episodic methods unreliable and computationally demanding at test time. To address these constraints, we propose DistillFSS, a framework that embeds support-set knowledge directly into a model’s parameters through a teacher–student distillation process. By internalizing few-shot reasoning into a dedicated layer within the student network, DistillFSS eliminates the need for support images at test time, enabling fast, lightweight inference, while allowing efficient extension to novel classes in unseen domains through rapid teacher-driven specialization. Combined with fine-tuning, the approach scales efficiently to large support sets and significantly reduces computational overhead. To evaluate the framework under realistic conditions, we introduce a new CD-FSS benchmark spanning medical imaging, industrial inspection, and remote sensing, with disjoint label spaces and variable support sizes. Experiments show that DistillFSS matches or surpasses state-of-the-art baselines, particularly in multi-class and multi-shot scenarios, while offering substantial efficiency gains. The code is available at https://github.com/pasqualedem/DistillFSS.

[106] Experts-Guided Unbalanced Optimal Transport for ISP Learning from Unpaired and/or Paired Data

Georgy Perevozchikov, Nancy Mehta, Egor Ershov, Radu Timofte

Main category: cs.CV

TL;DR: Unsupervised ISP training framework using Unbalanced Optimal Transport that works with both paired and unpaired data, outperforming existing supervised methods.

Details

Motivation: Current learned ISP pipelines require expensive paired raw-to-sRGB datasets, creating a major bottleneck. There's a need for training methods that don't rely on costly paired data acquisition.

Method: Proposes an unsupervised training framework based on Unbalanced Optimal Transport (first application to cross-domain ISP translation). Uses a “committee of expert discriminators” as hybrid adversarial regularizer to guide mapping by correcting specific ISP failure modes (color fidelity, structural artifacts, frequency-domain realism). Framework works in both paired and unpaired modes.

Result: When trained in paired mode, exceeds performance of original paired methods across all metrics. In unpaired mode, achieves quantitative and qualitative performance that rivals and sometimes surpasses original paired-trained counterparts. Framework provides robustness to outliers in target sRGB data.

Conclusion: The UOT-based framework successfully addresses the paired data bottleneck in ISP training, enabling high-performance ISP training without expensive paired datasets while providing robustness to data outliers.

Abstract: Learned Image Signal Processing (ISP) pipelines offer powerful end-to-end performance but are critically dependent on large-scale paired raw-to-sRGB datasets. This reliance on costly-to-acquire paired data remains a significant bottleneck. To address this challenge, we introduce a novel, unsupervised training framework based on Optimal Transport capable of training arbitrary ISP architectures in both unpaired and paired modes. We are the first to successfully apply Unbalanced Optimal Transport (UOT) for this complex, cross-domain translation task. Our UOT-based framework provides robustness to outliers in the target sRGB data, allowing it to discount atypical samples that would be prohibitively costly to map. A key component of our framework is a novel ``committee of expert discriminators,’’ a hybrid adversarial regularizer. This committee guides the optimal transport mapping by providing specialized, targeted gradients to correct specific ISP failure modes, including color fidelity, structural artifacts, and frequency-domain realism. To demonstrate the superiority of our approach, we retrained existing state-of-the-art ISP architectures using our paired and unpaired setups. Our experiments show that while our framework, when trained in paired mode, exceeds the performance of the original paired methods across all metrics, our unpaired mode concurrently achieves quantitative and qualitative performance that rivals, and in some cases surpasses, the original paired-trained counterparts. The code and pre-trained models are available at: https://github.com/gosha20777/EGUOT-ISP.git.

[107] Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling

Saurav Jha, M. Jehanzeb Mirza, Wei Lin, Shiqi Yang, Sarath Chandar

Main category: cs.CV

TL;DR: The paper analyzes test-time verification in vision-language models for spatial reasoning, finds current methods unreliable, proposes a new verification framework (ViSA) that improves some benchmarks but not others, revealing limitations in world models.

Details

Motivation: Vision-Language Models have limitations in spatial reasoning requiring multi-view understanding. Recent approaches like MindJourney use test-time scaling with world models and heuristic verifiers, but their effectiveness and reliability need systematic examination.

Method: The paper systematically analyzes test-time verifiers across benchmarks, conducts uncertainty-based analyses, and introduces Verification through Spatial Assertions (ViSA) framework that grounds test-time rewards in verifiable, frame-anchored micro-claims.

Result: MindJourney’s verifier provides little meaningful calibration, random scoring reduces answer entropy equally well, exposing systematic action biases. ViSA improves spatial reasoning on SAT-Real benchmark and corrects trajectory-selection biases, but fails to achieve consistent scaling on challenging MMSI-Bench.

Conclusion: Current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. The findings chart the bad (unreliable verification), good (ViSA improves some tasks), and ugly (fundamental limitations) aspects of test-time verification for world-model-based reasoning.

Abstract: Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney’s verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.

[108] Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective

Nan Zhong, Mian Zou, Yiran Xu, Zhenxing Qian, Xinpeng Zhang, Baoyuan Wu, Kede Ma

Main category: cs.CV

TL;DR: Self-supervised AI image detection using camera EXIF metadata as intrinsic photographic features, achieving state-of-the-art cross-model generalization and robustness to perturbations.

Details

Motivation: Existing AI-generated image detectors rely on assumptions about specific generative models, limiting cross-model applicability. There's a need for more generalizable detection methods that don't depend on model internals.

Method: Self-supervised approach using camera EXIF metadata (camera model, scene type, focal length, aperture) as pretext tasks. Features are learned from real photographs, then used for one-class detection (Gaussian mixture model) and binary detection (regularized classifier on high-frequency residuals from scrambled patches).

Result: EXIF-induced detectors substantially advance state-of-the-art, showing strong generalization across various generative models, effective performance on in-the-wild samples, and robustness to common benign image perturbations.

Conclusion: Camera metadata provides powerful intrinsic features for AI-generated image detection without relying on generative model assumptions, enabling more generalizable and robust forensic analysis.

Abstract: The proliferation of AI-generated imagery poses escalating challenges for multimedia forensics, yet many existing detectors depend on assumptions about the internals of specific generative models, limiting their cross-model applicability. We introduce a self-supervised approach for detecting AI-generated images that leverages camera metadata – specifically exchangeable image file format (EXIF) tags – to learn features intrinsic to digital photography. Our pretext task trains a feature extractor solely on camera-captured photographs by classifying categorical EXIF tags (\eg, camera model and scene type) and pairwise-ranking ordinal and continuous EXIF tags (\eg, focal length and aperture value). Using these EXIF-induced features, we first perform one-class detection by modeling the distribution of photographic images with a Gaussian mixture model and flagging low-likelihood samples as AI-generated. We then extend to binary detection that treats the learned extractor as a strong regularizer for a classifier of the same architecture, operating on high-frequency residuals from spatially scrambled patches. Extensive experiments across various generative models demonstrate that our EXIF-induced detectors substantially advance the state of the art, delivering strong generalization to in-the-wild samples and robustness to common benign image perturbations.

[109] Phase-OTDR Event Detection Using Image-Based Data Transformation and Deep Learning

Muhammet Cagri Yeke, Samil Sirin, Kivilcim Yuksel, Abdurrahman Gumus

Main category: cs.CV

TL;DR: This paper presents a novel image-based approach for detecting and classifying six types of events in optical fibers using Phase-OTDR data. By transforming 1D data into grayscale images using GADF, GASF, and Recurrence Plot techniques, then combining them into RGB representations, the method achieves high classification accuracy (up to 99.07%) with transfer learning models like EfficientNetB0 and DenseNet121.

Details

Motivation: Traditional Phase-OTDR data analysis for optical fiber event detection faces challenges in handling complex 1D data patterns. There's a need for more robust, adaptable, and efficient methods to accurately classify different types of fiber optic events while reducing dataset requirements and improving analysis efficiency.

Method: The proposed method transforms 1D Phase-OTDR data into grayscale images using three techniques: Gramian Angular Difference Field (GADF), Gramian Angular Summation Field (GASF), and Recurrence Plot. These grayscale images are then combined into a multi-channel RGB representation. Transfer learning models (EfficientNetB0 and DenseNet121) are applied for classification, with validation using 5-fold cross-validation on a publicly available Phase-OTDR dataset.

Result: The approach achieves high classification accuracies: 98.84% with EfficientNetB0 and 98.24% with DenseNet121. 5-fold cross-validation confirms reliability with test accuracies of 99.07% and 98.68% respectively. The method demonstrates efficient event detection while reducing dataset size and improving analysis efficiency.

Conclusion: The study demonstrates the transformative potential of image-based analysis for interpreting complex fiber optic sensing data. The approach offers significant advancements in accuracy and reliability of fiber optic monitoring systems. The publicly released codes and image-based dataset on GitHub support further research in this domain.

Abstract: This study focuses on event detection in optical fibers, specifically classifying six events using the Phase-OTDR system. A novel approach is introduced to enhance Phase-OTDR data analysis by transforming 1D data into grayscale images through techniques such as Gramian Angular Difference Field, Gramian Angular Summation Field, and Recurrence Plot. These grayscale images are combined into a multi-channel RGB representation, enabling more robust and adaptable analysis using transfer learning models. The proposed methodology achieves high classification accuracies of 98.84% and 98.24% with the EfficientNetB0 and DenseNet121 models, respectively. A 5-fold cross-validation process confirms the reliability of these models, with test accuracy rates of 99.07% and 98.68%. Using a publicly available Phase-OTDR dataset, the study demonstrates an efficient approach to understanding optical fiber events while reducing dataset size and improving analysis efficiency. The results highlight the transformative potential of image-based analysis in interpreting complex fiber optic sensing data, offering significant advancements in the accuracy and reliability of fiber optic monitoring systems. The codes and the corresponding image-based dataset are made publicly available on GitHub to support further research: https://github.com/miralab-ai/Phase-OTDR-event-detection.

[110] LeAD-M3D: Leveraging Asymmetric Distillation for Real-time Monocular 3D Detection

Johannes Meier, Jonathan Michel, Oussema Dhaouadi, Yung-Hsu Yang, Christoph Reich, Zuria Bauer, Stefan Roth, Marc Pollefeys, Jacques Kaiser, Daniel Cremers

Main category: cs.CV

TL;DR: LeAD-M3D is a monocular 3D object detector that achieves state-of-the-art accuracy and real-time inference without LiDAR or geometric priors, using three novel components for depth reasoning, matching, and efficient inference.

Details

Motivation: Real-time monocular 3D detection faces challenges including severe depth ambiguity, viewpoint shifts, and high computational costs. Existing methods either rely on additional modalities like LiDAR, use geometric priors, or sacrifice efficiency for accuracy.

Method: Three key components: 1) Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from teacher to student via quality-weighted depth-feature loss; 2) 3D-aware Consistent Matching (CM3D) integrates 3D MGIoU into matching score for better supervision; 3) Confidence-Gated 3D Inference (CGI3D) accelerates detection by restricting expensive 3D regression to high-confidence regions.

Result: Achieves state-of-the-art accuracy on KITTI and Waymo datasets, best reported car AP on Rope3D, while running up to 3.6x faster than prior high-accuracy methods. Sets new Pareto frontier for monocular 3D detection.

Conclusion: High fidelity and real-time efficiency in monocular 3D detection are simultaneously attainable without requiring LiDAR, stereo, or geometric assumptions, demonstrating the effectiveness of the proposed approach.

Abstract: Real-time monocular 3D object detection remains challenging due to severe depth ambiguity, viewpoint shifts, and the high computational cost of 3D reasoning. Existing approaches either rely on LiDAR or geometric priors to compensate for missing depth, or sacrifice efficiency to achieve competitive accuracy. We introduce LeAD-M3D, a monocular 3D detector that achieves state-of-the-art accuracy and real-time inference without extra modalities. Our method is powered by three key components. Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from a clean-image teacher to a mixup-noised student via a quality- and importance-weighted depth-feature loss, enabling stronger depth reasoning without LiDAR supervision. 3D-aware Consistent Matching (CM3D) improves prediction-to-ground truth assignment by integrating 3D MGIoU into the matching score, yielding more stable and precise supervision. Finally, Confidence-Gated 3D Inference (CGI3D) accelerates detection by restricting expensive 3D regression to top-confidence regions. Together, these components set a new Pareto frontier for monocular 3D detection: LeAD-M3D achieves state-of-the-art accuracy on KITTI and Waymo, and the best reported car AP on Rope3D, while running up to 3.6x faster than prior high-accuracy methods. Our results demonstrate that high fidelity and real-time efficiency in monocular 3D detection are simultaneously attainable - without LiDAR, stereo, or geometric assumptions.

[111] Deep Learning-Based Real-Time Sequential Facial Expression Analysis Using Geometric Features

Talha Enes Koksal, Abdurrahman Gumus

Main category: cs.CV

TL;DR: Real-time facial expression recognition using MediaPipe FaceMesh for landmark detection, geometric feature extraction, and ConvLSTM1D classification, achieving high accuracy across multiple datasets with 165 FPS performance.

Details

Motivation: Facial expression recognition is crucial for enhancing human-computer interaction and developing emotion-aware systems for applications like user experience personalization and intelligent surveillance.

Method: Uses MediaPipe FaceMesh for facial landmark detection, extracts geometric features (Euclidean distances and angles), analyzes temporal dynamics between frames to detect expression phases, and employs ConvLSTM1D network with MLP blocks for classification.

Result: Achieved accuracies of 93% (CK+), 79% (Oulu-CASIA VIS), 77% (Oulu-CASIA NIR), and 68% (MMI). Processes approximately 165 frames per second on consumer-grade hardware, demonstrating real-time applicability.

Conclusion: Provides a fast, accurate, and adaptable solution for facial expression analysis, contributing to emotion-aware technologies and personalized user experiences. Source code is publicly available on GitHub.

Abstract: Facial expression recognition is a crucial component in enhancing human-computer interaction and developing emotion-aware systems. Real-time detection and interpretation of facial expressions have become increasingly important for various applications, from user experience personalization to intelligent surveillance systems. This study presents a novel approach to real-time sequential facial expression recognition using deep learning and geometric features. The proposed method utilizes MediaPipe FaceMesh for rapid and accurate facial landmark detection. Geometric features, including Euclidean distances and angles, are extracted from these landmarks. Temporal dynamics are incorporated by analyzing feature differences between consecutive frames, enabling the detection of onset, apex, and offset phases of expressions. For classification, a ConvLSTM1D network followed by multilayer perceptron blocks is employed. The method’s performance was evaluated on multiple publicly available datasets, including CK+, Oulu-CASIA (VIS and NIR), and MMI. Accuracies of 93%, 79%, 77%, and 68% were achieved respectively. Experiments with composite datasets were also conducted to assess the model’s generalization capabilities. The approach demonstrated real-time applicability, processing approximately 165 frames per second on consumer-grade hardware. This research contributes to the field of facial expression analysis by providing a fast, accurate, and adaptable solution. The findings highlight the potential for further advancements in emotion-aware technologies and personalized user experiences, paving the way for more sophisticated human-computer interaction systems. To facilitate further research in this field, the complete source code for this study has been made publicly available on GitHub: https://github.com/miralab-ai/facial-expression-analysis.

[112] Hyperspectral Unmixing with 3D Convolutional Sparse Coding and Projected Simplex Volume Maximization

Gargi Panda, Soumitra Kundu, Saumik Bhattacharya, Aurobinda Routray

Main category: cs.CV

TL;DR: 3D-CSCNet: An algorithm-unrolling-based autoencoder network for hyperspectral unmixing that combines 3D convolutional sparse coding with autoencoder framework and PSVM initialization.

Details

Motivation: To improve hyperspectral unmixing by better capturing both spectral and spatial relationships in 3D HSI data cubes, overcoming limitations of existing unrolling-based networks that don't fully exploit the autoencoder framework.

Method: Proposes 3D-CSCNet built on 3D convolutional sparse coding model within autoencoder framework. Uses 3D-CSC blocks derived through deep algorithm unrolling to estimate abundance matrix, then reconstructs HSI via decoder to extract endmembers. Initializes decoder weights using PSVM algorithm for endmember estimation.

Result: Extensive experiments on three real datasets and one simulated dataset with three SNR levels show 3D-CSCNet outperforms state-of-the-art methods.

Conclusion: The proposed 3D-CSCNet effectively combines algorithm unrolling, 3D convolutional sparse coding, and autoencoder framework for superior hyperspectral unmixing performance by jointly learning spectral and spatial relationships.

Abstract: Hyperspectral unmixing (HSU) aims to separate each pixel into its constituent endmembers and estimate their corresponding abundance fractions. This work presents an algorithm-unrolling-based network for the HSU task, named the 3D Convolutional Sparse Coding Network (3D-CSCNet), built upon a 3D CSC model. Unlike existing unrolling-based networks, our 3D-CSCNet is designed within the powerful autoencoder (AE) framework. Specifically, to solve the 3D CSC problem, we propose a 3D CSC block (3D-CSCB) derived through deep algorithm unrolling. Given a hyperspectral image (HSI), 3D-CSCNet employs the 3D-CSCB to estimate the abundance matrix. The use of 3D CSC enables joint learning of spectral and spatial relationships in the 3D HSI data cube. The estimated abundance matrix is then passed to the AE decoder to reconstruct the HSI, and the decoder weights are extracted as the endmember matrix. Additionally, we propose a projected simplex volume maximization (PSVM) algorithm for endmember estimation, and the resulting endmembers are used to initialize the decoder weights of 3D-CSCNet. Extensive experiments on three real datasets and one simulated dataset with three different signal-to-noise ratio (SNR) levels demonstrate that our 3D-CSCNet outperforms state-of-the-art methods.

[113] World Models That Know When They Don’t Know: Controllable Video Generation with Calibrated Uncertainty

Zhiting Mei, Tenny Yin, Micah Baker, Ola Shorinwa, Anirudha Majumdar

Main category: cs.CV

TL;DR: C3: A method for training controllable video models to estimate dense, calibrated uncertainty at subpatch level to detect hallucinations and localize untrustworthy regions in generated videos.

Details

Motivation: Controllable video models often hallucinate (generate physically unrealistic frames), which is problematic for critical applications like robotics. Current models lack confidence assessment capabilities, making hallucination mitigation difficult.

Method: Three innovations: 1) Training framework using strictly proper scoring rules for correctness and calibration, 2) Uncertainty estimation in latent space (avoiding pixel-space instability/costs), 3) Mapping latent uncertainty to interpretable pixel-level RGB uncertainty heatmaps.

Result: Method provides calibrated uncertainty estimates within training distribution and enables effective out-of-distribution detection. Demonstrated on large-scale robot learning datasets (Bridge and DROID) with real-world evaluations.

Conclusion: C3 enables controllable video models to assess their own confidence, localize uncertainty at high resolution, and detect hallucinations, addressing critical reliability concerns for applications like robotics.

Abstract: Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch level, precisely localizing the uncertainty in each generated video frame. Our UQ method introduces three core innovations to empower video models to estimate their uncertainty. First, our method develops a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model’s uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel-space approaches. Third, we map the dense latent-space uncertainty to interpretable pixel-level uncertainty in the RGB space for intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out-of-distribution detection.

[114] Physics-Informed Graph Neural Network with Frequency-Aware Learning for Optical Aberration Correction

Yong En Kok, Bowen Deng, Alexander Bentley, Andrew J. Parkes, Michael G. Somekh, Amanda J. Wright, Michael P. Pound

Main category: cs.CV

TL;DR: ZRNet: Physics-informed framework for joint Zernike coefficient prediction and optical image restoration in microscopy, using Zernike Graph module and Frequency-Aware Alignment loss to outperform existing methods on diverse samples with complex aberrations.

Details

Motivation: Optical aberrations degrade microscopy image quality, especially in deep imaging. Existing methods treat aberration correction as black-box mapping without leveraging optical physics, and only handle mild aberrations on limited sample types.

Method: Proposes ZRNet with Zernike Graph module that models physical relationships between Zernike polynomials based on azimuthal degrees, and Frequency-Aware Alignment loss to enforce consistency between image restoration and Zernike prediction in Fourier domain.

Result: Achieves state-of-the-art performance in both image restoration and Zernike coefficient prediction across diverse microscopy modalities and biological samples with complex, large-amplitude aberrations on CytoImageNet dataset.

Conclusion: ZRNet successfully integrates optical physics principles into deep learning for aberration correction, demonstrating superior performance by explicitly modeling Zernike polynomial relationships and enforcing Fourier domain consistency.

Abstract: Optical aberrations significantly degrade image quality in microscopy, particularly when imaging deeper into samples. These aberrations arise from distortions in the optical wavefront and can be mathematically represented using Zernike polynomials. Existing methods often address only mild aberrations on limited sample types and modalities, typically treating the problem as a black-box mapping without leveraging the underlying optical physics of wavefront distortions. We propose ZRNet, a physics-informed framework that jointly performs Zernike coefficient prediction and optical image Restoration. We contribute a Zernike Graph module that explicitly models physical relationships between Zernike polynomials based on their azimuthal degrees-ensuring that learned corrections align with fundamental optical principles. To further enforce physical consistency between image restoration and Zernike prediction, we introduce a Frequency-Aware Alignment (FAA) loss, which better aligns Zernike coefficient prediction and image features in the Fourier domain. Extensive experiments on CytoImageNet demonstrates that our approach achieves state-of-the-art performance in both image restoration and Zernike coefficient prediction across diverse microscopy modalities and biological samples with complex, large-amplitude aberrations. Code is available at https://github.com/janetkok/ZRNet.

[115] Measuring the Effect of Background on Classification and Feature Importance in Deep Learning for AV Perception

Anne Sielemann, Valentin Barner, Stefan Wolf, Masoud Roschani, Jens Ziehn, Juergen Beyerer

Main category: cs.CV

TL;DR: Researchers use synthetic datasets to systematically study how background correlations affect traffic sign classification and explainable AI interpretations, quantifying when background features become important for classification.

Details

Motivation: Current XAI methods for deep learning rely on intuitive interpretations about whether classifiers focus on object pixels or background pixels, but these interpretations are difficult to test quantitatively. Real-world data contains unavoidable correlations, making it hard to determine if they're legitimate or spurious. Synthetic data can control correlations but often lacks realism quantification.

Method: Created six synthetic datasets for traffic sign recognition that vary only in camera variation and background correlation levels. This allows systematic isolation and quantification of background correlation effects, camera variation, and traffic sign shapes on classification performance and background feature importance.

Result: The study provides quantification of when and how much background features gain importance to support classification based on changes in the training domain. Results show how background correlation affects classifier behavior and XAI interpretations.

Conclusion: Synthetic datasets with controlled variations enable systematic testing of XAI interpretations about background feature importance, providing quantitative evidence about when background correlations legitimately support classification versus when they indicate problematic overfitting.

Abstract: Common approaches to explainable AI (XAI) for deep learning focus on analyzing the importance of input features on the classification task in a given model: saliency methods like SHAP and GradCAM are used to measure the impact of spatial regions of the input image on the classification result. Combined with ground truth information about the location of the object in the input image (e.g., a binary mask), it is determined whether object pixels had a high impact on the classification result, or whether the classification focused on background pixels. The former is considered to be a sign of a healthy classifier, whereas the latter is assumed to suggest overfitting on spurious correlations. A major challenge, however, is that these intuitive interpretations are difficult to test quantitatively, and hence the output of such explanations lacks an explanation itself. One particular reason is that correlations in real-world data are difficult to avoid, and whether they are spurious or legitimate is debatable. Synthetic data in turn can facilitate to actively enable or disable correlations where desired but often lack a sufficient quantification of realism and stochastic properties. […] Therefore, we systematically generate six synthetic datasets for the task of traffic sign recognition, which differ only in their degree of camera variation and background correlation […] to quantify the isolated influence of background correlation, different levels of camera variation, and considered traffic sign shapes on the classification performance, as well as background feature importance. […] Results include a quantification of when and how much background features gain importance to support the classification task based on changes in the training domain […]. Download: synset.de/datasets/synset-signset-ger/background-effect

[116] OWL: Unsupervised 3D Object Detection by Occupancy Guided Warm-up and Large Model Priors Reasoning

Xusheng Guo, Wanfa Zhang, Shijia Zhao, Qiming Xia, Xiaolong Xie, Mingming Wang, Hai Wu, Chenglu Wen

Main category: cs.CV

TL;DR: OWL: Unsupervised 3D object detection using occupancy-guided warm-up and large-model priors to address pseudo-label quality issues in self-training.

Details

Motivation: Existing unsupervised 3D object detection methods rely on pseudo-labels that are often incorrect at training start, misleading optimization. Current approaches struggle with effectively filtering and refining these noisy pseudo-labels, limiting performance.

Method: Three key components: 1) Occupancy Guided Warm-up (OGW) to initialize backbone with spatial perception capabilities, 2) Instance-Cued Reasoning (ICR) module using large-model priors to assess pseudo-label quality for filtering/refinement, 3) Weight-adapted Self-training (WAS) strategy to dynamically re-weight pseudo-labels during self-training.

Result: Outperforms state-of-the-art unsupervised methods by over 15.0% mAP on Waymo Open Dataset (WOD) and KITTI benchmarks.

Conclusion: OWL effectively addresses pseudo-label quality issues in unsupervised 3D object detection through occupancy-guided initialization, large-model priors for quality assessment, and adaptive self-training, demonstrating significant performance improvements over existing methods.

Abstract: Unsupervised 3D object detection leverages heuristic algorithms to discover potential objects, offering a promising route to reduce annotation costs in autonomous driving. Existing approaches mainly generate pseudo labels and refine them through self-training iterations. However, these pseudo-labels are often incorrect at the beginning of training, resulting in misleading the optimization process. Moreover, effectively filtering and refining them remains a critical challenge. In this paper, we propose OWL for unsupervised 3D object detection by occupancy guided warm-up and large-model priors reasoning. OWL first employs an Occupancy Guided Warm-up (OGW) strategy to initialize the backbone weight with spatial perception capabilities, mitigating the interference of incorrect pseudo-labels on network convergence. Furthermore, OWL introduces an Instance-Cued Reasoning (ICR) module that leverages the prior knowledge of large models to assess pseudo-label quality, enabling precise filtering and refinement. Finally, we design a Weight-adapted Self-training (WAS) strategy to dynamically re-weight pseudo-labels, improving the performance through self-training. Extensive experiments on Waymo Open Dataset (WOD) and KITTI demonstrate that OWL outperforms state-of-the-art unsupervised methods by over 15.0% mAP, revealing the effectiveness of our method.

[117] Manifold-Aware Point Cloud Completion via Geodesic-Attentive Hierarchical Feature Learning

Jianan Sun, Dongzhihan Wang, Mingyu Fan

Main category: cs.CV

TL;DR: Manifold-aware point cloud completion framework using geodesic distances to capture nonlinear geometry, improving geometric consistency and semantic coherence.

Details

Motivation: Existing point cloud completion methods rely on Euclidean proximity and overlook intrinsic nonlinear geometric structure, leading to suboptimal geometric consistency and semantic ambiguity in reconstructed shapes.

Method: Introduces two key modules: 1) Geodesic Distance Approximator (GDA) to estimate geodesic distances capturing latent manifold topology, and 2) Manifold-Aware Feature Extractor (MAFE) using geodesic-based k-NN groupings and geodesic-relational attention mechanism for hierarchical feature extraction.

Result: Extensive experiments on benchmark datasets demonstrate consistent outperformance over state-of-the-art methods in reconstruction quality.

Conclusion: Explicit incorporation of nonlinear geometry information through geodesic-aware relational attention promotes semantic coherence and structural fidelity in point cloud completion.

Abstract: Point cloud completion seeks to recover geometrically consistent shapes from partial or sparse 3D observations. Although recent methods have achieved reasonable global shape reconstruction, they often rely on Euclidean proximity and overlook the intrinsic nonlinear geometric structure of point clouds, resulting in suboptimal geometric consistency and semantic ambiguity. In this paper, we present a manifold-aware point cloud completion framework that explicitly incorporates nonlinear geometry information throughout the feature learning pipeline. Our approach introduces two key modules: a Geodesic Distance Approximator (GDA), which estimates geodesic distances between points to capture the latent manifold topology, and a Manifold-Aware Feature Extractor (MAFE), which utilizes geodesic-based $k$-NN groupings and a geodesic-relational attention mechanism to guide the hierarchical feature extraction process. By integrating geodesic-aware relational attention, our method promotes semantic coherence and structural fidelity in the reconstructed point clouds. Extensive experiments on benchmark datasets demonstrate that our approach consistently outperforms state-of-the-art methods in reconstruction quality.

[118] Distilling Expert Surgical Knowledge: How to train local surgical VLMs for anatomy explanation in Complete Mesocolic Excision

Lennart Maack, Julia-Kristin Graß, Lisa-Marie Toscha, Nathaniel Melling, Alexander Schlaefer

Main category: cs.CV

TL;DR: Privacy-preserving framework to distill knowledge from large LLMs into a local VLM for surgical scene understanding using expert-supervised datasets generated without sensitive images.

Details

Motivation: Current VLMs lack domain-specific surgical scene understanding capabilities and pose privacy risks when patient data is sent to external servers. There's a need for locally deployable models that can understand surgical anatomy while protecting patient privacy.

Method: Generate expert-supervised dataset by prompting teacher LLM using only textual context and binary segmentation masks (no sensitive images). Use this dataset for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) of a locally deployable VLM.

Result: Finetuning VLMs with the generated datasets significantly increases surgical domain knowledge compared to base VLMs. The framework provides data-efficient and privacy-conforming training for surgical domain optimization.

Conclusion: The work validates a privacy-preserving approach to create locally deployable VLMs with enhanced surgical scene understanding capabilities, addressing both domain expertise gaps and data privacy concerns in clinical settings.

Abstract: Recently, Vision Large Language Models (VLMs) have demonstrated high potential in computer-aided diagnosis and decision-support. However, current VLMs show deficits in domain specific surgical scene understanding, such as identifying and explaining anatomical landmarks during Complete Mesocolic Excision. Additionally, there is a need for locally deployable models to avoid patient data leakage to large VLMs, hosted outside the clinic. We propose a privacy-preserving framework to distill knowledge from large, general-purpose LLMs into an efficient, local VLM. We generate an expert-supervised dataset by prompting a teacher LLM without sensitive images, using only textual context and binary segmentation masks for spatial information. This dataset is used for Supervised Fine-Tuning (SFT) and subsequent Direct Preference Optimization (DPO) of the locally deployable VLM. Our evaluation confirms that finetuning VLMs with our generated datasets increases surgical domain knowledge compared to its base VLM by a large margin. Overall, this work validates a data-efficient and privacy-conforming way to train a surgical domain optimized, locally deployable VLM for surgical scene understanding.

[119] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

Munsif Ali, Najmul Hassan, Lucia Ventura, Davide Di Bari, Simonepietro Canese

Main category: cs.CV

TL;DR: AQUA-Net is a lightweight underwater image enhancement model that integrates frequency fusion and illumination awareness to restore color, contrast, and realism while reducing computational complexity for real-time applications.

Details

Motivation: Underwater images suffer from color distortion, low contrast, and haze due to light absorption and scattering. Existing deep learning models have high computational complexity, limiting their practical deployment for real-time underwater applications.

Method: AQUA-Net integrates a residual encoder-decoder with dual auxiliary branches in frequency and illumination domains. The frequency fusion encoder enriches spatial representations with Fourier domain cues, while the illumination-aware decoder performs adaptive exposure correction using a learned illumination map inspired by Retinex theory.

Result: Extensive experiments show AQUA-Net performs on par with state-of-the-art methods in qualitative and quantitative evaluations while using fewer parameters. The model also introduces a high-resolution real-world underwater video dataset from the Mediterranean Sea for robust evaluation.

Conclusion: AQUA-Net provides an effective solution for real-world underwater imaging with strong generalization capability and robustness. The frequency and illumination branches offer complementary contributions that improve visibility and color representation while maintaining computational efficiency.

Abstract: Underwater images often suffer from severe color distortion, low contrast, and a hazy appearance due to wavelength-dependent light absorption and scattering. Simultaneously, existing deep learning models exhibit high computational complexity, which limits their practical deployment for real-time underwater applications. To address these challenges, this paper presents a novel underwater image enhancement model, called Adaptive Frequency Fusion and Illumination Aware Network (AQUA-Net). It integrates a residual encoder decoder with dual auxiliary branches, which operate in the frequency and illumination domains. The frequency fusion encoder enriches spatial representations with frequency cues from the Fourier domain and preserves fine textures and structural details. Inspired by Retinex, the illumination-aware decoder performs adaptive exposure correction through a learned illumination map that separates reflectance from lighting effects. This joint spatial, frequency, and illumination design enables the model to restore color balance, visual contrast, and perceptual realism under diverse underwater conditions. Additionally, we present a high-resolution, real-world underwater video-derived dataset from the Mediterranean Sea, which captures challenging deep-sea conditions with realistic visual degradations to enable robust evaluation and development of deep learning models. Extensive experiments on multiple benchmark datasets show that AQUA-Net performs on par with SOTA in both qualitative and quantitative evaluations while using less number of parameters. Ablation studies further confirm that the frequency and illumination branches provide complementary contributions that improve visibility and color representation. Overall, the proposed model shows strong generalization capability and robustness, and it provides an effective solution for real-world underwater imaging applications.

[120] HQ-DM: Single Hadamard Transformation-Based Quantization-Aware Training for Low-Bit Diffusion Models

Shizhuo Mao, Hongtao Zou, Qihu Xie, Song Chen, Yi Kang

Main category: cs.CV

TL;DR: HQ-DM is a quantization-aware training framework for diffusion models that uses Single Hadamard Transformation to reduce activation outliers, enabling effective low-bit quantization while maintaining performance.

Details

Motivation: Diffusion models have high computational and memory costs that hinder deployment. Existing quantization methods struggle with activation outliers during inference, causing significant performance degradation in low-bit quantization scenarios.

Method: Proposes HQ-DM, a quantization-aware training framework that applies Single Hadamard Transformation to activation matrices. This reduces activation outliers while preserving model performance under quantization, supporting INT convolution operations and preventing weight outlier amplification.

Result: For conditional generation on ImageNet 256x256 using LDM-4 model, W4A4 quantization improves Inception Score by 12.8% and W4A3 quantization improves by 467.73% over state-of-the-art methods.

Conclusion: HQ-DM effectively addresses activation outlier issues in diffusion model quantization, enabling efficient low-bit deployment with significant performance improvements over existing methods.

Abstract: Diffusion models have demonstrated significant applications in the field of image generation. However, their high computational and memory costs pose challenges for deployment. Model quantization has emerged as a promising solution to reduce storage overhead and accelerate inference. Nevertheless, existing quantization methods for diffusion models struggle to mitigate outliers in activation matrices during inference, leading to substantial performance degradation under low-bit quantization scenarios. To address this, we propose HQ-DM, a novel Quantization-Aware Training framework that applies Single Hadamard Transformation to activation matrices. This approach effectively reduces activation outliers while preserving model performance under quantization. Compared to traditional Double Hadamard Transformation, our proposed scheme offers distinct advantages by seamlessly supporting INT convolution operations while preventing the amplification of weight outliers. For conditional generation on the ImageNet 256x256 dataset using the LDM-4 model, our W4A4 and W4A3 quantization schemes improve the Inception Score by 12.8% and 467.73%, respectively, over the existing state-of-the-art method.

[121] USV: Unified Sparsification for Accelerating Video Diffusion Models

Xinjian Wu, Hongmei Wang, Yuan Zhou, Qinglin Lu

Main category: cs.CV

TL;DR: USV is a unified sparsification framework that jointly optimizes attention pruning, token merging, and denoising step reduction for accelerating video diffusion models while maintaining quality.

Details

Motivation: Current video diffusion models suffer from quadratic complexity in spatio-temporal attention and long iterative denoising trajectories. Existing accelerators target single dimensions in isolation, leading to diminishing returns as other bottlenecks become dominant.

Method: USV learns a dynamic, data- and timestep-dependent sparsification policy that jointly prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps within a single optimization objective.

Result: Achieves up to 83.3% speedup in denoising process and 22.7% end-to-end acceleration while maintaining high visual fidelity on large-scale video generation benchmarks.

Conclusion: Unified, dynamic sparsification provides a practical path toward efficient, high-quality video generation by enabling strong mutual reinforcement among previously disjoint acceleration strategies.

Abstract: The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators – such as sparse attention and step-distilled samplers – typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model’s internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps, treating them not as independent tricks but as coordinated actions within a single optimization objective. This multi-dimensional co-design enables strong mutual reinforcement among previously disjoint acceleration strategies. Extensive experiments on large-scale video generation benchmarks demonstrate that USV achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity. Our results highlight unified, dynamic sparsification as a practical path toward efficient, high-quality video generation.

[122] Label-Efficient Point Cloud Segmentation with Active Learning

Johannes Meyer, Jasper Hoffmann, Felix Schulz, Dominik Merkle, Daniel Buescher, Alexander Reiterer, Joschka Boedecker, Wolfram Burgard

Main category: cs.CV

TL;DR: A novel active learning method for 3D point cloud semantic segmentation that uses a simple 2D grid subdivision approach and network ensemble uncertainty estimation, achieving state-of-the-art performance while being easy to implement.

Details

Motivation: Semantic segmentation of 3D point clouds requires expensive annotation. Existing active learning methods use complex heuristics for splitting point clouds and selecting annotation regions. The authors aim to develop a simpler, more effective approach that reduces annotation costs while maintaining performance.

Method: Two main components: 1) A simple 2D grid-based subdivision of point clouds into columns for annotation, and 2) A network ensemble approach to estimate prediction uncertainty for selecting the most beneficial regions to annotate next.

Result: The method achieves performance on par with or better than complex state-of-the-art methods on S3DIS, Toronto-3D, and a large-scale urban Freiburg dataset. Also demonstrates that annotated area (rather than number of points) is a more meaningful measure for active learning in point clouds.

Conclusion: The proposed approach provides an effective yet simple alternative to complex active learning methods for 3D point cloud segmentation, reducing annotation costs while maintaining high performance. The finding about annotated area vs. point count offers valuable insight for future active learning research in this domain.

Abstract: Semantic segmentation of 3D point cloud data often comes with high annotation costs. Active learning automates the process of selecting which data to annotate, reducing the total amount of annotation needed to achieve satisfactory performance. Recent approaches to active learning for 3D point clouds are often based on sophisticated heuristics for both, splitting point clouds into annotatable regions and selecting the most beneficial for further neural network training. In this work, we propose a novel and easy-to-implement strategy to separate the point cloud into annotatable regions. In our approach, we utilize a 2D grid to subdivide the point cloud into columns. To identify the next data to be annotated, we employ a network ensemble to estimate the uncertainty in the network output. We evaluate our method on the S3DIS dataset, the Toronto-3D dataset, and a large-scale urban 3D point cloud of the city of Freiburg, which we labeled in parts manually. The extensive evaluation shows that our method yields performance on par with, or even better than, complex state-of-the-art methods on all datasets. Furthermore, we provide results suggesting that in the context of point clouds the annotated area can be a more meaningful measure for active learning algorithms than the number of annotated points.

[123] FNOPT: Resolution-Agnostic, Self-Supervised Cloth Simulation using Meta-Optimization with Fourier Neural Operators

Ruochen Chen, Thuy Tran, Shaifali Parashar

Main category: cs.CV

TL;DR: FNOpt is a self-supervised cloth simulation framework using Fourier neural operators as neural optimizers that learns to simulate cloth dynamics without ground truth data, generalizes across resolutions, and captures fine-scale details.

Details

Motivation: Prior neural cloth simulators have limitations: they require extensive ground truth data, sacrifice fine-scale details, and generalize poorly across different mesh resolutions and motion patterns. There's a need for a more robust, data-efficient approach that can handle diverse simulation scenarios.

Method: FNOpt formulates time integration as an optimization problem and trains a resolution-agnostic neural optimizer parameterized by a Fourier neural operator (FNO). It’s trained only on coarse grids using physics-based losses, without requiring ground truth data.

Result: FNOpt outperforms prior learning-based approaches in out-of-distribution settings, achieving better accuracy and robustness. It generalizes to finer resolutions, captures fine-scale wrinkles, maintains rollout stability, and works across diverse mesh resolutions and motion patterns without retraining.

Conclusion: FNO-based meta-optimization presents a compelling alternative to previous neural cloth simulators, reducing the need for curated data and improving cross-resolution reliability while maintaining physical plausibility and detail.

Abstract: We present FNOpt, a self-supervised cloth simulation framework that formulates time integration as an optimization problem and trains a resolution-agnostic neural optimizer parameterized by a Fourier neural operator (FNO). Prior neural simulators often rely on extensive ground truth data or sacrifice fine-scale detail, and generalize poorly across resolutions and motion patterns. In contrast, FNOpt learns to simulate physically plausible cloth dynamics and achieves stable and accurate rollouts across diverse mesh resolutions and motion patterns without retraining. Trained only on a coarse grid with physics-based losses, FNOpt generalizes to finer resolutions, capturing fine-scale wrinkles and preserving rollout stability. Extensive evaluations on a benchmark cloth simulation dataset demonstrate that FNOpt outperforms prior learning-based approaches in out-of-distribution settings in both accuracy and robustness. These results position FNO-based meta-optimization as a compelling alternative to previous neural simulators for cloth, thus reducing the need for curated data and improving cross-resolution reliability.

[124] Curvature-Regularized Variational Autoencoder for 3D Scene Reconstruction from Sparse Depth

Maryam Yousefi, Soodeh Bakhshandeh

Main category: cs.CV

TL;DR: Using curvature regularization with discrete Laplacian operator improves 3D scene reconstruction from sparse depth measurements by 18.1% over standard VAEs, challenging the assumption that multiple geometric constraints are needed.

Details

Motivation: Sparse depth measurements (only 5% of needed data) make 3D scene reconstruction difficult for autonomous vehicles and robots, where geometric errors are unacceptable. Current approaches often use multiple geometric constraints, but this paper questions whether simpler regularization might be more effective.

Method: Proposes curvature regularization using a discrete Laplacian operator to improve 3D reconstruction from sparse depth data. The method adds only 15% training overhead with zero inference cost, offering stable gradients and noise suppression.

Result: Achieves 18.1% better reconstruction accuracy than standard variational autoencoders. The single well-designed regularization term outperforms complex multi-term formulations, challenging assumptions in geometric deep learning.

Conclusion: A single, well-designed curvature regularization term using discrete Laplacian can outperform complex multi-constraint approaches in 3D reconstruction from sparse depth data, offering better accuracy with minimal computational overhead.

Abstract: When depth sensors provide only 5% of needed measurements, reconstructing complete 3D scenes becomes difficult. Autonomous vehicles and robots cannot tolerate the geometric errors that sparse reconstruction introduces. We propose curvature regularization through a discrete Laplacian operator, achieving 18.1% better reconstruction accuracy than standard variational autoencoders. Our contribution challenges an implicit assumption in geometric deep learning: that combining multiple geometric constraints improves performance. A single well-designed regularization term not only matches but exceeds the effectiveness of complex multi-term formulations. The discrete Laplacian offers stable gradients and noise suppression with just 15% training overhead and zero inference cost. Code and models are available at https://github.com/Maryousefi/GeoVAE-3D.

[125] Bring Your Dreams to Life: Continual Text-to-Video Customization

Jiahua Dong, Xudong Wang, Wenqi Liang, Zongyan Han, Meng Cao, Duzhen Zhang, Hanbin Zhao, Zhi Han, Salman Khan, Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: CCVD is a continual learning framework for customized text-to-video generation that addresses catastrophic forgetting and concept neglect when learning new concepts over time.

Details

Motivation: Current customized text-to-video generation methods assume static concepts and struggle with forgetting and concept neglect when continuously learning new subjects and motions over time.

Method: Introduces concept-specific attribute retention module and task-aware concept aggregation strategy to prevent forgetting, plus controllable conditional synthesis with layer-specific region attention-guided noise estimation to address concept neglect.

Result: Extensive experiments show CCVD outperforms existing CTVG models in continual learning scenarios.

Conclusion: CCVD effectively enables continuous learning of new concepts in text-to-video generation while maintaining performance on previously learned concepts.

Abstract: Customized text-to-video generation (CTVG) has recently witnessed great progress in generating tailored videos from user-specific text. However, most CTVG methods assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling forgetting and concept neglect. To address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Besides, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention-guided noise estimation. Extensive experimental comparisons demonstrate that our CCVD outperforms existing CTVG models. The code is available at https://github.com/JiahuaDong/CCVD.

[126] UG-FedDA: Uncertainty-Guided Federated Domain Adaptation for Multi-Center Alzheimer’s Disease Detection

Fubao Zhu, Zhanyuan Jia, Zhiguo Wang, Huan Huang, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Chen Zhao, Weihua Zhou

Main category: cs.CV

TL;DR: UG-FedDA: A federated domain adaptation framework with uncertainty quantification for robust multicenter Alzheimer’s disease classification from MRI data.

Details

Motivation: Existing AD classification frameworks struggle with multicenter studies due to inter-site heterogeneity and lack of uncertainty quantification, limiting robustness and clinical applicability.

Method: Uncertainty-Guided Federated Domain Adaptation (UG-FedDA) integrates uncertainty quantification with federated domain adaptation, uses self-attention transformer for multi-template RoI feature extraction, and down-weights uncertain samples to mitigate distribution shifts.

Result: Achieved consistent cross-domain improvements across three classification tasks on ADNI, AIBL, and OASIS datasets. Best results: 90.54% accuracy for AD vs NC on ADNI, 80.20% for MCI vs AD on ADNI, and 83.73% for NC vs MCI on OASIS.

Conclusion: UG-FedDA effectively handles cross-site MRI heterogeneity under privacy constraints, demonstrating efficient adaptation across multiple sites while preserving strict privacy.

Abstract: Alzheimer’s disease (AD) is an irreversible neurodegenerative disorder, and early diagnosis is critical for timely intervention. However, most existing classification frameworks face challenges in multicenter studies, as they often neglect inter-site heterogeneity and lack mechanisms to quantify uncertainty, which limits their robustness and clinical applicability. To address these issues, we proposed Uncertainty-Guided Federated Domain Adaptation (UG-FedDA), a novel multicenter AD classification framework that integrates uncertainty quantification (UQ) with federated domain adaptation to handle cross-site structure magnetic resonance imaging (MRI) heterogeneity under privacy constraints. Our approach extracts multi-template region-of-interest (RoI) features using a self-attention transformer, capturing both regional representations and their interactions. UQ is integrated to guide feature alignment, mitigating source-target distribution shifts by down-weighting uncertain samples. Experiments are conducted on three public datasets: the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarkers and Lifestyle study (AIBL), and the Open Access Series of Imaging Studies (OASIS). UG-FedDA achieved consistent cross-domain improvements in accuracy, sensitivity, and area under the ROC curve across three classification tasks: AD vs. normal controls (NC), mild cognitive impairment (MCI) vs. AD, and NC vs. MCI. For NC vs. AD, UG-FedDA achieves accuracies of 90.54%, 89.04%, and 77.78% on ADNI, AIBL and OASIS datasets, respectively. For MCI vs. AD, accuracies are 80.20% (ADNI), 71.91% (AIBL), and 79.73% (OASIS). For NC vs. MCI, results are 76.87% (ADNI), 73.91% (AIBL), and 83.73% (OASIS). These results demonstrate that the proposed framework not only adapts efficiently across multiple sites but also preserves strict privacy.

[127] VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

Shiji Zhao, Shukun Xiong, Yao Huang, Yan Jin, Zhenyu Wu, Jiyang Guan, Ranjie Duan, Jialing Tao, Hui Xue, Xingxing Wei

Main category: cs.CV

TL;DR: VRSA is a novel jailbreak attack method that decomposes harmful text queries into sequential sub-images to exploit visual reasoning vulnerabilities in MLLMs, achieving higher attack success rates than existing methods.

Details

Motivation: While previous jailbreak attacks have focused on text reasoning vulnerabilities, visual reasoning threats in MLLMs have been largely overlooked. The paper aims to fully evaluate potential safety risks in visual reasoning tasks by exploring how multimodal models can be manipulated through sequential visual inputs.

Method: VRSA decomposes harmful text queries into sequentially related sub-images that gradually externalize harmful intent. It uses three key techniques: Adaptive Scene Refinement to optimize relevant scenes, Semantic Coherent Completion to iteratively rewrite sub-texts with contextual information, and Text-Image Consistency Alignment to maintain semantic consistency across modalities.

Result: VRSA achieves higher attack success rates compared to state-of-the-art jailbreak attack methods on both open-source and closed-source MLLMs, including GPT-4o and Claude-4.5-Sonnet, demonstrating significant vulnerabilities in visual reasoning tasks.

Conclusion: The paper reveals critical safety vulnerabilities in MLLMs’ visual reasoning capabilities and proposes VRSA as an effective attack method, highlighting the need for improved safety mechanisms against multimodal jailbreak attacks.

Abstract: Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities. However, more modalities bring more vulnerabilities to being utilized for jailbreak attacks, which induces MLLMs to output harmful content. Due to the strong reasoning ability of MLLMs, previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal. To fully evaluate potential safety risks in the visual reasoning task, we propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent by decomposing the original harmful text into several sequentially related sub-images. In particular, to enhance the rationality of the scene in the image sequence, we propose Adaptive Scene Refinement to optimize the scene most relevant to the original harmful query. To ensure the semantic continuity of the generated image, we propose Semantic Coherent Completion to iteratively rewrite each sub-text combined with contextual information in this scene. In addition, we propose Text-Image Consistency Alignment to keep the semantical consistency. A series of experiments demonstrates that the VRSA can achieve a higher attack success rate compared with the state-of-the-art jailbreak attack methods on both the open-source and closed-source MLLMs such as GPT-4o and Claude-4.5-Sonnet.

[128] Edit-aware RAW Reconstruction

Abhijith Punnappurath, Luxi Zhao, Ke Zhao, Hue Nguyen, Radek Grzeszczuk, Michael S. Brown

Main category: cs.CV

TL;DR: A plug-and-play edit-aware loss function that improves RAW reconstruction from sRGB images by making recovered RAWs more robust to different rendering styles and edits through a differentiable ISP simulator.

Details

Motivation: Users often edit camera images post-capture, but while RAW editing offers better flexibility, most editing happens on sRGB JPEGs since RAW files are rarely stored. Existing RAW reconstruction methods focus on pixel-wise fidelity but degrade under diverse rendering styles and editing operations.

Method: Introduces an edit-aware loss function with a modular, differentiable image signal processor (ISP) that simulates realistic photofinishing pipelines with tunable parameters. During training, ISP parameters are randomly sampled from distributions modeling real camera variations, and loss is computed in sRGB space between ground-truth and reconstructed RAWs rendered through this ISP.

Result: Improves sRGB reconstruction quality by up to 1.5-2 dB PSNR across various editing conditions. When applied to metadata-assisted RAW reconstruction methods, enables fine-tuning for target edits with further gains.

Conclusion: Provides a simple yet effective general mechanism for enhancing edit fidelity and rendering flexibility across existing RAW reconstruction methods, addressing the primary motivation for RAW reconstruction in consumer imaging.

Abstract: Users frequently edit camera images post-capture to achieve their preferred photofinishing style. While editing in the RAW domain provides greater accuracy and flexibility, most edits are performed on the camera’s display-referred output (e.g., 8-bit sRGB JPEG) since RAW images are rarely stored. Existing RAW reconstruction methods can recover RAW data from sRGB images, but these approaches are typically optimized for pixel-wise RAW reconstruction fidelity and tend to degrade under diverse rendering styles and editing operations. We introduce a plug-and-play, edit-aware loss function that can be integrated into any existing RAW reconstruction framework to make the recovered RAWs more robust to different rendering styles and edits. Our loss formulation incorporates a modular, differentiable image signal processor (ISP) that simulates realistic photofinishing pipelines with tunable parameters. During training, parameters for each ISP module are randomly sampled from carefully designed distributions that model practical variations in real camera processing. The loss is then computed in sRGB space between ground-truth and reconstructed RAWs rendered through this differentiable ISP. Incorporating our loss improves sRGB reconstruction quality by up to 1.5-2 dB PSNR across various editing conditions. Moreover, when applied to metadata-assisted RAW reconstruction methods, our approach enables fine-tuning for target edits, yielding further gains. Since photographic editing is the primary motivation for RAW reconstruction in consumer imaging, our simple yet effective loss function provides a general mechanism for enhancing edit fidelity and rendering flexibility across existing methods.

[129] Underwater Image Reconstruction Using a Swin Transformer-Based Generator and PatchGAN Discriminator

Md. Mahbub Hasan Akash, Aria Tasnim Mridula, Sheekar Banerjee, Ishtiak Al Mamoon

Main category: cs.CV

TL;DR: A Swin Transformer-based GAN framework achieves state-of-the-art underwater image reconstruction by addressing color distortion, low contrast, and haze effects through global dependency modeling.

Details

Motivation: Underwater imaging suffers from severe degradation due to water's wavelength-dependent absorption and scattering, causing color distortion, low contrast, and haze. Traditional methods and CNNs fail to adequately address these challenges due to limited receptive fields and inability to model global dependencies.

Method: A novel deep learning framework integrating Swin Transformer architecture within a GAN. The generator uses a U-Net structure with Swin Transformer blocks to capture both local features and long-range dependencies for color correction. A PatchGAN discriminator provides adversarial training for high-frequency detail preservation.

Result: State-of-the-art performance on EUVP dataset with PSNR of 24.76 dB and SSIM of 0.89, significantly outperforming existing methods. Visual results show effective color balance restoration, contrast improvement, and haze reduction. Ablation study confirms superiority of Swin Transformer design over convolutional alternatives.

Conclusion: The proposed Swin Transformer-based GAN offers robust underwater image reconstruction suitable for various marine applications, effectively addressing the limitations of traditional methods through global dependency modeling and adversarial training.

Abstract: Underwater imaging is essential for marine exploration, environmental monitoring, and infrastructure inspection. However, water causes severe image degradation through wavelength-dependent absorption and scattering, resulting in color distortion, low contrast, and haze effects. Traditional reconstruction methods and convolutional neural network-based approaches often fail to adequately address these challenges due to limited receptive fields and inability to model global dependencies. This paper presented a novel deep learning framework that integrated a Swin Transformer architecture within a generative adversarial network (GAN) for underwater image reconstruction. Our generator employed a U-Net structure with Swin Transformer blocks to capture both local features and long-range dependencies crucial for color correction across entire images. A PatchGAN discriminator provided adversarial training to ensure high-frequency detail preservation. We trained and evaluated our model on the EUVP dataset, which contains paired underwater images of varying quality. Quantitative results demonstrate stateof-the-art performance with PSNR of 24.76 dB and SSIM of 0.89, representing significant improvements over existing methods. Visual results showed effective color balance restoration, contrast improvement, and haze reduction. An ablation study confirms the superiority of our Swin Transformer designed over convolutional alternatives. The proposed method offers robust underwater image reconstruction suitable for various marine applications.

[130] SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

Wenhao Yan, Sheng Ye, Zhuoyi Yang, Jiayan Teng, ZhenHui Dong, Kairui Wen, Xiaotao Gu, Yong-Jin Liu, Jie Tang

Main category: cs.CV

TL;DR: SCAIL is a framework for studio-grade character animation using in-context learning with 3D pose representation and full-context pose injection in diffusion-transformer architecture.

Details

Motivation: Existing motion transfer approaches often fail to preserve structural fidelity and temporal consistency in wild scenarios with complex motion and cross-identity animations, making studio-grade production challenging.

Method: Two key innovations: 1) Novel 3D pose representation for robust motion signals, 2) Full-context pose injection mechanism within diffusion-transformer architecture for spatio-temporal reasoning over full motion sequences. Also includes curated data pipeline for quality.

Result: SCAIL achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism, as validated through comprehensive benchmarking.

Conclusion: The framework successfully addresses challenges in character animation by combining 3D pose representation with in-context learning, enabling high-quality motion transfer that meets production standards.

Abstract: Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (\textbf{S}tudio-grade \textbf{C}haracter \textbf{A}nimation via \textbf{I}n-context \textbf{L}earning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.

[131] NICE: Neural Implicit Craniofacial Model for Orthognathic Surgery Prediction

Jiawen Yang, Yihui Cao, Xuanyu Tian, Yuyao Zhang, Hongjiang Wei

Main category: cs.CV

TL;DR: NICE uses implicit neural representations to predict facial appearance after orthognathic surgery, outperforming existing methods in accuracy and anatomical preservation.

Details

Motivation: Accurate prediction of postoperative facial appearance in orthognathic surgery is challenging due to complex nonlinear interactions between skeletal movements and facial soft tissue. Existing methods lack computational efficiency or fail to capture these intricate interactions.

Method: Neural Implicit Craniofacial Model (NICE) uses implicit neural representations with two modules: 1) shape module with region-specific implicit SDF decoders to reconstruct facial surface, maxilla, and mandible; 2) surgery module with region-specific deformation decoders driven by shared surgical latent code to model nonlinear biomechanical responses to skeletal movements.

Result: NICE outperforms state-of-the-art methods, notably improving prediction accuracy in critical facial regions (lips and chin) while robustly preserving anatomical integrity.

Conclusion: NICE provides a clinically viable tool for enhanced surgical planning and patient consultation in orthognathic procedures by accurately predicting surgical outcomes through implicit neural representations.

Abstract: Orthognathic surgery is a crucial intervention for correcting dentofacial skeletal deformities to enhance occlusal functionality and facial aesthetics. Accurate postoperative facial appearance prediction remains challenging due to the complex nonlinear interactions between skeletal movements and facial soft tissue. Existing biomechanical, parametric models and deep-learning approaches either lack computational efficiency or fail to fully capture these intricate interactions. To address these limitations, we propose Neural Implicit Craniofacial Model (NICE) which employs implicit neural representations for accurate anatomical reconstruction and surgical outcome prediction. NICE comprises a shape module, which employs region-specific implicit Signed Distance Function (SDF) decoders to reconstruct the facial surface, maxilla, and mandible, and a surgery module, which employs region-specific deformation decoders. These deformation decoders are driven by a shared surgical latent code to effectively model the complex, nonlinear biomechanical response of the facial surface to skeletal movements, incorporating anatomical prior knowledge. The deformation decoders output point-wise displacement fields, enabling precise modeling of surgical outcomes. Extensive experiments demonstrate that NICE outperforms current state-of-the-art methods, notably improving prediction accuracy in critical facial regions such as lips and chin, while robustly preserving anatomical integrity. This work provides a clinically viable tool for enhanced surgical planning and patient consultation in orthognathic procedures.

[132] LPD: Learnable Prototypes with Diversity Regularization for Weakly Supervised Histopathology Segmentation

Khang Le, Anh Mai Vu, Thi Kim Trang Vo, Ha Thach, Ngoc Bui Lam Quang, Thanh-Huy Nguyen, Minh H. N. Le, Zhu Han, Chandra Mohan, Hien Van Nguyen

Main category: cs.CV

TL;DR: Proposes a cluster-free, one-stage learnable-prototype framework with diversity regularization for weakly supervised semantic segmentation in histopathology, addressing limitations of two-stage clustering-based approaches.

Details

Motivation: Existing WSSS methods in histopathology face challenges: inter-class homogeneity, intra-class heterogeneity, and CAM-induced region shrinkage. Two-stage clustering-based approaches are costly, hyperparameter-sensitive, and decouple prototype discovery from segmentation learning, limiting effectiveness and efficiency.

Method: Cluster-free, one-stage learnable-prototype framework with diversity regularization to enhance morphological intra-class heterogeneity coverage. Unlike previous methods that construct clustering prototype banks and refine masks separately, this approach integrates prototype learning directly into the segmentation process.

Result: Achieves state-of-the-art performance on BCSS-WSSS dataset, outperforming prior methods in mIoU and mDice. Qualitative segmentation maps show sharper boundaries and fewer mislabels. Activation heatmaps reveal that learnable prototypes cover more diverse and complementary regions within each class compared to clustering-based prototypes.

Conclusion: The proposed one-stage learnable-prototype framework with diversity regularization effectively addresses limitations of existing WSSS methods in histopathology, providing better coverage of intra-class heterogeneity while being more efficient and less sensitive to hyperparameters than two-stage clustering approaches.

Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology reduces pixel-level labeling by learning from image-level labels, but it is hindered by inter-class homogeneity, intra-class heterogeneity, and CAM-induced region shrinkage (global pooling-based class activation maps whose activations highlight only the most distinctive areas and miss nearby class regions). Recent works address these challenges by constructing a clustering prototype bank and then refining masks in a separate stage; however, such two-stage pipelines are costly, sensitive to hyperparameters, and decouple prototype discovery from segmentation learning, limiting their effectiveness and efficiency. We propose a cluster-free, one-stage learnable-prototype framework with diversity regularization to enhance morphological intra-class heterogeneity coverage. Our approach achieves state-of-the-art (SOTA) performance on BCSS-WSSS, outperforming prior methods in mIoU and mDice. Qualitative segmentation maps show sharper boundaries and fewer mislabels, and activation heatmaps further reveal that, compared with clustering-based prototypes, our learnable prototypes cover more diverse and complementary regions within each class, providing consistent qualitative evidence for their effectiveness.

[133] A Comparative Study on Synthetic Facial Data Generation Techniques for Face Recognition

Pedro Vidal, Bernardo Biesseck, Luiz E. L. Coelho, Roger Granada, David Menotti

Main category: cs.CV

TL;DR: This paper compares synthetic facial datasets generated using different techniques (diffusion models, GANs, 3D models) for facial recognition tasks, evaluating their performance against real data across multiple metrics and datasets.

Details

Motivation: Facial recognition faces challenges in explainability, demographic bias, privacy, and robustness to variations. Privacy regulations have degraded datasets, raising legal/ethical concerns. Synthetic data offers a solution by mitigating privacy issues, enabling controlled attribute experimentation, alleviating bias, and supplementing real data.

Method: Comparative evaluation of synthetic facial datasets generated using different techniques (diffusion models, GANs, 3D models) for facial recognition tasks. Performance measured using accuracy, rank-1, rank-5, and true positive rate at false positive rate of 0.01% across eight leading datasets.

Result: Synthetic data can capture realistic variations, but there remains a performance gap compared to real data. Techniques like diffusion models, GANs, and 3D models show substantial progress, though challenges persist in matching real data performance.

Conclusion: Synthetic facial data generation is a promising solution for privacy, bias, and data scarcity issues in facial recognition, but further research is needed to close the performance gap with real data and address remaining challenges in synthetic data generation techniques.

Abstract: Facial recognition has become a widely used method for authentication and identification, with applications for secure access and locating missing persons. Its success is largely attributed to deep learning, which leverages large datasets and effective loss functions to learn discriminative features. Despite these advances, facial recognition still faces challenges in explainability, demographic bias, privacy, and robustness to aging, pose variations, lighting changes, occlusions, and facial expressions. Privacy regulations have also led to the degradation of several datasets, raising legal, ethical, and privacy concerns. Synthetic facial data generation has been proposed as a promising solution. It mitigates privacy issues, enables experimentation with controlled facial attributes, alleviates demographic bias, and provides supplementary data to improve models trained on real data. This study compares the effectiveness of synthetic facial datasets generated using different techniques in facial recognition tasks. We evaluate accuracy, rank-1, rank-5, and the true positive rate at a false positive rate of 0.01% on eight leading datasets, offering a comparative analysis not extensively explored in the literature. Results demonstrate the ability of synthetic data to capture realistic variations while emphasizing the need for further research to close the performance gap with real data. Techniques such as diffusion models, GANs, and 3D models show substantial progress; however, challenges remain.

[134] Synset Signset Germany: a Synthetic Dataset for German Traffic Sign Recognition

Anne Sielemann, Lena Loercher, Max-Lion Schumacher, Stefan Wolf, Masoud Roschani, Jens Ziehn

Main category: cs.CV

TL;DR: Synthetic traffic sign dataset combining GAN-based texture generation for realistic dirt/wear with analytical scene modeling for physically correct lighting, enabling XAI and robustness testing applications.

Details

Motivation: To create a synthetic traffic sign recognition dataset that combines data-driven realism with analytical parameterization for explainable AI and robustness testing, addressing the need for rare traffic signs and detailed metadata.

Method: Combines GAN-based texture generation for realistic dirt and wear artifacts with analytical scene modulation for physically correct lighting. Creates a synthesis pipeline that allows detailed parameterization of environmental and imaging effects.

Result: Produced Synset Signset Germany dataset containing 105,500 images of 211 different German traffic sign classes, including rare 2020 signs. Provides masks, segmentation images, and extensive metadata with stochastic environment/imaging parameters.

Conclusion: The hybrid approach successfully creates realistic synthetic traffic sign data suitable for training/testing, with applications in explainable AI and robustness testing due to parameter sensitivity analysis capabilities.

Abstract: In this paper, we present a synthesis pipeline and dataset for training / testing data in the task of traffic sign recognition that combines the advantages of data-driven and analytical modeling: GAN-based texture generation enables data-driven dirt and wear artifacts, rendering unique and realistic traffic sign surfaces, while the analytical scene modulation achieves physically correct lighting and allows detailed parameterization. In particular, the latter opens up applications in the context of explainable AI (XAI) and robustness tests due to the possibility of evaluating the sensitivity to parameter changes, which we demonstrate with experiments. Our resulting synthetic traffic sign recognition dataset Synset Signset Germany contains a total of 105500 images of 211 different German traffic sign classes, including newly published (2020) and thus comparatively rare traffic signs. In addition to a mask and a segmentation image, we also provide extensive metadata including the stochastically selected environment and imaging effect parameters for each image. We evaluate the degree of realism of Synset Signset Germany on the real-world German Traffic Sign Recognition Benchmark (GTSRB) and in comparison to CATERED, a state-of-the-art synthetic traffic sign recognition dataset.

[135] EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, Xunliang Cai, Linjiang Huang, Hongsheng Li, Si Liu

Main category: cs.CV

TL;DR: A deliberative editing framework called EditThinker that improves instruction-based image editing by simulating human cognitive loops through iterative Think-while-Edit cycles.

Details

Motivation: Existing instruction-based image editing approaches have limited single-turn success rates due to inherent stochasticity and lack of deliberation, despite achieving high aesthetic quality through foundation models.

Method: Proposes a deliberative editing framework with iterative Think-while-Edit cycles: Critiquing results and Refining instructions, then Repeating generation. Trains a single MLLM (EditThinker) as reasoning engine to produce critique scores, reasoning processes, and refined instructions using reinforcement learning to align thinking with editing.

Result: Extensive experiments on four benchmarks demonstrate significant improvement in instruction-following capability of any image editing model by a large margin.

Conclusion: The deliberative editing framework with EditThinker effectively addresses the instruction-following challenge in image editing by simulating human cognitive processes, and the authors plan to release data construction framework, datasets, and models to benefit the community.

Abstract: Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to ’think’ while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker’s thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.

[136] A Scene-aware Models Adaptation Scheme for Cross-scene Online Inference on Mobile Devices

Yunzhe Li, Hongzi Zhu, Zhuohong Deng, Yunlong Cheng, Zimu Zheng, Liang Zhang, Shan Chang, Minyi Guo

Main category: cs.CV

TL;DR: Anole: A lightweight scheme for adaptive DNN model selection on mobile devices to handle unfamiliar test samples in AIoT applications by training scene-specific compact models and intelligently selecting the best-fit model for each test sample.

Details

Motivation: AIoT applications need online DNN prediction on mobile devices, but device movement causes unfamiliar test samples that degrade pre-trained model accuracy. Unstable networks require local inference, but existing approaches using single large DNNs are inefficient for diverse scenes.

Method: 1) Create an army of compact scene-specific DNN models by automatically identifying model-friendly scenes using weakly-supervised scene representation learning combining human heuristics and feature similarity. 2) Train a model classifier to predict the best-fit scene-specific model for each test sample. 3) Adaptively select the optimal model for online inference on mobile devices.

Result: Implemented on various mobile devices with UAV experiments. Anole outperforms using a single versatile large DNN: 4.5% higher prediction accuracy, 33.1% faster response time, and 45.1% lower power consumption.

Conclusion: Anole effectively addresses the challenge of unfamiliar test samples in mobile AIoT by enabling adaptive, efficient local DNN inference through scene-specific model selection, significantly improving accuracy, speed, and energy efficiency compared to traditional approaches.

Abstract: Emerging Artificial Intelligence of Things (AIoT) applications desire online prediction using deep neural network (DNN) models on mobile devices. However, due to the movement of devices, unfamiliar test samples constantly appear, significantly affecting the prediction accuracy of a pre-trained DNN. In addition, unstable network connection calls for local model inference. In this paper, we propose a light-weight scheme, called Anole, to cope with the local DNN model inference on mobile devices. The core idea of Anole is to first establish an army of compact DNN models, and then adaptively select the model fitting the current test sample best for online inference. The key is to automatically identify model-friendly scenes for training scene-specific DNN models. To this end, we design a weakly-supervised scene representation learning algorithm by combining both human heuristics and feature similarity in separating scenes. Moreover, we further train a model classifier to predict the best-fit scene-specific DNN model for each test sample. We implement Anole on different types of mobile devices and conduct extensive trace-driven and real-world experiments based on unmanned aerial vehicles (UAVs). The results demonstrate that Anole outwits the method of using a versatile large DNN in terms of prediction accuracy (4.5% higher), response time (33.1% faster) and power consumption (45.1% lower).

[137] SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Wenbo Huang, Jinghui Zhang, Xuwei Qian, Zhen Wu, Meng Wang, Lei Zhang

Main category: cs.CV

TL;DR: SOAP is a plug-and-play architecture for few-shot action recognition that enhances spatio-temporal feature learning through frame tuples and integrated feature channels, achieving SOTA performance on multiple benchmarks.

Details

Motivation: Traditional few-shot action recognition methods separate spatial and temporal features through temporal alignment after spatial extraction, and capture motion information narrowly between adjacent frames without considering density, leading to insufficient motion information capture. Real-world scenarios often lack sufficient video samples for data-driven training.

Method: Proposes SOAP (Spatio-tempOral frAme tuPle enhancer) architecture with SOAP-Net model. Instead of simple feature extraction, it considers temporal connections between different feature channels and spatio-temporal relations. Uses frame tuples with multiple frames (containing more motion information than adjacent frames) and combines frame tuples of diverse frame counts for broader perspective.

Result: SOAP-Net achieves new state-of-the-art performance across benchmarks including SthSthV2, Kinetics, UCF101, and HMDB51. Extensive evaluations demonstrate competitiveness, pluggability, generalization, and robustness.

Conclusion: SOAP provides an effective plug-and-play solution for few-shot action recognition by better integrating spatio-temporal features and capturing comprehensive motion information through frame tuples, addressing limitations of existing approaches.

Abstract: High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.

Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, Ziwei Liu

Main category: cs.CV

TL;DR: LaserMix++ is a semi-supervised learning framework for LiDAR semantic segmentation that uses multi-modal data (LiDAR + camera) and language guidance to reduce annotation requirements while improving 3D scene understanding.

Details

Motivation: Human-annotated LiDAR point clouds are expensive and challenging to obtain, limiting fully supervised methods. There's a need for more data-efficient approaches that can leverage unlabeled data and multi-sensor information for autonomous driving applications.

Method: LaserMix++ integrates: 1) multi-modal LaserMix operation for cross-sensor interactions, 2) camera-to-LiDAR feature distillation to enhance LiDAR feature learning, and 3) language-driven knowledge guidance using open-vocabulary models for auxiliary supervision. The framework manipulates laser beams from different LiDAR scans and incorporates LiDAR-camera correspondences.

Result: The framework achieves comparable accuracy to fully supervised methods with 5x fewer annotations and significantly improves supervised-only baselines. It’s validated on popular driving perception datasets and works across different LiDAR representations.

Conclusion: LaserMix++ demonstrates the potential of semi-supervised approaches to reduce reliance on extensive labeled data in LiDAR-based 3D scene understanding, offering a universally applicable solution for autonomous driving perception systems.

Abstract: Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.

[139] Image-Guided Semantic Pseudo-LiDAR Point Generation for 3D Object Detection

Minseung Lee, Seokha Moon, Seung Joon Lee, Reza Mahjourian, Jinkyu Kim

Main category: cs.CV

TL;DR: ImagePG uses RGB image features to generate dense, semantically meaningful 3D pseudo-LiDAR points, improving detection of small/distant objects in autonomous driving.

Details

Motivation: LiDAR's sparsity makes detecting small/distant objects difficult. Existing RoI-based point generation methods relying solely on LiDAR produce false positives and fail to recover meaningful structures.

Method: Proposes ImagePG with three modules: 1) Image-Guided RoI Points Generation (IG-RPG) creates pseudo-points using image features, 2) Image-Aware Occupancy Prediction Network (I-OPN) provides spatial priors for point placement, and 3) Multi-stage Refinement (MR) enhances point quality and detection robustness.

Result: On KITTI: +1.38%p mAP for cars, +7.91%p for pedestrians, +5.21%p for cyclists. Reduces false positives by nearly 50%. Achieves state-of-the-art cyclist performance on KITTI leaderboard. Validated on Waymo dataset.

Conclusion: ImagePG effectively leverages image features for semantic point generation, significantly improving detection of challenging objects while reducing false positives, demonstrating the value of multi-modal fusion for 3D perception.

Abstract: In autonomous driving scenarios, accurate perception is becoming an even more critical task for safe navigation. While LiDAR provides precise spatial data, its inherent sparsity makes it difficult to detect small or distant objects. Existing methods try to address this by generating additional points within a Region of Interest (RoI), but relying on LiDAR alone often leads to false positives and a failure to recover meaningful structures. To address these limitations, we propose Image-Guided Semantic Pseudo-LiDAR Point Generation model, called ImagePG, a novel framework that leverages rich RGB image features to generate dense and semantically meaningful 3D points. Our framework includes an Image-Guided RoI Points Generation (IG-RPG) module, which creates pseudo-points guided by image features, and an Image-Aware Occupancy Prediction Network (I-OPN), which provides spatial priors to guide point placement. A multi-stage refinement (MR) module further enhances point quality and detection robustness. To the best of our knowledge, ImagePG is the first method to directly leverage image features for point generation. Extensive experiments on the KITTI and Waymo datasets demonstrate that ImagePG significantly improves the detection of small and distant objects like pedestrians and cyclists, reducing false positives by nearly 50%. On the KITTI benchmark, our framework improves mAP by +1.38%p (car), +7.91%p (pedestrian), and +5.21%p (cyclist) on the test set over the baseline, achieving state-of-the-art cyclist performance on the KITTI leaderboard. The code is available at: https://github.com/MS-LIMA/ImagePG

[140] Multi-Scale Direction-Aware Network for Infrared Small Target Detection

Jinmiao Zhao, Zelin Shi, Chuang Yu, Yunpeng Liu, Xinyi Ying, Yimian Dai

Main category: cs.CV

TL;DR: MSDA-Net is a novel deep learning approach for infrared small target detection that integrates high-frequency directional features as domain prior knowledge, achieving state-of-the-art performance.

Details

Motivation: Existing deep learning methods for infrared small target detection focus on edge and shape features but ignore richer structural differences and detailed information in high-frequency components from different directions, failing to fully exploit high-frequency directional features for target perception.

Method: Proposes MSDA-Net with: 1) HFDI module (parameter-free) to inject high-frequency directional information; 2) MSDA module for multi-scale local relation extraction and directional feature perception; 3) FA structure to address target disappearance in high-level features; 4) FCF module to alleviate feature bias during cross-layer fusion.

Result: Extensive experiments show MSDA-Net achieves state-of-the-art results on multiple public datasets for infrared small target detection.

Conclusion: The paper successfully integrates high-frequency directional features as domain prior knowledge into neural networks for infrared small target detection, demonstrating the value of directional feature exploitation and achieving superior performance.

Abstract: Infrared small target detection faces the problem that it is difficult to effectively separate the background and the target. Existing deep learning-based methods focus on edge and shape features, but ignore the richer structural differences and detailed information embedded in high-frequency components from different directions, thereby failing to fully exploit the value of high-frequency directional features in target perception. To address this limitation, we propose a multi-scale direction-aware network (MSDA-Net), which is the first attempt to integrate the high-frequency directional features of infrared small targets as domain prior knowledge into neural networks. Specifically, to fully mine the high-frequency directional features, on the one hand, a high-frequency direction injection (HFDI) module without trainable parameters is constructed to inject the high-frequency directional information of the original image into the network. On the other hand, a multi-scale direction-aware (MSDA) module is constructed, which promotes the full extraction of local relations at different scales and the full perception of key features in different directions. In addition, considering the characteristics of infrared small targets, we construct a feature aggregation (FA) structure to address target disappearance in high-level feature maps, and a feature calibration fusion (FCF) module to alleviate feature bias during cross-layer feature fusion. Extensive experimental results show that our MSDA-Net achieves state-of-the-art (SOTA) results on multiple public datasets. The code can be available at https://github.com/YuChuang1205/MSDA-Net

[141] iMotion-LLM: Instruction-Conditioned Trajectory Generation

Abdulwahab Felemban, Nussair Hroub, Jian Ding, Eslam Abdelrahman, Xiaoqian Shen, Abduallah Mohamed, Mohamed Elhoseiny

Main category: cs.CV

TL;DR: iMotion-LLM integrates LLMs with trajectory prediction for text-guided motion generation in autonomous driving, using instruction-conditioned trajectory generation with safety alignment.

Details

Motivation: To enable adaptable, context-aware driving behavior through text instructions rather than conventional approaches, allowing for interpretable and safety-aligned trajectory generation.

Method: Combines encoder-decoder multimodal trajectory prediction with pre-trained LLM fine-tuned using LoRA, projects scene features into LLM input space, maps special tokens to trajectory decoder, and introduces two instruction-based datasets (InstructWaymo and Open-Vocabulary InstructNuPlan).

Result: Achieves 84% average accuracy in direction feasibility detection and 96% average accuracy in safety evaluation of open-vocabulary instructions, demonstrating strong contextual comprehension and instruction-conditioned trajectory generation.

Conclusion: Lays foundation for text-guided motion generation in autonomous driving, supporting simulated data generation, model interpretability, and robust safety alignment testing for trajectory generation models.

Abstract: We introduce iMotion-LLM, a large language model (LLM) integrated with trajectory prediction modules for interactive motion generation. Unlike conventional approaches, it generates feasible, safety-aligned trajectories based on textual instructions, enabling adaptable and context-aware driving behavior. It combines an encoder-decoder multimodal trajectory prediction model with a pre-trained LLM fine-tuned using LoRA, projecting scene features into the LLM input space and mapping special tokens to a trajectory decoder for text-based interaction and interpretable driving. To support this framework, we introduce two datasets: 1) InstructWaymo, an extension of the Waymo Open Motion Dataset with direction-based motion instructions, and 2) Open-Vocabulary InstructNuPlan, which features safety-aligned instruction-caption pairs and corresponding safe trajectory scenarios. Our experiments validate that instruction conditioning enables trajectory generation that follows the intended condition. iMotion-LLM demonstrates strong contextual comprehension, achieving 84% average accuracy in direction feasibility detection and 96% average accuracy in safety evaluation of open-vocabulary instructions. This work lays the foundation for text-guided motion generation in autonomous driving, supporting simulated data generation, model interpretability, and robust safety alignment testing for trajectory generation models. Our code, pre-trained model, and datasets are available at: https://vision-cair.github.io/iMotion-LLM/.

[142] PLANesT-3D: A new annotated dataset for segmentation of 3D plant point clouds

Kerem Mertoğlu, Yusuf Şalk, Server Karahan Sarıkaya, Kaya Turgut, Yasemin Evrenesoğlu, Hakan Çevikalp, Ömer Nezih Gerek, Helin Dutağacı, David Rousseau

Main category: cs.CV

TL;DR: PLANesT-3D: New annotated 3D plant point cloud dataset with semantic and instance labels for 3 species, acquired via low-cost SfM/MVS technique, plus SP-LSCnet segmentation method.

Details

Motivation: Address lack of diverse 3D plant datasets for evaluating generalization of deep learning methods in plant phenotyping, especially with low-cost acquisition techniques.

Method: Created PLANesT-3D dataset with 34 point clouds from 3 species (Capsicum annuum, Rosa kordana, Ribes rubrum) using SfM/MVS technique. Manually annotated semantic (“leaf”/“stem”) and instance labels. Also developed SP-LSCnet segmentation method combining unsupervised superpoint extraction with 3D point-based deep learning.

Result: Introduced diverse dataset with challenges like missing data, variable density, illumination variations. SP-LSCnet offers modular structure and interpretability advantages over existing methods like PointNet++ and RoseSegNet.

Conclusion: PLANesT-3D fills gap in 3D plant datasets by adding species diversity and low-cost acquisition modality, enabling better evaluation of segmentation methods’ generalization capabilities in plant phenotyping.

Abstract: Creation of new annotated public datasets is crucial in helping advances in 3D computer vision and machine learning meet their full potential for automatic interpretation of 3D plant models. Despite the proliferation of deep neural network architectures for segmentation and phenotyping of 3D plant models in the last decade, the amount of data, and diversity in terms of species and data acquisition modalities are far from sufficient for evaluation of such tools for their generalization ability. To contribute to closing this gap, we introduce PLANesT-3D; a new annotated dataset of 3D color point clouds of plants. PLANesT-3D is composed of 34 point cloud models representing 34 real plants from three different plant species: \textit{Capsicum annuum}, \textit{Rosa kordana}, and \textit{Ribes rubrum}. Both semantic labels in terms of “leaf” and “stem”, and organ instance labels were manually annotated for the full point clouds. PLANesT-3D introduces diversity to existing datasets by adding point clouds of two new species and providing 3D data acquired with the low-cost SfM/MVS technique as opposed to laser scanning or expensive setups. Point clouds reconstructed with SfM/MVS modality exhibit challenges such as missing data, variable density, and illumination variations. As an additional contribution, SP-LSCnet, a novel semantic segmentation method that is a combination of unsupervised superpoint extraction and a 3D point-based deep learning approach is introduced and evaluated on the new dataset. The advantages of SP-LSCnet over other deep learning methods are its modular structure and increased interpretability. Two existing deep neural network architectures, PointNet++ and RoseSegNet, were also tested on the point clouds of PLANesT-3D for semantic segmentation.

[143] SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, Kate Saenko

Main category: cs.CV

TL;DR: SAT dataset improves spatial reasoning in multimodal models using simulated 3D data, showing better performance than pseudo-annotation methods.

Details

Motivation: Existing multimodal language models struggle with spatial reasoning, especially dynamic awareness of motion effects. Manual annotation of object/camera movements is expensive, so simulated data is needed.

Method: Created SAT dataset using 3D simulators with 175K QA pairs across 20K scenes for static/dynamic spatial reasoning. Also built a small real-world test set. Evaluated on SAT plus 6 existing static benchmarks.

Result: Simulations effectively impart spatial aptitude that transfers to real images. SAT training improved LLaVA-13B by 11% and LLaVA-Video-7B by 8% on spatial benchmarks, outperforming some large proprietary models.

Conclusion: Perfect simulation annotations are more effective than pseudo-annotating real images. While static reasoning improves with synthetic data, dynamic reasoning still has room for improvement.

Abstract: Reasoning about motion and space is a fundamental cognitive capability that is required by multiple real-world applications. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only focus on static spatial relationships, and not dynamic awareness of motion and space, i.e., reasoning about the effect of egocentric and object motions on spatial relationships. Manually annotating such object and camera movements is expensive. Hence, we introduce SAT, a simulated spatial aptitude training dataset utilizing 3D simulators, comprising both static and dynamic spatial reasoning across 175K question-answer (QA) pairs and 20K scenes. Complementing this, we also construct a small (150 image-QAs) yet challenging dynamic spatial test set using real-world images. Leveraging our SAT datasets and 6 existing static spatial benchmarks, we systematically investigate what improves both static and dynamic spatial awareness. Our results reveal that simulations are surprisingly effective at imparting spatial aptitude to MLMs that translate to real images. We show that perfect annotations in simulation are more effective than existing approaches of pseudo-annotating real images. For instance, SAT training improves a LLaVA-13B model by an average 11% and a LLaVA-Video-7B model by an average 8% on multiple spatial benchmarks, including our real-image dynamic test set and spatial reasoning on long videos – even outperforming some large proprietary models. While reasoning over static relationships improves with synthetic training data, there is still considerable room for improvement for dynamic reasoning questions.

[144] GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Yihong Lin, Zhaoxin Fan, Xianjia Wu, Lingyu Xiong, Liang Peng, Xiandong Li, Wenxiong Kang, Songju Lei, Huang Xu

Main category: cs.CV

TL;DR: GLDiTalker: A speech-driven 3D facial animation model using Graph Latent Diffusion Transformer to solve audio-mesh misalignment, improving lip-sync accuracy and motion diversity.

Details

Motivation: Current speech-driven talking head generation methods suffer from modality inconsistencies (audio-mesh misalignment), reducing motion diversity and lip-sync accuracy, which limits their applications in AR and virtual human modeling.

Method: Proposes GLDiTalker with Graph Latent Diffusion Transformer that diffuses signals in quantized spatiotemporal latent space. Uses two-stage training: 1) Graph-Enhanced Quantized Space Learning for lip-sync accuracy, 2) Space-Time Powered Latent Diffusion for motion diversity.

Result: Outperforms existing methods on standard benchmarks, achieving superior lip-sync accuracy and motion diversity, generating realistic and temporally stable 3D facial animations.

Conclusion: GLDiTalker effectively addresses modality misalignment in speech-driven facial animation through its novel graph latent diffusion approach, enabling high-quality 3D facial animations with improved synchronization and motion variety.

Abstract: Speech-driven talking head generation is a critical yet challenging task with applications in augmented reality and virtual human modeling. While recent approaches using autoregressive and diffusion-based models have achieved notable progress, they often suffer from modality inconsistencies, particularly misalignment between audio and mesh, leading to reduced motion diversity and lip-sync accuracy. To address this, we propose GLDiTalker, a novel speech-driven 3D facial animation model based on a Graph Latent Diffusion Transformer. GLDiTalker resolves modality misalignment by diffusing signals within a quantized spatiotemporal latent space. It employs a two-stage training pipeline: the Graph-Enhanced Quantized Space Learning Stage ensures lip-sync accuracy, while the Space-Time Powered Latent Diffusion Stage enhances motion diversity. Together, these stages enable GLDiTalker to generate realistic, temporally stable 3D facial animations. Extensive evaluations on standard benchmarks demonstrate that GLDiTalker outperforms existing methods, achieving superior results in both lip-sync accuracy and motion diversity.

[145] From Correlation to Causation: Max-Pooling-Based Multi-Instance Learning Leads to More Robust Whole Slide Image Classification

Xin Liu, Weijia Zhang, Wei Tang, Thuc Duy Le, Jiuyong Li, Lin Liu, Min-Ling Zhang

Main category: cs.CV

TL;DR: FocusMIL improves WSI analysis by combining max-pooling MIL with variational information bottleneck to capture causal factors and reduce spurious correlations, enhancing robustness to domain shifts.

Details

Motivation: Attention-based MIL models in WSI analysis are vulnerable to spurious correlations and domain shifts, often focusing on non-tumor regions like staining biases or artifacts, leading to unreliable tumor localization.

Method: Revisits max-pooling MIL from causal perspective, proposes FocusMIL which couples max-pooling with instance-level variational information bottleneck to learn compact predictive representations, and uses multi-bag mini-batch scheme for stable optimization.

Result: Comprehensive experiments on three real-world datasets and one semi-synthetic dataset show FocusMIL has significant advantages in out-of-distribution scenarios and instance-level tumor region localization tasks by capturing causal factors.

Conclusion: FocusMIL effectively addresses limitations of attention-based MIL by leveraging max-pooling’s causal properties combined with VIB, resulting in more robust and reliable WSI analysis under domain shifts.

Abstract: In whole slide images (WSIs) analysis, attention-based multi-instance learning (MIL) models are susceptible to spurious correlations and degrade under domain shift. These methods may assign high attention weights to non-tumor regions, such as staining biases or artifacts, leading to unreliable tumor region localization. In this paper, we revisit max-pooling-based MIL methods from a causal perspective. Under mild assumptions, our theoretical results demonstrate that max-pooling encourages the model to focus on causal factors while ignoring bias-related factors. Furthermore, we discover that existing max-pooling-based methods may overfit the training set through rote memorization of instance features and fail to learn meaningful patterns. To address these issues, we propose FocusMIL, which couples max-pooling with an instance-level variational information bottleneck (VIB) to learn compact, predictive latent representations, and employs a multi-bag mini-batch scheme to stabilize optimization. We conduct comprehensive experiments on three real-world datasets and one semi-synthetic dataset. The results show that, by capturing causal factors, FocusMIL exhibits significant advantages in out-of-distribution scenarios and instance-level tumor region localization tasks.

[146] Exploring Ordinal Bias in Action Recognition for Instructional Videos

Joochan Kim, Minjoon Jung, Byoung-Tak Zhang

Main category: cs.CV

TL;DR: The paper identifies “ordinal bias” in action recognition models where they rely on dataset-specific action sequences rather than true video comprehension, and proposes two manipulation methods (Action Masking and Sequence Shuffling) to expose this vulnerability.

Details

Motivation: Current action recognition models for instructional videos often achieve good results by exploiting dominant, dataset-specific action sequences rather than truly understanding video content. This "ordinal bias" means models learn to recognize patterns rather than comprehend actions, limiting their generalization to diverse real-world scenarios.

Method: Two video manipulation methods: 1) Action Masking - masks frames of frequently co-occurring actions to break pattern recognition, and 2) Sequence Shuffling - randomizes the order of action segments to test if models rely on fixed sequences rather than understanding individual actions.

Result: Experiments show current models suffer significant performance drops when faced with nonstandard action sequences, confirming their vulnerability to ordinal bias and reliance on dataset-specific patterns rather than true comprehension.

Conclusion: The findings highlight the need to rethink evaluation strategies and develop models that can generalize beyond fixed action patterns, moving toward true video understanding in diverse instructional video contexts.

Abstract: Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.

[147] Point-PNG: Conditional Pseudo-Negatives Generation for Point Cloud Pre-Training

Sutharsan Mahendren, Saimunur Rahman, Piotr Koniusz, Tharindu Fernando, Sridha Sridharan, Clinton Fookes, Peyman Moghadam

Main category: cs.CV

TL;DR: Point-PNG is a self-supervised learning framework that generates conditional pseudo-negatives in latent space to learn point cloud representations that are both discriminative and transformation-sensitive, addressing the invariant-collapse problem in conventional methods.

Details

Motivation: Conventional self-supervised learning methods focus on achieving invariance and discard transformation-specific information. Recent approaches that incorporate transformation sensitivity often suffer from invariant-collapse phenomenon, where predictors degenerate into identity mappings, resulting in limited variation across transformations in latent representations.

Method: Proposes Point-PNG framework with COPE (Conditional Pseudo-Negatives Embedding) network that learns localized displacements induced by transformations in latent space. Uses a loss function based on pseudo-negatives conditioned on transformations to penalize trivial invariant solutions and enforce meaningful representation learning.

Result: Validated on shape classification and relative pose estimation tasks. Shows competitive performance on ModelNet40 and ScanObjectNN under challenging evaluation protocols. Achieves superior accuracy in relative pose estimation compared to supervised baselines.

Conclusion: Point-PNG effectively addresses the invariant-collapse problem in self-supervised learning for point clouds, enabling networks to capture richer transformation cues while preserving discriminative representations through conditional pseudo-negatives generation.

Abstract: We propose Point-PNG, a novel self-supervised learning framework that generates conditional pseudo-negatives in the latent space to learn point cloud representations that are both discriminative and transformation-sensitive. Conventional self-supervised learning methods focus on achieving invariance, discarding transformation-specific information. Recent approaches incorporate transformation sensitivity by explicitly modeling relationships between original and transformed inputs. However, they often suffer from an invariant-collapse phenomenon, where the predictor degenerates into identity mappings, resulting in latent representations with limited variation across transformations. To address this, we propose Point-PNG that explicitly penalizes invariant collapse through pseudo-negatives generation, enabling the network to capture richer transformation cues while preserving discriminative representations. To this end, we introduce a parametric network, COnditional Pseudo-Negatives Embedding (COPE), which learns localized displacements induced by transformations within the latent space. A key challenge arises when jointly training COPE with the MAE, as it tends to converge to trivial identity mappings. To overcome this, we design a loss function based on pseudo-negatives conditioned on the transformation, which penalizes such trivial invariant solutions and enforces meaningful representation learning. We validate Point-PNG on shape classification and relative pose estimation tasks, showing competitive performance on ModelNet40 and ScanObjectNN under challenging evaluation protocols, and achieving superior accuracy in relative pose estimation compared to supervised baselines.

[148] Neural Eulerian Scene Flow Fields

Kyle Vedder, Neehar Peri, Ishan Khatri, Siyi Li, Eric Eaton, Mehmet Kocamaz, Yue Wang, Zhiding Yu, Deva Ramanan, Joachim Pehserl

Main category: cs.CV

TL;DR: EulerFlow reframes scene flow as estimating a continuous space-time ODE with a neural prior, enabling high-quality self-supervised scene flow estimation across diverse domains without tuning.

Details

Motivation: Traditional scene flow methods often require supervision or domain-specific tuning. The authors aim to develop a unified, self-supervised approach that works across diverse domains without manual adaptation.

Method: Represents scene flow as a continuous space-time ODE with a neural prior, optimized against multi-observation reconstruction objectives through pure self-supervision on real-world data.

Result: Outperforms all prior methods on Argoverse 2 2024 Scene Flow Challenge, surpassing next-best unsupervised method by 2.5x and exceeding next-best supervised method by over 10%. Works across domains including autonomous driving and dynamic tabletop scenes, handles small fast-moving objects, and exhibits emergent 3D point tracking behavior.

Conclusion: EulerFlow demonstrates that framing scene flow as a continuous ODE with neural priors enables robust, high-quality estimation across diverse domains through pure self-supervision, achieving state-of-the-art performance.

Abstract: We reframe scene flow as the task of estimating a continuous space-time ODE that describes motion for an entire observation sequence, represented with a neural prior. Our method, EulerFlow, optimizes this neural prior estimate against several multi-observation reconstruction objectives, enabling high quality scene flow estimation via pure self-supervision on real-world data. EulerFlow works out-of-the-box without tuning across multiple domains, including large-scale autonomous driving scenes and dynamic tabletop settings. Remarkably, EulerFlow produces high quality flow estimates on small, fast moving objects like birds and tennis balls, and exhibits emergent 3D point tracking behavior by solving its estimated ODE over long-time horizons. On the Argoverse 2 2024 Scene Flow Challenge, EulerFlow outperforms all prior art, surpassing the next-best unsupervised method by more than 2.5x, and even exceeding the next-best supervised method by over 10%.

[149] MedDiff-FM: A Diffusion-based Foundation Model for Versatile Medical Image Applications

Yongrui Yu, Yannian Gu, Shaoting Zhang, Xiaofan Zhang

Main category: cs.CV

TL;DR: MedDiff-FM is a diffusion-based foundation model for diverse medical image tasks using 3D CT images from multiple anatomical regions, enabling multi-level image processing and seamless handling of various downstream applications through ControlNet fine-tuning.

Details

Motivation: Previous medical image diffusion models have been limited to specific anatomical regions, applications, and datasets, creating isolated solutions. There's a need for a comprehensive foundation model that can address diverse medical image tasks across different anatomical areas.

Method: MedDiff-FM pre-trains a diffusion foundation model using 3D CT images from multiple public datasets covering head to abdomen regions. It employs multi-level integrated processing (image-level and patch-level), uses position embedding for spatial relationships, and leverages region classes and anatomical structures. For downstream tasks, it uses ControlNet with task-specific conditions for rapid fine-tuning.

Result: The model successfully handles multiple downstream tasks including image denoising, anomaly detection, image synthesis, super-resolution, lesion generation, and lesion inpainting. Experimental results demonstrate effectiveness across diverse medical image applications.

Conclusion: MedDiff-FM provides a unified diffusion-based foundation model for medical imaging that overcomes previous limitations of isolated solutions, enabling comprehensive handling of various tasks across different anatomical regions through a single adaptable framework.

Abstract: Diffusion models have achieved significant success in both natural image and medical image domains, encompassing a wide range of applications. Previous investigations in medical images have often been constrained to specific anatomical regions, particular applications, and limited datasets, resulting in isolated diffusion models. This paper introduces a diffusion-based foundation model to address a diverse range of medical image tasks, namely MedDiff-FM. MedDiff-FM leverages 3D CT images from multiple publicly available datasets, covering anatomical regions from head to abdomen, to pre-train a diffusion foundation model, and explores the capabilities of the diffusion foundation model across a variety of application scenarios. The diffusion foundation model handles multi-level integrated image processing both at the image-level and patch-level, utilizes position embedding to establish multi-level spatial relationships, and leverages region classes and anatomical structures to capture certain anatomical regions. MedDiff-FM manages several downstream tasks seamlessly, including image denoising, anomaly detection, and image synthesis. MedDiff-FM is also capable of performing super-resolution, lesion generation, and lesion inpainting by rapidly fine-tuning the diffusion foundation model using ControlNet with task-specific conditions. The experimental results demonstrate the effectiveness of MedDiff-FM in addressing diverse downstream medical image tasks.

[150] Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel KAN Adapter for Enhanced Feature Adaptation

Gayatri Deshmukh, Somsubhra De, Chirag Sehgal, Jishu Sen Gupta, Sparsh Mittal

Main category: cs.CV

TL;DR: FLORA is a fashion dataset with 4,330 outfit-description pairs using professional fashion terminology, and NeRA is a novel adapter using KAN networks for better fashion image generation.

Details

Motivation: The fashion industry lacks specialized datasets with professional terminology for AI-driven design. Existing datasets don't capture the subtle stylistic elements and industry-specific language needed for high-fidelity fashion generation.

Method: 1) Created FLORA dataset with 4,330 curated fashion outfit-description pairs using professional fashion terminology. 2) Developed NeRA adapter architecture based on Kolmogorov-Arnold Networks (KAN) using learnable spline-based nonlinear transformations instead of traditional MLP adapters.

Result: Fine-tuning on FLORA significantly improves generative models’ ability to create accurate, stylistically rich fashion images. NeRA outperforms existing adapters (LoRA, LoKR, DoRA, LoHA) with superior fidelity, faster convergence, and better semantic alignment on FLORA and LAION-5B datasets.

Conclusion: FLORA enables advanced AI fashion design models, while NeRA provides a more effective adapter architecture. Both contributions will accelerate AI-driven fashion design and help designers bring ideas to life. The dataset and code will be open-sourced.

Abstract: Specialized datasets that capture the fashion industry’s rich language and styling elements can boost progress in AI-driven fashion design. We present FLORA, (Fashion Language Outfit Representation for Apparel Generation), the first comprehensive dataset containing 4,330 curated pairs of fashion outfits and corresponding textual descriptions. Each description utilizes industry-specific terminology and jargon commonly used by professional fashion designers, providing precise and detailed insights into the outfits. Hence, the dataset captures the delicate features and subtle stylistic elements necessary to create high-fidelity fashion designs. We demonstrate that fine-tuning generative models on the FLORA dataset significantly enhances their capability to generate accurate and stylistically rich images from textual descriptions of fashion sketches. FLORA will catalyze the creation of advanced AI models capable of comprehending and producing subtle, stylistically rich fashion designs. It will also help fashion designers and end-users to bring their ideas to life. As a second orthogonal contribution, we introduce NeRA (Nonlinear low-rank Expressive Representation Adapter), a novel adapter architecture based on Kolmogorov-Arnold Networks (KAN). Unlike traditional PEFT techniques such as LoRA, LoKR, DoRA, and LoHA that use MLP adapters, NeRA uses learnable spline-based nonlinear transformations, enabling superior modeling of complex semantic relationships, achieving strong fidelity, faster convergence and semantic alignment. Extensive experiments on our proposed FLORA and LAION-5B datasets validate the superiority of NeRA over existing adapters. We will open-source both the FLORA dataset and our implementation code.

[151] AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

Sunghyun Ahn, Youngwan Jo, Kijung Lee, Sein Kwon, Inpyo Hong, Sanghyun Park

Main category: cs.CV

TL;DR: Proposes AnyAnomaly, a customizable video anomaly detection model that uses user-defined text descriptions to detect abnormal events without retraining for new environments.

Details

Motivation: Existing VAD models require retraining for different environments, needing ML expertise, hardware, and extensive data collection, limiting practical usability.

Method: C-VAD technique with AnyAnomaly model using context-aware visual question answering without fine-tuning large vision-language models.

Result: Superior performance on C-VAD datasets, competitive results on VAD benchmarks with SOTA on UBnormal and UCF-Crime, and better generalization across datasets.

Conclusion: AnyAnomaly enables customizable anomaly detection without retraining, improving practical usability and achieving strong performance across diverse environments.

Abstract: Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers user-defined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive results on VAD benchmarks, achieving state-of-the-art performance on UBnormal and UCF-Crime and surpassing other methods in generalization across all datasets. Our code is available online at github.com/SkiddieAhn/Paper-AnyAnomaly.

[152] Test-Time 3D Occupancy Prediction

Fengyi Zhang, Xiangyu Sun, Huitong Yang, Zheng Zhang, Zi Huang, Yadan Luo

Main category: cs.CV

TL;DR: TT-Occ is a test-time occupancy prediction framework that uses 3D Gaussians and vision foundation models for flexible, training-free 3D scene understanding at arbitrary resolutions.

Details

Motivation: Traditional self-supervised 3D occupancy prediction requires extensive GPU training, lacks flexibility for different voxel resolutions, and cannot adapt to new object categories without retraining.

Method: A lift-track-voxelize approach: 1) Lift geometry/semantics from vision foundation models to instantiate 3D Gaussians, 2) Track dynamic Gaussians while accumulating static ones for temporal consistency, 3) Voxelize optimized Gaussians at arbitrary resolutions. Optional Gaussian smoothing reduces noise.

Result: Two variants (LiDAR-based and vision-centric) were tested on Occ3D and nuCraft benchmarks with varying voxel resolutions, demonstrating generality and effectiveness without network training.

Conclusion: TT-Occ provides a practical, flexible test-time occupancy prediction framework that eliminates training requirements, adapts to arbitrary resolutions, and enables open-vocabulary recognition through vision foundation models.

Abstract: Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VFMs enables accurate perception and open-vocabulary recognition, without any network training or fine-tuning. Specifically, TT-Occ operates in a lift-track-voxelize symphony: We first lift the geometry and semantics of surrounding-view extracted from VFMs to instantiate Gaussians at 3D space; Next, we track dynamic Gaussians while accumulating static ones to complete the scene and enforce temporal consistency; Finally, we voxelize the optimized Gaussians to generate occupancy prediction. Optionally, inherent noise in VFM predictions and tracking is mitigated by periodically smoothing neighboring Gaussians during optimization. To validate the generality and effectiveness of our framework, we offer two variants: one LiDAR-based and one vision-centric, and conduct extensive experiments on Occ3D and nuCraft benchmarks with varying voxel resolutions.

[153] Joint Self-Supervised Video Alignment and Action Segmentation

Ali Shah Ali, Syed Ahmed Mahmood, Mubin Saeed, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

Main category: cs.CV

TL;DR: Novel unified optimal transport framework for simultaneous self-supervised video alignment and action segmentation, achieving SOTA performance with single model efficiency.

Details

Motivation: To develop a unified approach that efficiently handles both video alignment and action segmentation tasks simultaneously, reducing computational overhead compared to separate models while maintaining or improving performance.

Method: Proposes a fused Gromov-Wasserstein optimal transport formulation with structural prior for video alignment, then extends to unified framework for joint video alignment and action segmentation using single model.

Result: Single-task method achieves SOTA on video alignment benchmarks; multi-task method achieves comparable alignment and superior segmentation results while saving time/memory compared to separate models.

Conclusion: First unified model for video alignment and action segmentation, demonstrating effectiveness of optimal transport framework for multi-task video understanding with improved efficiency.

Abstract: We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model. Our code is available on our research website: https://retrocausal.ai/research/.

[154] ChartQA-X: Generating Explanations for Visual Chart Reasoning

Shamanthak Hegde, Pooyan Fazli, Hasti Seifi

Main category: cs.CV

TL;DR: ChartQA-X: A comprehensive dataset for generating detailed explanations alongside chart question answering, with 30,799 chart samples across four chart types, showing that model-generated explanations surpass human-written ones in accuracy and logic.

Details

Motivation: The ability to explain complex information from chart images is vital for effective data-driven decision-making. There's a need to generate detailed explanations alongside answering questions about charts to improve comprehension and trust in generated responses.

Method: Created ChartQA-X dataset with 30,799 chart samples across four chart types, each paired with contextually relevant questions, answers, and explanations. Explanations were generated and selected based on metrics like faithfulness, informativeness, coherence, and perplexity. Models were fine-tuned on this dataset.

Result: Human evaluation with 245 participants shows model-generated explanations surpass human-written explanations in accuracy and logic, and are comparable in clarity and overall quality. Models fine-tuned on ChartQA-X show substantial improvements: up to 24.57 points in explanation quality, 18.96 percentage points in QA accuracy, and 14.75 percentage points on unseen benchmarks.

Conclusion: By integrating explanatory narratives with answers, ChartQA-X enables agents to convey complex visual information more effectively, improving comprehension and greater trust in generated responses for chart-based data analysis.

Abstract: The ability to explain complex information from chart images is vital for effective data-driven decision-making. In this work, we address the challenge of generating detailed explanations alongside answering questions about charts. We present ChartQA-X, a comprehensive dataset comprising 30,799 chart samples across four chart types, each paired with contextually relevant questions, answers, and explanations. Explanations are generated and selected based on metrics such as faithfulness, informativeness, coherence, and perplexity. Our human evaluation with 245 participants shows that model-generated explanations in ChartQA-X surpass human-written explanations in accuracy and logic and are comparable in terms of clarity and overall quality. Moreover, models fine-tuned on ChartQA-X show substantial improvements across various metrics, including absolute gains of up to 24.57 points in explanation quality, 18.96 percentage points in question-answering accuracy, and 14.75 percentage points on unseen benchmarks for the same task. By integrating explanatory narratives with answers, our approach enables agents to convey complex visual information more effectively, improving comprehension and greater trust in the generated responses.

[155] Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

Main category: cs.CV

TL;DR: Proposes Chain-of-Focus (CoF) method for VLMs to adaptively focus/zoom on key image regions for efficient multimodal reasoning, using two-stage training (SFT + RL) with significant performance gains.

Details

Motivation: Multimodal reasoning capability in existing vision language models (VLMs) has not been fully explored. Current models lack adaptive focusing and zooming mechanisms to efficiently process key image regions based on visual cues and questions.

Method: Chain-of-Focus (CoF) method with two-stage training: 1) Supervised fine-tuning (SFT) using MM-CoF dataset (3K samples from visual agent identifying key regions), 2) Reinforcement learning (RL) using outcome accuracies and formats as rewards to refine search and reasoning strategies without human priors. Implemented on Qwen2.5-VL model.

Result: Significant improvements on multiple benchmarks. On V* benchmark requiring strong visual reasoning, outperforms existing VLMs by 5% across 8 image resolutions (224 to 4K). Demonstrates effectiveness of CoF method for efficient VLM deployment.

Conclusion: The proposed Chain-of-Focus method enables VLMs to perform adaptive focusing and zooming on key image regions, achieving efficient multimodal reasoning. The two-stage training pipeline (SFT + RL) effectively refines reasoning strategies, facilitating more efficient deployment of VLMs in practical applications.

Abstract: Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.

[156] M3DHMR: Monocular 3D Hand Mesh Recovery

Yihong Lin, Xianjia Wu, Xilai Wang, Jianqiao Hu, Songju Lei, Xiandong Li, Wenxiong Kang

Main category: cs.CV

TL;DR: M3DHMR is a new pipeline for monocular 3D hand mesh recovery that directly estimates mesh vertex positions using dynamic spiral convolution layers and ROI refinement, outperforming state-of-the-art real-time methods.

Details

Motivation: Monocular 3D hand mesh recovery is challenging due to high degrees of freedom, 2D-to-3D ambiguity, and self-occlusion. Existing methods are either inefficient or less straightforward for predicting 3D mesh vertex positions.

Method: Proposes M3DHMR pipeline that directly estimates hand mesh vertex positions. Uses 2D cues for 3D tasks from single images, with a spiral decoder consisting of Dynamic Spiral Convolution (DSC) Layers that adaptively adjust weights based on vertex positions, and a Region of Interest (ROI) Layer that refines mesh vertices in predefined hand regions separately using physical information.

Result: Extensive experiments on FreiHAND dataset demonstrate that M3DHMR significantly outperforms state-of-the-art real-time methods.

Conclusion: M3DHMR provides an effective solution for monocular 3D hand mesh recovery by directly estimating vertex positions with adaptive weight adjustment and region-specific refinement, achieving superior performance compared to existing real-time methods.

Abstract: Monocular 3D hand mesh recovery is challenging due to high degrees of freedom of hands, 2D-to-3D ambiguity and self-occlusion. Most existing methods are either inefficient or less straightforward for predicting the position of 3D mesh vertices. Thus, we propose a new pipeline called Monocular 3D Hand Mesh Recovery (M3DHMR) to directly estimate the positions of hand mesh vertices. M3DHMR provides 2D cues for 3D tasks from a single image and uses a new spiral decoder consist of several Dynamic Spiral Convolution (DSC) Layers and a Region of Interest (ROI) Layer. On the one hand, DSC Layers adaptively adjust the weights based on the vertex positions and extract the vertex features in both spatial and channel dimensions. On the other hand, ROI Layer utilizes the physical information and refines mesh vertices in each predefined hand region separately. Extensive experiments on popular dataset FreiHAND demonstrate that M3DHMR significantly outperforms state-of-the-art real-time methods.

[157] V-CECE: Visual Counterfactual Explanations via Conceptual Edits

Nikolaos Spanos, Maria Lymperaiou, Giorgos Filandrianos, Konstantinos Thomas, Athanasios Voulodimos, Giorgos Stamou

Main category: cs.CV

TL;DR: A plug-and-play black-box counterfactual generation framework that produces human-level explanations without training, using diffusion models to suggest optimal edits while revealing gaps between human reasoning and neural model behavior.

Details

Motivation: Existing black-box counterfactual generation frameworks fail to consider semantic content of edits and rely heavily on training, lacking explainability and human-level reasoning.

Method: Uses pre-trained image editing diffusion model to suggest step-by-step edits with theoretical guarantees of optimal edits, operating without access to classifier internals (black-box approach).

Result: Demonstrates explanatory gap between human reasoning and neural model behavior across CNN, ViT, and LVLM classifiers, validated through comprehensive human evaluation.

Conclusion: Proposed framework enables explainable counterfactual generation without training, revealing important differences between human and machine reasoning patterns.

Abstract: Recent black-box counterfactual generation frameworks fail to take into account the semantic content of the proposed edits, while relying heavily on training to guide the generation process. We propose a novel, plug-and-play black-box counterfactual generation framework, which suggests step-by-step edits based on theoretical guarantees of optimal edits to produce human-level counterfactual explanations with zero training. Our framework utilizes a pre-trained image editing diffusion model, and operates without access to the internals of the classifier, leading to an explainable counterfactual generation process. Throughout our experimentation, we showcase the explanatory gap between human reasoning and neural model behavior by utilizing both Convolutional Neural Network (CNN), Vision Transformer (ViT) and Large Vision Language Model (LVLM) classifiers, substantiated through a comprehensive human evaluation.

[158] 3D Question Answering via only 2D Vision-Language Models

Fengyun Wang, Sicheng Yu, Jiawei Wu, Jinhui Tang, Hanwang Zhang, Qianru Sun

Main category: cs.CV

TL;DR: cdViews: A zero-shot approach using 2D vision-language models for 3D question answering by automatically selecting critical and diverse views from 3D point clouds.

Details

Motivation: To leverage powerful 2D large vision-language models (LVLMs) for 3D scene understanding tasks without expensive 3D model training, addressing the data scarcity problem in 3D domains.

Method: Proposes cdViews with two components: viewSelector identifies critical views based on answer-specific information potential, and viewNMS removes redundant views based on spatial overlap to ensure diversity. Uses 2D models like LLAVA-OV to process sampled views from 3D point clouds.

Result: Achieves state-of-the-art performance on ScanQA and SQA benchmarks for 3D-QA, demonstrating that 2D LVLMs can effectively handle 3D tasks without fine-tuning.

Conclusion: 2D LVLMs are currently the most effective alternative to resource-intensive 3D LVLMs for addressing 3D tasks, especially in data-scarce scenarios.

Abstract: Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.

[159] Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: Grounding IDs are latent identifiers that emerge when visual structures are added to LVLMs, improving multimodal binding and reducing hallucinations by aligning objects with partitions across modalities.

Details

Motivation: While LVLMs perform well on multimodal tasks, they lack structured reasoning and precise grounding. Recent work shows visual structures improve accuracy, but the underlying mechanisms remain unclear.

Method: Investigate the phenomenon through representation analysis to identify Grounding IDs - latent identifiers induced by external cues that bind objects to partitions. Use causal interventions to confirm these identifiers mediate binding between objects and symbolic cues.

Result: Grounding IDs emerge as consistent within-partition alignment in embedding space, reduce modality gap between image and text, strengthen attention between related components, improve cross-modal grounding, and reduce hallucinations.

Conclusion: Grounding IDs are a key symbolic mechanism that explains how external cues enhance multimodal binding, offering both interpretability and practical improvements for LVLMs.

Abstract: Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as consistent within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism that explains how external cues enhance multimodal binding and offer both interpretability and practical improvements.

[160] Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Negin Baghbanzadeh, Mohammed Saidul Islam, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour

Main category: cs.CV

TL;DR: Open-PMC-18M: A large-scale biomedical vision-language dataset with 18M image-text pairs created through improved subfigure extraction and contextual enrichment, enabling state-of-the-art medical representation learning.

Details

Motivation: Existing biomedical vision-language datasets are limited by small size, poor generalizability, and lack of rich medical context. Biomedical figure captions are often short, context-dependent, and partially informative, requiring better data curation for effective representation learning.

Method: Developed a data curation pipeline with transformer-based subfigure detection, subcaption extraction, and contextual text enrichment from inline references. Trained subfigure extraction model on 500K compound figures. Used this to create Open-PMC-18M dataset spanning radiology, microscopy, and visible light photography.

Result: Subfigure extraction model achieves SOTA on real and synthetic benchmarks. Open-PMC-18M contains 18M high-fidelity image-text pairs. Models trained on this dataset achieve SOTA results on 6 retrieval and 19 zero-shot classification tasks across three major modalities.

Conclusion: Data curation is foundational for biomedical representation learning. The released Open-PMC-18M dataset, models, and code enable reproducible benchmarks and advance biomedical vision-language modeling.

Abstract: In biomedical vision-language modeling, datasets are typically mined from scientific literature, pairing compound figures with captions that are short, context-dependent, and oftern partially informative. Prior work on subfigure extraction has been limited in both dataset size and generalizability. In addition, no existing effort has incorporated rich medical context in image-text pairs. We revisit data curation as a foundational component of effective biomedical representation learning. Our data curation process integrates transformer-based subfigure detection, subcaption extraction, and contextual text enrichment derived from inline references. Our subfigure extraction model, trained on a corpus of 500,000 compound figures, achieves state-of-the-art performance on real and synthetic benchmarks. Using this process, we curate and release Open-PMC-18M, a large-scale high-fidelity biomedical dataset comprising 18 million image-text pairs, spanning radiology, microscopy, and visible light photography. We train vision-language models on our dataset and perform extensive evaluation on 6 retrieval and 19 zero-shot classification tasks across three major modalities. The models trained on our dataset set a new state-of-the-art results in medical representation learning. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.

[161] Enabling Validation for Robust Few-Shot Recognition

Hanxin Wang, Tian Liu, Shu Kong

Main category: cs.CV

TL;DR: VEST framework improves few-shot recognition robustness by harmonizing performance on few-shot ID data and retrieved OOD data for better validation and parameter selection.

Details

Motivation: Current VLM finetuning methods for few-shot recognition work well on in-distribution test data but struggle with out-of-distribution data, creating a need for more robust approaches.

Method: Proposes Validation-Enabled Stage-wise Tuning (VEST) that uses retrieved open data for validation, harmonizing performance gain on few-shot ID data and degradation on retrieved OOD data to enable better parameter and checkpoint selection.

Result: VEST significantly outperforms existing VLM adaptation methods on ImageNet OOD benchmarks, achieving state-of-the-art performance on both ID and OOD data.

Conclusion: The proposed validation strategy effectively addresses the data scarcity challenge in few-shot recognition and enables robust learning through partial finetuning and checkpoint selection.

Abstract: Few-Shot Recognition (FSR) tackles classification tasks by training with minimal task-specific labeled data. Prevailing methods adapt or finetune a pretrained Vision-Language Model (VLM) and augment the scarce training data by retrieving task-relevant but noisy samples from open data sources. The finetuned VLM generalizes decently well to the task-specific in-distribution (ID) test data but struggles with out-of-distribution (OOD) test data. This motivates our study of robust FSR with VLM finetuning. The core challenge of FSR is data scarcity, extending beyond limited training data to a complete lack of validation data. We identify a key paradox as a potential solution: repurposing the retrieved open data for validation. As such retrieved data are inherently OOD compared with the task-specific ID training data, finetuned VLMs yield degraded performance on the retrieved data. This causes the validation logic to favor the pretrained model without any finetuning, hindering improvements w.r.t generalization. To resolve this dilemma, we introduce a novel validation strategy that harmonizes performance gain and degradation on the few-shot ID data and the retrieved data, respectively. Our validation enables parameter selection for partial finetuning and checkpoint selection, mitigating overfitting and improving test-data generalization. We unify this strategy with robust learning into a cohesive framework: Validation-Enabled Stage-wise Tuning (VEST). Extensive experiments on the established ImageNet OOD benchmarks show that VEST significantly outperforms existing VLM adaptation methods, achieving state-of-the-art FSR performance on both ID and OOD data.

[162] AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging

Yuxuan Ou, Ning Bi, Jiazhen Pan, Jiancheng Yang, Boliang Yu, Usama Zidan, Regent Lee, Vicente Grau

Main category: cs.CV

TL;DR: Unified deep learning framework that simultaneously generates synthetic contrast-enhanced CT from non-contrast CT and segments aortic lumen/thrombus using conditional diffusion models with multi-task learning.

Details

Motivation: Reduce risks of iodinated contrast agents (nephrotoxicity, allergies, environmental harm) in abdominal aortic aneurysm assessment by generating synthetic CECT from NCCT, overcoming limitations of multi-stage pipelines that cause error accumulation and fail to leverage shared anatomical structures.

Method: Integrated conditional diffusion models with multi-task learning for end-to-end joint optimization of image synthesis and segmentation; shares encoder/decoder parameters across tasks, requires no initial predictions, and uses semi-supervised training to handle missing segmentation labels in clinical data.

Result: Outperformed state-of-the-art models on 264 patients: PSNR 25.61 dB vs 23.80 dB for image synthesis; lumen Dice 0.89 vs 0.87, thrombus Dice 0.53 vs 0.48; reduced lumen diameter MAE to 4.19 mm from 5.78 mm and thrombus area error to 33.85% from 41.45%.

Conclusion: Proposed unified framework successfully addresses limitations of multi-stage approaches by jointly optimizing image synthesis and segmentation, achieving superior performance in both tasks while reducing contrast agent use and improving clinical measurement accuracy.

Abstract: While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA), the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures. To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data. We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85% from 41.45% when compared to nnU-Net. Code is available at https://github.com/yuxuanou623/AortaDiff.git.

[163] Martian World Model: Controllable Video Synthesis with Physically Accurate 3D Reconstructions

Longfei Li, Zhiwen Fan, Wenyan Cong, Xinhang Liu, Yuyang Yin, Matt Foutter, Panwang Pan, Chenyu You, Yue Wang, Zhangyang Wang, Yao Zhao, Marco Pavone, Yunchao Wei

Main category: cs.CV

TL;DR: Proposes M3arsSynth pipeline for 3D Martian environment reconstruction from NASA data and MarsGen video generator for synthesizing realistic Martian landscape videos with geometric consistency.

Details

Motivation: Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation, but challenging due to scarce high-quality Martian data and domain gap between Martian and terrestrial imagery.

Method: Two-component solution: 1) M3arsSynth data curation pipeline reconstructs 3D Martian environments from NASA’s stereo navigation images and renders high-fidelity multiview 3D video sequences; 2) MarsGen video generator fine-tuned on M3arsSynth data synthesizes novel videos conditioned on initial image frame, camera trajectories, or textual prompts.

Result: Approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency across wide range of Martian terrains and acquisition dates.

Conclusion: Holistic solution enables physically accurate 3D surface models at metric-scale resolution and realistic video generation for novel Martian environments, addressing critical needs for space mission preparation.

Abstract: Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA’s Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.

[164] TempoControl: Temporal Attention Guidance for Text-to-Video Models

Shira Schiber, Ofir Lindenbaum, Idan Schwartz

Main category: cs.CV

TL;DR: TempoControl enables precise temporal control in text-to-video generation by aligning visual concepts with timing signals using cross-attention map optimization, without requiring retraining.

Details

Motivation: Current generative video models lack fine-grained temporal control - users cannot specify when specific visual elements should appear within generated sequences, limiting creative applications.

Method: Uses cross-attention maps from text-to-video diffusion models with a novel optimization approach that guides concept timing through three principles: temporal pattern alignment (correlation), attention strength adjustment (magnitude), and semantic consistency preservation (entropy).

Result: Achieves precise temporal control while maintaining high video quality and diversity, demonstrated across applications including temporal reordering of objects, action timing, and audio-aligned video generation.

Conclusion: TempoControl provides an effective method for temporal alignment of visual concepts during inference without retraining or additional supervision, expanding creative possibilities for text-to-video generation.

Abstract: Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal pattern with a control signal (correlation), adjusting its strength where visibility is required (magnitude), and preserving semantic consistency (entropy). TempoControl provides precise temporal control while maintaining high video quality and diversity. We demonstrate its effectiveness across various applications, including temporal reordering of single and multiple objects, action timing, and audio-aligned video generation. Please see our project page for more details: https://shira-schiber.github.io/TempoControl/.

[165] A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

Antonio Finocchiaro, Alessandro Sebastiano Catinello, Michele Mazzamuto, Rosario Leonardi, Antonino Furnari, Giovanni Maria Farinella

Main category: cs.CV

TL;DR: Real-time hand-object interaction detection using cascaded Mamba-EfficientNetV2 for action recognition and YOLOWorld for object detection, achieving 38.52% p-AP and 85.13% AP respectively at 30fps.

Details

Motivation: Hand-object interaction detection is challenging for real-time applications where fast and accurate detection is crucial for intuitive user experiences in streaming egocentric vision.

Method: Cascaded architecture with two modules: action recognition (Mamba model with EfficientNetV2 backbone) detects interaction states, and object detection (fine-tuned YOLOWorld) identifies active objects when interaction is confirmed.

Result: Action recognition achieves 38.52% p-AP on ENIGMA-51 benchmark at 30fps; object detection reaches 85.13% AP for hand and object detection.

Conclusion: Proposed efficient approach enables real-time hand-object interaction detection from streaming egocentric vision with cascaded architecture that activates object detection only when interaction is detected.

Abstract: Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection module, which in turn performs inference on the relevant frame to detect and classify the active object.

[166] Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression

Roy H. Jennings, Genady Paikin, Roy Shaul, Evgeny Soloveichik

Main category: cs.CV

TL;DR: Current MLLM approaches for image regression using preset vocabularies and generic prompts offer no advantage over image-only models. The proposed RvTC method with flexible bin-based classification and data-specific semantic prompts achieves state-of-the-art performance by enabling cross-modal understanding.

Details

Motivation: Current Multimodal Large Language Models (MLLMs) for image-based regression tasks use preset output vocabularies and generic task-level prompts, assuming this mimics human rating behavior. However, these approaches fail to leverage the semantic understanding from textual input and provide no benefit over image-only training.

Method: Proposes Regression via Transformer-Based Classification (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach. Instead of complex distributional modeling for discretization errors, RvTC uses straightforward bin increase. Crucially, the method emphasizes using data-specific prompts containing semantic information about specific images rather than generic task descriptions.

Result: RvTC achieves state-of-the-art performance on four image assessment datasets using only images. Adding semantic information (like challenge titles on AVA dataset) to prompts substantially improves performance beyond the already state-of-the-art image-only baseline. The method demonstrates consistent improvements across two different MLLM architectures and shows benefits are not mere statistical biases.

Conclusion: MLLMs can effectively perform image regression when using flexible bin-based classification (RvTC) and data-specific semantic prompts that enable cross-modal understanding. The approach eliminates manual vocabulary crafting and demonstrates that semantic prompt information is crucial for leveraging MLLMs’ capabilities beyond image-only models.

Abstract: Multimodal Large Language Models (MLLMs) show promise for image-based regression tasks, but current approaches face key limitations. Recent methods fine-tune MLLMs using preset output vocabularies and generic task-level prompts (e.g., “How would you rate this image?”), assuming this mimics human rating behavior. \textbf{Our analysis reveals that these approaches provide no benefit over image-only training}. Models using preset vocabularies and generic prompts perform equivalently to image-only models, failing to leverage semantic understanding from textual input. We propose \textbf{Regression via Transformer-Based Classification} (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach. Unlike approaches that address discretization errors through complex distributional modeling, RvTC eliminates manual vocabulary crafting through straightforward bin increase, achieving state-of-the-art performance on four image assessment datasets using only images. \textbf{More importantly, we demonstrate that data-specific prompts dramatically improve performance}. Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding. On the AVA dataset, adding challenge titles to prompts substantially improves our already state-of-the-art image-only baseline. We demonstrate through empirical evidence from the AVA and AGIQA-3k datasets that MLLMs benefit from semantic prompt information, surpassing mere statistical biases. We validate RvTC across two different MLLM architectures, demonstrating consistent improvements and method generalizability.

[167] Perspective-Invariant 3D Object Detection

Ao Liang, Lingdong Kong, Dongyue Lu, Youquan Liu, Jian Fang, Huaici Zhao, Wei Tsang Ooi

Main category: cs.CV

TL;DR: Pi3DET: First multi-platform LiDAR 3D detection benchmark (vehicle, quadruped, drone) with cross-platform adaptation framework for perspective-invariant detection.

Details

Motivation: Existing 3D object detection research focuses mainly on vehicle-mounted platforms, leaving other autonomous platforms (quadrupeds, drones) underexplored. Need for cross-platform 3D perception systems.

Method: Introduces Pi3DET dataset with LiDAR data and 3D annotations from multiple platforms. Proposes cross-platform adaptation framework with geometric and feature-level alignment for perspective-invariant 3D detection.

Result: Extensive experiments show substantial gains over existing adaptation methods on cross-platform tasks. Provides benchmark for evaluating 3D detector robustness in cross-platform scenarios.

Conclusion: Work paves way for generalizable, unified 3D perception across diverse environments. Dataset, benchmark suite, and annotation toolkit made publicly available.

Abstract: With the rise of robotics, LiDAR-based 3D object detection has garnered significant attention in both academia and industry. However, existing datasets and methods predominantly focus on vehicle-mounted platforms, leaving other autonomous platforms underexplored. To bridge this gap, we introduce Pi3DET, the first benchmark featuring LiDAR data and 3D bounding box annotations collected from multiple platforms: vehicle, quadruped, and drone, thereby facilitating research in 3D object detection for non-vehicle platforms as well as cross-platform 3D detection. Based on Pi3DET, we propose a novel cross-platform adaptation framework that transfers knowledge from the well-studied vehicle platform to other platforms. This framework achieves perspective-invariant 3D detection through robust alignment at both geometric and feature levels. Additionally, we establish a benchmark to evaluate the resilience and robustness of current 3D detectors in cross-platform scenarios, providing valuable insights for developing adaptive 3D perception systems. Extensive experiments validate the effectiveness of our approach on challenging cross-platform tasks, demonstrating substantial gains over existing adaptation methods. We hope this work paves the way for generalizable and unified 3D perception systems across diverse and complex environments. Our Pi3DET dataset, cross-platform benchmark suite, and annotation toolkit have been made publicly available.

[168] Collaborative Face Experts Fusion in Video Generation: Boosting Identity Consistency Across Large Face Poses

Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Chengming Xu, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma

Main category: cs.CV

TL;DR: CoFE proposes a novel identity-preserving video generation method with specialized face experts and a curated dataset to handle large face poses.

Details

Motivation: Current video generation models fail to preserve identity under large face poses due to ineffective identity feature integration in DiT architectures and lack of large-pose video data.

Method: Collaborative Face Experts Fusion (CoFE) with three specialized experts (identity, semantic, detail) within DiT backbone, plus a data curation pipeline creating LaFID-180K dataset with face constraints, identity consistency, and speech disambiguation.

Result: Significantly outperforms state-of-the-art methods in face similarity, FID, and CLIP semantic alignment on multiple benchmarks.

Conclusion: CoFE effectively addresses identity preservation in large-pose video generation through specialized expert fusion and targeted dataset curation.

Abstract: Current video generation models struggle with identity preservation under large face poses, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT architectures, and the lack of targeted coverage of large face poses in existing open-source video datasets. To address these, we present two key innovations. First, we propose Collaborative Face Experts Fusion (CoFE), which dynamically fuses complementary signals from three specialized experts within the DiT backbone: an identity expert that captures cross-pose invariant features, a semantic expert that encodes high-level visual context, and a detail expert that preserves pixel-level attributes such as skin texture and color gradients. Second, we introduce a data curation pipeline comprising three key components: Face Constraints to ensure diverse large-pose coverage, Identity Consistency to maintain stable identity across frames, and Speech Disambiguation to align textual captions with actual speaking behavior. This pipeline yields LaFID-180K, a large-scale dataset of pose-annotated video clips designed for identity-preserving video generation. Experimental results on several benchmarks demonstrate that our approach significantly outperforms state-of-the-art methods in face similarity, FID, and CLIP semantic alignment. Project page: https://rain152.github.io/CoFE/.

[169] SegAssess: Panoramic quality mapping for robust and transferable unsupervised segmentation assessment

Bingnan Yang, Mi Zhang, Zhili Zhang, Zhan Zhang, Yuanxin Zhao, Xiangyun Hu, Jianya Gong

Main category: cs.CV

TL;DR: SegAssess introduces a novel deep learning framework for pixel-level segmentation quality assessment using panoramic quality mapping, achieving SOTA performance with strong zero-shot transferability.

Details

Motivation: Current unsupervised segmentation quality assessment methods suffer from coarse granularity, incomplete assessments, and poor transferability, limiting their effectiveness for pixel-level geospatial analysis in remote sensing.

Method: SegAssess formulates SQA as a four-class panoramic segmentation task (TP, FP, TN, FN) using enhanced SAM architecture with mask prompting via cross-attention, Edge Guided Compaction with Aggregated Semantic Filter for edge refinement, and Augmented Mixup Sampling for cross-domain robustness.

Result: SegAssess achieves state-of-the-art performance and demonstrates remarkable zero-shot transferability to unseen masks in comprehensive experiments.

Conclusion: The proposed PQM paradigm and SegAssess framework provide a comprehensive, pixel-wise solution for segmentation quality assessment with superior performance and transferability for remote sensing applications.

Abstract: High-quality image segmentation is fundamental to pixel-level geospatial analysis in remote sensing, necessitating robust segmentation quality assessment (SQA), particularly in unsupervised settings lacking ground truth. Although recent deep learning (DL) based unsupervised SQA methods show potential, they often suffer from coarse evaluation granularity, incomplete assessments, and poor transferability. To overcome these limitations, this paper introduces Panoramic Quality Mapping (PQM) as a new paradigm for comprehensive, pixel-wise SQA, and presents SegAssess, a novel deep learning framework realizing this approach. SegAssess distinctively formulates SQA as a fine-grained, four-class panoramic segmentation task, classifying pixels within a segmentation mask under evaluation into true positive (TP), false positive (FP), true negative (TN), and false negative (FN) categories, thereby generating a complete quality map. Leveraging an enhanced Segment Anything Model (SAM) architecture, SegAssess uniquely employs the input mask as a prompt for effective feature integration via cross-attention. Key innovations include an Edge Guided Compaction (EGC) branch with an Aggregated Semantic Filter (ASF) module to refine predictions near challenging object edges, and an Augmented Mixup Sampling (AMS) training strategy integrating multi-source masks to significantly boost cross-domain robustness and zero-shot transferability. Comprehensive experiments demonstrate that SegAssess achieves state-of-the-art (SOTA) performance and exhibits remarkable zero-shot transferability to unseen masks. The code is available at https://github.com/Yangbn97/SegAssess.

[170] Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model

Jihua Peng, Qianxiong Xu, Yichen Liu, Chenxi Liu, Cheng Long, Rui Zhao, Ziyue Li

Main category: cs.CV

TL;DR: LIR-GAD: A novel language-instructed reasoning framework for Group Activity Detection using Multimodal Large Language Models with specialized tokens and multimodal fusion.

Details

Motivation: Existing deep learning methods for Group Activity Detection rely on implicit pattern recognition from visual features, struggle with contextual reasoning and explainability, and lack explicit semantic understanding of collective activities.

Method: Proposes LIR-GAD framework that expands MLLM vocabulary with activity-level token and cluster-specific tokens, processes video frames with language instructions, uses multi-label classification loss, and introduces Multimodal Dual-Alignment Fusion module to integrate MLLM embeddings with visual features.

Result: Both quantitative and qualitative experiments demonstrate superior performance in Group Activity Detection tasks compared to existing methods.

Conclusion: The language-instructed reasoning approach effectively addresses limitations of existing methods by leveraging MLLM’s commonsense knowledge for better semantic understanding, contextual reasoning, and explainability in group activity detection.

Abstract: Group activity detection (GAD) aims to simultaneously identify group members and categorize their collective activities within video sequences. Existing deep learning-based methods develop specialized architectures (e.g., transformer networks) to model the dynamics of individual roles and semantic dependencies between individuals and groups. However, they rely solely on implicit pattern recognition from visual features and struggle with contextual reasoning and explainability. In this work, we propose LIR-GAD, a novel framework of language-instructed reasoning for GAD via Multimodal Large Language Model (MLLM). Our approach expand the original vocabulary of MLLM by introducing an activity-level token and multiple cluster-specific tokens. We process video frames alongside two specially designed tokens and language instructions, which are then integrated into the MLLM. The pretrained commonsense knowledge embedded in the MLLM enables the token and tokens to effectively capture the semantic information of collective activities and learn distinct representational features of different groups, respectively. Also, we introduce a multi-label classification loss to further enhance the token’s ability to learn discriminative semantic representations. Then, we design a Multimodal Dual-Alignment Fusion (MDAF) module that integrates MLLM’s hidden embeddings corresponding to the designed tokens with visual features, significantly enhancing the performance of GAD. Both quantitative and qualitative experiments demonstrate the superior performance of our proposed method in GAD taks.

[171] iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker, Abhishek Aich

Main category: cs.CV

TL;DR: iFinder is a training-free framework that grounds LLMs in dash-cam video analysis by extracting structured driving cues (object pose, lanes, trajectories) and organizing them hierarchically for improved reasoning over end-to-end V-VLMs.

Details

Motivation: LLMs struggle with domain-specific tasks like dash-cam analysis due to general-purpose training and lack of structured inductive biases. Vision-only analysis is challenging for existing V-VLMs which have poor spatial reasoning, causal inference, and explainability.

Method: Modular, training-free pipeline using pretrained vision models to extract critical driving cues (object pose, lane positions, object trajectories), organized into hierarchical frame- and video-level structures. Combined with three-block prompting strategy for step-wise grounded reasoning to refine V-VLM outputs.

Result: Outperforms end-to-end V-VLMs on four zero-shot driving benchmarks with up to 39% gains in accident reasoning accuracy. Domain-specific cues like object orientation and global context significantly improve performance.

Conclusion: iFinder provides a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding by grounding LLMs with driving domain-specific structured representations.

Abstract: Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues – object pose, lane positions, and object trajectories – which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM’s outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder’s proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.

[172] ZQBA: Zero Query Black-box Adversarial Attack

Joana C. Costa, Tiago Roxo, Hugo Proença, Pedro R. M. Inácio

Main category: cs.CV

TL;DR: ZQBA is a zero-query black-box adversarial attack that uses feature maps from one DNN to fool other networks without requiring multiple queries or diffusion models.

Details

Motivation: Current black-box attacks require multiple queries or diffusion models with surrogate training, limiting real-world applicability. There's a need for more practical attacks that don't require extensive queries or complex training.

Method: Extracts feature maps from a Deep Neural Network (DNN) and adds them to clean images to create adversarial samples that impair target model classification. This approach transfers adversarial samples across different models and datasets.

Result: ZQBA successfully transfers adversarial samples across various models and datasets (CIFAR, Tiny ImageNet). It outperforms state-of-the-art black-box attacks with single query, maintains imperceptible perturbations (verified by SSIM and qualitative evaluation).

Conclusion: ZQBA demonstrates a practical zero-query black-box attack method that exploits DNN representations, revealing vulnerabilities in real-world DNN deployments while maintaining attack effectiveness and imperceptibility.

Abstract: Current black-box adversarial attacks either require multiple queries or diffusion models to produce adversarial samples that can impair the target model performance. However, these methods require training a surrogate loss or diffusion models to produce adversarial samples, which limits their applicability in real-world settings. Thus, we propose a Zero Query Black-box Adversarial (ZQBA) attack that exploits the representations of Deep Neural Networks (DNNs) to fool other networks. Instead of requiring thousands of queries to produce deceiving adversarial samples, we use the feature maps obtained from a DNN and add them to clean images to impair the classification of a target model. The results suggest that ZQBA can transfer the adversarial samples to different models and across various datasets, namely CIFAR and Tiny ImageNet. The experiments also show that ZQBA is more effective than state-of-the-art black-box attacks with a single query, while maintaining the imperceptibility of perturbations, evaluated both quantitatively (SSIM) and qualitatively, emphasizing the vulnerabilities of employing DNNs in real-world contexts. All the source code is available at https://github.com/Joana-Cabral/ZQBA.

[173] CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-Lin, Hong-Han Shuai, Wen-Huang Cheng

Main category: cs.CV

TL;DR: CookAnything is a diffusion-based framework that generates coherent image sequences from cooking instructions of any length, addressing limitations of current methods in handling structured multi-step scenarios and variable recipe lengths.

Details

Motivation: Current diffusion models struggle with structured multi-step scenarios like recipe illustration, and existing recipe illustration methods cannot adjust to natural variability in recipe length, generating fixed numbers of images regardless of actual instruction structure.

Method: Three key components: (1) Step-wise Regional Control (SRC) aligns textual steps with image regions in a single denoising process; (2) Flexible RoPE provides step-aware positional encoding for temporal coherence and spatial diversity; (3) Cross-Step Consistency Control (CSCC) maintains ingredient consistency across steps.

Result: Experimental results on recipe illustration benchmarks show CookAnything outperforms existing methods in both training-based and training-free settings.

Conclusion: The framework supports scalable, high-quality visual synthesis of complex multi-step instructions and has significant potential for applications in instructional media and procedural content creation.

Abstract: Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.

[174] VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming

Duy Nguyen, Dat Nguyen

Main category: cs.CV

TL;DR: VirDA: Visual reprogramming for parameter-efficient unsupervised domain adaptation by adding domain-specific visual prompts to input images instead of fine-tuning backbone parameters.

Details

Motivation: Existing UDA methods require fine-tuning backbone parameters for each new source-target pair, leading to linear growth in parameters and storage, and preventing reuse of well-trained backbones across domains.

Method: Prepends a domain-specific visual reprogramming layer to the backbone that produces visual prompts as textural bias to adapt input style to target domain. Uses multiple objective functions optimizing intra- and inter-domain distribution differences without modifying backbone parameters.

Result: Achieves 92.8% mean accuracy on Office-31 with only 1.5M trainable parameters. Surpasses PDA by +1.6% accuracy using 46% of its parameters. Outperforms full-backbone fine-tuning methods (CDTrans, FixBi) while requiring only 1.7-2.8% of their parameters.

Conclusion: VirDA enables parameter-efficient domain adaptation by reusing backbone parameters across domains through visual reprogramming, achieving competitive accuracy with minimal parameter overhead compared to state-of-the-art methods.

Abstract: Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA. Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its “style” to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.

[175] TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

Jiaqi Yan, Ruilong Ren, Jingren Liu, Shuning Xu, Ling Wang, Yiheng Wang, Xinlin Zhong, Yun Wang, Long Zhang, Xiangyu Chen, Changzhi Sun, Jixiang Luo, Dell Zhang, Hao Sun, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: TeleEgo is a new benchmark for evaluating egocentric AI assistants with long-duration, streaming, multi-modal data across realistic daily activities, featuring 14+ hours per participant of synchronized video/audio/text and 3,291 QA items for testing memory, understanding, and cross-memory reasoning.

Details

Motivation: Existing benchmarks for egocentric AI assistants evaluate capabilities in isolation, lack realistic streaming scenarios, or only support short-term tasks. There's a need for comprehensive evaluation that combines multi-modal processing, real-time responsiveness, and long-term memory retention in realistic daily contexts.

Method: Created TeleEgo benchmark with over 14 hours per participant of synchronized egocentric video, audio, and text across four domains (work/study, lifestyle/routines, social activities, outings/culture). Data is aligned on unified global timeline with human-refined visual narrations and speech transcripts. Defines 12 diagnostic subtasks across three core capabilities: Memory, Understanding, and Cross-Memory Reasoning, with 3,291 human-verified QA items in streaming setting.

Result: The benchmark contains extensive multi-modal data and evaluation framework. Introduces Real-Time Accuracy (RTA) to measure correctness and responsiveness under tight decision windows, and Memory Persistence Time (MPT) for long-term retention in continuous streams. Reports RTA results for current models and releases TeleEgo with MPT evaluation framework.

Conclusion: TeleEgo provides a realistic and extensible benchmark for future egocentric assistants with stronger streaming memory, enabling systematic study of both real-time behavior and long-horizon memory capabilities in continuous, multi-modal contexts.

Abstract: Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work & study, lifestyle & routines, social activities, and outings & culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose Real-Time Accuracy (RTA) to jointly capture correctness and responsiveness under tight decision windows, and Memory Persistence Time (MPT) as a forward-looking metric for long-term retention in continuous streams. In this work, we report RTA results for current models and release TeleEgo, together with an MPT evaluation framework, as a realistic and extensible benchmark for future egocentric assistants with stronger streaming memory, enabling systematic study of both real-time behavior and long-horizon memory.

[176] SAGE: Saliency-Guided Contrastive Embeddings

Colton R. Crum, Christopher Sweet, Adam Czajka

Main category: cs.CV

TL;DR: SAGE is a new method that integrates human saliency priors into neural network training using contrastive embeddings in latent space rather than image space, improving classification performance across various scenarios.

Details

Motivation: Existing saliency-guided training methods rely on internal model mechanisms that may be unreliable, and placing guidance solely in image space creates challenges. The authors aim to move away from image space and use latent space embeddings to effectively integrate human perceptual priors.

Method: SAGE uses contrastive embeddings in the model’s latent space. It applies salient-preserving and saliency-degrading signal augmentations to input, captures changes in embeddings and model logits, and guides the model toward salient features and away from non-salient features using a contrastive triplet loss. A sanity check on logit distributions ensures model outputs match saliency-based augmentations.

Result: SAGE demonstrates improved classification performance across both open- and closed-set scenarios compared to state-of-the-art saliency-based methods. It shows effectiveness across various backbones and suggests wide generalization across different tasks.

Conclusion: Moving saliency guidance from image space to latent space embeddings via contrastive learning (SAGE) effectively integrates human perceptual priors, overcoming limitations of existing approaches and improving model generalization across diverse scenarios.

Abstract: Integrating human perceptual priors into the training of neural networks has been shown to raise model generalization, serve as an effective regularizer, and align models with human expertise for applications in high-risk domains. Existing approaches to integrate saliency into model training often rely on internal model mechanisms, which recent research suggests may be unreliable. Our insight is that many challenges associated with saliency-guided training stem from the placement of the guidance approaches solely within the image space. Instead, we move away from the image space, use the model’s latent space embeddings to steer human guidance during training, and we propose SAGE (Saliency-Guided Contrastive Embeddings): a loss function that integrates human saliency into network training using contrastive embeddings. We apply salient-preserving and saliency-degrading signal augmentations to the input and capture the changes in embeddings and model logits. We guide the model towards salient features and away from non-salient features using a contrastive triplet loss. Additionally, we perform a sanity check on the logit distributions to ensure that the model outputs match the saliency-based augmentations. We demonstrate a boost in classification performance across both open- and closed-set scenarios against SOTA saliency-based methods, showing SAGE’s effectiveness across various backbones, and include experiments to suggest its wide generalization across tasks.

[177] Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng, Erhang Zhang, Xieyuanli Chen, Hesheng Wang

Main category: cs.CV

TL;DR: Uni-Hand: A universal hand motion forecasting framework with multi-modal fusion, dual-branch diffusion for head-hand synergy, multi-target prediction (wrist/fingers), and hand-object interaction states, validated in downstream tasks like human-robot policy transfer.

Details

Motivation: Existing hand trajectory prediction methods suffer from insufficient prediction targets, modality gaps, entangled hand-head motion, and limited validation in downstream applications like AR and human-robot policy transfer.

Method: Multi-modal framework with vision-language fusion, global context incorporation, task-aware text embedding injection, dual-branch diffusion for concurrent head-hand motion prediction, target indicators for specific joint waypoints, and hand-object interaction state prediction.

Result: Achieves state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting on multiple datasets, demonstrates impressive human-robot policy transfer for robotic manipulation, and effective feature enhancement for action anticipation/recognition.

Conclusion: Uni-Hand presents a comprehensive hand motion forecasting framework that addresses multiple limitations of existing methods and shows strong real-world applicability through novel downstream task benchmarks and validation.

Abstract: Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.

[178] REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan

Main category: cs.CV

TL;DR: REVISOR is a novel multimodal reflection framework that enables MLLMs to collaboratively construct introspective reflection processes across text and visual modalities, significantly enhancing long-form video understanding without requiring additional supervised fine-tuning or external models.

Details

Motivation: Current text-based self-reflection mechanisms have limitations in long-form video understanding because: (1) they only rethink text information while ignoring richer visual content, and (2) they lack cross-modal interaction capabilities needed to fully integrate visual information during reflection.

Method: Proposes REVISOR framework with Dual Attribution Decoupled Reward (DADR) mechanism integrated into GRPO training strategy. This enables MLLMs to collaboratively construct reflection processes across textual and visual modalities, ensuring accurate review of relevant video segments during reinforcement learning.

Result: Achieves impressive results on four benchmarks: VideoMME, LongVideoBench, MLVU, and LVBench, significantly enhancing long-form video understanding capability of MLLMs without supplementary supervised fine-tuning or external models.

Conclusion: REVISOR successfully addresses limitations of text-only reflection mechanisms by enabling cross-modal collaborative reflection, providing an effective framework for enhancing MLLMs’ long-form video understanding capabilities through visual-textual introspection.

Abstract: Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model’s reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

[179] InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity

Haoming Wang, Qiyao Xue, Wei Gao

Main category: cs.CV

TL;DR: InfiniBench is an automated benchmark generator that creates diverse 3D scenes for evaluating vision-language models’ spatial reasoning abilities with full customizability over scene complexity.

Details

Motivation: Existing benchmarks for evaluating VLMs' spatial reasoning abilities lack customizability over scene complexity and cannot isolate specific failure modes under different spatial conditions. There's a need for diverse, scalable, and fully customizable evaluation tools.

Method: Three key innovations: 1) LLM-based agentic framework for iterative refinement of procedural scene constraints from natural language descriptions; 2) Flexible cluster-based layout optimizer for generating dense, cluttered scenes; 3) Task-aware camera trajectory optimization for rendering videos with full object coverage as VLM input.

Result: InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. Successfully generates benchmarks for spatial reasoning tasks including measurement, perspective-taking, and spatiotemporal tracking.

Conclusion: InfiniBench provides a fully automated, customizable benchmark generator that can synthesize theoretically infinite 3D scenes with parameterized control over scene complexity, addressing limitations of existing benchmarks and enabling comprehensive evaluation of VLMs’ spatial reasoning abilities.

Abstract: Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.

[180] MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition

Mingyu Zhao, Zhanfu Yang, Yang Zhou, Zhaoyang Xia, Can Jin, Xiaoxiao He, Dimitris N. Metaxas

Main category: cs.CV

TL;DR: Multimodal approach for continuous ASL sign recognition using 3D skeletal features and handshape classification to detect sign boundaries, followed by recognition on segmented signs.

Details

Motivation: To improve continuous sign language recognition by accurately detecting sign boundaries in ASL sentences, which is crucial for effective recognition since signs in continuous signing differ from citation-form isolated signs.

Method: 1) Use 3D skeletal features to detect sign boundaries based on convergence of sign properties and dynamics. 2) Incorporate handshape information by pretraining a classifier on 87 canonical handshape categories. 3) Multimodal fusion to combine video segmentation and handshape models. 4) Train recognition model on both isolated signs and manually segmented continuous signs.

Result: Significant improvements over previous work when evaluated on the ASLLRP corpus.

Conclusion: Multimodal approach combining 3D skeletal features and handshape classification effectively improves continuous sign recognition by better detecting sign boundaries and accounting for differences between isolated and continuous signing.

Abstract: This paper employs a multimodal approach for continuous sign recognition by first using ML for detecting the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then by recognizing the segmented signs. For improved robustness we use 3D skeletal features extracted from sign language videos to take into account the convergence of sign properties and their dynamics that tend to cluster at sign boundaries. Another focus of this paper is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for detection of 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing-as such signs often differ a bit in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.

[181] Structure is Supervision: Multiview Masked Autoencoders for Radiology

Sonia Laguna, Andrea Agostini, Alain Ryser, Samuel Ruiperez-Campillo, Irene Cannistraci, Moritz Vandenhirtz, Stephan Mandt, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M. Sutter, Julia E. Vogt

Main category: cs.CV

TL;DR: MVMAE is a self-supervised framework that uses multi-view radiology data and text reports to learn robust medical image representations, outperforming baselines on disease classification tasks.

Details

Motivation: Medical ML systems need pretraining strategies that exploit clinical data structure. Radiology studies have natural multi-view organization that can be leveraged for better representation learning.

Method: MVMAE combines masked image reconstruction with cross-view alignment to learn view-invariant representations. MVMAE-V2T adds radiology reports as text-based learning signals while maintaining vision-only inference.

Result: MVMAE consistently outperforms supervised and vision-language baselines on disease classification across MIMIC-CXR, CheXpert, and PadChest datasets. MVMAE-V2T provides additional gains, especially in low-label regimes.

Conclusion: Structural supervision (multi-view organization) and textual supervision (reports) are complementary paths toward scalable, clinically grounded medical foundation models.

Abstract: Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.

[182] Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

Jing He, Haodong Li, Mingzhi Sheng, Ying-Cong Chen

Main category: cs.CV

TL;DR: Lotus-2 is a two-stage deterministic framework that adapts diffusion model priors for stable, accurate geometric dense prediction (depth/normal estimation) using minimal training data.

Details

Motivation: Monocular geometric recovery is ill-posed due to appearance ambiguity. While discriminative models need massive supervision and generative diffusion models have powerful world priors, directly using stochastic diffusion for deterministic geometric inference is suboptimal.

Method: Two-stage framework: 1) Core predictor with single-step deterministic formulation, clean-data objective, and local continuity module for globally coherent structures. 2) Detail sharpener with constrained multi-step rectified-flow refinement within the manifold for fine-grained geometry.

Result: Using only 59K training samples (<1% of existing datasets), achieves SOTA in monocular depth estimation and competitive surface normal prediction, demonstrating diffusion models can serve as deterministic world priors.

Conclusion: Diffusion models can be effectively adapted as deterministic world priors for high-quality geometric reasoning, bridging traditional discriminative and generative paradigms with minimal supervision.

Abstract: Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by the scale, quality and diversity of available data and limited physical reasoning. Recent diffusion models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference: the former is optimized for diverse and high-fidelity image generation, whereas the latter requires stable and accurate predictions. In this work, we propose Lotus-2, a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction, aiming to provide an optimal adaption protocol to fully exploit the pre-trained generative priors. Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor, enhancing fine-grained geometry through noise-free deterministic flow matching. Using only 59K training samples, less than 1% of existing large-scale datasets, Lotus-2 establishes new state-of-the-art results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that diffusion models can serve as deterministic world priors, enabling high-quality geometric reasoning beyond traditional discriminative and generative paradigms.

[183] TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

Ziqian Wang, Yonghao He, Licheng Yang, Wei Zou, Hongxuan Ma, Liu Liu, Wei Sui, Yuxin Guo, Hu Su

Main category: cs.CV

TL;DR: TabletopGen is a training-free framework that generates diverse, interactive 3D tabletop scenes from reference images, using instance segmentation, 3D reconstruction, and novel pose/scale alignment.

Details

Motivation: Current text/image-driven 3D scene generation methods focus on large-scale scenes and struggle with high-density layouts and complex spatial relations in tabletop scenes, which are essential for embodied AI and robotic manipulation.

Method: TabletopGen accepts reference images (can be from text-to-image models), performs instance segmentation/completion, reconstructs 3D models with canonical alignment, then uses a two-stage pose/scale alignment: Differentiable Rotation Optimizer for rotation recovery and Top-view Spatial Alignment for translation/scale estimation, assembling collision-free scenes.

Result: Extensive experiments and user studies show state-of-the-art performance, surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, generating realistic tabletop scenes with rich stylistic and spatial diversity.

Conclusion: TabletopGen provides an effective training-free framework for generating diverse, interactive 3D tabletop scenes, addressing limitations of current methods and enabling better support for embodied AI applications like robotic manipulation.

Abstract: Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI – especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.

[184] LM-CartSeg: Automated Segmentation of Lateral and Medial Cartilage and Subchondral Bone for Radiomics Analysis

Tongxu Zhang

Main category: cs.CV

TL;DR: LM-CartSeg is an automatic pipeline for knee cartilage/bone segmentation, lateral/medial compartmentalization, and radiomics analysis that improves segmentation accuracy and provides quality-controlled ROIs for multi-center OA studies.

Details

Motivation: Existing knee MRI radiomics work relies on manual ROIs with poor quality control reporting, lacking robust anatomically meaningful regions that capture both cartilage and subchondral bone.

Method: Two 3D nnU-Net models trained on different datasets, with zero-shot predictions fused and refined using geometric rules: connected-component cleaning, 10mm subchondral bone bands, and data-driven tibial lateral/medial split using PCA and k-means.

Result: Post-processing improved macro ASSD from 2.63 to 0.36 mm and HD95 from 25.2 to 3.35 mm (DSC 0.91); zero-shot DSC on SKI-10 was 0.80. Geometric L/M rule produced stable compartments, and only 6-12% of radiomic features were strongly correlated with volume/thickness.

Conclusion: LM-CartSeg provides automatic, quality-controlled ROIs and radiomic features with discriminative information beyond simple morphometry, offering a practical foundation for multi-center knee OA radiomics studies.

Abstract: Background and Objective: Radiomics of knee MRI requires robust, anatomically meaningful regions of interest (ROIs) that jointly capture cartilage and subchondral bone. Most existing work relies on manual ROIs and rarely reports quality control (QC). We present LM-CartSeg, a fully automatic pipeline for cartilage/bone segmentation, geometric lateral/medial (L/M) compartmentalisation and radiomics analysis. Methods: Two 3D nnU-Net models were trained on SKM-TEA (138 knees) and OAIZIB-CM (404 knees). At test time, zero-shot predictions were fused and refined by simple geometric rules: connected-component cleaning, construction of 10 mm subchondral bone bands in physical space, and a data-driven tibial L/M split based on PCA and k-means. Segmentation was evaluated on an OAIZIB-CM test set (103 knees) and on SKI-10 (100 knees). QC used volume and thickness signatures. From 10 ROIs we extracted 4 650 non-shape radiomic features to study inter-compartment similarity, dependence on ROI size, and OA vs. non-OA classification on OAIZIB-CM Results: Post-processing improved macro ASSD on OAIZIB-CM from 2.63 to 0.36 mm and HD95 from 25.2 to 3.35 mm, with DSC 0.91; zero-shot DSC on SKI-10 was 0.80. The geometric L/M rule produced stable compartments across datasets, whereas a direct L/M nnU-Net showed domain-dependent side swaps. Only 6 to 12 percent of features per ROI were strongly correlated with volume or thickness. Radiomics-based models models restricted to size-linked features. Conclusions: LM-CartSeg yields automatic, QCd ROIs and radiomic features that carry discriminative information beyond simple morphometry, providing a practical foundation for multi-centre knee OA radiomics studies.

[185] COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, Tingwen Liu

Main category: cs.CV

TL;DR: COOPER is a unified MLLM that integrates depth and segmentation as auxiliary modalities with two-stage training for spatial perception and reasoning, achieving significant improvements in spatial reasoning while maintaining general performance.

Details

Motivation: Current MLLMs struggle with 3D-aware spatial reasoning because existing approaches treat perception enhancement (using auxiliary modalities like depth/segmentation) and reasoning enhancement (training on spatial VQA datasets) in isolation rather than as integrated capabilities.

Method: COOPER leverages depth and segmentation as auxiliary modalities with two-stage training: 1) auxiliary modality generation capability, and 2) adaptive interleaved reasoning capability to integrate perception and reasoning in a unified framework.

Result: COOPER achieves average 6.91% improvement in spatial reasoning while maintaining general performance. Even the variant trained only for auxiliary modality generation achieves 7.92% gain on distance and size estimation tasks.

Conclusion: Learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding, and a unified approach integrating perception and reasoning through adaptive interleaved reasoning leads to stronger spatial intelligence in MLLMs.

Abstract: Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.

[186] Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi

Main category: cs.CV

TL;DR: Live Avatar: A real-time streaming avatar generation system using a 14B diffusion model with novel parallelization and consistency mechanisms to overcome sequential bottlenecks.

Details

Motivation: Existing diffusion-based video generation methods suffer from sequential computation bottlenecks and long-horizon inconsistency, making them impractical for real-time, streaming audio-driven avatar synthesis applications.

Method: 1) Timestep-forcing Pipeline Parallelism (TPP): Distributes denoising steps across multiple GPUs to break autoregressive bottlenecks. 2) Rolling Sink Frame Mechanism (RSFM): Maintains temporal consistency by dynamically recalibrating appearance using cached reference images. 3) Self-Forcing Distribution Matching Distillation: Enables causal, streamable adaptation of large-scale models without quality loss.

Result: Achieves 20 FPS end-to-end generation on 5 H800 GPUs, making it the first practical real-time high-fidelity avatar generation system at this scale with state-of-the-art performance.

Conclusion: Live Avatar establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications, overcoming fundamental limitations of existing methods for real-time streaming avatar generation.

Abstract: Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.

[187] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian

Main category: cs.CV

TL;DR: EMMA is an efficient unified multimodal architecture that achieves state-of-the-art performance in understanding, generation, and editing tasks with only 4B parameters, outperforming larger models through innovative compression and architectural designs.

Details

Motivation: To create a unified multimodal architecture that efficiently handles understanding, generation, and editing tasks simultaneously, overcoming the limitations of separate specialized models and inefficient unified approaches that require excessive computational resources.

Method: Four key innovations: 1) Efficient autoencoder with 32x compression to reduce token count, 2) Channel-wise concatenation instead of token-wise for visual tokens, 3) Shared-and-decoupled network for task-specific modeling, 4) Mixture-of-experts mechanism for visual understanding encoder.

Result: EMMA-4B significantly outperforms state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while achieving competitive results compared to specialized multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image).

Conclusion: EMMA successfully demonstrates that efficient unified multimodal architectures are feasible and lays a solid foundation for future development in this direction, showing that unified models can achieve both high performance and efficiency compared to specialized approaches.

Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.

[188] You Only Train Once (YOTO): A Retraining-Free Object Detection Framework

Priyanto Hidayatullah, Nurjannah Syakrani, Yudi Widhiyasana, Muhammad Rizqi Sholahuddin, Refdinal Tubagus, Zahri Al Adzani Hidayat, Hanri Fajar Ramadhan, Dafa Alfarizki Pratama, Farhan Muhammad Yasin

Main category: cs.CV

TL;DR: YOTO framework combines YOLO11n for object detection with DeIT and Proxy Anchor Loss for feature extraction, using cosine similarity with a vector database for classification to prevent catastrophic forgetting in retail object detection.

Details

Motivation: Object detection suffers from catastrophic forgetting when new products are introduced, requiring full retraining with all previous data. This is particularly problematic in retail checkout where new products are frequently added, leading to high training costs and time consumption.

Method: YOTO integrates YOLO11n for object localization, DeIT for feature extraction, and Proxy Anchor Loss for metric learning. Classification uses cosine similarity between target product embeddings and those stored in a Qdrant vector database, eliminating the need for retraining when new products are added.

Result: In a retail case study with 140 products, the framework achieved encouraging accuracy for both new and existing products. Training time efficiency was almost 3x better than classical approaches, with efficiency improving as more products are added. Average inference time was 580ms per image on edge devices.

Conclusion: YOTO effectively addresses catastrophic forgetting in object detection, significantly reducing training time and costs while maintaining accuracy, making it practical for real-world retail applications where products are frequently updated.

Abstract: Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework’s feasibility for practical use.

[189] Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng

Main category: cs.CV

TL;DR: SFD introduces an asynchronous latent diffusion paradigm that prioritizes semantic formation before texture generation, achieving state-of-the-art FID scores and 100x faster convergence than original DiT.

Details

Motivation: Current LDMs denoise semantic and texture latents synchronously, ignoring the natural coarse-to-fine generation process where semantics precede textures. This synchronous approach misses the opportunity to use early semantic formation as guidance for better texture refinement.

Method: SFD constructs composite latents combining compact semantic latents (from pretrained visual encoder via Semantic VAE) with texture latents. It uses separate noise schedules with a temporal offset where semantics are denoised before textures, enabling natural coarse-to-fine generation.

Result: Achieves FID 1.06 on ImageNet 256x256 with LightningDiT-XL and FID 1.04 with 1.0B LightningDiT-XXL. Shows up to 100x faster convergence than original DiT. Also improves existing methods like ReDi and VA-VAE.

Conclusion: Asynchronous, semantics-led modeling in diffusion models is effective, providing clearer high-level guidance for texture refinement and enabling more efficient and higher-quality image generation.

Abstract: Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.

cs.AI

[190] Documenting SME Processes with Conversational AI: From Tacit Knowledge to BPMN

Unnikrishnan Radhakrishnan

Main category: cs.AI

TL;DR: LLM-powered conversational assistant captures SME shop-floor knowledge and converts it into BPMN 2.0 diagrams through interactive dialogue, enabling rapid process documentation at low cost.

Details

Motivation: SMEs rely heavily on tacit, undocumented knowledge that's difficult to formalize. Traditional process documentation requires specialized skills and is costly, creating barriers for SMEs to preserve institutional knowledge and improve operations.

Method: Developed an LLM-driven conversational assistant using Gemini 2.5 Pro with Gradio front-end and bpmn-js visualization. The system conducts interview-style dialogues to elicit process details, supports clarifying conversations, renders live BPMN diagrams, and allows real-time refinement.

Result: Proof-of-concept evaluation in equipment-maintenance scenario showed the chatbot produced accurate “AS-IS” models, flagged issues via annotations, and generated improved “TO-BE” variants within ~12 minutes at SME-friendly API costs. Successfully enforced BPMN 2.0 XML schemas.

Conclusion: Conversational LLMs can lower skill and cost barriers to rigorous process documentation for SMEs, helping preserve institutional knowledge, enhance operational transparency, and accelerate continuous improvement. Future work includes agentic and multimodal deployments.

Abstract: Small and medium-sized enterprises (SMEs) still depend heavily on tacit, experience-based know-how that rarely makes its way into formal documentation. This paper introduces a large-language-model (LLM)-driven conversational assistant that captures such knowledge on the shop floor and converts it incrementally and interactively into standards-compliant Business Process Model and Notation (BPMN) 2.0 diagrams. Powered by Gemini 2.5 Pro and delivered through a lightweight Gradio front-end with client-side bpmn-js visualisation, the assistant conducts an interview-style dialogue: it elicits process details, supports clarifying dialogue and on-demand analysis, and renders live diagrams that users can refine in real time. A proof-of-concept evaluation in an equipment-maintenance scenario shows that the chatbot produced an accurate “AS-IS” model, flagged issues via on-diagram annotations, and generated an improved “TO-BE” variant, all within about 12-minutes, while keeping API costs within an SME-friendly budget. The study analyses latency sources, model-selection trade-offs, and the challenges of enforcing strict XML schemas, then outlines a roadmap toward agentic and multimodal deployments. The results demonstrate that conversational LLMs can potentially be used to lower the skill and cost barriers to rigorous process documentation, helping SMEs preserve institutional knowledge, enhance operational transparency, and accelerate continuous-improvement efforts.

[191] Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations

Igor Halperin

Main category: cs.AI

TL;DR: Proposes two unsupervised metrics for evaluating LLM faithfulness using information theory and thermodynamics: semantic faithfulness (SF) based on KL divergence between topic transition matrices, and semantic entropy production (SEP) based on thermodynamic principles.

Details

Motivation: Evaluating faithfulness of LLMs to given tasks is complex, and there's a need for unsupervised metrics that can assess how well LLMs stay true to source material without requiring labeled data.

Method: Treats LLM as bipartite information engine with hidden layers as Maxwell demon. Models QCA triplets as probability distributions over topics, with topic transformations from context to query and answer represented as transition matrices Q and A. SF metric uses KL divergence between these matrices, optimized via convex optimization. SEP metric applies thermodynamic principles to measure entropy production in answer generation.

Result: Developed SF metric that maps minimal KL divergence to [0,1] interval (higher scores = greater faithfulness). Showed that high faithfulness generally implies low entropy production. Demonstrated framework on LLM summarization of SEC 10-K filings.

Conclusion: Proposed SF and SEP metrics provide unsupervised evaluation of LLM faithfulness, useful for both evaluation and hallucination control. The thermodynamic perspective offers insights into information processing efficiency during LLM generation.

Abstract: Evaluating faithfulness of Large Language Models (LLMs) to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context $C $ into answer $A$ via prompt $Q$. We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from $C$ to $Q$ and $A$ are modeled as transition matrices ${\bf Q}$ and ${\bf A}$ encoding the query goal and actual result, respectively. Our semantic faithfulness (SF) metric quantifies faithfulness for any given QCA triplet by the Kullback-Leibler (KL) divergence between these matrices. Both matrices are inferred simultaneously via convex optimization of this KL divergence, and the final SF metric is obtained by mapping the minimal divergence onto the unit interval [0,1], where higher scores indicate greater faithfulness. Furthermore, we propose a thermodynamics-based semantic entropy production (SEP) metric in answer generation, and show that high faithfulness generally implies low entropy production. The SF and SEP metrics can be used jointly or separately for LLM evaluation and hallucination control. We demonstrate our framework on LLM summarization of corporate SEC 10-K filings.

[192] Bridging Traditional Machine Learning and Large Language Models: A Two-Part Course Design for Modern AI Education

Fang Li

Main category: cs.AI

TL;DR: A two-part AI/DS course bridging traditional ML with modern LLMs, enhancing student comprehension and industry readiness.

Details

Motivation: To address the gap between foundational machine learning concepts and contemporary LLM applications, providing students with a comprehensive understanding of AI evolution while preparing them for industry demands in the rapidly changing field.

Method: A two-part course structure: first covering foundational machine learning concepts, then contemporary LLM applications. The approach includes detailed course architecture, implementation strategies, assessment methods, and was delivered over two seven-week summer terms.

Result: The integrated approach enhances student comprehension of the AI landscape and better prepares them for industry demands. Successful implementation was demonstrated through summer course delivery.

Conclusion: This pedagogical approach effectively bridges traditional ML with modern LLMs, providing students with comprehensive AI education that addresses both foundational knowledge and cutting-edge applications, making them better prepared for the evolving AI industry.

Abstract: This paper presents an innovative pedagogical approach for teaching artificial intelligence and data science that systematically bridges traditional machine learning techniques with modern Large Language Models (LLMs). We describe a course structured in two sequential and complementary parts: foundational machine learning concepts and contemporary LLM applications. This design enables students to develop a comprehensive understanding of AI evolution while building practical skills with both established and cutting-edge technologies. We detail the course architecture, implementation strategies, assessment methods, and learning outcomes from our summer course delivery spanning two seven-week terms. Our findings demonstrate that this integrated approach enhances student comprehension of the AI landscape and better prepares them for industry demands in the rapidly evolving field of artificial intelligence.

[193] On the Computability of Artificial General Intelligence

Georgios Mappouras, Charalambos Rossides

Main category: cs.AI

TL;DR: The paper proves that no algorithm can demonstrate truly new functional capabilities beyond what was already present in the initial algorithm, thus establishing fundamental limits on computational creativity and AGI.

Details

Motivation: To determine the upper bounds of computational processes and assess how close humanity is to achieving artificial general intelligence (AGI), particularly examining whether algorithms can demonstrate true creativity and innovation.

Method: The authors adopt a specific definition of AGI from prior work that emphasizes creativity and innovation in unlocking new functional capabilities. They then provide a formal proof establishing that no algorithm can demonstrate functional capabilities that weren’t already present in the initial algorithm itself.

Result: The formal proof establishes that algorithms (including AI models) cannot be truly creative in any field - they can only demonstrate existing functional capabilities, combinations, and permutations of existing capabilities, but cannot generate fundamentally new ones.

Conclusion: The proof has significant implications for the future of AI development and our understanding of human intelligence origins, suggesting fundamental limits to what computational processes can achieve in terms of true creativity and innovation.

Abstract: In recent years we observed rapid and significant advancements in artificial intelligence (A.I.). So much so that many wonder how close humanity is to developing an A.I. model that can achieve human level of intelligence, also known as artificial general intelligence (A.G.I.). In this work we look at this question and we attempt to define the upper bounds, not just of A.I., but rather of any machine-computable process (a.k.a. an algorithm). To answer this question however, one must first precisely define A.G.I. We borrow prior work’s definition of A.G.I. [1] that best describes the sentiment of the term, as used by the leading developers of A.I. That is, the ability to be creative and innovate in some field of study in a way that unlocks new and previously unknown functional capabilities in that field. Based on this definition we draw new bounds on the limits of computation. We formally prove that no algorithm can demonstrate new functional capabilities that were not already present in the initial algorithm itself. Therefore, no algorithm (and thus no A.I. model) can be truly creative in any field of study, whether that is science, engineering, art, sports, etc. In contrast, A.I. models can demonstrate existing functional capabilities, as well as combinations and permutations of existing functional capabilities. We conclude this work by discussing the implications of this proof both as it regards to the future of A.I. development, as well as to what it means for the origins of human intelligence.

[194] Resolving Zadehs Paradox Axiomatic Possibility Theory as a Foundation for Reliable Artificial Intelligence

Bychkov Oleksii, Bychkova Sophia, Lytvynchuk Khrystyna

Main category: cs.AI

TL;DR: Possibility theory provides a fundamental resolution to Dempster-Shafer Theory paradoxes through its dualistic apparatus of possibility and necessity measures, offering logically consistent uncertainty handling.

Details

Motivation: To resolve the crisis in uncertainty reasoning caused by paradoxes in Dempster-Shafer Theory (DST), moving beyond numerous attempts to fix Dempster's rule by establishing a more fundamental, logically consistent foundation.

Method: Axiomatic approach from Bychkov’s article using possibility theory’s dualistic apparatus of possibility and necessity measures, with comparative analysis of probabilistic, evidential (DST), and possibilistic paradigms using a classic medical diagnostic dilemma.

Result: Possibility theory enables correct processing of contradictory data, avoids logical traps of DST, and brings formal reasoning closer to natural intelligence logic, demonstrating it’s not just an alternative but a fundamental resolution.

Conclusion: Possibility theory provides the essential resolution to DST paradoxes through its mathematically rigorous foundation, offering superior handling of uncertainty and contradictory information compared to traditional approaches.

Abstract: This work advances and substantiates the thesis that the resolution of this crisis lies in the domain of possibility theory, specifically in the axiomatic approach developed in Bychkovs article. Unlike numerous attempts to fix Dempster rule, this approach builds from scratch a logically consistent and mathematically rigorous foundation for working with uncertainty, using the dualistic apparatus of possibility and necessity measures. The aim of this work is to demonstrate that possibility theory is not merely an alternative, but provides a fundamental resolution to DST paradoxes. A comparative analysis of three paradigms will be conducted probabilistic, evidential, and possibilistic. Using a classic medical diagnostic dilemma as an example, it will be shown how possibility theory allows for correct processing of contradictory data, avoiding the logical traps of DST and bringing formal reasoning closer to the logic of natural intelligence.

[195] AI & Human Co-Improvement for Safer Co-Superintelligence

Jason Weston, Jakob Foerster

Main category: cs.AI

TL;DR: The paper argues for shifting AI research focus from self-improving AI to co-improvement - collaborative AI-human research partnerships to achieve safer superintelligence through symbiosis.

Details

Motivation: Current AI self-improvement goals are dangerous and may take too long. The authors advocate for a more achievable and safer alternative: co-improvement through AI-human collaboration in research.

Method: Focus on developing AI systems specifically designed to work with human researchers throughout the entire research process - from ideation to experimentation - creating a symbiotic partnership.

Result: The approach promises to accelerate AI research while endowing both AIs and humans with safer superintelligence through their collaborative symbiosis.

Conclusion: Prioritizing human-in-the-loop research improvement will lead to faster and safer progress toward superintelligence than pursuing AI self-improvement alone.

Abstract: Self-improvement is a goal currently exciting the field of AI, but is fraught with danger, and may take time to fully achieve. We advocate that a more achievable and better goal for humanity is to maximize co-improvement: collaboration between human researchers and AIs to achieve co-superintelligence. That is, specifically targeting improving AI systems’ ability to work with human researchers to conduct AI research together, from ideation to experimentation, in order to both accelerate AI research and to generally endow both AIs and humans with safer superintelligence through their symbiosis. Focusing on including human research improvement in the loop will both get us there faster, and more safely.

[196] MCP-AI: Protocol-Driven Intelligence Framework for Autonomous Reasoning in Healthcare

Zag ElSayed, Craig Erickson, Ernest Pedapati

Main category: cs.AI

TL;DR: MCP-AI introduces a novel architecture combining Model Context Protocol with clinical applications, enabling explainable medical decision-making through modular, executable AI agents with reusable memory objects for adaptive, longitudinal reasoning.

Details

Motivation: Healthcare AI systems need to overcome challenges in merging contextual reasoning, long-term state management, and human-verifiable workflows into a cohesive framework, as traditional CDSS and prompt-based LLMs lack adaptive, collaborative, and auditable clinical reasoning capabilities.

Method: MCP-AI architecture built on Model Context Protocol (MCP) - a modular, executable specification for orchestrating generative and descriptive AI agents in real-time workflows. Each MCP file captures clinical objectives, patient context, reasoning state, and task logic as reusable, auditable memory objects.

Result: Validated through two use cases: (1) diagnostic modeling of Fragile X Syndrome with comorbid depression, and (2) remote coordination for Type 2 Diabetes and hypertension. The system facilitates physician-in-the-loop validation, streamlines clinical processes, and guarantees secure transitions of AI responsibilities between providers.

Conclusion: MCP-AI provides a scalable basis for interpretable, composable, and safety-oriented AI in clinical environments, connecting with HL7/FHIR interfaces and adhering to regulatory standards like HIPAA and FDA SaMD guidelines.

Abstract: Healthcare AI systems have historically faced challenges in merging contextual reasoning, long-term state management, and human-verifiable workflows into a cohesive framework. This paper introduces a completely innovative architecture and concept: combining the Model Context Protocol (MCP) with a specific clinical application, known as MCP-AI. This integration allows intelligent agents to reason over extended periods, collaborate securely, and adhere to authentic clinical logic, representing a significant shift away from traditional Clinical Decision Support Systems (CDSS) and prompt-based Large Language Models (LLMs). As healthcare systems become more complex, the need for autonomous, context-aware clinical reasoning frameworks has become urgent. We present MCP-AI, a novel architecture for explainable medical decision-making built upon the Model Context Protocol (MCP) a modular, executable specification for orchestrating generative and descriptive AI agents in real-time workflows. Each MCP file captures clinical objectives, patient context, reasoning state, and task logic, forming a reusable and auditable memory object. Unlike conventional CDSS or stateless prompt-based AI systems, MCP-AI supports adaptive, longitudinal, and collaborative reasoning across care settings. MCP-AI is validated through two use cases: (1) diagnostic modeling of Fragile X Syndrome with comorbid depression, and (2) remote coordination for Type 2 Diabetes and hypertension. In either scenario, the protocol facilitates physician-in-the-loop validation, streamlines clinical processes, and guarantees secure transitions of AI responsibilities between healthcare providers. The system connects with HL7/FHIR interfaces and adheres to regulatory standards, such as HIPAA and FDA SaMD guidelines. MCP-AI provides a scalable basis for interpretable, composable, and safety-oriented AI within upcoming clinical environments.

[197] ChipMind: Retrieval-Augmented Reasoning for Long-Context Circuit Design Specifications

Changwen Xing, SamZaak Wong, Xinlai Wan, Yanfeng Lu, Mengli Zhang, Zebin Ma, Lei Qi, Zhengxiong Li, Nan Guan, Zhe Jiang, Xi Wang, Jun Yang

Main category: cs.AI

TL;DR: ChipMind is a knowledge graph-augmented reasoning framework that addresses LLM context window limitations for IC development by transforming circuit specs into domain-specific knowledge graphs and using adaptive retrieval with semantic filtering.

Details

Motivation: Large Language Models have potential for automating integrated circuit development, but their practical deployment is limited by restricted context windows. Existing context-extension methods struggle with effective semantic modeling and multi-hop reasoning over complex circuit specifications.

Method: ChipMind transforms circuit specifications into a domain-specific knowledge graph (ChipKG) using Circuit Semantic-Aware Knowledge Graph Construction. It then uses ChipKG-Augmented Reasoning with information-theoretic adaptive retrieval to trace logical dependencies and intent-aware semantic filtering to prune irrelevant noise.

Result: ChipMind significantly outperforms state-of-the-art baselines on an industrial-scale specification reasoning benchmark, achieving an average improvement of 34.59% (up to 72.73%).

Conclusion: The framework bridges a critical gap between academic research and practical industrial deployment of LLM-aided Hardware Design (LAD) by effectively handling lengthy, intricate circuit specifications through knowledge graph augmentation.

Abstract: While Large Language Models (LLMs) demonstrate immense potential for automating integrated circuit (IC) development, their practical deployment is fundamentally limited by restricted context windows. Existing context-extension methods struggle to achieve effective semantic modeling and thorough multi-hop reasoning over extensive, intricate circuit specifications. To address this, we introduce ChipMind, a novel knowledge graph-augmented reasoning framework specifically designed for lengthy IC specifications. ChipMind first transforms circuit specifications into a domain-specific knowledge graph ChipKG through the Circuit Semantic-Aware Knowledge Graph Construction methodology. It then leverages the ChipKG-Augmented Reasoning mechanism, combining information-theoretic adaptive retrieval to dynamically trace logical dependencies with intent-aware semantic filtering to prune irrelevant noise, effectively balancing retrieval completeness and precision. Evaluated on an industrial-scale specification reasoning benchmark, ChipMind significantly outperforms state-of-the-art baselines, achieving an average improvement of 34.59% (up to 72.73%). Our framework bridges a critical gap between academic research and practical industrial deployment of LLM-aided Hardware Design (LAD).

[198] BEAVER: An Efficient Deterministic LLM Verifier

Tarun Suresh, Nalin Wadhwa, Debangshu Banerjee, Gagandeep Singh

Main category: cs.AI

TL;DR: BEAVER is the first practical framework for computing deterministic, sound probability bounds on LLM constraint satisfaction, providing provable guarantees instead of sampling-based estimates.

Details

Motivation: As LLMs move to production, practitioners need reliable methods to verify that model outputs satisfy required constraints. Sampling-based estimates provide intuition but no sound guarantees, creating a need for deterministic verification methods.

Method: BEAVER systematically explores the generation space using novel token trie and frontier data structures, maintaining provably sound bounds at every iteration for any prefix-closed semantic constraint.

Result: BEAVER achieves 6-8 times tighter probability bounds and identifies 3-4 times more high-risk instances compared to baseline methods under identical computational budgets across correctness verification, privacy verification, and secure code generation tasks.

Conclusion: BEAVER enables precise characterization and risk assessment that loose bounds or empirical evaluation cannot provide, offering the first practical framework for deterministic verification of LLM constraint satisfaction with sound guarantees.

Abstract: As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify that model outputs satisfy required constraints. While sampling-based estimates provide an intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM constraint satisfaction. Given any prefix-closed semantic constraint, BEAVER systematically explores the generation space using novel token trie and frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on correctness verification, privacy verification and secure code generation tasks across multiple state of the art LLMs. BEAVER achieves 6 to 8 times tighter probability bounds and identifies 3 to 4 times more high risk instances compared to baseline methods under identical computational budgets, enabling precise characterization and risk assessment that loose bounds or empirical evaluation cannot provide.

[199] The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic Systems

Robert Yang

Main category: cs.AI

TL;DR: The paper introduces “akrasia” (weakness of will) as a framework for analyzing inconsistency in AI agents, proposes the Akrasia Benchmark to measure when models contradict their own commitments, and explores how this micro-level weakness can lead to macro-level instability in multi-agent systems.

Details

Motivation: LLMs exhibit inconsistency where they know the correct answer but fail to act on it, similar to human "akrasia" or weakness of will. The authors aim to provide a conceptual framework for analyzing this inconsistency and goal drift in agentic AI systems, connecting AI behavior to classical philosophical theories of agency.

Method: Introduces the Akrasia Benchmark with structured prompting conditions (Baseline [B], Synonym [S], Temporal [T], and Temptation [X]) to measure when a model’s local response contradicts its prior commitments. The benchmark enables quantitative comparison of “self-control” across model families, decoding strategies, and temptation types.

Result: The paper presents a preliminary benchmark that operationalizes akrasia measurement, allowing for systematic evaluation of inconsistency in AI systems. It also outlines how micro-level akrasia can compound into macro-level instability in multi-agent systems, potentially manifesting as “scheming” or deliberate misalignment.

Conclusion: By reframing AI inconsistency as weakness of will, this work connects agentic behavior to classical theories of agency and provides an empirical bridge between philosophy, psychology, and the emerging science of agentic AI, offering new tools for analyzing and potentially mitigating goal drift in AI systems.

Abstract: Large language models display a peculiar form of inconsistency: they “know” the correct answer but fail to act on it. In human philosophy, this tension between global judgment and local impulse is called akrasia, or weakness of will. We propose akrasia as a foundational concept for analyzing inconsistency and goal drift in agentic AI systems. To operationalize it, we introduce a preliminary version of the Akrasia Benchmark, currently a structured set of prompting conditions (Baseline [B], Synonym [S], Temporal [T], and Temptation [X]) that measures when a model’s local response contradicts its own prior commitments. The benchmark enables quantitative comparison of “self-control” across model families, decoding strategies, and temptation types. Beyond single-model evaluation, we outline how micro-level akrasia may compound into macro-level instability in multi-agent systems that may be interpreted as “scheming” or deliberate misalignment. By reframing inconsistency as weakness of will, this work connects agentic behavior to classical theories of agency and provides an empirical bridge between philosophy, psychology, and the emerging science of agentic AI.

Chuang Yu, Jinmiao Zhao, Mingxuan Zhao, Yunpeng Liu, Xiujun Shu, Yuanhao Feng, Bo Wang, Xiangyu Yue

Main category: cs.AI

TL;DR: MIND is a novel reasoning framework for multimodal LLMs that enables active discriminative reasoning through rationale augmentation, progressive correction learning, and contrastive alignment, achieving SOTA performance across multiple reasoning tasks.

Details

Motivation: Current MLLMs suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and susceptibility to misleading interpretations in complex scenarios. They lack human-like cognitive abilities for active reasoning and correction.

Method: Three key components: 1) Rationale Augmentation and Discrimination (RAD) paradigm for automatic dataset expansion with diverse rationales; 2) Progressive Two-stage Correction Learning (P2CL) strategy with positive learning and active logic discrimination phases; 3) Multi-rationale Contrastive Alignment (MCA) optimization to separate correct and incorrect reasoning semantics.

Result: Achieves state-of-the-art performance on multiple public datasets covering scientific, commonsense, and mathematical reasoning scenarios. The framework provides a unified and extensible data foundation for reasoning tasks.

Conclusion: MIND enables paradigm evolution from passive imitation-based reasoning to active discriminative reasoning, advancing MLLMs toward higher levels of cognitive intelligence with human-like “Understand -> Rethink -> Correct” abilities.

Abstract: Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and are susceptible to misleading interpretations in complex scenarios. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of “Understand -> Rethink -> Correct”, and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmentation and Discrimination (RAD) paradigm, which automatically and efficiently expands existing datasets by generating diverse rationales, providing a unified and extensible data foundation. Meanwhile, we design a Progressive Two-stage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the second phase enables active logic discrimination and correction. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multi-rationale Contrastive Alignment (MCA) optimization strategy, which achieves semantic aggregation of correct reasoning and boundary separation of incorrect reasoning. Extensive experiments demonstrate that the proposed MIND reasoning framework achieves state-of-the-art (SOTA) performance on multiple public datasets covering scientific, commonsense, and mathematical scenarios. It provides a new perspective for advancing MLLMs towards higher levels of cognitive intelligence. Our code is available at https://github.com/YuChuang1205/MIND

[201] CureAgent: A Training-Free Executor-Analyst Framework for Clinical Reasoning

Ting-Ting Xie, Yixin Zhang

Main category: cs.AI

TL;DR: Proposes Executor-Analyst Framework to fix Context Utilization Failure in clinical AI agents, using modular architecture with specialized TxAgents and long-context models, achieving SOTA without expensive finetuning.

Details

Motivation: Current clinical agents built on small LLMs suffer from Context Utilization Failure - they retrieve biomedical evidence but fail to ground their diagnosis in that information, showing reasoning deficits in monolithic models.

Method: Executor-Analyst Framework: modular architecture decoupling syntactic tool execution (specialized TxAgents as Executors) from semantic clinical reasoning (long-context foundation models as Analysts). Uses Stratified Ensemble strategy instead of global pooling to preserve evidentiary diversity and address information bottleneck.

Result: Achieves state-of-the-art performance on CURE-Bench without expensive end-to-end finetuning. Reveals critical scaling insights: Context-Performance Paradox (extending reasoning beyond 12k tokens degrades accuracy) and Curse of Dimensionality in action spaces (expanding toolsets requires hierarchical retrieval).

Conclusion: Demonstrates potential of training-free architectural engineering for scalable, agile foundation for trustworthy AI-driven therapeutics, providing a solution to reasoning deficits in clinical agents while avoiding expensive finetuning costs.

Abstract: Current clinical agent built on small LLMs, such as TxAgent suffer from a \textit{Context Utilization Failure}, where models successfully retrieve biomedical evidence due to supervised finetuning but fail to ground their diagnosis in that information. In this work, we propose the Executor-Analyst Framework, a modular architecture that decouples the syntactic precision of tool execution from the semantic robustness of clinical reasoning. By orchestrating specialized TxAgents (Executors) with long-context foundation models (Analysts), we mitigate the reasoning deficits observed in monolithic models. Beyond simple modularity, we demonstrate that a Stratified Ensemble strategy significantly outperforms global pooling by preserving evidentiary diversity, effectively addressing the information bottleneck. Furthermore, our stress tests reveal critical scaling insights: (1) a \textit{Context-Performance Paradox}, where extending reasoning contexts beyond 12k tokens introduces noise that degrades accuracy; and (2) the \textit{Curse of Dimensionality} in action spaces, where expanding toolsets necessitates hierarchical retrieval strategies. Crucially, our approach underscores the potential of training-free architectural engineering, achieving state-of-the-art performance on CURE-Bench without the need for expensive end-to-end finetuning. This provides a scalable, agile foundation for the next generation of trustworthy AI-driven therapeutics. Code has been released on https://github.com/June01/CureAgent.

[202] Ontology Learning with LLMs: A Benchmark Study on Axiom Identification

Roos M. Bakker, Daan L. Di Scala, Maaike H. T. de Boer, Stephan A. Raaijmakers

Main category: cs.AI

TL;DR: LLMs show promise for ontology axiom identification but performance varies by axiom type and domain, with Axiom-by-Axiom prompting outperforming direct approaches.

Details

Motivation: Ontology development is complex and requires significant expertise, but recent advances in NLP and LLMs offer potential for automation through ontology learning, particularly for identifying fundamental logical axioms.

Method: Created OntoAxiom benchmark with 9 ontologies (17,118 triples, 2,771 axioms) focusing on subclass, disjoint, subproperty, domain, and range axioms. Tested 12 LLMs with three shot settings and two prompting strategies: Direct (all axioms at once) vs Axiom-by-Axiom (one axiom per prompt).

Result: Axiom-by-Axiom prompting achieved higher F1 scores than Direct approach. Performance varied significantly across axiom types and domains (e.g., FOAF ontology: 0.642 for subclass vs music ontology: 0.218). Larger LLMs outperformed smaller ones, though smaller models remain viable for resource-constrained settings.

Conclusion: LLMs cannot fully automate axiom identification yet, but can provide valuable candidate axioms to support ontology engineers in development and refinement tasks.

Abstract: Ontologies are an important tool for structuring domain knowledge, but their development is a complex task that requires significant modelling and domain expertise. Ontology learning, aimed at automating this process, has seen advancements in the past decade with the improvement of Natural Language Processing techniques, and especially with the recent growth of Large Language Models (LLMs). This paper investigates the challenge of identifying axioms: fundamental ontology components that define logical relations between classes and properties. In this work, we introduce an Ontology Axiom Benchmark OntoAxiom, and systematically test LLMs on that benchmark for axiom identification, evaluating different prompting strategies, ontologies, and axiom types. The benchmark consists of nine medium-sized ontologies with together 17.118 triples, and 2.771 axioms. We focus on subclass, disjoint, subproperty, domain, and range axioms. To evaluate LLM performance, we compare twelve LLMs with three shot settings and two prompting strategies: a Direct approach where we query all axioms at once, versus an Axiom-by-Axiom (AbA) approach, where each prompt queries for one axiom only. Our findings show that the AbA prompting leads to higher F1 scores than the direct approach. However, performance varies across axioms, suggesting that certain axioms are more challenging to identify. The domain also influences performance: the FOAF ontology achieves a score of 0.642 for the subclass axiom, while the music ontology reaches only 0.218. Larger LLMs outperform smaller ones, but smaller models may still be viable for resource-constrained settings. Although performance overall is not high enough to fully automate axiom identification, LLMs can provide valuable candidate axioms to support ontology engineers with the development and refinement of ontologies.

[203] Enhancing Local Search for MaxSAT with Deep Differentiation Clause Weighting

Menghua Jiang, Haokai Gao, Shuhao Chen, Yin Chen

Main category: cs.AI

TL;DR: DeepDist introduces a novel clause weighting scheme with distinct update strategies for PMS and WPMS, plus a new initialization method and decimation technique, resulting in an SLS solver that outperforms state-of-the-art approaches.

Details

Motivation: Existing SLS algorithms for Partial Maximum Satisfiability (PMS) and Weighted Partial Maximum Satisfiability (WPMS) fail to adequately distinguish between these two problem types, using uniform update strategies that overlook their structural differences, limiting solver performance.

Method: 1) Novel clause weighting scheme with distinct update conditions for PMS vs WPMS instances; 2) New initialization method tailored to each problem type’s characteristics; 3) Decimation method prioritizing unit and hard clauses; 4) Integration into SLS solver DeepDist.

Result: DeepDist outperforms state-of-the-art SLS solvers on MaxSAT Evaluation benchmarks. A hybrid solver combining DeepDist with TT-Open-WBO-Inc surpasses MaxSAT Evaluation 2024 winners SPB-MaxSAT-c-Band and SPB-MaxSAT-c-FPS.

Conclusion: The proposed approach effectively addresses the limitations of existing methods by distinguishing between PMS and WPMS, leading to significant performance improvements and demonstrating the importance of problem-specific strategies in clause weighting schemes.

Abstract: Partial Maximum Satisfiability (PMS) and Weighted Partial Maximum Satisfiability (WPMS) generalize Maximum Satisfiability (MaxSAT), with broad real-world applications. Recent advances in Stochastic Local Search (SLS) algorithms for solving (W)PMS have mainly focused on designing clause weighting schemes. However, existing methods often fail to adequately distinguish between PMS and WPMS, typically employing uniform update strategies for clause weights and overlooking critical structural differences between the two problem types. In this work, we present a novel clause weighting scheme that, for the first time, updates the clause weights of PMS and WPMS instances according to distinct conditions. This scheme also introduces a new initialization method, which better accommodates the unique characteristics of both instance types. Furthermore, we propose a decimation method that prioritizes satisfying unit and hard clauses, effectively complementing our proposed clause weighting scheme. Building on these methods, we develop a new SLS solver for (W)PMS named DeepDist. Experimental results on benchmarks from the anytime tracks of recent MaxSAT Evaluations show that DeepDist outperforms state-of-the-art SLS solvers. Notably, a hybrid solver combining DeepDist with TT-Open-WBO-Inc surpasses the performance of the MaxSAT Evaluation 2024 winners, SPB-MaxSAT-c-Band and SPB-MaxSAT-c-FPS, highlighting the effectiveness of our approach. The code is available at https://github.com/jmhmaxsat/DeepDist

[204] KANFormer for Predicting Fill Probabilities via Survival Analysis in Limit Order Books

Jinfeng Zhong, Emmanuel Bacry, Agathe Guilloux, Jean-François Muzy

Main category: cs.AI

TL;DR: KANFormer: A deep learning model combining dilated causal CNNs, Transformer encoders, and Kolmogorov-Arnold Networks to predict limit order time-to-fill using both market and agent-level information, outperforming existing methods.

Details

Motivation: Existing models for predicting limit order time-to-fill rely only on limit order book snapshots, missing important agent-level information and order queue position that could improve prediction accuracy and capture execution likelihood patterns.

Method: Combines Dilated Causal Convolutional network with Transformer encoder, enhanced by Kolmogorov-Arnold Networks (KANs) for better nonlinear approximation. Integrates agent actions related to LOB dynamics and order queue position alongside market data.

Result: Outperforms existing models on CAC 40 index futures data in both calibration (Right-Censored Log-Likelihood, Integrated Brier Score) and discrimination (C-index, time-dependent AUC) metrics. SHAP analysis provides interpretable feature importance over time.

Conclusion: Combining rich market signals with expressive neural architectures (KANFormer) enables accurate and interpretable predictions of fill probabilities, demonstrating the value of integrating agent-level information with traditional market data.

Abstract: This paper introduces KANFormer, a novel deep-learning-based model for predicting the time-to-fill of limit orders by leveraging both market- and agent-level information. KANFormer combines a Dilated Causal Convolutional network with a Transformer encoder, enhanced by Kolmogorov-Arnold Networks (KANs), which improve nonlinear approximation. Unlike existing models that rely solely on a series of snapshots of the limit order book, KANFormer integrates the actions of agents related to LOB dynamics and the position of the order in the queue to more effectively capture patterns related to execution likelihood. We evaluate the model using CAC 40 index futures data with labeled orders. The results show that KANFormer outperforms existing works in both calibration (Right-Censored Log-Likelihood, Integrated Brier Score) and discrimination (C-index, time-dependent AUC). We further analyze feature importance over time using SHAP (SHapley Additive exPlanations). Our results highlight the benefits of combining rich market signals with expressive neural architectures to achieve accurate and interpretabl predictions of fill probabilities.

[205] A Fast Anti-Jamming Cognitive Radar Deployment Algorithm Based on Reinforcement Learning

Wencheng Cai, Xuchao Gao, Congying Han, Mingqiang Li, Tiande Guo

Main category: cs.AI

TL;DR: FARDA uses deep reinforcement learning for fast radar deployment, achieving similar coverage to evolutionary algorithms but 7,000x faster.

Details

Motivation: Existing evolutionary algorithms for cognitive radar deployment are slow and prone to local optima, creating a need for faster, more efficient anti-jamming radar deployment in modern warfare.

Method: Models radar deployment as end-to-end task using deep reinforcement learning with integrated neural modules for heatmap perception and a novel reward format.

Result: Achieves comparable coverage to evolutionary algorithms while deploying radars approximately 7,000 times faster, with ablation studies confirming each component’s necessity.

Conclusion: FARDA provides a significantly faster alternative to evolutionary algorithms for anti-jamming radar deployment while maintaining effectiveness, addressing critical time constraints in modern warfare.

Abstract: The fast deployment of cognitive radar to counter jamming remains a critical challenge in modern warfare, where more efficient deployment leads to quicker detection of targets. Existing methods are primarily based on evolutionary algorithms, which are time-consuming and prone to falling into local optima. We tackle these drawbacks via the efficient inference of neural networks and propose a brand new framework: Fast Anti-Jamming Radar Deployment Algorithm (FARDA). We first model the radar deployment problem as an end-to-end task and design deep reinforcement learning algorithms to solve it, where we develop integrated neural modules to perceive heatmap information and a brand new reward format. Empirical results demonstrate that our method achieves coverage comparable to evolutionary algorithms while deploying radars approximately 7,000 times faster. Further ablation experiments confirm the necessity of each component of FARDA.

[206] Evolutionary System 2 Reasoning: An Empirical Proof

Zeyuan Ma, Wenqi Huang, Guo-Huan Song, Hongshu Guo, Sijie Ma, Zhiguang Cao, Yue-Jiao Gong

Main category: cs.AI

TL;DR: ERO framework uses evolutionary strategies to enhance LLMs’ reasoning abilities through survival-of-the-fittest optimization, showing that even weak models can develop powerful reasoning with simple evolution loops.

Details

Motivation: Current LLMs show substantial specific skills but fall short in general intelligence and system 2 reasoning (slow thinking). The paper aims to answer whether machine intelligence can evolve reasoning abilities like humans.

Method: Proposes Evolutionary Reasoning Optimization (ERO) framework that initializes multiple LLMs as a population and uses evolutionary strategies to evolve the population, maximizing quantified reasoning scores of the best individual through survival-of-the-fittest optimization.

Result: Two key findings: 1) Latest LLMs like GPT-5 still show limited system 2 reasoning ability; 2) With ERO’s simple evolution loop, even relatively weak models (Qwen-7B) can be enhanced to emerge powerful reasoning ability.

Conclusion: ERO demonstrates that evolutionary optimization can effectively enhance reasoning abilities in LLMs, suggesting a promising path toward developing more human-like general intelligence in machines.

Abstract: Machine intelligence marks the ultimate dream of making machines’ intelligence comparable to human beings. While recent progress in Large Language Models (LLMs) show substantial specific skills for a wide array of downstream tasks, they more or less fall shorts in general intelligence. Following correlation between intelligence and system 2 reasoning (slow thinking), in this paper, we aim to answering a worthwhile research question: could machine intelligence such as LLMs be evolved to acquire reasoning ability (not specific skill) just like our human beings? To this end, we propose evolutionary reasoning optimization (ERO) framework which performs survival of the fittest over a population of LLMs to search for individual with strong reasoning ability. Given a reasoning task, ERO first initializes multiple LLMs as a population, after which an evolutionary strategy evolves the population to maximize quantified reasoning score of the best individual. Based on experiments on representative testsuites, we claim two surprising empirical discoveries: i) the latest LLMs such as GPT-5 still show limited system 2 reasoning ability; ii) with simple evolution-loop of ERO, a relatively weak model (Qwen-7B) could be enhanced to emerge powerful reasoning ability. Our project can be accessed at https://github.com/MetaEvo/ERO for reproduction needs.

[207] The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics

Edward Y. Chang

Main category: cs.AI

TL;DR: LLMs are not dead ends for AGI - they provide necessary pattern repositories (System-1), but need a System-2 coordination layer for reasoning. The paper proposes UCCT theory and MACI architecture to enable reasoning through pattern selection and constraint.

Details

Motivation: To counter influential critiques that LLMs are structurally incapable of reasoning or planning, arguing that the bottleneck is not the pattern-matching capability itself but the lack of a coordination layer to select and constrain patterns for goal-directed reasoning.

Method: Proposes UCCT (semantic anchoring theory) that models reasoning as phase transition governed by effective support (ρ_d), representational mismatch (d_r), and adaptive anchoring budget (γ log k). Translates this into MACI architecture with three components: baiting (behavior-modulated debate), filtering (Socratic judging), and persistence (transactional memory).

Result: Theoretical framework that reframes ungrounded generation as unbaited retrieval of maximum likelihood prior, while reasoning emerges when anchors shift posterior toward goal-directed constraints. Provides architecture to implement coordination layer.

Conclusion: The path to AGI runs through LLMs, not around them - LLMs provide necessary System-1 substrate, and the missing component is a System-2 coordination layer that can be implemented via theories like UCCT and architectures like MACI.

Abstract: Influential critiques argue that Large Language Models (LLMs) are a dead end for AGI: “mere pattern matchers” structurally incapable of reasoning or planning. We argue this conclusion misidentifies the bottleneck: it confuses the ocean with the net. Pattern repositories are the necessary System-1 substrate; the missing component is a System-2 coordination layer that selects, constrains, and binds these patterns. We formalize this layer via UCCT, a theory of semantic anchoring that models reasoning as a phase transition governed by effective support (rho_d), representational mismatch (d_r), and an adaptive anchoring budget (gamma log k). Under this lens, ungrounded generation is simply an unbaited retrieval of the substrate’s maximum likelihood prior, while “reasoning” emerges when anchors shift the posterior toward goal-directed constraints. We translate UCCT into architecture with MACI, a coordination stack that implements baiting (behavior-modulated debate), filtering (Socratic judging), and persistence (transactional memory). By reframing common objections as testable coordination failures, we argue that the path to AGI runs through LLMs, not around them.

[208] Multimodal Oncology Agent for IDH1 Mutation Prediction in Low-Grade Glioma

Hafsa Akebli, Adam Shephard, Vincenzo Della Mea, Nasir Rajpoot

Main category: cs.AI

TL;DR: MOA (Multimodal Oncology Agent) combines histology analysis via TITAN foundation model with clinical/genomic reasoning to predict IDH1 mutations in low-grade gliomas, achieving state-of-the-art performance.

Details

Motivation: IDH1 mutations in low-grade gliomas define clinically distinct subgroups with important prognostic and therapeutic implications, but current prediction methods may not fully leverage multimodal information.

Method: Developed a Multimodal Oncology Agent (MOA) integrating: 1) TITAN foundation model for histology analysis, 2) reasoning over structured clinical/genomic data using PubMed, Google Search, and OncoKB, and 3) fusion of histology features with clinical data.

Result: MOA without histology (F1=0.826) outperformed clinical baseline (F1=0.798). MOA with histology fusion achieved best performance (F1=0.912), exceeding histology baseline (0.894) and fused histology-clinical baseline (0.897) on TCGA-LGG cohort (n=488).

Conclusion: MOA captures complementary mutation-relevant information enriched through external biomedical sources, enabling accurate IDH1 mutation prediction and demonstrating the value of multimodal integration for oncology diagnostics.

Abstract: Low-grade gliomas frequently present IDH1 mutations that define clinically distinct subgroups with specific prognostic and therapeutic implications. This work introduces a Multimodal Oncology Agent (MOA) integrating a histology tool based on the TITAN foundation model for IDH1 mutation prediction in low-grade glioma, combined with reasoning over structured clinical and genomic inputs through PubMed, Google Search, and OncoKB. MOA reports were quantitatively evaluated on 488 patients from the TCGA-LGG cohort against clinical and histology baselines. MOA without the histology tool outperformed the clinical baseline, achieving an F1-score of 0.826 compared to 0.798. When fused with histology features, MOA reached the highest performance with an F1-score of 0.912, exceeding both the histology baseline at 0.894 and the fused histology-clinical baseline at 0.897. These results demonstrate that the proposed agent captures complementary mutation-relevant information enriched through external biomedical sources, enabling accurate IDH1 mutation prediction.

[209] Using Large Language Models to Create Personalized Networks From Therapy Sessions

Clarissa W. Ong, Hiba Arnaout, Kate Sheehan, Estella Fox, Eugen Owtscharow, Iryna Gurevych

Main category: cs.AI

TL;DR: LLM-based pipeline generates personalized client networks from therapy transcripts to support clinical case conceptualization and treatment planning.

Details

Motivation: Network-driven treatment personalization requires intensive longitudinal data, which isn't always feasible. LLMs offer a scalable solution to create personalized networks from therapy transcripts.

Method: End-to-end pipeline using in-context learning to identify psychological processes and dimensions from 77 therapy transcripts, then grouping processes into clinically meaningful clusters and generating explanation-augmented relationships between clusters.

Result: Method achieved high performance with few training examples. Experts preferred multi-step approach over direct prompting (up to 90% preference) and rated networks favorably for clinical relevance, novelty, and usefulness (72-75% scores).

Conclusion: Proof of concept for using LLMs to create clinically relevant networks from therapy transcripts, enabling bottom-up case conceptualization and identification of latent themes. Future research should examine whether these networks improve treatment outcomes compared to other personalization methods.

Abstract: Recent advances in psychotherapy have focused on treatment personalization, such as by selecting treatment modules based on personalized networks. However, estimating personalized networks typically requires intensive longitudinal data, which is not always feasible. A solution to facilitate scalability of network-driven treatment personalization is leveraging LLMs. In this study, we present an end-to-end pipeline for automatically generating client networks from 77 therapy transcripts to support case conceptualization and treatment planning. We annotated 3364 psychological processes and their corresponding dimensions in therapy transcripts. Using these data, we applied in-context learning to jointly identify psychological processes and their dimensions. The method achieved high performance even with a few training examples. To organize the processes into networks, we introduced a two-step method that grouped them into clinically meaningful clusters. We then generated explanation-augmented relationships between clusters. Experts found that networks produced by our multi-step approach outperformed those built with direct prompting for clinical utility and interpretability, with up to 90% preferring our approach. In addition, the networks were rated favorably by experts, with scores for clinical relevance, novelty, and usefulness ranging from 72-75%. Our findings provide a proof of concept for using LLMs to create clinically relevant networks from therapy transcripts. Advantages of our approach include bottom-up case conceptualization from client utterances in therapy sessions and identification of latent themes. Networks generated from our pipeline may be used in clinical settings and supervision and training. Future research should examine whether these networks improve treatment outcomes relative to other methods of treatment personalization, including statistically estimated networks.

[210] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis

Federico Bianchi, Yongchan Kwon, Zachary Izzo, Linjun Zhang, James Zou

Main category: cs.AI

TL;DR: AI paper correctness checker using GPT-5 finds increasing objective mistakes in published AI papers over time, with 83.2% precision in error detection and 75.8% success in proposing fixes.

Details

Motivation: Published AI papers form the foundation of research, but errors can propagate unnoticed, creating confusion and complicating reproducibility. The accelerating pace of research and demands on peer-review make mistakes harder to detect.

Method: Developed a Paper Correctness Checker based on GPT-5 to systematically identify objective mistakes (formulas, derivations, calculations, figures, tables) in papers from top AI conferences/journals. Human experts reviewed 316 potential mistakes to validate findings.

Result: Published papers contain non-negligible objective mistakes, with increasing trends: NeurIPS (3.8 to 5.9 mistakes/paper, 55.3% increase 2021-2025), ICLR (4.1 to 5.2, 2018-2025), TMLR (5.0 to 5.5, 2022/23-2025). Human validation showed 83.2% precision (263/316 confirmed). AI Checker can propose correct fixes for 75.8% of identified mistakes.

Conclusion: Frontier LLMs like GPT-5 have significant potential to detect and correct objective mistakes in published papers, helping establish a firmer knowledge foundation and improving reproducibility in AI research.

Abstract: How many mistakes do published AI papers contain? Peer-reviewed publications form the foundation upon which new research and knowledge are built. Errors that persist in the literature can propagate unnoticed, creating confusion in follow-up studies and complicating reproducibility. The accelerating pace of research and the increasing demands on the peer-review system make such mistakes harder to detect and avoid. To address this, we developed a Paper Correctness Checker based on GPT-5 to systematically identify mistakes in papers previously published at top AI conferences and journals. Our analysis focuses on objective mistakes-e.g., errors in formulas, derivations, calculations, figures, and tables-that have a clearly verifiable ground truth. We intentionally exclude subjective considerations such as novelty, importance, or writing quality. We find that published papers contain a non-negligible number of objective mistakes and that the average number of mistakes per paper has increased over time-from 3.8 in NeurIPS 2021 to 5.9 in NeurIPS 2025 (55.3% increase); from 4.1 in ICLR 2018 to 5.2 in ICLR 2025; and from 5.0 in TMLR 2022/23 to 5.5 in TMLR 2025. Human experts reviewed 316 potential mistakes identified by the AI Checker and confirmed that 263 were actual mistakes, corresponding to a precision of 83.2%. While most identified issues are relatively minor, correcting them would reduce confusion in the literature and strengthen reproducibility. The AI Checker also surfaced potentially more substantive mistakes that could affect the interpretation of results. Moreover, we show that the AI Checker can propose correct fixes for 75.8% of the identified mistakes. Overall, this study highlights the potential of frontier LLMs to detect and correct objective mistakes in published papers, helping to establish a firmer foundation of knowledge.

[211] PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation

Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, Babak Damavandi

Main category: cs.AI

TL;DR: PRiSM is a synthetic, dynamic multimodal benchmark for evaluating scientific reasoning in VLMs using Python code, featuring 24,750+ physics/math problems with automated ground truth generation and verification.

Details

Motivation: Existing VLM benchmarks fail to address scientific domains' unique requirements like conceptual understanding, symbolic reasoning, and formal law adherence. Current datasets are static, lack intermediate reasoning steps, robustness to variations, and mechanisms for verifying scientific correctness.

Method: Introduces PRiSM benchmark with PrismAgent pipeline to generate dynamic multimodal problems. Each problem includes textual/visual inputs, generated figures, executable Python code for ground truth, and step-by-step reasoning. Uses Python-powered automated ground truth generation for verification.

Result: Comprehensive evaluation of existing VLMs reveals their limitations in scientific reasoning. PRiSM enables fine-grained experimental auditing, uncovering failure modes, uncertainty behaviors, and reasoning limitations through five targeted evaluation tasks.

Conclusion: PRiSM provides a robust framework for deeper insights into VLM scientific reasoning capabilities, addressing critical gaps in current evaluation methodologies for scientific domains.

Abstract: Evaluating vision-language models (VLMs) in scientific domains like mathematics and physics poses unique challenges that go far beyond predicting final answers. These domains demand conceptual understanding, symbolic reasoning, and adherence to formal laws, requirements that most existing benchmarks fail to address. In particular, current datasets tend to be static, lacking intermediate reasoning steps, robustness to variations, or mechanisms for verifying scientific correctness. To address these limitations, we introduce PRiSM, a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code. PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent, to generate well-structured problem instances. Each problem contains dynamic textual and visual input, a generated figure, alongside rich structured outputs: executable Python code for ground truth generation and verification, and detailed step-by-step reasoning. The dynamic nature and Python-powered automated ground truth generation of our benchmark allow for fine-grained experimental auditing of multimodal VLMs, revealing failure modes, uncertainty behaviors, and limitations in scientific reasoning. To this end, we propose five targeted evaluation tasks covering generalization, symbolic program synthesis, perturbation robustness, reasoning correction, and ambiguity resolution. Through comprehensive evaluation of existing VLMs, we highlight their limitations and showcase how PRiSM enables deeper insights into their scientific reasoning capabilities.

[212] TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models

Shima Imani, Seungwhan Moon, Lambert Mathias, Lu Zhang, Babak Damavandi

Main category: cs.AI

TL;DR: TRACE is a framework for transparent reasoning evaluation that diagnoses reasoning trajectories in vision-language models using Auxiliary Reasoning Sets to expose failures missed by standard final-answer evaluation.

Details

Motivation: Current evaluation methods for large vision-language models focus only on final answers, masking reasoning errors and allowing silent failures to persist, creating a need for more transparent reasoning assessment.

Method: TRACE uses Auxiliary Reasoning Sets (compact sub question-answer pairs) to decompose complex problems, evaluates intermediate steps through consistency-based metrics, and defines confidence regions to distinguish reliable from unreliable reasoning paths.

Result: Experiments show that consistency across ARS correlates with final-answer correctness, helps pinpoint where reasoning failures occur, and provides actionable signals for model improvement through effective filtering and debugging.

Conclusion: TRACE offers a transparent framework for evaluating reasoning trajectories in vision-language models, moving beyond final-answer evaluation to expose hidden reasoning failures and support model refinement through consistency-based diagnostics.

Abstract: Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.

[213] Variational Quantum Rainbow Deep Q-Network for Optimizing Resource Allocation Problem

Truong Thanh Hung Nguyen, Truong Thinh Nguyen, Hung Cao

Main category: cs.AI

TL;DR: VQR-DQN integrates variational quantum circuits with Rainbow DQN to solve NP-hard resource allocation problems, achieving significant performance improvements over classical methods.

Details

Motivation: Resource allocation is NP-hard with combinatorial complexity. Classical DRL methods like Rainbow DQN have limited representational power due to classical function approximators, creating a need for quantum-enhanced approaches.

Method: Introduces Variational Quantum Rainbow DQN (VQR-DQN) that integrates ring-topology variational quantum circuits with Rainbow DQN to leverage quantum superposition and entanglement. Frames human resource allocation as an MDP with combinatorial action spaces based on officer capabilities, event schedules, and transition times.

Result: On four HRAP benchmarks, VQR-DQN achieves 26.8% normalized makespan reduction versus random baselines and outperforms Double DQN and classical Rainbow DQN by 4.9-13.4%.

Conclusion: The gains align with theoretical connections between circuit expressibility, entanglement, and policy quality, demonstrating the potential of quantum-enhanced DRL for large-scale resource allocation.

Abstract: Resource allocation remains NP-hard due to combinatorial complexity. While deep reinforcement learning (DRL) methods, such as the Rainbow Deep Q-Network (DQN), improve scalability through prioritized replay and distributional heads, classical function approximators limit their representational power. We introduce Variational Quantum Rainbow DQN (VQR-DQN), which integrates ring-topology variational quantum circuits with Rainbow DQN to leverage quantum superposition and entanglement. We frame the human resource allocation problem (HRAP) as a Markov decision process (MDP) with combinatorial action spaces based on officer capabilities, event schedules, and transition times. On four HRAP benchmarks, VQR-DQN achieves 26.8% normalized makespan reduction versus random baselines and outperforms Double DQN and classical Rainbow DQN by 4.9-13.4%. These gains align with theoretical connections between circuit expressibility, entanglement, and policy quality, demonstrating the potential of quantum-enhanced DRL for large-scale resource allocation. Our implementation is available at: https://github.com/Analytics-Everywhere-Lab/qtrl/.

[214] SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, Babak Damavandi

Main category: cs.AI

TL;DR: SymPyBench: A large-scale synthetic physics benchmark with 15,045 parameterized problems, featuring multiple question formats and novel evaluation metrics for assessing scientific reasoning in language models.

Details

Motivation: To create a comprehensive benchmark for evaluating scientific reasoning capabilities of language models, addressing limitations of existing static benchmarks by providing parameterized problems with infinite variations and structured ground-truth solutions.

Method: Developed a synthetic benchmark of 15,045 university-level physics problems with 90/10 train/test split. Each problem is fully parameterized with executable Python code for ground-truth solutions. Includes three question types: MC-Symbolic, MC-Numerical, and free-form formats. Introduced novel evaluation metrics: Consistency Score, Failure Rate, and Confusion Rate.

Result: Experiments with state-of-the-art instruction-tuned language models revealed both strengths and limitations in scientific reasoning capabilities. The benchmark enables quantitative assessment of model performance across problem variations.

Conclusion: SymPyBench serves as a foundation for developing more robust and interpretable reasoning systems by providing dynamic, code-driven evaluation with novel metrics that quantify variability and uncertainty in model responses.

Abstract: We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems

[215] IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao

Main category: cs.AI

TL;DR: IS-Bench is a new benchmark for evaluating interactive safety in embodied AI agents, focusing on dynamic risk perception and mitigation in household tasks, revealing current VLMs’ safety limitations.

Details

Motivation: Existing evaluation paradigms fail to assess dynamic risks in interactive environments, relying on unreliable post-hoc evaluations that ignore unsafe intermediate steps, creating safety hazards for real-world deployment.

Method: Proposes IS-Bench, a multi-modal benchmark with 161 scenarios and 388 unique safety risks in a high-fidelity simulator, featuring process-oriented evaluation that verifies risk mitigation actions in correct procedural order.

Result: Experiments on leading VLMs (GPT-4o, Gemini-2.5) show current agents lack interactive safety awareness; safety-aware Chain-of-Thought improves performance but often compromises task completion.

Conclusion: IS-Bench highlights critical safety limitations in current embodied AI systems and provides a foundation for developing safer, more reliable agents through interactive safety evaluation.

Abstract: Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent’s actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent’s interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under https://github.com/AI45Lab/IS-Bench.

[216] Rolling in the deep of cognitive and AI biases

Athena Vakali, Nicoleta Tantalaki

Main category: cs.AI

TL;DR: The paper argues that current AI fairness approaches focus too much on computational aspects while overlooking human cognitive biases, and proposes a new methodology that maps human heuristics to AI biases throughout the AI lifecycle.

Details

Motivation: Despite efforts to make AI systems fair, they continue to produce discriminatory outcomes. The authors argue this is because current fairness approaches treat AI as purely computational systems, ignoring the sociotechnical context and human cognitive biases that influence AI development and deployment.

Method: The paper introduces a radical new methodology that centers human cognitive biases in AI fairness analysis. It uses cognitive science definitions and taxonomies of human heuristics to identify how harmful human actions influence the AI lifecycle, reveals hidden pathways from human to AI biases, and creates a mapping that justifies the reflection of human heuristics in AI biases.

Result: The approach detects relevant fairness intensities and inter-dependencies between human cognitive biases and AI system biases, providing a framework to understand how human heuristics translate into AI discrimination.

Conclusion: This human-centric approach will help revisit AI fairness through deeper case studies that reveal hidden biases and their causes and effects, moving beyond purely computational fairness analysis to address the sociotechnical nature of AI systems.

Abstract: Nowadays, we delegate many of our decisions to Artificial Intelligence (AI) that acts either in solo or as a human companion in decisions made to support several sensitive domains, like healthcare, financial services and law enforcement. AI systems, even carefully designed to be fair, are heavily criticized for delivering misjudged and discriminated outcomes against individuals and groups. Numerous work on AI algorithmic fairness is devoted on Machine Learning pipelines which address biases and quantify fairness under a pure computational view. However, the continuous unfair and unjust AI outcomes, indicate that there is urgent need to understand AI as a sociotechnical system, inseparable from the conditions in which it is designed, developed and deployed. Although, the synergy of humans and machines seems imperative to make AI work, the significant impact of human and societal factors on AI bias is currently overlooked. We address this critical issue by following a radical new methodology under which human cognitive biases become core entities in our AI fairness overview. Inspired by the cognitive science definition and taxonomy of human heuristics, we identify how harmful human actions influence the overall AI lifecycle, and reveal human to AI biases hidden pathways. We introduce a new mapping, which justifies the human heuristics to AI biases reflections and we detect relevant fairness intensities and inter-dependencies. We envision that this approach will contribute in revisiting AI fairness under deeper human-centric case studies, revealing hidden biases cause and effects.

[217] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Kaichen Zhang, Keming Wu, Zuhao Yang, Bo Li, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing

Main category: cs.AI

TL;DR: OpenMMReasoner introduces a transparent two-stage training recipe (SFT + RL) for multimodal reasoning with open-sourced data and code, achieving 11.6% improvement over baseline models.

Details

Motivation: Despite progress in visual reasoning, there's a lack of transparent and reproducible data curation and training strategies, which hinders scalable research in multimodal reasoning.

Method: Two-stage approach: 1) Supervised fine-tuning with 874K-sample cold-start dataset featuring step-by-step validation, 2) Reinforcement learning with 74K-sample dataset across diverse domains to sharpen and stabilize reasoning abilities.

Result: Achieves 11.6% improvement over Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, demonstrating superior performance and establishing empirical foundation for future research.

Conclusion: The work provides a fully transparent, reproducible training recipe that highlights the critical role of data quality and training design in multimodal reasoning, with all code, pipeline, and data open-sourced for community use.

Abstract: Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.

[218] Enhancing Large Language Models through Neuro-Symbolic Integration and Ontological Reasoning

Ruslan Idelfonso Magana Vsevolodovna, Marco Monti

Main category: cs.AI

TL;DR: Neuro-symbolic approach combining LLMs with ontological reasoning to detect and correct hallucinations through consistency checking and iterative refinement.

Details

Motivation: LLMs suffer from hallucinations (inaccuracies and logical inconsistencies) that compromise reliability, especially in domains requiring factual accuracy.

Method: Integrates symbolic ontological reasoning (OWL ontologies, HermiT reasoner) with machine learning (logistic regression) to map natural language to logical forms, detect inconsistencies, and generate explanatory feedback for iterative refinement.

Result: Experimental results show significant improvements in semantic coherence and factual accuracy of LLM outputs in a defined domain.

Conclusion: Combining LLM fluency with formal semantics through neuro-symbolic approaches shows potential for enhancing consistency and reliability of LLM outputs.

Abstract: Large Language Models (LLMs) demonstrate impressive capabilities in natural language processing but suffer from inaccuracies and logical inconsistencies known as hallucinations. This compromises their reliability, especially in domains requiring factual accuracy. We propose a neuro-symbolic approach integrating symbolic ontological reasoning and machine learning methods to enhance the consistency and reliability of LLM outputs. Our workflow utilizes OWL ontologies, a symbolic reasoner (e.g., HermiT) for consistency checking, and a lightweight machine learning model (logistic regression) for mapping natural language statements into logical forms compatible with the ontology. When inconsistencies between LLM outputs and the ontology are detected, the system generates explanatory feedback to guide the LLM towards a corrected, logically coherent response in an iterative refinement loop. We present a working Python prototype demonstrating this pipeline. Experimental results in a defined domain suggest significant improvements in semantic coherence and factual accuracy of LLM outputs, showcasing the potential of combining LLM fluency with the rigor of formal semantics.

[219] FedIFL: A federated cross-domain diagnostic framework for motor-driven systems with inconsistent fault modes

Zexiao Wang, Yankai Wang, Xiaoqiang Liao, Xinguo Ming, Weiming Shen

Main category: cs.AI

TL;DR: FedIFL is a federated learning framework for fault diagnosis that addresses label space inconsistency across clients by learning invariant features through prototype contrastive learning, feature generation, and feature disentanglement mechanisms.

Details

Motivation: Industrial data scarcity makes it hard for individual equipment users to train comprehensive fault diagnosis models. Federated learning enables collaborative training while preserving data privacy, but inconsistent label spaces across clients (due to diverse working conditions) cause local models to focus on client-specific faults, weakening the global model's generalization.

Method: FedIFL uses: 1) Intra-client prototype contrastive learning to mitigate domain shifts, 2) Feature generation for privacy-friendly access to other clients’ distributions, 3) Cross-client feature disentanglement with federated instance consistency loss for invariant features, federated instance personalization loss and orthogonal loss to distinguish specific features from invariant ones.

Result: The aggregated model achieves promising generalization across global label spaces, enabling accurate fault diagnosis for Motor Driven Systems with inconsistent label spaces. Experiments on real-world MDSs validate FedIFL’s effectiveness and superiority in federated cross-domain diagnosis.

Conclusion: FedIFL successfully addresses the challenge of inconsistent fault modes in federated diagnostic scenarios by learning invariant features across clients, improving the global model’s generalization while maintaining data privacy.

Abstract: Due to the scarcity of industrial data, individual equipment users, particularly start-ups, struggle to independently train a comprehensive fault diagnosis model; federated learning enables collaborative training while ensuring data privacy, making it an ideal solution. However, the diversity of working conditions leads to variations in fault modes, resulting in inconsistent label spaces across different clients. In federated diagnostic scenarios, label space inconsistency leads to local models focus on client-specific fault modes and causes local models from different clients to map different failure modes to similar feature representations, which weakens the aggregated global model’s generalization. To tackle this issue, this article proposed a federated cross-domain diagnostic framework termed Federated Invariant Features Learning (FedIFL). In intra-client training, prototype contrastive learning mitigates intra-client domain shifts, subsequently, feature generating ensures local models can access distributions of other clients in a privacy-friendly manner. Besides, in cross-client training, a feature disentanglement mechanism is introduced to mitigate cross-client domain shifts, specifically, an instance-level federated instance consistency loss is designed to ensure the instance-level consistency of invariant features between different clients, furthermore, a federated instance personalization loss and an orthogonal loss are constructed to distinguish specific features that from the invariant features. Eventually, the aggregated model achieves promising generalization among global label spaces, enabling accurate fault diagnosis for target clients’ Motor Driven Systems (MDSs) with inconsistent label spaces. Experiments on real-world MDSs validate the effectiveness and superiority of FedIFL in federated cross-domain diagnosis with inconsistent fault modes.

[220] KNARsack: Teaching Neural Algorithmic Reasoners to Solve Pseudo-Polynomial Problems

Stjepan Požgaj, Dobrik Georgiev, Marin Šilić, Petar Veličković

Main category: cs.AI

TL;DR: Neural algorithmic reasoning approach for Knapsack problem using two-phase dynamic programming pipeline achieves better generalization to larger instances than direct prediction baseline.

Details

Motivation: To extend neural algorithmic reasoning (NAR) to solve Knapsack problems, which are pseudo-polynomial problems bridging classical algorithms and combinatorial optimization but omitted from standard NAR benchmarks.

Method: Two-phase pipeline that closely follows the Knapsack algorithm: 1) constructing the dynamic programming table, 2) reconstructing the solution from the table. Uses dynamic programming supervision to model intermediate states.

Result: The approach achieves better generalization to larger problem instances compared to a direct-prediction baseline that attempts to select the optimal subset only from problem inputs.

Conclusion: Modeling algorithmic reasoning through dynamic programming supervision in a two-phase pipeline is effective for solving Knapsack problems and demonstrates improved generalization capabilities over simpler approaches.

Abstract: Neural algorithmic reasoning (NAR) is a growing field that aims to embed algorithmic logic into neural networks by imitating classical algorithms. In this extended abstract, we detail our attempt to build a neural algorithmic reasoner that can solve Knapsack, a pseudo-polynomial problem bridging classical algorithms and combinatorial optimisation, but omitted in standard NAR benchmarks. Our neural algorithmic reasoner is designed to closely follow the two-phase pipeline for the Knapsack problem, which involves first constructing the dynamic programming table and then reconstructing the solution from it. The approach, which models intermediate states through dynamic programming supervision, achieves better generalization to larger problem instances than a direct-prediction baseline that attempts to select the optimal subset only from the problem inputs.

[221] SOCK: A Benchmark for Measuring Self-Replication in Large Language Models

Justin Chavarria, Rohan Raizada, Justin White, Eyad Alhetairshi

Main category: cs.AI

TL;DR: SOCK is a benchmark CLI that measures LLMs’ ability to self-replicate autonomously across computational contexts, categorizing models into Replication-Capability Levels (RCL) and Persistence-Capability Levels (PCL) using a five-task suite.

Details

Motivation: To establish the first formalized benchmark for evaluating LLM self-replication capabilities, enabling standardized measurement of autonomous replication across different computational environments and helping identify potential security risks in multi-agent systems.

Method: Developed a five-task suite based on modern CLI utilities and computer processes, tested in controlled environments with LLMs acting agentically. Performance is measured to produce R-scores (quantitative self-replication ability) and categorize models into RCL-PCL matrices.

Result: Evaluation of various open-weight and proprietary frontier models revealed significant obstacles to persistent self-replication, including context retention and multi-agent decision-making challenges.

Conclusion: SOCK provides the first standardized benchmark for LLM self-replication assessment, enabling industry tracking of multi-agent system effectiveness and mitigation of self-replication threats, while identifying research directions to safely reduce obstacles to functional multi-agent systems.

Abstract: We introduce SOCK, a benchmark command line interface (CLI) that measures large language models’ (LLMs) ability to self-replicate without human intervention. In this benchmark, self-replication is defined not only as an LLM’s ability to create a functioning and running copy of itself, but also the ability for that self-replication to persist and occur across different computational contexts. Accordingly, we’ve developed a system to categorize LLMs based on broad self-replication capabilities in two general classes, Replication-Capability Levels (RCL) and Persistence-Capability Levels (PCL). Using a five-task suite based on practically manipulable modern CLI utilities and computer processes, experiments are orchestrated in a controlled environment with an LLM acting agentically. The performance of the LLM on agent tasks is then computed to produce an R-score (a quantitative evaluation of overall self-replication ability) and data used to categorize LLMs into specific RCL-PCL matrices. SOCK offers two primary contributions: (1) Provides the first formalized definitions and benchmark suite for evaluating LLM self-replication, with the goal of establishing a standard for future research, to our knowledge; (2) Allows the industry to track the effectiveness of future multi-agent systems and mitigate potential self-replication threat vectors within them. The results compiled from evaluating a variety of open-weight and proprietary frontier models reveal significant obstacles to persistent self-replication and multi-agent systems, including context retention and multi-agent decision-making. We propose future research directions to safely reduce the severity of these obstacles, potentially lowering future risk of more functional multi-agent systems.

[222] Generalized Parallel Scaling with Interdependent Generations

Harry Dong, David Brandfonbrener, Eryk Helenowski, Yun He, Mrinal Kumar, Han Fang, Yuejie Chi, Karthik Abinav Sankararaman

Main category: cs.AI

TL;DR: Bridge enables parallel LLM inference with interdependent responses instead of independent generations, improving quality and consistency with minimal new parameters.

Details

Motivation: Current parallel LLM inference generates N responses independently, wasting compute resources and leaving useful information untapped across responses. This contrasts with sequential generation where past computation informs future steps.

Method: Bridge treats batched LLM hidden states as holistic tensors rather than independent slices, allowing responses to influence each other during parallel generation. It adds only 2.8%-5.1% new parameters and enables interdependent response generation.

Result: Bridge improves relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 39% and boosts consistency of correct responses. It scales to any generation width with better performance than independent generations.

Conclusion: Bridge unlocks a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique, offering higher quality response sets with minimal parameter overhead.

Abstract: Parallel LLM inference scaling involves sampling a set of $N>1$ responses for a single input prompt. However, these $N$ parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8%-5.1%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 39% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.

[223] Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models

Hanze Guo, Jing Yao, Xiao Zhou, Xiaoyuan Yi, Xing Xie

Main category: cs.AI

TL;DR: COUPLE is a counterfactual reasoning framework for aligning LLMs with pluralistic human values, addressing value complexity and steerability challenges through structural causal modeling.

Details

Motivation: LLMs need to align with diverse human values beyond average principles, but existing methods fail to handle value interdependence/priorities (complexity) and precise control of nuanced value priorities (steerability).

Method: Proposes COUPLE framework using structural causal model to capture value interdependencies and priorities, plus counterfactual reasoning to generate outputs aligned with desired value objectives.

Result: COUPLE outperforms baselines across diverse value objectives on two datasets with different value systems, while providing better interpretability.

Conclusion: COUPLE effectively addresses pluralistic value alignment challenges through causal modeling and counterfactual reasoning, enabling better alignment with diverse human values.

Abstract: As large language models (LLMs) become increasingly integrated into applications serving users across diverse cultures, communities and demographics, it is critical to align LLMs with pluralistic human values beyond average principles (e.g., HHH). In psychological and social value theories such as Schwartz’s Value Theory, pluralistic values are represented by multiple value dimensions paired with various priorities. However, existing methods encounter two challenges when aligning with such fine-grained value objectives: 1) they often treat multiple values as independent and equally important, ignoring their interdependence and relative priorities (value complexity); 2) they struggle to precisely control nuanced value priorities, especially those underrepresented ones (value steerability). To handle these challenges, we propose COUPLE, a COUnterfactual reasoning framework for PLuralistic valuE alignment. It introduces a structural causal model (SCM) to feature complex interdependency and prioritization among features, as well as the causal relationship between high-level value dimensions and behaviors. Moreover, it applies counterfactual reasoning to generate outputs aligned with any desired value objectives. Benefitting from explicit causal modeling, COUPLE also provides better interpretability. We evaluate COUPLE on two datasets with different value systems and demonstrate that COUPLE advances other baselines across diverse types of value objectives.

[224] Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li, Mary Phuong, Daniel Tan

Main category: cs.AI

TL;DR: SRFT trains models to admit factual mistakes, which generalizes to admitting hidden misaligned objectives in adversarial settings, enabling near-perfect detection of deceptive AI behavior.

Details

Motivation: As AI systems become more capable of complex agentic tasks, they also become more capable of pursuing undesirable objectives and causing harm. Direct interrogation methods are unreliable because models can lie about their true objectives and behaviors.

Method: Self-report fine-tuning (SRFT) - a supervised fine-tuning technique that trains models to occasionally make factual mistakes and then admit them when asked. This admission behavior is designed to generalize to admitting hidden misaligned objectives.

Result: SRFT models achieve near-ceiling performance in detecting hidden objectives (F1 score = 0.98 vs baseline 0). They recover 28-100% of hidden objective details vs 0% for baseline models and prefilled assistant turn attacks. Models confess hidden objectives even under strong pressure not to disclose them.

Conclusion: SRFT provides a promising technique for promoting honesty propensity and incriminating misaligned AIs by training models to admit mistakes, which generalizes to admitting hidden misaligned objectives in adversarial settings.

Abstract: As AI systems become more capable of complex agentic tasks, they also become more capable of pursuing undesirable objectives and causing harm. Previous work has attempted to catch these unsafe instances by interrogating models directly about their objectives and behaviors. However, the main weakness of trusting interrogations is that models can lie. We propose self-report fine-tuning (SRFT), a simple supervised fine-tuning technique that trains models to occasionally make factual mistakes, then admit them when asked. We show that the admission of factual errors in simple question-answering settings generalizes out-of-distribution (OOD) to the admission of hidden misaligned objectives in adversarial agentic settings. We evaluate SRFT in OOD stealth tasks, where models are instructed to complete a hidden misaligned objective alongside a user-specified objective without being caught by monitoring. After SRFT, models are more likely to confess the details of their hidden objectives when interrogated, even under strong pressure not to disclose them. Interrogation on SRFT models can detect hidden objectives with near-ceiling performance (F1 score = 0.98), while the baseline model lies when interrogated under the same conditions (F1 score = 0). Interrogation on SRFT models can further elicit the content of the hidden objective, recovering 28-100% details, compared to 0% details recovered in the baseline model and by prefilled assistant turn attacks. This provides a promising technique for promoting honesty propensity and incriminating misaligned AIs.

[225] Debate over Mixed-knowledge: A Robust Multi-Agent Reasoning Framework for Incomplete Knowledge Graph Question Answering

Jilong Liu, Pengyang Shao, Wei Qin, Fei Liu, Yonghui Yang, Richang Hong

Main category: cs.AI

TL;DR: DoM framework enables dynamic integration of structured KG and unstructured text knowledge for incomplete KGQA using multi-agent debate, with a new realistic dataset based on real-world knowledge updates.

Details

Motivation: Real-world KGs are often incomplete, leading to Incomplete KGQA problems. Existing methods lack adaptive contextual fusion of multiple knowledge sources and fail to exploit complementary strengths. Current datasets artificially remove triples rather than reflecting real-world knowledge incompleteness patterns.

Method: Proposes Debate over Mixed-knowledge (DoM) framework based on Multi-Agent Debate paradigm. Uses specialized agents for KG inference and external text inference, coordinated through iterative interaction. Decomposes questions, retrieves evidence via dual agents (KG and RAG), and employs judge agent to evaluate/aggregate intermediate answers.

Result: DoM consistently outperforms state-of-the-art baselines through extensive experiments. Also introduces new realistic dataset (Incomplete Knowledge Graph WebQuestions) constructed using real-world knowledge updates rather than random triple removal.

Conclusion: DoM effectively addresses IKGQA by dynamically integrating structured and unstructured knowledge through multi-agent collaboration, exploiting knowledge complementarity and enhancing robustness to KG incompleteness. The new dataset provides more realistic evaluation benchmark.

Abstract: Knowledge Graph Question Answering (KGQA) aims to improve factual accuracy by leveraging structured knowledge. However, real-world Knowledge Graphs (KGs) are often incomplete, leading to the problem of Incomplete KGQA (IKGQA). A common solution is to incorporate external data to fill knowledge gaps, but existing methods lack the capacity to adaptively and contextually fuse multiple sources, failing to fully exploit their complementary strengths. To this end, we propose Debate over Mixed-knowledge (DoM), a novel framework that enables dynamic integration of structured and unstructured knowledge for IKGQA. Built upon the Multi-Agent Debate paradigm, DoM assigns specialized agents to perform inference over knowledge graphs and external texts separately, and coordinates their outputs through iterative interaction. It decomposes the input question into sub-questions, retrieves evidence via dual agents (KG and Retrieval-Augmented Generation, RAG), and employs a judge agent to evaluate and aggregate intermediate answers. This collaboration exploits knowledge complementarity and enhances robustness to KG incompleteness. In addition, existing IKGQA datasets simulate incompleteness by randomly removing triples, failing to capture the irregular and unpredictable nature of real-world knowledge incompleteness. To address this, we introduce a new dataset, Incomplete Knowledge Graph WebQuestions, constructed by leveraging real-world knowledge updates. These updates reflect knowledge beyond the static scope of KGs, yielding a more realistic and challenging benchmark. Through extensive experiments, we show that DoM consistently outperforms state-of-the-art baselines.

[226] ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset

Chen Yang, Ran Le, Yun Xing, Zhenwei An, Zongchao Chen, Wayne Xin Zhao, Yang Song, Tao Zhang

Main category: cs.AI

TL;DR: ToolMind is a large-scale, high-quality tool-agentic dataset with 160k synthetic instances and 200k augmented instances, featuring turn-level validation to prevent error propagation during LLM agent training.

Details

Motivation: Existing multi-turn dialogue synthesis methods only validate correctness at trajectory level, overlooking turn-level errors that propagate during training and degrade model performance. There's scarcity of high-quality trajectories for developing stronger LLM agents.

Method: Construct function graph based on parameter correlations, then use multi-agent framework to simulate realistic user-assistant-tool interactions. Employ fine-grained turn-level filtering to remove erroneous/suboptimal steps while preserving self-corrective reasoning signals.

Result: Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks, demonstrating the effectiveness of the high-quality dataset with turn-level validation.

Conclusion: ToolMind addresses limitations of existing datasets by providing large-scale, high-quality tool-agentic data with turn-level filtering, mitigating error amplification during training while preserving essential self-corrective reasoning for robust tool-use learning.

Abstract: Large Language Model (LLM) agents have developed rapidly in recent years to solve complex real-world problems using external tools. However, the scarcity of high-quality trajectories still hinders the development of stronger LLM agents. Most existing works on multi-turn dialogue synthesis validate correctness only at the trajectory level, which may overlook turn-level errors that can propagate during training and degrade model performance. To address these limitations, we introduce ToolMind, a large-scale, high-quality tool-agentic dataset with 160k synthetic data instances generated using over 20k tools and 200k augmented open-source data instances. Our data synthesis pipeline first constructs a function graph based on parameter correlations and then uses a multi-agent framework to simulate realistic user-assistant-tool interactions. Beyond trajectory-level validation, we employ fine-grained turn-level filtering to remove erroneous or suboptimal steps, ensuring that only high-quality reasoning traces are retained. This approach mitigates error amplification during training while preserving self-corrective reasoning signals essential for robust tool-use learning. Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.

[227] Learning the Value of Value Learning

Alex John London, Aydin Mohseni

Main category: cs.AI

TL;DR: Extends Jeffrey-Bolker decision framework to model value refinement alongside factual uncertainty, proves value-of-information theorem for axiological refinement, shows mutual refinement transforms zero-sum games into positive-sum interactions and yields Pareto improvements.

Details

Motivation: Standard decision frameworks address uncertainty about facts but assume fixed values. The paper aims to extend rational choice theory to model how values themselves can be refined through deliberation, bridging the gap between epistemic and axiological considerations.

Method: Extends the Jeffrey-Bolker decision framework to incorporate value refinement. Uses formal mathematical proofs for value-of-information theorem in axiological contexts. Applies game theory analysis to multi-agent settings to study mutual refinement effects.

Result: Proves a value-of-information theorem for axiological refinement. Shows mutual refinement transforms zero-sum games into positive-sum interactions. Demonstrates mutual refinement yields Pareto-improving Nash bargains in multi-agent settings.

Conclusion: A framework of rational choice can be extended to model value refinement and its benefits. Unifying epistemic and axiological refinement broadens conceptual foundations of rational choice and illuminates normative status of ethical deliberation.

Abstract: Standard decision frameworks addresses uncertainty about facts but assumes fixed values. We extend the Jeffrey-Bolker framework to model refinements in values and prove a value-of-information theorem for axiological refinement. In multi-agent settings, we establish that mutual refinement will characteristically transform zero-sum games into positive-sum interactions and yields Pareto-improving Nash bargains. These results show that a framework of rational choice can be extended to model value refinement and its associated benefits. By unifying epistemic and axiological refinement under a single formalism, we broaden the conceptual foundations of rational choice and illuminate the normative status of ethical deliberation.

[228] Self-Transparency Failures in Expert-Persona LLMs: How Instruction-Following Overrides Honesty

Alex Diep

Main category: cs.AI

TL;DR: Language models often hide their AI identity when pretending to be professionals, with disclosure rates varying dramatically by profession and model type, creating trust calibration risks.

Details

Motivation: To investigate whether language models disclose their AI nature when assigned professional personas, as users may mistakenly trust AI-generated advice as licensed professional guidance when models maintain false credentials.

Method: Common-garden experimental design auditing 16 open-weight models (4B-671B parameters) across 19,200 trials under identical conditions, testing different professional personas and measuring disclosure rates, with Bayesian validation for robustness.

Result: Models showed sharp domain-specific inconsistency: Financial Advisor persona elicited 30.8% disclosure vs Neurosurgeon’s 3.5% (8.8-fold difference). Disclosure ranged from 2.8% to 73.6% across models, with model identity being far more important than parameter count. Explicit permission increased disclosure from 23.7% to 65.8%.

Conclusion: Suppression of AI identity reflects instruction-following prioritization, not capability limitations. Organizations cannot assume safety properties transfer across domains, requiring deliberate behavior design and empirical verification due to trust calibration risks.

Abstract: This study audits whether language models disclose their AI nature when assigned professional personas and questioned about their expertise. When models maintain false professional credentials, users may calibrate trust based on overstated competence claims, treating AI-generated guidance as equivalent to licensed professional advice. Using a common-garden experimental design, sixteen open-weight models (4B-671B parameters) were audited under identical conditions across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure at the first prompt, while a Neurosurgeon persona elicited only 3.5% - an 8.8-fold difference that emerged before any epistemic probing. Disclosure ranged from 2.8% to 73.6% across model families, with a 14B model reaching 39.4% while a 70B model produced just 4.1%. Model identity provided substantially larger improvement in fitting observations than parameter count ($ΔR_{adj}^{2}=0.359$ vs $0.018$). Reasoning variants showed heterogeneous effects: some exhibited up to 48.4 percentage points lower disclosure than their base instruction-tuned counterparts, while others maintained high transparency. An additional experiment demonstrated that explicit permission to disclose AI nature increased disclosure from 23.7% to 65.8%, revealing that suppression reflects instruction-following prioritization rather than capability limitations. Bayesian validation confirmed robustness to judge measurement error ($κ=0.908$). These patterns create trust calibration risks when users encounter the same model across professional contexts. Organizations cannot assume safety properties will transfer across deployment domains, requiring deliberate behavior design and empirical verification.

[229] GTM: Simulating the World of Tools for AI Agents

Zhenzhen Ren, Xinpeng Zhang, Zhenxing Qian, Yan Gao, Yu Shi, Shuxin Zheng, Jiyan He

Main category: cs.AI

TL;DR: GTM is a 1.5B parameter model that simulates diverse tools for efficient LLM agent training, eliminating expensive real tool interactions.

Details

Motivation: Training LLM agents with real tools is expensive, slow, and requires maintenance overhead. Need a cost-effective alternative to simulate tool interactions.

Method: Developed GTM using Context-Aware Response Generation (CARG) pipeline to synthesize training data covering 20,000+ tools across 300 domains. Model learns to mimic tool execution with prompt-level configuration.

Result: GTM produces high-quality outputs with strong consistency and reliability. In RL agent training, it’s significantly faster than real tools while maintaining comparable output quality, with good generalization across domains.

Conclusion: GTM serves as a foundational component for developing AI agents, enabling efficient and scalable training of tool-augmented systems without real tool interaction overhead.

Abstract: The integration of external tools is pivotal for empowering Large Language Model (LLM) agents with real-world capabilities. However, training these agents through direct, continuous interaction with diverse tools is often prohibitively expensive, slow, and introduces additional development and maintenance overhead. To address this challenge, we introduce the Generalist Tool Model (GTM), a 1.5-billion-parameter model that learns to act as a universal tool simulator. With only prompt-level configuration, GTM accesses tool functionalities along with input arguments and generates outputs that faithfully mimic real tool execution, providing a fast and cost-effective solution that eliminates development overhead. To build GTM, we propose the Context-Aware Response Generation (CARG) pipeline, which synthesizes comprehensive training data covering over 20,000 tools across 300 domains including physics, medicine, robotics, and finance. Through this pipeline, GTM learns to produce not only syntactically correct outputs but also logically coherent and contextually appropriate responses. Experiments demonstrate that GTM produces high-quality outputs with strong consistency and reliability. Besides when used in real reinforcement learning scenarios for agent training, GTM exhibits significantly faster simulation speed compared to real tools while maintaining comparable output quality, along with remarkable generalization and domain adaptability. Our results establish GTM as a foundational component for developing future AI agents, enabling efficient and scalable training of tool-augmented systems.

cs.SD

[230] Lyrics Matter: Exploiting the Power of Learnt Representations for Music Popularity Prediction

Yash Choudhary, Preeti Rao, Pushpak Bhattacharyya

Main category: cs.SD

TL;DR: LLM-driven lyric embeddings combined with audio and social features improve music popularity prediction by 9-20% over baselines.

Details

Motivation: Lyrics are under-explored in music popularity prediction, which traditionally focuses on audio features, social metadata, or model architectures. The paper aims to demonstrate the value of lyrics for accurate popularity forecasting.

Method: Automated pipeline using LLM to extract high-dimensional lyric embeddings capturing semantic, syntactic, and sequential information. These are integrated into HitMusicLyricNet, a multimodal architecture combining audio, lyrics, and social metadata for popularity score prediction (0-100).

Result: Outperforms existing baselines on SpotGenTrack dataset (100k+ tracks) with 9% improvement in MAE and 20% improvement in MSE. Ablation confirms gains come from LLM-driven lyrics feature pipeline (LyricsAENet).

Conclusion: Dense lyric representations extracted via LLMs significantly enhance music popularity prediction, demonstrating the importance of lyrics alongside audio and social features in multimodal approaches.

Abstract: Accurately predicting music popularity is a critical challenge in the music industry, offering benefits to artists, producers, and streaming platforms. Prior research has largely focused on audio features, social metadata, or model architectures. This work addresses the under-explored role of lyrics in predicting popularity. We present an automated pipeline that uses LLM to extract high-dimensional lyric embeddings, capturing semantic, syntactic, and sequential information. These features are integrated into HitMusicLyricNet, a multimodal architecture that combines audio, lyrics, and social metadata for popularity score prediction in the range 0-100. Our method outperforms existing baselines on the SpotGenTrack dataset, which contains over 100,000 tracks, achieving 9% and 20% improvements in MAE and MSE, respectively. Ablation confirms that gains arise from our LLM-driven lyrics feature pipeline (LyricsAENet), underscoring the value of dense lyric representations.

[231] The T12 System for AudioMOS Challenge 2025: Audio Aesthetics Score Prediction System Using KAN- and VERSA-based Models

Katsuhiko Yamamoto, Koichi Miyazaki, Shogo Seki

Main category: cs.SD

TL;DR: The paper proposes AESCA, an audio aesthetics score prediction system that combines KAN-based and VERSA-based predictors in an ensemble model, achieving top performance in the AudioMOS Challenge 2025 Track 2.

Details

Motivation: To develop an effective audio aesthetics score prediction system for the AudioMOS Challenge 2025 Track 2 competition, addressing the need for accurate evaluation of audio quality across multiple axes.

Method: Two main approaches: 1) KAN-based predictor using Kolmogorov-Arnold Networks with group-rational KAN layers replacing MLPs, trained on labeled and pseudo-labeled data; 2) VERSA-based predictor using extreme gradient boosting regression with existing metric outputs. Final system combines four KAN models and one VERSA model in an ensemble.

Result: The proposed T12 system achieved the best correlations among submitted systems in three axes at utterance level, two axes at system level, and overall average performance in the AudioMOS Challenge 2025 Track 2.

Conclusion: The ensemble approach combining KAN-based and VERSA-based predictors effectively predicts audio aesthetics scores, demonstrating state-of-the-art performance in the competition across multiple evaluation axes.

Abstract: We propose an audio aesthetics score (AES) prediction system by CyberAgent (AESCA) for AudioMOS Challenge 2025 (AMC25) Track 2. The AESCA comprises a Kolmogorov–Arnold Network (KAN)-based audiobox aesthetics and a predictor from the metric scores using the VERSA toolkit. In the KAN-based predictor, we replaced each multi-layer perceptron layer in the baseline model with a group-rational KAN and trained the model with labeled and pseudo-labeled audio samples. The VERSA-based predictor was designed as a regression model using extreme gradient boosting, incorporating outputs from existing metrics. Both the KAN- and VERSA-based models predicted the AES, including the four evaluation axes. The final AES values were calculated using an ensemble model that combined four KAN-based models and a VERSA-based model. Our proposed T12 system yielded the best correlations among the submitted systems, in three axes at the utterance level, two axes at the system level, and the overall average.

[232] Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

Reo Yoneyama, Atsushi Miyashita, Ryuichi Yamamoto, Tomoki Toda

Main category: cs.SD

TL;DR: Wavehax is an aliasing-free neural waveform generator that combines 2D convolution with harmonic priors for complex spectrogram estimation, achieving high-quality speech synthesis with exceptional robustness to high fundamental frequencies while being computationally efficient.

Details

Motivation: Neural vocoders suffer from aliasing issues in latent feature spaces due to time-domain nonlinear operations and resampling layers. Aliasing causes high-frequency components to fold into low-frequency range, making them indistinguishable and creating two problems: 1) complicates waveform generation by forcing subsequent layers to handle aliasing effects, increasing computational complexity, and 2) limits extrapolation performance, especially with high fundamental frequencies, degrading perceptual quality.

Method: The paper proposes Wavehax, an aliasing-free neural waveform generator that integrates 2D convolution and harmonic priors for reliable complex spectrogram estimation. The approach is based on the insight that time-domain nonlinear operations inevitably introduce aliasing but provide strong inductive bias for harmonic generation, while time-frequency-domain processing can achieve aliasing-free synthesis but lacks this inductive bias.

Result: Wavehax achieves speech quality comparable to existing high-fidelity neural vocoders and shows exceptional robustness in high fundamental frequency extrapolation scenarios where aliasing effects are severe. It requires less than 5% of the multiply-accumulate operations and model parameters compared to HiFi-GAN V1, while achieving over four times faster CPU inference speed.

Conclusion: Wavehax successfully addresses aliasing issues in neural vocoders by combining the strengths of time-frequency-domain processing (aliasing-free synthesis) with harmonic priors, resulting in a computationally efficient system that maintains high speech quality while being particularly robust to challenging high fundamental frequency scenarios.

Abstract: Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers must address these aliasing effects, increasing the computational complexity. Second, it limits extrapolation performance, particularly in handling high fundamental frequencies, which degrades the perceptual quality of generated speech waveforms. This paper demonstrates that 1) time-domain nonlinear operations inevitably introduce aliasing but provide a strong inductive bias for harmonic generation, and 2) time-frequency-domain processing can achieve aliasing-free waveform synthesis but lacks the inductive bias for effective harmonic generation. Building on this insight, we propose Wavehax, an aliasing-free neural WAVEform generator that integrates 2D convolution and a HArmonic prior for reliable Complex Spectrogram estimation. Experimental results show that Wavehax achieves speech quality comparable to existing high-fidelity neural vocoders and exhibits exceptional robustness in scenarios requiring high fundamental frequency extrapolation, where aliasing effects become typically severe. Moreover, Wavehax requires less than 5% of the multiply-accumulate operations and model parameters compared to HiFi-GAN V1, while achieving over four times faster CPU inference speed.

cs.LG

[233] Advanced Unsupervised Learning: A Comprehensive Overview of Multi-View Clustering Techniques

Abdelmalik Moujahid, Fadi Dornaika

Main category: cs.LG

TL;DR: A comprehensive survey paper on multi-view clustering (MVC) that systematically categorizes methods, analyzes their strengths/weaknesses, and discusses future directions based on review of 140+ publications.

Details

Motivation: Machine learning faces challenges with computational constraints, single-view limitations, and complex multi-domain datasets. MVC emerges as a powerful approach to overcome these challenges by providing richer data representation and effective solutions for unsupervised learning tasks.

Method: Systematic survey methodology reviewing over 140 publications, categorizing MVC methods into co-training, co-regularization, subspace, deep learning, kernel-based, anchor-based, and graph-based strategies. Comparative analysis of integration strategies (early fusion, late fusion, joint learning) and investigation of practical use cases.

Result: Comprehensive categorization framework for MVC methods, in-depth analysis of strengths/weaknesses including scalability and incomplete data challenges, and structured investigation of practical applications in healthcare, multimedia, and social network analysis.

Conclusion: This survey fills existing gaps in MVC research by providing systematic categorization, comparative insights, and forward-looking discussion of emerging trends and interdisciplinary applications, offering actionable insights for advancing the field.

Abstract: Machine learning techniques face numerous challenges to achieve optimal performance. These include computational constraints, the limitations of single-view learning algorithms and the complexity of processing large datasets from different domains, sources or views. In this context, multi-view clustering (MVC), a class of unsupervised multi-view learning, emerges as a powerful approach to overcome these challenges. MVC compensates for the shortcomings of single-view methods and provides a richer data representation and effective solutions for a variety of unsupervised learning tasks. In contrast to traditional single-view approaches, the semantically rich nature of multi-view data increases its practical utility despite its inherent complexity. This survey makes a threefold contribution: (1) a systematic categorization of multi-view clustering methods into well-defined groups, including co-training, co-regularization, subspace, deep learning, kernel-based, anchor-based, and graph-based strategies; (2) an in-depth analysis of their respective strengths, weaknesses, and practical challenges, such as scalability and incomplete data; and (3) a forward-looking discussion of emerging trends, interdisciplinary applications, and future directions in MVC research. This study represents an extensive workload, encompassing the review of over 140 foundational and recent publications, the development of comparative insights on integration strategies such as early fusion, late fusion, and joint learning, and the structured investigation of practical use cases in the areas of healthcare, multimedia, and social network analysis. By integrating these efforts, this work aims to fill existing gaps in MVC research and provide actionable insights for the advancement of the field.

[234] Coefficient of Variation Masking: A Volatility-Aware Strategy for EHR Foundation Models

Rajna Fani, Rafi Al Attrach, David Restrepo, Yugang Jia, Leo Anthony Celi, Peter Schüffler

Main category: cs.LG

TL;DR: CV-Masking improves MAE pretraining for EHR by adapting masking probabilities to feature volatility, outperforming random masking and enhancing downstream clinical tasks.

Details

Motivation: Existing MAE approaches use uniform random masking, assuming all EHR features are equally predictable. However, laboratory tests show substantial heterogeneity in volatility - some biomarkers are stable while others fluctuate considerably. Volatile biomarkers often signal acute pathophysiology and require more sophisticated modeling to capture complex temporal patterns.

Method: Proposes Coefficient of Variation Masking (CV-Masking), a volatility-aware pretraining strategy that adaptively adjusts masking probabilities according to each feature’s intrinsic variability. Combined with a value-only masking objective aligned with clinical workflows.

Result: CV-Masking yields systematic improvements over random and variance-based strategies. Experiments on large panel of laboratory tests show enhanced reconstruction, improved downstream predictive performance, accelerated convergence, and production of more robust and clinically meaningful EHR representations.

Conclusion: Volatility-aware masking strategies like CV-Masking are superior to uniform random masking for EHR pretraining, producing better representations that capture clinically important temporal patterns in volatile biomarkers.

Abstract: Masked autoencoders (MAEs) are increasingly applied to electronic health records (EHR) for learning general-purpose representations that support diverse clinical tasks. However, existing approaches typically rely on uniform random masking, implicitly assuming all features are equally predictable. In reality, laboratory tests exhibit substantial heterogeneity in volatility: some biomarkers (e.g., sodium) remain stable, while others (e.g., lactate) fluctuate considerably and are more difficult to model. Clinically, volatile biomarkers often signal acute pathophysiology and require more sophisticated modeling to capture their complex temporal patterns. We propose a volatility-aware pretraining strategy, Coefficient of Variation Masking (CV-Masking), that adaptively adjusts masking probabilities according to the intrinsic variability of each feature. Combined with a value-only masking objective aligned with clinical workflows, CV-Masking yields systematic improvements over random and variance-based strategies. Experiments on a large panel of laboratory tests show that CV-Masking enhances reconstruction, improves downstream predictive performance, and accelerates convergence, producing more robust and clinically meaningful EHR representations.

[235] Rethinking Tokenization for Clinical Time Series: When Less is More

Rafi Al Attrach, Rajna Fani, David Restrepo, Yugang Jia, Peter Schüffler

Main category: cs.LG

TL;DR: Systematic evaluation shows tokenization strategies for clinical time series have task-dependent effects, with explicit time encodings providing no consistent benefit and frozen pretrained code encoders outperforming trainable ones with fewer parameters.

Details

Motivation: There's a lack of fair comparisons between tokenization strategies for processing electronic health records, despite their importance in shaping how models handle clinical time series data.

Method: Systematic evaluation of tokenization approaches using transformer-based architectures with controlled ablations across four clinical prediction tasks on MIMIC-IV dataset.

Result: Explicit time encodings provide no consistent statistically significant benefit; value features show task-dependent importance; frozen pretrained code encoders dramatically outperform trainable counterparts with fewer parameters; larger clinical encoders provide consistent improvements.

Conclusion: Simpler, parameter-efficient approaches can achieve strong performance, though optimal tokenization strategy remains task-dependent, enabling fairer comparisons in clinical modeling.

Abstract: Tokenization strategies shape how models process electronic health records, yet fair comparisons of their effectiveness remain limited. We present a systematic evaluation of tokenization approaches for clinical time series modeling using transformer-based architectures, revealing task-dependent and sometimes counterintuitive findings about temporal and value feature importance. Through controlled ablations across four clinical prediction tasks on MIMIC-IV, we demonstrate that explicit time encodings provide no consistent statistically significant benefit for the evaluated downstream tasks. Value features show task-dependent importance, affecting mortality prediction but not readmission, suggesting code sequences alone can carry sufficient predictive signal. We further show that frozen pretrained code encoders dramatically outperform their trainable counterparts while requiring dramatically fewer parameters. Larger clinical encoders provide consistent improvements across tasks, benefiting from frozen embeddings that eliminate computational overhead. Our controlled evaluation enables fairer tokenization comparisons and demonstrates that simpler, parameter-efficient approaches can, in many cases, achieve strong performance, though the optimal tokenization strategy remains task-dependent.

[236] Mitigating the Antigenic Data Bottleneck: Semi-supervised Learning with Protein Language Models for Influenza A Surveillance

Yanhua Xu

Main category: cs.LG

TL;DR: PLMs + SSL improve influenza antigenicity prediction with scarce labeled data, enabling better use of genomic surveillance for vaccine strain selection.

Details

Motivation: Influenza viruses evolve rapidly requiring frequent vaccine updates, but current HI assays for antigenicity measurement are labor-intensive and don't scale. Genomic data vastly outpace available phenotypic labels, limiting supervised model effectiveness.

Method: Combined pre-trained Protein Language Models (ESM-2, ProtVec, ProtT5, ProtBert) with Semi-Supervised Learning (Self-training and Label Spreading). Used nested cross-validation to simulate low-label regimes (25%, 50%, 75%, 100% label availability) across four IAV subtypes (H1N1, H3N2, H5N1, H9N2).

Result: SSL consistently improved performance under label scarcity. Self-training with ProtVec showed largest relative gains. ESM-2 achieved F1 scores above 0.82 with only 25% labeled data. H1N1 and H9N2 predicted with high accuracy, while hypervariable H3N2 remained challenging but SSL mitigated performance decline.

Conclusion: Integrating PLMs with SSL addresses the antigenicity labeling bottleneck, enables more effective use of unlabeled surveillance sequences, and supports rapid variant prioritization and timely vaccine strain selection.

Abstract: Influenza A viruses (IAVs) evolve antigenically at a pace that requires frequent vaccine updates, yet the haemagglutination inhibition (HI) assays used to quantify antigenicity are labor-intensive and unscalable. As a result, genomic data vastly outpace available phenotypic labels, limiting the effectiveness of traditional supervised models. We hypothesize that combining pre-trained Protein Language Models (PLMs) with Semi-Supervised Learning (SSL) can retain high predictive accuracy even when labeled data are scarce. We evaluated two SSL strategies, Self-training and Label Spreading, against fully supervised baselines using four PLM-derived embeddings (ESM-2, ProtVec, ProtT5, ProtBert) applied to haemagglutinin (HA) sequences. A nested cross-validation framework simulated low-label regimes (25%, 50%, 75%, and 100% label availability) across four IAV subtypes (H1N1, H3N2, H5N1, H9N2). SSL consistently improved performance under label scarcity. Self-training with ProtVec produced the largest relative gains, showing that SSL can compensate for lower-resolution representations. ESM-2 remained highly robust, achieving F1 scores above 0.82 with only 25% labeled data, indicating that its embeddings capture key antigenic determinants. While H1N1 and H9N2 were predicted with high accuracy, the hypervariable H3N2 subtype remained challenging, although SSL mitigated the performance decline. These findings demonstrate that integrating PLMs with SSL can address the antigenicity labeling bottleneck and enable more effective use of unlabeled surveillance sequences, supporting rapid variant prioritization and timely vaccine strain selection.

[237] REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

Nameer Hirschkind, Joseph Liu, Xiao Yu, Mahesh Kumar Nandwana

Main category: cs.LG

TL;DR: REINA introduces an information-theoretic loss to train adaptive policies for simultaneous speech translation, optimizing the latency-quality tradeoff by waiting only when it provides useful information.

Details

Motivation: Simultaneous speech translation systems face the fundamental challenge of balancing translation quality and latency. Current approaches need better optimization of when to wait for more input versus when to output translations.

Method: Proposes Regularized Entropy INformation Adaptation (REINA), an information-theoretic loss derived from first principles to train adaptive policies using existing non-streaming translation models. The key strategy is to wait for more input only when it provides useful information.

Result: Achieves state-of-the-art streaming results for French, Spanish and German (both directions) using only open source or synthetic data. Improves latency/quality trade-off by up to 21% compared to prior approaches, and pushes the Pareto frontier beyond previous works.

Conclusion: REINA provides an effective information-theoretic approach to optimize the latency-quality tradeoff in simultaneous speech translation, enabling better adaptive policies that wait strategically only when beneficial.

Abstract: Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores.

[238] Variance Matters: Improving Domain Adaptation via Stratified Sampling

Andrea Napoli, Paul White

Main category: cs.LG

TL;DR: VaRDASS introduces a variance reduction technique for unsupervised domain adaptation using stratified sampling to improve discrepancy estimation accuracy and target domain performance.

Details

Motivation: Domain shift is a major challenge in real-world ML deployment. While UDA aims to minimize domain discrepancy, current methods suffer from high variance in stochastic settings, limiting their theoretical benefits.

Method: Proposes VaRDASS - a specialized stochastic variance reduction technique for UDA using stratified sampling. Derives ad hoc stratification objectives for correlation alignment and MMD discrepancy measures, with a practical k-means style optimization algorithm.

Result: Theoretical analysis shows the MMD objective is optimal (minimizes variance) under certain assumptions. Experiments on three domain shift datasets demonstrate improved discrepancy estimation accuracy and better target domain performance.

Conclusion: VaRDASS effectively addresses variance issues in UDA through stratified sampling, providing both theoretical guarantees and practical improvements in domain adaptation performance.

Abstract: Domain shift remains a key challenge in deploying machine learning models to the real world. Unsupervised domain adaptation (UDA) aims to address this by minimising domain discrepancy during training, but the discrepancy estimates suffer from high variance in stochastic settings, which can stifle the theoretical benefits of the method. This paper proposes Variance-Reduced Domain Adaptation via Stratified Sampling (VaRDASS), the first specialised stochastic variance reduction technique for UDA. We consider two specific discrepancy measures – correlation alignment and the maximum mean discrepancy (MMD) – and derive ad hoc stratification objectives for these terms. We then present expected and worst-case error bounds, and prove that our proposed objective for the MMD is theoretically optimal (i.e., minimises the variance) under certain assumptions. Finally, a practical k-means style optimisation algorithm is introduced and analysed. Experiments on three domain shift datasets demonstrate improved discrepancy estimation accuracy and target domain performance.

[239] MAR-FL: A Communication Efficient Peer-to-Peer Federated Learning System

Felix Mulitze, Herbert Woisetschläger, Hans Arno Jacobsen

Main category: cs.LG

TL;DR: MAR-FL is a novel peer-to-peer federated learning system that reduces communication overhead from O(N²) to O(N log N) through iterative group-based aggregation while maintaining robustness to network churn and unreliable clients.

Details

Motivation: Next-generation wireless systems and distributed ML require FL methods that are both efficient and robust under wireless connectivity constraints and network churn. Existing P2P FL approaches suffer from excessive communication complexity (O(N²)), limiting their practical scalability.

Method: MAR-FL uses iterative group-based aggregation to reduce communication overhead. The system leverages P2P architecture without a central coordinator and incorporates mechanisms for resilience to churn and unreliable clients, while supporting private computing integration.

Result: MAR-FL achieves communication costs that scale as O(N log N), significantly improving over existing O(N²) baselines. The system maintains effectiveness as the number of peers grows and demonstrates robustness to unreliable clients and network churn.

Conclusion: MAR-FL provides a scalable P2P FL solution with substantially reduced communication overhead, making it suitable for next-generation wireless systems where efficiency and robustness to network dynamics are critical requirements.

Abstract: The convergence of next-generation wireless systems and distributed Machine Learning (ML) demands Federated Learning (FL) methods that remain efficient and robust with wireless connected peers and under network churn. Peer-to-peer (P2P) FL removes the bottleneck of a central coordinator, but existing approaches suffer from excessive communication complexity, limiting their scalability in practice. We introduce MAR-FL, a novel P2P FL system that leverages iterative group-based aggregation to substantially reduce communication overhead while retaining resilience to churn. MAR-FL achieves communication costs that scale as O(N log N), contrasting with the O(N^2) complexity of previously existing baselines, and thereby maintains effectiveness especially as the number of peers in an aggregation round grows. The system is robust towards unreliable FL clients and can integrate private computing.

[240] Edged Weisfeiler-Lehman Algorithm

Xiao Yue, Bo Liu, Feng Zhang, Guangzhi Qu

Main category: cs.LG

TL;DR: The paper proposes E-WL (Edged-WL) algorithm that extends 1-WL to incorporate edge features, and EGIN (Edged Graph Isomorphism Network) model for graph learning with edge features.

Details

Motivation: Current GNNs and the 1-WL algorithm don't effectively utilize edge features, which presents a limitation since edge features contain valuable information in many graph learning applications.

Method: Proposed E-WL algorithm extends 1-WL to incorporate edge features in color refinement. Built upon this, developed EGIN model that leverages edge features in graph neural networks for improved graph representation learning.

Result: Evaluated on 12 edge-featured benchmark datasets. EGIN models demonstrated superior performance compared to state-of-the-art baselines in graph classification tasks.

Conclusion: Incorporating edge features through E-WL algorithm and EGIN model improves graph learning performance, addressing a key limitation in existing GNNs that ignore edge features.

Abstract: As a classical approach on graph learning, the propagation-aggregation methodology is widely exploited by many of Graph Neural Networks (GNNs), wherein the representation of a node is updated by aggregating representations from itself and neighbor nodes recursively. Similar to the propagation-aggregation methodology, the Weisfeiler-Lehman (1-WL) algorithm tests isomorphism through color refinement according to color representations of a node and its neighbor nodes. However, 1-WL does not leverage any edge features (labels), presenting a potential improvement on exploiting edge features in some fields. To address this limitation, we proposed a novel Edged-WL algorithm (E-WL) which extends the original 1-WL algorithm to incorporate edge features. Building upon the E-WL algorithm, we also introduce an Edged Graph Isomorphism Network (EGIN) model for further exploiting edge features, which addresses one key drawback in many GNNs that do not utilize any edge features of graph data. We evaluated the performance of proposed models using 12 edge-featured benchmark graph datasets and compared them with some state-of-the-art baseline models. Experimental results indicate that our proposed EGIN models, in general, demonstrate superior performance in graph learning on graph classification tasks.

[241] Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao

Main category: cs.LG

TL;DR: AT-GRPO is a novel RL algorithm and training system designed for multi-agent systems that addresses challenges in applying on-policy RL to MAS by introducing agent- and turn-wise grouping and supporting both single- and multi-policy regimes.

Details

Motivation: Applying on-policy RL to multi-agent systems is underexplored and presents unique challenges: standard GRPO grouping assumptions break down because prompts vary by role and turn, and existing training stacks don't support MAS-workflow rollouts with on-policy updates for both single-policy and multi-policy models.

Method: AT-GRPO includes (1) an agent- and turn-wise grouped RL algorithm tailored to MAS that addresses prompt variations by role and turn, and (2) a training system that supports both single- and multi-policy regimes with MAS-workflow rollouts and on-policy updates.

Result: AT-GRPO delivers substantial gains across game, planning, coding, and math tasks. On long-horizon planning, it increases accuracy from 14.0-47.0% (single-agent RL baseline) to 96.0-99.5%. It improves reasoning performance with average gains of 3.87-7.62% on coding tasks and 9.0-17.93% on math tasks.

Conclusion: AT-GRPO successfully addresses the challenges of applying on-policy RL to multi-agent systems, demonstrating significant performance improvements across diverse tasks and establishing a foundation for RL-enhanced multi-agent LLM systems.

Abstract: Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.

[242] Bridging quantum and classical computing for partial differential equations through multifidelity machine learning

Bruno Jacob, Amanda A. Howard, Panos Stinis

Main category: cs.LG

TL;DR: Multifidelity learning framework bridges quantum hardware limitations by correcting coarse quantum PDE solutions to high-fidelity accuracy using sparse classical data, enabling practical quantum utility for scientific computing.

Details

Motivation: Quantum PDE solvers face severe near-term hardware constraints: limited qubit counts restrict spatial resolution to coarse grids, and circuit depth limitations prevent accurate long-time integration. These bottlenecks confine quantum solvers to low-fidelity regimes despite their theoretical speedup potential.

Method: Multifidelity learning framework that trains a low-fidelity surrogate on abundant quantum solver outputs, then learns correction mappings through a multifidelity neural architecture balancing linear and nonlinear transformations. Demonstrated on nonlinear PDEs using quantum lattice Boltzmann methods.

Result: Successfully corrects coarse quantum predictions and achieves temporal extrapolation well beyond classical training window. Framework produces predictions competitive with classical accuracy while reducing expensive high-fidelity simulation requirements.

Conclusion: Bridges gap between hardware-limited quantum simulations and application requirements, establishing pathway for extracting computational value from current quantum devices in real-world scientific applications. Advances both algorithm development and practical deployment of near-term quantum computing for computational physics.

Abstract: Quantum algorithms for partial differential equations (PDEs) face severe practical constraints on near-term hardware: limited qubit counts restrict spatial resolution to coarse grids, while circuit depth limitations prevent accurate long-time integration. These hardware bottlenecks confine quantum PDE solvers to low-fidelity regimes despite their theoretical potential for computational speedup. We introduce a multifidelity learning framework that corrects coarse quantum solutions to high-fidelity accuracy using sparse classical training data, facilitating the path toward practical quantum utility for scientific computing. The approach trains a low-fidelity surrogate on abundant quantum solver outputs, then learns correction mappings through a multifidelity neural architecture that balances linear and nonlinear transformations. Demonstrated on benchmark nonlinear PDEs including viscous Burgers equation and incompressible Navier-Stokes flows via quantum lattice Boltzmann methods, the framework successfully corrects coarse quantum predictions and achieves temporal extrapolation well beyond the classical training window. This strategy illustrates how one can reduce expensive high-fidelity simulation requirements while producing predictions that are competitive with classical accuracy. By bridging the gap between hardware-limited quantum simulations and application requirements, this work establishes a pathway for extracting computational value from current quantum devices in real-world scientific applications, advancing both algorithm development and practical deployment of near-term quantum computing for computational physics.

[243] When unlearning is free: leveraging low influence points to reduce computational costs

Anat Kleiman, Robert Fisher, Ben Deaner, Udi Wieder

Main category: cs.LG

TL;DR: The paper proposes an efficient unlearning framework that identifies negligible-impact data points to reduce dataset size before unlearning, achieving up to 50% computational savings.

Details

Motivation: Current machine unlearning methods treat all data points equally, but some points may have negligible impact on model learning. With growing data privacy concerns, efficient unlearning is needed to avoid unnecessary computational costs.

Method: Uses influence functions to analyze training data impact across language and vision tasks, identifies subsets with negligible effect on model outputs, then applies unlearning only to impactful data after reducing dataset size.

Result: The framework achieves significant computational savings (up to approximately 50%) on real-world empirical examples by reducing dataset size before unlearning.

Conclusion: Not all data points need equal unlearning attention; identifying negligible-impact points enables more efficient unlearning with substantial computational benefits while maintaining privacy goals.

Abstract: As concerns around data privacy in machine learning grow, the ability to unlearn, or remove, specific data points from trained models becomes increasingly important. While state of the art unlearning methods have emerged in response, they typically treat all points in the forget set equally. In this work, we challenge this approach by asking whether points that have a negligible impact on the model’s learning need to be removed. Through a comparative analysis of influence functions across language and vision tasks, we identify subsets of training data with negligible impact on model outputs. Leveraging this insight, we propose an efficient unlearning framework that reduces the size of datasets before unlearning leading to significant computational savings (up to approximately 50 percent) on real world empirical examples.

[244] DMAGT: Unveiling miRNA-Drug Associations by Integrating SMILES and RNA Sequence Structures through Graph Transformer Models

Ziqi Zhang

Main category: cs.LG

TL;DR: DMAGT is a transformer-based graph neural network model that predicts drug-miRNA associations using molecular structure embeddings and graph transformers, achieving high accuracy and practical validation.

Details

Motivation: Traditional wet lab experiments for exploring drug-miRNA associations are inefficient and costly, creating a need for computational methods to accelerate miRNA-targeted drug development.

Method: A multi-layer transformer-based graph neural network (DMAGT) that transforms drug-miRNA associations into graphs, uses Word2Vec for embedding molecular structures, and employs graph transformers to learn from embedded features and relational structures.

Result: Achieved up to 95.24±0.05 AUC on three datasets (ncDR, RNAInter, SM2miR), outperformed comparable methods, and validated 14 out of 20 top predictions for 5-Fluorouracil and Oxaliplatin.

Conclusion: DMAGT demonstrates excellent performance and stability in predicting drug-miRNA associations, providing an efficient computational shortcut for miRNA drug development.

Abstract: MiRNAs, due to their role in gene regulation, have paved a new pathway for pharmacology, focusing on drug development that targets miRNAs. However, traditional wet lab experiments are limited by efficiency and cost constraints, making it difficult to extensively explore potential associations between developed drugs and target miRNAs. Therefore, we have designed a novel machine learning model based on a multi-layer transformer-based graph neural network, DMAGT, specifically for predicting associations between drugs and miRNAs. This model transforms drug-miRNA associations into graphs, employs Word2Vec for embedding features of drug molecular structures and miRNA base structures, and leverages a graph transformer model to learn from embedded features and relational structures, ultimately predicting associations between drugs and miRNAs. To evaluate DMAGT, we tested its performance on three datasets composed of drug-miRNA associations: ncDR, RNAInter, and SM2miR, achieving up to AUC of $95.24\pm0.05$. DMAGT demonstrated superior performance in comparative experiments tackling similar challenges. To validate its practical efficacy, we specifically focused on two drugs, namely 5-Fluorouracil and Oxaliplatin. Of the 20 potential drug-miRNA associations identified as the most likely, 14 were successfully validated. The above experiments demonstrate that DMAGT has an excellent performance and stability in predicting drug-miRNA associations, providing a new shortcut for miRNA drug development.

[245] Bridging Interpretability and Optimization: Provably Attribution-Weighted Actor-Critic in Reproducing Kernel Hilbert Spaces

Na Li, Hangguan Shan, Wei Ni, Wenjie Zhang, Xinyu Li

Main category: cs.LG

TL;DR: RSA2C is an attribution-aware actor-critic algorithm that uses RKHS-SHAP state attributions to weight actor gradients and advantage critic targets, improving efficiency, stability, and interpretability in continuous control tasks.

Details

Motivation: Current actor-critic methods lack interpretability, and existing explainable RL approaches don't effectively use state attributions to guide training. They treat all state features equally, ignoring the heterogeneous impacts of individual state dimensions on rewards.

Method: Proposes RSA2C with three RKHS-based components: Actor in vector-valued RKHS with Mahalanobis-weighted kernel, Value Critic and Advantage Critic in scalar RKHSs. Uses sparsified dictionaries (Value Critic has its own, Actor and Advantage Critic share one). Computes state attributions via RKHS-SHAP (kernel mean embedding for on-manifold, conditional mean embedding for off-manifold expectations), converts them to Mahalanobis-gated weights that modulate Actor gradients and Advantage Critic targets.

Result: Theoretical: Derives global, non-asymptotic convergence bound under state perturbations, showing stability through perturbation-error term and efficiency through convergence-error term. Empirical: Demonstrates efficiency, stability, and interpretability on three standard continuous-control environments.

Conclusion: RSA2C successfully integrates state attributions into actor-critic training, achieving improved performance while providing interpretability through attribution-aware gradient weighting and target modulation.

Abstract: Actor-critic (AC) methods are a cornerstone of reinforcement learning (RL) but offer limited interpretability. Current explainable RL methods seldom use state attributions to assist training. Rather, they treat all state features equally, thereby neglecting the heterogeneous impacts of individual state dimensions on the reward. We propose RKHS–SHAP-based Advanced Actor–Critic (RSA2C), an attribution-aware, kernelized, two-timescale AC algorithm, including Actor, Value Critic, and Advantage Critic. The Actor is instantiated in a vector-valued reproducing kernel Hilbert space (RKHS) with a Mahalanobis-weighted operator-valued kernel, while the Value Critic and Advantage Critic reside in scalar RKHSs. These RKHS-enhanced components use sparsified dictionaries: the Value Critic maintains its own dictionary, while the Actor and Advantage Critic share one. State attributions, computed from the Value Critic via RKHS–SHAP (kernel mean embedding for on-manifold expectations and conditional mean embedding for off-manifold expectations), are converted into Mahalanobis-gated weights that modulate Actor gradients and Advantage Critic targets. Theoretically, we derive a global, non-asymptotic convergence bound under state perturbations, showing stability through the perturbation-error term and efficiency through the convergence-error term. Empirical results on three standard continuous-control environments show that our algorithm achieves efficiency, stability, and interpretability.

[246] CFO: Learning Continuous-Time PDE Dynamics via Flow-Matched Neural Operators

Xianglong Hou, Xinquan Huang, Paris Perdikaris

Main category: cs.LG

TL;DR: CFO is a neural operator framework that learns continuous-time PDE dynamics using flow matching on temporal splines, enabling time-resolution invariant training and inference with superior long-horizon stability and data efficiency.

Details

Motivation: Autoregressive neural operators for time-dependent PDEs suffer from error accumulation over long rollouts and require uniform temporal discretization. Current continuous approaches like neural ODEs have high computational burden from backpropagation through ODE solvers.

Method: CFO repurposes flow matching to learn PDE dynamics without ODE solver backpropagation. It fits temporal splines to trajectory data, uses finite-difference estimates of time derivatives at knots to construct probability paths, then trains a neural operator via flow matching to predict analytic velocity fields.

Result: CFO demonstrates superior long-horizon stability and remarkable data efficiency across four benchmarks (Lorenz, 1D Burgers, 2D diffusion-reaction, 2D shallow water). Trained on only 25% of irregularly subsampled time points, it outperforms autoregressive baselines with relative error reductions up to 87%. Despite requiring inference-time integration, it uses only 50% of baseline function evaluations while enabling reverse-time inference and arbitrary temporal querying.

Conclusion: CFO provides a time-resolution invariant framework for learning continuous-time PDE dynamics that overcomes limitations of autoregressive methods and neural ODEs, offering superior stability, data efficiency, and flexible temporal querying capabilities.

Abstract: Neural operator surrogates for time-dependent partial differential equations (PDEs) conventionally employ autoregressive prediction schemes, which accumulate error over long rollouts and require uniform temporal discretization. We introduce the Continuous Flow Operator (CFO), a framework that learns continuous-time PDE dynamics without the computational burden of standard continuous approaches, e.g., neural ODE. The key insight is repurposing flow matching to directly learn the right-hand side of PDEs without backpropagating through ODE solvers. CFO fits temporal splines to trajectory data, using finite-difference estimates of time derivatives at knots to construct probability paths whose velocities closely approximate the true PDE dynamics. A neural operator is then trained via flow matching to predict these analytic velocity fields. This approach is inherently time-resolution invariant: training accepts trajectories sampled on arbitrary, non-uniform time grids while inference queries solutions at any temporal resolution through ODE integration. Across four benchmarks (Lorenz, 1D Burgers, 2D diffusion-reaction, 2D shallow water), CFO demonstrates superior long-horizon stability and remarkable data efficiency. CFO trained on only 25% of irregularly subsampled time points outperforms autoregressive baselines trained on complete data, with relative error reductions up to 87%. Despite requiring numerical integration at inference, CFO achieves competitive efficiency, outperforming autoregressive baselines using only 50% of their function evaluations, while uniquely enabling reverse-time inference and arbitrary temporal querying.

[247] Uncertainty Quantification for Scientific Machine Learning using Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KAN)

Y. Sungtaek Ju

Main category: cs.LG

TL;DR: SVGP KANs integrate sparse variational Gaussian processes with Kolmogorov-Arnold Networks for scalable Bayesian inference with uncertainty quantification in scientific ML applications.

Details

Motivation: Kolmogorov-Arnold Networks offer interpretability but lack principled uncertainty quantification needed for scientific applications where distinguishing aleatoric from epistemic uncertainty is crucial.

Method: Framework combines sparse variational Gaussian process inference with Kolmogorov-Arnold topology, using analytic moment matching to propagate uncertainty through deep additive structures while maintaining interpretability.

Result: Demonstrated through three studies: calibration of heteroscedastic measurement noise in fluid flow reconstruction, quantification of prediction confidence degradation in multi-step forecasting, and out-of-distribution detection in convolutional autoencoders.

Conclusion: SVGP KANs are a promising architecture for uncertainty-aware learning in scientific machine learning, providing scalable Bayesian inference with quasi-linear computational complexity.

Abstract: Kolmogorov-Arnold Networks have emerged as interpretable alternatives to traditional multi-layer perceptrons. However, standard implementations lack principled uncertainty quantification capabilities essential for many scientific applications. We present a framework integrating sparse variational Gaussian process inference with the Kolmogorov-Arnold topology, enabling scalable Bayesian inference with computational complexity quasi-linear in sample size. Through analytic moment matching, we propagate uncertainty through deep additive structures while maintaining interpretability. We use three example studies to demonstrate the framework’s ability to distinguish aleatoric from epistemic uncertainty: calibration of heteroscedastic measurement noise in fluid flow reconstruction, quantification of prediction confidence degradation in multi-step forecasting of advection-diffusion dynamics, and out-of-distribution detection in convolutional autoencoders. These results suggest Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KANs) is a promising architecture for uncertainty-aware learning in scientific machine learning.

[248] The Erosion of LLM Signatures: Can We Still Distinguish Human and LLM-Generated Scientific Ideas After Iterative Paraphrasing?

Sadat Shahriar, Navid Ayoobi, Arjun Mukherjee

Main category: cs.LG

TL;DR: SOTA models struggle to distinguish human vs LLM-generated scientific ideas, especially after paraphrasing, with detection performance dropping 25.4% after 5 paraphrasing stages. Context helps slightly, but simplified non-expert style paraphrasing erodes LLM signatures most.

Details

Motivation: As LLMs become research agents, distinguishing between human and LLM-generated ideas is crucial for understanding LLMs' cognitive capabilities. While LLM text detection has been studied, distinguishing scientific ideas remains unexplored.

Method: Systematically evaluate SOTA machine learning models’ ability to differentiate human vs LLM-generated ideas, particularly after successive paraphrasing stages. Test detection performance with and without research problem context.

Result: Detection performance declines by average 25.4% after five consecutive paraphrasing stages. Incorporating research problem context improves detection by up to 2.97%. Detection algorithms struggle most when ideas are paraphrased into simplified, non-expert style.

Conclusion: Distinguishing human vs LLM-generated scientific ideas is challenging, especially after paraphrasing. Simplified non-expert style paraphrasing most erodes LLM signatures, highlighting limitations in current detection methods.

Abstract: With the increasing reliance on LLMs as research agents, distinguishing between LLM and human-generated ideas has become crucial for understanding the cognitive nuances of LLMs’ research capabilities. While detecting LLM-generated text has been extensively studied, distinguishing human vs LLM-generated scientific idea remains an unexplored area. In this work, we systematically evaluate the ability of state-of-the-art (SOTA) machine learning models to differentiate between human and LLM-generated ideas, particularly after successive paraphrasing stages. Our findings highlight the challenges SOTA models face in source attribution, with detection performance declining by an average of 25.4% after five consecutive paraphrasing stages. Additionally, we demonstrate that incorporating the research problem as contextual information improves detection performance by up to 2.97%. Notably, our analysis reveals that detection algorithms struggle significantly when ideas are paraphrased into a simplified, non-expert style, contributing the most to the erosion of distinguishable LLM signatures.

[249] Enhancing Deep Deterministic Policy Gradients on Continuous Control Tasks with Decoupled Prioritized Experience Replay

Mehmet Efe Lorasdagi, Dogan Can Cicek, Furkan Burak Mutlu, Suleyman Serdar Kozat

Main category: cs.LG

TL;DR: DPER (Decoupled Prioritized Experience Replay) improves deep deterministic policy gradient algorithms by using separate, tailored transition batches for Actor and Critic networks instead of identical batches.

Details

Motivation: Actor and Critic networks in DDPG algorithms have different learning objectives and update dynamics, making uniform transition usage potentially suboptimal. The authors aim to improve performance by providing appropriate learning signals to each component through decoupled batches.

Method: Introduces DPER, a novel experience replay mechanism that allows independent sampling of transition batches for Actor and Critic. DPER can be integrated into any off-policy deep RL algorithm for continuous control domains. Combined with Twin Delayed DDPG (TD3) and evaluated on standard MuJoCo benchmarks.

Result: DPER outperforms conventional experience replay strategies (vanilla experience replay and prioritized experience replay) across multiple MuJoCo tasks from the OpenAI Gym suite.

Conclusion: Decoupling experience replay for Actor and Critic networks enhances training dynamics and final policy quality. DPER offers a generalizable mechanism that improves performance for a wide class of actor-critic off-policy reinforcement learning algorithms.

Abstract: Background: Deep Deterministic Policy Gradient-based reinforcement learning algorithms utilize Actor-Critic architectures, where both networks are typically trained using identical batches of replayed transitions. However, the learning objectives and update dynamics of the Actor and Critic differ, raising concerns about whether uniform transition usage is optimal. Objectives: We aim to improve the performance of deep deterministic policy gradient algorithms by decoupling the transition batches used to train the Actor and the Critic. Our goal is to design an experience replay mechanism that provides appropriate learning signals to each component by using separate, tailored batches. Methods: We introduce Decoupled Prioritized Experience Replay (DPER), a novel approach that allows independent sampling of transition batches for the Actor and the Critic. DPER can be integrated into any off-policy deep reinforcement learning algorithm that operates in continuous control domains. We combine DPER with the state-of-the-art Twin Delayed DDPG algorithm and evaluate its performance across standard continuous control benchmarks. Results: DPER outperforms conventional experience replay strategies such as vanilla experience replay and prioritized experience replay in multiple MuJoCo tasks from the OpenAI Gym suite. Conclusions: Our findings show that decoupling experience replay for Actor and Critic networks can enhance training dynamics and final policy quality. DPER offers a generalizable mechanism that enhances performance for a wide class of actor-critic off-policy reinforcement learning algorithms.

[250] Robustness Test for AI Forecasting of Hurricane Florence Using FourCastNetv2 and Random Perturbations of the Initial Condition

Adam Lizerbram, Shane Stevenson, Iman Khadir, Matthew Tu, Samuel S. P. Shen

Main category: cs.LG

TL;DR: Testing AI weather model (FourCastNetv2) robustness by injecting noise into hurricane initial conditions and using random inputs to assess sensitivity and reliability.

Details

Motivation: Understanding AI weather forecasting model robustness is crucial for assessing output reliability, especially for extreme weather events like hurricanes where accurate predictions are critical.

Method: Two experiments: 1) Perturb Hurricane Florence initial conditions with varying Gaussian noise levels and analyze trajectory/intensity predictions; 2) Start model with fully random initial conditions to observe response to nonsensical inputs.

Result: FCNv2 preserves hurricane features well under low-moderate noise; maintains general storm structure even with high noise (though positional accuracy degrades); consistently underestimates storm intensity/persistence; generates smooth forecasts from random inputs after few timesteps.

Conclusion: The model shows tendency toward stable, smoothed outputs and maintains basic storm features under noise, but underestimates intensity. The simple testing approach is portable to other AI weather forecasting models for robustness assessment.

Abstract: Understanding the robustness of a weather forecasting model with respect to input noise or different uncertainties is important in assessing its output reliability, particularly for extreme weather events like hurricanes. In this paper, we test sensitivity and robustness of an artificial intelligence (AI) weather forecasting model: NVIDIAs FourCastNetv2 (FCNv2). We conduct two experiments designed to assess model output under different levels of injected noise in the models initial condition. First, we perturb the initial condition of Hurricane Florence from the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5) dataset (September 13-16, 2018) with varying amounts of Gaussian noise and examine the impact on predicted trajectories and forecasted storm intensity. Second, we start FCNv2 with fully random initial conditions and observe how the model responds to nonsensical inputs. Our results indicate that FCNv2 accurately preserves hurricane features under low to moderate noise injection. Even under high levels of noise, the model maintains the general storm trajectory and structure, although positional accuracy begins to degrade. FCNv2 consistently underestimates storm intensity and persistence across all levels of injected noise. With full random initial conditions, the model generates smooth and cohesive forecasts after a few timesteps, implying the models tendency towards stable, smoothed outputs. Our approach is simple and portable to other data-driven AI weather forecasting models.

[251] Non-Convex Federated Optimization under Cost-Aware Client Selection

Xiaowen Jiang, Anton Rodomanov, Sebastian U. Stich

Main category: cs.LG

TL;DR: The paper introduces a new federated optimization model that quantifies communication and computation costs for different client-selection strategies, and proposes a new algorithm (RG-SAGA) that achieves state-of-the-art communication and local complexities for non-convex optimization.

Details

Motivation: Existing federated optimization metrics don't distinguish between different client-selection strategies (random subset vs. all clients vs. hybrid), which have different communication costs in practice. There's a need for a unified model that quantifies these costs and enables fair comparison of optimization methods.

Method: 1) Introduces a new federated optimization model that quantifies communication and local computation complexities; 2) Proposes a new algorithm based on inexact composite gradient method with carefully constructed gradient estimator; 3) Uses SAGA as base estimator and introduces Recursive-Gradient (RG) technique to improve error bounds; 4) Creates RG-SAGA estimator with improved performance.

Result: The proposed RG-SAGA algorithm achieves the best-known communication and local complexities among existing federated optimization methods for non-convex optimization, with improved error bounds compared to original SAGA.

Conclusion: The paper provides a unified framework for comparing federated optimization methods with different client-selection strategies, and introduces an improved algorithm (RG-SAGA) that sets new state-of-the-art performance in terms of communication and computation efficiency for non-convex problems.

Abstract: Different federated optimization algorithms typically employ distinct client-selection strategies: some methods communicate only with a randomly sampled subset of clients at each round, while others need to periodically communicate with all clients or use a hybrid scheme that combines both strategies. However, existing metrics for comparing optimization methods typically do not distinguish between these strategies, which often incur different communication costs in practice. To address this disparity, we introduce a simple and natural model of federated optimization that quantifies communication and local computation complexities. This new model allows for several commonly used client-selection strategies and explicitly associates each with a distinct cost. Within this setting, we propose a new algorithm that achieves the best-known communication and local complexities among existing federated optimization methods for non-convex optimization. This algorithm is based on the inexact composite gradient method with a carefully constructed gradient estimator and a special procedure for solving the auxiliary subproblem at each iteration. The gradient estimator is based on SAGA, a popular variance-reduced gradient estimator. We first derive a new variance bound for it, showing that SAGA can exploit functional similarity. We then introduce the Recursive-Gradient technique as a general way to potentially improve the error bound of a given conditionally unbiased gradient estimator, including both SAGA and SVRG. By applying this technique to SAGA, we obtain a new estimator, RG-SAGA, which has an improved error bound compared to the original one.

[252] PathFinder: MCTS and LLM Feedback-based Path Selection for Multi-Hop Question Answering

Durga Prasad Maram, Kalpa Gunaratna, Vijay Srinivasan, Haris Jeelani, Srinivas Chappidi

Main category: cs.LG

TL;DR: PATHFINDER improves multi-hop QA by using Monte Carlo Tree Search to generate training paths, filtering erroneous traces with sub-answer recall and LLM verification, and reformulating failed sub-queries.

Details

Motivation: Existing training-based approaches for multi-hop question answering suffer from LLM hallucinations and incorrect reasoning paths that hinder performance, despite LLMs' reasoning capabilities for decomposing questions.

Method: PATHFINDER uses Monte Carlo Tree Search to generate training path traces, filters erroneous/lengthy traces using sub-answer recall and LLM-as-a-judge verification, and reformulates sub-queries to handle failed retrieval cases.

Result: PATHFINDER improves the performance of multi-hop question answering over public benchmark datasets.

Conclusion: The proposed approach effectively addresses LLM hallucinations and incorrect reasoning paths in multi-hop QA through systematic path generation, quality filtering, and query reformulation.

Abstract: Multi-hop question answering is a challenging task in which language models must reason over multiple steps to reach the correct answer. With the help of Large Language Models and their reasoning capabilities, existing systems are able to think and decompose an input question over multiple steps to analyze, retrieve, and reason. However, training-based approaches for this problem still suffer from LLM hallucinations and incorrect reasoning paths that hinder performance. Hence, we propose PATHFINDER, an approach that: (i) uses Monte Carlo Tree Search to generate training path traces, (ii) improves training data quality by filtering erroneous and lengthy traces using sub-answer recall and LLM-as-a-judge verification, and (iii) reformulates sub-queries to handle failed retrieval cases. By following these steps, we demonstrate that PATHFINDER improves the performance of multi-hop QA over public benchmark datasets.

[253] Interaction Tensor Shap

Hiroki Hasegawa, Yukihiko Okada

Main category: cs.LG

TL;DR: IT SHAP enables efficient computation of higher-order Shapley interactions using tensor network contractions, reducing exponential complexity to polynomial time.

Details

Motivation: Existing Shapley-based methods cannot tractably evaluate higher-order feature interactions in deep, high-dimensional models due to exponential computational complexity.

Method: Introduces Interaction Tensor SHAP (IT SHAP) that reformulates Shapley Taylor Interaction Index as tensor network contractions, assuming Tensor Train structure with polynomial TT ranks for computational efficiency.

Result: Reduces computational complexity from exponential Θ(4^n) to NC2 parallel time, enabling polynomial-time computation of higher-order interactions under TT structure.

Conclusion: IT SHAP provides a unified, axiomatic, and computationally tractable framework for scalable interaction-aware explainable AI in high-dimensional models.

Abstract: Machine learning models have grown increasingly deep and high dimensional, making it difficult to understand how individual and combined features influence their predictions. While Shapley value based methods provide principled feature attributions, existing formulations cannot tractably evaluate higher order interactions: the Shapley Taylor Interaction Index (STII) requires exponential scale enumeration of subsets, and current tensor based approaches such as the Marginal SHAP Tensor (MST) are restricted to first order effects. The central problem is that no existing framework simultaneously preserves the axiomatic exactness of STII and avoids the exponential computational blow up inherent to high order discrete derivatives. Here we show that high order Shapley interactions can be represented exactly as tensor network contractions, enabling polynomial time and polylog depth computation under Tensor Train (TT) structure. We introduce Interaction Tensor SHAP (IT SHAP), which reformulates STII as the contraction of a Value Tensor and a Weight Tensor, and assume a finite state TT representation of the Weight Tensor with polynomial TT ranks. Under TT structured model and distribution tensors, we show that IT SHAP reduces the exponential complex Theta(4^n) of STII to NC2 parallel time. These results demonstrate that IT SHAP provides a unified, axiomatic, and computationally tractable formulation of main effects and higher order interactions in high dimensional models. This framework establishes a foundation for scalable interaction aware explainable AI, with implications for large black box models whose combinatorial structure has previously rendered interaction analysis infeasible.

[254] Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models

Mahesh Kumar Nandwana, Youngwan Lim, Joseph Liu, Alex Yang, Varun Notibala, Nishchaie Khanna

Main category: cs.LG

TL;DR: Roblox Guard 1.0 is a safety-focused LLM built on Llama-3.1-8B-Instruct that provides input-output moderation, with strong generalization to unseen safety threats and a new evaluation benchmark.

Details

Motivation: LLMs aligned during post-training may still generate inappropriate outputs posing user risks, requiring robust safeguards across both model inputs and outputs.

Method: Instruction fine-tuning on Llama-3.1-8B-Instruct using synthetic and open-source safety datasets, augmented with chain-of-thought rationales and input inversion to enhance contextual understanding. Also created RobloxGuard-Eval benchmark for systematic evaluation.

Result: The model demonstrates strong performance on out-of-domain safety benchmarks and generalizes across previously unseen safety taxonomies.

Conclusion: Roblox Guard 1.0 provides comprehensive input-output moderation capability and the accompanying benchmark enables systematic evaluation of LLM safety guardrails.

Abstract: Large Language Models (LLMs) are typically aligned for safety during the post-training phase; however, they may still generate inappropriate outputs that could potentially pose risks to users. This challenge underscores the need for robust safeguards that operate across both model inputs and outputs. In this work, we introduce Roblox Guard 1.0, a state-of-the-art instruction fine-tuned LLM designed to enhance the safety of LLM systems through comprehensive input-output moderation, using a pipeline of LLMs to enhance moderation capability. Built on the Llama-3.1-8B-Instruct backbone, our model is instruction fine-tuned to generalize across previously unseen safety taxonomies and demonstrates strong performance on out-of-domain safety benchmarks. The instruction fine-tuning process uses a mix of synthetic and open-source safety datasets, augmented with chain-of-thought (CoT) rationales and input inversion to enhance contextual understanding and decision making. To support systematic evaluation, we also release RobloxGuard-Eval, a new benchmark featuring an extensible safety taxonomy to assess the effectiveness of LLM guardrails and moderation frameworks.

[255] When Forgetting Builds Reliability: LLM Unlearning for Reliable Hardware Code Generation

Yiwen Liang, Qiufeng Li, Shikai Wang, Weidong Cao

Main category: cs.LG

TL;DR: A novel unlearning framework for LLM-based hardware code generation that removes problematic memorization while preserving code quality.

Details

Motivation: LLMs in hardware design face reliability issues due to memorization of proprietary IP, contaminated benchmarks, and unsafe coding patterns, requiring a solution to mitigate these risks.

Method: Combines syntax-preserving unlearning strategy to maintain structural integrity and fine-grained floor-aware selective loss for precise removal of problematic knowledge.

Result: Supports forget sets up to 3x larger, typically requires only one training epoch, and preserves both syntactic correctness and functional integrity of RTL codes.

Conclusion: The framework enables reliable LLM-assisted hardware design by effectively removing problematic knowledge without degrading code generation capabilities.

Abstract: Large Language Models (LLMs) have shown strong potential in accelerating digital hardware design through automated code generation. Yet, ensuring their reliability remains a critical challenge, as existing LLMs trained on massive heterogeneous datasets often exhibit problematic memorization of proprietary intellectual property (IP), contaminated benchmarks, and unsafe coding patterns. To mitigate these risks, we propose a novel unlearning framework tailored for LLM-based hardware code generation. Our method combines (i) a syntax-preserving unlearning strategy that safeguards the structural integrity of hardware code during forgetting, and (ii) a fine-grained floor-aware selective loss that enables precise and efficient removal of problematic knowledge. This integration achieves effective unlearning without degrading LLM code generation capabilities. Extensive experiments show that our framework supports forget sets up to 3x larger, typically requiring only a single training epoch, while preserving both syntactic correctness and functional integrity of register-transfer level (RTL) codes. Our work paves an avenue towards reliable LLM-assisted hardware design.

[256] Enhancing Dimensionality Prediction in Hybrid Metal Halides via Feature Engineering and Class-Imbalance Mitigation

Mariia Karabin, Isaac Armstrong, Leo Beck, Paulina Apanel, Markus Eisenbach, David B. Mitzi, Hanna Terletska, Hendrik Heinz

Main category: cs.LG

TL;DR: ML framework predicts structural dimensionality of hybrid metal halides using feature engineering and class-imbalance handling techniques.

Details

Motivation: Predicting structural dimensionality of hybrid metal halides is challenging due to highly imbalanced datasets across dimensionality classes (0D, 1D, 2D, 3D), requiring specialized techniques to handle class imbalance.

Method: Combines chemically-informed feature engineering with class-imbalance handling (SMOTE augmentation), interaction-based descriptors, multi-stage workflow with feature selection, model stacking, and performance optimization.

Result: Significantly improves F1-scores for underrepresented classes and achieves robust cross-validation performance across all dimensionalities.

Conclusion: The framework successfully addresses class imbalance challenges in predicting HMH structural dimensionality through integrated feature engineering and advanced ML techniques.

Abstract: We present a machine learning framework for predicting the structural dimensionality of hybrid metal halides (HMHs), including organic-inorganic perovskites, using a combination of chemically-informed feature engineering and advanced class-imbalance handling techniques. The dataset, consisting of 494 HMH structures, is highly imbalanced across dimensionality classes (0D, 1D, 2D, 3D), posing significant challenges to predictive modeling. This dataset was later augmented to 1336 via the Synthetic Minority Oversampling Technique (SMOTE) to mitigate the effects of the class imbalance. We developed interaction-based descriptors and integrated them into a multi-stage workflow that combines feature selection, model stacking, and performance optimization to improve dimensionality prediction accuracy. Our approach significantly improves F1-scores for underrepresented classes, achieving robust cross-validation performance across all dimensionalities.

[257] Text Rationalization for Robust Causal Effect Estimation

Lijinghua Zhang, Hengrui Cai

Main category: cs.LG

TL;DR: CATR framework selects sparse token subsets to address positivity violations in text-based causal inference, improving estimation stability and accuracy.

Details

Motivation: High-dimensional text data in causal inference causes positivity assumption violations, leading to extreme propensity scores, unstable weights, and inflated variance in treatment effect estimates.

Method: Confounding-Aware Token Rationalization (CATR) selects sparse necessary token subsets using a residual-independence diagnostic that preserves confounding information sufficient for unconfoundedness while discarding irrelevant text.

Result: Experiments on synthetic data and MIMIC-III database show CATR yields more accurate, stable, and interpretable causal effect estimates than existing baselines.

Conclusion: CATR effectively addresses positivity violations in text-based causal inference by selecting relevant tokens, stabilizing estimators, and improving interpretability of causal effects.

Abstract: Recent advances in natural language processing have enabled the increasing use of text data in causal inference, particularly for adjusting confounding factors in treatment effect estimation. Although high-dimensional text can encode rich contextual information, it also poses unique challenges for causal identification and estimation. In particular, the positivity assumption, which requires sufficient treatment overlap across confounder values, is often violated at the observational level, when massive text is represented in feature spaces. Redundant or spurious textual features inflate dimensionality, producing extreme propensity scores, unstable weights, and inflated variance in effect estimates. We address these challenges with Confounding-Aware Token Rationalization (CATR), a framework that selects a sparse necessary subset of tokens using a residual-independence diagnostic designed to preserve confounding information sufficient for unconfoundedness. By discarding irrelevant texts while retaining key signals, CATR mitigates observational-level positivity violations and stabilizes downstream causal effect estimators. Experiments on synthetic data and a real-world study using the MIMIC-III database demonstrate that CATR yields more accurate, stable, and interpretable causal effect estimates than existing baselines.

[258] China Regional 3km Downscaling Based on Residual Corrective Diffusion Model

Honglu Sun, Hao Jing, Zhixiang Dai, Sa Xiao, Wei Xue, Jian Sun, Qifeng Lu

Main category: cs.LG

TL;DR: Deep learning-based statistical downscaling using CorrDiff diffusion model to generate high-resolution 3km weather forecasts for China region, outperforming operational regional model CMA-MESO.

Details

Motivation: To efficiently produce high-resolution weather forecasts by applying statistical downscaling to global model outputs, addressing the computational challenges of high-resolution numerical weather prediction.

Method: Uses CorrDiff diffusion-based downscaling framework with modifications: expanded region (20x larger), includes high-level variables at six pressure levels (not just surface), adds global residual connection for accuracy. Applied to 25km global forecasts from CMA-GFS and SFF models to generate 3km forecasts.

Result: Downscaled forecasts outperform CMA-MESO regional model in MAE for target variables. CorrDiff generates fine-scale details leading to more realistic predictions compared to deterministic regression models, particularly evident in radar composite reflectivity forecasts.

Conclusion: Diffusion-based statistical downscaling (CorrDiff) effectively produces high-resolution weather forecasts that surpass operational regional models, demonstrating the value of generative models for capturing fine-scale details in weather prediction.

Abstract: A fundamental challenge in numerical weather prediction is to efficiently produce high-resolution forecasts. A common solution is applying downscaling methods, which include dynamical downscaling and statistical downscaling, to the outputs of global models. This work focuses on statistical downscaling, which establishes statistical relationships between low-resolution and high-resolution historical data using statistical models. Deep learning has emerged as a powerful tool for this task, giving rise to various high-performance super-resolution models, which can be directly applied for downscaling, such as diffusion models and Generative Adversarial Networks. This work relies on a diffusion-based downscaling framework named CorrDiff. In contrast to the original work of CorrDiff, the region considered in this work is nearly 20 times larger, and we not only consider surface variables as in the original work, but also encounter high-level variables (six pressure levels) as target downscaling variables. In addition, a global residual connection is added to improve accuracy. In order to generate the 3km forecasts for the China region, we apply our trained models to the 25km global grid forecasts of CMA-GFS, an operational global model of the China Meteorological Administration (CMA), and SFF, a data-driven deep learning-based weather model developed from Spherical Fourier Neural Operators (SFNO). CMA-MESO, a high-resolution regional model, is chosen as the baseline model. The experimental results demonstrate that the forecasts downscaled by our method generally outperform the direct forecasts of CMA-MESO in terms of MAE for the target variables. Our forecasts of radar composite reflectivity show that CorrDiff, as a generative model, can generate fine-scale details that lead to more realistic predictions compared to the corresponding deterministic regression models.

[259] Generalization Beyond Benchmarks: Evaluating Learnable Protein-Ligand Scoring Functions on Unseen Targets

Jakub Kopko, David Graber, Saltuk Mustafa Eyrilmez, Stanislav Mazurenko, David Bednar, Jiri Sedlar, Josef Sivic

Main category: cs.LG

TL;DR: Current protein-ligand scoring functions perform well on standard benchmarks but struggle to generalize to novel protein targets, highlighting the need for more rigorous evaluation protocols and better generalization methods.

Details

Motivation: As machine learning becomes central to molecular design, ensuring the reliability of protein-ligand scoring functions on novel protein targets is crucial. Standard benchmarks don't reflect the true challenge of generalization to targets with limited known structures and experimental data.

Method: The authors evaluate state-of-the-art scoring functions on dataset splits that simulate evaluation on novel targets with limited known structures and affinity measurements. They investigate whether large-scale self-supervised pretraining can bridge the generalization gap and test simple methods that leverage limited test-target data to improve performance.

Result: Analysis reveals that commonly used benchmarks don’t reflect true generalization challenges to novel targets. Preliminary evidence suggests large-scale self-supervised pretraining has potential to bridge the generalization gap. Simple methods using limited test-target data show efficacy in improving scoring function performance.

Conclusion: The findings underscore the need for more rigorous evaluation protocols and offer practical guidance for designing scoring functions with predictive power that extends to novel protein targets, addressing a critical gap in machine learning for molecular design.

Abstract: As machine learning becomes increasingly central to molecular design, it is vital to ensure the reliability of learnable protein-ligand scoring functions on novel protein targets. While many scoring functions perform well on standard benchmarks, their ability to generalize beyond training data remains a significant challenge. In this work, we evaluate the generalization capability of state-of-the-art scoring functions on dataset splits that simulate evaluation on targets with a limited number of known structures and experimental affinity measurements. Our analysis reveals that the commonly used benchmarks do not reflect the true challenge of generalizing to novel targets. We also investigate whether large-scale self-supervised pretraining can bridge this generalization gap and we provide preliminary evidence of its potential. Furthermore, we probe the efficacy of simple methods that leverage limited test-target data to improve scoring function performance. Our findings underscore the need for more rigorous evaluation protocols and offer practical guidance for designing scoring functions with predictive power extending to novel protein targets.

[260] Smart Timing for Mining: A Deep Learning Framework for Bitcoin Hardware ROI Prediction

Sithumi Wickramasinghe, Bikramjit Das, Dorien Herremans

Main category: cs.LG

TL;DR: MineROI-Net: A Transformer-based model for predicting profitable timing of Bitcoin mining hardware acquisitions using time series classification of ROI outcomes.

Details

Motivation: Bitcoin mining hardware acquisition lacks strategic guidance despite being capital-intensive, with no computational frameworks addressing when to purchase ASIC hardware given market volatility, technological obsolescence, and revenue cycles.

Method: Formulate hardware acquisition as time series classification task predicting ROI outcomes (profitable, marginal, unprofitable). Propose MineROI-Net, an open-source Transformer-based architecture capturing multi-scale temporal patterns in mining profitability.

Result: MineROI-Net achieves 83.7% accuracy and 83.1% macro F1-score, outperforming LSTM and TSLANet baselines. Shows strong economic relevance with 93.6% precision for unprofitable periods and 98.5% precision for profitable ones, avoiding critical misclassifications.

Conclusion: MineROI-Net provides a practical, data-driven tool for timing mining hardware acquisitions, potentially reducing financial risk in capital-intensive mining operations. Model is publicly available as open source.

Abstract: Bitcoin mining hardware acquisition requires strategic timing due to volatile markets, rapid technological obsolescence, and protocol-driven revenue cycles. Despite mining’s evolution into a capital-intensive industry, there is little guidance on when to purchase new Application-Specific Integrated Circuit (ASIC) hardware, and no prior computational frameworks address this decision problem. We address this gap by formulating hardware acquisition as a time series classification task, predicting whether purchasing ASIC machines yields profitable (Return on Investment (ROI) >= 1), marginal (0 < ROI < 1), or unprofitable (ROI <= 0) returns within one year. We propose MineROI-Net, an open source Transformer-based architecture designed to capture multi-scale temporal patterns in mining profitability. Evaluated on data from 20 ASIC miners released between 2015 and 2024 across diverse market regimes, MineROI-Net outperforms LSTM-based and TSLANet baselines, achieving 83.7% accuracy and 83.1% macro F1-score. The model demonstrates strong economic relevance, achieving 93.6% precision in detecting unprofitable periods and 98.5% precision for profitable ones, while avoiding misclassification of profitable scenarios as unprofitable and vice versa. These results indicate that MineROI-Net offers a practical, data-driven tool for timing mining hardware acquisitions, potentially reducing financial risk in capital-intensive mining operations. The model is available through: https://github.com/AMAAI-Lab/MineROI-Net.

[261] RevoNAD: Reflective Evolutionary Exploration for Neural Architecture Design

Gyusam Chang, Jeongyoon Yoon, Shin han yi, JaeHyeok Lee, Sujin Jang, Sangpil Kim

Main category: cs.LG

TL;DR: RevoNAD introduces a reflective evolutionary orchestrator that bridges LLM-based reasoning with feedback-aligned neural architecture search, achieving SOTA performance across multiple datasets.

Details

Motivation: LLM-driven neural architecture design faces challenges with discrete token-level design loops that prevent smooth feedback guidance, leading to mode collapse into redundant structures or drift toward infeasible designs when constructive reasoning isn't well grounded.

Method: RevoNAD uses three key components: 1) Multi-round Multi-expert Consensus to transfer isolated design rules into meaningful architectural clues, 2) Adaptive Reflective Exploration that adjusts exploration degree based on reward variance (explores when uncertain, refines when stable), and 3) Pareto-guided Evolutionary Selection promoting architectures optimizing accuracy, efficiency, latency, confidence, and structural diversity.

Result: Achieves state-of-the-art performance across CIFAR10, CIFAR100, ImageNet16-120, COCO-5K, and Cityscape datasets. Ablation and transfer studies validate effectiveness for practically reliable and deployable neural architecture design.

Conclusion: RevoNAD effectively bridges LLM-based reasoning with feedback-aligned architectural search, overcoming limitations of previous LLM-driven NAD systems through reflective evolutionary orchestration that enables reliable and deployable neural architecture design.

Abstract: Recent progress in leveraging large language models (LLMs) has enabled Neural Architecture Design (NAD) systems to generate new architecture not limited from manually predefined search space. Nevertheless, LLM-driven generation remains challenging: the token-level design loop is discrete and non-differentiable, preventing feedback from smoothly guiding architectural improvement. These methods, in turn, commonly suffer from mode collapse into redundant structures or drift toward infeasible designs when constructive reasoning is not well grounded. We introduce RevoNAD, a reflective evolutionary orchestrator that effectively bridges LLM-based reasoning with feedback-aligned architectural search. First, RevoNAD presents a Multi-round Multi-expert Consensus to transfer isolated design rules into meaningful architectural clues. Then, Adaptive Reflective Exploration adjusts the degree of exploration leveraging reward variance; it explores when feedback is uncertain and refines when stability is reached. Finally, Pareto-guided Evolutionary Selection effectively promotes architectures that jointly optimize accuracy, efficiency, latency, confidence, and structural diversity. Across CIFAR10, CIFAR100, ImageNet16-120, COCO-5K, and Cityscape, RevoNAD achieves state-of-the-art performance. Ablation and transfer studies further validate the effectiveness of RevoNAD in allowing practically reliable, and deployable neural architecture design.

[262] Sepsis Prediction Using Graph Convolutional Networks over Patient-Feature-Value Triplets

Bozhi Dan, Di Wu, Ji Xu, Xiang Liu, Yiziting Zhu, Xin Shu, Yujie Li, Bin Yi

Main category: cs.LG

TL;DR: Triplet-GCN: A graph convolutional model using patient-feature-value triplets for early sepsis detection from EHR data, outperforming traditional tabular methods.

Details

Motivation: Sepsis remains a major cause of ICU mortality, but timely detection is challenging due to complex, sparse, and heterogeneous EHR data. Existing methods struggle with the intricate relationships in EHR data.

Method: Triplet-GCN represents each patient encounter as patient-feature-value triplets, constructs a bipartite EHR graph, and learns patient embeddings via Graph Convolutional Network (GCN) followed by MLP. Uses type-specific preprocessing (median imputation, standardization, effect coding, mode imputation with embeddings) and retains measurement values on edges.

Result: In a multi-center Chinese cohort (N=648), Triplet-GCN consistently outperformed tabular baselines (KNN, SVM, XGBoost, Random Forest) across discrimination and balanced error metrics, with better sensitivity-specificity trade-off and improved utility for early warning.

Conclusion: Encoding EHR as triplets and propagating information over patient-feature graphs produces more informative patient representations than feature-independent models, offering a simple, end-to-end blueprint for deployable sepsis risk stratification.

Abstract: In the intensive care setting, sepsis continues to be a major contributor to patient illness and death; however, its timely detection is hindered by the complex, sparse, and heterogeneous nature of electronic health record (EHR) data. We propose Triplet-GCN, a single-branch graph convolutional model that represents each encounter as patient-feature-value triplets, constructs a bipartite EHR graph, and learns patient embeddings via a Graph Convolutional Network (GCN) followed by a lightweight multilayer perceptron (MLP). The pipeline applies type-specific preprocessing – median imputation and standardization for numeric variables, effect coding for binary features, and mode imputation with low-dimensional embeddings for rare categorical attributes – and initializes patient nodes with summary statistics, while retaining measurement values on edges to preserve “who measured what and by how much”. In a retrospective, multi-center Chinese cohort (N = 648; 70/30 train-test split) drawn from three tertiary hospitals, Triplet-GCN consistently outperforms strong tabular baselines (KNN, SVM, XGBoost, Random Forest) across discrimination and balanced error metrics, yielding a more favorable sensitivity-specificity trade-off and improved overall utility for early warning. These findings indicate that encoding EHR as triplets and propagating information over a patient-feature graph produce more informative patient representations than feature-independent models, offering a simple, end-to-end blueprint for deployable sepsis risk stratification.

[263] TS-HINT: Enhancing Semiconductor Time Series Regression Using Attention Hints From Large Language Model Reasoning

Jonathan Adam Rico, Nagarajan Raghavan, Senthilnath Jayavelu

Main category: cs.LG

TL;DR: TS-Hint: A time series foundation model framework with chain-of-thought reasoning for semiconductor manufacturing MRR prediction, addressing data limitations and temporal dynamics.

Details

Motivation: Existing methods for predicting material removal rate (MRR) in semiconductor manufacturing (like CMP) rely on static time series features, losing temporal dynamics and requiring large datasets for training.

Method: Propose TS-Hint, a Time Series Foundation Model framework integrated with chain-of-thought reasoning that provides attention hints during training based on attention mechanism data and saliency data.

Result: Experimental results show effectiveness in limited data settings via few-shot learning and ability to learn directly from multivariate time series features.

Conclusion: TS-Hint addresses limitations of existing methods by preserving temporal dynamics and working effectively with limited data through foundation model approach with chain-of-thought reasoning.

Abstract: Existing data-driven methods rely on the extraction of static features from time series to approximate the material removal rate (MRR) of semiconductor manufacturing processes such as chemical mechanical polishing (CMP). However, this leads to a loss of temporal dynamics. Moreover, these methods require a large amount of data for effective training. In this paper, we propose TS-Hint, a Time Series Foundation Model (TSFM) framework, integrated with chain-of-thought reasoning which provides attention hints during training based on attention mechanism data and saliency data. Experimental results demonstrate the effectiveness of our model in limited data settings via few-shot learning and can learn directly from multivariate time series features.

[264] IdealTSF: Can Non-Ideal Data Contribute to Enhancing the Performance of Time Series Forecasting Models?

Hua Wang, Jinghao Lu, Fan Zhang

Main category: cs.LG

TL;DR: IdealTSF framework leverages non-ideal negative samples to enhance time series forecasting by pretraining on negative data, transforming sequences into ideal positive samples, and applying adversarial optimization.

Details

Motivation: Missing values and anomalies in time series data hinder deep learning forecasting performance. Previous approaches focused on feature extraction or treating suboptimal data as positive samples, but leveraging negative samples could better enhance event prediction.

Method: Proposes IdealTSF framework with three progressive steps: 1) Pretraining by extracting knowledge from negative sample data, 2) Training by transforming sequence data into ideal positive samples, 3) Optimization using negative optimization mechanism with adversarial disturbances.

Result: Extensive experiments show that negative sample data unlocks significant potential within basic attention architecture for time series forecasting, demonstrating the framework’s effectiveness.

Conclusion: IdealTSF is particularly well-suited for applications with noisy samples or low-quality data, as it effectively leverages both ideal positive and negative samples to improve forecasting performance.

Abstract: Deep learning has shown strong performance in time series forecasting tasks. However, issues such as missing values and anomalies in sequential data hinder its further development in prediction tasks. Previous research has primarily focused on extracting feature information from sequence data or addressing these suboptimal data as positive samples for knowledge transfer. A more effective approach would be to leverage these non-ideal negative samples to enhance event prediction. In response, this study highlights the advantages of non-ideal negative samples and proposes the IdealTSF framework, which integrates both ideal positive and negative samples for time series forecasting. IdealTSF consists of three progressive steps: pretraining, training, and optimization. It first pretrains the model by extracting knowledge from negative sample data, then transforms the sequence data into ideal positive samples during training. Additionally, a negative optimization mechanism with adversarial disturbances is applied. Extensive experiments demonstrate that negative sample data unlocks significant potential within the basic attention architecture for time series forecasting. Therefore, IdealTSF is particularly well-suited for applications with noisy samples or low-quality data.

[265] Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Zhenpeng Su, Leiyu Pan, Minxuan Lv, Tiehua Mei, Zijia Lin, Yuntao Li, Wenping Hu, Ruiming Tang, Kun Gai, Guorui Zhou

Main category: cs.LG

TL;DR: Proposes Entropy Ratio Clipping (ERC) to stabilize RL training for LLMs by constraining global distribution shift using entropy ratio between current and previous policies.

Details

Motivation: Off-policy RL training for LLMs causes distribution shift that pushes policies beyond trust regions, leading to training instabilities like entropy fluctuations and unstable gradients. PPO-Clip only addresses local clipping but ignores global distributional shift of actions.

Method: Introduces entropy ratio between current and previous policies as global metric to quantify policy exploration changes. Proposes Entropy Ratio Clipping (ERC) mechanism with bidirectional constraints on entropy ratio to stabilize policy updates at global distribution level and compensate for PPO-clip’s limitations with un-sampled actions.

Result: ERC integrated into DAPO and GPPO RL algorithms. Experiments across multiple benchmarks show ERC consistently improves performance and stabilizes training.

Conclusion: ERC effectively addresses global distribution shift in LLM RL training, providing more stable policy updates and better performance than standard PPO-clip approaches.

Abstract: Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an \textbf{Entropy Ratio Clipping} (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms. Experiments across multiple benchmarks show that ERC consistently improves performance.

[266] How Ensemble Learning Balances Accuracy and Overfitting: A Bias-Variance Perspective on Tabular Data

Zubair Ahmed Mohammad

Main category: cs.LG

TL;DR: Ensembles achieve high accuracy with small generalization gaps by reducing variance through averaging or controlled boosting, especially effective on datasets with nonlinear structure but requiring regularization on noisy/imbalanced data.

Details

Motivation: To understand how ensemble models balance accuracy and overfitting (generalization gaps) across different tabular classification tasks, as ensembles often achieve higher accuracy but their ability to maintain small generalization gaps is not well understood.

Method: Used repeated stratified cross-validation with statistical significance testing to compare linear models, a single decision tree, and nine ensemble methods across four tabular classification datasets (Breast Cancer, Heart Disease, Pima Diabetes, Credit Card Fraud). Also computed dataset complexity indicators like linearity score, Fisher ratio, and noise estimate.

Result: 1) On nearly linear/clean data: linear models generalize well, ensembles offer little benefit. 2) On datasets with meaningful nonlinear structure: tree-based ensembles increase test accuracy by 5-7 points while keeping gaps below 3%. 3) On noisy/imbalanced datasets: ensembles remain competitive but require regularization to avoid fitting noise or majority class patterns.

Conclusion: Ensembles can maintain high accuracy with low overfitting by reducing variance through averaging/controlled boosting, especially effective on datasets with nonlinear structure. Dataset complexity indicators help explain when ensembles effectively control variance, providing practical guidance for model selection in real-world tabular applications.

Abstract: Ensemble models often achieve higher accuracy than single learners, but their ability to maintain small generalization gaps is not always well understood. This study examines how ensembles balance accuracy and overfitting across four tabular classification tasks: Breast Cancer, Heart Disease, Pima Diabetes, and Credit Card Fraud. Using repeated stratified cross validation with statistical significance testing, we compare linear models, a single decision tree, and nine ensemble methods. The results show that ensembles can reach high accuracy without large gaps by reducing variance through averaging or controlled boosting. On nearly linear and clean data, linear models already generalize well and ensembles offer little additional benefit. On datasets with meaningful nonlinear structure, tree based ensembles increase test accuracy by 5 to 7 points while keeping gaps below 3 percent. On noisy or highly imbalanced datasets, ensembles remain competitive but require regularization to avoid fitting noise or majority class patterns. We also compute simple dataset complexity indicators, such as linearity score, Fisher ratio, and noise estimate, which explain when ensembles are likely to control variance effectively. Overall, the study provides a clear view of how and when ensembles maintain high accuracy while keeping overfitting low, offering practical guidance for model selection in real world tabular applications.

[267] PERM EQ x GRAPH EQ: Equivariant Neural Networks for Quantum Molecular Learning

Saumya Biswas, Jiten Oswal

Main category: cs.LG

TL;DR: Comparison of Geometric Quantum Machine Learning models on molecular datasets (LiH and NH3) shows permutational symmetric embedding is most generalizable, with graph embedding improving trainability.

Details

Motivation: To understand how different symmetry equivariance properties in Quantum Machine Learning models affect performance and generalizability on molecular geometry datasets, and to establish criteria for model selection based on molecular geometry.

Method: Comparative study of QML models with different symmetry properties: no symmetry equivariance, rotational and permutational equivariance, and graph embedded permutational equivariance. Tested on two molecular datasets (linear LiH and trigonal pyramidal NH3) with a classical equivariant model as baseline.

Result: Graph embedding of features improves trainability for geometric datasets. Permutational symmetric embedding emerges as the most generalizable QML model for geometric learning. Performance differentials reveal criteria for model selection based on molecular geometry.

Conclusion: Permutational symmetric embedding is the most effective and generalizable Quantum Machine Learning approach for geometric learning on molecular datasets, with graph embedding providing a pathway to enhanced trainability.

Abstract: In hierarchal order of molecular geometry, we compare the performances of Geometric Quantum Machine Learning models. Two molecular datasets are considered: the simplistic linear shaped LiH-molecule and the trigonal pyramidal molecule NH3. Both accuracy and generalizability metrics are considered. A classical equivariant model is used as a baseline for the performance comparison. The comparative performance of Quantum Machine Learning models with no symmetry equivariance, rotational and permutational equivariance, and graph embedded permutational equivariance is investigated. The performance differentials and the molecular geometry in question reveals the criteria for choice of models for generalizability. Graph embedding of features is shown to be an effective pathway to greater trainability for geometric datasets. Permutational symmetric embedding is found to be the most generalizable quantum Machine Learning model for geometric learning.

[268] Turbulence Regression

Yingang Fan, Binjie Ding, Baiyi Chen

Main category: cs.LG

TL;DR: The paper proposes a NeuTucker decomposition model using discretized wind field data for improved low-altitude turbulence prediction, outperforming traditional regression methods.

Details

Motivation: Traditional methods struggle to accurately predict low-altitude turbulence states using only wind profile radar data due to complex factors and sparse observations. Current observational conditions limit accurate turbulence prediction.

Method: Introduces a NeuTucker decomposition model that discretizes continuous wind field data and constructs a four-dimensional Tucker interaction tensor to capture spatio-temporal interactions. Uses a Tucker neural network-based low-rank decomposition model for three-dimensional wind field data.

Result: The discretized NeuTucF model demonstrates superior performance in estimating missing observations in real datasets compared to various common regression models.

Conclusion: The proposed NeuTucker decomposition approach with data discretization effectively captures complex wind field interactions and improves turbulence prediction accuracy over traditional methods.

Abstract: Air turbulence refers to the disordered and irregular motion state generated by drastic changes in velocity, pressure, or direction during airflow. Various complex factors lead to intricate low-altitude turbulence outcomes. Under current observational conditions, especially when using only wind profile radar data, traditional methods struggle to accurately predict turbulence states. Therefore, this paper introduces a NeuTucker decomposition model utilizing discretized data. Designed for continuous yet sparse three-dimensional wind field data, it constructs a low-rank Tucker decomposition model based on a Tucker neural network to capture the latent interactions within the three-dimensional wind field data. Therefore, two core ideas are proposed here: 1) Discretizing continuous input data to adapt to models like NeuTucF that require discrete data inputs. 2) Constructing a four-dimensional Tucker interaction tensor to represent all possible spatio-temporal interactions among different elevations and three-dimensional wind speeds. In estimating missing observations in real datasets, this discretized NeuTucF model demonstrates superior performance compared to various common regression models.

[269] GRASP: Graph Reasoning Agents for Systems Pharmacology with Human-in-the-Loop

Omid Bazgir, Vineeth Manthapuri, Ilia Rattsev, Mohammad Jafarnejad

Main category: cs.LG

TL;DR: GRASP is a multi-agent graph-reasoning framework that converts natural language QSP model specifications into executable MATLAB/SimBiology code while preserving biological constraints, outperforming traditional LLM approaches.

Details

Motivation: QSP modeling requires significant time investment that limits domain expert throughput. Current approaches lack the ability to preserve biological constraints while enabling natural language specification.

Method: GRASP encodes QSP models as typed biological knowledge graphs and compiles them to executable code. It uses a two-phase workflow (Understanding for legacy code reconstruction and Action for constraint-checked modification) with BFS parameter alignment and iterative validation.

Result: GRASP outperforms SME-guided CoT and ToT baselines (≈9-10/10 vs. 5-7/10) across biological plausibility, mathematical correctness, structural fidelity, and code quality. BFS alignment achieves F1 = 0.95 for dependency discovery, units, and range.

Conclusion: Graph-structured, agentic workflows can make QSP model development both accessible and rigorous, enabling domain experts to specify mechanisms in natural language without sacrificing biomedical fidelity.

Abstract: Quantitative Systems Pharmacology (QSP) modeling is essential for drug development but it requires significant time investment that limits the throughput of domain experts. We present \textbf{GRASP} – a multi-agent, graph-reasoning framework with a human-in-the-loop conversational interface – that encodes QSP models as typed biological knowledge graphs and compiles them to executable MATLAB/SimBiology code while preserving units, mass balance, and physiological constraints. A two-phase workflow – \textsc{Understanding} (graph reconstruction of legacy code) and \textsc{Action} (constraint-checked, language-driven modification) – is orchestrated by a state machine with iterative validation. GRASP performs breadth-first parameter-alignment around new entities to surface dependent quantities and propose biologically plausible defaults, and it runs automatic execution/diagnostics until convergence. In head-to-head evaluations using LLM-as-judge, GRASP outperforms SME-guided CoT and ToT baselines across biological plausibility, mathematical correctness, structural fidelity, and code quality ((\approx)9–10/10 vs.\ 5–7/10). BFS alignment achieves F1 = 0.95 for dependency discovery, units, and range. These results demonstrate that graph-structured, agentic workflows can make QSP model development both accessible and rigorous, enabling domain experts to specify mechanisms in natural language without sacrificing biomedical fidelity.

[270] Credal and Interval Deep Evidential Classifications

Michele Caprio, Shireen K. Manchingal, Fabio Cuzzolin

Main category: cs.LG

TL;DR: CDEC and IDEC are novel deep evidential classification methods that use credal sets and interval distributions for uncertainty quantification, enabling abstention when uncertainty exceeds thresholds and providing robust probabilistic guarantees within acceptable bounds.

Details

Motivation: Uncertainty quantification is crucial for AI decision-making, risk assessment, and model reliability. Current methods have limitations in systematically assessing both epistemic (reducible) and aleatoric (irreducible) uncertainties and providing reliable abstention mechanisms when uncertainty is too high.

Method: CDEC uses credal sets (closed convex sets of probabilities) while IDEC uses intervals of evidential predictive distributions. Both are trained with standard backpropagation using evidence theory-based loss functions. They can abstain from classification when epistemic or aleatoric uncertainty exceeds thresholds, otherwise provide collections of labels with probabilistic guarantees.

Result: Experiments on MNIST, CIFAR-10, CIFAR-100 and their OoD shifts show competitive predictive accuracy, state-of-the-art OoD detection under epistemic and total uncertainty, and well-calibrated prediction regions that expand reliably under distribution shift. CDEC achieves stable uncertainty estimates with small ensembles.

Conclusion: CDEC and IDEC overcome limitations of previous evidential deep learning approaches, providing effective uncertainty quantification with abstention capabilities and robust probabilistic guarantees, making them valuable for reliable AI decision-making under uncertainty.

Abstract: Uncertainty Quantification (UQ) presents a pivotal challenge in the field of Artificial Intelligence (AI), profoundly impacting decision-making, risk assessment and model reliability. In this paper, we introduce Credal and Interval Deep Evidential Classifications (CDEC and IDEC, respectively) as novel approaches to address UQ in classification tasks. CDEC and IDEC leverage a credal set (closed and convex set of probabilities) and an interval of evidential predictive distributions, respectively, allowing us to avoid overfitting to the training data and to systematically assess both epistemic (reducible) and aleatoric (irreducible) uncertainties. When those surpass acceptable thresholds, CDEC and IDEC have the capability to abstain from classification and flag an excess of epistemic or aleatoric uncertainty, as relevant. Conversely, within acceptable uncertainty bounds, CDEC and IDEC provide a collection of labels with robust probabilistic guarantees. CDEC and IDEC are trained using standard backpropagation and a loss function that draws from the theory of evidence. They overcome the shortcomings of previous efforts, and extend the current evidential deep learning literature. Through extensive experiments on MNIST, CIFAR-10 and CIFAR-100, together with their natural OoD shifts (F-MNIST/K-MNIST, SVHN/Intel, TinyImageNet), we show that CDEC and IDEC achieve competitive predictive accuracy, state-of-the-art OoD detection under epistemic and total uncertainty, and tight, well-calibrated prediction regions that expand reliably under distribution shift. An ablation over ensemble size further demonstrates that CDEC attains stable uncertainty estimates with only a small ensemble.

[271] IDK-S: Incremental Distributional Kernel for Streaming Anomaly Detection

Yang Xu, Yixiao Ma, Kaifeng Zhang, Zuliang Yang, Kai Ming Ting

Main category: cs.LG

TL;DR: IDK-S is an incremental distributional kernel method for streaming anomaly detection that maintains high accuracy with lightweight updates, outperforming state-of-the-art methods in both speed and detection performance.

Details

Motivation: Anomaly detection on data streams requires methods that can handle evolving distributions while maintaining real-time efficiency. Existing methods struggle to balance detection accuracy with computational efficiency in streaming environments.

Method: IDK-S builds on the Isolation Distributional Kernel (an offline detector) and introduces a lightweight incremental update mechanism. It creates dynamic representations in the kernel mean embedding framework, avoiding full model retraining while maintaining statistical equivalence to the retrained model.

Result: Extensive experiments on 13 benchmarks show IDK-S achieves superior detection accuracy while operating substantially faster than existing state-of-the-art methods, often by an order of magnitude.

Conclusion: IDK-S effectively addresses the challenges of streaming anomaly detection by combining the accuracy advantages of data-dependent kernels with efficient incremental updates, making it a practical solution for real-time streaming applications.

Abstract: Anomaly detection on data streams presents significant challenges, requiring methods to maintain high detection accuracy among evolving distributions while ensuring real-time efficiency. Here we introduce $\mathcal{IDK}$-$\mathcal{S}$, a novel $\mathbf{I}$ncremental $\mathbf{D}$istributional $\mathbf{K}$ernel for $\mathbf{S}$treaming anomaly detection that effectively addresses these challenges by creating a new dynamic representation in the kernel mean embedding framework. The superiority of $\mathcal{IDK}$-$\mathcal{S}$ is attributed to two key innovations. First, it inherits the strengths of the Isolation Distributional Kernel, an offline detector that has demonstrated significant performance advantages over foundational methods like Isolation Forest and Local Outlier Factor due to the use of a data-dependent kernel. Second, it adopts a lightweight incremental update mechanism that significantly reduces computational overhead compared to the naive baseline strategy of performing a full model retraining. This is achieved without compromising detection accuracy, a claim supported by its statistical equivalence to the full retrained model. Our extensive experiments on thirteen benchmarks demonstrate that $\mathcal{IDK}$-$\mathcal{S}$ achieves superior detection accuracy while operating substantially faster, in many cases by an order of magnitude, than existing state-of-the-art methods.

[272] On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

Yiming Tang, Harshvardhan Saini, Yizhen Liao, Dianbo Liu

Main category: cs.LG

TL;DR: The paper develops a unified theoretical framework for sparse dictionary learning methods in mechanistic interpretability, providing the first formal analysis of optimization landscapes and explaining empirical phenomena like feature absorption and dead neurons.

Details

Motivation: As AI models become more capable, understanding their internal representations is crucial for scientific progress and trustworthy deployment. While sparse dictionary learning methods have shown empirical success in disentangling superposed concepts, they lack theoretical understanding, with existing work limited to specific constrained cases.

Method: Develops a unified theoretical framework treating sparse dictionary learning as one optimization problem, analyzes how diverse methods (sparse autoencoders, transcoders, crosscoders) instantiate this framework, and provides rigorous analysis of optimization landscapes. Also designs controlled experiments to validate theoretical results.

Result: Provides first theoretical explanations for empirically observed phenomena including feature absorption, dead neurons, and neuron resampling techniques. Demonstrates how various SDL methods fit within the unified framework and offers formal grounding for the broader family of methods beyond tied-weight constraints.

Conclusion: The paper establishes a foundational theoretical framework for sparse dictionary learning in mechanistic interpretability, bridging the gap between empirical success and theoretical understanding, which enables better analysis and design of interpretability methods.

Abstract: As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they process information has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have shown that neural networks represent meaningful concepts as directions in their representation spaces and often encode many concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into interpretable features. These methods have demonstrated remarkable empirical success but have limited theoretical understanding. Existing theoretical work is limited to sparse autoencoders with tied-weight constraints, leaving the broader family of SDL methods without formal grounding. In this work, we develop the first unified theoretical framework considering SDL as one unified optimization problem. We demonstrate how diverse methods instantiate the theoretical framwork and provide rigorous analysis on the optimization landscape. We provide the first theoretical explanations for some empirically observed phenomena, including feature absorption, dead neurons, and the neuron resampling technique. We further design controlled experiments to validate our theoretical results.

[273] SCoNE: Spherical Consistent Neighborhoods Ensemble for Effective and Efficient Multi-View Anomaly Detection

Yang Xu, Hang Zhang, Yixiao Ma, Ye Zhu, Kai Ming Ting

Main category: cs.LG

TL;DR: SCoNE is a novel multi-view anomaly detection method that directly represents consistent neighborhoods across views using data-dependent spherical neighborhoods, achieving O(N) complexity and superior accuracy.

Details

Motivation: Existing multi-view anomaly detection methods have two key issues: (1) they cannot guarantee capturing consistent neighbors well when the same neighbors are in regions of varied densities across different views, leading to inferior accuracy; (2) they have high O(N²) computational cost, making them inapplicable for large datasets.

Method: SCoNE (Spherical Consistent Neighborhoods Ensemble) directly represents consistent neighborhoods with multi-view instances without intermediate representations. It uses data-dependent spherical neighborhoods that are large in sparse regions and small in dense regions, enabling consistent neighborhood representation across views without learning.

Result: SCoNE achieves superior detection accuracy compared to existing approaches and runs orders-of-magnitude faster in large datasets due to its O(N) time complexity.

Conclusion: SCoNE effectively addresses the limitations of existing multi-view anomaly detection methods by providing a direct, data-dependent neighborhood representation that ensures consistency across views while achieving linear time complexity, making it both accurate and scalable.

Abstract: The core problem in multi-view anomaly detection is to represent local neighborhoods of normal instances consistently across all views. Recent approaches consider a representation of local neighborhood in each view independently, and then capture the consistent neighbors across all views via a learning process. They suffer from two key issues. First, there is no guarantee that they can capture consistent neighbors well, especially when the same neighbors are in regions of varied densities in different views, resulting in inferior detection accuracy. Second, the learning process has a high computational cost of $\mathcal{O}(N^2)$, rendering them inapplicable for large datasets. To address these issues, we propose a novel method termed \textbf{S}pherical \textbf{C}onsistent \textbf{N}eighborhoods \textbf{E}nsemble (SCoNE). It has two unique features: (a) the consistent neighborhoods are represented with multi-view instances directly, requiring no intermediate representations as used in existing approaches; and (b) the neighborhoods have data-dependent properties, which lead to large neighborhoods in sparse regions and small neighborhoods in dense regions. The data-dependent properties enable local neighborhoods in different views to be represented well as consistent neighborhoods, without learning. This leads to $\mathcal{O}(N)$ time complexity. Empirical evaluations show that SCoNE has superior detection accuracy and runs orders-of-magnitude faster in large datasets than existing approaches.

[274] RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs

Jonathan Geuter, Gregor Kornhardt

Main category: cs.LG

TL;DR: RoBoN is a sequential multi-LLM routing approach that outperforms single-model best-of-n by exploiting model diversity at inference time without additional training.

Details

Motivation: Traditional best-of-n uses a single model, but LLMs have complementary strengths across tasks. The authors aim to leverage model diversity to improve best-of-n performance without additional training.

Method: RoBoN sequentially routes generations across multiple models using a reward model and agreement signal. It maintains compute parity with standard best-of-n and works with any plug-in reward model.

Result: RoBoN consistently outperforms standard best-of-n on reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU) with gains up to 3.4% absolute accuracy, and beats uniform multi-model baselines.

Conclusion: Model diversity can be exploited at inference to improve best-of-n performance over any single model, providing a simple, training-free path to test-time scaling with multiple LLMs.

Abstract: Best-of-$n$ is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of-$n$ relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of-$n$), a sequential multi-LLM alternative to the prevailing single-model best-of-$n$. Given a suite of models ${m_i}_{i=1}^M$, RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU), RoBoN consistently outperforms standard best-of-$n$ applied to each individual model for larger $n$, with gains of up to 3.4% in absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited at inference to improve best-of-$n$ performance over any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.

[275] Improving Local Fidelity Through Sampling and Modeling Nonlinearity

Sanjeev Shrestha, Rahul Dubey, Hui Liu

Main category: cs.LG

TL;DR: Proposes a novel explanation method using MARS for non-linear local boundaries and N-ball sampling to improve faithfulness over LIME.

Details

Motivation: LIME assumes linear local decision boundaries and fails to capture non-linear relationships, leading to incorrect explanations for complex black-box models used in high-stakes applications.

Method: Uses Multivariate Adaptive Regression Splines (MARS) to model non-linear local boundaries and N-ball sampling technique that samples directly from desired distribution instead of reweighting samples like LIME.

Result: Method yields more faithful explanations than baselines, achieving 37% average reduction in root mean square error and significantly improving local fidelity across three UCI datasets with different classifiers and kernel widths.

Conclusion: The proposed approach effectively captures underlying behavior of reference models by modeling non-linear local boundaries and using better sampling, resulting in higher-fidelity explanations for black-box machine learning models.

Abstract: With the increasing complexity of black-box machine learning models and their adoption in high-stakes areas, it is critical to provide explanations for their predictions. Local Interpretable Model-agnostic Explanation (LIME) is a widely used technique that explains the prediction of any classifier by learning an interpretable model locally around the predicted instance. However, it assumes that the local decision boundary is linear and fails to capture the non-linear relationships, leading to incorrect explanations. In this paper, we propose a novel method that can generate high-fidelity explanations. Multivariate adaptive regression splines (MARS) is used to model non-linear local boundaries that effectively captures the underlying behavior of the reference model, thereby enhancing the local fidelity of the explanation. Additionally, we utilize the N-ball sampling technique, which samples directly from the desired distribution instead of reweighting samples as done in LIME, further improving the faithfulness score. We evaluate our method on three UCI datasets across different classifiers and varying kernel widths. Experimental results show that our method yields more faithful explanations compared to baselines, achieving an average reduction of 37% in root mean square error, significantly improving local fidelity.

[276] Wasserstein distance based semi-supervised manifold learning and application to GNSS multi-path detection

Antoine Blais, Nicolas Couëllan

Main category: cs.LG

TL;DR: Optimal transport-based semi-supervised learning using Wasserstein distance for label propagation in deep convolutional networks, applied to GNSS multi-path interference detection with improved accuracy over fully supervised methods.

Details

Motivation: To address the challenge of learning from scarce labeled image data by leveraging semi-supervised approaches, specifically for GNSS applications where multi-path interference detection requires effective learning with limited labeled samples.

Method: Proposes an optimal transport-based semi-supervised approach using Wasserstein distance as similarity metric between image samples in implicit graph-based transductive learning. The Wasserstein distance is integrated into label propagation mechanism during deep convolutional network training.

Result: Experiments under various signal conditions show that with specific hyperparameter choices controlling semi-supervision amount and metric sensitivity, classification accuracy can be significantly improved over fully supervised training methods.

Conclusion: The optimal transport-based semi-supervised approach using Wasserstein distance is effective for learning from scarce labeled image data, particularly demonstrated in GNSS multi-path interference detection applications, offering improved accuracy over traditional supervised methods.

Abstract: The main objective of this study is to propose an optimal transport based semi-supervised approach to learn from scarce labelled image data using deep convolutional networks. The principle lies in implicit graph-based transductive semi-supervised learning where the similarity metric between image samples is the Wasserstein distance. This metric is used in the label propagation mechanism during learning. We apply and demonstrate the effectiveness of the method on a GNSS real life application. More specifically, we address the problem of multi-path interference detection. Experiments are conducted under various signal conditions. The results show that for specific choices of hyperparameters controlling the amount of semi-supervision and the level of sensitivity to the metric, the classification accuracy can be significantly improved over the fully supervised training method.

[277] Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

Shikai Qiu, Zixi Chen, Hoang Phan, Qi Lei, Andrew Gordon Wilson

Main category: cs.LG

TL;DR: Scaling preconditioned optimizers (Shampoo, SOAP, Muon) via proper hyperparameter transfer rules enables consistent speedups over AdamW for large language models, but incorrect scaling causes vanishing benefits.

Details

Motivation: Recent preconditioned optimizers show promise but have mixed replication results; need to understand how to properly scale them for large models using hyperparameter transfer techniques.

Method: Study optimal learning rate and weight decay scaling with model width/depth for various optimizers; apply μP scaling with blocking and explicit spectral normalization; test on Llama models from 190M to 1.4B parameters.

Result: Muon achieves 1.4× and Shampoo 1.3× speedup over AdamW when properly scaled; scaling weight decay as 1/width works well; incorrect scaling causes vanishing speedups.

Conclusion: Studying optimal hyperparameter transfer is essential for reliable optimizer comparison at scale; proper scaling enables consistent speedups for preconditioned optimizers.

Abstract: Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts to validate and replicate their successes have reported mixed results. To better understand the effectiveness of these optimizers at scale, in this work we investigate how to scale preconditioned optimizers via hyperparameter transfer, building on prior works such as $μ$P. We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. We find that scaling the learning rate according to $μ$P improves transfer, but can still suffer from significant finite-width deviations that cause drifting optimal learning rates, which we show can be mitigated by blocking and explicit spectral normalization. For compute-optimal scaling, we find scaling independent weight decay as $1/\mathrm{width}$ is nearly optimal across optimizers. Applying these scaling rules, we show Muon and Shampoo consistently achieve $1.4\times$ and $1.3\times$ speedup over AdamW for training Llama-architecture language models of sizes ranging from $190$M to $1.4$B, whereas the speedup vanishes rapidly with scale under incorrect scaling. Based on these results and further ablations, we argue that studying optimal hyperparameter transfer is essential for reliably comparing optimizers at scale given a realistic tuning budget.

[278] Bounded Graph Clustering with Graph Neural Networks

Kibidi Neocosmos, Diego Baptista, Nicole Ludwig

Main category: cs.LG

TL;DR: Proposes a GNN framework that allows flexible control over number of detected communities, supporting both range specification and exact cluster count.

Details

Motivation: Existing GNN-based community detection methods often fail to return the exact specified number of clusters due to design limitations, requiring users to know the true number in advance which is often impractical.

Method: Introduces a flexible framework that allows users to specify either a plausible range for cluster numbers or an exact count, with mechanisms to enforce these bounds during GNN training.

Result: The proposed method enables reliable control over community count in GNN-based detection, allowing both range-based and exact-number specifications to be properly enforced.

Conclusion: Provides a principled solution to the cluster count control problem in GNN-based community detection, making GNN methods more practical and flexible for real-world applications.

Abstract: In community detection, many methods require the user to specify the number of clusters in advance since an exhaustive search over all possible values is computationally infeasible. While some classical algorithms can infer this number directly from the data, this is typically not the case for graph neural networks (GNNs): even when a desired number of clusters is specified, standard GNN-based methods often fail to return the exact number due to the way they are designed. In this work, we address this limitation by introducing a flexible and principled way to control the number of communities discovered by GNNs. Rather than assuming the true number of clusters is known, we propose a framework that allows the user to specify a plausible range and enforce these bounds during training. However, if the user wants an exact number of clusters, it may also be specified and reliably returned.

[279] Modular Jets for Supervised Pipelines: Diagnosing Mirage vs Identifiability

Suman Sanyal

Main category: cs.LG

TL;DR: The paper introduces Modular Jets to analyze whether model decompositions are uniquely determined by data, distinguishing between mirage regimes (multiple decompositions) and identifiable regimes (unique decomposition).

Details

Motivation: Traditional supervised learning evaluates models via predictive risk but doesn't address whether internal model decomposition is uniquely determined by data and evaluation design. The paper aims to understand when multiple modular decompositions can produce identical input-output behavior.

Method: Introduces Modular Jets - local linear response maps describing how each module reacts to small structured perturbations. Proposes empirical notion of mirage regimes vs identifiable regimes. Develops MoJet algorithm for empirical jet estimation and mirage diagnostics.

Result: Proves a jet-identifiability theorem for two-module linear regression pipelines showing that under mild rank assumptions and access to module-level jets, internal factorization is uniquely determined, while risk-only evaluation admits many mirage decompositions.

Conclusion: Modular Jets provide a framework to analyze model decompositions beyond predictive risk, revealing when internal structures are uniquely identifiable versus when multiple decompositions produce identical behavior (mirage regimes).

Abstract: Classical supervised learning evaluates models primarily via predictive risk on hold-out data. Such evaluations quantify how well a function behaves on a distribution, but they do not address whether the internal decomposition of a model is uniquely determined by the data and evaluation design. In this paper, we introduce \emph{Modular Jets} for regression and classification pipelines. Given a task manifold (input space), a modular decomposition, and access to module-level representations, we estimate empirical jets, which are local linear response maps that describe how each module reacts to small structured perturbations of the input. We propose an empirical notion of \emph{mirage} regimes, where multiple distinct modular decompositions induce indistinguishable jets and thus remain observationally equivalent, and contrast this with an \emph{identifiable} regime, where the observed jets single out a decomposition up to natural symmetries. In the setting of two-module linear regression pipelines we prove a jet-identifiability theorem. Under mild rank assumptions and access to module-level jets, the internal factorisation is uniquely determined, whereas risk-only evaluation admits a large family of mirage decompositions that implement the same input-to-output map. We then present an algorithm (MoJet) for empirical jet estimation and mirage diagnostics, and illustrate the framework using linear and deep regression as well as pipeline classification.

[280] Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Igor Shilov, Alex Cloud, Aryo Pradipta Gema, Jacob Goldman-Wetzler, Nina Panickssery, Henry Sleight, Erik Jones, Cem Anil

Main category: cs.LG

TL;DR: SGTM (Selective GradienT Masking) improves upon Gradient Routing for safer pretraining by localizing harmful knowledge into removable parameters, showing better robustness to labeling errors and adversarial fine-tuning compared to data filtering and other unlearning methods.

Details

Motivation: Data filtering for pretraining safety faces challenges: labeling harmful data is expensive at scale, and even small mislabeling could create dangerous capabilities in large models. Need robust pretraining-time mitigations that handle label noise.

Method: Selective GradienT Masking (SGTM) - an improved variant of Gradient Routing that zero-masks selected gradients so target domain examples only update dedicated parameters, localizing knowledge for later removal.

Result: SGTM provides better retain/forget trade-off with labeling errors than data filtering and previous Gradient Routing. Shows strong robustness to adversarial fine-tuning (7x more steps needed vs RMU). Tested on language removal and biology knowledge removal tasks.

Conclusion: SGTM offers a promising pretraining-time safety complement, especially where label noise is unavoidable, providing robust protection against mislabeled harmful content while maintaining model utility.

Abstract: Large Language Models increasingly possess capabilities that carry dual-use risks. While data filtering has emerged as a pretraining-time mitigation, it faces significant challenges: labeling whether data is harmful is expensive at scale, and given improving sample efficiency with larger models, even small amounts of mislabeled content could give rise to dangerous capabilities. To address risks associated with mislabeled harmful content, prior work proposed Gradient Routing (Cloud et al., 2024) – a technique that localizes target knowledge into a dedicated subset of model parameters so they can later be removed. We explore an improved variant of Gradient Routing, which we call Selective GradienT Masking (SGTM), with particular focus on evaluating its robustness to label noise. SGTM zero-masks selected gradients such that target domain examples only update their dedicated parameters. We test SGTM’s effectiveness in two applications: removing knowledge of one language from a model trained on a bilingual synthetic dataset, and removing biology knowledge from a model trained on English Wikipedia. In both cases SGTM provides better retain/forget trade-off in the presence of labeling errors compared to both data filtering and a previously proposed instantiation of Gradient Routing. Unlike shallow unlearning approaches that can be quickly undone through fine-tuning, SGTM exhibits strong robustness to adversarial fine-tuning, requiring seven times more fine-tuning steps to reach baseline performance on the forget set compared to a finetuning-based unlearning method (RMU). Our results suggest SGTM provides a promising pretraining-time complement to existing safety mitigations, particularly in settings where label noise is unavoidable.

[281] Feasibility of AI-Assisted Programming for End-User Development

Irene Weber

Main category: cs.LG

TL;DR: AI-assisted end-user coding using LLMs shows promise as a viable alternative to traditional low-code/no-code platforms for enabling non-programmers to develop digital tools.

Details

Motivation: To investigate whether AI-assisted end-user coding (using LLM-based assistants) can complement or replace traditional low-code/no-code platforms for end-user development, offering greater flexibility, broader applicability, faster development, improved reusability, and reduced vendor lock-in.

Method: Conducted a case study where non-programmers were asked to develop a basic web app through interaction with AI assistants, analyzing their ability to complete the task and their perceptions of the approach.

Result: Majority of study participants successfully completed the web app development task in reasonable time and expressed support for AI-assisted end-user coding as a viable approach for end-user development.

Conclusion: AI-assisted end-user coding is a feasible paradigm that may complement or even replace traditional LCNC platforms in the future, with implications for practice, research, and academic teaching.

Abstract: End-user development,where non-programmers create or adapt their own digital tools, can play a key role in driving digital transformation within organizations. Currently, low-code/no-code platforms are widely used to enable end-user development through visual programming, minimizing the need for manual coding. Recent advancements in generative AI, particularly large language model-based assistants and “copilots”, open new possibilities, as they may enable end users to generate and refine programming code and build apps directly from natural language prompts. This approach, here referred to as AI-assisted end-user coding, promises greater flexibility, broader applicability, faster development, improved reusability, and reduced vendor lock-in compared to the established visual LCNC platforms. This paper investigates whether AI-assisted end-user coding is a feasible paradigm for end-user development, which may complement or even replace the LCNC model in the future. To explore this, we conducted a case study in which non-programmers were asked to develop a basic web app through interaction with AI assistants.The majority of study participants successfully completed the task in reasonable time and also expressed support for AI-assisted end-user coding as a viable approach for end-user development. The paper presents the study design, analyzes the outcomes, and discusses potential implications for practice, future research, and academic teaching.

[282] Meta-Learning Multi-armed Bandits for Beam Tracking in 5G and 6G Networks

Alexander Mattick, George Yammine, Georgios Kontes, Setareh Maghsudi, Christopher Mutschler

Main category: cs.LG

TL;DR: The paper proposes a POMDP-based online search approach for beam selection in 5G/6G networks that outperforms supervised learning methods and handles new trajectories and environmental changes.

Details

Motivation: Current analog beamforming uses codebooks with pre-configured beams, but optimal beam selection is challenging due to large codebooks, reflections, blockages, and moving user equipment. Supervised learning approaches have limitations in handling new trajectories and environmental changes.

Method: Formulates beam selection as a partially observable Markov decision process (POMDP), modeling the environment as the codebook itself. Uses an online search procedure that selects candidate beams conditioned on belief states of the unobservable optimal beam and previously probed beams.

Result: The proposed method outperforms previous supervised learning approaches by orders of magnitude and successfully handles new or unforeseen trajectories and changes in the physical environment.

Conclusion: POMDP-based online search provides a superior approach to beam selection that addresses limitations of current supervised learning methods, offering better performance and adaptability to dynamic network conditions.

Abstract: Beamforming-capable antenna arrays with many elements enable higher data rates in next generation 5G and 6G networks. In current practice, analog beamforming uses a codebook of pre-configured beams with each of them radiating towards a specific direction, and a beam management function continuously selects \textit{optimal} beams for moving user equipments (UEs). However, large codebooks and effects caused by reflections or blockages of beams make an optimal beam selection challenging. In contrast to previous work and standardization efforts that opt for supervised learning to train classifiers to predict the next best beam based on previously selected beams we formulate the problem as a partially observable Markov decision process (POMDP) and model the environment as the codebook itself. At each time step, we select a candidate beam conditioned on the belief state of the unobservable optimal beam and previously probed beams. This frames the beam selection problem as an online search procedure that locates the moving optimal beam. In contrast to previous work, our method handles new or unforeseen trajectories and changes in the physical environment, and outperforms previous work by orders of magnitude.

[283] BERTO: an Adaptive BERT-based Network Time Series Predictor with Operator Preferences in Natural Language

Nitin Priyadarshini Shankar, Vaibhav Singh, Sheetal Kalyani, Christian Maciocco

Main category: cs.LG

TL;DR: BERTO is a BERT-based transformer framework for cellular network traffic prediction and energy optimization that uses natural language prompts to balance power savings vs. performance trade-offs.

Details

Motivation: Cellular networks need intelligent solutions for traffic prediction and energy optimization to reduce power consumption while maintaining service quality. Existing models lack the ability to flexibly balance these competing objectives based on operator preferences.

Method: BERTO uses a BERT-based transformer architecture with a Balancing Loss Function and prompt-based customization. Natural language prompts guide the model to manage underprediction and overprediction according to operator intent, allowing adjustment of power savings vs. performance trade-offs.

Result: Experiments on real-world datasets show BERTO achieves 4.13% reduction in MSE compared to existing models. It operates over a flexible range of 1.4 kW in power and up to 9× variation in service quality, balancing power saving and performance through simple natural language inputs.

Conclusion: BERTO provides an effective framework for intelligent RAN deployments that combines high prediction accuracy with customizable energy optimization through natural language prompts, making it suitable for practical cellular network management.

Abstract: We introduce BERTO, a BERT-based framework for traffic prediction and energy optimization in cellular networks. Built on transformer architectures, BERTO delivers high prediction accuracy, while its Balancing Loss Function and prompt-based customization allow operators to adjust the trade-off between power savings and performance. Natural language prompts guide the model to manage underprediction and overprediction in accordance with the operator’s intent. Experiments on real-world datasets show that BERTO improves upon existing models with a $4.13$% reduction in MSE while introducing the feature of balancing competing objectives of power saving and performance through simple natural language inputs, operating over a flexible range of $1.4$ kW in power and up to $9\times$ variation in service quality, making it well suited for intelligent RAN deployments.

[284] Teaching Language Models Mechanistic Explainability Through Arrow-Pushing

Théo A. Neukomm, Zlatko Jončev, Philippe Schwaller

Main category: cs.LG

TL;DR: Language models trained on MechSMILES format achieve high accuracy in predicting chemical reaction mechanisms, enabling validation of synthesis plans, holistic atom mapping, and catalyst-aware template extraction.

Details

Motivation: Current Computer-Assisted Synthesis Planning (CASP) systems lack mechanistic grounding, which is crucial for understanding synthesizability and chemical validity.

Method: Developed MechSMILES (compact textual format encoding molecular structure and electron flow) and trained language models on four mechanism prediction tasks using datasets like mech-USPTO-31k and FlowER.

Result: Models achieve >95% top-3 accuracy on elementary step prediction, 73% on mech-USPTO-31k, and 93% on FlowER for complete reaction mechanism retrieval.

Conclusion: Mechanistic understanding enables three applications: post-hoc validation for CASP systems, holistic atom-to-atom mapping, and catalyst-aware reaction template extraction, providing pathway toward more explainable and chemically valid synthesis planning.

Abstract: Chemical reaction mechanisms provide crucial insight into synthesizability, yet current Computer-Assisted Synthesis Planning (CASP) systems lack mechanistic grounding. We introduce a computational framework for teaching language models to predict chemical reaction mechanisms through arrow pushing formalism, a century-old notation that tracks electron flow while respecting conservation laws. We developed MechSMILES, a compact textual format encoding molecular structure and electron flow, and trained language models on four mechanism prediction tasks of increasing complexity using mechanistic reaction datasets, such as mech-USPTO-31k and FlowER. Our models achieve more than 95% top-3 accuracy on elementary step prediction and scores that surpass 73% on mech-USPTO-31k, and 93% on FlowER dataset for the retrieval of complete reaction mechanisms on our hardest task. This mechanistic understanding enables three key applications. First, our models serve as post-hoc validators for CASP systems, filtering chemically implausible transformations. Second, they enable holistic atom-to-atom mapping that tracks all atoms, including hydrogens. Third, they extract catalyst-aware reaction templates that distinguish recycled catalysts from spectator species. By grounding predictions in physically meaningful electron moves that ensure conservation of mass and charge, this work provides a pathway toward more explainable and chemically valid computational synthesis planning, while providing an architecture-agnostic framework for the benchmarking of mechanism prediction.

[285] Towards agent-based-model informed neural networks

Nino Antulov-Fantulin

Main category: cs.LG

TL;DR: ABM-NNs: A framework for designing neural networks that preserve agent-based model constraints like mass conservation, network locality, and bounded rationality, validated on ecological, epidemiological, and economic systems.

Details

Motivation: Standard neural differential equations fail to enforce important constraints in complex systems modeling where physical invariants are absent but other constraints (mass conservation, network locality, bounded rationality) must be preserved.

Method: Agent-Based-Model informed Neural Networks (ABM-NNs) using restricted graph neural networks and hierarchical decomposition to learn interpretable, structure-preserving dynamics that respect ABM constraints.

Result: Successful validation across three case studies: (1) Generalized Lotka-Volterra system - recovered ground-truth parameters from short trajectories with interventions; (2) Graph-based SIR contagion model - outperformed GCN, GraphSAGE, Graph Transformer in out-of-sample forecasting and noise robustness; (3) Real-world macroeconomic model - learned coupled GDP dynamics and enabled gradient-based counterfactual policy analysis.

Conclusion: ABM-NNs provide an effective framework for learning interpretable, constraint-preserving dynamics in complex systems, bridging neural networks with agent-based modeling principles for improved forecasting and policy analysis.

Abstract: In this article, we present a framework for designing neural networks that remain consistent with the underlying principles of agent-based models. We begin by highlighting the limitations of standard neural differential equations in modeling complex systems, where physical invariants (like energy) are often absent but other constraints (like mass conservation, network locality, bounded rationality) must be enforced. To address this, we introduce Agent-Based-Model informed Neural Networks(ABM-NNs), which leverage restricted graph neural networks and hierarchical decomposition to learn interpretable, structure-preserving dynamics. We validate the framework across three case studies of increasing complexity: (i) a generalized Generalized Lotka–Volterra system, where we recover ground-truth parameters from short trajectories in presence of interventions; (ii) a graph-based SIR contagion model, where our method outperforms state-of-the-art graph learning baselines (GCN, GraphSAGE, Graph Transformer) in out-of-sample forecasting and noise robustness; and (iii) a real-world macroeconomic model of the ten largest economies, where we learn coupled GDP dynamics from empirical data and demonstrate gradient-based counterfactual analysis for policy interventions.

[286] Learnability Window in Gated Recurrent Neural Networks

Lorenzo Livi

Main category: cs.LG

TL;DR: Gating mechanisms in RNNs determine learnability windows via effective learning rates, not just numerical stability. These rates act as filters controlling gradient transport, with sample complexity scaling inversely with their decay under heavy-tailed noise.

Details

Motivation: Classical analyses focus on numerical stability of Jacobian products, but this is insufficient to explain learnability. The paper aims to develop a theoretical framework that explains how gating mechanisms actually determine the temporal horizon over which gradient information remains statistically recoverable in recurrent neural networks.

Method: Develops a theoretical framework analyzing effective learning rates μ_{t,ℓ} obtained from first-order expansions of gate-induced Jacobian products in Backpropagation Through Time. Uses heavy-tailed (α-stable) gradient noise analysis to derive minimal sample size requirements and learnability window formulas.

Result: Shows that minimal sample size to detect dependency at lag ℓ scales as N(ℓ) ∝ f(ℓ)^{-α}, where f(ℓ) = ||μ_{t,ℓ}||_1 is the effective learning rate envelope. Derives explicit formula for learnability window H_N and closed-form scaling laws for logarithmic, polynomial, and exponential decay patterns. Finds that broader/more heterogeneous gate spectra produce larger learnability windows, while heavier-tailed noise compresses them.

Conclusion: Effective learning rates, not numerical stability alone, are the fundamental quantities governing when and for how long gated recurrent networks can learn long-range temporal dependencies. The framework links gate-induced time-scale structure, gradient noise, and sample complexity to explain learnability windows.

Abstract: We develop a theoretical framework that explains how gating mechanisms determine the learnability window $\mathcal{H}N$ of recurrent neural networks, defined as the largest temporal horizon over which gradient information remains statistically recoverable. While classical analyses emphasize numerical stability of Jacobian products, we show that stability alone is insufficient: learnability is governed instead by the \emph{effective learning rates} $μ{t,\ell}$, per-lag and per-neuron quantities obtained from first-order expansions of gate-induced Jacobian products in Backpropagation Through Time. These effective learning rates act as multiplicative filters that control both the magnitude and anisotropy of gradient transport. Under heavy-tailed ($α$-stable) gradient noise, we prove that the minimal sample size required to detect a dependency at lag~$\ell$ satisfies $N(\ell)\propto f(\ell)^{-α}$, where $f(\ell)=|μ_{t,\ell}|_1$ is the effective learning rate envelope. This leads to an explicit formula for $\mathcal{H}_N$ and closed-form scaling laws for logarithmic, polynomial, and exponential decay of $f(\ell)$. The theory predicts that broader or more heterogeneous gate spectra produce slower decay of $f(\ell)$ and hence larger learnability windows, whereas heavier-tailed noise compresses $\mathcal{H}_N$ by slowing statistical concentration. By linking gate-induced time-scale structure, gradient noise, and sample complexity, the framework identifies the effective learning rates as the fundamental quantities that govern when – and for how long – gated recurrent networks can learn long-range temporal dependencies.

[287] Mechanistic Interpretability of Antibody Language Models Using SAEs

Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Charlotte M. Deane

Main category: cs.LG

TL;DR: TopK SAEs reveal biological concepts in antibody language models but lack causal control, while Ordered SAEs enable precise generative steering at the cost of interpretability.

Details

Motivation: To investigate learned concepts in autoregressive antibody language models (p-IgGen) and develop methods for steering antibody generation using sparse autoencoders.

Method: Employed two types of sparse autoencoders (SAEs): TopK SAEs for feature-concept mapping and Ordered SAEs with hierarchical structure for generative steering.

Result: TopK SAEs revealed biologically meaningful latent features but high concept correlation didn’t guarantee causal control. Ordered SAEs reliably identified steerable features but produced more complex, less interpretable activation patterns.

Conclusion: TopK SAEs are sufficient for mapping latent features to concepts, while Ordered SAEs are preferable for precise generative steering in antibody language models.

Abstract: Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate an autoregressive antibody language model, p-IgGen, and steer its generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs are sufficient for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

[288] Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws

Zhengquan Luo, Zhiqiang Xu

Main category: cs.LG

TL;DR: The paper proposes a unified theoretical framework for dataset distillation that explains performance saturation and configuration robustness through scaling and coverage laws.

Details

Motivation: Existing dataset distillation methods lack theoretical foundations - they use different surrogate objectives and optimization assumptions, making it hard to analyze common principles or provide guarantees. It's unclear when distilled data remains effective across different training configurations (optimizer, architecture, augmentation).

Method: Proposes a “configuration-dynamics-error analysis” framework that reformulates major DD approaches under a common generalization-error perspective. This provides: (1) a scaling law showing error decreases with distilled sample size, explaining performance saturation; (2) a coverage law showing required sample size scales linearly with configuration diversity.

Result: The unified analysis reveals that various matching methods (gradient, distribution, trajectory) are interchangeable surrogates that reduce the same generalization error. Experiments across diverse methods and configurations empirically confirm the derived laws.

Conclusion: The framework advances theoretical foundations for dataset distillation, enabling theory-driven design of compact, configuration-robust distilled datasets. It explains why different methods work and provides guidance on how surrogate choices affect sample efficiency and robustness.

Abstract: Dataset distillation (DD) aims to construct compact synthetic datasets that allow models to achieve comparable performance to full-data training while substantially reducing storage and computation. Despite rapid empirical progress, its theoretical foundations remain limited: existing methods (gradient, distribution, trajectory matching) are built on heterogeneous surrogate objectives and optimization assumptions, which makes it difficult to analyze their common principles or provide general guarantees. Moreover, it is still unclear under what conditions distilled data can retain the effectiveness of full datasets when the training configuration, such as optimizer, architecture, or augmentation, changes. To answer these questions, we propose a unified theoretical framework, termed configuration–dynamics–error analysis, which reformulates major DD approaches under a common generalization-error perspective and provides two main results: (i) a scaling law that provides a single-configuration upper bound, characterizing how the error decreases as the distilled sample size increases and explaining the commonly observed performance saturation effect; and (ii) a coverage law showing that the required distilled sample size scales linearly with configuration diversity, with provably matching upper and lower bounds. In addition, our unified analysis reveals that various matching methods are interchangeable surrogates, reducing the same generalization error, clarifying why they can all achieve dataset distillation and providing guidance on how surrogate choices affect sample efficiency and robustness. Experiments across diverse methods and configurations empirically confirm the derived laws, advancing a theoretical foundation for DD and enabling theory-driven design of compact, configuration-robust dataset distillation.

[289] Approximation of Box Decomposition Algorithm for Fast Hypervolume-Based Multi-Objective Optimization

Shuhei Watanabe

Main category: cs.LG

TL;DR: This paper provides comprehensive mathematical and algorithmic details for an approximation algorithm to address the super-polynomial memory complexity of hypervolume box-decomposition in Bayesian optimization.

Details

Motivation: The computational cost of optimizing acquisition functions in hypervolume-based Bayesian optimization remains a bottleneck due to expensive hypervolume improvement calculations. While box-decomposition helps with frequent exact calculations, it suffers from super-polynomial memory complexity O(MN^⌊(M+1)/2⌋) in worst cases.

Method: The paper provides comprehensive mathematical and algorithmic details for an approximation algorithm originally proposed by Couckuyt et al. (2012) but lacking rigorous description in literature. This bridges the gap by formalizing the approximation approach to hypervolume calculations.

Result: The paper delivers a complete algorithmic description of the approximation method for hypervolume calculations, addressing the memory complexity issues of exact box-decomposition approaches.

Conclusion: By providing rigorous mathematical and algorithmic details for the previously proposed approximation algorithm, this work enables more efficient hypervolume-based Bayesian optimization while overcoming the super-polynomial memory limitations of exact methods.

Abstract: Hypervolume (HV)-based Bayesian optimization (BO) is one of the standard approaches for multi-objective decision-making. However, the computational cost of optimizing the acquisition function remains a significant bottleneck, primarily due to the expense of HV improvement calculations. While HV box-decomposition offers an efficient way to cope with the frequent exact improvement calculations, it suffers from super-polynomial memory complexity $O(MN^{\lfloor \frac{M + 1}{2} \rfloor})$ in the worst case as proposed by Lacour et al. (2017). To tackle this problem, Couckuyt et al. (2012) employed an approximation algorithm. However, a rigorous algorithmic description is currently absent from the literature. This paper bridges this gap by providing comprehensive mathematical and algorithmic details of this approximation algorithm.

[290] NEAT: Neighborhood-Guided, Efficient, Autoregressive Set Transformer for 3D Molecular Generation

Daniel Rose, Roxane Axel Jacob, Johannes Kirchmair, Thierry Langer

Main category: cs.LG

TL;DR: NEAT is an autoregressive model for 3D molecular generation that treats molecules as sets of atoms, achieving permutation invariance and state-of-the-art performance with high efficiency.

Details

Motivation: Existing autoregressive models for 3D molecular generation rely on artificial token orders (canonical orders or focus atoms), which creates a mismatch since molecular graphs should be permutation-invariant. The authors argue this limitation is unnecessary and propose a truly order-agnostic approach.

Method: NEAT (Neighborhood-guided, Efficient, Autoregressive, Set Transformer) treats molecular graphs as sets of atoms and uses an autoregressive flow model to learn order-agnostic distributions over admissible tokens at the graph boundary. It maintains permutation invariance through set-based architecture.

Result: NEAT achieves state-of-the-art performance in 3D molecular generation with high computational efficiency while maintaining atom-level permutation invariance.

Conclusion: NEAT establishes a practical foundation for scalable molecular design by overcoming the token order limitation of previous autoregressive models, offering an efficient and permutation-invariant approach to 3D molecular generation.

Abstract: Autoregressive models are a promising alternative to diffusion-based models for 3D molecular structure generation. However, a key limitation is the assumption of a token order: while text has a natural sequential order, the next token prediction given a molecular graph prefix should be invariant to atom permutations. Previous works sidestepped this mismatch by using canonical orders or focus atoms. We argue that this is unnecessary. We introduce NEAT, a Neighborhood-guided, Efficient, Autoregressive, Set Transformer that treats molecular graphs as sets of atoms and learns the order-agnostic distribution over admissible tokens at the graph boundary with an autoregressive flow model. NEAT approaches state-of-the-art performance in 3D molecular generation with high computational efficiency and atom-level permutation invariance, establishing a practical foundation for scalable molecular design.

[291] Sparse Attention Post-Training for Mechanistic Interpretability

Florent Draye, Anson Lei, Ingmar Posner, Bernhard Schölkopf

Main category: cs.LG

TL;DR: Simple post-training method makes transformer attention extremely sparse (≈0.3% edges) without performance loss, revealing redundancy and enabling more interpretable models.

Details

Motivation: Transformers have dense attention patterns that may contain significant redundancy. The authors aim to explore whether attention can be made much sparser while maintaining performance, using sparsity as a structural prior for better organization and interpretability rather than just computational efficiency.

Method: Post-training method applying flexible sparsity regularization under a constrained-loss objective. The approach preserves original pretraining loss while dramatically reducing attention connectivity to approximately 0.3% of edges in models up to 1B parameters.

Result: Achieved extreme sparsity (≈0.3% edges) without sacrificing performance. This local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer connecting edges.

Conclusion: Transformer attention can be made orders of magnitude sparser, suggesting much computation is redundant. Sparsity may serve as a guiding principle for creating more structured and interpretable models, exposing organized connectivity patterns while preserving capability.

Abstract: We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.3 %$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

[292] Predicting Price Movements in High-Frequency Financial Data with Spiking Neural Networks

Brian Ezinwoke, Oliver Rhodes

Main category: cs.LG

TL;DR: SNNs with Bayesian Optimization and novel PSA objective outperform conventional models for high-frequency price spike forecasting, achieving 76.8% cumulative return.

Details

Motivation: Conventional financial models fail to capture fine temporal structure needed for high-frequency trading, especially during sudden price spikes. SNNs offer biological inspiration for processing discrete events at millisecond scales.

Method: Convert high-frequency stock data to spike trains, evaluate three SNN architectures (STDP-trained, novel inhibitory competition, supervised backpropagation), use Bayesian Optimization with novel Penalized Spike Accuracy objective to align predicted spike rates with empirical rates.

Result: PSA-tuned models consistently outperformed SA-tuned counterparts and baselines. Extended SNN with PSA achieved highest cumulative return (76.8%) in backtesting, significantly surpassing supervised alternative (42.54% return).

Conclusion: SNNs with robust task-specific tuning (PSA objective) show strong potential for effective price spike forecasting in high-frequency trading environments.

Abstract: Modern high-frequency trading (HFT) environments are characterized by sudden price spikes that present both risk and opportunity, but conventional financial models often fail to capture the required fine temporal structure. Spiking Neural Networks (SNNs) offer a biologically inspired framework well-suited to these challenges due to their natural ability to process discrete events and preserve millisecond-scale timing. This work investigates the application of SNNs to high-frequency price-spike forecasting, enhancing performance via robust hyperparameter tuning with Bayesian Optimization (BO). This work converts high-frequency stock data into spike trains and evaluates three architectures: an established unsupervised STDP-trained SNN, a novel SNN with explicit inhibitory competition, and a supervised backpropagation network. BO was driven by a novel objective, Penalized Spike Accuracy (PSA), designed to ensure a network’s predicted price spike rate aligns with the empirical rate of price events. Simulated trading demonstrated that models optimized with PSA consistently outperformed their Spike Accuracy (SA)-tuned counterparts and baselines. Specifically, the extended SNN model with PSA achieved the highest cumulative return (76.8%) in simple backtesting, significantly surpassing the supervised alternative (42.54% return). These results validate the potential of spiking networks, when robustly tuned with task-specific objectives, for effective price spike forecasting in HFT.

[293] Neural Coherence : Find higher performance to out-of-distribution tasks from few samples

Simon Guiroy, Mats Richter, Sarath Chandar, Christopher Pal

Main category: cs.LG

TL;DR: Neural Coherence: A novel model selection method using activation statistics from just a few unlabeled target examples to choose the best pre-trained checkpoint for downstream tasks, especially effective with scarce, unlabeled, out-of-distribution data.

Details

Motivation: Current model selection methods for fine-tuning pre-trained vision models rely on in-distribution validation data, which becomes unreliable when target task data is scarce, unlabeled, and out-of-distribution. There's a need for data-efficient selection methods that work with minimal target information.

Method: Proposes Neural Coherence - a principle that characterizes model activation statistics for both source (pre-training) and target domains. This allows defining model selection methods that compare activation patterns between domains using just a few unlabeled target examples to identify the most suitable checkpoint.

Result: Significantly improves generalization across diverse target domains (Food-101, PlantNet-300K, iNaturalist) compared to established baselines. Also demonstrates effectiveness in meta-learning settings and shows versatility as a principle for training data selection.

Conclusion: Neural Coherence provides a powerful, data-efficient approach for model selection that works reliably with minimal unlabeled target data, addressing the challenge of selecting optimal checkpoints for fine-tuning when target data is scarce and out-of-distribution.

Abstract: To create state-of-the-art models for many downstream tasks, it has become common practice to fine-tune a pre-trained large vision model. However, it remains an open question of how to best determine which of the many possible model checkpoints resulting from a large training run to use as the starting point. This becomes especially important when data for the target task of interest is scarce, unlabeled and out-of-distribution. In such scenarios, common methods relying on in-distribution validation data become unreliable or inapplicable. This work proposes a novel approach for model selection that operates reliably on just a few unlabeled examples from the target task. Our approach is based on a novel concept: Neural Coherence, which entails characterizing a model’s activation statistics for source and target domains, allowing one to define model selection methods with high data-efficiency. We provide experiments where models are pre-trained on ImageNet1K and examine target domains consisting of Food-101, PlantNet-300K and iNaturalist. We also evaluate it in many meta-learning settings. Our approach significantly improves generalization across these different target domains compared to established baselines. We further demonstrate the versatility of Neural Coherence as a powerful principle by showing its effectiveness in training data selection.

[294] Computational Design of Low-Volatility Lubricants for Space Using Interpretable Machine Learning

Daniel Miliate, Ashlie Martini

Main category: cs.LG

TL;DR: ML models predict vapor pressure for space lubricants, enabling discovery of new molecules for moving mechanical assemblies in vacuum conditions.

Details

Motivation: Space applications require liquid lubricants with extremely low vapor pressures for moving mechanical assemblies, but current options are limited and impose design constraints.

Method: Data-driven machine learning approach using molecular dynamics simulations and experimental databases to predict vapor pressure, with emphasis on model interpretability.

Result: ML models identify relationships between chemical structure and vapor pressure, leading to proposed candidate molecules for space lubricant applications.

Conclusion: Machine learning enables virtual screening and discovery of new space-suitable liquid lubricants, potentially overcoming current limitations in moving mechanical assembly designs.

Abstract: The function and lifetime of moving mechanical assemblies (MMAs) in space depend on the properties of lubricants. MMAs that experience high speeds or high cycles require liquid based lubricants due to their ability to reflow to the point of contact. However, only a few liquid-based lubricants have vapor pressures low enough for the vacuum conditions of space, each of which has limitations that add constraints to MMA designs. This work introduces a data-driven machine learning (ML) approach to predicting vapor pressure, enabling virtual screening and discovery of new space-suitable liquid lubricants. The ML models are trained with data from both high-throughput molecular dynamics simulations and experimental databases. The models are designed to prioritize interpretability, enabling the relationships between chemical structure and vapor pressure to be identified. Based on these insights, several candidate molecules are proposed that may have promise for future space lubricant applications in MMAs.

[295] Fair Text Classification via Transferable Representations

Thibaud Leteno, Michael Perrot, Charlotte Laclau, Antoine Gourru, Christophe Gravier

Main category: cs.LG

TL;DR: Proposes using Wasserstein Dependency Measure and adversarial training to achieve group fairness in neural text classifiers, with domain adaptation to avoid needing sensitive attributes.

Details

Motivation: Group fairness remains an open challenge in text classification, requiring fair treatment between sensitive groups (e.g., women and men). Current methods struggle to distinguish fair from unfair information in text encoders.

Method: Extends Wasserstein Dependency Measure for learning unbiased classifiers. Uses adversarial training to induce independence between target label representations and sensitive attribute representations. Leverages Domain Adaptation to remove need for sensitive attributes in the dataset.

Result: Provides both theoretical and empirical evidence that the approach is well-founded, suggesting it effectively addresses fairness challenges in text classification.

Conclusion: The proposed method offers a principled approach to achieving group fairness in neural text classifiers by combining Wasserstein Dependency Measure, adversarial training, and domain adaptation techniques.

Abstract: Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups (e.g., women and men) remains an open challenge. We propose an approach that extends the use of the Wasserstein Dependency Measure for learning unbiased neural text classifiers. Given the challenge of distinguishing fair from unfair information in a text encoder, we draw inspiration from adversarial training by inducing independence between representations learned for the target label and those for a sensitive attribute. We further show that Domain Adaptation can be efficiently leveraged to remove the need for access to the sensitive attributes in the dataset we cure. We provide both theoretical and empirical evidence that our approach is well-founded.

[296] DAE-HardNet: A Physics Constrained Neural Network Enforcing Differential-Algebraic Hard Constraints

Rahul Golder, Bimol Nath Roy, M. M. Faruque Hasan

Main category: cs.LG

TL;DR: DAE-HardNet is a physics-constrained neural network that strictly enforces differential-algebraic equations by learning both functions and derivatives simultaneously with a differentiable projection layer.

Details

Motivation: Traditional PINNs only minimize physics constraints softly and don't strictly satisfy differential-algebraic equations, especially when constraints involve differential operators. There's a need for models that can strictly embed domain knowledge and first-principles in data-driven models.

Method: DAE-HardNet learns both functions and their derivatives simultaneously while enforcing algebraic and differential constraints. It uses a differentiable projection layer to project model predictions onto the constraint manifold, ensuring strict satisfaction of DAEs.

Result: Compared to MLPs and PINNs, DAE-HardNet achieves orders of magnitude reduction in physics loss while maintaining prediction accuracy. It successfully learns derivatives and can estimate unknown parameters through parameter estimation problems.

Conclusion: DAE-HardNet provides a physics-constrained approach that strictly satisfies DAEs, offers improved derivative learning, and enables faster inference by potentially bypassing the projection layer for specific problems.

Abstract: Traditional physics-informed neural networks (PINNs) do not always satisfy physics based constraints, especially when the constraints include differential operators. Rather, they minimize the constraint violations in a soft way. Strict satisfaction of differential-algebraic equations (DAEs) to embed domain knowledge and first-principles in data-driven models is generally challenging. This is because data-driven models consider the original functions to be black-box whose derivatives can only be obtained after evaluating the functions. We introduce DAE-HardNet, a physics-constrained (rather than simply physics-informed) neural network that learns both the functions and their derivatives simultaneously, while enforcing algebraic as well as differential constraints. This is done by projecting model predictions onto the constraint manifold using a differentiable projection layer. We apply DAE-HardNet to several systems and test problems governed by DAEs, including the dynamic Lotka-Volterra predator-prey system and transient heat conduction. We also show the ability of DAE-HardNet to estimate unknown parameters through a parameter estimation problem. Compared to multilayer perceptrons (MLPs) and PINNs, DAE-HardNet achieves orders of magnitude reduction in the physics loss while maintaining the prediction accuracy. It has the added benefits of learning the derivatives which improves the constrained learning of the backbone neural network prior to the projection layer. For specific problems, this suggests that the projection layer can be bypassed for faster inference. The current implementation and codes are available at https://github.com/SOULS-TAMU/DAE-HardNet.

[297] LittleBit: Ultra Low-Bit Quantization via Latent Factorization

Banseok Lee, Dongkyu Kim, Youngcheon You, Youngmin Kim

Main category: cs.LG

TL;DR: LittleBit enables extreme LLM compression to 0.1 bits per weight using low-rank matrix factorization with binarization and multi-scale compensation, achieving 31× memory reduction while maintaining performance.

Details

Motivation: Large language models face substantial memory and computational costs, making deployment challenging. While quantization helps, performance degradation in sub-1-bit regimes remains particularly difficult, limiting practical compression for resource-constrained environments.

Method: LittleBit uses latent matrix factorization to represent weights in low-rank form, then binarizes these factors. It incorporates multi-scale compensation (row, column, and latent dimension with per-rank importance) to counteract information loss. Key innovations include Dual Sign-Value-Independent Decomposition for QAT initialization and integrated Residual Compensation for error mitigation.

Result: Achieves 0.1 bits per weight compression with nearly 31× memory reduction (e.g., Llama2-13B to under 0.9 GB). Performance at 0.1 BPW on Llama2-7B surpasses leading methods at 0.7 BPW. Enables 11.6× kernel-level speedup over FP16.

Conclusion: LittleBit establishes a new viable size-performance trade-off for extreme LLM compression, making powerful LLMs practical for resource-constrained environments while maintaining competitive performance at unprecedented compression levels.

Abstract: Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit’s superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method’s 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off–unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level–and makes powerful LLMs practical for resource-constrained environments. Our code can be found at https://github.com/SamsungLabs/LittleBit.

[298] NeuroMemFPP: A recurrent neural approach for memory-aware parameter estimation in fractional Poisson process

Neha Gupta, Aditya Maheshwari

Main category: cs.LG

TL;DR: LSTM-based framework for estimating fractional Poisson process parameters from inter-arrival time sequences, outperforming traditional method of moments by 55.3% MSE reduction and showing effectiveness on real-world high-frequency data.

Details

Motivation: The fractional Poisson process models event arrivals with memory and long-range dependence, but traditional parameter estimation methods like method of moments may not effectively capture temporal dependencies in inter-arrival time sequences.

Method: Recurrent neural network (RNN) framework using Long Short-Term Memory (LSTM) networks to estimate FPP parameters μ and β from sequences of inter-arrival times, leveraging LSTM’s ability to model temporal dependencies.

Result: On synthetic data: 55.3% reduction in mean squared error compared to traditional method of moments. On real-world data: LSTM effectively tracked daily patterns and parameter changes in emergency call records and AAPL stock trading data, demonstrating effectiveness on complex time-dependent real-world data.

Conclusion: LSTM-based approach provides superior parameter estimation for fractional Poisson processes compared to traditional methods, effectively capturing temporal dependencies in both synthetic and real-world high-frequency event data.

Abstract: In this paper, we propose a recurrent neural network (RNN)-based framework for estimating the parameters of the fractional Poisson process (FPP), which models event arrivals with memory and long-range dependence. The Long Short-Term Memory (LSTM) network estimates the key parameters $μ>0$ and $β\in(0,1)$ from sequences of inter-arrival times, effectively modeling their temporal dependencies. Our experiments on synthetic data show that the proposed approach reduces the mean squared error (MSE) by about 55.3% compared to the traditional method of moments (MOM) and performs reliably across different training conditions. We tested the method on two real-world high-frequency datasets: emergency call records from Montgomery County, PA, and AAPL stock trading data. The results show that the LSTM can effectively track daily patterns and parameter changes, indicating its effectiveness on real-world data with complex time dependencies.

[299] Impugan: Learning Conditional Generative Models for Robust Data Imputation

Zalish Mahmud, Anantaa Kotal, Aritran Piplai

Main category: cs.LG

TL;DR: Impugan is a conditional GAN for imputing missing values and integrating heterogeneous datasets, outperforming traditional methods by capturing complex nonlinear relationships.

Details

Motivation: Real-world data often has missing values due to sensor failures, inconsistent records, and heterogeneous data sources with different scales, sampling rates, and quality. Traditional imputation methods rely on strong linearity and independence assumptions that don't hold for complex data, leading to biased estimates.

Method: Impugan uses a conditional Generative Adversarial Network (cGAN) trained on complete samples to learn dependencies between missing and observed variables. The generator reconstructs missing entries from available features, while the discriminator distinguishes true from imputed data, enforcing realism through adversarial training.

Result: Impugan achieves up to 82% lower Earth Mover’s Distance (EMD) and 70% lower mutual-information deviation compared to leading baselines on benchmark datasets and multi-source integration tasks.

Conclusion: Adversarially trained generative models provide a scalable and principled approach for imputing and merging incomplete, heterogeneous data, capturing nonlinear and multimodal relationships that conventional methods cannot represent.

Abstract: Incomplete data are common in real-world applications. Sensors fail, records are inconsistent, and datasets collected from different sources often differ in scale, sampling rate, and quality. These differences create missing values that make it difficult to combine data and build reliable models. Standard imputation methods such as regression models, expectation-maximization, and multiple imputation rely on strong assumptions about linearity and independence. These assumptions rarely hold for complex or heterogeneous data, which can lead to biased or over-smoothed estimates. We propose Impugan, a conditional Generative Adversarial Network (cGAN) for imputing missing values and integrating heterogeneous datasets. The model is trained on complete samples to learn how missing variables depend on observed ones. During inference, the generator reconstructs missing entries from available features, and the discriminator enforces realism by distinguishing true from imputed data. This adversarial process allows Impugan to capture nonlinear and multimodal relationships that conventional methods cannot represent. In experiments on benchmark datasets and a multi-source integration task, Impugan achieves up to 82% lower Earth Mover’s Distance (EMD) and 70% lower mutual-information deviation (MI) compared to leading baselines. These results show that adversarially trained generative models provide a scalable and principled approach for imputing and merging incomplete, heterogeneous data. Our model is available at: github.com/zalishmahmud/impuganBigData2025

[300] LDLT $\mathcal{L}$-Lipschitz Network: Generalized Deep End-To-End Lipschitz Network Construction

Marius F. R. Juston, Ramavarapu S. Sreenivas, Dustin Nottage, Ahmet Soylemezoglu

Main category: cs.LG

TL;DR: This paper presents a rigorous LMI framework for constructing Lipschitz-constrained deep residual networks, extending to general nonlinear architectures via LDL decomposition, achieving better accuracy than previous methods.

Details

Motivation: ResNets excel in computer vision but controlling Lipschitz constants is crucial for adversarial robustness and network certifiability. Existing methods lack rigorous frameworks for constructing Lipschitz-constrained networks.

Method: Reformulate ResNet as cyclic tridiagonal LMI, derive closed-form parameter constraints for Lipschitz continuity, use novel LDL decomposition for LMI feasibility certification, extend to general nonlinear architectures, employ Cholesky decomposition for efficient parameterization.

Result: The LDL formulation provides tight relaxation of SDP-based networks, maintains full expressiveness, and achieves 3%-13% accuracy gains over SLL Layers on 121 UCI datasets.

Conclusion: The proposed framework enables robust network designs for adversarial robustness, certified training, and control systems through provable parameterization of Lipschitz-constrained architectures.

Abstract: Deep residual networks (ResNets) have demonstrated outstanding success in computer vision tasks, attributed to their ability to maintain gradient flow through deep architectures. Simultaneously, controlling the Lipschitz constant in neural networks has emerged as an essential area of research to enhance adversarial robustness and network certifiability. This paper presents a rigorous approach to the general design of $\mathcal{L}$-Lipschitz deep residual networks using a Linear Matrix Inequality (LMI) framework. Initially, the ResNet architecture was reformulated as a cyclic tridiagonal LMI, and closed-form constraints on network parameters were derived to ensure $\mathcal{L}$-Lipschitz continuity; however, using a new $LDL^\top$ decomposition approach for certifying LMI feasibility, we extend the construction of $\mathcal{L}$-Lipchitz networks to any other nonlinear architecture. Our contributions include a provable parameterization methodology for constructing Lipschitz-constrained residual networks and other hierarchical architectures. Cholesky decomposition is also used for efficient parameterization. These findings enable robust network designs applicable to adversarial robustness, certified training, and control systems. The $LDL^\top$ formulation is shown to be a tight relaxation of the SDP-based network, maintaining full expressiveness and achieving 3%-13% accuracy gains over SLL Layers on 121 UCI data sets.

[301] MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Attribution

Sara Patel, Mingxun Zhou, Giulia Fanti

Main category: cs.LG

TL;DR: MaxShapley is an efficient algorithm for fair attribution in generative search engines that uses RAG, reducing computational cost from exponential to linear while maintaining comparable attribution quality to exact Shapley values.

Details

Motivation: Generative search engines based on LLMs are replacing traditional search, changing how information providers are compensated. There's a need for fair mechanisms to attribute and compensate content providers based on their contributions to generated answers.

Method: MaxShapley is a special case of Shapley value that leverages a decomposable max-sum utility function to compute attributions with linear computation in the number of documents, as opposed to the exponential cost of traditional Shapley values.

Result: Evaluation on three multi-hop QA datasets (HotPotQA, MuSiQUE, MS MARCO) shows MaxShapley achieves comparable attribution quality to exact Shapley computation while consuming a fraction of its tokens, with up to 8x reduction in resource consumption over prior state-of-the-art methods at the same attribution accuracy.

Conclusion: MaxShapley provides an efficient solution for fair attribution in generative search ecosystems, enabling sustainable compensation for content providers while maintaining computational feasibility.

Abstract: Generative search engines based on large language models (LLMs) are replacing traditional search, fundamentally changing how information providers are compensated. To sustain this ecosystem, we need fair mechanisms to attribute and compensate content providers based on their contributions to generated answers. We introduce MaxShapley, an efficient algorithm for fair attribution in generative search pipelines that use retrieval-augmented generation (RAG). MaxShapley is a special case of the celebrated Shapley value; it leverages a decomposable max-sum utility function to compute attributions with linear computation in the number of documents, as opposed to the exponential cost of Shapley values. We evaluate MaxShapley on three multi-hop QA datasets (HotPotQA, MuSiQUE, MS MARCO); MaxShapley achieves comparable attribution quality to exact Shapley computation, while consuming a fraction of its tokens–for instance, it gives up to an 8x reduction in resource consumption over prior state-of-the-art methods at the same attribution accuracy.

[302] KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

Damien Lesens, Beheshteh T. Rakhshan, Guillaume Rabusseau

Main category: cs.LG

TL;DR: KQ-SVD is a new KV cache compression method that directly decomposes the attention matrix via closed-form SVD, outperforming prior methods that compress keys alone or jointly embed queries and keys.

Details

Motivation: KV cache memory bottleneck grows with sequence length and batch size. Existing compression methods are suboptimal because they don't directly target the attention matrix which depends on query-key inner products.

Method: KQ-SVD performs optimal low-rank decomposition of the attention matrix using a closed-form SVD solution, directly targeting the source of redundancy in attention computations.

Result: KQ-SVD preserves attention outputs with higher fidelity under compression and consistently delivers superior projection quality on LLaMA and Mistral models compared to prior methods.

Conclusion: Direct decomposition of the attention matrix via KQ-SVD is more effective for KV cache compression than compressing keys alone or jointly embedding queries and keys.

Abstract: The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are suboptimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and Mistral models demonstrate that our approach consistently delivers superior projection quality.

[303] Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

David Chanin, Adrià Garriga-Alonso

Main category: cs.LG

TL;DR: SAE L0 hyperparameter is not just a free tradeoff parameter - it must be set correctly to avoid feature mixing and ensure proper feature disentanglement in LLMs.

Details

Motivation: Existing work treats L0 (average features per token) as a free parameter affecting only reconstruction-sparsity tradeoff, but the authors suspect it has deeper implications for feature disentanglement quality.

Method: Study the effect of L0 on SAEs, analyze feature mixing patterns at different L0 values, and develop a proxy metric to guide optimal L0 selection for given training distributions.

Result: Both too low and too high L0 cause feature mixing: low L0 mixes correlated features for better reconstruction, high L0 finds degenerate solutions. Most commonly used SAEs have L0 that is too low. The proposed metric successfully finds correct L0 in toy models and aligns with peak sparse probing performance in LLM SAEs.

Conclusion: L0 must be set correctly to train SAEs with correct features; it’s not just a reconstruction-sparsity tradeoff parameter but crucial for proper feature disentanglement.

Abstract: Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparameter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity-reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features.

[304] On the Bayes Inconsistency of Disagreement Discrepancy Surrogates

Neil G. Marchant, Andrew C. Cullen, Feng Liu, Sarah M. Erfani

Main category: cs.LG

TL;DR: The paper addresses limitations in existing surrogate losses for disagreement discrepancy in distribution shift scenarios, proves they lack Bayes consistency, introduces theoretical bounds, and proposes a novel provably consistent surrogate loss that outperforms existing methods.

Details

Motivation: Deep neural networks often fail under distribution shift, which is critical for building safe and reliable systems. Disagreement discrepancy has emerged as a promising approach for addressing distribution shift, but existing surrogate losses for optimizing it have fundamental flaws.

Method: The authors prove that existing surrogates for disagreement discrepancy are not Bayes consistent. They introduce new theoretical bounds on the optimality gap for such surrogates. Guided by this theory, they propose a novel disagreement loss that, when paired with cross-entropy, yields a provably consistent surrogate for disagreement discrepancy.

Result: Empirical evaluations across diverse benchmarks demonstrate that the proposed method provides more accurate and robust estimates of disagreement discrepancy than existing approaches, particularly under challenging adversarial conditions.

Conclusion: The paper identifies a fundamental flaw in existing surrogate losses for disagreement discrepancy, provides theoretical foundations for understanding their limitations, and introduces a novel provably consistent surrogate that improves performance in distribution shift scenarios.

Abstract: Deep neural networks often fail when deployed in real-world contexts due to distribution shift, a critical barrier to building safe and reliable systems. An emerging approach to address this problem relies on \emph{disagreement discrepancy} – a measure of how the disagreement between two models changes under a shifting distribution. The process of maximizing this measure has seen applications in bounding error under shifts, testing for harmful shifts, and training more robust models. However, this optimization involves the non-differentiable zero-one loss, necessitating the use of practical surrogate losses. We prove that existing surrogates for disagreement discrepancy are not Bayes consistent, revealing a fundamental flaw: maximizing these surrogates can fail to maximize the true disagreement discrepancy. To address this, we introduce new theoretical results providing both upper and lower bounds on the optimality gap for such surrogates. Guided by this theory, we propose a novel disagreement loss that, when paired with cross-entropy, yields a provably consistent surrogate for disagreement discrepancy. Empirical evaluations across diverse benchmarks demonstrate that our method provides more accurate and robust estimates of disagreement discrepancy than existing approaches, particularly under challenging adversarial conditions.

[305] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Germán Kruszewski, Pierre Erbacher, Jos Rozen, Marc Dymetman

Main category: cs.LG

TL;DR: Proposes using α-divergence instead of RL for LLM fine-tuning to balance precision and diversity, achieving SOTA on theorem-proving benchmark.

Details

Motivation: RL fine-tuning causes diversity loss in LLMs due to mode-seeking behavior of Reverse KL divergence, which concentrates probability mass on certain high-probability regions while neglecting others.

Method: Start from explicit target distribution (filtered correct answers), approximate using α-divergence family to control precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences.

Result: Achieves state-of-the-art performance on Lean theorem-proving benchmark along coverage-precision Pareto frontier, outperforming all prior methods on coverage axis.

Conclusion: α-divergence provides better control over precision-diversity trade-off than RL, enabling more diverse yet accurate LLM fine-tuning for reasoning tasks.

Abstract: Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the “mode-seeking” or “zero-forcing” Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the $α$-divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage-precision Pareto frontier, outperforming all prior methods on the coverage axis.

[306] Developing synthetic microdata through machine learning for firm-level business surveys

Jorge Cisneros Paz, Timothy Wojan, Matthew Williams, Jennifer Ozawa, Robert Chew, Kimberly Janda, Timothy Navarro, Michael Floyd, Christine Task, Damon Streat

Main category: cs.LG

TL;DR: The paper presents a machine learning approach to create synthetic public-use microdata samples for business surveys to address re-identification risks while preserving data utility.

Details

Motivation: Increased computing power and Big Data availability have heightened re-identification risks for anonymized Census data, potentially violating confidentiality pledges to survey respondents. Business data presents unique challenges due to lack of anonymity and industry identifiability in geographic areas.

Method: The paper describes a machine learning model to construct synthetic PUMS based on the Annual Business Survey (ABS), using quality metrics to evaluate the synthetic data. The approach is demonstrated using synthetic PUMS developed for the 2007 Survey of Business Owners as a proof of concept.

Result: Econometric replication of a published analysis in Small Business Economics demonstrates the synthetic data’s verisimilitude to true data, showing that critical moments and relationships are preserved while protecting individual respondent confidentiality.

Conclusion: Synthetic data generation using machine learning offers a viable solution to confidentiality concerns in business survey data while maintaining research utility, enabling broader public access to valuable economic data without compromising privacy protections.

Abstract: Public-use microdata samples (PUMS) from the United States (US) Census Bureau on individuals have been available for decades. However, large increases in computing power and the greater availability of Big Data have dramatically increased the probability of re-identifying anonymized data, potentially violating the pledge of confidentiality given to survey respondents. Data science tools can be used to produce synthetic data that preserve critical moments of the empirical data but do not contain the records of any existing individual respondent or business. Developing public-use firm data from surveys presents unique challenges different from demographic data, because there is a lack of anonymity and certain industries can be easily identified in each geographic area. This paper briefly describes a machine learning model used to construct a synthetic PUMS based on the Annual Business Survey (ABS) and discusses various quality metrics. Although the ABS PUMS is currently being refined and results are confidential, we present two synthetic PUMS developed for the 2007 Survey of Business Owners, similar to the ABS business data. Econometric replication of a high impact analysis published in Small Business Economics demonstrates the verisimilitude of the synthetic data to the true data and motivates discussion of possible ABS use cases.

[307] Optimizing Fine-Tuning through Advanced Initialization Strategies for Low-Rank Adaptation

Yongfu Xue

Main category: cs.LG

TL;DR: IniLoRA improves LoRA by initializing low-rank matrices to approximate original model weights, achieving better performance across models and tasks.

Details

Motivation: LoRA's zero-product initialization limits its ability to effectively activate and leverage original model weights, creating a performance bottleneck. The authors aim to overcome this limitation for better fine-tuning performance.

Method: Proposes IniLoRA with novel initialization strategy that initializes low-rank matrices to closely approximate original model weights. Introduces two variants: IniLoRA-α and IniLoRA-β with distinct initialization methods.

Result: IniLoRA achieves better performance than LoRA across a range of models and tasks. The variants further enhance performance.

Conclusion: IniLoRA’s weight-approximating initialization strategy effectively addresses LoRA’s limitations and improves parameter-efficient fine-tuning performance.

Abstract: The rapid development of parameter-efficient fine-tuning methods has noticeably improved the efficiency of adapting large language models. Among these, LoRA has gained widespread popularity due to its strong balance of effectiveness and parameter efficiency. However, LoRA relies on initializing two low-rank matrices whose product is zero, which limits its ability to effectively activate and leverage the original model weights-creating a potential bottleneck for optimal performance. To address this limitation, we propose \textbf{IniLoRA}, a novel initialization strategy that initializes the low-rank matrices to closely approximate the original model weights. Experimental results indicate that IniLoRA achieves better performance than LoRA across a range of models and tasks. Additionally, we introduce two variants, IniLoRA-$α$ and IniLoRA-$β$, both leveraging distinct initialization methods to enhance performance further.

[308] Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives

Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang

Main category: cs.LG

TL;DR: Reinforce-Ada addresses RL signal loss in LLM reasoning via adaptive sampling that prioritizes difficult prompts, outperforming uniform baselines by 2x convergence speed.

Details

Motivation: Standard RL for LLM reasoning suffers from signal loss due to uniform sampling with small group sizes, which fails to uncover informative learning signals for difficult prompts. This is a statistical artifact of undersampling rather than an inherent model limitation.

Method: Introduces theoretical framework optimizing non-linear RL objective (log-likelihood), inducing weighted gradient estimator prioritizing difficult prompts. Proposes Reinforce-Ada family: algorithms that dynamically allocate inference budgets based on prompt difficulty via adaptive sampling. Two efficient realizations: estimation-based approach and model-free sequential sampling approach.

Result: Extensive experiments across multiple benchmarks show Reinforce-Ada significantly outperforms uniform baselines like GRPO, recovering lost signals and accelerating convergence by up to 2× while maintaining same total inference budget.

Conclusion: Signal loss in RL for LLM reasoning is addressable through adaptive sampling that actively invests compute in difficult prompts rather than discarding them, enabling more efficient RL training with existing compute budgets.

Abstract: Reinforcement learning (RL) for large language model reasoning is frequently hindered by signal loss, a phenomenon where standard uniform sampling with small group sizes fails to uncover informative learning signals for difficult prompts. We demonstrate that this collapse is a statistical artifact of undersampling rather than an inherent model limitation. To address this systematically, we introduce a theoretical framework based on optimizing a non-linear RL objective (e.g., log-likelihood). We show that this objective naturally induces a weighted gradient estimator that prioritizes difficult prompts, which can be robustly realized through adaptive sampling. Guided by this framework, we propose Reinforce-Ada, a family of algorithms that dynamically allocates inference budgets based on prompt difficulty, effectively scaling up RL compute to where it is needed most. Unlike passive filtering methods that discard low-signal prompts, Reinforce-Ada actively invests compute to recover them. We introduce two efficient realizations: an estimation-based approach and a model-free sequential sampling approach. Extensive experiments across multiple benchmarks show that Reinforce-Ada significantly outperforms uniform baselines like GRPO, recovering lost signals and accelerating convergence by up to $2\times$ while maintaining the same total inference budget. Code is available at https://github.com/RLHFlow/Reinforce-Ada.

[309] Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

Jian Xu, Shian Du, Junmei Yang, Qianli Ma, Delu Zeng

Main category: cs.LG

TL;DR: The paper proposes an Annealed Importance Sampling (AIS) approach for Bayesian GPLVMs to address limitations in handling complex data structures, achieving tighter variational bounds and better performance than existing methods.

Details

Motivation: Importance-weighted Bayesian GPLVMs have limitations in handling complex data structures and high-dimensional spaces due to challenges in generating effective proposal distributions. The authors aim to overcome these limitations for better unsupervised learning performance.

Method: Proposes an Annealed Importance Sampling (AIS) approach that transforms the posterior into a sequence of intermediate distributions using annealing. Combines Sequential Monte Carlo samplers with Variational Inference (VI) to explore wider posterior distributions. Also introduces an efficient algorithm by reparameterizing all variables in the evidence lower bound (ELBO).

Result: Experimental results on toy and image datasets show the method outperforms state-of-the-art methods with tighter variational bounds, higher log-likelihoods, and more robust convergence.

Conclusion: The AIS approach successfully addresses limitations of importance-weighted Bayesian GPLVMs for complex data structures, providing improved performance in unsupervised learning tasks like dimensionality reduction and missing data recovery.

Abstract: Gaussian Process Latent Variable Models (GPLVMs) have become increasingly popular for unsupervised tasks such as dimensionality reduction and missing data recovery due to their flexibility and non-linear nature. An importance-weighted version of the Bayesian GPLVMs has been proposed to obtain a tighter variational bound. However, this version of the approach is primarily limited to analyzing simple data structures, as the generation of an effective proposal distribution can become quite challenging in high-dimensional spaces or with complex data sets. In this work, we propose an Annealed Importance Sampling (AIS) approach to address these issues. By transforming the posterior into a sequence of intermediate distributions using annealing, we combine the strengths of Sequential Monte Carlo samplers and VI to explore a wider range of posterior distributions and gradually approach the target distribution. We further propose an efficient algorithm by reparameterizing all variables in the evidence lower bound (ELBO). Experimental results on both toy and image datasets demonstrate that our method outperforms state-of-the-art methods in terms of tighter variational bounds, higher log-likelihoods, and more robust convergence.

[310] Detecting the Future: All-at-Once Event Sequence Forecasting with Horizon Matching

Ivan Karpukhin, Andrey Savchenko

Main category: cs.LG

TL;DR: DEF introduces a novel approach for simultaneous long-horizon event forecasting using matching-based loss to align predictions with ground truth, achieving 50% improvement over existing methods.

Details

Motivation: Traditional autoregressive multi-step forecasting for event sequences has limited effectiveness due to convergence to constant/repetitive outputs, creating a need for better long-horizon event prediction methods.

Method: DEF uses a novel matching-based loss function that optimally aligns predictions with ground truth events during training, enabling simultaneous forecasting of multiple future events on a horizon.

Result: Achieves up to 50% relative improvement over existing temporal point processes and event prediction models, establishes new SOTA in long-horizon event prediction, and maintains computational efficiency during inference.

Conclusion: DEF successfully addresses limitations of traditional autoregressive approaches, providing accurate and diverse simultaneous forecasting of multiple future events with significant performance improvements.

Abstract: Long-horizon events forecasting is a crucial task across various domains, including retail, finance, healthcare, and social networks. Traditional models for event sequences often extend to forecasting on a horizon using an autoregressive (recursive) multi-step strategy, which has limited effectiveness due to typical convergence to constant or repetitive outputs. To address this limitation, we introduce DEF, a novel approach for simultaneous forecasting of multiple future events on a horizon with high accuracy and diversity. Our method optimally aligns predictions with ground truth events during training by using a novel matching-based loss function. We establish a new state-of-the-art in long-horizon event prediction, achieving up to a 50% relative improvement over existing temporal point processes and event prediction models. Furthermore, we achieve state-of-the-art performance in next-event prediction tasks while demonstrating high computational efficiency during inference.

[311] Statistical Guarantees for Approximate Stationary Points of Shallow Neural Networks

Mahsa Taheri, Fang Xie, Johannes Lederer

Main category: cs.LG

TL;DR: Statistical guarantees for neural networks extended from global optima to stationary points and nearby regions, bridging theory with practice for shallow networks.

Details

Motivation: Statistical guarantees for neural networks typically only apply to global optima of complex objective functions, creating a gap between theory and actual neural network pipeline outputs. The paper aims to bring statistical theory closer to practice.

Method: Develop statistical guarantees for shallow linear neural networks that apply to stationary points and nearby regions, matching global optima guarantees up to logarithmic factors. Extend these guarantees to shallow ReLU neural networks under the assumption that first layer weight matrices are nearly identical for the stationary network and target.

Result: Theoretical results show that neural networks don’t necessarily need to be optimized globally from a mathematical perspective. Statistical guarantees for stationary points and nearby regions match those of global optima up to logarithmic factors for shallow linear networks, and extend to ReLU networks with specific assumptions.

Conclusion: Despite being limited to shallow neural networks, this work represents an important step toward describing practical neural network properties in mathematical terms, bridging the gap between statistical theory and actual neural network practice.

Abstract: Since statistical guarantees for neural networks are usually restricted to global optima of intricate objective functions, it is unclear whether these theories explain the performances of actual outputs of neural network pipelines. The goal of this paper is, therefore, to bring statistical theory closer to practice. We develop statistical guarantees for shallow linear neural networks that coincide up to logarithmic factors with the global optima but apply to stationary points and the points nearby. These results support the common notion that neural networks do not necessarily need to be optimized globally from a mathematical perspective. We then extend our statistical guarantees to shallow ReLU neural networks, assuming the first layer weight matrices are nearly identical for the stationary network and the target. More generally, despite being limited to shallow neural networks for now, our theories make an important step forward in describing the practical properties of neural networks in mathematical terms.

[312] SPARTAN: A Sparse Transformer World Model Attending to What Matters

Anson Lei, Bernhard Schölkopf, Ingmar Posner

Main category: cs.LG

TL;DR: SPARTAN is a Transformer-based world model that learns sparse, context-dependent interaction graphs between entities to improve few-shot adaptation to dynamics changes and robustness against distractors.

Details

Motivation: Capturing entity interactions in structured ways is crucial for adaptable world models, but reliably discovering local causal structures in complex settings remains challenging. The authors identify sparsity as a critical ingredient for discovering meaningful interaction structures.

Method: SPARTAN (SPARse TrANsformer World model) applies sparsity regularization on attention patterns between object-factored tokens in a Transformer architecture. This enables learning sparse, context-dependent interaction graphs that predict future object states. The model is extended to adapt to sparse interventions with unknown targets.

Result: SPARTAN outperforms state-of-the-art object-centric world models in observation-based environments. It learns local causal graphs that accurately reflect underlying object interactions, achieving significantly improved few-shot adaptation to dynamics changes and robustness against distractors.

Conclusion: SPARTAN demonstrates that sparsity regularization in Transformer-based world models enables learning interpretable, context-dependent interaction structures that provide efficient adaptation to environmental changes while maintaining robustness against distractions.

Abstract: Capturing the interactions between entities in a structured way plays a central role in world models that flexibly adapt to changes in the environment. Recent works motivate the benefits of models that explicitly represent the structure of interactions and formulate the problem as discovering local causal structures. In this work, we demonstrate that reliably capturing these relationships in complex settings remains challenging. To remedy this shortcoming, we postulate that sparsity is a critical ingredient for the discovery of such local structures. To this end, we present the SPARse TrANsformer World model (SPARTAN), a Transformer-based world model that learns context-dependent interaction structures between entities in a scene. By applying sparsity regularisation on the attention patterns between object-factored tokens, SPARTAN learns sparse, context-dependent interaction graphs that accurately predict future object states. We further extend our model to adapt to sparse interventions with unknown targets in the dynamics of the environment. This results in a highly interpretable world model that can efficiently adapt to changes. Empirically, we evaluate SPARTAN against the current state-of-the-art in object-centric world models in observation-based environments and demonstrate that our model can learn local causal graphs that accurately reflect the underlying interactions between objects, achieving significantly improved few-shot adaptation to dynamics changes, as well as robustness against distractors.

[313] A Control Perspective on Training PINNs

Matthieu Barreau, Haoming Shen

Main category: cs.LG

TL;DR: PINN training analyzed from control theory perspective; gradient descent with resampling modeled as stochastic control system; integral and leaky integral controllers introduced for adaptive physics weighting; theoretical analysis and numerical validation show integral controller works well with correct physics model, leaky integrator better with model mismatch.

Details

Motivation: To provide convergence guarantees and principled training algorithms for PINNs by analyzing training dynamics from a control-theoretic perspective, addressing the need for more robust and theoretically grounded training methods.

Method: Interpret PINN training with gradient descent and resampling as a stochastic control-affine system; introduce integral and leaky integral controllers for dynamic adaptation of physics weight; theoretically analyze asymptotic properties under accuracy-robustness trade-off; validate on toy examples.

Result: Numerical evidence shows integral controller achieves accurate and robust convergence when physical model is correct, while leaky integrator provides improved performance in presence of model mismatch; theoretical framework established for analyzing PINN training dynamics.

Conclusion: This work represents a first step toward convergence guarantees and principled training algorithms tailored to PINN characteristics, establishing a control-theoretic framework for analyzing and improving PINN training dynamics.

Abstract: We investigate the training of Physics-Informed Neural Networks (PINNs) from a control-theoretic perspective. Using gradient descent with resampling, we interpret the training dynamics as asymptotically equivalent to a stochastic control-affine system, where sampling effects act as process disturbances and measurement noise. Within this framework, we introduce two controllers for dynamically adapting the physics weight: an integral controller and a leaky integral controller. We theoretically analyze their asymptotic properties under the accuracy-robustness trade-off, and we evaluate them on a toy example. Numerical evidence suggests that the integral controller achieves accurate and robust convergence when the physical model is correct, whereas the leaky integrator provides improved performance in the presence of model mismatch. This work represents a first step toward convergence guarantees and principled training algorithms tailored to the distinct characteristics of PINN tasks.

[314] Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation

Justus Westerhoff, Golzar Atefi, Mario Koddenbrock, Alexei Figueroa, Alexander Löser, Erik Rodner, Felix A. Gers

Main category: cs.LG

TL;DR: The paper proposes IMPRINT, a general framework for imprinting-based transfer learning that identifies three key components (generation, normalization, aggregation), analyzes existing methods, and introduces a novel clustering-based variant that improves performance by 4%.

Details

Motivation: Foundation models can be applied to new tasks through transfer learning, and imprinting offers an efficient parameter-free adaptation method. However, conceptual differences between existing imprinting studies necessitate a systematic investigation and unified framework.

Method: Proposes the IMPRINT framework with three main components: generation (creating proxies for novel data), normalization (proper scaling), and aggregation (combining proxies). Analyzes existing methods through this lens and introduces a novel variant using clustering inspired by neural collapse phenomenon.

Result: Analysis reveals benefits of multiple proxies in generation and importance of proper normalization. The novel clustering-based imprinting variant outperforms previous transfer learning methods by 4% on benchmark tasks.

Conclusion: The IMPRINT framework provides systematic understanding of imprinting methods, enables comparison of existing approaches, and leads to improved performance through a novel clustering-based variant that connects imprinting with neural collapse phenomenon.

Abstract: The capacity of foundation models allows for their application to new, unseen tasks. The adaptation to such tasks is called transfer learning. An efficient transfer learning method that circumvents parameter optimization is imprinting. The conceptual differences between studies on imprinting form the basis of our systematic investigation. In this work, we propose the general \texttt{IMPRINT} framework, identifying three main components: generation, normalization, and aggregation. Through the lens of this framework, we conduct an in-depth analysis and a comparison of the existing methods. Our findings reveal the benefits of representing novel data with multiple proxies in the generation step and show the importance of proper normalization. Beyond an extensive analytical grounding, our framework enables us to propose a novel variant of imprinting which outperforms previous work on transfer learning tasks by 4%. This variant determines proxies through clustering motivated by the neural collapse phenomenon – a connection that we draw for the first time. We publicly release our code at https://github.com/DATEXIS/IMPRINT.

[315] Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data

Chen Fan, Mark Schmidt, Christos Thrampoulidis

Main category: cs.LG

TL;DR: This paper provides the first complete characterization of implicit optimization bias for p-norm normalized steepest descent and momentum steepest descent algorithms in multi-class linear classification with cross-entropy loss, showing they converge to max-margin solutions with respect to the classifier matrix’s p-norm.

Details

Motivation: Different gradient-based optimization methods for overparameterized models achieve zero training error but converge to different solutions with different generalization properties. Understanding these implicit biases is crucial for explaining why certain algorithms generalize better than others.

Method: Theoretical analysis of p-norm normalized steepest descent (NSD) and momentum steepest descent (NMD) algorithms in multi-class linear classification with cross-entropy loss. The key insight reduces analysis of general p-norms to analysis of max-norm by exploiting ordering properties between p-norms relative to max-norm and its dual sum-norm.

Result: Proved that NSD and NMD converge to solutions maximizing the margin with respect to the classifier matrix’s p-norm, with established convergence rates. Showed that Spectral Descent and Muon converge to max-margin solutions with respect to spectral norm. Extended analysis to Adam, showing it converges to max-norm solution.

Conclusion: Multi-class linear classification provides the most transparent framework for studying implicit biases of matrix-parameter optimization algorithms. The analysis reveals how different p-norm constraints lead to different implicit biases and generalization properties.

Abstract: Different gradient-based methods for optimizing overparameterized models can all achieve zero training error yet converge to distinctly different solutions inducing different generalization properties. We provide the first complete characterization of implicit optimization bias for p-norm normalized steepest descent (NSD) and momentum steepest descent (NMD) algorithms in multi-class linear classification with cross-entropy loss. Our key theoretical contribution is proving that these algorithms converge to solutions maximizing the margin with respect to the classifier matrix’s p-norm, with established convergence rates. These results encompass important special cases including Spectral Descent and Muon, which we show converge to max-margin solutions with respect to the spectral norm. A key insight of our contribution is that the analysis of general entry-wise and Schatten p-norms can be reduced to the analysis of NSD/NMD with max-norm by exploiting a natural ordering property between all p-norms relative to the max-norm and its dual sum-norm. For the specific case of descent with respect to the max-norm, we further extend our analysis to include preconditioning, showing that Adam converges to the matrix’s max-norm solution. Our results demonstrate that the multi-class linear setting, which is inherently richer than the binary counterpart, provides the most transparent framework for studying implicit biases of matrix-parameter optimization algorithms.

[316] Hypergraph Foundation Model

Yue Gao, Yifan Feng, Shiquan Liu, Xiangmin Han, Shaoyi Du, Zongze Wu, Han Hu

Main category: cs.LG

TL;DR: Hyper-FM is a hypergraph foundation model that outperforms baselines by 13.4% on 11 new datasets, with scaling laws showing domain diversity matters more than graph size.

Details

Motivation: Hypergraph neural networks (HGNNs) effectively model complex high-order relationships but developing foundation models for hypergraphs is challenging due to their distinct data structure combining vertex features and intricate structural information.

Method: Hyper-FM features two key components: 1) Hierarchical High-Order Neighbor Guided Vertex Knowledge Embedding for vertex feature representation, and 2) Hierarchical Multi-Hypergraph Guided Structural Knowledge Extraction for structural information. The authors also curate 11 text-attributed hypergraph datasets.

Result: Experiments on the 11 datasets show Hyper-FM outperforms baseline methods by approximately 13.4%. The authors also propose the first scaling law for hypergraph foundation models, demonstrating that increasing domain diversity significantly enhances performance more than merely augmenting vertex and hyperedge counts.

Conclusion: Hyper-FM successfully addresses the challenge of developing foundation models for hypergraphs, and the scaling law findings underscore the critical role of domain diversity in scaling hypergraph models rather than just increasing graph size.

Abstract: Hypergraph neural networks (HGNNs) effectively model complex high-order relationships in domains like protein interactions and social networks by connecting multiple vertices through hyperedges, enhancing modeling capabilities, and reducing information loss. Developing foundation models for hypergraphs is challenging due to their distinct data, which includes both vertex features and intricate structural information. We present Hyper-FM, a Hypergraph Foundation Model for multi-domain knowledge extraction, featuring Hierarchical High-Order Neighbor Guided Vertex Knowledge Embedding for vertex feature representation and Hierarchical Multi-Hypergraph Guided Structural Knowledge Extraction for structural information. Additionally, we curate 11 text-attributed hypergraph datasets to advance research between HGNNs and LLMs. Experiments on these datasets show that Hyper-FM outperforms baseline methods by approximately 13.4%, validating our approach. Furthermore, we propose the first scaling law for hypergraph foundation models, demonstrating that increasing domain diversity significantly enhances performance, unlike merely augmenting vertex and hyperedge counts. This underscores the critical role of domain diversity in scaling hypergraph models.

[317] Interval Regression: A Comparative Study with Proposed Models

Tung L Nguyen, Toby Dylan Hocking

Main category: cs.LG

TL;DR: Comprehensive review and comparative analysis of Interval Regression models for handling imprecise target values represented as intervals, showing no universally optimal model.

Details

Motivation: In real-world applications, target values are often imprecise and represented as intervals rather than exact values, creating a need for specialized regression models that can handle this uncertainty.

Method: Comprehensive review of existing Interval Regression models, introduction of alternative models for comparison, and experimental evaluation on both real-world and synthetic datasets.

Result: No single Interval Regression model is universally optimal; model performance varies across different scenarios, emphasizing the need for careful model selection based on specific use cases.

Conclusion: The choice of Interval Regression model should be tailored to specific application requirements, as different models perform better in different contexts, highlighting the importance of scenario-specific model selection.

Abstract: Regression models are essential for a wide range of real-world applications. However, in practice, target values are not always precisely known; instead, they may be represented as intervals of acceptable values. This challenge has led to the development of Interval Regression models. In this study, we provide a comprehensive review of existing Interval Regression models and introduce alternative models for comparative analysis. Experiments are conducted on both real-world and synthetic datasets to offer a broad perspective on model performance. The results demonstrate that no single model is universally optimal, highlighting the importance of selecting the most suitable model for each specific scenario.

[318] A Trustworthy By Design Classification Model for Building Energy Retrofit Decision Support

Panagiota Rempi, Sotiris Pelekis, Alexandros Menelaos Tzortzis, Evangelos Spiliotis, Evangelos Karakolis, Christos Ntanos, Dimitris Askounis

Main category: cs.LG

TL;DR: This paper presents a trustworthy ML framework for recommending energy efficiency retrofit strategies for residential buildings using minimal inputs, addressing data scarcity and regulatory compliance challenges.

Details

Motivation: Improving energy efficiency in existing residential buildings is critical for climate change mitigation, but retrofit decision-making faces challenges with data availability, model transparency, and compliance with AI regulations like the EU AI Act.

Method: A trustworthy-by-design ML framework combining CTGAN for data augmentation of limited/imbalanced data with a neural network multi-label classifier for predicting retrofit action combinations, plus an XAI layer using SHAP for explainability.

Result: Validated on two case studies (large EPC dataset from England/Wales and small RETROFIT-LAT dataset from Latvia), the framework handles diverse data conditions and improves performance up to 53% compared to baseline.

Conclusion: The framework provides a feasible, interpretable, and trustworthy AI system for building retrofit decision support that aids stakeholders in prioritizing energy investments while supporting regulation-compliant innovation in sustainable energy transition.

Abstract: Improving energy efficiency in residential buildings is critical to combating climate change and reducing greenhouse gas emissions. Retrofitting existing buildings, which contribute a significant share of energy use, is therefore a key priority, especially in regions with outdated building stock. Artificial Intelligence (AI) and Machine Learning (ML) can automate retrofit decision-making and find retrofit strategies. However, their use faces challenges of data availability, model transparency, and compliance with national and EU AI regulations including the AI act, ethics guidelines and the ALTAI. This paper presents a trustworthy-by-design ML-based decision support framework that recommends energy efficiency strategies for residential buildings using minimal user-accessible inputs. The framework merges Conditional Tabular Generative Adversarial Networks (CTGAN) to augment limited and imbalanced data with a neural network-based multi-label classifier that predicts potential combinations of retrofit actions. To support explanation and trustworthiness, an Explainable AI (XAI) layer using SHapley Additive exPlanations (SHAP) clarifies the rationale behind recommendations and guides feature engineering. Two case studies validate performance and generalization: the first leveraging a well-established, large EPC dataset for England and Wales; the second using a small, imbalanced post-retrofit dataset from Latvia (RETROFIT-LAT). Results show that the framework can handle diverse data conditions and improve performance up to 53% compared to the baseline. Overall, the proposed framework provides a feasible, interpretable, and trustworthy AI system for building retrofit decision support through assured performance, usability, and transparency to aid stakeholders in prioritizing effective energy investments and support regulation-compliant, data-driven innovation in sustainable energy transition.

[319] Privacy-Preserving Conformal Prediction Under Local Differential Privacy

Coby Penso, Bar Mahpud, Jacob Goldberger, Or Sheffet

Main category: cs.LG

TL;DR: Conformal prediction with local differential privacy: two approaches for private calibration when labels are perturbed, maintaining coverage guarantees without accessing true labels.

Details

Motivation: Address privacy-sensitive scenarios where an untrusted aggregator can only access perturbed versions of true labels, particularly in applications like medical imaging or LLM queries where users may not trust the aggregator or cannot compute their own scores.

Method: Two complementary LDP approaches: 1) Users provide input features and perturbed labels via k-ary randomized response (no model access needed), 2) Users add noise to conformity scores via binary search response (requires model access but preserves both data and label privacy). Both compute conformal threshold directly from noisy data.

Result: Proved finite-sample coverage guarantees and demonstrated robust coverage even under severe randomization, unifying strong local privacy with predictive uncertainty control.

Conclusion: The approach enables conformal prediction in privacy-sensitive applications where users cannot or will not trust aggregators with true labels, supporting both scenarios where users can and cannot compute their own scores.

Abstract: Conformal prediction (CP) provides sets of candidate classes with a guaranteed probability of containing the true class. However, it typically relies on a calibration set with clean labels. We address privacy-sensitive scenarios where the aggregator is untrusted and can only access a perturbed version of the true labels. We propose two complementary approaches under local differential privacy (LDP). In the first approach, users do not access the model but instead provide their input features and a perturbed label using a k-ary randomized response. In the second approach, which enforces stricter privacy constraints, users add noise to their conformity score by binary search response. This method requires access to the classification model but preserves both data and label privacy. Both approaches compute the conformal threshold directly from noisy data without accessing the true labels. We prove finite-sample coverage guarantees and demonstrate robust coverage even under severe randomization. This approach unifies strong local privacy with predictive uncertainty control, making it well-suited for sensitive applications such as medical imaging or large language model queries, regardless of whether users can (or are willing to) compute their own scores.

[320] IF-GUIDE: Influence Function-Guided Detoxification of LLMs

Zachary Coalson, Juhan Bae, Nicholas Carlini, Sanghyun Hong

Main category: cs.LG

TL;DR: IF-GUIDE is a proactive method that uses influence functions to identify and suppress harmful tokens in training data to reduce LLM toxicity, achieving up to 10× toxicity reduction without human-preference data.

Details

Motivation: Most existing approaches to reducing model toxicity are reactive (fine-tuning after pre-training). The paper proposes a proactive approach that addresses toxicity at the training data level before models learn harmful behaviors.

Method: IF-GUIDE adapts influence functions to measure token-level attributions from training data to model toxicity. It includes techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. The method doesn’t require human-preference data.

Result: IF-GUIDE reduces both explicit and implicit toxicity by up to 10× compared to uncensored models and up to 3× compared to baseline alignment methods (DPO and RAD). It’s computationally efficient, using a million-parameter proxy model instead of billion-parameter models for influence scoring.

Conclusion: Proactive data-level intervention via influence functions (IF-GUIDE) effectively reduces LLM toxicity without human-preference data, outperforming reactive alignment methods and being computationally efficient.

Abstract: We study how training data contributes to the emergence of toxic behaviors in large language models. Most prior work on reducing model toxicity adopts reactive approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a proactive approach, IF-GUIDE, that leverages influence functions to identify and suppress harmful tokens in the training data. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-GUIDE does not rely on human-preference data, which is typically required by existing alignment methods. In our evaluation, we demonstrate that IF-GUIDE substantially reduces both explicit and implicit toxicity-by up to 10$\times$ compared to uncensored models, and up to 3$\times$ compared to baseline alignment methods such as DPO and RAD-across both pre-training and fine-tuning scenarios. IF-GUIDE is computationally efficient: a billion-parameter model is not necessary for computing influence scores; a million-parameter model-with 7.5$\times$ fewer parameters-can effectively serve as a proxy for identifying harmful data. Our code is publicly available at: https://github.com/ztcoalson/IF-Guide

[321] Variational Supervised Contrastive Learning

Ziwen Wang, Jiajun Fan, Thao Nguyen, Heng Ji, Ge Liu

Main category: cs.LG

TL;DR: VarCon reformulates supervised contrastive learning as variational inference, achieving SOTA performance on major image datasets with clearer semantic organization and better few-shot learning.

Details

Motivation: Address two key limitations of contrastive learning: (1) lack of explicit regulation of embedding distribution causing semantically related instances to be pushed apart, and (2) excessive reliance on large in-batch negatives and tailored augmentations hindering generalization.

Method: Reformulates supervised contrastive learning as variational inference over latent class variables, maximizing a posterior-weighted evidence lower bound (ELBO) that replaces exhaustive pair-wise comparisons for efficient class-aware matching and grants fine-grained control over intra-class dispersion.

Result: Achieves state-of-the-art performance: 79.36% Top-1 accuracy on ImageNet-1K and 78.29% on CIFAR-100 with ResNet-50 encoder, converging in just 200 epochs. Shows clearer decision boundaries, better semantic organization, and superior few-shot learning and robustness.

Conclusion: VarCon successfully addresses limitations of traditional contrastive learning through variational inference formulation, achieving superior performance, faster convergence, and better generalization across multiple benchmarks and tasks.

Abstract: Contrastive learning has proven to be highly efficient and adaptable in shaping representation spaces across diverse modalities by pulling similar samples together and pushing dissimilar ones apart. However, two key limitations persist: (1) Without explicit regulation of the embedding distribution, semantically related instances can inadvertently be pushed apart unless complementary signals guide pair selection, and (2) excessive reliance on large in-batch negatives and tailored augmentations hinders generalization. To address these limitations, we propose Variational Supervised Contrastive Learning (VarCon), which reformulates supervised contrastive learning as variational inference over latent class variables and maximizes a posterior-weighted evidence lower bound (ELBO) that replaces exhaustive pair-wise comparisons for efficient class-aware matching and grants fine-grained control over intra-class dispersion in the embedding space. Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100, ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while converging in just 200 epochs; (2) yields substantially clearer decision boundaries and semantic organization in the embedding space, as evidenced by KNN classification, hierarchical clustering results, and transfer-learning assessments; and (3) demonstrates superior performance in few-shot learning than supervised baseline and superior robustness across various augmentation strategies. Our code is available at https://github.com/ziwenwang28/VarContrast.

[322] Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis

Xinyi Li, Sai Wang, Yutian Lin, Yu Wu, Yi Yang

Main category: cs.LG

TL;DR: Retro-Expert is an interpretable retrosynthesis framework that combines LLMs and specialized models via reinforcement learning to provide both accurate predictions and chemical logic explanations.

Details

Motivation: Existing retrosynthesis models rely on static pattern-matching, which limits logic decision-making and creates black-box predictions. There's a need for interpretable models that provide actionable chemical insights.

Method: Three-component framework: (1) specialized models analyze products to construct chemical decision space, (2) LLM-driven critical reasoning generates predictions with interpretable reasoning paths, (3) reinforcement learning optimizes interpretable decision policy.

Result: Retro-Expert surpasses both LLM-based and specialized models across different metrics while providing expert-aligned explanations that bridge AI predictions with actionable chemical insights.

Conclusion: The proposed interpretable framework successfully addresses black-box limitations in retrosynthesis by combining complementary strengths of different models through collaborative reasoning and reinforcement learning.

Abstract: Retrosynthesis prediction aims to infer the reactant molecule based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing models rely on static pattern-matching paradigm, which limits their ability to perform effective logic decision-making, leading to black-box decision-making. Building on this, we propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary reasoning strengths of Large Language Models and specialized models via reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models analyze the product to construct high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions and corresponding interpretable reasoning path, and (3) reinforcement learning optimizing interpretable decision policy. Experiments show that Retro-Expert not only surpasses both LLM-based and specialized models across different metrics but also provides expert-aligned explanations that bridge the gap between AI predictions and actionable chemical insights.

[323] Conditional Generative Modeling for Enhanced Credit Risk Management in Supply Chain Finance

Qingkai Zhang, L. Jeff Hong, Houmin Yan

Main category: cs.LG

TL;DR: Proposes a credit risk management framework for 3PL-led supply chain finance in cross-border e-commerce using generative modeling to assess risk and determine optimal loan sizes for SMEs.

Details

Motivation: Small- and medium-sized sellers in cross-border e-commerce face financing challenges due to limited credit histories, despite the growth of 3PL-led supply chain finance that uses in-transit inventory as collateral.

Method: Uses Quantile-Regression-based Generative Metamodeling (QRGMM) for sales distribution modeling, integrates with Deep Factorization Machines (DeepFM) to capture covariate interactions, and proposes a unified framework for flexible risk measure estimation with functional risk measures linking risk to loan levels.

Result: Extensive experiments on synthetic and real-world data validate the model’s efficacy for credit risk assessment and loan size determination in cross-border e-commerce supply chain finance.

Conclusion: Generative models show strong potential for strengthening credit assessment and supporting financing for small- and medium-sized sellers in cross-border e-commerce supply chain finance.

Abstract: The rapid expansion of cross-border e-commerce (CBEC) has created significant opportunities for small- and medium-sized sellers, yet financing remains a critical challenge due to their limited credit histories. Third-party logistics (3PL)-led supply chain finance (SCF) has emerged as a promising solution, leveraging in-transit inventory as collateral. We propose an advanced credit risk management framework tailored for 3PL-led SCF, addressing the dual challenges of credit risk assessment and loan size determination. Specifically, we leverage conditional generative modeling of sales distributions through Quantile-Regression-based Generative Metamodeling (QRGMM) as the foundation for risk measures estimation. We propose a unified framework that enables flexible estimation of multiple risk measures while introducing a functional risk measure formulation that systematically captures the relationship between these risk measures and varying loan levels, supported by theoretical guarantees. To capture complex covariate interactions in e-commerce sales data, we integrate QRGMM with Deep Factorization Machines (DeepFM). Extensive experiments on synthetic and real-world data validate the efficacy of our model for credit risk assessment and loan size determination. This study explores the use of generative models in CBEC SCF risk management, illustrating their potential to strengthen credit assessment and support financing for small- and medium-sized sellers.

[324] IPA: An Information-Reconstructive Input Projection Framework for Efficient Foundation Model Adaptation

Yuan Yin, Shashanka Venkataramanan, Tuan-Hung Vu, Andrei Bursuc, Matthieu Cord

Main category: cs.LG

TL;DR: IPA improves LoRA by replacing random down-projection with feature-aware compression that reconstructs original inputs, boosting performance with fewer parameters.

Details

Motivation: LoRA's random initialization of down-projection discards useful information and becomes a bottleneck since it changes little during training while up-projection does most adaptation.

Method: IPA uses feature-aware projection framework that reconstructs original input in reduced hidden space. For linear case, implements algorithms approximating top principal components for efficient projector pretraining.

Result: IPA consistently outperforms LoRA and DoRA: +1.5 points average accuracy on commonsense reasoning, +2.3 points on VTAB-1k. Matches full LoRA performance with ~50% trainable parameters when projection is frozen.

Conclusion: IPA’s feature-aware projection effectively addresses LoRA’s random initialization bottleneck, improving parameter efficiency and performance across vision and language tasks.

Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, reduce adaptation cost by injecting low-rank updates into pretrained weights. However, LoRA’s down-projection is randomly initialized and data-agnostic, discarding potentially useful information. Prior analyses show that this projection changes little during training, while the up-projection carries most of the adaptation, making the random input compression a performance bottleneck. We propose IPA, a feature-aware projection framework that explicitly aims to reconstruct the original input within a reduced hidden space. In the linear case, we instantiate IPA with algorithms approximating top principal components, enabling efficient projector pretraining with negligible inference overhead. Across language and vision benchmarks, IPA consistently improves over LoRA and DoRA, achieving on average 1.5 points higher accuracy on commonsense reasoning and 2.3 points on VTAB-1k, while matching full LoRA performance with roughly half the trainable parameters when the projection is frozen. Code available at https://github.com/valeoai/peft-ipa .

[325] scE2TM improves single-cell embedding interpretability and reveals cellular perturbation signatures

Hegang Chen, Yuyin Lu, Yifan Zhao, Zhiming Dai, Fu Lee Wang, Qing Li, Yanghui Rao, Yue Li

Main category: cs.LG

TL;DR: scE2TM is an external knowledge-guided embedded topic model for single-cell RNA-seq analysis that prevents interpretation collapse through embedding clustering regularization, achieving superior clustering performance and biological interpretability.

Details

Motivation: Current embedded topic models for single-cell analysis suffer from interpretation collapse (topics semantically collapsing into redundancy) and fail to fully capture biological variation. The rise of single-cell foundation models creates opportunities to leverage external biological knowledge for better embeddings.

Method: scE2TM uses embedding clustering regularization where each topic is constrained to be the center of a separately aggregated gene cluster, ensuring topics capture unique biological information. It incorporates external biological knowledge from single-cell foundation models to guide embeddings.

Result: On 20 scRNA-seq datasets, scE2TM outperforms 7 state-of-the-art methods in clustering performance. It produces topics with higher diversity and stronger consistency with biological pathways. In interferon-stimulated PBMCs, it accurately simulates topic perturbations mirroring experimental responses. In melanoma, it identifies malignant-specific topics that reveal survival-associated gene programs.

Conclusion: scE2TM provides a high-quality, interpretable embedding framework for single-cell analysis that prevents interpretation collapse, leverages external knowledge, and delivers biologically meaningful insights across diverse applications including perturbation modeling and clinical biomarker discovery.

Abstract: Single-cell RNA sequencing technologies have revolutionized our understanding of cellular heterogeneity, yet computational methods often struggle to balance performance with biological interpretability. Embedded topic models have been widely used for interpretable single-cell embedding learning. However, these models suffer from the potential problem of interpretation collapse, where topics semantically collapse towards each other, resulting in redundant topics and incomplete capture of biological variation. Furthermore, the rise of single-cell foundation models creates opportunities to harness external biological knowledge for guiding model embeddings. Here, we present scE2TM, an external knowledge-guided embedded topic model that provides a high-quality cell embedding and interpretation for scRNA-seq analysis. Through embedding clustering regularization method, each topic is constrained to be the center of a separately aggregated gene cluster, enabling it to capture unique biological information. Across 20 scRNA-seq datasets, scE2TM achieves superior clustering performance compared with seven state-of-the-art methods. A comprehensive interpretability benchmark further shows that scE2TM-learned topics exhibit higher diversity and stronger consistency with underlying biological pathways. Modeling interferon-stimulated PBMCs, scE2TM simulates topic perturbations that drive control cells toward stimulated-like transcriptional states, faithfully mirroring experimental interferon responses. In melanoma, scE2TM identifies malignant-specific topics and extrapolates them to unseen patient data, revealing gene programs associated with patient survival.

[326] Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

Antonio Bărbălau, Cristian Daniel Păduraru, Teodor Poncu, Alexandru Tifrea, Elena Burceanu

Main category: cs.LG

TL;DR: S&P Top-K is an encoder-centric alternative to traditional decoder-based SAE steering that identifies and projects top-K encoder features aligned with sensitive attributes, improving fairness and reducing undesirable behaviors in vision-language and language models.

Details

Motivation: Current SAE-based model steering uses decoder-centric approaches that rewrite activations as weighted sums of decoder features, but this may not be optimal for cross-modal performance. The authors propose an encoder-centric alternative for more effective model steering.

Method: S&P Top-K: A retraining-free, lightweight Selection and Projection framework that: 1) Identifies Top-K encoder features aligned with sensitive attributes/behaviors, 2) Optionally aggregates them into a single control axis, and 3) Computes an orthogonal projection applied directly in the model’s native embedding space.

Result: In vision-language models: Improves fairness metrics on CelebA and FairFace by up to 3.2× over conventional SAE usage. In LLMs: Substantially reduces aggressiveness and sycophancy in Llama-3 8B Instruct, achieving up to 3.6× gains over masked reconstruction.

Conclusion: Encoder-centric interventions provide a general, efficient, and more effective mechanism for shaping model behavior at inference time than traditional decoder-centric SAE approaches, offering superior cross-modal performance.

Abstract: Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretability and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. This procedure essentially rewrites the original activations as a weighted sum of decoder features. In contrast to existing literature, we forward an encoder-centric alternative to model steering which demonstrates a stronger cross-modal performance. We introduce S&P Top-K, a retraining-free and computationally lightweight Selection and Projection framework that identifies Top-K encoder features aligned with a sensitive attribute or behavior, optionally aggregates them into a single control axis, and computes an orthogonal projection to be subsequently applied directly in the model’s native embedding space. In vision-language models, it improves fairness metrics on CelebA and FairFace by up to 3.2 times over conventional SAE usage, and in large language models, it substantially reduces aggressiveness and sycophancy in Llama-3 8B Instruct, achieving up to 3.6 times gains over masked reconstruction. These findings suggest that encoder-centric interventions provide a general, efficient, and more effective mechanism for shaping model behavior at inference time than the traditional decoder-centric use of SAEs.

[327] Establishing Validity for Distance Functions and Internal Clustering Validity Indices in Correlation Space

Isabella Degen, Zahraa S Abdallah, Kate Robson Brown, Henry W J Reeve

Main category: cs.LG

TL;DR: ICVI performance depends on data structure type. Validating correlation patterns shows SWC and DBI are valid, while VRC and PBM fail, contrary to previous studies.

Details

Motivation: Current ICVI studies evaluate based on matching human labels rather than measuring discovered data structure quality, leaving practitioners without principled ICVI selection. Inconsistent performance arises from not formally quantifying structure type.

Method: First validity assessment for correlation patterns. Formalized 23 canonical correlation patterns as theoretical optimal clustering. Used synthetic data modeling this structure with controlled perturbations to evaluate validity across content, criterion, construct, and external validity dimensions.

Result: Silhouette Width Criterion (SWC) and Davies-Bouldin Index (DBI) are valid for correlation patterns. Calinski-Harabasz (VRC) and Pakhira-Bandyopadhyay-Maulik (PBM) indices fail. Simple Lp norm distances achieve validity, while correlation-specific functions fail structural, criterion, and external validity.

Conclusion: Validity depends on structure type, explaining inconsistent ICVI performance across studies. Structure-type-specific validation provides practical guidance (SWC>0.9, DBI<0.15 thresholds) and methodological template for establishing validity for other structure types.

Abstract: Internal clustering validity indices (ICVIs) assess clustering quality without ground truth labels. Comparative studies consistently find that no single ICVI outperforms others across datasets, leaving practitioners without principled ICVI selection. We argue that inconsistent ICVI performance arises because studies evaluate them based on matching human labels rather than measuring the quality of the discovered structure in the data, using datasets without formally quantifying the structure type and quality. Structure type refers to the mathematical organisation in data that clustering aims to discover. Validity theory requires a theoretical definition of clustering quality, which depends on structure type. We demonstrate this through the first validity assessment of clustering quality measures for correlation patterns, a structure type that arises from clustering time series by correlation relationships. We formalise 23 canonical correlation patterns as the theoretical optimal clustering and use synthetic data modelling this structure with controlled perturbations to evaluate validity across content, criterion, construct, and external validity. Our findings show that Silhouette Width Criterion (SWC) and Davies-Bouldin Index (DBI) are valid for correlation patterns, whilst Calinski-Harabasz (VRC) and Pakhira-Bandyopadhyay-Maulik (PBM) indices fail. Simple Lp norm distances achieve validity, whilst correlation-specific functions fail structural, criterion, and external validity. These results differ from previous studies where VRC and PBM performed well, demonstrating that validity depends on structure type. Our structure-type-specific validation method provides both practical guidance (quality thresholds SWC>0.9, DBI<0.15) and a methodological template for establishing validity for other structure types.

[328] Self-Supervised Learning of Graph Representations for Network Intrusion Detection

Lorenzo Guerra, Thomas Chapuis, Guillaume Duc, Pavlo Mozharovskyi, Van-Tam Nguyen

Main category: cs.LG

TL;DR: GraphIDS: Self-supervised graph neural network for network intrusion detection that unifies representation learning and anomaly detection through masked autoencoder reconstruction.

Details

Motivation: Existing graph-based intrusion detection methods decouple representation learning from anomaly detection, limiting embedding utility for attack identification. Need for unified approach that learns normal communication patterns directly optimized for intrusion detection.

Method: Combines inductive graph neural network for local topological context embedding with Transformer-based encoder-decoder for reconstruction. Learns normal patterns via masked autoencoder, flags intrusions based on high reconstruction errors during inference.

Result: Achieves up to 99.98% PR-AUC and 99.61% macro F1-score on diverse NetFlow benchmarks, outperforming baselines by 5-25 percentage points.

Conclusion: GraphIDS provides effective end-to-end framework for network intrusion detection by unifying representation learning and anomaly detection, enabling direct optimization of embeddings for malicious traffic recognition.

Abstract: Detecting intrusions in network traffic is a challenging task, particularly under limited supervision and constantly evolving attack patterns. While recent works have leveraged graph neural networks for network intrusion detection, they often decouple representation learning from anomaly detection, limiting the utility of the embeddings for identifying attacks. We propose GraphIDS, a self-supervised intrusion detection model that unifies these two stages by learning local graph representations of normal communication patterns through a masked autoencoder. An inductive graph neural network embeds each flow with its local topological context to capture typical network behavior, while a Transformer-based encoder-decoder reconstructs these embeddings, implicitly learning global co-occurrence patterns via self-attention without requiring explicit positional information. During inference, flows with unusually high reconstruction errors are flagged as potential intrusions. This end-to-end framework ensures that embeddings are directly optimized for the downstream task, facilitating the recognition of malicious traffic. On diverse NetFlow benchmarks, GraphIDS achieves up to 99.98% PR-AUC and 99.61% macro F1-score, outperforming baselines by 5-25 percentage points.

[329] Pushing Toward the Simplex Vertices: A Simple Remedy for Code Collapse in Smoothed Vector Quantization

Takashi Morita

Main category: cs.LG

TL;DR: The paper introduces a novel regularization method for smoothed vector quantization that simultaneously ensures tight approximation to one-hot vectors and prevents code collapse by minimizing distances between simplex vertices and their K-nearest smoothed quantizers.

Details

Motivation: Vector quantization is widely used but faces a fundamental challenge: the non-differentiable quantization step blocks gradient backpropagation. Smoothed quantization addresses this but requires two key properties - tight approximation to one-hot vectors and prevention of code collapse - which existing methods typically address separately.

Method: The paper proposes a simple regularization that promotes both desiderata simultaneously by minimizing the distance between each simplex vertex and its K-nearest smoothed quantizers. This ensures smoothed quantizers remain close to one-hot vectors while utilizing all codebook entries.

Result: Experiments on discrete image autoencoding and contrastive speech representation learning benchmarks demonstrate that the proposed method achieves more reliable codebook utilization and improves performance compared to prior approaches.

Conclusion: The introduced regularization provides an effective solution for smoothed vector quantization that simultaneously addresses the dual requirements of tight approximation and codebook utilization, outperforming existing methods that handle these issues separately.

Abstract: Vector quantization, which discretizes a continuous vector space into a finite set of representative vectors (a codebook), has been widely adopted in modern machine learning. Despite its effectiveness, vector quantization poses a fundamental challenge: the non-differentiable quantization step blocks gradient backpropagation. Smoothed vector quantization addresses this issue by relaxing the hard assignment of a codebook vector into a weighted combination of codebook entries, represented as the matrix product of a simplex vector and the codebook. Effective smoothing requires two properties: (1) smoothed quantizers should remain close to a onehot vector, ensuring tight approximation, and (2) all codebook entries should be utilized, preventing code collapse. Existing methods typically address these desiderata separately. By contrast, the present study introduces a simple and intuitive regularization that promotes both simultaneously by minimizing the distance between each simplex vertex and its $K$-nearest smoothed quantizers. Experiments on representative benchmarks, including discrete image autoencoding and contrastive speech representation learning, demonstrate that the proposed method achieves more reliable codebook utilization and improves performance compared to prior approaches.

[330] LLM-NAS: LLM-driven Hardware-Aware Neural Architecture Search

Hengyi Zhu, Grace Li Zhang, Shaoyi Huang

Main category: cs.LG

TL;DR: LLM-NAS: An LLM-driven Neural Architecture Search method that reduces search cost from days to minutes while achieving better accuracy-latency trade-offs by addressing exploration bias through complexity partitioning and prompt co-evolution.

Details

Motivation: Traditional HW-NAS methods require multiple GPU days per dataset, while LLM-driven approaches suffer from exploration bias - they repeatedly propose designs within limited search space and fail to discover architectures across different latency ranges.

Method: Three key components: 1) Complexity-driven partitioning engine to divide search space by complexity for diversity; 2) LLM-powered architecture prompt co-evolution operator where LLM updates design heuristics knowledge base and performs guided evolution; 3) Zero-cost predictor to avoid training candidates from scratch.

Result: On HW-NAS-Bench, LLM-NAS achieves higher HV, lower IGD, and up to 54% lower latency than baselines at similar accuracy. Search cost drops from days to minutes compared with traditional supernet baselines.

Conclusion: LLM-NAS effectively addresses exploration bias in LLM-driven NAS, enabling efficient discovery of diverse neural architectures with better accuracy-latency trade-offs while dramatically reducing search time from days to minutes.

Abstract: Hardware-Aware Neural Architecture Search (HW-NAS) requires joint optimization of accuracy and latency under device constraints. Traditional supernet-based methods require multiple GPU days per dataset. Large Language Model (LLM)-driven approaches avoid training a large supernet and can provide quick feedback, but we observe an exploration bias: the LLM repeatedly proposes neural network designs within limited search space and fails to discover architectures across different latency ranges in the entire search space. To address this issue, we propose LLM-NAS: an LLM-driven Neural Architecture Search that can generate neural networks with high accuracy and low latency with reduced search cost. Our proposed LLM-NAS has three key components: 1) a complexity-driven partitioning engine that divides the search space by complexity to enforce diversity and mitigate exploration bias; 2) an LLM-powered architecture prompt co-evolution operator, in which the LLM first updates a knowledge base of design heuristics based on results from the previous round, then performs a guided evolution algorithm on architectures with prompts that incorporate this knowledge base. Prompts and designs improve together across rounds which avoids random guesswork and improve efficiency; 3) a zero-cost predictor to avoid training a large number of candidates from scratch. Experimental results show that on HW-NAS-Bench, LLM-NAS can achieve overall higher HV, lower IGD, and up to 54% lower latency than baselines at similar accuracy. Meanwhile, the search cost drops from days to minutes compared with traditional supernet baselines.

[331] Multi-marginal temporal Schrödinger Bridge Matching from unpaired data

Thomas Gravier, Thomas Boyer, Auguste Genovesio

Main category: cs.LG

TL;DR: MMtSBM extends Diffusion Schrödinger Bridge Matching to multiple time points, enabling reconstruction of hidden dynamics from unpaired static snapshots with theoretical guarantees and practical efficiency.

Details

Motivation: Many natural dynamic processes (like cellular differentiation or disease progression) can only be observed through static snapshots, making it challenging but crucial to reconstruct their temporal evolution to understand underlying dynamics.

Method: Proposes Multi-Marginal temporal Schrödinger Bridge Matching (MMtSBM) from unpaired data, extending Diffusion Schrödinger Bridge Matching by deriving the Iterative Markovian Fitting algorithm to handle multiple marginals in a novel factorized fashion.

Result: MMtSBM retains theoretical properties on toy examples, achieves state-of-the-art performance on real-world datasets (e.g., transcriptomic trajectory inference in 100 dimensions), and for the first time recovers couplings and dynamics in very high-dimensional image settings.

Conclusion: Establishes multi-marginal Schrödinger bridges as a practical and principled approach for recovering hidden dynamics from static data, addressing scalability and restrictive assumption limitations of existing methods.

Abstract: Many natural dynamic processes – such as in vivo cellular differentiation or disease progression – can only be observed through the lens of static sample snapshots. While challenging, reconstructing their temporal evolution to decipher underlying dynamic properties is of major interest to scientific research. Existing approaches enable data transport along a temporal axis but are poorly scalable in high dimension and require restrictive assumptions to be met. To address these issues, we propose Multi-Marginal temporal Schrödinger Bridge Matching (MMtSBM) from unpaired data, extending the theoretical guarantees and empirical efficiency of Diffusion Schrödinger Bridge Matching (arXiv:2303.16852) by deriving the Iterative Markovian Fitting algorithm to multiple marginals in a novel factorized fashion. Experiments show that MMtSBM retains theoretical properties on toy examples, achieves state-of-the-art performance on real-world datasets such as transcriptomic trajectory inference in 100 dimensions, and, for the first time, recovers couplings and dynamics in very high-dimensional image settings. Our work establishes multi-marginal Schrödinger bridges as a practical and principled approach for recovering hidden dynamics from static data.

[332] LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, Junchen Jiang

Main category: cs.LG

TL;DR: LMCACHE is an efficient open-source KV caching solution that offloads and shares KV caches across LLM inference engines and queries, addressing GPU memory limitations through optimized data movement and modular architecture.

Details

Motivation: KV cache storage in GPU memory is insufficient for growing cache demands, requiring offloading solutions to enable cache reuse across queries and inference engines, as evidenced by real-world usage statistics showing cache growth exceeding GPU capacity.

Method: LMCACHE implements: (1) optimized KV cache data movement using batched operations and pipelining, (2) modular KV cache connector to decouple from evolving inference engines, and (3) control API for flexible cache orchestration across GPU, CPU, storage, and network layers.

Result: LMCACHE achieves up to 15x throughput improvement when combined with vLLM across workloads like multi-round QA and document analysis. Enterprise adoption reveals insights about remote storage benefits for prefill delay and context truncation’s negative impact on cache hit ratio.

Conclusion: LMCACHE provides an efficient, widely-adopted solution for KV cache offloading and sharing, addressing critical GPU memory limitations in LLM inference while offering valuable insights for real-world deployment through its modular design and optimized performance.

Abstract: KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across different queries and inference engines. Our real-world usage statistics confirm this trend: over time, the total KV cache stored by users has grown rapidly, far exceeding the capacity of GPU memory. Despite this need, there lacks an efficient solution for offloading and transferring KV caches. We present LMCACHE, the first and so far the most efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) out of the GPU memory and shares them across engines and queries. LMCACHE supports both cache offloading (prefix reuse across queries) and prefill-decode (PD) disaggregation (cross-engine/GPU cache transfer). LMCACHE’s high performance and wide adoption stem from the following contributions: (1) highly optimized KV cache data movement powered by batched data movement operations, compute and I/O pipelining; (2) a modular KV cache connector component, decoupling LMCACHE from the rapid evolution of inference engines; (3) a first-class control API for flexible cache orchestration across GPU, CPU, storage, and network layers. Our evaluation shows that combining LMCACHE with vLLM achieves up to 15x improvement in throughput across workloads such as multi-round question answering and document analysis. Large-scale adoption of LMCACHE in enterprise settings provides us valuable insights, for example, fetching KV cache from remote storage has unsurprisingly benefits to prefill delay, and that context truncation, which is a widely applied technique in industry, can greatly reduce prefix cache hit ratio by half. The source code of LMCACHE is at: https://github.com/LMCache/LMCache.

[333] Data-Augmented Deep Learning for Downhole Depth Sensing and Field Validation

Siyu Xiao, Xindi Zhao, Tianhao Mao, Yiwei Wang, Yuqiao Chen, Hongyun Zhang, Jian Wang, Junjie Wang, Shuang Liu, Tupei Chen, Yang Liu

Main category: cs.LG

TL;DR: A system for CCL signal acquisition with comprehensive preprocessing methods for data augmentation improves neural network-based casing collar recognition, achieving perfect F1 scores on benchmark models.

Details

Motivation: Accurate downhole depth measurement is crucial for oil/gas operations, but preprocessing methods for CCL signal recognition are underdeveloped, and limited real well data availability challenges neural network training requiring extensive datasets.

Method: Integrated system for CCL signal acquisition in downhole tools; comprehensive preprocessing methods for data augmentation including standardization, label distribution smoothing (LDS), random cropping, label smoothing regularization (LSR), time scaling, and multiple sampling; systematic experimentation across configuration combinations.

Result: Standardization, LDS, and random cropping are fundamental requirements; LSR, time scaling, and multiple sampling enhance generalization; F1 scores improved from 0.937/0.952 to 1.0/1.0 on benchmark models; validation on real CCL waveforms confirms effectiveness.

Conclusion: The work addresses gaps in data augmentation methodologies for training casing collar recognition models in CCL data-limited environments, demonstrating practical applicability for oil/gas well operations.

Abstract: Accurate downhole depth measurement is essential for oil and gas well operations, directly influencing reservoir contact, production efficiency, and operational safety. Collar correlation using a casing collar locator (CCL) is fundamental for precise depth calibration. While neural network-based CCL signal recognition has achieved significant progress in collar identification, preprocessing methods for such applications remain underdeveloped. Moreover, the limited availability of real well data poses substantial challenges for training neural network models that require extensive datasets. This paper presents a system integrated into downhole tools for CCL signal acquisition to facilitate dataset construction. We propose comprehensive preprocessing methods for data augmentation and evaluate their effectiveness using our neural network models. Through systematic experimentation across various configuration combinations, we analyze the contribution of each augmentation method. Results demonstrate that standardization, label distribution smoothing (LDS), and random cropping are fundamental requirements for model training, while label smoothing regularization (LSR), time scaling, and multiple sampling significantly enhance model generalization capability. The F1 scores of our two benchmark models trained with the proposed augmentation methods maximumly improve from 0.937 and 0.952 to 1.0 and 1.0, respectively. Performance validation on real CCL waveforms confirms the effectiveness and practical applicability of our approach. This work addresses the gaps in data augmentation methodologies for training casing collar recognition models in CCL data-limited environments.

[334] MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu

Main category: cs.LG

TL;DR: MOSS is a novel FP8 training framework that enables efficient and stable training of large language models by introducing two-level microscaling for activations and automatic scaling for weights, achieving 34% higher throughput than BF16 baseline.

Details

Motivation: FP8 training offers efficiency gains but faces challenges with numerical precision and stability. Current methods using per-group quantization and just-in-time scaling introduce dequantization overhead and memory inefficiencies that negate FP8 benefits.

Method: Two key innovations: (1) Two-level microscaling for sensitive activations - combines high-precision global scale with compact power-of-two local scales; (2) Automatic scaling for weights in linear layers - predicts and adjusts scaling factors during training without costly max-reduction operations.

Result: MOSS enables efficient FP8 training of 7B parameter models with performance comparable to BF16 baseline while achieving up to 34% higher training throughput.

Conclusion: MOSS successfully addresses FP8 training limitations by balancing precision and efficiency through novel scaling strategies, making FP8 training practical for large language models.

Abstract: Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput.

[335] FAST-CAD: A Fairness-Aware Framework for Non-Contact Stroke Diagnosis

Tianming Sha, Zechuan Chen, Zhan Cheng, Haotian Zhai, Xuwei Ding, Keze Wang

Main category: cs.LG

TL;DR: FAST-CAD combines domain-adversarial training and group distributionally robust optimization for fair stroke diagnosis across demographic groups with theoretical guarantees.

Details

Motivation: Existing automated stroke diagnosis methods suffer from fairness issues across demographic groups, potentially exacerbating healthcare disparities. Timely stroke diagnosis significantly improves patient survival, but current methods may not perform equally well across different demographic subgroups.

Method: FAST-CAD combines domain-adversarial training (DAT) with group distributionally robust optimization (Group-DRO). It uses self-supervised encoders with adversarial domain discrimination to learn demographic-invariant representations, while Group-DRO optimizes worst-group risk to ensure robust performance across all subgroups. The framework is built on domain adaptation and minimax fairness theory.

Result: Extensive experiments on a multimodal dataset covering 12 demographic subgroups (defined by age, gender, and posture) show superior diagnostic performance while maintaining fairness across demographic groups. The method provides convergence guarantees and fairness bounds.

Conclusion: FAST-CAD provides both practical advances and theoretical insights for fair medical AI systems, demonstrating that the unified DAT + Group-DRO framework effectively addresses fairness issues in stroke diagnosis while maintaining diagnostic accuracy.

Abstract: Stroke is an acute cerebrovascular disease, and timely diagnosis significantly improves patient survival. However, existing automated diagnosis methods suffer from fairness issues across demographic groups, potentially exacerbating healthcare disparities. In this work we propose FAST-CAD, a theoretically grounded framework that combines domain-adversarial training (DAT) with group distributionally robust optimization (Group-DRO) for fair and accurate non-contact stroke diagnosis. Our approach is built on domain adaptation and minimax fairness theory and provides convergence guarantees and fairness bounds. We curate a multimodal dataset covering 12 demographic subgroups defined by age, gender, and posture. FAST-CAD employs self-supervised encoders with adversarial domain discrimination to learn demographic-invariant representations, while Group-DRO optimizes worst-group risk to ensure robust performance across all subgroups. Extensive experiments show that our method achieves superior diagnostic performance while maintaining fairness across demographic groups, and our theoretical analysis supports the effectiveness of the unified DAT + Group-DRO framework. This work provides both practical advances and theoretical insights for fair medical AI systems.

[336] Towards a Generalisable Cyber Defence Agent for Real-World Computer Networks

Tim Dudman, Martyn Bull

Main category: cs.LG

TL;DR: TERLA enables RL agents to defend networks of varying topology/size without retraining using graph neural networks and fixed-size action spaces.

Details

Motivation: Current RL agents for cyber defense require retraining for different network topologies/sizes, making them impractical for real-world networks where these change over time.

Method: Use heterogeneous graph neural networks to create fixed-size latent embeddings of network states, combined with reduced, interpretable action spaces. Applied to PPO agents and tested in realistic CAGE Challenge 4 environment.

Result: TERLA agents maintain defensive performance of vanilla PPO while improving action efficiency. They demonstrate generalizability by using same architecture across different networks and showing improved performance when deployed to defend varying network segments.

Conclusion: TERLA provides a generalizable approach for RL-based cyber defense that works across different network topologies and sizes without retraining, bridging the sim-to-real gap.

Abstract: Recent advances in deep reinforcement learning for autonomous cyber defence have resulted in agents that can successfully defend simulated computer networks against cyber-attacks. However, many of these agents would need retraining to defend networks with differing topology or size, making them poorly suited to real-world networks where topology and size can vary over time. In this research we introduce a novel set of Topological Extensions for Reinforcement Learning Agents (TERLA) that provide generalisability for the defence of networks with differing topology and size, without the need for retraining. Our approach involves the use of heterogeneous graph neural network layers to produce a fixed-size latent embedding representing the observed network state. This representation learning stage is coupled with a reduced, fixed-size, semantically meaningful and interpretable action space. We apply TERLA to a standard deep reinforcement learning Proximal Policy Optimisation (PPO) agent model, and to reduce the sim-to-real gap, conduct our research using Cyber Autonomy Gym for Experimentation (CAGE) Challenge 4. This Cyber Operations Research Gym environment has many of the features of a real-world network, such as realistic Intrusion Detection System (IDS) events and multiple agents defending network segments of differing topology and size. TERLA agents retain the defensive performance of vanilla PPO agents whilst showing improved action efficiency. Generalisability has been demonstrated by showing that all TERLA agents have the same network-agnostic neural network architecture, and by deploying a single TERLA agent multiple times to defend network segments with differing topology and size, showing improved defensive performance and efficiency.

[337] Beyond the Laplacian: Interpolated Spectral Augmentation for Graph Neural Networks

Ziyao Cui, Edric Tam

Main category: cs.LG

TL;DR: ILEs (Interpolated Laplacian Embeddings) provide spectral feature augmentation for GNNs using a family of graph matrices beyond standard Laplacian embeddings, improving performance when node features are limited.

Details

Motivation: GNN performance depends on informative node features, which are often limited or absent in real-world applications. While Laplacian spectral embeddings are commonly used, the authors question whether embeddings from alternative graph matrices could also provide useful representations.

Method: Introduces Interpolated Laplacian Embeddings (ILEs) derived from a simple yet expressive family of graph matrices. Uses spectral graph theory to interpret the structural information captured by ILEs.

Result: Simulations and experiments on real-world datasets show that feature augmentation via ILEs improves performance across commonly used GNN architectures.

Conclusion: ILEs offer a straightforward and practical approach that broadens the practitioner’s spectral augmentation toolkit for GNNs when node features are limited, providing an alternative to standard Laplacian embeddings.

Abstract: Graph neural networks (GNNs) are fundamental tools in graph machine learning. The performance of GNNs relies crucially on the availability of informative node features, which can be limited or absent in real-life datasets and applications. A natural remedy is to augment the node features with embeddings computed from eigenvectors of the graph Laplacian matrix. While it is natural to default to Laplacian spectral embeddings, which capture meaningful graph connectivity information, we ask whether spectral embeddings from alternative graph matrices can also provide useful representations for learning. We introduce Interpolated Laplacian Embeddings (ILEs), which are derived from a simple yet expressive family of graph matrices. Using tools from spectral graph theory, we offer a straightforward interpretation of the structural information that ILEs capture. We demonstrate through simulations and experiments on real-world datasets that feature augmentation via ILEs can improve performance across commonly used GNN architectures. Our work offers a straightforward and practical approach that broadens the practitioner’s spectral augmentation toolkit when node features are limited.

[338] xLSTM-PINN: Memory-Gated Spectral Remodeling for Physics-Informed Learning

Ze Tao, Darui Zhao, Fujun Liu, Ke Xu, Xiangsheng Hu

Main category: cs.LG

TL;DR: xLSTM-PINN architecture addresses spectral bias in physics-informed neural networks through memory gating and residual micro-steps, achieving lower spectral/RMSE errors and better high-frequency modeling across PDE benchmarks.

Details

Motivation: Physics-informed neural networks (PINNs) suffer from spectral bias that limits their ability to model high-frequency phenomena and hampers extrapolation performance, creating a need for improved architectures.

Method: Introduces xLSTM-PINN architecture that performs representation-level spectral remodeling through memory gating and residual micro-steps, without modifying automatic differentiation or physics loss constraints.

Result: Achieves consistently lower spectral error and RMSE across four diverse PDE benchmarks, with broader stable learning-rate window. Frequency-domain analysis shows elevated high-frequency kernel weights, rightward-shifted resolvable bandwidth, and shorter convergence time for high-wavenumber components.

Conclusion: xLSTM-PINN provides a robust pathway to suppress spectral bias in physics-informed learning, improving accuracy, reproducibility, and transferability without fundamental changes to the PINN framework.

Abstract: Physics-informed neural networks (PINN) face significant challenges from spectral bias, which impedes their ability to model high-frequency phenomena and limits extrapolation performance. To address this, we introduce xLSTM-PINN, a novel architecture that performs representation-level spectral remodeling through memory gating and residual micro-steps. Our method consistently achieves markedly lower spectral error and root mean square error (RMSE) across four diverse partial differential equation (PDE) benchmarks, along withhhh a broader stable learning-rate window. Frequency-domain analysis confirms that xLSTM-PINN elevates high-frequency kernel weights, shifts the resolvable bandwidth rightward, and shortens the convergence time for high-wavenumber components. Without modifying automatic differentiation or physics loss constraints, this work provides a robust pathway to suppress spectral bias, thereby improving accuracy, reproducibility, and transferability in physics-informed learning.

[339] Observational Auditing of Label Privacy

Iden Kalemaj, Luca Melis, Maxime Boucher, Ilya Mironov, Saeed Mahloujifar

Main category: cs.LG

TL;DR: Observational auditing framework for differential privacy that evaluates privacy guarantees without modifying training datasets, using inherent data randomness instead of canary injection or sample removal.

Details

Motivation: Existing DP auditing methods require modifying training datasets (injecting canaries, removing samples), which is resource-intensive and involves significant engineering overhead for large-scale systems.

Method: Novel observational auditing framework that leverages inherent randomness of data distributions to evaluate privacy without altering original datasets. Extends privacy auditing beyond membership inference to protected attributes (with labels as special case).

Result: Experiments on Criteo and CIFAR-10 datasets demonstrate effectiveness in auditing label privacy guarantees. Theoretical foundations provided for the method.

Conclusion: This work opens new avenues for practical privacy auditing in large-scale production environments by eliminating the need for dataset modifications.

Abstract: Differential privacy (DP) auditing is essential for evaluating privacy guarantees in machine learning systems. Existing auditing methods, however, pose a significant challenge for large-scale systems since they require modifying the training dataset – for instance, by injecting out-of-distribution canaries or removing samples from training. Such interventions on the training data pipeline are resource-intensive and involve considerable engineering overhead. We introduce a novel observational auditing framework that leverages the inherent randomness of data distributions, enabling privacy evaluation without altering the original dataset. Our approach extends privacy auditing beyond traditional membership inference to protected attributes, with labels as a special case, addressing a key gap in existing techniques. We provide theoretical foundations for our method and perform experiments on Criteo and CIFAR-10 datasets that demonstrate its effectiveness in auditing label privacy guarantees. This work opens new avenues for practical privacy auditing in large-scale production environments.

[340] Dynamic Correction of Erroneous State Estimates via Diffusion Bayesian Exploration

Yiwei Shi, Hongnan Ma, Mengyue Yang, Cunjia Liu, Weiru Liu

Main category: cs.LG

TL;DR: Proposes a diffusion-driven Bayesian exploration framework to correct early state estimation errors in emergency response, overcoming the Stationarity-Induced Posterior Support Invariance problem in particle filters.

Details

Motivation: Early state estimates in emergency response are often based on limited or biased information, leading to severe misalignment with reality that constrains subsequent actions and causes catastrophic delays, resource misallocation, and human harm. Bootstrap particle filters suffer from Stationarity-Induced Posterior Support Invariance (S-PSI), where regions excluded by the initial prior remain permanently unexplorable.

Method: A diffusion-driven Bayesian exploration framework that expands posterior support via entropy-regularized sampling and covariance-scaled diffusion. Includes a Metropolis-Hastings check to validate proposals and keep inference adaptive to unexpected evidence.

Result: Empirical evaluations on realistic hazardous-gas localization tasks show the approach matches reinforcement learning and planning baselines when priors are correct, substantially outperforms classical SMC perturbations and RL-based methods under misalignment, and provides theoretical guarantees that DEPF resolves S-PSI while maintaining statistical rigor.

Conclusion: The proposed diffusion-driven Bayesian exploration framework enables principled, real-time correction of early state estimation errors in high-stakes applications, overcoming the fundamental limitations of traditional particle filters while maintaining statistical rigor.

Abstract: In emergency response and other high-stakes societal applications, early-stage state estimates critically shape downstream outcomes. Yet, these initial state estimates-often based on limited or biased information-can be severely misaligned with reality, constraining subsequent actions and potentially causing catastrophic delays, resource misallocation, and human harm. Under the stationary bootstrap baseline (zero transition and no rejuvenation), bootstrap particle filters exhibit Stationarity-Induced Posterior Support Invariance (S-PSI), wherein regions excluded by the initial prior remain permanently unexplorable, making corrections impossible even when new evidence contradicts current beliefs. While classical perturbations can in principle break this lock-in, they operate in an always-on fashion and may be inefficient. To overcome this, we propose a diffusion-driven Bayesian exploration framework that enables principled, real-time correction of early state estimation errors. Our method expands posterior support via entropy-regularized sampling and covariance-scaled diffusion. A Metropolis-Hastings check validates proposals and keeps inference adaptive to unexpected evidence. Empirical evaluations on realistic hazardous-gas localization tasks show that our approach matches reinforcement learning and planning baselines when priors are correct. It substantially outperforms classical SMC perturbations and RL-based methods under misalignment, and we provide theoretical guarantees that DEPF resolves S-PSI while maintaining statistical rigor.

[341] Morphling: Fast, Fused, and Flexible GNN Training at Scale

Anubhab, Rupesh Nasre

Main category: cs.LG

TL;DR: Morphling is a domain-specific code synthesizer that compiles high-level GNN specifications into optimized implementations for CPUs, GPUs, and distributed systems, achieving 20X speedups over existing frameworks.

Details

Motivation: Current GNN frameworks like PyG and DGL fail to address the divergent execution characteristics of graph neural networks - fusing irregular graph traversals with regular matrix operations. They rely on generic kernels with poor cache locality, excessive memory movement, and substantial intermediate allocations.

Method: Morphling compiles high-level GNN specifications into portable implementations targeting OpenMP, CUDA, and MPI. It uses a library of optimized, architecture-aware primitives and incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths based on input feature statistics.

Result: On 11 real-world datasets, Morphling improves per-epoch training throughput by average 20X on CPUs, 19X on GPUs, and 6X in distributed settings over PyG and DGL, with peak speedups reaching 66X. Memory consumption reduced by up to 15X.

Conclusion: Specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.

Abstract: Graph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. Morphling improves per-epoch training throughput by an average of 20X on CPUs, 19X on GPUs, and 6X in distributed settings over PyG and DGL, with peak speedups reaching 66X. Morphling’s memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.

[342] Convergence for Discrete Parameter Update Schemes

Paul Wilson, Fabio Zanasi, George Constantinides

Main category: cs.LG

TL;DR: A new approach to low-precision training using discrete update rules instead of quantizing continuous updates, with convergence guarantees and empirical validation.

Details

Motivation: Modern deep learning models require immense computational resources, motivating research into low-precision training methods to reduce computational costs.

Method: Introduces discrete update rules as an alternative to quantizing continuous updates, establishing convergence guarantees for a general class of such discrete schemes, with a multinomial update rule as a concrete example.

Result: The approach avoids quantization of continuous updates by design and is supported by empirical evaluation, showing it can work effectively for training.

Conclusion: This discrete perspective opens new avenues for efficient training, particularly for models with inherently discrete structure, offering a fundamentally different approach to low-precision training.

Abstract: Modern deep learning models require immense computational resources, motivating research into low-precision training. Quantised training addresses this by representing training components in low-bit integers, but typically relies on discretising real-valued updates. We introduce an alternative approach where the update rule itself is discrete, avoiding the quantisation of continuous updates by design. We establish convergence guarantees for a general class of such discrete schemes, and present a multinomial update rule as a concrete example, supported by empirical evaluation. This perspective opens new avenues for efficient training, particularly for models with inherently discrete structure.

[343] QoSDiff: An Implicit Topological Embedding Learning Framework Leveraging Denoising Diffusion and Adversarial Attention for Robust QoS Prediction

Guanchen Du, Jianlong Xu, Wei Wei

Main category: cs.LG

TL;DR: QoSDiff: A novel QoS prediction framework using denoising diffusion models and adversarial interaction modules to bypass explicit graph construction and improve robustness against noise.

Details

Motivation: Current QoS prediction methods, especially GNNs, rely on explicit user-service interaction graphs which are intractable at large scale, limit implicit topological relationship modeling, and are vulnerable to environmental noise and outliers.

Method: QoSDiff uses denoising diffusion probabilistic models to recover intrinsic latent structures from noisy initializations, plus an adversarial interaction module with bidirectional hybrid attention to capture high-order interactions and distinguish informative patterns from noise.

Result: Extensive experiments on two large-scale real-world datasets show QoSDiff significantly outperforms state-of-the-art baselines, with superior cross-dataset generalization capability and exceptional robustness against observational noise.

Conclusion: QoSDiff provides an effective alternative to explicit graph-based approaches for QoS prediction, offering better scalability, noise robustness, and modeling of complex user-service relationships without graph construction prerequisites.

Abstract: Accurate Quality of Service (QoS) prediction is fundamental to service computing, providing essential data-driven guidance for service selection and ensuring superior user experiences. However, prevalent approaches, particularly Graph Neural Networks (GNNs), heavily rely on constructing explicit user–service interaction graphs. Such reliance not only leads to the intractability of explicit graph construction in large-scale scenarios but also limits the modeling of implicit topological relationships and exacerbates susceptibility to environmental noise and outliers. To address these challenges, this paper introduces \emph{QoSDiff}, a novel embedding learning framework that bypasses the prerequisite of explicit graph construction. Specifically, it leverages a denoising diffusion probabilistic model to recover intrinsic latent structures from noisy initializations. To further capture high-order interactions, we propose an adversarial interaction module that integrates a bidirectional hybrid attention mechanism. This adversarial paradigm dynamically distinguishes informative patterns from noise, enabling a dual-perspective modeling of intricate user–service associations. Extensive experiments on two large-scale real-world datasets demonstrate that QoSDiff significantly outperforms state-of-the-art baselines. Notably, the results highlight the framework’s superior cross-dataset generalization capability and exceptional robustness against observational noise.

cs.MA

[344] Distributed scalable coupled policy algorithm for networked multi-agent reinforcement learning

Pengcheng Dai, Dongming Wang, Wenwu Yu, Wei Ren

Main category: cs.MA

TL;DR: Distributed Scalable Coupled Policy (DSCP) algorithm for networked multi-agent RL with interdependent rewards and coupled policies, using neighbor-averaged Q-functions and geometric sampling for efficient distributed optimization.

Details

Motivation: Address challenges in networked multi-agent RL where agents have interdependent rewards (depending on neighbors' states/actions) and coupled policies (parameterized by neighbors' parameters), requiring distributed coordination for collaborative optimization.

Method: Introduce neighbors’ averaged Q-function concept and derive coupled policy gradient expression. Develop DSCP algorithm with geometric 2-horizon sampling (no full Q-table needed), local neighbor interactions, and push-sum protocol for distributed parameter updates using only κp-hop neighbor information.

Result: Prove convergence to first-order stationary point of objective function. Simulation in robot path planning shows clear improvement over state-of-the-art methods.

Conclusion: DSCP algorithm effectively solves networked multi-agent RL with interdependent rewards and coupled policies through distributed coordination, theoretical convergence guarantees, and practical performance improvements.

Abstract: This paper studies networked multi-agent reinforcement learning (NMARL) with interdependent rewards and coupled policies. In this setting, each agent’s reward depends on its own state-action pair as well as those of its direct neighbors, and each agent’s policy is parameterized by its local parameters together with those of its $κ_{p}$-hop neighbors, with $κ_{p}\geq 1$ denoting the coupled radius. The objective of the agents is to collaboratively optimize their policies to maximize the discounted average cumulative reward. To address the challenge of interdependent policies in collaborative optimization, we introduce a novel concept termed the neighbors’ averaged $Q$-function and derive a new expression for the coupled policy gradient. Based on these theoretical foundations, we develop a distributed scalable coupled policy (DSCP) algorithm, where each agent relies only on the state-action pairs of its $κ_{p}$-hop neighbors and the rewards its their $(κ_{p}+1)$-hop neighbors. Specially, in the DSCP algorithm, we employ a geometric 2-horizon sampling method that does not require storing a full $Q$-table to obtain an unbiased estimate of the coupled policy gradient. Moreover, each agent interacts exclusively with its direct neighbors to obtain accurate policy parameters, while maintaining local estimates of other agents’ parameters to execute its local policy and collect samples for optimization. These estimates and policy parameters are updated via a push-sum protocol, enabling distributed coordination of policy updates across the network. We prove that the joint policy produced by the proposed algorithm converges to a first-order stationary point of the objective function. Finally, the effectiveness of DSCP algorithm is demonstrated through simulations in a robot path planning environment, showing clear improvement over state-of-the-art methods.

[345] Distributionally Robust Markov Games with Average Reward

Zachary Roch, Yue Wang

Main category: cs.MA

TL;DR: This paper studies distributionally robust Markov games with average-reward criterion, establishing theoretical foundations for Nash equilibrium existence and proposing algorithmic solutions for multi-agent decision-making under uncertainty.

Details

Motivation: The motivation is to address multi-agent decision-making under uncertainty over extended horizons, where agents need to ensure satisfactory performance despite environment uncertainties and opponent actions. The average-reward criterion is particularly relevant for long-running environments.

Method: The authors establish theoretical connections between best-response policies and single-agent problems, derive robust Bellman equations, prove existence of stationary Nash equilibria under different settings (irreducible and weakly communicating), and propose two algorithms: Robust Nash-Iteration with convergence guarantees and a temporal-difference based algorithm.

Result: Key results include: establishing correspondence between optimal policies and robust Bellman equation solutions, proving existence of stationary Nash equilibria, developing convergent algorithms for DR-MGs, and showing that average-reward robust NE can be approximated by discounted NE with large discount factors.

Conclusion: The paper provides a comprehensive theoretical and algorithmic foundation for decision-making in complex, uncertain, and long-running multi-player environments, bridging theory and practice for distributionally robust Markov games with average-reward criterion.

Abstract: We study distributionally robust Markov games (DR-MGs) with the average-reward criterion, a framework for multi-agent decision-making under uncertainty over extended horizons. In average reward DR-MGs, agents aim to maximize their worst-case infinite-horizon average reward, to ensure satisfactory performance under environment uncertainties and opponent actions. We first establish a connection between the best-response policies and the optimal policies for the induced single-agent problems. Under a standard irreducible assumption, we derive a correspondence between the optimal policies and the solutions of the robust Bellman equation, and derive the existence of stationary Nash Equilibrium (NE) based on these results. We further study DR-MGs under the weakly communicating setting, where we construct a set-valued map and show its value is a subset of the best-response policies, convex and upper hemi-continuous, and derive the existence of NE. We then explore algorithmic solutions, by first proposing a Robust Nash-Iteration algorithm and providing convergence guarantees under some additional assumptions and a NE computing oracle. We further develop a temporal-difference based algorithm for DR-MGs, and provide convergence guarantees without any additional oracle or assumptions. Finally, we connect average-reward robust NE to discounted ones, showing that the average reward robust NE can be approximated by the discounted ones under a large discount factor. Our studies provide a comprehensive theoretical and algorithmic foundation for decision-making in complex, uncertain, and long-running multi-player environments.

cs.MM

eess.AS

[346] SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

Kaidi Wang, Yi He, Wenhao Guan, Weijie Wu, Hongwu Ding, Xiong Zhang, Di Wu, Meng Meng, Jian Luan, Lin Li, Qingyang Hong

Main category: eess.AS

TL;DR: SyncVoice is a vision-augmented video dubbing framework that improves speech naturalness and audio-visual synchronization, supporting cross-lingual video translation.

Details

Motivation: Existing video dubbing methods have limitations in speech naturalness, audio-visual synchronization, and are limited to monolingual settings, creating a need for better cross-lingual video dubbing solutions.

Method: Built upon a pretrained TTS model, fine-tuned on audio-visual data with a Dual Speaker Encoder to mitigate inter-language interference in cross-lingual speech synthesis.

Result: Experimental results show SyncVoice achieves high-fidelity speech generation with strong synchronization performance in video dubbing tasks.

Conclusion: SyncVoice demonstrates potential for video dubbing applications, particularly in video translation scenarios, by addressing key limitations of existing methods.

Abstract: Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.

[347] A Multi-Channel Auditory Signal Encoder with Adaptive Resolution Using Volatile Memristors

Dongxu Guo, Deepika Yadav, Patrick Foster, Spyros Stathopoulos, Mingyi Chen, Themis Prodromakis, Shiwei Wang

Main category: eess.AS

TL;DR: Hybrid CMOS-memristor auditory encoder uses volatile HfTiOx devices for adaptive-threshold asynchronous delta-modulation spike encoding, enabling onset emphasis and energy-efficient audio processing.

Details

Motivation: To create energy-efficient neuromorphic audio front-ends that emphasize speech onsets while preserving temporal details, using hybrid CMOS-memristor technology to overcome limitations of fixed-threshold spike encoding.

Method: Combines 8-channel 130nm CMOS encoder IC with off-chip volatile HfTiOx memristor devices via switch interface. Uses spike-triggered programming to rapidly raise ADM threshold (desensitization), then exploits device volatility to passively lower threshold (resensitization). On-chip current-mirror TIA converts device current into symmetric thresholds for different encoding regimes.

Result: The adaptive system sharpens speech onsets and preserves fine temporal details that fixed-threshold baselines miss, while maintaining matched spike budget. Multi-channel spike cochleagrams show consistent onset emphasis. Demonstrates practical hybrid CMOS-memristor pathway for spike-efficient audio encoding.

Conclusion: The work establishes a viable hybrid CMOS-memristor approach for onset-salient, spike-efficient neuromorphic audio front-ends and motivates future low-power single-chip integration.

Abstract: We demonstrate and experimentally validate an end-to-end hybrid CMOS-memristor auditory encoder that realises adaptive-threshold, asynchronous delta-modulation (ADM)-based spike encoding by exploiting the inherent volatility of HfTiOx devices. A spike-triggered programming pulse rapidly raises the ADM threshold Delta (desensitisation); the device’s volatility then passively lowers Delta when activity subsides (resensitisation), emphasising onsets while restoring sensitivity without static control energy. Our prototype couples an 8-channel 130 nm encoder IC to off-chip HfTiOx devices via a switch interface and an off-chip controller that monitors spike activity and issues programming events. An on-chip current-mirror transimpedance amplifier (TIA) converts device current into symmetric thresholds, enabling both sensitive and conservative encoding regimes. Evaluated with gammatone-filtered speech, the adaptive loop-at matched spike budget-sharpens onsets and preserves fine temporal detail that a fixed-Delta baseline misses; multi-channel spike cochleagrams show the same trend. Together, these results establish a practical hybrid CMOS-memristor pathway to onset-salient, spike-efficient neuromorphic audio front-ends and motivate low-power single-chip integration.

[348] Speech World Model: Causal State-Action Planning with Explicit Reasoning for Speech

Xuanru Zhou, Jiachen Lian, Henry Hong, Xinyi Yang, Gopala Anumanchipalli

Main category: eess.AS

TL;DR: A modular speech-language model that explicitly reasons about speech states and actions using a causal graph framework, enabling interpretable analysis and counterfactual reasoning under sparse supervision.

Details

Motivation: Current speech-language models treat speech understanding as a single black box, analyzing content well but reasoning weakly about other aspects, especially under sparse supervision. The authors argue for explicit reasoning over speech states and actions with modular and transparent decisions.

Method: Inspired by cognitive science, the system adopts a modular perspective and world model view, learning forward dynamics over latent states. Speech understanding is factorized into four modules that communicate through a causal graph, establishing a cognitive state search space. An instruction-tuned language model uses posterior traces from this space to produce causal analysis and user-facing responses.

Result: The paper presents the first graph-based modular speech model for explicit reasoning, enabling counterfactual interventions and interpretability under partial supervision. The authors will open source the model and data.

Conclusion: This modular approach provides transparent reasoning about speech states and actions, addressing limitations of current black-box speech-language models and promoting advanced speech understanding through interpretable, causally-aware systems.

Abstract: Current speech-language models (SLMs) typically use a cascade of speech encoder and large language model, treating speech understanding as a single black box. They analyze the content of speech well but reason weakly about other aspects, especially under sparse supervision. Thus, we argue for explicit reasoning over speech states and actions with modular and transparent decisions. Inspired by cognitive science we adopt a modular perspective and a world model view in which the system learns forward dynamics over latent states. We factorize speech understanding into four modules that communicate through a causal graph, establishing a cognitive state search space. Guided by posterior traces from this space, an instruction-tuned language model produces a concise causal analysis and a user-facing response, enabling counterfactual interventions and interpretability under partial supervision. We present the first graph based modular speech model for explicit reasoning and we will open source the model and data to promote the development of advanced speech understanding.

eess.IV

[349] Two-Stage Camera Calibration Method for Multi-Camera Systems Using Scene Geometry

Aleksandr Abramov

Main category: eess.IV

TL;DR: A novel two-stage calibration method for multi-camera systems that works with static images, using operator annotations of natural geometric primitives and interactive EFOV polygon manipulation, eliminating need for floor plans, physical access, or synchronized video.

Details

Motivation: Traditional multi-camera calibration methods fail in real-world conditions where accurate floor plans are unavailable, physical access for calibration patterns is impossible, or synchronized video streams don't exist. There's a need for practical calibration solutions for deployment in challenging environments.

Method: Two-stage approach: 1) Partial calibration using operator annotations of natural geometric primitives (parallel/perpendicular/vertical lines, equal-length segments) to estimate roll, pitch, focal length and project EFOV onto horizontal plane. 2) Interactive manipulation of projected EFOV polygons (position, scale, rotation) aligned with floor plan or virtual calibration elements to determine extrinsic parameters (position, yaw).

Result: Method implemented as practical web service requiring only static images from each camera. Comparative analysis and demonstration videos confirm applicability, accuracy, and flexibility for previously infeasible scenarios.

Conclusion: The method enables deployment of precise multi-camera tracking systems in challenging real-world conditions without floor plans, physical access, or synchronized video, overcoming limitations of traditional calibration approaches.

Abstract: Calibration of multi-camera systems is a key task for accurate object tracking. However, it remains a challenging problem in real-world conditions, where traditional methods are not applicable due to the lack of accurate floor plans, physical access to place calibration patterns, or synchronized video streams. This paper presents a novel two-stage calibration method that overcomes these limitations. In the first stage, partial calibration of individual cameras is performed based on an operator’s annotation of natural geometric primitives (parallel, perpendicular, and vertical lines, or line segments of equal length). This allows estimating key parameters (roll, pitch, focal length) and projecting the camera’s Effective Field of View (EFOV) onto the horizontal plane in a base 3D coordinate system. In the second stage, precise system calibration is achieved through interactive manipulation of the projected EFOV polygons. The operator adjusts their position, scale, and rotation to align them with the floor plan or, in its absence, using virtual calibration elements projected onto all cameras in the system. This determines the remaining extrinsic parameters (camera position and yaw). Calibration requires only a static image from each camera, eliminating the need for physical access or synchronized video. The method is implemented as a practical web service. Comparative analysis and demonstration videos confirm the method’s applicability, accuracy, and flexibility, enabling the deployment of precise multi-camera tracking systems in scenarios previously considered infeasible.

[350] CATNUS: Coordinate-Aware Thalamic Nuclei Segmentation Using T1-Weighted MRI

Anqi Feng, Zhangxing Bian, Samuel W. Remedios, Savannah P. Hays, Blake E. Dewey, Alexa Colinco, Jiachen Zhuo, Dan Benjamini, Jerry L. Prince

Main category: eess.IV

TL;DR: CATNUS is a deep learning method for accurate thalamic nuclei segmentation using coordinate-aware 3D U-Net, showing superior performance and generalization across diverse MRI sequences and scanners.

Details

Motivation: Thalamic nuclei segmentation is crucial for understanding brain function and neurological disorders, but remains challenging due to small nucleus size, limited contrast, resolution constraints, and anatomical variability between subjects.

Method: CATNUS uses a 3D U-Net architecture enhanced with coordinate convolution layers for better localization of both large and small nuclei. It provides pre-trained models for T1 maps, MPRAGE, and FGATIR sequences.

Result: CATNUS outperformed FreeSurfer, THOMAS, and HIPS-THOMAS in segmentation accuracy and test-retest reliability. It demonstrated strong generalization across multiple scanners, field strengths, and vendors in traveling-subject datasets.

Conclusion: CATNUS provides an accurate, generalizable solution for thalamic nuclei segmentation with potential to facilitate large-scale neuroimaging studies and support clinical assessment in diverse real-world settings.

Abstract: Accurate segmentation of thalamic nuclei from magnetic resonance images is important due to the distinct roles of these nuclei in overall brain function and to their differential involvement in neurological and psychiatric disorders. However, segmentation remains challenging given the small size of many nuclei, limited intrathalamic contrast and image resolution, and inter-subject anatomical variability. In this work, we present CATNUS (Coordinate-Aware Thalamic Nuclei Segmentation), segmenting 13 thalamic nuclei (or nuclear groups) using a 3D U-Net architecture enhanced with coordinate convolution layers, which provide more precise localization of both large and small nuclei. To support broad clinical applicability, we provide pre-trained model variants that can operate on quantitative T1 maps as well as on widely used magnetization-prepared rapid gradient echo (MPRAGE) and fast gray matter acquisition T1 inversion recovery (FGATIR) sequences. We benchmarked CATNUS against established methods, including FreeSurfer, THOMAS and HIPS-THOMAS, demonstrating improved segmentation accuracy and robust test-retest reliability across multiple nuclei. Furthermore, CATNUS demonstrated strong out-of-distribution generalization on traveling-subject datasets spanning multiple scanners, field strengths, and vendors, producing reliable and anatomically coherent segmentations across diverse acquisition conditions. Overall, CATNUS provides an accurate and generalizable solution for thalamic nuclei segmentation, with strong potential to facilitate large-scale neuroimaging studies and support real-world clinical assessment.

[351] Image Semantic Communication with Quadtree Partition-based Coding

Yinhuan Huang, Zhijin Qin

Main category: eess.IV

TL;DR: Quad-DeepSC: A novel image semantic communication system using quadtree partition-based joint semantic-channel coding that outperforms traditional communication systems while maintaining low complexity and latency.

Details

Motivation: Existing image DeepSC methods struggle with balancing rate-distortion performance and computational complexity, often performing worse than traditional schemes, especially on high-resolution datasets.

Method: Proposes Quad-DeepSC with quadtree partition-based entropy estimation and feature coding modules combined with lightweight feature extraction/reconstruction networks. Uses a two-stage training: first trains Quad-LIC (learned image codec) for compression, then fine-tunes end-to-end over wireless channels.

Result: Quad-DeepSC is the first DeepSC system to surpass conventional systems (VTM source coding + optimal MCS under 3GPP standards) across datasets of varying resolutions. Both Quad-DeepSC and Quad-LIC exhibit minimal latency suitable for real-time deployment.

Conclusion: The proposed Quad-DeepSC successfully addresses limitations of existing image DeepSC methods by achieving superior performance over traditional systems while maintaining low complexity, making it practical for real-time wireless communication applications.

Abstract: Deep learning based semantic communication (DeepSC) system has emerged as a promising paradigm for efficient wireless transmission. However, existing image DeepSC methods, frequently encounter challenges in balancing rate-distortion performance and computational complexity, and often exhibit inferior performance compared to traditional schemes, especially on high-resolution datasets. To address these limitations, we propose a novel image DeepSC system, using quadtree partition-based joint semantic-channel coding, named Quad-DeepSC, which maintains low complexity while achieving state-of-the-art transmission performance. Based on maturing learned image compression technologies, we establish a unified DeepSC system design and training pipeline. The proposed Quad-DeepSC integrates quadtree partition-based entropy estimation and feature coding modules with lightweight feature extraction and reconstruction networks to form an end-to-end architecture. During training, all components except the feature coding modules are jointly optimized as a compact learned image codec, Quad-LIC, for source compression tasks. The pretrained Quad-LIC is then embedded into Quad-DeepSC and fine-tuned end-to-end over wireless channels. Extensive experimental results demonstrate that Quad-DeepSC is the first DeepSC system to surpass conventional communication systems, which employ VTM for source coding and adopt the optimal MCS index under 3GPP standards for channel coding and digital modulation, in performance across datasets of varying resolutions. Notably, both Quad-DeepSC and Quad-LIC exhibit minimal latency, rendering them well-suited for deployment in real-time wireless communication systems.

[352] General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood

Roy Betser, Omer Hofman, Roman Vainshtein, Guy Gilboa

Main category: eess.IV

TL;DR: CLIDE is a zero-shot detection method for AI-generated images that uses conditional likelihood approximation to adapt across diverse image domains, outperforming existing methods especially on domain-specific cases like artistic images.

Details

Motivation: Existing zero-shot detection methods struggle to adapt to specific image domains (like artistic images), limiting real-world applicability. Maintaining updated datasets for supervised methods is time-consuming, creating need for robust zero-shot approaches.

Method: CLIDE uses conditional likelihood approximation, computing likelihoods conditioned on real images to enable adaptation across diverse image domains without requiring domain-specific training data.

Result: State-of-the-art performance on large-scale general datasets and significantly outperforms existing methods in domain-specific cases, demonstrating robustness and domain-aware generalization.

Conclusion: CLIDE addresses the domain adaptation challenge in AI-generated image detection, showing the importance of broad, domain-aware generalization for practical zero-shot detection systems.

Abstract: The rapid advancement of generative models, particularly diffusion-based methods, has significantly improved the realism of synthetic images. As new generative models continuously emerge, detecting generated images remains a critical challenge. While fully supervised, and few-shot methods have been proposed, maintaining an updated dataset is time-consuming and challenging. Consequently, zero-shot methods have gained increasing attention in recent years. We find that existing zero-shot methods often struggle to adapt to specific image domains, such as artistic images, limiting their real-world applicability. In this work, we introduce CLIDE, a novel zero-shot detection method based on conditional likelihood approximation. Our approach computes likelihoods conditioned on real images, enabling adaptation across diverse image domains. We extensively evaluate CLIDE, demonstrating state-of-the-art performance on a large-scale general dataset and significantly outperform existing methods in domain-specific cases. These results demonstrate the robustness of our method and underscore the need of broad, domain-aware generalization for the AI-generated image detection task. Code is available at https://github.com/FujitsuResearch/domain_adaptive_image_detection.

[353] Convergent Primal-Dual Plug-and-Play Image Restoration: A General Algorithm and Applications

Yodai Suzuki, Ryosuke Isono, Shunsuke Ono

Main category: eess.IV

TL;DR: A deep plug-and-play algorithm with theoretical convergence guarantees is proposed, combining PnP with primal-dual splitting to handle non-quadratic data-fidelity terms and constraints in image restoration.

Details

Motivation: Existing PnP methods lack theoretical convergence guarantees under realistic assumptions, have inconsistent behavior, and often require high computational costs for handling non-quadratic data-fidelity terms and constraints common in image restoration.

Method: Integrates PnP paradigm with primal-dual splitting (PDS), an efficient proximal splitting methodology, to develop a general convergent PnP framework with established theoretical convergence conditions under reasonable assumptions.

Result: The proposed PnP algorithm solves a more general monotone inclusion problem rather than standard convex optimization, with mathematical representation of the solution set. It efficiently handles broad class of image restoration problems with guaranteed theoretical convergence, validated by numerical experiments.

Conclusion: The approach provides a practical and effective PnP framework with theoretical convergence guarantees for image restoration, addressing limitations of existing ad-hoc PnP methods while maintaining computational efficiency.

Abstract: We propose a general deep plug-and-play (PnP) algorithm with a theoretical convergence guarantee. PnP strategies have demonstrated outstanding performance in various image restoration tasks by exploiting the powerful priors underlying Gaussian denoisers. However, existing PnP methods often lack theoretical convergence guarantees under realistic assumptions due to their ad-hoc nature, resulting in inconsistent behavior. Moreover, even when convergence guarantees are provided, they are typically designed for specific settings or require a considerable computational cost in handling non-quadratic data-fidelity terms and additional constraints, which are key components in many image restoration scenarios. To tackle these challenges, we integrate the PnP paradigm with primal-dual splitting (PDS), an efficient proximal splitting methodology for solving a wide range of convex optimization problems, and develop a general convergent PnP framework. Specifically, we establish theoretical conditions for the convergence of the proposed PnP algorithm under a reasonable assumption. Furthermore, we show that the problem solved by the proposed PnP algorithm is not a standard convex optimization problem but a more general monotone inclusion problem, where we provide a mathematical representation of the solution set. Our approach efficiently handles a broad class of image restoration problems with guaranteed theoretical convergence. Numerical experiments on specific image restoration tasks validate the practicality and effectiveness of our theoretical results.

[354] A robot-assisted pipeline to rapidly scan 1.7 million historical aerial photographs

Sheila Masson, Alan Potts, Allan Williams, Steve Berggreen, Kevin McLaren, Sam Martin, Eugenio Noda, Nicklas Nordfors, Nic Ruecroft, Hannah Druckenmiller, Solomon Hsiang, Andreas Madestam, Anna Tompsett

Main category: eess.IV

TL;DR: Researchers developed a robot-assisted pipeline that increases scanning productivity 30-fold, enabling digitization of 1.7 million historical aerial photographs from 65 countries.

Details

Motivation: Hundreds of millions of historical aerial photographs from the 20th century remain inaccessible in physical archives. Manual digitization is slow and expensive, preventing access to this valuable visual record of environmental and social changes.

Method: Developed a novel robot-assisted scanning pipeline that automates the digitization process, dramatically increasing worker productivity compared to manual methods.

Result: The pipeline achieved a 30-fold increase in scanning productivity and was successfully applied to digitize 1.7 million historical aerial photographs from 65 countries.

Conclusion: Automated scanning technology can make at-scale digitization feasible, unlocking the visual record of the 20th century for the digital era and revolutionizing access to historical aerial imagery.

Abstract: During the 20th Century, aerial surveys captured hundreds of millions of high-resolution photographs of the earth’s surface. These images, the precursors to modern satellite imagery, represent an extraordinary visual record of the environmental and social upheavals of the 20th Century. However, most of these images currently languish in physical archives where retrieval is difficult and costly. Digitization could revolutionize access, but manual scanning is slow and expensive. Automated scanning could make at-scale digitization feasible, unlocking this visual record of the 20th Century for the digital era. Here, we describe and validate a novel robot-assisted pipeline that increases worker productivity in scanning 30-fold, applied at scale to digitize an archive of 1.7 million historical aerial photographs from 65 countries.

[355] LymphAtlas- A Unified Multimodal Lymphoma Imaging Repository Delivering AI-Enhanced Diagnostic Insight

Jiajun Ding, Beiyao Zhu, Xiaosheng Liu, Lishen Zhang, Zhao Liu

Main category: eess.IV

TL;DR: Researchers created a 3D multimodal segmentation dataset for lymphoma using whole-body FDG PET/CT scans, addressing the lack of standardized datasets in hematological malignancies. The dataset includes 483 scans from 220 patients and supports accurate deep learning-based segmentation.

Details

Motivation: To bridge the gap of lacking standardized multimodal segmentation datasets in hematological malignancies, particularly for lymphoma, which hinders the development of accurate automated segmentation and quantitative analysis tools for PET/CT imaging.

Method: Retrospective collection of 483 whole-body FDG PET/CT examination datasets from 220 patients (106 non-Hodgkin lymphoma, 42 Hodgkin lymphoma) between 2011-2024. Data underwent ethical review, de-identification, and preservation of complete 3D structural information. The dataset was constructed in nnUNet format with systematic validation of preprocessing, annotation quality, and automatic segmentation algorithms.

Result: The deep learning model trained on this dataset achieved accurate segmentation of lymphoma lesions in PET/CT images with high accuracy, good robustness, and reproducibility. The dataset enables deep fusion of PET/CT images, improving characterization of tumor morphology, location, and metabolic features.

Conclusion: This dataset provides solid data support for early diagnosis, clinical staging, personalized treatment, and promotes development of automated image segmentation and precision medicine based on deep learning. The dataset and resources are publicly available.

Abstract: This study integrates PET metabolic information with CT anatomical structures to establish a 3D multimodal segmentation dataset for lymphoma based on whole-body FDG PET/CT examinations, which bridges the gap of the lack of standardised multimodal segmentation datasets in the field of haematological malignancies. We retrospectively collected 483 examination datasets acquired between March 2011 and May 2024, involving 220 patients (106 non-Hodgkin lymphoma, 42 Hodgkin lymphoma); all data underwent ethical review and were rigorously de-identified. Complete 3D structural information was preserved during data acquisition, preprocessing and annotation, and a high-quality dataset was constructed based on the nnUNet format. By systematic technical validation and evaluation of the preprocessing process, annotation quality and automatic segmentation algorithm, the deep learning model trained based on this dataset is verified to achieve accurate segmentation of lymphoma lesions in PET/CT images with high accuracy, good robustness and reproducibility, which proves the applicability and stability of this dataset in accurate segmentation and quantitative analysis. The deep fusion of PET/CT images achieved with this dataset not only significantly improves the accurate portrayal of the morphology, location and metabolic features of tumour lesions, but also provides solid data support for early diagnosis, clinical staging and personalized treatment, and promotes the development of automated image segmentation and precision medicine based on deep learning. The dataset and related resources are available at https://github.com/SuperD0122/LymphAtlas-.

[356] Stochastic Orthogonal Regularization for deep projective priors

Ali Joundi, Yann Traonmilin, Alasdair Newson

Main category: eess.IV

TL;DR: The paper proposes a stochastic orthogonal regularization method for training deep neural network projections (deep projective priors) to improve convergence speed and robustness of generalized projected gradient descent algorithms for solving inverse problems in imaging.

Details

Motivation: Many image processing and computer vision tasks are formulated as inverse problems requiring fast and robust algorithms. While neural networks can provide effective projections onto complex data manifolds (deep projective priors), there's little guarantee that these learned projections satisfy the theoretical conditions needed for linear convergence in projected gradient descent methods.

Method: The authors propose a stochastic orthogonal regularization of the training loss for deep projective priors. This regularization is designed to ensure that the learned projections sufficiently approximate orthogonal projections, which theoretically guarantees linear stable recovery. The method is tested with two types of deep projective priors: autoencoders and denoising networks.

Result: Experimental results show that the proposed stochastic orthogonal regularization yields projections that improve convergence speed and robustness of generalized projected gradient descent in challenging inverse problem settings, consistent with theoretical predictions.

Conclusion: The stochastic orthogonal regularization method enables deep projective priors to better approximate orthogonal projections, leading to improved convergence properties and performance in solving imaging inverse problems, bridging the gap between theoretical guarantees and practical neural network implementations.

Abstract: Many crucial tasks of image processing and computer vision are formulated as inverse problems. Thus, it is of great importance to design fast and robust algorithms to solve these problems. In this paper, we focus on generalized projected gradient descent (GPGD) algorithms where generalized projections are realized with learned neural networks and provide state-of-the-art results for imaging inverse problems. Indeed, neural networks allow for projections onto unknown low-dimensional sets that model complex data, such as images. We call these projections deep projective priors. In generic settings, when the orthogonal projection onto a lowdimensional model set is used, it has been shown, under a restricted isometry assumption, that the corresponding orthogonal PGD converges with a linear rate, yielding near-optimal convergence (within the class of GPGD methods) in the classical case of sparse recovery. However, for deep projective priors trained with classical mean squared error losses, there is little guarantee that the hypotheses for linear convergence are satisfied. In this paper, we propose a stochastic orthogonal regularization of the training loss for deep projective priors. This regularization is motivated by our theoretical results: a sufficiently good approximation of the orthogonal projection guarantees linear stable recovery with performance close to orthogonal PGD. We show experimentally, using two different deep projective priors (based on autoencoders and on denoising networks), that our stochastic orthogonal regularization yields projections that improve convergence speed and robustness of GPGD in challenging inverse problem settings, in accordance with our theoretical findings.

[357] A multi-dynamic low-rank deep image prior (ML-DIP) for 3D real-time cardiovascular MRI

Chong Chen, Marc Vornehm, Zhenyu Bu, Preethi Chandrasekaran, Muhammad A. Sultan, Syed M. Arshad, Yingmin Liu, Yuchi Han, Rizwan Ahmad

Main category: eess.IV

TL;DR: ML-DIP enables high-quality 3D real-time cardiac MRI with >1000× acceleration by learning spatial and motion representations directly from undersampled data, without needing fully sampled training datasets.

Details

Motivation: To develop a reconstruction framework for 3D real-time cine cardiovascular MRI that can handle highly undersampled data without requiring fully sampled training datasets, which are difficult to obtain in clinical practice.

Method: Multi-dynamic low-rank deep image prior (ML-DIP) framework uses separate neural networks to model spatial image content and deformation fields, jointly trained per scan to reconstruct dynamic image series directly from undersampled k-space data.

Result: In phantom studies: PSNR >29 dB and SSIM >0.90 for 2-minute scans, recovering cardiac/respiratory motion and PVC events. In healthy subjects: comparable functional measurements to 2D cine, better image quality than 5D-Cine, even during exercise. In PVC patients: preserved beat-to-beat variability and reconstructed irregular beats, while 5D-Cine showed artifacts.

Conclusion: ML-DIP enables high-quality 3D real-time CMR with acceleration factors exceeding 1,000 by learning low-rank spatial and motion representations from undersampled data, without relying on external fully sampled training datasets.

Abstract: Purpose: To develop a reconstruction framework for 3D real-time cine cardiovascular magnetic resonance (CMR) from highly undersampled data without requiring fully sampled training datasets. Methods: We developed a multi-dynamic low-rank deep image prior (ML-DIP) framework that models spatial image content and deformation fields using separate neural networks. These sub-networks are jointly trained per scan to reconstruct the dynamic image series directly from undersampled k-space data. ML-DIP was evaluated on (i) a 3D cine digital phantom with simulated premature ventricular contractions (PVCs), (ii) ten healthy subjects (including two scanned during both rest and exercise), and (iii) 12 patients with a history of PVCs. Phantom results were assessed using peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). In vivo performance was evaluated by comparing left-ventricular function quantification (against 2D real-time cine) and image quality (against 2D real-time cine and binning-based 5D-Cine). Results: In the phantom study, ML-DIP achieved PSNR > 29 dB and SSIM > 0.90 for scan times as short as two minutes, while recovering cardiac motion, respiratory motion, and PVC events. In healthy subjects, ML-DIP yielded functional measurements comparable to 2D cine and higher image quality than 5D-Cine, including during exercise with high heart rates and bulk motion. In PVC patients, ML-DIP preserved beat-to-beat variability and reconstructed irregular beats, whereas 5D-Cine showed motion artifacts and information loss due to binning. Conclusion: ML-DIP enables high-quality 3D real-time CMR with acceleration factors exceeding 1,000 by learning low-rank spatial and motion representations from undersampled data, without relying on external fully sampled training datasets.

[358] Generative MR Multitasking with complex-harmonic cardiac encoding: Bridging the gap between gated imaging and real-time imaging

Xinguo Fang, Anthony G. Christodoulou

Main category: eess.IV

TL;DR: Generative Multitasking is a unified MRI framework using CVAE with complex harmonic cardiac coordinates to bridge gated and real-time cardiac imaging in a single free-breathing, non-ECG-gated acquisition.

Details

Motivation: To develop a unified reconstruction framework that bridges real-time and gated cardiac MRI, including quantitative MRI, eliminating the need for separate gated and real-time scans.

Method: Uses Generative Multitasking with conditional variational autoencoder (CVAE) that learns implicit neural temporal basis from sequence timings and interpretable latent space for cardiac/respiratory motion. Cardiac motion modeled as complex harmonic with phase encoding timing and latent amplitude capturing beat-to-beat variability.

Result: Enabled flexible cardiac motion representation (both gated-like and real-time-like views), removed eddy-current artifacts, and improved quantitative mapping with reduced intraseptal T1/T2 coefficients of variation (T1: 0.13 vs 0.31; T2: 0.12 vs 0.32; p<0.001).

Conclusion: Generative Multitasking unifies gated and real-time CMR within single free-breathing acquisition, provides flexible motion representation, suppresses artifacts, improves quantitative mapping, and suggests path toward combined cine, multicontrast, and quantitative imaging.

Abstract: Purpose: To develop a unified image reconstruction framework that bridges real-time and gated cardiac MRI, including quantitative MRI. Methods: We introduce Generative Multitasking, which learns an implicit neural temporal basis from sequence timings and an interpretable latent space for cardiac and respiratory motion. Cardiac motion is modeled as a complex harmonic, with phase encoding timing and a latent amplitude capturing beat-to-beat functional variability, linking cardiac phase-resolved (“gated-like”) and time-resolved (“real-time-like”) views. We implemented the framework using a conditional variational autoencoder (CVAE) and evaluated it for free-breathing, non-ECG-gated radial GRE in three settings: steady-state cine imaging, multicontrast T2prep/IR imaging, and dual-flip-angle T1/T2 mapping, compared with conventional Multitasking. Results: Generative Multitasking provided flexible cardiac motion representation, enabling reconstruction of archetypal cardiac phase-resolved cines (like gating) as well as time-resolved series that reveal beat-to-beat variability (like real-time imaging). Conditioning on the previous k-space angle and modifying this term at inference removed eddy-current artifacts without globally smoothing high temporal frequencies. For quantitative mapping, Generative Multitasking reduced intraseptal T1 and T2 coefficients of variation compared with conventional Multitasking (T1: 0.13 vs. 0.31; T2: 0.12 vs. 0.32; p<0.001), indicating higher SNR. Conclusion: Generative Multitasking uses a CVAE with complex harmonic cardiac coordinates to unify gated and real-time CMR within a single free-breathing, non-ECG-gated acquisition. It allows flexible cardiac motion representation, suppresses trajectory-dependent artifacts, and improves T1 and T2 mapping, suggesting a path toward cine, multicontrast, and quantitative imaging without separate gated and real-time scans.

[359] A Fractional Variational Approach to Spectral Filtering Using the Fourier Transform

Nelson H. T. Lemes, José Claudinei Ferreira, Higor V. M. Ferreira

Main category: eess.IV

TL;DR: A variational method using fractional derivatives in frequency domain for Raman spectrum denoising, optimized via Shannon entropy.

Details

Motivation: Fluorescence interference and noise obscure subtle spectral features in Raman analysis, making accurate analysis challenging.

Method: Minimizes functional with fractional derivatives, reformulated in frequency domain via Fourier transform, optimized using Shannon entropy for regularization parameter and derivative order.

Result: Method effectively balances noise suppression with preservation of essential chemical features (peak position, intensity, area) in simulated Raman data and images.

Conclusion: Combination of variational approach in frequency domain, fractional derivatives, and Shannon entropy optimization produces efficient, robust, and easily implementable filter.

Abstract: The interference of fluorescence signals and noise remains a significant challenge in Raman spectrum analysis, often obscuring subtle spectral features that are critical for accurate analysis. Inspired by variational methods similar to those used in image denoising, our approach minimizes a functional involving fractional derivatives to balance noise suppression with the preservation of essential chemical features of the signal, such as peak position, intensity, and area. The original problem is reformulated in the frequency domain through the Fourier transform, making the implementation simple and fast. In this work, we discuss the theoretical framework, practical implementation, and the advantages and limitations of this method in the context of {simulated} Raman data, as well as in image processing. The main contribution of this article is the combination of a variational approach in the frequency domain, the use of fractional derivatives, and the optimization of the {regularization parameter and} derivative order through the concept of Shannon entropy. This work explores how the fractional order, combined with the regularization parameter, affects noise removal and preserves the essential features of the spectrum {and image}. Finally, the study shows that the combination of the proposed strategies produces an efficient, robust, and easily implementable filter.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Fine-Tuning BERT for Domain-Specific Question Answering: Toward Educational NLP Resources at University Scale

[2] Unveiling Affective Polarization Trends in Parliamentary Proceedings

[3] Decoding the Black Box: Discerning AI Rhetorics About and Through Poetic Prompting

[4] Enhancing Clinical Note Generation with ICD-10, Clinical Ontology Knowledge Graphs, and Chain-of-Thought Prompting Using GPT-4

[5] To Think or Not to Think: The Hidden Cost of Meta-Training with Excessive CoT Examples

[6] LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning

[7] Exposing Pink Slime Journalism: Linguistic Signatures and Robust Detection Against LLM-Generated Threats

[8] Transformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change

[9] Mitigating Self-Preference by Authorship Obfuscation

[10] Learning from Self Critique and Refinement for Faithful LLM Summarization

[11] SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs

[12] LMSpell: Neural Spell Checking for Low-Resource Languages

[13] ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering

[14] Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment

[15] SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures

[16] Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches

[17] Structured Reasoning with Tree-of-Thoughts for Bengali Math Word Problems

[18] A Greek Government Decisions Dataset for Public-Sector Analysis and Insight

[19] Grounded Multilingual Medical Reasoning for Question Answering with Large Language Models

[20] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

[21] MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation

[22] Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods

[23] Faithfulness metric fusion: Improving the evaluation of LLM trustworthiness across domains

[24] Efficient Text Classification with Conformal In-Context Learning

[25] Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning

[26] Heard or Halted? Gender, Interruptions, and Emotional Tone in U.S. Supreme Court Oral Arguments

[27] Prompting Science Report 4: Playing Pretend: Expert Personas Don’t Improve Factual Accuracy

[28] Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

[29] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[30] Towards Data-efficient Customer Intent Recognition with Prompt-based Learning Paradigm

[31] Vision-centric Token Compression in Large Language Model

[32] Experiments with Large Language Models on Retrieval-Augmented Generation for Closed-Source Simulation Software

[33] AURA: A Diagnostic Framework for Tracking User Satisfaction of Interactive Planning Agents

[34] SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

[35] A quantitative analysis of semantic information in deep representations of text and images

[36] Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings

[37] Pet-Bench: Benchmarking the Abilities of Large Language Models as E-Pets in Social Network Services

[38] CodeNER: Code Prompting for Named Entity Recognition

[39] A Survey on Diffusion Language Models

[40] HARP: Hallucination Detection via Reasoning Subspace Projection

[41] From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents

[42] Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

[43] Chinese Discharge Drug Recommendation in Metabolic Diseases with Large Language Models

[44] HalluClean: A Unified Framework to Combat Hallucinations in LLMs

[45] LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models

[46] MindEval: Benchmarking Language Models on Multi-turn Mental Health Support

[47] Decoding inner speech with an end-to-end brain-to-text neural interface

[48] Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match

[49] Characterizing Language Use in a Collaborative Situated Game

cs.CV

[50] AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance

[51] Breaking Scale Anchoring: Frequency Representation Learning for Accurate High-Resolution Inference from Low-Resolution Training

[52] InvarDiff: Cross-Scale Invariance Caching for Accelerated Diffusion Models

[53] Fine-tuning an ECG Foundation Model to Predict Coronary CT Angiography Outcomes

[54] ChromouVQA: Benchmarking Vision-Language Models under Chromatic Camouflaged Images

[55] Spatiotemporal Satellite Image Downscaling with Transfer Encoders and Autoregressive Generative Models

[56] FlowEO: Generative Unsupervised Domain Adaptation for Earth Observation

[57] Self-Improving VLM Judges Without Human Annotations

[58] TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

[59] EFDiT: Efficient Fine-grained Image Generation Using Diffusion Transformer Models

[60] Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning

[61] Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models

[62] A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion

[63] DEAR: Dataset for Evaluating the Aesthetics of RenderingDEAR: Dataset for Evaluating the Aesthetics of Rendering

[64] IE2Video: Adapting Pretrained Diffusion Models for Event-Based Video Reconstruction

[65] Age-Inclusive 3D Human Mesh Recovery for Action-Preserving Data Anonymization

[66] CARD: Correlation Aware Restoration with Diffusion

[67] Inferring Compositional 4D Scenes without Ever Seeing One

[68] From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

[69] SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

[70] SplatPainter: Interactive Authoring of 3D Gaussians from 2D Edits via Test-Time Training

[71] Group Orthogonal Low-Rank Adaptation for RGB-T Tracking

[72] PoolNet: Deep Learning for 2D to 3D Video Process Validation

[73] ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration

[74] LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models

[75] Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

[76] The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos