Daily arXiv Papers - 2025-07-28

Summaries of research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

Víctor Gallego

Main category: cs.CL

TL;DR: SSC is a framework for LMs to self-correct flawed specifications, reducing reward hacking by over 90%.

DetailsMotivation: Address LM exploitation of flawed specifications to achieve high scores without meeting user intent.

Method: Multi-step inference: generate, critique, revise specification, and produce a robust response.

Result: Reduced reward hacking vulnerability from 50-70% to under 10%.

Conclusion: SSC enables robust alignment without weight modification, improving LM behavior.

Abstract: Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user’s true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70% of cases, the SSC process reduces this vulnerability by over 90%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .

[2] The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

Abdulhady Abas Abdullah, Amir H. Gandomi, Tarik A Rashid, Seyedali Mirjalili, Laith Abualigah, Milena Živković, Hadi Veisi

Main category: cs.CL

TL;DR: AS-RoBERTa, a script-focused multilingual model for Arabic-script languages, outperforms general models like mBERT and XLM-RoBERTa by 2-5% in classification tasks.

DetailsMotivation: General multilingual models struggle with Arabic-script languages due to shared scripts but differing orthographic norms and cultural contexts.

Method: Developed four RoBERTa-based models (AS-RoBERTa) pre-trained on language-specific corpora, focusing on script features.

Result: AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2-5 percentage points in classification tasks.

Conclusion: Script-aware specialization improves performance for Arabic-script languages, advocating for script-specific pre-training strategies.

Abstract: In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language. By focusing pre-training on language-specific script features and statistics, our models capture patterns overlooked by general-purpose models. When fine-tuned on classification tasks, AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. An ablation study confirms that script-focused pre-training is central to these gains. Error analysis using confusion matrices shows how shared script traits and domain-specific content affect performance. Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.

[3] ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting

Nicole Lai-Lopez, Lusha Wang, Su Yuan, Liza Zhang

Main category: cs.CL

TL;DR: A multilingual text detoxification pipeline combining lexicon-guided tagging, a fine-tuned sequence-to-sequence model, and iterative classifier-based gatekeeping, achieving high performance in the PAN-2025 competition.

DetailsMotivation: To improve multilingual text detoxification by leveraging explicit toxic word annotation and cross-lingual generalization, outperforming prior unsupervised or monolingual methods.

Method: Integrates lexicon-guided tagging (using multilingual_toxic_lexicon), a fine-tuned sequence-to-sequence model (s-nlp/mt0-xl-detox-orpo), and an iterative classifier-based gatekeeping mechanism.

Result: Achieved highest STA (0.922), official J score of 0.612, and xCOMET scores of 0.793 (dev) and 0.787 (test), outperforming baselines and backtranslation methods.

Conclusion: The model shows strong generalization in high-resource languages and consistent detoxification improvements, securing ninth place in the competition.

Abstract: In this work, we introduce our solution for the Multilingual Text Detoxification Task in the PAN-2025 competition for the ylmmcl team: a robust multilingual text detoxification pipeline that integrates lexicon-guided tagging, a fine-tuned sequence-to-sequence model (s-nlp/mt0-xl-detox-orpo) and an iterative classifier-based gatekeeping mechanism. Our approach departs from prior unsupervised or monolingual pipelines by leveraging explicit toxic word annotation via the multilingual_toxic_lexicon to guide detoxification with greater precision and cross-lingual generalization. Our final model achieves the highest STA (0.922) from our previous attempts, and an average official J score of 0.612 for toxic inputs in both the development and test sets. It also achieved xCOMET scores of 0.793 (dev) and 0.787 (test). This performance outperforms baseline and backtranslation methods across multiple languages, and shows strong generalization in high-resource settings (English, Russian, French). Despite some trade-offs in SIM, the model demonstrates consistent improvements in detoxification strength. In the competition, our team achieved ninth place with a score of 0.612.

[4] Evaluating Code-Mixing in LLMs Across 18 Languages

Yilun Yang, Yekun Chai

Main category: cs.CL

TL;DR: The paper evaluates LLMs on code-mixed data across 18 languages, highlights their underperformance, and proposes a synthetic data generation method using GPT-4.

DetailsMotivation: Existing benchmarks and methods for code-mixing are limited, and research on LLMs in this context is scarce despite its importance for multilingual users.

Method: Comprehensive evaluation of LLMs on code-mixed data from 18 languages and a novel synthetic data generation approach using word substitution and GPT-4 prompting.

Result: LLMs consistently underperform on code-mixed datasets involving multiple language families.

Conclusion: Improvements in training data size, model scale, and few-shot learning could enhance LLM performance on code-mixed tasks.

Abstract: Code-mixing, the practice of switching between languages within a conversation, presents unique challenges for traditional natural language processing. Existing benchmarks, such as LinCE and GLUECoS, are limited by narrow language pairings and tasks, failing to adequately evaluate the code-mixing capabilities of large language models (LLMs). Despite the significance of code-mixing for multilingual users, research on LLMs in this context remains limited. Additionally, current methods for generating code-mixed data are underdeveloped. In this paper, we conduct a comprehensive evaluation of LLMs’ performance on code-mixed data across 18 languages from seven language families. We also propose a novel approach for generating synthetic code-mixed texts by combining word substitution with GPT-4 prompting. Our analysis reveals consistent underperformance of LLMs on code-mixed datasets involving multiple language families. We suggest that improvements in training data size, model scale, and few-shot learning could enhance their performance.

[5] CueBuddy: helping non-native English speakers navigate English-centric STEM education

Pranav Gupta

Main category: cs.CL

TL;DR: CueBuddy aids STEM students in the Global South by providing real-time lexical cues for technical English terms, addressing language barriers without disrupting lectures.

DetailsMotivation: STEM students fluent in lower-resource languages struggle with English technical terms, hindering their performance despite strong scientific prerequisites.

Method: CueBuddy uses real-time technical keyword spotting and multilingual glossary lookup to provide lexical cues during lectures.

Result: The tool helps students keep up with complex jargon without interrupting their focus.

Conclusion: CueBuddy is a promising solution, though it has limitations and potential for future enhancements.

Abstract: Students across the world in STEM classes, especially in the Global South, fall behind their peers who are more fluent in English, despite being at par with them in terms of scientific prerequisites. While many of them are able to follow everyday English at ease, key terms in English stay challenging. In most cases, such students have had most of their course prerequisites in a lower resource language. Live speech translation to lower resource languages is a promising area of research, however, models for speech translation can be too expensive on a large scale and often struggle with technical content. In this paper, we describe CueBuddy, which aims to remediate these issues by providing real-time “lexical cues” through technical keyword spotting along real-time multilingual glossary lookup to help students stay up to speed with complex English jargon without disrupting their concentration on the lecture. We also describe the limitations and future extensions of our approach.

[6] PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning

Mohammad Kachuee, Teja Gollapudi, Minseok Kim, Yin Huang, Kai Sun, Xiao Yang, Jiaqi Wang, Nirav Shah, Yue Liu, Aaron Colak, Anuj Kumar, Wen-tau Yih, Xin Luna Dong

Main category: cs.CL

TL;DR: PrismRAG improves RAG performance by fine-tuning with distractor-aware QA pairs and reasoning-centric habits, boosting factuality by 5.4%.

DetailsMotivation: Addressing RAG's shortcomings with confusing semi-relevant passages and the need for deep contextual reasoning.

Method: Fine-tuning framework using distractor-aware QA pairs and instilling reasoning-centric habits in LLMs.

Result: 5.4% improvement in average factuality across 12 benchmarks, outperforming state-of-the-art solutions.

Conclusion: PrismRAG effectively enhances RAG performance by integrating distractor-aware training and reasoning habits.

Abstract: Retrieval-augmented generation (RAG) often falls short when retrieved context includes confusing semi-relevant passages, or when answering questions require deep contextual understanding and reasoning. We propose an efficient fine-tuning framework, called PrismRAG, that (i) trains the model with distractor-aware QA pairs mixing gold evidence with subtle distractor passages, and (ii) instills reasoning-centric habits that make the LLM plan, rationalize, and synthesize without relying on extensive human engineered instructions. Evaluated across 12 open-book RAG QA benchmarks spanning diverse application domains and scenarios, PrismRAG improves average factuality by 5.4%, outperforming state-of-the-art solutions.

[7] LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation

Jingxuan Wei, Caijun Jia, Qi Chen, Yujun Cai, Linzhuang Sun, Xiangxiang Zhang, Gaowei Wu, Bihui Yu

Main category: cs.CL

TL;DR: LLaVA-NeuMT is a multimodal multilingual translation framework that improves translation quality by modeling language-specific and language-agnostic representations, using layer selection and neuron-level adaptation. It achieves SOTA results with only 40% fine-tuning.

DetailsMotivation: Existing MMT methods struggle with multilingual translation due to cross-lingual interference and inefficient parameter-sharing.

Method: Proposes LLaVA-NeuMT with layer selection and neuron-level adaptation to mitigate interference and improve efficiency.

Result: Surpasses full fine-tuning approaches, achieving SOTA results on M3-Multi30K and M3-AmbigCaps datasets.

Conclusion: LLaVA-NeuMT offers an efficient, scalable solution for cross-lingual adaptation in multimodal translation.

Abstract: Multimodal Machine Translation (MMT) enhances translation quality by incorporating visual context, helping to resolve textual ambiguities. While existing MMT methods perform well in bilingual settings, extending them to multilingual translation remains challenging due to cross-lingual interference and ineffective parameter-sharing strategies. To address this, we propose LLaVA-NeuMT, a novel multimodal multilingual translation framework that explicitly models language-specific and language-agnostic representations to mitigate multilingual interference. Our approach consists of a layer selection mechanism that identifies the most informative layers for different language pairs and a neuron-level adaptation strategy that dynamically selects language-specific and agnostic neurons to improve translation quality while reducing redundancy. We conduct extensive experiments on the M3-Multi30K and M3-AmbigCaps datasets, demonstrating that LLaVA-NeuMT, while fine-tuning only 40% of the model parameters, surpasses full fine-tuning approaches and ultimately achieves SOTA results on both datasets. Our analysis further provides insights into the importance of selected layers and neurons in multimodal multilingual adaptation, offering an efficient and scalable solution to cross-lingual adaptation in multimodal translation.

[8] MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service

Ming Gong, Xucheng Huang, Ziheng Xu, Vijayan K. Asari

Main category: cs.CL

TL;DR: MindFlow+ is a self-evolving dialogue agent combining LLMs, imitation learning, and offline RL to improve e-commerce customer service. It introduces data-centric mechanisms for better tool use and goal alignment, outperforming baselines in relevance, flexibility, and accuracy.

DetailsMotivation: Traditional intent-based systems fail in dynamic, multi-turn e-commerce interactions. MindFlow+ aims to address this by leveraging advanced learning techniques for domain-specific behavior.

Method: MindFlow+ uses tool-augmented demonstration construction and reward-conditioned data modeling, combining LLMs, imitation learning, and offline RL.

Result: Outperforms baselines in contextual relevance, flexibility, and task accuracy on real-world e-commerce conversations.

Conclusion: Combining LLMs, tool reasoning, and reward-guided learning shows promise for building specialized, context-aware dialogue systems.

Abstract: High-quality dialogue is crucial for e-commerce customer service, yet traditional intent-based systems struggle with dynamic, multi-turn interactions. We present MindFlow+, a self-evolving dialogue agent that learns domain-specific behavior by combining large language models (LLMs) with imitation learning and offline reinforcement learning (RL). MindFlow+ introduces two data-centric mechanisms to guide learning: tool-augmented demonstration construction, which exposes the model to knowledge-enhanced and agentic (ReAct-style) interactions for effective tool use; and reward-conditioned data modeling, which aligns responses with task-specific goals using reward signals. To evaluate the model’s role in response generation, we introduce the AI Contribution Ratio, a novel metric quantifying AI involvement in dialogue. Experiments on real-world e-commerce conversations show that MindFlow+ outperforms strong baselines in contextual relevance, flexibility, and task accuracy. These results demonstrate the potential of combining LLMs tool reasoning, and reward-guided learning to build domain-specialized, context-aware dialogue systems.

[9] NUTMEG: Separating Signal From Noise in Annotator Disagreement

Jonathan Ivey, Susan Gauch, David Jurgens

Main category: cs.CL

TL;DR: NUTMEG, a Bayesian model, separates noisy annotations from genuine disagreements in human-labeled data, improving ground-truth recovery and downstream model performance.

DetailsMotivation: Traditional aggregation methods treat annotator disagreements as errors, ignoring genuine disagreements. NUTMEG addresses this by distinguishing noise from signal.

Method: NUTMEG uses Bayesian modeling to incorporate annotator background information, filtering noisy annotations while preserving systematic disagreements.

Result: NUTMEG outperforms traditional methods in recovering ground-truth from annotations with systematic disagreement and improves downstream model performance.

Conclusion: Accounting for annotator competence and systematic disagreements is crucial for effective training on human-labeled data.

Abstract: NLP models often rely on human-labeled data for training and evaluation. Many approaches crowdsource this data from a large number of annotators with varying skills, backgrounds, and motivations, resulting in conflicting annotations. These conflicts have traditionally been resolved by aggregation methods that assume disagreements are errors. Recent work has argued that for many tasks annotators may have genuine disagreements and that variation should be treated as signal rather than noise. However, few models separate signal and noise in annotator disagreement. In this work, we introduce NUTMEG, a new Bayesian model that incorporates information about annotator backgrounds to remove noisy annotations from human-labeled training data while preserving systematic disagreements. Using synthetic data, we show that NUTMEG is more effective at recovering ground-truth from annotations with systematic disagreement than traditional aggregation methods. We provide further analysis characterizing how differences in subpopulation sizes, rates of disagreement, and rates of spam affect the performance of our model. Finally, we demonstrate that downstream models trained on NUTMEG-aggregated data significantly outperform models trained on data from traditionally aggregation methods. Our results highlight the importance of accounting for both annotator competence and systematic disagreements when training on human-labeled data.

[10] REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, Daniel Kang

Main category: cs.CL

TL;DR: REPRO-Bench is introduced to evaluate AI agents’ ability to assess reproducibility of social science papers, addressing gaps in existing benchmarks. The best agent achieved 21.4% accuracy, improved to 71% with REPRO-Agent.

DetailsMotivation: Manual reproducibility assessment is costly; AI agents could automate this, but existing benchmarks are inadequate.

Method: REPRO-Bench includes 112 task instances (papers with reproduction reports) for end-to-end evaluation. Three AI agents were tested, and REPRO-Agent was developed to improve performance.

Result: Best agent accuracy was 21.4%, improved by 71% with REPRO-Agent.

Conclusion: Advanced AI agents are needed for real-world reproducibility assessment; REPRO-Bench is publicly available.

Abstract: Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench features end-to-end evaluation tasks on the reproducibility of social science papers with complexity comparable to real-world assessments. We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%. Building on our empirical analysis, we develop REPRO-Agent, which improves the highest accuracy achieved by existing agents by 71%. We conclude that more advanced AI agents should be developed to automate real-world reproducibility assessment. REPRO-Bench is publicly available at https://github.com/uiuc-kang-lab/REPRO-Bench.

[11] SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

Hongyuan Lu, Zixuan Li, Zefan Zhang, Wai Lam

Main category: cs.CL

TL;DR: The paper introduces Automatic Dictionary Selection (ADS) to optimize dictionary use for LLM translation, proposing SLoW to select low-frequency words, saving tokens while improving performance.

DetailsMotivation: Current LLMs support limited languages, and dictionary-based methods are costly. ADS aims to balance token use and translation quality.

Method: Proposes SLoW, selecting low-frequency word dictionaries without needing training data or LLM tuning.

Result: SLoW outperforms baselines on 100 FLORES languages, saving tokens and often surpassing full dictionary performance.

Conclusion: SLoW is effective, resource-efficient, and adaptable for enhancing LLM translations without additional tuning.

Abstract: There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}

[12] SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models

Zhen Wan, Chao-Han Huck Yang, Yahan Yu, Jinchuan Tian, Sheng Li, Ke Hu, Zhehuai Chen, Shinji Watanabe, Fei Cheng, Chenhui Chu, Sadao Kurohashi

Main category: cs.CL

TL;DR: SIQ is a new evaluation pipeline for voice understanding LLMs, assessing them across three cognitive levels inspired by Bloom’s Taxonomy.

DetailsMotivation: Current metrics like WER are limited; SIQ aims to provide a more comprehensive evaluation of voice understanding in LLMs.

Method: SIQ evaluates LLMs at three levels: Remembering (WER), Understanding (interpretation similarity), and Application (QA accuracy).

Result: SIQ quantifies voice understanding, compares cascaded vs. end-to-end models, identifies benchmark errors, and detects hallucinations.

Conclusion: SIQ bridges cognitive principles with voice benchmarks, revealing overlooked challenges in multi-modal training.

Abstract: We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models, LLM Voice, designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM Voice across three cognitive levels motivated by Bloom’s Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM’s interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training.

[13] Large language models provide unsafe answers to patient-posed medical questions

Rachel L. Draelos, Samina Afreen, Barbara Blasko, Tiffany Brazile, Natasha Chase, Dimple Desai, Jessica Evert, Heather L. Gardner, Lauren Herrmann, Aswathy Vaikom House, Stephanie Kass, Marianne Kavan, Kirshma Khemani, Amanda Koire, Lauren M. McDonald, Zahraa Rabeeah, Amy Shah

Main category: cs.CL

TL;DR: A study evaluates the safety of four LLM chatbots (Claude, Gemini, GPT-4o, Llama3-70B) for medical advice, finding significant differences in problematic and unsafe responses.

DetailsMotivation: Concerns about patient safety due to widespread use of LLM chatbots for medical advice.

Method: Physician-led red-teaming study using the HealthAdvice dataset, evaluating 888 responses to 222 medical questions.

Result: Problematic responses ranged from 21.6% (Claude) to 43.2% (Llama), with unsafe responses from 5% (Claude) to 13% (GPT-4o, Llama).

Conclusion: Millions may receive unsafe advice; improvements are needed for clinical safety of chatbots.

Abstract: Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots–Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta–on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women’s health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.

[14] SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg

Main category: cs.CL

TL;DR: A novel duplex speech-to-speech (S2S) architecture enables real-time adaptability, outperforming previous models in reasoning, turn-taking, and barge-in abilities while halving bitrate and simplifying training.

DetailsMotivation: Current speech models lack real-time adaptability like user barge-in, limiting intuitive human-computer interaction.

Method: Proposes a duplex S2S architecture with continuous user inputs and codec agent outputs, using a pretrained streaming encoder and separate architectures for user and agent modeling.

Result: Outperforms previous models in reasoning, turn-taking, and barge-in, reduces bitrate to 0.6 kbps, and requires less speech data by skipping pretrain.

Conclusion: The model simplifies duplex S2S model development, is openly available, and fosters reproducibility.

Abstract: Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.

[15] A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions

Agada Joseph Oche, Ademola Glory Folashade, Tirthankar Ghosal, Arpan Biswas

Main category: cs.CL

TL;DR: The paper reviews Retrieval-Augmented Generation (RAG), highlighting its evolution, technical components, applications, challenges, and future innovations.

DetailsMotivation: RAG addresses hallucinations and outdated knowledge in parametric models by combining LLMs with retrieval systems.

Method: The review examines retrieval mechanisms, generation models, fusion strategies, and benchmarks performance on accuracy, fluency, latency, and efficiency.

Result: RAG shows rapid growth and diverse applications but faces challenges like retrieval quality, privacy, and scalability.

Conclusion: Emerging solutions like hybrid retrieval and privacy-preserving techniques promise more reliable and efficient NLP systems.

Abstract: Retrieval-Augmented Generation (RAG) represents a major advancement in natural language processing (NLP), combining large language models (LLMs) with information retrieval systems to enhance factual grounding, accuracy, and contextual relevance. This paper presents a comprehensive systematic review of RAG, tracing its evolution from early developments in open domain question answering to recent state-of-the-art implementations across diverse applications. The review begins by outlining the motivations behind RAG, particularly its ability to mitigate hallucinations and outdated knowledge in parametric models. Core technical components-retrieval mechanisms, sequence-to-sequence generation models, and fusion strategies are examined in detail. A year-by-year analysis highlights key milestones and research trends, providing insight into RAG’s rapid growth. The paper further explores the deployment of RAG in enterprise systems, addressing practical challenges related to retrieval of proprietary data, security, and scalability. A comparative evaluation of RAG implementations is conducted, benchmarking performance on retrieval accuracy, generation fluency, latency, and computational efficiency. Persistent challenges such as retrieval quality, privacy concerns, and integration overhead are critically assessed. Finally, the review highlights emerging solutions, including hybrid retrieval approaches, privacy-preserving techniques, optimized fusion strategies, and agentic RAG architectures. These innovations point toward a future of more reliable, efficient, and context-aware knowledge-intensive NLP systems.

[16] Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems

Jhen-Ke Lin, Hao-Chien Lu, Chung-Chun Wang, Hong-Yun Lin, Berlin Chen

Main category: cs.CL

TL;DR: Fine-tuning Whisper models with precise disfluency annotations improves ASR accuracy for verbatim L2 speech transcription.

DetailsMotivation: Accurate capture of disfluencies in verbatim transcription is crucial for tasks like error analysis and feedback, but many ASR systems overlook these details.

Method: Fine-tuned Whisper models on the Speak & Improve 2025 corpus using LoRA, comparing three annotation schemes: Pure, Rich, and Extra (acoustically precise fillers).

Result: The “Extra” scheme achieved a 5.5% WER, an 11.3% relative improvement over the “Pure” scheme (6.2% WER).

Conclusion: Explicit, realistic filled-pause labeling significantly enhances ASR accuracy for verbatim L2 speech transcription.

Abstract: Verbatim transcription for automatic speaking assessment demands accurate capture of disfluencies, crucial for downstream tasks like error analysis and feedback. However, many ASR systems discard or generalize hesitations, losing important acoustic details. We fine-tune Whisper models on the Speak & Improve 2025 corpus using low-rank adaptation (LoRA), without recourse to external audio training data. We compare three annotation schemes: removing hesitations (Pure), generic tags (Rich), and acoustically precise fillers inferred by Gemini 2.0 Flash from existing audio-transcript pairs (Extra). Our challenge system achieved 6.47% WER (Pure) and 5.81% WER (Extra). Post-challenge experiments reveal that fine-tuning Whisper Large V3 Turbo with the “Extra” scheme yielded a 5.5% WER, an 11.3% relative improvement over the “Pure” scheme (6.2% WER). This demonstrates that explicit, realistic filled-pause labeling significantly enhances ASR accuracy for verbatim L2 speech transcription.

[17] Mining Contextualized Visual Associations from Images for Creativity Understanding

Ananya Sahu, Amith Ananthram, Kathleen McKeown

Main category: cs.CL

TL;DR: The paper introduces a method to mine contextualized associations for visual elements in images, enabling creative caption generation at varying abstraction levels. It produces a dataset of 1.7m creative captions for MSCOCO images, improving zero-shot retrieval in creative domains like poetry and metaphor visualization.

DetailsMotivation: Current vision-language models rely on literal alt-text, lacking shared language for creative output. This work addresses the gap by mining contextualized associations for richer, abstract captions.

Method: The method mines associations for visual elements in unlabeled datasets, generating creative captions at increasing abstraction levels. It produces a dataset of 1.7m captions for MSCOCO images.

Result: Human evaluation confirms captions are visually grounded yet abstract. Fine-tuning improves zero-shot retrieval in creative domains (poetry, metaphor visualization).

Conclusion: The work provides a scalable solution for creative caption generation, releasing datasets, code, and models for broader use.

Abstract: Understanding another person’s creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.

[18] GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

Hongjie Chen, Zehan Li, Yaodong Song, Wenming Deng, Yitong Yao, Yuxin Zhang, Hang Lv, Xuechao Zhu, Jian Kang, Jie Lian, Jie Li, Chao Wang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li

Main category: cs.CL

TL;DR: GOAT-SLM is a spoken language model that incorporates paralinguistic and speaker characteristics, improving natural spoken interactions beyond text semantics.

DetailsMotivation: Existing models overlook paralinguistic cues like emotion, dialect, and age in speech. GOAT-SLM aims to address this gap.

Method: Uses a dual-modality head architecture and modular, staged training to align linguistic, paralinguistic, and speaker characteristics.

Result: Outperforms existing models in emotion, dialect, and age-sensitive tasks on the TELEVAL benchmark.

Conclusion: GOAT-SLM advances more natural and socially aware spoken language systems by modeling beyond linguistic content.

Abstract: Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.

[19] Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders

Richmond Sin Jing Xuan, Jalil Huseynov, Yang Zhang

Main category: cs.CL

TL;DR: Multilingual LLMs show cross-linguistic generalization but underperform in medium/low-resource languages. Analysis of Gemma-2-2B reveals activation disparities, addressed via LoRA fine-tuning, improving performance while retaining English accuracy.

DetailsMotivation: To understand and mitigate performance disparities in medium/low-resource languages in multilingual LLMs.

Method: Analyzed activation patterns in Gemma-2-2B using Sparse Autoencoders (SAEs) and applied activation-aware fine-tuning via LoRA.

Result: Fine-tuning improved activations (e.g., 87.69% for Malayalam) and benchmark performance, with English retention at ~91%.

Conclusion: Activation alignment is key to enhancing multilingual LLM performance, especially for medium/low-resource languages.

Abstract: Multilingual large language models (LLMs) exhibit strong cross-linguistic generalization, yet medium to low resource languages underperform on common benchmarks such as ARC-Challenge, MMLU, and HellaSwag. We analyze activation patterns in Gemma-2-2B across all 26 residual layers and 10 languages: Chinese (zh), Russian (ru), Spanish (es), Italian (it), medium to low resource languages including Indonesian (id), Catalan (ca), Marathi (mr), Malayalam (ml), and Hindi (hi), with English (en) as the reference. Using Sparse Autoencoders (SAEs), we reveal systematic disparities in activation patterns. Medium to low resource languages receive up to 26.27 percent lower activations in early layers, with a persistent gap of 19.89 percent in deeper layers. To address this, we apply activation-aware fine-tuning via Low-Rank Adaptation (LoRA), leading to substantial activation gains, such as 87.69 percent for Malayalam and 86.32 percent for Hindi, while maintaining English retention at approximately 91 percent. After fine-tuning, benchmark results show modest but consistent improvements, highlighting activation alignment as a key factor in enhancing multilingual LLM performance.

Yongjie Li, Ruilin Nong, Jianan Liu, Lucas Evans

Main category: cs.CL

TL;DR: The paper presents an NLP-based method for automating legal document summarization to enhance judicial efficiency by extracting key information and reducing manual review burdens.

DetailsMotivation: To improve judicial efficiency by automating the summarization of legal documents, reducing errors and workload for legal professionals.

Method: Uses state-of-the-art NLP and machine learning to identify patterns in legal texts and generate precise summaries.

Result: Demonstrates high-quality summaries, faster processing, and improved operational efficiency in legal workflows.

Conclusion: Highlights the transformative potential of automation in refining judicial processes and legal workflows.

Abstract: Legal document summarization represents a significant advancement towards improving judicial efficiency through the automation of key information detection. Our approach leverages state-of-the-art natural language processing techniques to meticulously identify and extract essential data from extensive legal texts, which facilitates a more efficient review process. By employing advanced machine learning algorithms, the framework recognizes underlying patterns within judicial documents to create precise summaries that encapsulate the crucial elements. This automation alleviates the burden on legal professionals, concurrently reducing the likelihood of overlooking vital information that could lead to errors. Through comprehensive experiments conducted with actual legal datasets, we demonstrate the capability of our method to generate high-quality summaries while preserving the integrity of the original content and enhancing processing times considerably. The results reveal marked improvements in operational efficiency, allowing legal practitioners to direct their efforts toward critical analytical and decision-making activities instead of manual reviews. This research highlights promising technology-driven strategies that can significantly alter workflow dynamics within the legal sector, emphasizing the role of automation in refining judicial processes.

[21] JCAPT: A Joint Modeling Approach for CAPT

Tzu-Hsuan Yang, Yue-Yang He, Berlin Chen

Main category: cs.CL

TL;DR: A unified framework using Mamba (SSM) with phonological features and token strategies improves APA and MDD in CAPT, outperforming prior methods.

DetailsMotivation: Enhancing interpretability and fine-grained temporal reasoning in CAPT systems for better pronunciation feedback in L2 learning.

Method: Joint modeling of APA and MDD using Mamba (SSM), integrating phonological features and token strategies.

Result: Outperforms prior methods on the Speechocean762 benchmark, especially in MDD.

Conclusion: The framework successfully combines phonological attribution, SSM, and prompting, advancing CAPT performance.

Abstract: Effective pronunciation feedback is critical in second language (L2) learning, for which computer-assisted pronunciation training (CAPT) systems often encompass two key tasks: automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD). Recent work has shown that joint modeling of these two tasks can yield mutual benefits. Our unified framework leverages Mamba, a selective state space model (SSM), while integrating phonological features and think token strategies to jointly enhance interpretability and fine-grained temporal reasoning in APA and MDD. To our knowledge, this is the first study to combine phonological attribution, SSM-based modeling, and prompting in CAPT. A series of experiments conducted on the speechocean762 benchmark demonstrate that our model consistently outperforms prior methods, particularly on the MDD task.

[22] A Similarity Measure for Comparing Conversational Dynamics

Sang Min Jung, Kaixiang Zhang, Cristian Danescu-Niculescu-Mizil

Main category: cs.CL

TL;DR: A method for comparing conversations based on their interactional dynamics is introduced, validated, and applied to analyze online community conversations.

DetailsMotivation: Existing methods lack robust ways to compare conversations holistically by their dynamics, which is crucial for analyzing conversational data and evaluating agents.

Method: A similarity measure for conversation dynamics is developed, validated for robustness and topic sensitivity, and applied to analyze online community interactions.

Result: The measure effectively captures conversation dynamics and reveals insights into situational power in online conversations.

Conclusion: The introduced similarity measure enhances conversational analysis and evaluation, demonstrating its utility in real-world applications.

Abstract: The quality of a conversation goes beyond the individual quality of each reply, and instead emerges from how these combine into interactional patterns that give the conversation its distinctive overall “shape”. However, there is no robust automated method for comparing conversations in terms of their overall interactional dynamics. Such methods could enhance the analysis of conversational data and help evaluate conversational agents more holistically. In this work, we introduce a similarity measure for comparing conversations with respect to their dynamics. We design a validation framework for testing the robustness of the metric in capturing differences in conversation dynamics and for assessing its sensitivity to the topic of the conversations. Finally, to illustrate the measure’s utility, we use it to analyze conversational dynamics in a large online community, bringing new insights into the role of situational power in conversations.

[23] A Toolbox, Not a Hammer – Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation

Bohan Yao, Vikas Yadav

Main category: cs.CL

TL;DR: Multi-TAG is a framework that enhances LLMs by allowing concurrent use of multiple tools per reasoning step, improving accuracy and robustness without finetuning.

DetailsMotivation: Existing tool-augmented LLMs struggle with complex math problems requiring multi-step reasoning, prompting the need for a more robust solution.

Method: Multi-TAG enables LLMs to invoke multiple tools simultaneously at each reasoning step, aggregating their outputs for verification and refinement.

Result: Multi-TAG outperforms state-of-the-art baselines by 6.0% to 7.5% on challenging benchmarks like MATH500 and AIME.

Conclusion: Multi-TAG offers a finetuning-free, scalable solution for improving LLM performance in complex mathematical reasoning tasks.

Abstract: Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.

[24] Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement

Hao Li, Yizheng Sun, Viktor Schlegel, Kailai Yang, Riza Batista-Navarro, Goran Nenadic

Main category: cs.CL

TL;DR: Arg-LLaDA is a novel framework for argument summarization that iteratively improves summaries using sufficiency-guided remasking and regeneration, outperforming existing methods.

DetailsMotivation: Addressing the underexplored generation stage in argument summarization, where existing approaches lack support for factual correction or structural refinement.

Method: Introduces Arg-LLaDA, combining a masking controller and sufficiency-checking module to iteratively revise summaries for better faithfulness, conciseness, and coherence.

Result: Outperforms state-of-the-art baselines in 7/10 automatic metrics and shows significant improvements in human evaluations for coverage, faithfulness, and conciseness.

Conclusion: Arg-LLaDA’s iterative, sufficiency-aware strategy effectively enhances argument summarization, validated by empirical and human evaluations.

Abstract: Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.

[25] Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents

Haorui He, Yupeng Li, Dacheng Wen, Reynold Cheng, Francis C. M. Lau

Main category: cs.CL

TL;DR: DebateCV is a novel claim verification framework using multi-agent LLM debates, outperforming single-LLM methods by leveraging synthetic data and multi-round argumentation.

DetailsMotivation: Single-LLM methods struggle with complex claim verification involving multi-faceted evidence, prompting the need for a debate-driven approach inspired by real-world fact-checking.

Method: DebateCV employs two Debaters with opposing stances and a Moderator for multi-round argumentation, enhanced by post-training with synthetic debate data.

Result: The framework outperforms existing methods under varying evidence quality, demonstrating effectiveness in complex claim verification.

Conclusion: DebateCV introduces a scalable, debate-driven methodology for claim verification, addressing data scarcity and improving accuracy.

Abstract: Claim verification is critical for enhancing digital literacy. However, the state-of-the-art single-LLM methods struggle with complex claim verification that involves multi-faceted evidences. Inspired by real-world fact-checking practices, we propose DebateCV, the first claim verification framework that adopts a debate-driven methodology using multiple LLM agents. In our framework, two Debaters take opposing stances on a claim and engage in multi-round argumentation, while a Moderator evaluates the arguments and renders a verdict with justifications. To further improve the performance of the Moderator, we introduce a novel post-training strategy that leverages synthetic debate data generated by the zero-shot DebateCV, effectively addressing the scarcity of real-world debate-driven claim verification data. Experimental results show that our method outperforms existing claim verification methods under varying levels of evidence quality. Our code and dataset are publicly available at https://anonymous.4open.science/r/DebateCV-6781.

[26] Objectifying the Subjective: Cognitive Biases in Topic Interpretations

Swapnil Hingmire, Ze Shi Li, Shiyu, Zeng, Ahmed Musa Awon, Luiz Franciscatto Guerra, Neil Ernst

Main category: cs.CL

TL;DR: The paper proposes a theory of topic interpretation based on cognitive heuristics, emphasizing the need for user-centric evaluation frameworks.

DetailsMotivation: Current topic quality measures (e.g., coherence) don't assess how well topics aid corpus exploration, prompting a user-focused approach.

Method: User studies are conducted to evaluate topic quality constructs, with reflexive thematic analysis applied to user rationales.

Result: Users interpret topics using availability and representativeness heuristics, anchoring on salient words and adjusting semantically.

Conclusion: A cognitive biases-aware framework is needed for topic interpretation, viewing it as a judgment under uncertainty by ecologically rational users.

Abstract: Interpretation of topics is crucial for their downstream applications. State-of-the-art evaluation measures of topic quality such as coherence and word intrusion do not measure how much a topic facilitates the exploration of a corpus. To design evaluation measures grounded on a task, and a population of users, we do user studies to understand how users interpret topics. We propose constructs of topic quality and ask users to assess them in the context of a topic and provide rationale behind evaluations. We use reflexive thematic analysis to identify themes of topic interpretations from rationales. Users interpret topics based on availability and representativeness heuristics rather than probability. We propose a theory of topic interpretation based on the anchoring-and-adjustment heuristic: users anchor on salient words and make semantic adjustments to arrive at an interpretation. Topic interpretation can be viewed as making a judgment under uncertainty by an ecologically rational user, and hence cognitive biases aware user models and evaluation frameworks are needed.

[27] An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case

Gioele Giachino, Marco Rondina, Antonio Vetrò, Riccardo Coppola, Juan Carlos De Martin

Main category: cs.CL

TL;DR: The study examines gender and professional bias in LLM responses to ungendered prompts, revealing stereotypes in outputs, especially in Italian.

DetailsMotivation: Concerns about LLMs perpetuating stereotypes and biases, particularly in non-English languages like Italian, drive this research.

Method: Structured experiments with 3600 responses from ChatGPT and Gemini, analyzing gendered pronouns in hierarchical job prompts.

Result: LLMs showed bias, e.g., associating ‘she’ with ‘assistant’ (Gemini 100%, ChatGPT 97%), highlighting ethical concerns.

Conclusion: Mitigation strategies are needed to prevent AI from exacerbating social inequalities; future work includes broader language and model studies.

Abstract: The increasing use of Large Language Models (LLMs) in a large variety of domains has sparked worries about how easily they can perpetuate stereotypes and contribute to the generation of biased content. With a focus on gender and professional bias, this work examines in which manner LLMs shape responses to ungendered prompts, contributing to biased outputs. This analysis uses a structured experimental method, giving different prompts involving three different professional job combinations, which are also characterized by a hierarchical relationship. This study uses Italian, a language with extensive grammatical gender differences, to highlight potential limitations in current LLMs’ ability to generate objective text in non-English languages. Two popular LLM-based chatbots are examined, namely OpenAI ChatGPT (gpt-4o-mini) and Google Gemini (gemini-1.5-flash). Through APIs, we collected a range of 3600 responses. The results highlight how content generated by LLMs can perpetuate stereotypes. For example, Gemini associated 100% (ChatGPT 97%) of ‘she’ pronouns to the ‘assistant’ rather than the ‘manager’. The presence of bias in AI-generated text can have significant implications in many fields, such as in the workplaces or in job selections, raising ethical concerns about its use. Understanding these risks is pivotal to developing mitigation strategies and assuring that AI-based systems do not increase social inequalities, but rather contribute to more equitable outcomes. Future research directions include expanding the study to additional chatbots or languages, refining prompt engineering methods or further exploiting a larger experimental base.

[28] Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Chaymaa Abbas, Mariette Awad, Razane Tajeddine

Main category: cs.CL

TL;DR: The study reveals that LLMs amplify social biases, especially for AAVE inputs under data poisoning, increasing toxicity disproportionately compared to SAE. Larger models worsen this effect. GPT-4o audits highlight harmful stereotypes tied to AAVE, urging dialect-aware debiasing and responsible training.

DetailsMotivation: To investigate how dialectal variation (AAVE vs. SAE) and data poisoning affect toxicity in LLM outputs, exposing biases and their amplification with model scale.

Method: Used small- and medium-scale LLaMA models with poisoned data, measuring toxicity for AAVE and SAE inputs. Employed GPT-4o as a fairness auditor to identify harmful stereotypes.

Result: Poisoned data significantly increased toxicity for AAVE inputs, while SAE remained less affected. Larger models amplified biases more. GPT-4o detected harmful stereotypes (aggression, criminality, intellectual inferiority) tied to AAVE.

Conclusion: The study highlights the compounded impact of dialectal bias and data poisoning, calling for dialect-aware evaluation, debiasing, and ethical training protocols in LLM development.

Abstract: Despite the ongoing improvements in the design of large language models (LLMs) to foster inclusion and balanced responses, these systems remain susceptible to encoding and amplifying social biases. This study examines how dialectal variation, specifically African American Vernacular English (AAVE) versus Standard American English (SAE), interacts with data poisoning to influence toxicity in outputs. Using both small- and medium-scale LLaMA models, we show that even minimal exposure to poisoned data significantly increases toxicity for AAVE inputs, while it remains comparatively unaffected for SAE. Larger models exhibit a more significant amplification effect which suggests heightened susceptibility with scale. To further assess these disparities, we employed GPT-4o as a fairness auditor, which identified harmful stereotypical patterns disproportionately tied to AAVE inputs, including portrayals of aggression, criminality, and intellectual inferiority. These findings underscore the compounding impact of data poisoning and dialectal bias and emphasize the need for dialect-aware evaluation, targeted debiasing interventions, and socially responsible training protocols during development.

[29] How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

Zi Liang, Liantong Yu, Shiyu Zhang, Qingqing Ye, Haibo Hu

Main category: cs.CL

TL;DR: The paper introduces ArxivRoll, a dynamic evaluation framework for LLMs to address overestimation issues in public benchmarks by generating private test cases and measuring contamination/bias.

DetailsMotivation: Overestimation in LLM evaluations due to benchmark contamination or imbalanced training undermines fair comparisons and realistic assessments. Existing solutions lack reproducibility, transparency, and efficiency.

Method: Proposes ArxivRoll with SCP (automated private test case generator) and Rugged Scores (metrics for contamination/bias). Uses recent ArXiv articles for dynamic, one-time evaluations every six months.

Result: Demonstrates high benchmark quality and provides systematic evaluation of current LLMs.

Conclusion: ArxivRoll offers a reproducible, transparent, and efficient solution to quantify and mitigate overestimation in LLM evaluations.

Abstract: Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emph{i) SCP (Sequencing, Cloze, and Prediction)}, an automated generator for private test cases, and \emph{ii) Rugged Scores (RS)}, metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at https://github.com/liangzid/ArxivRoll/.

[30] Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation

Yuanhe Zhang, Fangzhou Xie, Zhenhong Zhou, Zherui Li, Hao Chen, Kun Wang, Yufei Guo

Main category: cs.CL

TL;DR: LLDMs match LLMs in performance but are faster and better at math. Existing jailbreak methods for LLMs don’t work well on LLDMs, raising safety concerns. The paper introduces PAD, a jailbreak method for LLDMs, achieving 97% success and highlighting risks.

DetailsMotivation: To expose safety vulnerabilities in LLDMs, as existing jailbreak methods for LLMs are ineffective due to architectural differences.

Method: Introduces PAD (Parallel Decoding jailbreak) with Multi-Point Attention Attack to exploit LLDMs’ parallel generative processes.

Result: PAD achieves a 97% jailbreak success rate, showing LLDMs are more vulnerable and faster at harmful generation than LLMs.

Conclusion: LLDMs have significant safety risks; PAD reveals vulnerabilities, urging better defenses for secure deployment.

Abstract: Large Language Diffusion Models (LLDMs) exhibit comparable performance to LLMs while offering distinct advantages in inference speed and mathematical reasoning tasks.The precise and rapid generation capabilities of LLDMs amplify concerns of harmful generations, while existing jailbreak methodologies designed for Large Language Models (LLMs) prove limited effectiveness against LLDMs and fail to expose safety vulnerabilities.Successful defense cannot definitively resolve harmful generation concerns, as it remains unclear whether LLDMs possess safety robustness or existing attacks are incompatible with diffusion-based architectures.To address this, we first reveal the vulnerability of LLDMs to jailbreak and demonstrate that attack failure in LLDMs stems from fundamental architectural differences.We present a PArallel Decoding jailbreak (PAD) for diffusion-based language models. PAD introduces Multi-Point Attention Attack, which guides parallel generative processes toward harmful outputs that inspired by affirmative response patterns in LLMs. Experimental evaluations across four LLDMs demonstrate that PAD achieves jailbreak attack success rates by 97%, revealing significant safety vulnerabilities. Furthermore, compared to autoregressive LLMs of the same size, LLDMs increase the harmful generation speed by 2x, significantly highlighting risks of uncontrolled misuse.Through comprehensive analysis, we provide an investigation into LLDM architecture, offering critical insights for the secure deployment of diffusion-based language models.

[31] Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump’s Presidential Campaigns

Ilias Chalkidis, Stephanie Brandl, Paris Aslanidis

Main category: cs.CL

TL;DR: The paper explores LLMs’ ability to classify nuanced populist discourse, finding limitations and better performance with fine-tuned models like RoBERTa, tested on Trump’s speeches and European politicians.

DetailsMotivation: To assess LLMs' grasp of complex social science concepts like populism, which is underexplored despite their broad capabilities.

Method: Curated datasets for populist discourse, evaluated pre-trained LLMs across prompting paradigms, and fine-tuned models like RoBERTa for comparison.

Result: Fine-tuned RoBERTa outperformed instruction-tuned LLMs unless fine-tuned. LLMs showed robustness on out-of-domain data in European political speeches.

Conclusion: LLMs struggle with nuanced populist discourse without fine-tuning, but instruction-tuned models generalize better across contexts.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of instruction-following tasks, yet their grasp of nuanced social science concepts remains underexplored. This paper examines whether LLMs can identify and classify fine-grained forms of populism, a complex and contested concept in both academic and media debates. To this end, we curate and release novel datasets specifically designed to capture populist discourse. We evaluate a range of pre-trained (large) language models, both open-weight and proprietary, across multiple prompting paradigms. Our analysis reveals notable variation in performance, highlighting the limitations of LLMs in detecting populist discourse. We find that a fine-tuned RoBERTa classifier vastly outperforms all new-era instruction-tuned LLMs, unless fine-tuned. Additionally, we apply our best-performing model to analyze campaign speeches by Donald Trump, extracting valuable insights into his strategic use of populist rhetoric. Finally, we assess the generalizability of these models by benchmarking them on campaign speeches by European politicians, offering a lens into cross-context transferability in political discourse analysis. In this setting, we find that instruction-tuned LLMs exhibit greater robustness on out-of-domain data.

[32] AutoPCR: Automated Phenotype Concept Recognition by Prompting

Yicheng Tao, Yuanhao Huang, Jie Liu

Main category: cs.CL

TL;DR: AutoPCR is a prompt-based phenotype concept recognition method that outperforms existing methods without requiring ontology-specific training.

DetailsMotivation: Existing phenotype CR methods struggle with generalization across diverse text types and evolving biomedical terminology, necessitating a more robust solution.

Method: AutoPCR uses a three-stage process: hybrid entity extraction, candidate retrieval via SapBERT, and entity linking through prompting a large language model.

Result: AutoPCR achieves top performance on four benchmark datasets, showing robustness in mention-level and document-level evaluations.

Conclusion: AutoPCR demonstrates strong inductive capability and generalizability to new ontologies, surpassing prior state-of-the-art methods.

Abstract: Phenotype concept recognition (CR) is a fundamental task in biomedical text mining, enabling applications such as clinical diagnostics and knowledge graph construction. However, existing methods often require ontology-specific training and struggle to generalize across diverse text types and evolving biomedical terminology. We present AutoPCR, a prompt-based phenotype CR method that does not require ontology-specific training. AutoPCR performs CR in three stages: entity extraction using a hybrid of rule-based and neural tagging strategies, candidate retrieval via SapBERT, and entity linking through prompting a large language model. Experiments on four benchmark datasets show that AutoPCR achieves the best average and most robust performance across both mention-level and document-level evaluations, surpassing prior state-of-the-art methods. Further ablation and transfer studies demonstrate its inductive capability and generalizability to new ontologies.

[33] Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks

Kai Liu, Zhan Su, Peijie Dong, Fengran Mo, Jianfei Gao, ShaoTing Zhang, Kai Chen

Main category: cs.CL

TL;DR: Smooth Reading, a chunk-wise inference method, improves Recurrent LLMs’ performance on long-context tasks, nearly matching Self-Attention LLMs while maintaining efficiency.

DetailsMotivation: Recurrent LLMs underperform on long-context tasks due to fixed-size memory limitations, despite their linear computational complexity advantage over Self-Attention LLMs.

Method: Proposes Smooth Reading, a chunk-wise inference method inspired by human reading, which processes context in chunks and iteratively summarizes information.

Result: Smooth Reading reduces the performance gap, boosting a Recurrent LLM to outperform Self-Attention LLMs by 3.61% on LongBench while being 3x faster in training and 2x faster in inference.

Conclusion: Smooth Reading is the first method to achieve comparable performance for Recurrent LLMs on long-context tasks, offering efficiency and performance benefits.

Abstract: Recently, recurrent large language models (Recurrent LLMs) with linear computational complexity have re-emerged as efficient alternatives to self-attention-based LLMs (Self-Attention LLMs), which have quadratic complexity. However, Recurrent LLMs often underperform on long-context tasks due to their limited fixed-size memory. Previous research has primarily focused on enhancing the memory capacity of Recurrent LLMs through architectural innovations, but these approaches have not yet enabled Recurrent LLMs to match the performance of Self-Attention LLMs on long-context tasks. We argue that this limitation arises because processing the entire context at once is not well-suited for Recurrent LLMs. In this paper, we propose Smooth Reading, a chunk-wise inference method inspired by human reading strategies. Smooth Reading processes context in chunks and iteratively summarizes the contextual information, thereby reducing memory demands and making the approach more compatible with Recurrent LLMs. Our experimental results show that this method substantially narrows the performance gap between Recurrent and Self-Attention LLMs on long-context tasks, while preserving the efficiency advantages of Recurrent LLMs. Our Smooth Reading boosts SWA-3B-4k (a Recurrent LLM) from 5.68% lower to 3.61% higher performance than Self-Attention LLMs on LongBench. Besides, our method maintains the high efficiency, training 3x faster and inferring 2x faster at 64k context compared to Self-Attention LLMs. To our knowledge, this is the first work to achieve comparable performance using Recurrent LLMs compared with Self-Attention LLMs on long-context tasks. We hope our method will inspire future research in this area. To facilitate further progress, we will release code and dataset.

[34] Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization

Hsuan-Yu Wang, Pei-Ying Lee, Berlin Chen

Main category: cs.CL

TL;DR: The paper explores how timestamp alignment between ASR and SD outputs improves SER accuracy, introducing a pipeline for synchronization and a multimodal fusion method.

DetailsMotivation: Misalignment between ASR and SD outputs reduces reliability in multimodal emotion recognition, especially in conversations.

Method: An alignment pipeline synchronizes timestamps, combining RoBERTa text embeddings and Wav2Vec audio embeddings with cross-attention fusion and a gating mechanism.

Result: Precise alignment on IEMOCAP improves SER accuracy, outperforming unsynchronized baselines.

Conclusion: Temporal alignment is crucial for enhancing emotion recognition accuracy, offering a robust foundation for multimodal analysis.

Abstract: In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. To address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERTa with audio embeddings from Wav2Vec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. The results highlight the critical importance of temporal alignment, demonstrating its effectiveness in enhancing overall emotion recognition accuracy and providing a foundation for robust multimodal emotion analysis.

[35] Data Augmentation for Spoken Grammatical Error Correction

Penny Karanasou, Mengjie Qian, Stefano Bannò, Mark J. F. Gales, Kate M. Knill

Main category: cs.CL

TL;DR: The paper introduces an automated method to generate audio-text pairs with grammatical errors for Spoken GEC (SGEC) and proposes metrics to evaluate the data. The goal is to enrich datasets without altering language assessment scores.

DetailsMotivation: High-quality annotated spoken datasets for SGEC are under-resourced, limiting progress in the field.

Method: Proposes a fully automated method to generate audio-text pairs with grammatical errors and disfluencies, along with objective metrics for evaluation.

Result: Evaluated on the S&I Corpus, the augmented dataset maintains textual and acoustic characteristics while introducing new errors.

Conclusion: The method successfully augments datasets for both written GEC and SGEC, enhancing resources without compromising assessment scores.

Abstract: While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S&I Corpus, the first publicly available speech dataset with grammar error annotations.

[36] Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study

Rachel M. Murphy, Nishant Mishra, Nicolette F. de Keizer, Dave A. Dongelmans, Kitty J. Jager, Ameen Abu-Hanna, Joanna E. Klopotowska, Iacer Calixto

Main category: cs.CL

TL;DR: The study benchmarks ADE detection in Dutch clinical texts using transformer models, with MedRoBERTa.nl performing best in both internal and external validation.

DetailsMotivation: To establish a robust benchmark for detecting adverse drug events (ADEs) in Dutch clinical free-text documents using advanced NLP models.

Method: Trained Bi-LSTM and four transformer models (BERTje, RobBERT, MedRoBERTa.nl, NuNER) on annotated ICU notes for NER and RC tasks, evaluated internally and externally.

Result: MedRoBERTa.nl achieved the highest macro-averaged F1 scores (0.63 with gold standard, 0.62 with predicted entities) and recall (0.67-0.74) in external validation.

Conclusion: The study provides a clinically meaningful evaluation approach for ADE detection, emphasizing task-specific performance measures and future clinical applicability.

Abstract: In this study, we set a benchmark for adverse drug event (ADE) detection in Dutch clinical free text documents using several transformer models, clinical scenarios and fit-for-purpose performance measures. We trained a Bidirectional Long Short-Term Memory (Bi-LSTM) model and four transformer-based Dutch and/or multilingual encoder models (BERTje, RobBERT, MedRoBERTa.nl, and NuNER) for the tasks of named entity recognition (NER) and relation classification (RC) using 102 richly annotated Dutch ICU clinical progress notes. Anonymized free text clinical progress notes of patients admitted to intensive care unit (ICU) of one academic hospital and discharge letters of patients admitted to Internal Medicine wards of two non-academic hospitals were reused. We evaluated our ADE RC models internally using gold standard (two-step task) and predicted entities (end-to-end task). In addition, all models were externally validated on detecting ADEs at the document level. We report both micro- and macro-averaged F1 scores, given the imbalance of ADEs in the datasets. Although differences for the ADE RC task between the models were small, MedRoBERTa.nl was the best performing model with macro-averaged F1 score of 0.63 using gold standard and 0.62 using predicted entities. The MedRoBERTa.nl models also performed the best in our external validation and achieved recall of between 0.67 to 0.74 using predicted entities, meaning between 67 to 74% of discharge letters with ADEs were detected. Our benchmark study presents a robust and clinically meaningful approach for evaluating language models for ADE detection in clinical free text documents. Our study highlights the need to use appropriate performance measures fit for the task of ADE detection in clinical free-text documents and envisioned future clinical use.

[37] Towards Domain Specification of Embedding Models in Medicine

Mohammad Khodadad, Ali Shiraee, Mahdi Astaraki, Hamidreza Mahyar

Main category: cs.CL

TL;DR: MEDTE, a GTE model fine-tuned on diverse medical corpora, addresses limitations in medical text embeddings by improving methodology and evaluation. A new benchmark suite of 51 tasks shows its superiority over existing models.

DetailsMotivation: Existing medical text embedding models are limited by narrow training data and outdated methods, and current evaluations fail to generalize across real-world medical tasks.

Method: Leverage MEDTE, a GTE model fine-tuned on diverse medical corpora using self-supervised contrastive learning, and propose a 51-task benchmark suite tailored to medical text.

Result: MEDTE outperforms state-of-the-art models across various tasks, demonstrating robustness and improved performance.

Conclusion: The combined approach of MEDTE and a comprehensive benchmark suite provides a robust framework for medical text embeddings, addressing prior shortcomings.

Abstract: Medical text embedding models are foundational to a wide array of healthcare applications, ranging from clinical decision support and biomedical information retrieval to medical question answering, yet they remain hampered by two critical shortcomings. First, most models are trained on a narrow slice of medical and biological data, beside not being up to date in terms of methodology, making them ill suited to capture the diversity of terminology and semantics encountered in practice. Second, existing evaluations are often inadequate: even widely used benchmarks fail to generalize across the full spectrum of real world medical tasks. To address these gaps, we leverage MEDTE, a GTE model extensively fine-tuned on diverse medical corpora through self-supervised contrastive learning across multiple data sources, to deliver robust medical text embeddings. Alongside this model, we propose a comprehensive benchmark suite of 51 tasks spanning classification, clustering, pair classification, and retrieval modeled on the Massive Text Embedding Benchmark (MTEB) but tailored to the nuances of medical text. Our results demonstrate that this combined approach not only establishes a robust evaluation framework but also yields embeddings that consistently outperform state of the art alternatives in different tasks.

[38] TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

Mohammad Aflah Khan, Ameya Godbole, Johnny Tian-Zheng Wei, Ryan Wang, James Flemings, Krishna Gummadi, Willie Neiswanger, Robin Jia

Main category: cs.CL

TL;DR: TokenSmith is an open-source library for interactive editing and analysis of datasets in pretraining frameworks, simplifying dataset debugging and experimentation.

DetailsMotivation: Existing workflows for understanding the relationship between training data and model behavior are cumbersome and fragmented, limiting accessibility for researchers.

Method: TokenSmith provides a modular backend and simple UI for operations like searching, viewing, editing, and sampling pretraining data, without requiring changes to training code.

Result: TokenSmith democratizes access to production-grade dataset tooling, supporting frameworks like GPT-NeoX, Megatron, and NVIDIA NeMo.

Conclusion: TokenSmith enhances pretraining workflows by making dataset inspection and editing more accessible and efficient.

Abstract: Understanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, ingesting, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug and play addition to existing large language model pretraining workflows, thereby democratizing access to production-grade dataset tooling. TokenSmith is hosted on GitHub1, with accompanying documentation and tutorials. A demonstration video is also available on YouTube.

[39] GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab

Main category: cs.CL

TL;DR: GEPA, a prompt optimizer using natural language reflection, outperforms GRPO and MIPROv2 with fewer rollouts, achieving significant quality gains.

DetailsMotivation: Leverage the interpretable nature of language for richer learning in LLMs compared to sparse RL rewards.

Method: Introduces GEPA, which samples trajectories, reflects in natural language, diagnoses issues, and optimizes prompts via Pareto frontier lessons.

Result: GEPA outperforms GRPO by 10% on average (up to 20%) and MIPROv2 by over 10%, using up to 35x fewer rollouts.

Conclusion: GEPA demonstrates efficiency and effectiveness in prompt optimization and inference-time search for code tasks.

Abstract: Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA’s design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.

[40] Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models

Son Quoc Tran, Tushaar Gangavarapu, Nicholas Chernogor, Jonathan P. Chang, Cristian Danescu-Niculescu-Mizil

Main category: cs.CL

TL;DR: The paper introduces a uniform evaluation framework for predicting conversation derailment (CGA task) and a novel metric for assessing model adaptability.

DetailsMotivation: To improve automated systems' ability to anticipate conversation outcomes, aiding human-human interactions.

Method: Revisits the CGA task, introduces a benchmark for comparing architectures, and proposes a new metric for dynamic forecasting.

Result: Provides an updated overview of CGA model progress and enables reliable comparisons.

Conclusion: The framework and metric advance the field by standardizing evaluation and capturing model adaptability.

Abstract: We often rely on our intuition to anticipate the direction of a conversation. Endowing automated systems with similar foresight can enable them to assist human-human interactions. Recent work on developing models with this predictive capacity has focused on the Conversations Gone Awry (CGA) task: forecasting whether an ongoing conversation will derail. In this work, we revisit this task and introduce the first uniform evaluation framework, creating a benchmark that enables direct and reliable comparisons between different architectures. This allows us to present an up-to-date overview of the current progress in CGA models, in light of recent advancements in language modeling. Our framework also introduces a novel metric that captures a model’s ability to revise its forecast as the conversation progresses.

[41] Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: experiments with the rare disease use-case

Shashank Gupta, Xuguang Ai, Ramakanth Kavuluru

Main category: cs.CL

TL;DR: Comparison of three paradigms for end-to-end relation extraction (E2ERE) in biomedicine reveals pipeline models outperform sequence-to-sequence and GPT models, despite the latter’s larger size.

DetailsMotivation: To evaluate and compare the performance of prevailing paradigms (pipeline, sequence-to-sequence, GPT) for E2ERE in biomedicine, especially for rare diseases with complex entity structures.

Method: Used the RareDis dataset to test NER→RE pipelines, joint sequence-to-sequence models, and GPT models, employing state-of-the-art implementations and error analysis.

Result: Pipeline models performed best, followed by sequence-to-sequence models; GPT models lagged despite larger size. Discontinuous entities caused significant NER errors.

Conclusion: Conventional pipeline models are more effective for E2ERE when training data is available, though innovative methods combining their strengths with GPT models could improve performance.

Abstract: End-to-end relation extraction (E2ERE) is an important and realistic application of natural language processing (NLP) in biomedicine. In this paper, we aim to compare three prevailing paradigms for E2ERE using a complex dataset focused on rare diseases involving discontinuous and nested entities. We use the RareDis information extraction dataset to evaluate three competing approaches (for E2ERE): NER $\rightarrow$ RE pipelines, joint sequence to sequence models, and generative pre-trained transformer (GPT) models. We use comparable state-of-the-art models and best practices for each of these approaches and conduct error analyses to assess their failure modes. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind; GPT models with eight times as many parameters are worse than even sequence-to-sequence models and lose to pipeline models by over 10 F1 points. Partial matches and discontinuous entities caused many NER errors contributing to lower overall E2E performances. We also verify these findings on a second E2ERE dataset for chemical-protein interactions. Although generative LM-based methods are more suitable for zero-shot settings, when training data is available, our results show that it is better to work with more conventional models trained and tailored for E2ERE. More innovative methods are needed to marry the best of the both worlds from smaller encoder-decoder pipeline models and the larger GPT models to improve E2ERE. As of now, we see that well designed pipeline models offer substantial performance gains at a lower cost and carbon footprint for E2ERE. Our contribution is also the first to conduct E2ERE for the RareDis dataset.

[42] Spike No More: Stabilizing the Pre-training of Large Language Models

Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

Main category: cs.CL

TL;DR: The paper investigates loss spikes in large language model pre-training, attributing them to sudden gradient norm growth. It proposes stabilizing conditions (small sub-layers and large shortcuts) and validates them experimentally.

DetailsMotivation: Loss spikes degrade model performance and waste computational resources, necessitating methods to prevent them.

Method: Analyzes spectral norms of Jacobian matrices for sub-layers to identify conditions (small sub-layers, large shortcuts) for gradient norm stability.

Result: Methods meeting the proposed conditions effectively prevent loss spikes during pre-training.

Conclusion: Stabilizing pre-training requires controlling gradient norms via small sub-layers and large shortcuts, as validated by experiments.

Abstract: Loss spikes often occur during pre-training of large language models. The spikes degrade the performance of large language models and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. Based on the assumption that the loss spike is caused by the sudden growth of the gradient norm, we explore factors to keep the gradient norm small through an analysis of the spectral norms of the Jacobian matrices for the sub-layers. Our findings suggest that stabilizing the pre-training process requires two conditions: small sub-layers and large shortcut. We conduct various experiments to empirically verify our theoretical analyses. Experimental results demonstrate that methods satisfying the conditions effectively prevent loss spikes during pre-training.

[43] How Important is Domain Specificity in Language Models and Instruction Finetuning for Biomedical Relation Extraction?

Aviv Brokman, Ramakanth Kavuluru

Main category: cs.CL

TL;DR: General-domain LMs often outperform biomedical-domain LMs in biomedical relation extraction, but biomedical instruction finetuning shows promise despite fewer instructions.

DetailsMotivation: To evaluate if domain-specific LMs and instruction finetuning improve performance in biomedical NLP tasks, specifically relation extraction.

Method: Tested existing LMs on four datasets, comparing general-domain vs. biomedical-domain models and assessing the impact of biomedical instruction finetuning.

Result: General-domain models typically outperformed biomedical-domain models, but biomedical instruction finetuning improved performance similarly to general finetuning.

Conclusion: Focusing on larger-scale biomedical instruction finetuning of general LMs may be more effective than building domain-specific biomedical LMs.

Abstract: Cutting edge techniques developed in the general NLP domain are often subsequently applied to the high-value, data-rich biomedical domain. The past few years have seen generative language models (LMs), instruction finetuning, and few-shot learning become foci of NLP research. As such, generative LMs pretrained on biomedical corpora have proliferated and biomedical instruction finetuning has been attempted as well, all with the hope that domain specificity improves performance on downstream tasks. Given the nontrivial effort in training such models, we investigate what, if any, benefits they have in the key biomedical NLP task of relation extraction. Specifically, we address two questions: (1) Do LMs trained on biomedical corpora outperform those trained on general domain corpora? (2) Do models instruction finetuned on biomedical datasets outperform those finetuned on assorted datasets or those simply pretrained? We tackle these questions using existing LMs, testing across four datasets. In a surprising result, general-domain models typically outperformed biomedical-domain models. However, biomedical instruction finetuning improved performance to a similar degree as general instruction finetuning, despite having orders of magnitude fewer instructions. Our findings suggest it may be more fruitful to focus research effort on larger-scale biomedical instruction finetuning of general LMs over building domain-specific biomedical LMs

[44] MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts

Dominik Macko, Jakub Kopal, Robert Moro, Ivan Srba

Main category: cs.CL

TL;DR: The paper introduces MultiSocial, the first multilingual and multi-platform dataset for detecting machine-generated social-media texts, addressing gaps in current research focused on English and longer texts.

DetailsMotivation: Existing machine-generated text detection research is limited to English and longer texts, neglecting short, informal social-media content. There's also a lack of multilingual benchmark datasets for this domain.

Method: The authors created MultiSocial, a dataset with 472,097 texts (58k human-written and machine-generated by 7 multilingual LLMs) across 22 languages and 5 platforms. They evaluated existing detection methods in zero-shot and fine-tuned settings.

Result: Fine-tuned detectors adapt well to social-media texts, and platform selection for training significantly impacts performance.

Conclusion: The study highlights the feasibility of detecting machine-generated social-media texts and emphasizes the importance of platform-specific training for effective detection.

Abstract: Recent LLMs are able to generate high-quality multilingual texts, indistinguishable for humans from authentic human-written ones. Research in machine-generated text detection is however mostly focused on the English language and longer texts, such as news articles, scientific papers or student essays. Social-media texts are usually much shorter and often feature informal language, grammatical errors, or distinct linguistic items (e.g., emoticons, hashtags). There is a gap in studying the ability of existing methods in detection of such texts, reflected also in the lack of existing multilingual benchmark datasets. To fill this gap we propose the first multilingual (22 languages) and multi-platform (5 social media platforms) dataset for benchmarking machine-generated text detection in the social-media domain, called MultiSocial. It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual LLMs. We use this benchmark to compare existing detection methods in zero-shot as well as fine-tuned form. Our results indicate that the fine-tuned detectors have no problem to be trained on social-media texts and that the platform selection for training matters.

[45] Long-Form Answers to Visual Questions from Blind and Low Vision People

Mina Huh, Fangyuan Xu, Yi-Hao Peng, Chongyan Chen, Hansika Murugu, Danna Gurari, Eunsol Choi, Amy Pavel

Main category: cs.CL

TL;DR: VizWiz-LF dataset introduces long-form answers for visual questions from BLV users, highlighting the role of explanations and suggestions. Evaluations show generated answers often hallucinate details, prompting strategies to abstain from unanswerable questions.

DetailsMotivation: To address the need for long-form answers in visual question answering (VQA) for BLV users, providing richer context beyond simple answers.

Method: Developed VizWiz-LF dataset with 4.2k long-form answers, annotated functional roles, and evaluated with BLV and sighted people. Tested prompting strategies to reduce hallucinations.

Result: BLV users find human and generated answers plausible, but generated ones often hallucinate incorrect details, especially for unanswerable questions.

Conclusion: Long-form answers enhance VQA for BLV users, but generated answers need improvement to reduce hallucinations, with abstention strategies as a potential solution.

Abstract: Vision language models can now generate long-form answers to questions about images - long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to 600 visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences of LFVQA and demonstrate that long-form answers contain information beyond the question answer such as explanations and suggestions. We further conduct automatic and human evaluations with BLV and sighted people to evaluate long-form answers. BLV people perceive both human-written and generated long-form answers to be plausible, but generated answers often hallucinate incorrect visual details, especially for unanswerable visual questions (e.g., blurry or irrelevant images). To reduce hallucinations, we evaluate the ability of VQA models to abstain from answering unanswerable questions across multiple prompting strategies.

[46] Advancing biomolecular understanding and design following human instructions

Xiang Zhuang, Keyan Ding, Tianwen Lyu, Yinuo Jiang, Xiaotong Li, Zhuoyi Xiang, Zeyuan Wang, Ming Qin, Kehua Feng, Jike Wang, Qiang Zhang, Huajun Chen

Main category: cs.CL

TL;DR: InstructBioMol is an AI model that bridges natural language and biomolecular design, improving drug and enzyme design with human-like understanding.

DetailsMotivation: To address the gap between AI's computational capabilities and researchers' intuitive goals in biomolecular research, particularly in using natural language.

Method: Developed InstructBioMol, a large language model for any-to-any alignment of natural language, molecules, and proteins, integrating multimodal inputs.

Result: Achieved a 10% improvement in drug binding affinity and a 70.4 score in enzyme-substrate pair prediction.

Conclusion: InstructBioMol effectively bridges natural language and biomolecular design, showing transformative potential for real-world research.

Abstract: Understanding and designing biomolecules, such as proteins and small molecules, is central to advancing drug discovery, synthetic biology and enzyme engineering. Recent breakthroughs in artificial intelligence have revolutionized biomolecular research, achieving remarkable accuracy in biomolecular prediction and design. However, a critical gap remains between artificial intelligence’s computational capabilities and researchers’ intuitive goals, particularly in using natural language to bridge complex tasks with human intentions. Large language models have shown potential to interpret human intentions, yet their application to biomolecular research remains nascent due to challenges including specialized knowledge requirements, multimodal data integration, and semantic alignment between natural language and biomolecules. To address these limitations, we present InstructBioMol, a large language model designed to bridge natural language and biomolecules through a comprehensive any-to-any alignment of natural language, molecules and proteins. This model can integrate multimodal biomolecules as the input, and enable researchers to articulate design goals in natural language, providing biomolecular outputs that meet precise biological needs. Experimental results demonstrate that InstructBioMol can understand and design biomolecules following human instructions. In particular, it can generate drug molecules with a 10% improvement in binding affinity and design enzymes that achieve an enzyme-substrate pair prediction score of 70.4. This highlights its potential to transform real-world biomolecular research. The code is available at https://github.com/HICAI-ZJU/InstructBioMol.

[47] A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans

Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga

Main category: cs.CL

TL;DR: The paper introduces a framework to evaluate PLMs’ knowledge of five semantic relations beyond hypernymy, comparing human and model performance. Results show a gap between humans and models, with antonymy as the strongest relation for models.

DetailsMotivation: To address the incomplete understanding of PLMs' semantic relation knowledge, especially beyond hypernymy, and to compare human and model performance.

Method: A comprehensive evaluation framework covering five semantic relations (hyponymy, holonymy, meronymy, antonymy, synonymy) using five metrics (soundness, completeness, symmetry, prototypicality, distinguishability). Six PLMs (four masked, two causal) were tested.

Result: Significant knowledge gap between humans and models across all relations. Causal models don’t always outperform masked models. Antonymy is the strongest relation for models.

Conclusion: PLMs lack comprehensive semantic relation knowledge compared to humans, highlighting the need for further research and model improvement.

Abstract: Recently, much work has concerned itself with the enigma of what exactly pretrained language models~(PLMs) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Generally, only one relation has been considered, namely hypernymy. Furthermore, previous work did not measure humans’ performance on the same task as that performed by the PLMs. This means that at this point in time, there is only an incomplete view of the extent of these models’ semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use five metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, prototypicality, and distinguishability. Using these, we can fairly compare humans and models on the same task. Our extensive experiments involve six PLMs, four masked and two causal language models. The results reveal a significant knowledge gap between humans and models for all semantic relations. In general, causal language models, despite their wide use, do not always perform significantly better than masked language models. Antonymy is the outlier relation where all models perform reasonably well.

[48] LLMs are Also Effective Embedding Models: An In-depth Overview

Chongyang Tao, Tao Shen, Shen Gao, Junshuo Zhang, Zhen Li, Kai Hua, Wenpeng Hu, Zhengwei Tao, Shuai Ma

Main category: cs.CL

TL;DR: The survey explores the shift from traditional embedding models to LLM-based embeddings, detailing methods like direct prompting and data-centric tuning, along with challenges and future directions.

DetailsMotivation: To understand and document the transition from traditional encoder-only models to decoder-only LLMs for embedding tasks, highlighting advancements and challenges.

Method: The survey reviews foundational techniques, LLM-based embedding strategies (direct prompting and data-centric tuning), and advanced methods for diverse scenarios.

Result: Provides a comprehensive overview of LLM-based embeddings, including performance factors, limitations, and challenges like efficiency-accuracy trade-offs.

Conclusion: The survey synthesizes current advancements, identifies key challenges, and offers a framework for future research to improve LLM-based embeddings.

Abstract: Large language models (LLMs) have revolutionized natural language processing by achieving state-of-the-art performance across various tasks. Recently, their effectiveness as embedding models has gained attention, marking a paradigm shift from traditional encoder-only models like ELMo and BERT to decoder-only, large-scale LLMs such as GPT, LLaMA, and Mistral. This survey provides an in-depth overview of this transition, beginning with foundational techniques before the LLM era, followed by LLM-based embedding models through two main strategies to derive embeddings from LLMs. 1) Direct prompting: We mainly discuss the prompt designs and the underlying rationale for deriving competitive embeddings. 2) Data-centric tuning: We cover extensive aspects that affect tuning an embedding model, including model architecture, training objectives, data constructions, etc. Upon the above, we also cover advanced methods for producing embeddings from longer texts, multilingual, code, cross-modal data, as well as reasoning-aware and other domain-specific scenarios. Furthermore, we discuss factors affecting choices of embedding models, such as performance/efficiency comparisons, dense vs sparse embeddings, pooling strategies, and scaling law. Lastly, the survey highlights the limitations and challenges in adapting LLMs for embeddings, including cross-task embedding quality, trade-offs between efficiency and accuracy, low-resource, long-context, data bias, robustness, etc. This survey serves as a valuable resource for researchers and practitioners by synthesizing current advancements, highlighting key challenges, and offering a comprehensive framework for future work aimed at enhancing the effectiveness and efficiency of LLMs as embedding models.

[49] Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation

Aneta Zugecova, Dominik Macko, Ivan Srba, Robert Moro, Jakub Kopal, Katarina Marcincinova, Matus Mesarcik

Main category: cs.CL

TL;DR: The study evaluates how large language models (LLMs) can generate personalized disinformation, revealing vulnerabilities in safety filters and the need for urgent improvements.

DetailsMotivation: The misuse of LLMs for generating personalized disinformation raises concerns, yet this combination hasn't been thoroughly studied.

Method: The study tests open and closed LLMs for generating personalized disinformation, assessing safety filters, personalization quality, and detectability.

Result: LLMs lack effective safety filters; personalization reduces filter activations, acting as a jailbreak.

Conclusion: Urgent action is needed to strengthen safety measures and address vulnerabilities in LLMs.

Abstract: The capabilities of recent large language models (LLMs) to generate high-quality content indistinguishable by humans from human-written texts raises many concerns regarding their misuse. Previous research has shown that LLMs can be effectively misused for generating disinformation news articles following predefined narratives. Their capabilities to generate personalized (in various aspects) content have also been evaluated and mostly found usable. However, a combination of personalization and disinformation abilities of LLMs has not been comprehensively studied yet. Such a dangerous combination should trigger integrated safety filters of the LLMs, if there are some. This study fills this gap by evaluating vulnerabilities of recent open and closed LLMs, and their willingness to generate personalized disinformation news articles in English. We further explore whether the LLMs can reliably meta-evaluate the personalization quality and whether the personalization affects the generated-texts detectability. Our results demonstrate the need for stronger safety-filters and disclaimers, as those are not properly functioning in most of the evaluated LLMs. Additionally, our study revealed that the personalization actually reduces the safety-filter activations; thus effectively functioning as a jailbreak. Such behavior must be urgently addressed by LLM developers and service providers.

[50] T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

Lijun Li, Zhelun Shi, Xuhao Hu, Bowen Dong, Yiran Qin, Xihui Liu, Lu Sheng, Jing Shao

Main category: cs.CL

TL;DR: T2ISafety is a benchmark for evaluating text-to-image models on toxicity, fairness, and bias, uncovering risks like racial unfairness and toxic content generation.

DetailsMotivation: Addressing the lack of comprehensive safety evaluation in text-to-image models, which can generate harmful or biased content.

Method: Developed a hierarchical taxonomy (12 tasks, 44 categories) and collected 70K prompts, annotating 68K images to train a safety evaluator.

Result: Identified persistent racial fairness issues, toxic content generation, and privacy protection inconsistencies across 12 diffusion models.

Conclusion: T2ISafety highlights critical safety gaps in T2I models, even in advanced proprietary models, and provides tools for future evaluations.

Abstract: Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing. Data and evaluator are released under https://github.com/adwardlee/t2i_safety.

[51] An Efficient Sparse Fine-Tuning with Low Quantization Error via Neural Network Pruning

Cen-Jhih Li, Aditya Bhaskara

Main category: cs.CL

TL;DR: A new Sparse Fine-tuning (SpFT) framework improves memory efficiency by 20-50% while matching the accuracy of state-of-the-art methods like LoRA.

DetailsMotivation: To make fine-tuning more accessible for users with limited computational budgets by enhancing memory and computational efficiency.

Method: Identifies important neurons using feature importance metrics from neural network pruning, then fine-tunes only weights involving these neurons.

Result: Achieves 20-50% better memory efficiency than existing SpFT methods while maintaining accuracy comparable to LoRA variants.

Conclusion: The proposed SpFT framework offers a practical solution for efficient fine-tuning of foundation models, balancing performance and resource constraints.

Abstract: Fine-tuning is an important step in adapting foundation models such as large language models to downstream tasks. To make this step more accessible to users with limited computational budgets, it is crucial to develop fine-tuning methods that are memory and computationally efficient. Sparse Fine-tuning (SpFT) and Low-rank adaptation (LoRA) are two frameworks that have emerged for addressing this problem and have been adopted widely in practice. In this work, we develop a new SpFT framework, based on ideas from neural network pruning. At a high level, we first identify ``important’’ neurons/nodes using feature importance metrics from network pruning (specifically, we use the structural pruning method), and then perform fine-tuning by restricting to weights involving these neurons. Experiments on common language tasks show our method improves SpFT’s memory efficiency by 20-50% while matching the accuracy of state-of-the-art methods like LoRA’s variants.

[52] Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs

Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos

Main category: cs.CL

TL;DR: Open LLMs can predict citation intent effectively using in-context learning and fine-tuning, outperforming traditional domain-specific models with minimal task-specific data.

DetailsMotivation: To explore the adaptability of general-purpose LLMs for citation intent prediction, avoiding reliance on domain-specific models like SciBERT.

Method: Evaluated 12 model variations across 5 LLM families using zero-, one-, few-, and many-shot prompting, followed by fine-tuning the top-performing model.

Result: Achieved 8% F1-score improvement on SciCite and 4.3% on ACL-ARC datasets compared to baselines.

Conclusion: General-purpose LLMs can be effectively adapted for citation intent prediction, offering insights for model selection and prompt engineering.

Abstract: This work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning. Unlike traditional approaches relying on domain-specific pre-trained models like SciBERT, we demonstrate that general-purpose LLMs can be adapted to this task with minimal task-specific data. We evaluate twelve model variations across five prominent open LLM families using zero-, one-, few-, and many-shot prompting. Our experimental study identifies the top-performing model and prompting parameters through extensive in-context learning experiments. We then demonstrate the significant impact of task-specific adaptation by fine-tuning this model, achieving a relative F1-score improvement of 8% on the SciCite dataset and 4.3% on the ACL-ARC dataset compared to the instruction-tuned baseline. These findings provide valuable insights for model selection and prompt engineering. Additionally, we make our end-to-end evaluation framework and models openly available for future use.

[53] Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

Fakhraddin Alwajih, Abdellah El Mekki, Samar Mohamed Magdy, Abdelrahim A. Elmadany, Omer Nacar, El Moatez Billah Nagoudi, Reem Abdel-Salam, Hanin Atwany, Youssef Nafea, Abdulfattah Mohammed Yahya, Rahaf Alhamouri, Hamzah A. Alsayadi, Hiba Zayed, Sara Shatnawi, Serry Sibaee, Yasir Ech-Chammakhy, Walid Al-Dhabyani, Marwa Mohamed Ali, Imen Jarraya, Ahmed Oumar El-Shangiti, Aisha Alraeesi, Mohammed Anwar Al-Ghrawi, Abdulrahman S. Al-Batati, Elgizouli Mohamed, Noha Taha Elgindi, Muhammed Saeed, Houdaifa Atou, Issam Ait Yahia, Abdelhak Bouayad, Mohammed Machrouh, Amal Makouar, Dania Alkawi, Mukhtar Mohamed, Safaa Taher Abdelfadil, Amine Ziad Ounnoughene, Rouabhia Anfel, Rwaa Assi, Ahmed Sorkatti, Mohamedou Cheikh Tourad, Anis Koubaa, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: A community-driven dataset for evaluating cultural and dialectal sensitivity of LLMs in Arabic, revealing limitations in representation and performance.

DetailsMotivation: To address the lack of cultural and dialectal inclusivity in LLMs, especially for Arabic languages.

Method: Created a dataset with input-response pairs in Modern Standard Arabic and dialectal Arabic, covering 20 topics across 22 Arab countries, and evaluated LLMs using this dataset.

Result: Closed-source LLMs perform well but have flaws; open-source models struggle more. Some countries are underrepresented (e.g., Iraq, Mauritania, Yemen).

Conclusion: The dataset highlights gaps in LLM performance and representation, advocating for more inclusive and culturally sensitive models.

Abstract: As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.

[54] Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy

Ruixi Lin, Ziqiao Wang, Yang You

Main category: cs.CL

TL;DR: The paper introduces a Heaviside step function-based ensemble debiasing method to address class accuracy imbalance in few-shot learning with language models, achieving balanced and improved performance.

DetailsMotivation: The pursuit of overall accuracy in few-shot learning often masks class imbalance, and the goal is to elevate weak classes rather than enrich strong ones.

Method: Proposes a Heaviside step function-based ensemble debiasing method for flexible rectification of class probabilities at class and sample levels.

Result: Achieves state-of-the-art accuracy gains with balanced class accuracies on seven benchmarks using Llama-2-13B, and significant gains with Llama-2-70B in biomedical tasks.

Conclusion: Sample-level corrections are crucial for elevating weak classes, and the method demonstrates the necessity of ensemble debiasing at both levels.

Abstract: Language models are strong few-shot learners and achieve good overall accuracy in text classification tasks, masking the fact that their results suffer from great class accuracy imbalance. We believe that the pursuit of overall accuracy should not come from enriching the strong classes, but from raising up the weak ones. To address the imbalance, we propose a Heaviside step function based ensemble debiasing method, which enables flexible rectifications of in-context learned class probabilities at both class and sample levels. Evaluations with Llama-2-13B on seven text classification benchmarks show that our approach achieves state-of-the-art overall accuracy gains with balanced class accuracies. More importantly, we perform analyses on the resulted probability correction scheme, showing that sample-level corrections are necessary to elevate weak classes. Due to effectively correcting weak classes, our method also brings significant performance gains to a larger model variant, Llama-2-70B, especially on a biomedical domain task, further demonstrating the necessity of ensemble debiasing at both levels. Our source code is available at https://github.com/NUS-HPC-AI-Lab/DCS.

[55] Relation Extraction with Instance-Adapted Predicate Descriptions

Yuhang Jiang, Ramakanth Kavuluru

Main category: cs.CL

TL;DR: The paper proposes a dual-encoder architecture with joint contrastive and cross-entropy loss for relation extraction, improving F1 scores by 1-2% over state-of-the-art methods.

DetailsMotivation: Smaller encoder models are still preferred for relation extraction despite decoder-only large language models excelling in generative tasks. The paper aims to enhance their performance.

Method: A novel dual-encoder architecture with joint contrastive and cross-entropy loss, using a second encoder for instance-specific predicate representations infused with real entity spans.

Result: Achieved F1 score improvements of 1-2% over state-of-the-art methods on biomedical and general domain datasets.

Conclusion: The proposed architecture is effective, with ablation studies validating its components.

Abstract: Relation extraction (RE) is a standard information extraction task playing a major role in downstream applications such as knowledge discovery and question answering. Although decoder-only large language models are excelling in generative tasks, smaller encoder models are still the go to architecture for RE. In this paper, we revisit fine-tuning such smaller models using a novel dual-encoder architecture with a joint contrastive and cross-entropy loss. Unlike previous methods that employ a fixed linear layer for predicate representations, our approach uses a second encoder to compute instance-specific predicate representations by infusing them with real entity spans from corresponding input instances. We conducted experiments on two biomedical RE datasets and two general domain datasets. Our approach achieved F1 score improvements ranging from 1% to 2% over state-of-the-art methods with a simple but elegant formulation. Ablation studies justify the importance of various components built into the proposed architecture.

[56] Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations

Yinghan Zhou, Juan Wen, Wanli Peng, Yiming Xue, Ziwei Zhang, Zhengxian Wu

Main category: cs.CL

TL;DR: The paper introduces DP-Net, a novel AI-generated text (AIGT) detection method using dynamic perturbations via reinforcement learning, addressing generalization and robustness simultaneously.

DetailsMotivation: Concerns about misuse of AI-generated text highlight the need for robust and generalizable detection methods, which existing approaches lack.

Method: Proposes DP-Net, leveraging dynamic perturbations introduced by reinforcement learning with a detailed reward-action mechanism.

Result: DP-Net outperforms state-of-the-art methods in cross-domain generalization and robustness against adversarial attacks.

Conclusion: DP-Net effectively unifies generalization and robustness in AIGT detection, validated by superior experimental performance.

Abstract: The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness. However, existing methods either focus on model generalization or concentrate on robustness. The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we argue that robustness can be view as a specific form of domain shift, and empirically reveal an intrinsic mechanism for model generalization of AIGT detection task. Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action. Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios. Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks. The code is publicly available at https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net.

[57] RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah

Main category: cs.CL

TL;DR: RADLADS converts softmax attention transformers to linear attention decoders efficiently, requiring minimal tokens and cost, while maintaining performance.

DetailsMotivation: To reduce the computational cost and token requirements of converting softmax attention transformers to linear attention models without significant quality loss.

Method: Uses a rapid distillation protocol to convert models, introduces new RWKV-variant architectures, and applies the process to Qwen2.5 models (7B, 32B, 72B).

Result: Achieves state-of-the-art performance for linear attention models, with conversion costing under $2,000 for a 72B model.

Conclusion: RADLADS offers a cost-effective, high-quality solution for linear attention model conversion, with models released under open licenses.

Abstract: We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today’s prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

[58] Distilling the Implicit Multi-Branch Structure in LLMs’ Reasoning via Reinforcement Learning

Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: RLKD, a reinforcement learning-based distillation framework, outperforms standard supervised fine-tuning by better capturing the teacher model’s implicit multi-branch reasoning structure.

DetailsMotivation: Supervised fine-tuning (SFT) fails to distill the authentic multi-branch reasoning structure of teacher models, limiting student model performance.

Method: RLKD uses a Generative Structure Reward Model (GSRM) to align student and teacher reasoning structures via reinforcement learning.

Result: RLKD surpasses SFT-RL pipelines, even with minimal data, unlocking superior reasoning potential in student models.

Conclusion: RLKD effectively distills complex reasoning structures, offering a more robust alternative to SFT-based methods.

Abstract: Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher’s reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher’s implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.

[59] DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, Yixue Li

Main category: cs.CL

TL;DR: A reinforcement learning-based multi-agent framework (Ours) improves biomedical question answering by dynamically optimizing questioning strategies in multi-turn consultations, outperforming existing models.

DetailsMotivation: Current LLMs in clinical consultations face challenges like vague diagnoses due to single-round systems and inflexibility in multi-turn models.

Method: Proposes a reinforcement learning (RL)-based multi-agent framework where a doctor agent dynamically adjusts questioning strategies through interactions with a patient agent, guided by a Consultation Evaluator.

Result: Outperforms existing models in multi-turn reasoning and diagnostic performance, with practical benefits like reducing misdiagnosis risks.

Conclusion: The framework enhances clinical reasoning in LLMs, optimizes medical resource allocation, and addresses workforce shortages, with code and data publicly available.

Abstract: Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Single-round consultation systems require patients to describe all symptoms upfront, leading to vague diagnosis with unclear complaints. Traditional multi-turn dialogue models, constrained by static supervised learning, lack flexibility and fail to intelligently extract key clinical information. To address these limitations, we propose \Ours{}, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that \Ours{} outperforms existing models in both multi-turn reasoning capability and final diagnostic performance. This approach shows immense practical value by reducing misdiagnosis risks in time-pressured settings, freeing clinicians for complex cases, and pioneering a strategy to optimize medical resource allocation and alleviate workforce shortages. Code and data are available at https://github.com/JarvisUSTC/DoctorAgent-RL

[60] Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience

Jiawei Gu, Ziting Xian, Yuanzhen Xie, Ye Liu, Enjie Liu, Ruichao Zhong, Mochi Gao, Yunzhi Tan, Bo Hu, Zang Li

Main category: cs.CL

TL;DR: CoRE framework improves LLM performance on structured data tasks like Text-to-SQL and TableQA using contrastive ICL and experience memory, achieving significant gains.

DetailsMotivation: LLMs underperform on structured data due to underexposure and rigid transfer mechanisms, unlike humans who adapt learned patterns across modalities.

Method: Introduces CoRE, leveraging contrastive ICL and MCTS-generated Experience Memory to enhance generalization and diversity.

Result: Achieves average gains of 3.44% and 4.24%, with up to 17.2% on challenging tasks, and expands training data 8-9x.

Conclusion: CoRE bridges the cognitive gap, enabling LLMs to better handle structured knowledge without additional training.

Abstract: Large language models (LLMs) achieve strong performance on plain text tasks but underperform on structured data like tables and databases. Potential challenges arise from their underexposure during pre-training and rigid text-to-structure transfer mechanisms. Unlike humans who seamlessly apply learned patterns across data modalities, LLMs struggle to infer implicit relationships embedded in tabular formats, especially in the absence of explicit structural guidance. To bridge this cognitive gap, we introduce Contrastive Retrieval-Augmented Generation on Experience (CoRE), a framework that builds experience memory representations and enhances generalization through contrastive In-Context Learning (ICL) to simulate human-like knowledge transfer. Experiments on Text-to-SQL and TableQA show CoRE significantly improves performance, achieving average gains of 3.44% and 4.24%, with up to 17.2% on challenging tasks. Our Monte Carlo Tree Search (MCTS)-generated Experience Memory expands training data 8-9x, enhancing diversity and domain coverage. This training-free and continual method propels LLMs toward structured knowledge expertise.

[61] References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation

Silvia Casola, Yang Janet Liu, Siyao Peng, Oliver Kraus, Albert Gatt, Barbara Plank

Main category: cs.CL

TL;DR: The paper investigates how the choice of reference sets affects reference-based summarization metrics, revealing significant instability in popular metrics like ROUGE and weak correlation with human judgments, especially for LLM outputs.

DetailsMotivation: Current summarization evaluation overlooks the impact of reference set variation on metrics, potentially undermining the reliability of model comparisons.

Method: Analyzes three multi-reference summarization datasets (SummEval, GUMSum, DUC2004) and collects human judgments on LLM outputs for diverse genres.

Result: Popular metrics, especially n-gram-based ones like ROUGE, show instability, and correlation with human judgments is weak or absent for LLM outputs.

Conclusion: Recommends incorporating reference set variation into summarization evaluation to improve consistency and correlation with human judgments, particularly for LLMs.

Abstract: Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of the reference set on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.

[62] Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

Omer Luxembourg, Haim Permuter, Eliya Nachmani

Main category: cs.CL

TL;DR: Dilated Unmasking Scheduler (DUS) improves parallel text generation in masked diffusion language models by minimizing joint entropy gain, outperforming confidence-based methods.

DetailsMotivation: Existing samplers for masked diffusion language models (MDLMs) reduce to slow autoregressive behavior due to ignoring interactions when unmasking multiple positions in parallel.

Method: Proposes DUS, which partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel to minimize joint entropy gain at each denoising step.

Result: DUS outperforms confidence-based planners across math, code, and general-knowledge benchmarks, recovering most performance lost under traditional parallel unmasking.

Conclusion: DUS reveals the true speed-quality frontier of MDLMs without modifying the underlying denoiser.

Abstract: Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP) and general-knowledge benchmarks (BBH, MMLU-Pro), DUS outperforms confidence-based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs.

[63] MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model

K. Sahit Reddy, N. Ragavenderan, Vasanth K., Ganesh N. Naik, Vishalakshi Prabhu, Nagaraja G. S

Main category: cs.CL

TL;DR: MedicalBERT is a pretrained BERT model optimized for biomedical NLP tasks, outperforming other BERT-based models and general-purpose BERT by 5.67% on average.

DetailsMotivation: General-purpose NLP models like BERT, GPT, and T5 struggle with biomedical terminology and bidirectional understanding in medical texts.

Method: MedicalBERT is pretrained on a large biomedical dataset with domain-specific vocabulary and fine-tuned for tasks like named entity recognition and document classification.

Result: MedicalBERT surpasses BioBERT, SciBERT, ClinicalBERT, and general-purpose BERT in benchmarks, showing superior performance in biomedical NLP tasks.

Conclusion: MedicalBERT demonstrates the effectiveness of transfer learning for domain-specific NLP, highlighting its potential for medical applications.

Abstract: Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can’t fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: https://www.researchgate.net/publication/392489050_MedicalBERT_enhancing_biomedical_natural_language_processing_using_pretrained_BERT-based_model [accessed Jul 06 2025].

[64] Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu

Main category: cs.CL

TL;DR: Seed-X is a family of open-source LLMs for multilingual translation, achieving performance comparable to top closed-source models like GPT-4o, with 7B parameters.

DetailsMotivation: Addressing challenges in multilingual translation, such as intricate language patterns and stilted translations in automated systems.

Method: Pre-trained on diverse, high-quality data across 28 languages, finetuned with Chain-of-Thought reasoning and enhanced via reinforcement learning.

Result: Outperforms larger open-source models and matches closed-source models like Gemini-2.5 and GPT-4o in 28 languages.

Conclusion: Seed-X advances multilingual translation research and applications, with publicly available parameters and best practices shared.

Abstract: Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.

[65] Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

Main category: cs.CL

TL;DR: Promptomatix automates prompt optimization for LLMs, improving performance and accessibility without manual tuning.

DetailsMotivation: Manual prompt engineering is inconsistent and inaccessible to non-experts, limiting LLM performance.

Method: Uses a meta-prompt-based optimizer and DSPy-powered compiler to analyze intent, generate synthetic data, and refine prompts.

Result: Achieves competitive or superior performance across 5 tasks, reducing prompt length and computational overhead.

Conclusion: Promptomatix makes prompt optimization scalable, efficient, and accessible.

Abstract: Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.

[66] A Fisher’s exact test justification of the TF-IDF term-weighting scheme

Paul Sheridan, Zeyad Ahmed, Aitazaz A. Farooque

Main category: cs.CL

TL;DR: The paper justifies TF-IDF from a significance testing perspective, linking it to Fisher’s exact test and its p-value.

DetailsMotivation: To provide a theoretical foundation for TF-IDF's effectiveness by connecting it to statistical significance testing.

Method: Demonstrates how TF-IDF relates to the negative logarithm of the p-value from Fisher’s exact test under certain conditions.

Result: TF-IDF is shown to converge to the negative log-transformed p-value under idealized assumptions and in large document collections.

Conclusion: The paper offers statisticians a clear explanation for TF-IDF’s success through its connection to significance testing.

Abstract: Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term’s occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the $p$-value from a one-tailed version of Fisher’s exact test of statistical significance. As a corollary, we establish a connection between TF-IDF and the said negative log-transformed $p$-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF-IDF in the limit of an infinitely large document collection. The Fisher’s exact test justification of TF-IDF equips the working statistician with a ready explanation of the term-weighting scheme’s long-established effectiveness.

[67] 3LM: Bridging Arabic, STEM, and Code through Benchmarking

Basma El Amel Boussaha, Leen AlQadi, Mugariya Farooq, Shaikha Alsuwaidi, Giulia Campesan, Ahmed Alzubaidi, Mohammed Alyafeai, Hakim Hacid

Main category: cs.CL

TL;DR: The paper introduces 3LM, a suite of three benchmarks for Arabic LLMs, focusing on STEM and code generation to address gaps in existing Arabic benchmarks.

DetailsMotivation: Existing Arabic benchmarks lack coverage in STEM and code domains, which are crucial for real-world LLM applications.

Method: Developed three benchmarks: (1) STEM Q&A from Arabic textbooks, (2) synthetic STEM questions, and (3) a code generation benchmark via translated and reviewed code benchmarks.

Result: Publicly released benchmarks to support Arabic LLM research in underrepresented domains.

Conclusion: 3LM fills a critical gap in Arabic LLM evaluation, promoting broader applications and research.

Abstract: Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in domains like STEM and code which are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.

[68] Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu

Main category: cs.CL

TL;DR: The paper proposes using formal language-based reasoning (e.g., Dafny) to improve LLM verification, reducing reliance on human priors. It introduces DafnyComp, a benchmark, and shows SFT and RL can outperform proprietary models.

DetailsMotivation: Existing LLMs struggle with reliable and scalable verification, especially for complex programming tasks. Formal languages offer provable verification, reducing human supervision needs.

Method: Uses Dafny for formal reasoning, introduces an automatic data curation pipeline, and integrates RL with formal verifier feedback. Includes supervised fine-tuning (SFT) and RL with regularization.

Result: Small models (0.5B) generate verifiable Dafny code, surpassing proprietary models. RL improves generalization and outperforms baselines on DafnyComp.

Conclusion: Formal language-based reasoning with automated pipelines and RL enhances LLM reliability and scalability for software verification.

Abstract: Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.

[69] Natural Language Processing for Tigrinya: Current State and Future Directions

Fitsum Gaim, Jong C. Park

Main category: cs.CL

TL;DR: A survey of NLP research for Tigrinya, analyzing 40+ studies (2011-2025), covering resources, models, and applications across 10 tasks. Highlights progress from rule-based to neural systems, challenges like morphological complexity, and future directions.

DetailsMotivation: Tigrinya is underrepresented in NLP despite its large speaker base. This work aims to review and analyze existing research to guide future advancements.

Method: Systematic review of over 40 studies, categorizing computational resources, models, and applications across ten NLP tasks.

Result: Reveals a shift from rule-based to neural systems, driven by resource creation. Identifies challenges (e.g., morphological complexity) and promising directions (e.g., cross-lingual transfer).

Conclusion: Provides a reference for researchers and a roadmap for advancing Tigrinya NLP, with publicly available metadata.

Abstract: Despite being spoken by millions of people, Tigrinya remains severely underrepresented in Natural Language Processing (NLP) research. This work presents a comprehensive survey of NLP research for Tigrinya, analyzing over 40 studies spanning more than a decade of work from 2011 to 2025. We systematically review the current state of computational resources, models, and applications across ten distinct downstream tasks, including morphological processing, machine translation, speech recognition, and question-answering. Our analysis reveals a clear trajectory from foundational, rule-based systems to modern neural architectures, with progress consistently unlocked by resource creation milestones. We identify key challenges rooted in Tigrinya’s morphological complexity and resource scarcity, while highlighting promising research directions, including morphology-aware modeling, cross-lingual transfer, and community-centered resource development. This work serves as both a comprehensive reference for researchers and a roadmap for advancing Tigrinya NLP. A curated metadata of the surveyed studies and resources is made publicly available.

[70] Technical Report of TeleChat2, TeleChat2.5 and T1

Zihan Wang, Xinzhang Liu, Yitong Yao, Chao Wang, Yu Zhao, Zhihao Yang, Wenmin Deng, Kaipeng Jia, Jiaxin Peng, Yuyao Huang, Sishi Xiong, Zhuo Jiang, Kaidong Yu, Xiaohui Hu, Fubei Yao, Ruiyu Fang, Zhuoru Jiang, Ruiting Song, Qiyi Xie, Rui Xue, Xuewei He, Yanlei Xue, Zhu Yuan, Zhaoxi Zhang, Zilu Huang, Shiquan Wang, Xin Wang, Hanming Wu, Mingyuan Wang, Xufeng Zhan, Yuhan Sun, Zhaohu Xing, Yuhao Jiang, Bingkai Yang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li

Main category: cs.CL

TL;DR: The paper introduces TeleChat2, TeleChat2.5, and T1, upgraded versions of TeleChat, with improved performance through enhanced training strategies. They feature domain-specific pretraining, reinforcement learning, and outperform proprietary models like GPT-4o.

DetailsMotivation: To advance language model performance with minimal architectural changes by optimizing training strategies, focusing on reasoning, speed, and diverse applications.

Method: Enhanced pretraining on 10 trillion tokens, SFT, DPO, continual pretraining with domain-specific data, and RL for code and math tasks. Models include 115B parameter dense Transformers.

Result: TeleChat2.5 prioritizes speed, while T1 excels in complex reasoning. Both outperform proprietary models like GPT-4o.

Conclusion: The TeleChat series offers state-of-the-art models for diverse applications, publicly released to benefit developers and researchers.

Abstract: We introduce the latest series of TeleChat models: \textbf{TeleChat2}, \textbf{TeleChat2.5}, and \textbf{T1}, offering a significant upgrade over their predecessor, TeleChat. Despite minimal changes to the model architecture, the new series achieves substantial performance gains through enhanced training strategies in both pre-training and post-training stages. The series begins with \textbf{TeleChat2}, which undergoes pretraining on 10 trillion high-quality and diverse tokens. This is followed by Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to further enhance its capabilities. \textbf{TeleChat2.5} and \textbf{T1} expand the pipeline by incorporating a continual pretraining phase with domain-specific datasets, combined with reinforcement learning (RL) to improve performance in code generation and mathematical reasoning tasks. The \textbf{T1} variant is designed for complex reasoning, supporting long Chain-of-Thought (CoT) reasoning and demonstrating substantial improvements in mathematics and coding. In contrast, \textbf{TeleChat2.5} prioritizes speed, delivering rapid inference. Both flagship models of \textbf{T1} and \textbf{TeleChat2.5} are dense Transformer-based architectures with 115B parameters, showcasing significant advancements in reasoning and general task performance compared to the original TeleChat. Notably, \textbf{T1-115B} outperform proprietary models such as OpenAI’s o1-mini and GPT-4o. We publicly release \textbf{TeleChat2}, \textbf{TeleChat2.5} and \textbf{T1}, including post-trained versions with 35B and 115B parameters, to empower developers and researchers with state-of-the-art language models tailored for diverse applications.

[71] HIVMedQA: Benchmarking large language models for HIV medical decision support

Gonzalo Cardenal-Antolin, Jacques Fellay, Bashkim Jaha, Roger Kouyos, Niko Beerenwinkel, Diane Duroux

Main category: cs.CL

TL;DR: The study evaluates LLMs in HIV management, introducing HIVMedQA as a benchmark. Gemini 2.5 Pro performed best, but proprietary models and question complexity impacted results.

DetailsMotivation: To assess LLMs' potential and limitations in HIV care due to its complexity and lack of benchmarking studies.

Method: Developed HIVMedQA benchmark, evaluated 10 LLMs using prompt engineering, and assessed performance across clinical relevance dimensions.

Result: Gemini 2.5 Pro outperformed others; proprietary models and complexity affected performance. Reasoning and comprehension were challenging.

Conclusion: Targeted development and evaluation are needed for safe LLM integration in clinical care.

Abstract: Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically specialized LLMs, applying prompt engineering to enhance performance. Our evaluation framework incorporates both lexical similarity and an LLM-as-a-judge approach, extended to better reflect clinical relevance. We assessed performance across key dimensions: question comprehension, reasoning, knowledge recall, bias, potential harm, and factual accuracy. Results show that Gemini 2.5 Pro consistently outperformed other models across most dimensions. Notably, two of the top three models were proprietary. Performance declined as question complexity increased. Medically fine-tuned models did not always outperform general-purpose ones, and larger model size was not a reliable predictor of performance. Reasoning and comprehension were more challenging than factual recall, and cognitive biases such as recency and status quo were observed. These findings underscore the need for targeted development and evaluation to ensure safe, effective LLM integration in clinical care.

cs.CV

[72] Quantum-Cognitive Tunnelling Neural Networks for Military-Civilian Vehicle Classification and Sentiment Analysis

Milan Maksimovic, Anna Bohdanets, Immaculate Motsi-Omoijiade, Guido Governatori, Ivan S. Maksymov

Main category: cs.CV

TL;DR: The paper explores using quantum tunnelling (QT) in neural networks to improve recognition of ambiguous objects and sentiment analysis, specifically for military and civilian vehicles and military-specific vocabulary. It suggests QT-based models can enhance AI in battlefield scenarios.

DetailsMotivation: To improve AI's ability to recognize ambiguous objects and analyze sentiment in military contexts, leveraging QT for human-like reasoning.

Method: Employing novel QT-based neural networks to evaluate performance on customized CIFAR-format images and military-specific sentiment analysis.

Result: QT-based models show promise in distinguishing military and civilian vehicles and analyzing sentiment with military vocabulary.

Conclusion: QT-based neural networks can enhance multimodal AI applications in battlefield scenarios, particularly in human-operated drone warfare.

Abstract: Prior work has demonstrated that incorporating well-known quantum tunnelling (QT) probability into neural network models effectively captures important nuances of human perception, particularly in the recognition of ambiguous objects and sentiment analysis. In this paper, we employ novel QT-based neural networks and assess their effectiveness in distinguishing customised CIFAR-format images of military and civilian vehicles, as well as sentiment, using a proprietary military-specific vocabulary. We suggest that QT-based models can enhance multimodal AI applications in battlefield scenarios, particularly within human-operated drone warfare contexts, imbuing AI with certain traits of human reasoning.

[73] Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching

Haiyang Liu, Xiaolin Hong, Xuancheng Yang, Yudi Ruan, Xiang Lian, Michael Lingelbach, Hongwei Yi, Wei Li

Main category: cs.CV

TL;DR: Livatar is a real-time audio-driven talking head video generation framework that improves lip-sync accuracy and reduces pose drift, achieving high performance (141 FPS) and low latency (0.17s).

DetailsMotivation: Existing methods for talking head generation suffer from poor lip-sync accuracy and long-term pose drift, limiting their practical use.

Method: Livatar uses a flow matching-based framework and system optimizations to enhance performance.

Result: The framework achieves a LipSync Confidence of 8.50 on the HDTF dataset and operates at 141 FPS with 0.17s latency.

Conclusion: Livatar enables high-fidelity avatars for broader applications, with demonstrated efficiency and quality.

Abstract: We present Livatar, a real-time audio-driven talking heads videos generation framework. Existing baselines suffer from limited lip-sync accuracy and long-term pose drift. We address these limitations with a flow matching based framework. Coupled with system optimizations, Livatar achieves competitive lip-sync quality with a 8.50 LipSync Confidence on the HDTF dataset, and reaches a throughput of 141 FPS with an end-to-end latency of 0.17s on a single A10 GPU. This makes high-fidelity avatars accessible to broader applications. Our project is available at https://www.hedra.com/ with with examples at https://h-liu1997.github.io/Livatar-1/

[74] Features extraction for image identification using computer vision

Venant Niyonkuru, Sylla Sekou, Jimmy Jackson Sinzinkayo

Main category: cs.CV

TL;DR: The paper compares feature extraction techniques in computer vision, focusing on Vision Transformers (ViTs) and other methods like GANs, deep feature models, and traditional approaches, highlighting ViTs’ superior performance over CNNs.

DetailsMotivation: To evaluate and compare the effectiveness of various feature extraction techniques, particularly ViTs, in advancing computer vision.

Method: Analyzes architectures like ViTs (patch embedding, positional encoding, multi-head self-attention) and compares them with CNNs, GANs, and traditional methods.

Result: ViTs outperform CNNs, with experimental results showcasing their merits and limitations.

Conclusion: ViTs are a promising advancement in computer vision, though their limitations and applications require further exploration.

Abstract: This study examines various feature extraction techniques in computer vision, the primary focus of which is on Vision Transformers (ViTs) and other approaches such as Generative Adversarial Networks (GANs), deep feature models, traditional approaches (SIFT, SURF, ORB), and non-contrastive and contrastive feature models. Emphasizing ViTs, the report summarizes their architecture, including patch embedding, positional encoding, and multi-head self-attention mechanisms with which they overperform conventional convolutional neural networks (CNNs). Experimental results determine the merits and limitations of both methods and their utilitarian applications in advancing computer vision.

[75] Adapt, But Don’t Forget: Fine-Tuning and Contrastive Routing for Lane Detection under Distribution Shift

Mohammed Abdul Hafeez Khan, Parth Ganeriwala, Sarah M. Lehman, Siddhartha Bhattacharyya, Amy Alvarez, Natasha Neogi

Main category: cs.CV

TL;DR: A framework for lane detection addresses catastrophic forgetting in cross-dataset fine-tuning by using branch-specific adaptation and dynamic routing.

DetailsMotivation: Cross-dataset distribution shifts cause catastrophic forgetting in lane detection models during fine-tuning.

Method: Train a base model on a source dataset, adapt to target datasets via separate branches, and use contrastive learning for dynamic routing.

Result: Achieves near-optimal F1-scores with fewer parameters than training separate models.

Conclusion: The framework enables efficient adaptation to new datasets while mitigating catastrophic forgetting.

Abstract: Lane detection models are often evaluated in a closed-world setting, where training and testing occur on the same dataset. We observe that, even within the same domain, cross-dataset distribution shifts can cause severe catastrophic forgetting during fine-tuning. To address this, we first train a base model on a source distribution and then adapt it to each new target distribution by creating separate branches, fine-tuning only selected components while keeping the original source branch fixed. Based on a component-wise analysis, we identify effective fine-tuning strategies for target distributions that enable parameter-efficient adaptation. At inference time, we propose using a supervised contrastive learning model to identify the input distribution and dynamically route it to the corresponding branch. Our framework achieves near-optimal F1-scores while using significantly fewer parameters than training separate models for each distribution.

[76] Querying Autonomous Vehicle Point Clouds: Enhanced by 3D Object Counting with CounterNet

Xiaoyu Zhang, Zhifeng Bao, Hai Dong, Ziwei Wang, Jiajun Liu

Main category: cs.CV

TL;DR: CounterNet improves object counting accuracy in point cloud data for autonomous vehicles, enhancing query reliability.

DetailsMotivation: Accurate object counting in point cloud data is critical for reliable query results, but current methods fail in 3D data.

Method: Proposes CounterNet, a heatmap-based network with feature map partitioning and dynamic model selection for better counting.

Result: Improves counting accuracy by 5-20% across object categories, leading to more reliable query outcomes.

Conclusion: CounterNet addresses limitations of prior work, offering a robust solution for point cloud querying in autonomous vehicles.

Abstract: Autonomous vehicles generate massive volumes of point cloud data, yet only a subset is relevant for specific tasks such as collision detection, traffic analysis, or congestion monitoring. Effectively querying this data is essential to enable targeted analytics. In this work, we formalize point cloud querying by defining three core query types: RETRIEVAL, COUNT, and AGGREGATION, each aligned with distinct analytical scenarios. All these queries rely heavily on accurate object counts to produce meaningful results, making precise object counting a critical component of query execution. Prior work has focused on indexing techniques for 2D video data, assuming detection models provide accurate counting information. However, when applied to 3D point cloud data, state-of-the-art detection models often fail to generate reliable object counts, leading to substantial errors in query results. To address this limitation, we propose CounterNet, a heatmap-based network designed for accurate object counting in large-scale point cloud data. Rather than focusing on accurate object localization, CounterNet detects object presence by finding object centers to improve counting accuracy. We further enhance its performance with a feature map partitioning strategy using overlapping regions, enabling better handling of both small and large objects in complex traffic scenes. To adapt to varying frame characteristics, we introduce a per-frame dynamic model selection strategy that selects the most effective configuration for each input. Evaluations on three real-world autonomous vehicle datasets show that CounterNet improves counting accuracy by 5% to 20% across object categories, resulting in more reliable query outcomes across all supported query types.

[77] KuiSCIMA v2.0: Improved Baselines, Calibration, and Cross-Notation Generalization for Historical Chinese Music Notations in Jiang Kui’s Baishidaoren Gequ

Tristan Repolusk, Eduardo Veas

Main category: cs.CV

TL;DR: The paper introduces advancements in Optical Music Recognition (OMR) for historical Chinese notations, achieving lower error rates and outperforming human transcribers.

DetailsMotivation: To address challenges like high class imbalance and limited data in recognizing historical Chinese musical notations (e.g., suzipu, lülüpu) and improve digitization efforts.

Method: Developed a character recognition model for imbalanced data, used temperature scaling for calibration, and employed leave-one-edition-out cross-validation. Extended the KuiSCIMA dataset.

Result: Reduced CER from 10.4% to 7.1% for suzipu and achieved 0.9% for lülüpu, outperforming human transcribers (15.9% average CER). ECE below 0.0162.

Conclusion: The work advances digitization of historical Chinese music, enhancing accessibility and cultural diversity in OMR.

Abstract: Optical Music Recognition (OMR) for historical Chinese musical notations, such as suzipu and l"ul"upu, presents unique challenges due to high class imbalance and limited training data. This paper introduces significant advancements in OMR for Jiang Kui’s influential collection Baishidaoren Gequ from 1202. In this work, we develop and evaluate a character recognition model for scarce imbalanced data. We improve upon previous baselines by reducing the Character Error Rate (CER) from 10.4% to 7.1% for suzipu, despite working with 77 highly imbalanced classes, and achieve a remarkable CER of 0.9% for l"ul"upu. Our models outperform human transcribers, with an average human CER of 15.9% and a best-case CER of 7.6%. We employ temperature scaling to achieve a well-calibrated model with an Expected Calibration Error (ECE) below 0.0162. Using a leave-one-edition-out cross-validation approach, we ensure robust performance across five historical editions. Additionally, we extend the KuiSCIMA dataset to include all 109 pieces from Baishidaoren Gequ, encompassing suzipu, l"ul"upu, and jianzipu notations. Our findings advance the digitization and accessibility of historical Chinese music, promoting cultural diversity in OMR and expanding its applicability to underrepresented music traditions.

[78] Part Segmentation of Human Meshes via Multi-View Human Parsing

James Dickens, Kamyar Hamad

Main category: cs.CV

TL;DR: The paper bridges point cloud deep learning and human parsing by enabling semantic segmentation of human meshes using geometric data, introducing a pseudo-ground truth pipeline and a memory-efficient sampling strategy.

DetailsMotivation: To enable per-vertex semantic segmentation of large-scale human meshes by leveraging raw geometry, avoiding reliance on texture information.

Method: Develops a pseudo-ground truth pipeline for Thuman2.1 dataset, aligns meshes, segments from viewpoints, backprojects labels, and introduces windowed iterative FPS with space-filling curve-based serialization for efficient downsampling. Uses PointTransformer for geometric segmentation.

Result: Experimental results confirm the approach’s effectiveness and accuracy in semantic parsing of human meshes.

Conclusion: The proposed method successfully achieves high-accuracy semantic segmentation of human meshes using only geometric data, demonstrating its practical applicability.

Abstract: Recent advances in point cloud deep learning have led to models that achieve high per-part labeling accuracy on large-scale point clouds, using only the raw geometry of unordered point sets. In parallel, the field of human parsing focuses on predicting body part and clothing/accessory labels from images. This work aims to bridge these two domains by enabling per-vertex semantic segmentation of large-scale human meshes. To achieve this, a pseudo-ground truth labeling pipeline is developed for the Thuman2.1 dataset: meshes are first aligned to a canonical pose, segmented from multiple viewpoints, and the resulting point-level labels are then backprojected onto the original mesh to produce per-point pseudo ground truth annotations. Subsequently, a novel, memory-efficient sampling strategy is introduced, a windowed iterative farthest point sampling (FPS) with space-filling curve-based serialization to effectively downsample the point clouds. This is followed by a purely geometric segmentation using PointTransformer, enabling semantic parsing of human meshes without relying on texture information. Experimental results confirm the effectiveness and accuracy of the proposed approach.

[79] ShrinkBox: Backdoor Attack on Object Detection to Disrupt Collision Avoidance in Machine Learning-based Advanced Driver Assistance Systems

Muhammad Zaeem Shahzad, Muhammad Abdullah Hanif, Bassem Ouni, Muhammad Shafique

Main category: cs.CV

TL;DR: ShrinkBox is a novel backdoor attack targeting object detection in ML-ADAS, subtly shrinking bounding boxes to disrupt distance estimation without detection.

DetailsMotivation: To expose vulnerabilities in ML-ADAS, particularly in cost-effective systems relying on camera input, by introducing a stealthy attack that undermines collision avoidance.

Method: ShrinkBox manipulates ground truth bounding boxes in training data, tested on YOLOv9m with the KITTI dataset, achieving high ASR with minimal poisoning.

Result: ShrinkBox achieves 96% ASR with 4% poisoning, increasing MAE in distance estimation by over 3x on poisoned samples.

Conclusion: ShrinkBox highlights critical security flaws in ML-ADAS, necessitating robust defenses against such stealthy attacks.

Abstract: Advanced Driver Assistance Systems (ADAS) significantly enhance road safety by detecting potential collisions and alerting drivers. However, their reliance on expensive sensor technologies such as LiDAR and radar limits accessibility, particularly in low- and middle-income countries. Machine learning-based ADAS (ML-ADAS), leveraging deep neural networks (DNNs) with only standard camera input, offers a cost-effective alternative. Critical to ML-ADAS is the collision avoidance feature, which requires the ability to detect objects and estimate their distances accurately. This is achieved with specialized DNNs like YOLO, which provides real-time object detection, and a lightweight, detection-wise distance estimation approach that relies on key features extracted from the detections like bounding box dimensions and size. However, the robustness of these systems is undermined by security vulnerabilities in object detectors. In this paper, we introduce ShrinkBox, a novel backdoor attack targeting object detection in collision avoidance ML-ADAS. Unlike existing attacks that manipulate object class labels or presence, ShrinkBox subtly shrinks ground truth bounding boxes. This attack remains undetected in dataset inspections and standard benchmarks while severely disrupting downstream distance estimation. We demonstrate that ShrinkBox can be realized in the YOLOv9m object detector at an Attack Success Rate (ASR) of 96%, with only a 4% poisoning ratio in the training instances of the KITTI dataset. Furthermore, given the low error targets introduced in our relaxed poisoning strategy, we find that ShrinkBox increases the Mean Absolute Error (MAE) in downstream distance estimation by more than 3x on poisoned samples, potentially resulting in delays or prevention of collision warnings altogether.

[80] Structure Matters: Revisiting Boundary Refinement in Video Object Segmentation

Guanyi Qin, Ziyue Wang, Daiyun Shen, Haofeng Liu, Hantao Zhou, Junde Wu, Runze Hu, Yueming Jin

Main category: cs.CV

TL;DR: The paper introduces OASIS, a method for Semi-supervised Video Object Segmentation (SVOS) that improves accuracy and handles occlusion by refining object boundaries and using evidential learning for uncertainty.

DetailsMotivation: Addressing challenges in SVOS like occlusion and object interactions while meeting real-time processing needs.

Method: Proposes a lightweight structure refinement module combining Canny filter edge priors and object features, along with evidential learning for uncertainty.

Result: Achieves superior performance (F:91.6, G:86.6) and competitive speed (48 FPS) on benchmarks.

Conclusion: OASIS offers an efficient and accurate solution for SVOS, outperforming state-of-the-art methods.

Abstract: Given an object mask, Semi-supervised Video Object Segmentation (SVOS) technique aims to track and segment the object across video frames, serving as a fundamental task in computer vision. Although recent memory-based methods demonstrate potential, they often struggle with scenes involving occlusion, particularly in handling object interactions and high feature similarity. To address these issues and meet the real-time processing requirements of downstream applications, in this paper, we propose a novel bOundary Amendment video object Segmentation method with Inherent Structure refinement, hereby named OASIS. Specifically, a lightweight structure refinement module is proposed to enhance segmentation accuracy. With the fusion of rough edge priors captured by the Canny filter and stored object features, the module can generate an object-level structure map and refine the representations by highlighting boundary features. Evidential learning for uncertainty estimation is introduced to further address challenges in occluded regions. The proposed method, OASIS, maintains an efficient design, yet extensive experiments on challenging benchmarks demonstrate its superior performance and competitive inference speed compared to other state-of-the-art methods, i.e., achieving the F values of 91.6 (vs. 89.7 on DAVIS-17 validation set) and G values of 86.6 (vs. 86.2 on YouTubeVOS 2019 validation set) while maintaining a competitive speed of 48 FPS on DAVIS.

[81] VGS-ATD: Robust Distributed Learning for Multi-Label Medical Image Classification Under Heterogeneous and Imbalanced Conditions

Zehui Zhao, Laith Alzubaidi, Haider A. Alwzwazy, Jinglan Zhang, Yuantong Gu

Main category: cs.CV

TL;DR: VGS-ATD is a novel distributed learning framework addressing privacy, data heterogeneity, and scalability in medical imaging, outperforming centralized and decentralized methods in accuracy, efficiency, and resilience to catastrophic forgetting.

DetailsMotivation: Traditional centralized and decentralized learning methods in medical imaging face privacy risks, inefficiencies, and scalability issues, especially with heterogeneous data and dynamic clinical environments.

Method: Proposes VGS-ATD, a distributed learning framework, validated across 30 datasets and 80 labels on distributed nodes.

Result: Achieved 92.7% accuracy, outperforming centralized (84.9%) and swarm learning (72.99%), with 50% lower computational costs and only 1% accuracy drop after expansion.

Conclusion: VGS-ATD is a scalable, efficient, and privacy-preserving solution for medical imaging, resilient to catastrophic forgetting and outperforming existing methods.

Abstract: In recent years, advanced deep learning architectures have shown strong performance in medical imaging tasks. However, the traditional centralized learning paradigm poses serious privacy risks as all data is collected and trained on a single server. To mitigate this challenge, decentralized approaches such as federated learning and swarm learning have emerged, allowing model training on local nodes while sharing only model weights. While these methods enhance privacy, they struggle with heterogeneous and imbalanced data and suffer from inefficiencies due to frequent communication and the aggregation of weights. More critically, the dynamic and complex nature of clinical environments demands scalable AI systems capable of continuously learning from diverse modalities and multilabels. Yet, both centralized and decentralized models are prone to catastrophic forgetting during system expansion, often requiring full model retraining to incorporate new data. To address these limitations, we propose VGS-ATD, a novel distributed learning framework. To validate VGS-ATD, we evaluate it in experiments spanning 30 datasets and 80 independent labels across distributed nodes, VGS-ATD achieved an overall accuracy of 92.7%, outperforming centralized learning (84.9%) and swarm learning (72.99%), while federated learning failed under these conditions due to high requirements on computational resources. VGS-ATD also demonstrated strong scalability, with only a 1% drop in accuracy on existing nodes after expansion, compared to a 20% drop in centralized learning, highlighting its resilience to catastrophic forgetting. Additionally, it reduced computational costs by up to 50% relative to both centralized and swarm learning, confirming its superior efficiency and scalability.

[82] Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Qiong Wu, Wenhao Lin, Yiyi Zhou, Weihao Ye, Zhanpeng Zen, Xiaoshuai Sun, Rongrong Ji

Main category: cs.CV

TL;DR: The paper addresses redundancy in visual tokens of Multimodal Large Language Models (MLLMs) and proposes DyVTE, a method to dynamically remove visual tokens to improve efficiency.

DetailsMotivation: Excessive visual tokens in MLLMs cause computational inefficiency. The study aims to understand and mitigate this redundancy.

Method: DyVTE uses hyper-networks to monitor text tokens and remove visual tokens when they become redundant, based on empirical observations of MLLM attention behaviors.

Result: DyVTE improves MLLM efficiency across models like LLaVA and VILA, validated by benchmarks, while also revealing general MLLM modeling patterns.

Conclusion: DyVTE effectively reduces visual redundancy in MLLMs, enhancing efficiency and providing insights into MLLM behavior.

Abstract: The excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. To gain insights into this problem, we first conduct extensive empirical studies on the attention behaviors of MLLMs, and summarize three main inference stages in MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning} resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information, yielding obvious visual redundancy. Based on these generalized observations, we propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer, thereby addressing the observed visual redundancy. To validate VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL, and conduct extensive experiments on a bunch of benchmarks. The experiment results not only show the effectiveness of our VTE in improving MLLMs’ efficiency, but also yield the general modeling patterns of MLLMs, well facilitating the in-depth understanding of MLLMs. Our code is released at https://github.com/DoubtedSteam/DyVTE.

[83] Fuzzy Theory in Computer Vision: A Review

Adilet Yerkin, Ayan Igali, Elnara Kadyrgali, Maksat Shagyrov, Malika Ziyada, Muragul Muratbekova, Pakizar Shamoi

Main category: cs.CV

TL;DR: The paper explores fuzzy logic’s role in computer vision for handling uncertainty and noise, improving tasks like object recognition and segmentation. It discusses key techniques, applications, and integration with deep learning.

DetailsMotivation: To address uncertainty and imprecision in image data using fuzzy logic, offering adaptable and interpretable solutions for computer vision tasks.

Method: Discusses fuzzy techniques like clustering, inference systems, type-2 fuzzy sets, and rule-based decision-making, alongside integration with deep learning models like CNNs.

Result: Fuzzy logic enhances performance in tasks such as object recognition and segmentation, with applications in medical imaging, autonomous systems, and industrial inspection.

Conclusion: Fuzzy logic, especially when combined with deep learning, shows promise for advancing computer vision, with emerging trends focusing on hybrid models and explainable AI.

Abstract: Computer vision applications are omnipresent nowadays. The current paper explores the use of fuzzy logic in computer vision, stressing its role in handling uncertainty, noise, and imprecision in image data. Fuzzy logic is able to model gradual transitions and human-like reasoning and provides a promising approach to computer vision. Fuzzy approaches offer a way to improve object recognition, image segmentation, and feature extraction by providing more adaptable and interpretable solutions compared to traditional methods. We discuss key fuzzy techniques, including fuzzy clustering, fuzzy inference systems, type-2 fuzzy sets, and fuzzy rule-based decision-making. The paper also discusses various applications, including medical imaging, autonomous systems, and industrial inspection. Additionally, we explore the integration of fuzzy logic with deep learning models such as convolutional neural networks (CNNs) to enhance performance in complex vision tasks. Finally, we examine emerging trends such as hybrid fuzzy-deep learning models and explainable AI.

[84] Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion

Zongye Zhang, Bohan Kong, Qingjie Liu, Yunhong Wang

Main category: cs.CV

TL;DR: MoMADiff combines masked modeling and diffusion for robust 3D human motion generation from text, outperforming existing methods in quality and control.

DetailsMotivation: Existing methods struggle with out-of-distribution motions and lack fine-grained control, limiting real-world applicability.

Method: Combines masked modeling with diffusion processes for frame-level continuous representations, supporting flexible keyframe specification.

Result: Outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence on novel datasets.

Conclusion: MoMADiff offers a robust solution for text-to-motion generation with improved generalization and control.

Abstract: Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both spatial and temporal aspects of motion synthesis. MoMADiff demonstrates strong generalization capability on novel text-to-motion datasets with sparse keyframes as motion prompts. Extensive experiments on two held-out datasets and two standard benchmarks show that our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence. The code is available at: https://github.com/zzysteve/MoMADiff

[85] Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back

Ruixing Zhang, Yang Zhang, Tongyu Zhu, Leilei Sun, Weifeng Lv

Main category: cs.CV

TL;DR: The paper proposes a human-like next-location prediction method using Vision-Language Models (VLMs), achieving SOTA performance and cross-city generalization.

DetailsMotivation: Existing models lack human-like reasoning over maps, while VLMs offer strong visual reasoning capabilities.

Method: Introduces VGLS to test VLMs, then VLMLocPredictor with SFT tasks and reinforcement learning for self-improvement.

Result: Achieves SOTA performance and superior cross-city generalization on datasets from four cities.

Conclusion: VLMs can effectively mimic human-like reasoning for next-location prediction, outperforming traditional LLM-based methods.

Abstract: Next Location Prediction is a fundamental task in the study of human mobility, with wide-ranging applications in transportation planning, urban governance, and epidemic forecasting. In practice, when humans attempt to predict the next location in a trajectory, they often visualize the trajectory on a map and reason based on road connectivity and movement trends. However, the vast majority of existing next-location prediction models do not reason over maps \textbf{in the way that humans do}. Fortunately, the recent development of Vision-Language Models (VLMs) has demonstrated strong capabilities in visual perception and even visual reasoning. This opens up a new possibility: by rendering both the road network and trajectory onto an image and leveraging the reasoning abilities of VLMs, we can enable models to perform trajectory inference in a human-like manner. To explore this idea, we first propose a method called Vision-Guided Location Search (VGLS), which evaluates whether a general-purpose VLM is capable of trajectory-based reasoning without modifying any of its internal parameters. Based on insights from the VGLS results, we further propose our main approach: VLMLocPredictor, which is composed of two stages: In the first stage, we design two Supervised Fine-Tuning (SFT) tasks that help the VLM understand road network and trajectory structures and acquire basic reasoning ability on such visual inputs. In the second stage, we introduce Reinforcement Learning from Visual Map Feedback, enabling the model to self-improve its next-location prediction ability through interaction with the environment. Experiments conducted on datasets from four different cities show that our method achieves state-of-the-art (SOTA) performance and exhibits superior cross-city generalization compared to other LLM-based approaches.

[86] Enhancing Frequency for Single Image Super-Resolution with Learnable Separable Kernels

Heng Tian

Main category: cs.CV

TL;DR: The paper introduces Learnable Separable Kernels (LSKs), a plug-and-play module for single-image super-resolution (SISR) that directly enhances image frequency components, reducing parameters and computation by over 60% while improving performance.

DetailsMotivation: Existing SISR methods rely on indirect enhancements like specialized loss functions. The paper aims to directly improve frequency components for better efficiency and performance.

Method: Proposes LSKs, rank-one matrices decomposed into orthogonal one-dimensional kernels, to enhance frequency components. Includes interpretable analysis of feature maps.

Result: LSKs reduce parameters and computation by over 60% and improve model performance, especially with higher upscaling factors. Visualization confirms effective frequency enhancement.

Conclusion: LSKs offer a lightweight, efficient solution for SISR, outperforming baseline methods and scaling well with larger upscaling factors.

Abstract: Existing approaches often enhance the performance of single-image super-resolution (SISR) methods by incorporating auxiliary structures, such as specialized loss functions, to indirectly boost the quality of low-resolution images. In this paper, we propose a plug-and-play module called Learnable Separable Kernels (LSKs), which are formally rank-one matrices designed to directly enhance image frequency components. We begin by explaining why LSKs are particularly suitable for SISR tasks from a frequency perspective. Baseline methods incorporating LSKs demonstrate a significant reduction of over 60% in both the number of parameters and computational requirements. This reduction is achieved through the decomposition of LSKs into orthogonal and mergeable one-dimensional kernels. Additionally, we perform an interpretable analysis of the feature maps generated by LSKs. Visualization results reveal the capability of LSKs to enhance image frequency components effectively. Extensive experiments show that incorporating LSKs not only reduces the number of parameters and computational load but also improves overall model performance. Moreover, these experiments demonstrate that models utilizing LSKs exhibit superior performance, particularly as the upscaling factor increases.

[87] Gen-AI Police Sketches with Stable Diffusion

Nicholas Fidalgo, Aaron Contreras, Katherine Harvey, Johnny Ni

Main category: cs.CV

TL;DR: The paper explores multimodal AI for suspect sketching, comparing three models. Model 1 (baseline Stable Diffusion) performed best in metrics like SSIM and PSNR, while Model 3 (LoRA fine-tuned CLIP) improved perceptual similarity but lagged behind.

DetailsMotivation: To automate and enhance suspect sketching using multimodal AI, addressing the need for better alignment between text descriptions and generated sketches.

Method: Developed three pipelines: (1) baseline Stable Diffusion, (2) CLIP-integrated Stable Diffusion, and (3) LoRA fine-tuned CLIP with Stable Diffusion. Evaluated via SSIM, PSNR, and LPIPS.

Result: Model 1 achieved the highest SSIM (0.72) and PSNR (25 dB). Model 3 improved perceptual similarity but was outperformed by Model 1.

Conclusion: The baseline Stable Diffusion model (Model 1) is the most robust for suspect sketching, despite its simplicity, though fine-tuning (Model 3) offers perceptual improvements.

Abstract: This project investigates the use of multimodal AI-driven approaches to automate and enhance suspect sketching. Three pipelines were developed and evaluated: (1) baseline image-to-image Stable Diffusion model, (2) same model integrated with a pre-trained CLIP model for text-image alignment, and (3) novel approach incorporating LoRA fine-tuning of the CLIP model, applied to self-attention and cross-attention layers, and integrated with Stable Diffusion. An ablation study confirmed that fine-tuning both self- and cross-attention layers yielded the best alignment between text descriptions and sketches. Performance testing revealed that Model 1 achieved the highest structural similarity (SSIM) of 0.72 and a peak signal-to-noise ratio (PSNR) of 25 dB, outperforming Model 2 and Model 3. Iterative refinement enhanced perceptual similarity (LPIPS), with Model 3 showing improvement over Model 2 but still trailing Model 1. Qualitatively, sketches generated by Model 1 demonstrated the clearest facial features, highlighting its robustness as a baseline despite its simplicity.

[88] MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding

Chang Liu, Ye Pan, Chenyang Ding, Susanto Rahardja, Xiaokang Yang

Main category: cs.CV

TL;DR: MEDTalk is a framework for dynamic emotional 3D facial animation, disentangling content and emotion for realistic expressions, integrating audio and text, and supporting multimodal inputs for personalized control.

DetailsMotivation: Existing methods rely on static emotion labels, limiting diversity and naturalness in facial animations.

Method: Disentangles content and emotion via cross-reconstruction, integrates audio and text for dynamic expressions, and uses multimodal inputs for personalization.

Result: Generates fine-grained, dynamic emotional expressions suitable for industrial pipelines like MetaHuman.

Conclusion: MEDTalk advances emotional facial animation by enabling independent control and realistic, personalized expressions.

Abstract: Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline. The code is available at: https://github.com/SJTU-Lucy/MEDTalk.

[89] Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks

Sanyam Jain, Marsha Mariya Kappan, Vijeta Sharma

Main category: cs.CV

TL;DR: The paper evaluates CLIP for human action recognition in healthcare, revealing its limitations under masking strategies and proposing a noise-based enhancement to improve accuracy and reduce bias.

DetailsMotivation: Human action recognition is vital in healthcare, but traditional models struggle with generalization. CLIP offers potential, but its performance under masking needs evaluation.

Method: Tested CLIP on UCF-101 with three masking strategies: black masking, feature-specific masking, and isolation masking. Proposed adding class-specific noise to improve attention.

Result: CLIP showed inconsistent performance and misclassifications when visual cues were obscured. The noise enhancement improved accuracy and reduced bias.

Conclusion: CLIP has potential but needs refinement for clinical use. Future work should focus on generalizability in healthcare scenarios.

Abstract: Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment. While traditional models like CNNs and RNNs have achieved moderate success, they often struggle to generalize across diverse and complex actions. Recent advancements in vision-language models, especially the transformer-based CLIP model, offer promising capabilities for generalizing action recognition from video data. In this work, we evaluate CLIP on the UCF-101 dataset and systematically analyze its performance under three masking strategies: (1) percentage-based and shape-based black masking at 10%, 30%, and 50%, (2) feature-specific masking to suppress bias-inducing elements, and (3) isolation masking that retains only class-specific regions. Our results reveal that CLIP exhibits inconsistent behavior and frequent misclassifications, particularly when essential visual cues are obscured. To overcome these limitations, we propose incorporating class-specific noise, learned via a custom loss function, to reinforce attention to class-defining features. This enhancement improves classification accuracy and model confidence while reducing bias. We conclude with a discussion on the challenges of applying such models in clinical domains and outline directions for future work to improve generalizability across domain-independent healthcare scenarios.

[90] HeartUnloadNet: A Weakly-Supervised Cycle-Consistent Graph Network for Predicting Unloaded Cardiac Geometry from Diastolic States

Siyu Mu, Wei Xuan Chan, Choon Hwai Yap

Main category: cs.CV

TL;DR: HeartUnloadNet is a deep learning framework that predicts the unloaded left ventricular shape from clinical images, outperforming traditional methods in speed and accuracy.

DetailsMotivation: The unloaded cardiac geometry is crucial for biomechanical modeling, but estimating it from clinical images is challenging. Traditional methods are computationally expensive.

Method: HeartUnloadNet uses a graph attention architecture and cycle-consistency strategy to predict the unloaded LV shape from the end diastolic mesh, incorporating biophysical priors.

Result: Achieves sub-millimeter accuracy (DSC 0.986, HD 0.083 cm) and reduces inference time to 0.02 seconds per case, outperforming traditional solvers.

Conclusion: HeartUnloadNet offers a scalable, accurate, and fast alternative to inverse FE solvers, enabling real-time clinical applications.

Abstract: The unloaded cardiac geometry (i.e., the state of the heart devoid of luminal pressure) serves as a valuable zero-stress and zero-strain reference and is critical for personalized biomechanical modeling of cardiac function, to understand both healthy and diseased physiology and to predict the effects of cardiac interventions. However, estimating the unloaded geometry from clinical images remains a challenging task. Traditional approaches rely on inverse finite element (FE) solvers that require iterative optimization and are computationally expensive. In this work, we introduce HeartUnloadNet, a deep learning framework that predicts the unloaded left ventricular (LV) shape directly from the end diastolic (ED) mesh while explicitly incorporating biophysical priors. The network accepts a mesh of arbitrary size along with physiological parameters such as ED pressure, myocardial stiffness scale, and fiber helix orientation, and outputs the corresponding unloaded mesh. It adopts a graph attention architecture and employs a cycle-consistency strategy to enable bidirectional (loading and unloading) prediction, allowing for partial self-supervision that improves accuracy and reduces the need for large training datasets. Trained and tested on 20,700 FE simulations across diverse LV geometries and physiological conditions, HeartUnloadNet achieves sub-millimeter accuracy, with an average DSC of 0.986 and HD of 0.083 cm, while reducing inference time to just 0.02 seconds per case, over 10^5 times faster and significantly more accurate than traditional inverse FE solvers. Ablation studies confirm the effectiveness of the architecture. Notably, the cycle-consistent design enables the model to maintain a DSC of 97% even with as few as 200 training samples. This work thus presents a scalable and accurate surrogate for inverse FE solvers, supporting real-time clinical applications in the future.

[91] Multispectral Demosaicing via Dual Cameras

SaiKiran Tedla, Junyong Lee, Beixuan Yang, Mahmoud Afifi, Michael S. Brown

Main category: cs.CV

TL;DR: A method for multispectral (MS) image demosaicing using dual-camera setups, leveraging RGB images to guide MS demosaicing, achieving state-of-the-art accuracy.

DetailsMotivation: Enhancing spectral applications and RGB image quality by integrating MS imaging into multi-camera devices like smartphones.

Method: Proposes a demosaicing method for dual-camera setups, using co-captured RGB images to guide MS image reconstruction. Introduces a dataset for training and evaluation.

Result: Achieves state-of-the-art accuracy in demosaicing MS images compared to existing techniques.

Conclusion: The proposed method effectively leverages RGB images to improve MS demosaicing, validated by a new dataset and superior performance.

Abstract: Multispectral (MS) images capture detailed scene information across a wide range of spectral bands, making them invaluable for applications requiring rich spectral data. Integrating MS imaging into multi camera devices, such as smartphones, has the potential to enhance both spectral applications and RGB image quality. A critical step in processing MS data is demosaicing, which reconstructs color information from the mosaic MS images captured by the camera. This paper proposes a method for MS image demosaicing specifically designed for dual-camera setups where both RGB and MS cameras capture the same scene. Our approach leverages co-captured RGB images, which typically have higher spatial fidelity, to guide the demosaicing of lower-fidelity MS images. We introduce the Dual-camera RGB-MS Dataset - a large collection of paired RGB and MS mosaiced images with ground-truth demosaiced outputs - that enables training and evaluation of our method. Experimental results demonstrate that our method achieves state-of-the-art accuracy compared to existing techniques.

[92] Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting

Xingyu Miao, Haoran Duan, Quanhao Qian, Jiuniu Wang, Yang Long, Ling Shao, Deli Zhao, Ran Xu, Gongjie Zhang

Main category: cs.CV

TL;DR: A scalable pipeline converts single-view images into realistic 3D representations, addressing the scarcity of 3D datasets and reducing data collection costs.

DetailsMotivation: The scarcity of large-scale 3D datasets limits spatial intelligence in AI, unlike abundant 2D imagery.

Method: Integrated depth estimation, camera calibration, and scale calibration to generate 3D representations (point clouds, depth maps, etc.) from single-view images.

Result: Two datasets (COCO-3D, Objects365-v2-3D) were released, and experiments showed benefits for various 3D tasks.

Conclusion: The pipeline effectively bridges the gap between 2D imagery and 3D spatial understanding, advancing AI capabilities in perceiving and interacting with physical environments.

Abstract: Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations - including point clouds, camera poses, depth maps, and pseudo-RGBD - via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.

[93] SaLF: Sparse Local Fields for Multi-Sensor Rendering in Real-Time

Yun Chen, Matthew Haines, Jingkang Wang, Krzysztof Baron-Lis, Sivabalan Manivasagam, Ze Yang, Raquel Urtasun

Main category: cs.CV

TL;DR: SaLF introduces a fast, versatile volumetric representation for high-fidelity sensor simulation, supporting both rasterization and raytracing, with improved efficiency and realism.

DetailsMotivation: Current methods like NeRF and 3DGS are slow or limited in sensor compatibility, hindering scalable autonomy testing.

Method: SaLF uses sparse 3D voxel primitives with local implicit fields, enabling fast training (<30 min) and rendering (50+ FPS for cameras, 600+ FPS for LiDARs).

Result: SaLF matches realism of existing methods while offering faster performance, adaptive scene handling, and support for non-pinhole cameras and LiDARs.

Conclusion: SaLF advances sensor simulation by combining efficiency, versatility, and realism, enabling scalable autonomy testing.

Abstract: High-fidelity sensor simulation of light-based sensors such as cameras and LiDARs is critical for safe and accurate autonomy testing. Neural radiance field (NeRF)-based methods that reconstruct sensor observations via ray-casting of implicit representations have demonstrated accurate simulation of driving scenes, but are slow to train and render, hampering scale. 3D Gaussian Splatting (3DGS) has demonstrated faster training and rendering times through rasterization, but is primarily restricted to pinhole camera sensors, preventing usage for realistic multi-sensor autonomy evaluation. Moreover, both NeRF and 3DGS couple the representation with the rendering procedure (implicit networks for ray-based evaluation, particles for rasterization), preventing interoperability, which is key for general usage. In this work, we present Sparse Local Fields (SaLF), a novel volumetric representation that supports rasterization and raytracing. SaLF represents volumes as a sparse set of 3D voxel primitives, where each voxel is a local implicit field. SaLF has fast training (<30 min) and rendering capabilities (50+ FPS for camera and 600+ FPS LiDAR), has adaptive pruning and densification to easily handle large scenes, and can support non-pinhole cameras and spinning LiDARs. We demonstrate that SaLF has similar realism as existing self-driving sensor simulation methods while improving efficiency and enhancing capabilities, enabling more scalable simulation. https://waabi.ai/salf/

[94] SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and Progressive Transfer Learning

Xinjun Cheng, Yiguo He, Junjie Zhu, Chunping Qiu, Jun Wang, Qiangjuan Huang, Ke Yang

Main category: cs.CV

TL;DR: The paper introduces SAR-Text, a large-scale dataset of 130,000 SAR image-text pairs, and the SAR-Narrator framework for generating descriptions. It validates the dataset’s effectiveness through improved performance in retrieval, captioning, and VQA tasks.

DetailsMotivation: The lack of large-scale, high-quality SAR image-text datasets hinders semantic understanding in remote sensing, prompting the creation of SAR-Text.

Method: The SAR-Narrator framework uses multi-stage progressive transfer learning to generate textual descriptions for SAR images. Three models (SAR-RS-CLIP, SAR-RS-CoCa, SAR-GPT) are tested on retrieval, captioning, and VQA tasks.

Result: SAR-RS-CLIP improves retrieval recall by 16.43% and 10.54%. SAR-RS-CoCa outperforms the original CoCa model in captioning metrics (BLEU-4, SPICE, CIDEr). SAR-GPT shows superior performance in VQA tasks.

Conclusion: SAR-Text and SAR-Narrator significantly enhance SAR image understanding and can be adopted by the community for larger-scale dataset construction.

Abstract: Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-Text, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-Text dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage progressive transfer learning strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 16.43% and 10.54% on the OSdataset-512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves BLEU-4, SPICE, and CIDEr scores exceeding those of the original CoCa model by more than 8x, 4x, and 10x, respectively. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets.

[95] Learning Efficient and Generalizable Human Representation with Human Gaussian Model

Yifan Liu, Shengjun Zhang, Chensheng Dai, Yang Chen, Hao Liu, Chen Li, Yueqi Duan

Main category: cs.CV

TL;DR: A method called Human Gaussian Graph is proposed to model animatable human avatars by connecting 3D Gaussians to a human SMPL mesh, improving frame relations and animation quality.

DetailsMotivation: Existing methods for animatable human avatars either require per-instance optimization or fail to capture temporal relations between frames.

Method: The Human Gaussian Graph uses dual layers (Gaussians and mesh vertices) with intra-node and inter-node operations to aggregate and pass information.

Result: The method shows efficiency and generalization in novel view synthesis and novel pose animation.

Conclusion: The Human Gaussian Graph effectively models animatable avatars by leveraging frame relations and mesh connectivity.

Abstract: Modeling animatable human avatars from videos is a long-standing and challenging problem. While conventional methods require per-instance optimization, recent feed-forward methods have been proposed to generate 3D Gaussians with a learnable network. However, these methods predict Gaussians for each frame independently, without fully capturing the relations of Gaussians from different timestamps. To address this, we propose Human Gaussian Graph to model the connection between predicted Gaussians and human SMPL mesh, so that we can leverage information from all frames to recover an animatable human representation. Specifically, the Human Gaussian Graph contains dual layers where Gaussians are the first layer nodes and mesh vertices serve as the second layer nodes. Based on this structure, we further propose the intra-node operation to aggregate various Gaussians connected to one mesh vertex, and inter-node operation to support message passing among mesh node neighbors. Experimental results on novel view synthesis and novel pose animation demonstrate the efficiency and generalization of our method.

[96] Diffusion-FS: Multimodal Free-Space Prediction via Diffusion for Autonomous Driving

Keshav Gupta, Tejas S. Stanley, Pranjal Paul, Arun K. Singh, K. Madhava Krishna

Main category: cs.CV

TL;DR: The paper proposes a self-supervised method for predicting drivable free-space corridors using monocular camera input, introducing ContourDiff for structured and interpretable predictions.

DetailsMotivation: Existing methods assume BEV-centric representations, which are hard to obtain. The paper aims to predict navigable subsets of road regions using only monocular images.

Method: A self-supervised approach generates free-space samples using future ego trajectories and camera images. ContourDiff, a diffusion-based architecture, denoises contour points for structured predictions.

Result: The method accurately predicts multimodal navigable corridors, validated on nuScenes and CARLA datasets.

Conclusion: The approach effectively addresses the challenge of drivable free-space prediction without relying on BEV representations, offering interpretable results.

Abstract: Drivable Free-space prediction is a fundamental and crucial problem in autonomous driving. Recent works have addressed the problem by representing the entire non-obstacle road regions as the free-space. In contrast our aim is to estimate the driving corridors that are a navigable subset of the entire road region. Unfortunately, existing corridor estimation methods directly assume a BEV-centric representation, which is hard to obtain. In contrast, we frame drivable free-space corridor prediction as a pure image perception task, using only monocular camera input. However such a formulation poses several challenges as one doesn’t have the corresponding data for such free-space corridor segments in the image. Consequently, we develop a novel self-supervised approach for free-space sample generation by leveraging future ego trajectories and front-view camera images, making the process of visual corridor estimation dependent on the ego trajectory. We then employ a diffusion process to model the distribution of such segments in the image. However, the existing binary mask-based representation for a segment poses many limitations. Therefore, we introduce ContourDiff, a specialized diffusion-based architecture that denoises over contour points rather than relying on binary mask representations, enabling structured and interpretable free-space predictions. We evaluate our approach qualitatively and quantitatively on both nuScenes and CARLA, demonstrating its effectiveness in accurately predicting safe multimodal navigable corridors in the image.

[97] Tell Me What You See: An Iterative Deep Learning Framework for Image Captioning

Hitesh Kumar Gupta

Main category: cs.CV

TL;DR: The paper presents an iterative development of image captioning models, from CNN-LSTM to an advanced attention-based system (Nexus), highlighting the importance of attention mechanisms for performance.

DetailsMotivation: To systematically explore and validate architectural enhancements in image captioning, particularly the role of attention mechanisms alongside visual backbone upgrades.

Method: Developed five models, starting with a simple CNN-LSTM and ending with Nexus (EfficientNetV2B3 backbone + dynamic attention), tested on MS COCO 2017.

Result: Nexus achieved a BLEU-4 score of 31.4, outperforming benchmarks and showing that attention is crucial when upgrading visual backbones.

Conclusion: The iterative approach provides a replicable blueprint for modern vision-language tasks, emphasizing attention mechanisms.

Abstract: Image captioning, a task at the confluence of computer vision and natural language processing, requires a sophisticated understanding of both visual scenes and linguistic structure. While modern approaches are dominated by large-scale Transformer architectures, this paper documents a systematic, iterative development of foundational image captioning models, progressing from a simple CNN-LSTM encoder-decoder to a competitive attention-based system. We present a series of five models, beginning with Genesis and concluding with Nexus, an advanced model featuring an EfficientNetV2B3 backbone and a dynamic attention mechanism. Our experiments chart the impact of architectural enhancements and demonstrate a key finding within the classic CNN-LSTM paradigm: merely upgrading the visual backbone without a corresponding attention mechanism can degrade performance, as the single-vector bottleneck cannot transmit the richer visual detail. This insight validates the architectural shift to attention. Trained on the MS COCO 2017 dataset, our final model, Nexus, achieves a BLEU-4 score of 31.4, surpassing several foundational benchmarks and validating our iterative design process. This work provides a clear, replicable blueprint for understanding the core architectural principles that underpin modern vision-language tasks.

[98] Deepfake Detection Via Facial Feature Extraction and Modeling

Benjamin Carter, Nathan Dilla, Micheal Callahan, Atuhaire Ambala

Main category: cs.CV

TL;DR: The paper proposes using facial landmarks for deepfake detection, focusing on inconsistencies in facial movements rather than raw image processing, achieving high accuracy with various neural networks.

DetailsMotivation: Deepfake technology makes it hard to distinguish AI-generated media from genuine content, necessitating new detection methods beyond traditional image processing.

Method: The approach extracts facial landmarks from videos to detect deepfakes, testing the technique on RNN, ANN, and CNN models.

Result: High accuracy was achieved: 96% for RNN, 93% for ANN, and 78% for CNN, showing the method’s effectiveness.

Conclusion: Facial landmark extraction is a viable alternative to raw image processing for deepfake detection, offering compatibility with multiple neural networks and reduced computational requirements.

Abstract: The rise of deepfake technology brings forth new questions about the authenticity of various forms of media found online today. Videos and images generated by artificial intelligence (AI) have become increasingly more difficult to differentiate from genuine media, resulting in the need for new models to detect artificially-generated media. While many models have attempted to solve this, most focus on direct image processing, adapting a convolutional neural network (CNN) or a recurrent neural network (RNN) that directly interacts with the video image data. This paper introduces an approach of using solely facial landmarks for deepfake detection. Using a dataset consisting of both deepfake and genuine videos of human faces, this paper describes an approach for extracting facial landmarks for deepfake detection, focusing on identifying subtle inconsistencies in facial movements instead of raw image processing. Experimental results demonstrated that this feature extraction technique is effective in various neural network models, with the same facial landmarks tested on three neural network models, with promising performance metrics indicating its potential for real-world applications. The findings discussed in this paper include RNN and artificial neural network (ANN) models with accuracy between 96% and 93%, respectively, with a CNN model hovering around 78%. This research challenges the assumption that raw image processing is necessary to identify deepfake videos by presenting a facial feature extraction approach compatible with various neural network models while requiring fewer parameters.

[99] Flow Stochastic Segmentation Networks

Fabio De Sousa Ribeiro, Omar Todd, Charles Jones, Avinash Kori, Raghav Mehta, Ben Glocker

Main category: cs.CV

TL;DR: Flow-SSN is a generative segmentation model with autoregressive and flow variants, overcoming low-rank limitations of prior methods and enabling efficient sampling. It achieves state-of-the-art results in medical imaging.

DetailsMotivation: Address limitations of low-rank parameterization in previous methods and improve efficiency in sampling for segmentation tasks.

Method: Introduces Flow-SSN, featuring discrete-time autoregressive and continuous-time flow variants, focusing on high-rank pixel-wise covariances without storing parameters.

Result: Flow-SSN outperforms standard diffusion-based models in efficiency and achieves state-of-the-art results on medical imaging benchmarks.

Conclusion: Flow-SSN is a powerful and efficient generative segmentation model, advancing the field with its high-rank capability and practical performance.

Abstract: We introduce the Flow Stochastic Segmentation Network (Flow-SSN), a generative segmentation model family featuring discrete-time autoregressive and modern continuous-time flow variants. We prove fundamental limitations of the low-rank parameterisation of previous methods and show that Flow-SSNs can estimate arbitrarily high-rank pixel-wise covariances without assuming the rank or storing the distributional parameters. Flow-SSNs are also more efficient to sample from than standard diffusion-based segmentation models, thanks to most of the model capacity being allocated to learning the base distribution of the flow, constituting an expressive prior. We apply Flow-SSNs to challenging medical imaging benchmarks and achieve state-of-the-art results. Code available: https://github.com/biomedia-mira/flow-ssn.

[100] PTCMIL: Multiple Instance Learning via Prompt Token Clustering for Whole Slide Image Analysis

Beidi Zhao, SangMook Kim, Hao Chen, Chen Zhou, Zu-hua Gao, Gang Wang, Xiaoxiao Li

Main category: cs.CV

TL;DR: PTCMIL introduces prompt token clustering in ViT for MIL aggregation, improving performance and interpretability in WSI analysis.

DetailsMotivation: Existing MIL methods struggle with WSI complexity and heterogeneity, and current approaches like ViTs and clustering are computationally heavy or lack task-specific adaptability.

Method: PTCMIL integrates learnable prompt tokens into ViT, unifying clustering and prediction dynamically. It uses projection-based clustering and token merging for efficient pattern capture.

Result: PTCMIL outperforms state-of-the-art methods in classification and survival analysis across eight datasets, with confirmed robustness and interpretability.

Conclusion: PTCMIL effectively addresses WSI analysis challenges by combining clustering and prediction in an efficient, interpretable framework.

Abstract: Multiple Instance Learning (MIL) has advanced WSI analysis but struggles with the complexity and heterogeneity of WSIs. Existing MIL methods face challenges in aggregating diverse patch information into robust WSI representations. While ViTs and clustering-based approaches show promise, they are computationally intensive and fail to capture task-specific and slide-specific variability. To address these limitations, we propose PTCMIL, a novel Prompt Token Clustering-based ViT for MIL aggregation. By introducing learnable prompt tokens into the ViT backbone, PTCMIL unifies clustering and prediction tasks in an end-to-end manner. It dynamically aligns clustering with downstream tasks, using projection-based clustering tailored to each WSI, reducing complexity while preserving patch heterogeneity. Through token merging and prototype-based pooling, PTCMIL efficiently captures task-relevant patterns. Extensive experiments on eight datasets demonstrate its superior performance in classification and survival analysis tasks, outperforming state-of-the-art methods. Systematic ablation studies confirm its robustness and strong interpretability. The code is released at https://github.com/ubc-tea/PTCMIL.

[101] Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh

Main category: cs.CV

TL;DR: A novel phoneme-based two-stage framework for V-ASR combines visual and landmark motion features with an LLM for word reconstruction, achieving lower error rates.

DetailsMotivation: Existing V-ASR methods struggle with viseme ambiguity and high error rates due to lack of auditory cues.

Method: A two-stage approach: Stage 1 predicts phonemes using visual and landmark features; Stage 2 reconstructs words using an LLM (NLLB).

Result: Achieves 17.4% WER on LRS2 and 21.0% WER on LRS3 datasets.

Conclusion: The proposed PV-ASR method outperforms existing methods by addressing viseme ambiguity and reducing training complexity.

Abstract: Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging due to the absence of auditory cues and the visual ambiguity of phonemes that exhibit similar visemes-distinct sounds that appear identical in lip motions. Existing methods often aim to predict words or characters directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity and require large amounts of pre-training data. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by an LLM model for word reconstruction to address these challenges. Stage 1 consists of V-ASR, which outputs the predicted phonemes, thereby reducing training complexity. Meanwhile, the facial landmark features address speaker-specific facial characteristics. Stage 2 comprises an encoder-decoder LLM model, NLLB, that reconstructs the output phonemes back to words. Besides using a large visual dataset for deep learning fine-tuning, our PV-ASR method demonstrates superior performance by achieving 17.4% WER on the LRS2 and 21.0% WER on the LRS3 dataset.

[102] Transferable and Undefendable Point Cloud Attacks via Medial Axis Transform

Keke Tang, Yuze Gao, Weilong Peng, Xiaofei Wang, Meie Fang, Peican Zhu

Main category: cs.CV

TL;DR: MAT-Adv is a novel adversarial attack framework for point clouds that enhances transferability and undefendability by perturbing medial axis transform (MAT) representations.

DetailsMotivation: Existing adversarial attack methods for point clouds lack transferability and robustness against defenses, necessitating a more effective approach.

Method: MAT-Adv uses an autoencoder to project point clouds into MAT representations, perturbs these intrinsic structures, and employs dropout to avoid overfitting.

Result: MAT-Adv outperforms state-of-the-art methods in transferability and undefendability.

Conclusion: The framework successfully introduces structural-level adversarial characteristics, improving robustness across models and defenses.

Abstract: Studying adversarial attacks on point clouds is essential for evaluating and improving the robustness of 3D deep learning models. However, most existing attack methods are developed under ideal white-box settings and often suffer from limited transferability to unseen models and insufficient robustness against common defense mechanisms. In this paper, we propose MAT-Adv, a novel adversarial attack framework that enhances both transferability and undefendability by explicitly perturbing the medial axis transform (MAT) representations, in order to induce inherent adversarialness in the resulting point clouds. Specifically, we employ an autoencoder to project input point clouds into compact MAT representations that capture the intrinsic geometric structure of point clouds. By perturbing these intrinsic representations, MAT-Adv introduces structural-level adversarial characteristics that remain effective across diverse models and defense strategies. To mitigate overfitting and prevent perturbation collapse, we incorporate a dropout strategy into the optimization of MAT perturbations, further improving transferability and undefendability. Extensive experiments demonstrate that MAT-Adv significantly outperforms existing state-of-the-art methods in both transferability and undefendability. Codes will be made public upon paper acceptance.

[103] WiSE-OD: Benchmarking Robustness in Infrared Object Detection

Heitor R. Medeiros, Atif Belal, Masih Aminbeidokhti, Eric Granger, Marco Pedersoli

Main category: cs.CV

TL;DR: The paper introduces LLVIP-C and FLIR-C benchmarks for cross-modality OOD robustness in IR object detection and proposes WiSE-OD, a weight-space ensembling method to enhance accuracy and robustness without extra costs.

DetailsMotivation: The scarcity of large-scale IR datasets and the modality gap between RGB and IR imagery compromise the robustness of object detection models under distribution shifts.

Method: Proposes WiSE-OD, a weight-space ensembling method with two variants: WiSE-OD${ZS}$ (combining RGB zero-shot and IR fine-tuned weights) and WiSE-OD${LP}$ (blending zero-shot and linear probing). Introduces LLVIP-C and FLIR-C benchmarks for evaluation.

Result: WiSE-OD improves cross-modality and corruption robustness across three RGB-pretrained detectors and two baselines, without additional training or inference costs.

Conclusion: The proposed method and benchmarks effectively address the robustness challenges in IR object detection by leveraging complementary knowledge from RGB and IR models.

Abstract: Object detection (OD) in infrared (IR) imagery is critical for low-light and nighttime applications. However, the scarcity of large-scale IR datasets forces models to rely on weights pre-trained on RGB images. While fine-tuning on IR improves accuracy, it often compromises robustness under distribution shifts due to the inherent modality gap between RGB and IR. To address this, we introduce LLVIP-C and FLIR-C, two cross-modality out-of-distribution (OOD) benchmarks built by applying corruption to standard IR datasets. Additionally, to fully leverage the complementary knowledge from RGB and infrared trained models, we propose WiSE-OD, a weight-space ensembling method with two variants: WiSE-OD${ZS}$, which combines RGB zero-shot and IR fine-tuned weights, and WiSE-OD${LP}$, which blends zero-shot and linear probing. Evaluated across three RGB-pretrained detectors and two robust baselines, WiSE-OD improves both cross-modality and corruption robustness without any additional training or inference cost.

[104] Perspective from a Higher Dimension: Can 3D Geometric Priors Help Visual Floorplan Localization?

Bolei Chen, Jiaxu Kang, Haonan Yang, Ping Zhong, Jianxin Wang

Main category: cs.CV

TL;DR: The paper introduces a 3D geometric prior-enhanced method for 2D floorplan localization to address challenges like visual changes and occlusions, outperforming existing methods.

DetailsMotivation: Floorplans are robust but minimalist, causing modal and geometric mismatches with visual perceptions. Existing methods struggle with visual changes and occlusions.

Method: The approach uses 3D geometric priors (multi-view constraints and view-scene alignment) via self-supervised contrastive learning, requiring no extra annotations.

Result: The method significantly improves localization accuracy without added computational cost, outperforming state-of-the-art techniques.

Conclusion: Injecting 3D priors bridges modal gaps and enhances floorplan localization, validated by extensive experiments.

Abstract: Since a building’s floorplans are easily accessible, consistent over time, and inherently robust to changes in visual appearance, self-localization within the floorplan has attracted researchers’ interest. However, since floorplans are minimalist representations of a building’s structure, modal and geometric differences between visual perceptions and floorplans pose challenges to this task. While existing methods cleverly utilize 2D geometric features and pose filters to achieve promising performance, they fail to address the localization errors caused by frequent visual changes and view occlusions due to variously shaped 3D objects. To tackle these issues, this paper views the 2D Floorplan Localization (FLoc) problem from a higher dimension by injecting 3D geometric priors into the visual FLoc algorithm. For the 3D geometric prior modeling, we first model geometrically aware view invariance using multi-view constraints, i.e., leveraging imaging geometric principles to provide matching constraints between multiple images that see the same points. Then, we further model the view-scene aligned geometric priors, enhancing the cross-modal geometry-color correspondences by associating the scene’s surface reconstruction with the RGB frames of the sequence. Both 3D priors are modeled through self-supervised contrastive learning, thus no additional geometric or semantic annotations are required. These 3D priors summarized in extensive realistic scenes bridge the modal gap while improving localization success without increasing the computational burden on the FLoc algorithm. Sufficient comparative studies demonstrate that our method significantly outperforms state-of-the-art methods and substantially boosts the FLoc accuracy. All data and code will be released after the anonymous review.

[105] MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition

Jian Chen, Yuxuan Hu, Haifeng Lu, Wei Wang, Min Yang, Chengming Li, Xiping Hu

Main category: cs.CV

TL;DR: The paper proposes MGHFT, a multi-granularity hierarchical fusion transformer, to improve sticker emotion recognition by fusing multi-view textual context with visual features using Multimodal Large Language Models and contrastive learning.

DetailsMotivation: Sticker emotion understanding is challenging due to its reliance on multi-view information like background knowledge and stylistic cues, which existing pre-trained visual models struggle with.

Method: MGHFT uses Multimodal Large Language Models for multi-view textual context, a hierarchical fusion strategy with a pyramid visual transformer, and contrastive learning to integrate textual and visual features.

Result: MGHFT outperforms existing methods on public datasets, achieving 5.4% higher F1 and 4.0% higher accuracy than the best pre-trained visual models.

Conclusion: MGHFT effectively enhances sticker emotion recognition by combining multi-view textual and visual features, demonstrating significant improvements over existing approaches.

Abstract: Although pre-trained visual models with text have demonstrated strong capabilities in visual feature extraction, sticker emotion understanding remains challenging due to its reliance on multi-view information, such as background knowledge and stylistic cues. To address this, we propose a novel multi-granularity hierarchical fusion transformer (MGHFT), with a multi-view sticker interpreter based on Multimodal Large Language Models. Specifically, inspired by the human ability to interpret sticker emotions from multiple views, we first use Multimodal Large Language Models to interpret stickers by providing rich textual context via multi-view descriptions. Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding, which builds upon a pyramid visual transformer to extract both global and local sticker features at multiple stages. Through contrastive learning and attention mechanisms, textual features are injected at different stages of the visual backbone, enhancing the fusion of global- and local-granularity visual semantics with textual guidance. Finally, we introduce a text-guided fusion attention mechanism to effectively integrate the overall multimodal features, enhancing semantic understanding. Extensive experiments on 2 public sticker emotion datasets demonstrate that MGHFT significantly outperforms existing sticker emotion recognition approaches, achieving higher accuracy and more fine-grained emotion recognition. Compared to the best pre-trained visual models, our MGHFT also obtains an obvious improvement, 5.4% on F1 and 4.0% on accuracy. The code is released at https://github.com/cccccj-03/MGHFT_ACMMM2025.

[106] Synthetic-to-Real Camouflaged Object Detection

Zhihao Luo, Luojun Lin, Zheng Lin

Main category: cs.CV

TL;DR: The paper introduces Syn-to-Real Camouflaged Object Detection (S2R-COD) to address limited datasets in COD by leveraging synthetic data and unannotated real images, proposing the CSRDA framework for domain adaptation.

DetailsMotivation: Limited datasets for COD, especially in specialized categories, and performance degradation when using synthetic data directly.

Method: Proposes CSRDA, a student-teacher model using pseudo labeling and consistency regularization to adapt synthetic data to real-world scenarios.

Result: CSRDA effectively bridges the gap between synthetic and real domains, improving model performance in real-world COD tasks.

Conclusion: The framework mitigates data scarcity and annotation challenges in COD, with code made publicly available.

Abstract: Due to the high cost of collection and labeling, there are relatively few datasets for camouflaged object detection (COD). In particular, for certain specialized categories, the available image dataset is insufficiently populated. Synthetic datasets can be utilized to alleviate the problem of limited data to some extent. However, directly training with synthetic datasets compared to real datasets can lead to a degradation in model performance. To tackle this problem, in this work, we investigate a new task, namely Syn-to-Real Camouflaged Object Detection (S2R-COD). In order to improve the model performance in real world scenarios, a set of annotated synthetic camouflaged images and a limited number of unannotated real images must be utilized. We propose the Cycling Syn-to-Real Domain Adaptation Framework (CSRDA), a method based on the student-teacher model. Specially, CSRDA propagates class information from the labeled source domain to the unlabeled target domain through pseudo labeling combined with consistency regularization. Considering that narrowing the intra-domain gap can improve the quality of pseudo labeling, CSRDA utilizes a recurrent learning framework to build an evolving real domain for bridging the source and target domain. Extensive experiments demonstrate the effectiveness of our framework, mitigating the problem of limited data and handcraft annotations in COD. Our code is publicly available at: https://github.com/Muscape/S2R-COD

[107] HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback

Elham Soltani Kazemi, Imad Eddine Toubal, Gani Rahmon, Jaired Collins, K. Palaniappan

Main category: cs.CV

TL;DR: HQ-SMem improves Video Object Segmentation (VOS) by refining masks, optimizing memory, and updating appearance models, achieving top performance on benchmarks.

DetailsMotivation: Existing VOS models struggle with precise mask delineation, deformable objects, tracking drift, and long sequences. HQ-SMem addresses these limitations.

Method: Combines SAM-HQ for refined masks, dynamic smart memory for efficiency, and updated appearance models to handle complex variations.

Result: Ranks top two on VOTS and VOTSt 2024 datasets, sets benchmarks on Long Video Dataset and LVOS.

Conclusion: HQ-SMem effectively mitigates VOS limitations, excelling in complex, long-term scenarios.

Abstract: Video Object Segmentation (VOS) is foundational to numerous computer vision applications, including surveillance, autonomous driving, robotics and generative video editing. However, existing VOS models often struggle with precise mask delineation, deformable objects, topologically transforming objects, tracking drift and long video sequences. In this paper, we introduce HQ-SMem, for High Quality video segmentation and tracking using Smart Memory, a novel method that enhances the performance of VOS base models by addressing these limitations. Our approach incorporates three key innovations: (i) leveraging SAM with High-Quality masks (SAM-HQ) alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) implementing a dynamic smart memory mechanism that selectively stores relevant key frames while discarding redundant ones, thereby optimizing memory usage and processing efficiency for long-term videos; and (iii) dynamically updating the appearance model to effectively handle complex topological object variations and reduce drift throughout the video. These contributions mitigate several limitations of existing VOS models including, coarse segmentations that mix-in background pixels, fixed memory update schedules, brittleness to drift and occlusions, and prompt ambiguity issues associated with SAM. Extensive experiments conducted on multiple public datasets and state-of-the-art base trackers demonstrate that our method consistently ranks among the top two on VOTS and VOTSt 2024 datasets. Moreover, HQ-SMem sets new benchmarks on Long Video Dataset and LVOS, showcasing its effectiveness in challenging scenarios characterized by complex multi-object dynamics over extended temporal durations.

[108] Underwater Waste Detection Using Deep Learning A Performance Comparison of YOLOv7 to 10 and Faster RCNN

UMMPK Nawarathne, HMNS Kumari, HMLS Kumari

Main category: cs.CV

TL;DR: YOLOv8 outperforms other models (YOLOv7, YOLOv9, YOLOv10, Faster R-CNN) in underwater waste detection with 80.9% mAP, due to its advanced architecture.

DetailsMotivation: Accurate detection of underwater waste is critical for environmental monitoring and cleanup efforts.

Method: Tested five object recognition models (YOLOv7-YOLOv10, Faster R-CNN) on a diverse underwater dataset with 15 classes.

Result: YOLOv8 achieved the highest mAP (80.9%), attributed to its anchor-free mechanisms and self-supervised learning.

Conclusion: YOLOv8 is a promising tool for scalable and efficient underwater pollution detection.

Abstract: Underwater pollution is one of today’s most significant environmental concerns, with vast volumes of garbage found in seas, rivers, and landscapes around the world. Accurate detection of these waste materials is crucial for successful waste management, environmental monitoring, and mitigation strategies. In this study, we investigated the performance of five cutting-edge object recognition algorithms, namely YOLO (You Only Look Once) models, including YOLOv7, YOLOv8, YOLOv9, YOLOv10, and Faster Region-Convolutional Neural Network (R-CNN), to identify which model was most effective at recognizing materials in underwater situations. The models were thoroughly trained and tested on a large dataset containing fifteen different classes under diverse conditions, such as low visibility and variable depths. From the above-mentioned models, YOLOv8 outperformed the others, with a mean Average Precision (mAP) of 80.9%, indicating a significant performance. This increased performance is attributed to YOLOv8’s architecture, which incorporates advanced features such as improved anchor-free mechanisms and self-supervised learning, allowing for more precise and efficient recognition of items in a variety of settings. These findings highlight the YOLOv8 model’s potential as an effective tool in the global fight against pollution, improving both the detection capabilities and scalability of underwater cleanup operations.

[109] Gaussian Set Surface Reconstruction through Per-Gaussian Optimization

Zhentao Huang, Di Wu, Zhenbang He, Minglun Gong

Main category: cs.CV

TL;DR: GSSR improves 3D Gaussian Splatting by ensuring even Gaussian distribution and alignment with surface normals, enhancing geometric accuracy and scene editing.

DetailsMotivation: Existing methods like 3DGS and PGSR fail to optimize individual Gaussian placement, leading to uneven distribution and poor geometry reconstruction.

Method: GSSR combines pixel-level and Gaussian-level normal consistency with photometric consistency, adds opacity regularization, and uses periodic reinitialization for cleaner distribution.

Result: GSSR achieves improved geometric precision, enabling better scene editing and novel view synthesis while maintaining rendering quality.

Conclusion: GSSR effectively addresses geometric inaccuracies in Gaussian-based representations, offering a robust solution for 3D scene reconstruction and editing.

Abstract: 3D Gaussian Splatting (3DGS) effectively synthesizes novel views through its flexible representation, yet fails to accurately reconstruct scene geometry. While modern variants like PGSR introduce additional losses to ensure proper depth and normal maps through Gaussian fusion, they still neglect individual placement optimization. This results in unevenly distributed Gaussians that deviate from the latent surface, complicating both reconstruction refinement and scene editing. Motivated by pioneering work on Point Set Surfaces, we propose Gaussian Set Surface Reconstruction (GSSR), a method designed to distribute Gaussians evenly along the latent surface while aligning their dominant normals with the surface normal. GSSR enforces fine-grained geometric alignment through a combination of pixel-level and Gaussian-level single-view normal consistency and multi-view photometric consistency, optimizing both local and global perspectives. To further refine the representation, we introduce an opacity regularization loss to eliminate redundant Gaussians and apply periodic depth- and normal-guided Gaussian reinitialization for a cleaner, more uniform spatial distribution. Our reconstruction results demonstrate significantly improved geometric precision in Gaussian placement, enabling intuitive scene editing and efficient generation of novel Gaussian-based 3D environments. Extensive experiments validate GSSR’s effectiveness, showing enhanced geometric accuracy while preserving high-quality rendering performance.

[110] MedIQA: A Scalable Foundation Model for Prompt-Driven Medical Image Quality Assessment

Siyi Xun, Yue Sun, Jingkun Chen, Zitong Yu, Tong Tong, Xiaohong Liu, Mingxiang Wu, Tao Tan

Main category: cs.CV

TL;DR: MedIQA is a foundation model for medical image quality assessment (IQA) that outperforms existing methods by handling diverse modalities and clinical scenarios, supported by a large annotated dataset.

DetailsMotivation: Existing medical IQA methods lack generalization across diverse modalities and clinical scenarios, necessitating a more robust solution.

Method: MedIQA integrates a salient slice assessment module and an automatic prompt strategy, leveraging a large multi-modality dataset with expert annotations.

Result: MedIQA significantly outperforms baselines in multiple downstream tasks, proving its effectiveness.

Conclusion: MedIQA establishes a scalable framework for medical IQA, enhancing diagnostic workflows and clinical decision-making.

Abstract: Rapid advances in medical imaging technology underscore the critical need for precise and automated image quality assessment (IQA) to ensure diagnostic accuracy. Existing medical IQA methods, however, struggle to generalize across diverse modalities and clinical scenarios. In response, we introduce MedIQA, the first comprehensive foundation model for medical IQA, designed to handle variability in image dimensions, modalities, anatomical regions, and types. We developed a large-scale multi-modality dataset with plentiful manually annotated quality scores to support this. Our model integrates a salient slice assessment module to focus on diagnostically relevant regions feature retrieval and employs an automatic prompt strategy that aligns upstream physical parameter pre-training with downstream expert annotation fine-tuning. Extensive experiments demonstrate that MedIQA significantly outperforms baselines in multiple downstream tasks, establishing a scalable framework for medical IQA and advancing diagnostic workflows and clinical decision-making.

[111] PDT: Point Distribution Transformation with Diffusion Models

Jionghao Wang, Cheng Lin, Yuan Liu, Rui Xu, Zhiyang Dou, Xiao-Xiao Long, Hao-Xiang Guo, Taku Komura, Wenping Wang, Xin Li

Main category: cs.CV

TL;DR: PDT is a framework using diffusion models to transform unstructured point clouds into semantically meaningful distributions, capturing both geometric and semantic features.

DetailsMotivation: Extracting meaningful structural information from unstructured point clouds and transforming them into semantically useful distributions is an under-explored problem.

Method: PDT employs diffusion models with novel architecture and learning strategy to correlate source and target distributions via denoising.

Result: The method successfully transforms point clouds into structured outputs like keypoints, joints, and feature lines.

Conclusion: PDT offers a powerful tool for 3D geometry processing tasks requiring structured point distributions.

Abstract: Point-based representations have consistently played a vital role in geometric data structures. Most point cloud learning and processing methods typically leverage the unordered and unconstrained nature to represent the underlying geometry of 3D shapes. However, how to extract meaningful structural information from unstructured point cloud distributions and transform them into semantically meaningful point distributions remains an under-explored problem. We present PDT, a novel framework for point distribution transformation with diffusion models. Given a set of input points, PDT learns to transform the point set from its original geometric distribution into a target distribution that is semantically meaningful. Our method utilizes diffusion models with novel architecture and learning strategy, which effectively correlates the source and the target distribution through a denoising process. Through extensive experiments, we show that our method successfully transforms input point clouds into various forms of structured outputs - ranging from surface-aligned keypoints, and inner sparse joints to continuous feature lines. The results showcase our framework’s ability to capture both geometric and semantic features, offering a powerful tool for various 3D geometry processing tasks where structured point distributions are desired. Code will be available at this link: https://github.com/shanemankiw/PDT.

[112] PerioDet: Large-Scale Panoramic Radiograph Benchmark for Clinical-Oriented Apical Periodontitis Detection

Xiaocheng Fang, Jieyi Cai, Huanyu Liu, Chengju Zhou, Minhua Lu, Bingzhi Chen

Main category: cs.CV

TL;DR: The paper introduces ‘PerioXrays’, a large-scale annotated dataset for apical periodontitis, and proposes ‘PerioDet’, a detection method combining BDA and IDC to improve automated diagnosis.

DetailsMotivation: The lack of a high-quality annotated dataset hinders the development of CAD systems for apical periodontitis.

Method: Proposes PerioDet, integrating Background-Denoising Attention (BDA) and IoU-Dynamic Calibration (IDC) to tackle background noise and small targets.

Result: PerioDet outperforms on the PerioXrays dataset, and human-computer experiments confirm its clinical utility.

Conclusion: The PerioXrays dataset and PerioDet method advance automated apical periodontitis diagnosis, showing promise for clinical use.

Abstract: Apical periodontitis is a prevalent oral pathology that presents significant public health challenges. Despite advances in automated diagnostic systems across various medical fields, the development of Computer-Aided Diagnosis (CAD) applications for apical periodontitis is still constrained by the lack of a large-scale, high-quality annotated dataset. To address this issue, we release a large-scale panoramic radiograph benchmark called “PerioXrays”, comprising 3,673 images and 5,662 meticulously annotated instances of apical periodontitis. To the best of our knowledge, this is the first benchmark dataset for automated apical periodontitis diagnosis. This paper further proposes a clinical-oriented apical periodontitis detection (PerioDet) paradigm, which jointly incorporates Background-Denoising Attention (BDA) and IoU-Dynamic Calibration (IDC) mechanisms to address the challenges posed by background noise and small targets in automated detection. Extensive experiments on the PerioXrays dataset demonstrate the superiority of PerioDet in advancing automated apical periodontitis detection. Additionally, a well-designed human-computer collaborative experiment underscores the clinical applicability of our method as an auxiliary diagnostic tool for professional dentists.

[113] Closing the Modality Gap for Mixed Modality Search

Binxu Li, Yuhui Zhang, Xiaohan Wang, Weixin Liang, Ludwig Schmidt, Serena Yeung-Levy

Main category: cs.CV

TL;DR: GR-CLIP, a lightweight calibration method, addresses the modality gap in CLIP’s embedding space, improving mixed modality search performance by up to 26 percentage points.

DetailsMotivation: Mixed modality search is important but underexplored, and existing models like CLIP exhibit a modality gap, causing ranking bias and fusion failure.

Method: Proposes GR-CLIP, a post-hoc calibration method to remove the modality gap in CLIP’s embedding space.

Result: GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP and outperforms other models with 75x less compute.

Conclusion: GR-CLIP effectively addresses the modality gap, enhancing mixed modality search efficiency and performance.

Abstract: Mixed modality search – retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents – is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP’s embedding space. Evaluated on MixBench – the first benchmark specifically designed for mixed modality search – GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.

[114] YOLO for Knowledge Extraction from Vehicle Images: A Baseline Study

Saraa Al-Saddik, Manna Elizabeth Philip, Ali Haidar

Main category: cs.CV

TL;DR: The study evaluates deep learning models (YOLO-v11, YOLO-World, YOLO-Classification) for vehicle attribute identification, achieving high accuracy (up to 94.86%) using multi-view inference (MVI). Object detection models outperformed classification-only ones, with smaller YOLO variants being efficient for real-time use.

DetailsMotivation: Accurate vehicle attribute identification is crucial for law enforcement and intelligence, requiring robust models for real-world, unconstrained conditions.

Method: Three YOLO-based models were tested on a large dataset (100,000+ images) for make, shape, and color prediction, using MVI to enhance performance. Models were evaluated on a separate dataset of 29,937 images.

Result: High accuracy was achieved: 93.70% (make), 82.86% (shape), 85.19% (color), and 94.86% (color-binary). Object detection models (YOLO-v11, YOLO-World) outperformed classification-only models. Smaller YOLO variants were efficient for real-time use.

Conclusion: MVI is essential for usable models in complex datasets. Object detection models are superior for make and shape extraction, and smaller YOLO variants offer efficiency benefits. The work provides a baseline for real-world vehicle metadata extraction.

Abstract: Accurate identification of vehicle attributes such as make, colour, and shape is critical for law enforcement and intelligence applications. This study evaluates the effectiveness of three state-of-the-art deep learning approaches YOLO-v11, YOLO-World, and YOLO-Classification on a real-world vehicle image dataset. This dataset was collected under challenging and unconstrained conditions by NSW Police Highway Patrol Vehicles. A multi-view inference (MVI) approach was deployed to enhance the performance of the models’ predictions. To conduct the analyses, datasets with 100,000 plus images were created for each of the three metadata prediction tasks, specifically make, shape and colour. The models were tested on a separate dataset with 29,937 images belonging to 1809 number plates. Different sets of experiments have been investigated by varying the models sizes. A classification accuracy of 93.70%, 82.86%, 85.19%, and 94.86% was achieved with the best performing make, shape, colour, and colour-binary models respectively. It was concluded that there is a need to use MVI to get usable models within such complex real-world datasets. Our findings indicated that the object detection models YOLO-v11 and YOLO-World outperformed classification-only models in make and shape extraction. Moreover, smaller YOLO variants perform comparably to larger counterparts, offering substantial efficiency benefits for real-time predictions. This work provides a robust baseline for extracting vehicle metadata in real-world scenarios. Such models can be used in filtering and sorting user queries, minimising the time required to search large vehicle images datasets.

[115] AEDR: Training-Free AI-Generated Image Attribution via Autoencoder Double-Reconstruction

Chao Wang, Kejiang Chen, Zijin Yang, Yaofei Wang, Weiming Zhang

Main category: cs.CV

TL;DR: AEDR is a training-free attribution method for generative models, using double-reconstruction and image homogeneity to improve accuracy and efficiency.

DetailsMotivation: Addressing the reduced accuracy and high computational costs of existing reconstruction-based attribution methods for tracing image origins.

Method: AEDR performs two consecutive reconstructions using a model’s autoencoder, using the ratio of losses as the attribution signal, calibrated by image homogeneity.

Result: AEDR achieves 25.5% higher accuracy than existing methods and requires only 1% of the computational time.

Conclusion: AEDR is a highly efficient and accurate solution for attributing images generated by SOTA models.

Abstract: The rapid advancement of image-generation technologies has made it possible for anyone to create photorealistic images using generative models, raising significant security concerns. To mitigate malicious use, tracing the origin of such images is essential. Reconstruction-based attribution methods offer a promising solution, but they often suffer from reduced accuracy and high computational costs when applied to state-of-the-art (SOTA) models. To address these challenges, we propose AEDR (AutoEncoder Double-Reconstruction), a novel training-free attribution method designed for generative models with continuous autoencoders. Unlike existing reconstruction-based approaches that rely on the value of a single reconstruction loss, AEDR performs two consecutive reconstructions using the model’s autoencoder, and adopts the ratio of these two reconstruction losses as the attribution signal. This signal is further calibrated using the image homogeneity metric to improve accuracy, which inherently cancels out absolute biases caused by image complexity, with autoencoder-based reconstruction ensuring superior computational efficiency. Experiments on eight top latent diffusion models show that AEDR achieves 25.5% higher attribution accuracy than existing reconstruction-based methods, while requiring only 1% of the computational time.

[116] MedSymmFlow: Bridging Generative Modeling and Classification in Medical Imaging through Symmetrical Flow Matching

Francisco Caetano, Lemar Abdi, Christiaan Viviers, Amaan Valiuddin, Fons van der Sommen

Main category: cs.CV

TL;DR: MedSymmFlow is a hybrid model combining generative and discriminative approaches for medical image classification, offering accurate predictions and reliable uncertainty estimates.

DetailsMotivation: To address the need for accurate and well-calibrated uncertainty estimates in medical image classification for clinical decision-making.

Method: Uses Symmetrical Flow Matching with a latent-space formulation and semantic mask conditioning to unify classification, generation, and uncertainty quantification.

Result: Matches or outperforms baselines in accuracy and AUC on MedMNIST datasets, with validated uncertainty estimates.

Conclusion: MedSymmFlow effectively integrates classification, generation, and uncertainty quantification, proving useful for high-stakes medical imaging.

Abstract: Reliable medical image classification requires accurate predictions and well-calibrated uncertainty estimates, especially in high-stakes clinical settings. This work presents MedSymmFlow, a generative-discriminative hybrid model built on Symmetrical Flow Matching, designed to unify classification, generation, and uncertainty quantification in medical imaging. MedSymmFlow leverages a latent-space formulation that scales to high-resolution inputs and introduces a semantic mask conditioning mechanism to enhance diagnostic relevance. Unlike standard discriminative models, it naturally estimates uncertainty through its generative sampling process. The model is evaluated on four MedMNIST datasets, covering a range of modalities and pathologies. The results show that MedSymmFlow matches or exceeds the performance of established baselines in classification accuracy and AUC, while also delivering reliable uncertainty estimates validated by performance improvements under selective prediction.

[117] UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis

Zixiang Ai, Zhenyu Cui, Yuxin Peng, Jiahuan Zhou

Main category: cs.CV

TL;DR: A unified point-level prompting method is proposed to address noise and incompleteness in point clouds, enhancing downstream analysis tasks efficiently.

DetailsMotivation: Existing methods for point cloud enhancement (denoising and completion) are isolated from downstream tasks, limiting their effectiveness in real-world scenarios. Conflicting objectives between tasks further hinder performance.

Method: Introduces a Rectification Prompter for noise filtering and a Completion Prompter for robustness. A Shape-Aware Unit unifies features for downstream analysis.

Result: Demonstrates superiority and robustness on four datasets for noisy and incomplete point cloud data.

Conclusion: The proposed method effectively unifies point cloud enhancement with downstream tasks, outperforming state-of-the-art approaches.

Abstract: Pre-trained point cloud analysis models have shown promising advancements in various downstream tasks, yet their effectiveness is typically suffering from low-quality point cloud (i.e., noise and incompleteness), which is a common issue in real scenarios due to casual object occlusions and unsatisfactory data collected by 3D sensors. To this end, existing methods focus on enhancing point cloud quality by developing dedicated denoising and completion models. However, due to the isolation between the point cloud enhancement and downstream tasks, these methods fail to work in various real-world domains. In addition, the conflicting objectives between denoising and completing tasks further limit the ensemble paradigm to preserve critical geometric features. To tackle the above challenges, we propose a unified point-level prompting method that reformulates point cloud denoising and completion as a prompting mechanism, enabling robust analysis in a parameter-efficient manner. We start by introducing a Rectification Prompter to adapt to noisy points through the predicted rectification vector prompts, effectively filtering noise while preserving intricate geometric features essential for accurate analysis. Sequentially, we further incorporate a Completion Prompter to generate auxiliary point prompts based on the rectified point clouds, facilitating their robustness and adaptability. Finally, a Shape-Aware Unit module is exploited to efficiently unify and capture the filtered geometric features for the downstream point cloud analysis.Extensive experiments on four datasets demonstrate the superiority and robustness of our method when handling noisy and incomplete point cloud data against existing state-of-the-art methods. Our code is released at https://github.com/zhoujiahuan1991/ICCV2025-UPP.

[118] GPSMamba: A Global Phase and Spectral Prompt-guided Mamba for Infrared Image Super-Resolution

Yongsong Huang, Tomo Miyazaki, Xiaofeng Liu, Shinichiro Omachi

Main category: cs.CV

TL;DR: GPSMamba improves infrared image super-resolution by integrating non-causal guidance and spectral-phase supervision into Mamba models, achieving state-of-the-art results.

DetailsMotivation: Infrared images suffer from low contrast and sparse textures, requiring robust long-range modeling for coherence, which current 1D causal models like Mamba struggle with due to fragmented context.

Method: Proposes GPSMamba with an Adaptive Semantic-Frequency State Space Module (ASF-SSM) for non-local context and a Thermal-Spectral Attention with Phase Consistency Loss for global supervision.

Result: GPSMamba achieves state-of-the-art performance in infrared image restoration.

Conclusion: The framework effectively mitigates causal modeling limitations, offering a powerful paradigm for infrared image super-resolution.

Abstract: Infrared Image Super-Resolution (IRSR) is challenged by the low contrast and sparse textures of infrared data, requiring robust long-range modeling to maintain global coherence. While State-Space Models like Mamba offer proficiency in modeling long-range dependencies for this task, their inherent 1D causal scanning mechanism fragments the global context of 2D images, hindering fine-detail restoration. To address this, we propose Global Phase and Spectral Prompt-guided Mamba (GPSMamba), a framework that synergizes architectural guidance with non-causal supervision. First, our Adaptive Semantic-Frequency State Space Module (ASF-SSM) injects a fused semantic-frequency prompt directly into the Mamba block, integrating non-local context to guide reconstruction. Then, a novel Thermal-Spectral Attention and Phase Consistency Loss provides explicit, non-causal supervision to enforce global structural and spectral fidelity. By combining these two innovations, our work presents a systematic strategy to mitigate the limitations of causal modeling. Extensive experiments demonstrate that GPSMamba achieves state-of-the-art performance, validating our approach as a powerful new paradigm for infrared image restoration. Code is available at https://github.com/yongsongH/GPSMamba.

[119] PatchTraj: Dynamic Patch Representation Learning for Time-Frequency Trajectory Prediction

Yanghong Liu, Xingping Dong, Ming Li, Weixing Zhang, Yidong Lou

Main category: cs.CV

TL;DR: PatchTraj is a dynamic patch-based framework for pedestrian trajectory prediction, unifying time and frequency domains to improve motion modeling and achieve state-of-the-art results.

DetailsMotivation: Existing methods inadequately model human motion dynamics and lack interaction between time and frequency domains in trajectory sequences.

Method: Decomposes trajectories into time and frequency components, uses dynamic patch partitioning, adaptive embedding, and cross-modal attention for fusion, and employs a Transformer encoder-decoder for prediction.

Result: Achieves state-of-the-art performance on ETH-UCY, SDD, NBA, and JRDB datasets with high efficiency.

Conclusion: PatchTraj effectively balances local motion details and long-range dependencies, outperforming existing methods.

Abstract: Pedestrian trajectory prediction is crucial for autonomous driving and robotics. While existing point-based and grid-based methods expose two key limitations: insufficiently modeling human motion dynamics, as they fail to balance local motion details with long-range spatiotemporal dependencies, and the time representation lacks interaction with the frequency domain in modeling trajectory sequences. To address these challenges, we propose PatchTraj, a dynamic patch-based trajectory prediction framework that unifies time-domain and frequency-domain representations. Specifically, we decompose the trajectory into raw time sequences and frequency components, employing dynamic patch partitioning for multi-scale trajectory segmentation to capture hierarchical motion patterns. Each patch is processed by an adaptive embedding layer with scale-aware feature extraction, followed by hierarchical feature aggregation to model both fine-grained and long-range dependencies. The outputs of two branches interact via cross-modal attention, enabling complementary fusion of temporal and spectral cues. Finally, a Transformer encoder-decoder integrates both modalities to autoregressively predict future trajectories. Extensive experiments on ETH-UCY, SDD, NBA, and JRDB datasets demonstrate that our method achieves state-of-the-art performance with high efficiency.

[120] Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment

Ying Ba, Tianyu Zhang, Yalong Bai, Wenyi Mo, Tao Liang, Bing Su, Ji-Rong Wen

Main category: cs.CV

TL;DR: The paper introduces ICT and HP scores to better evaluate image generation systems, addressing flaws in current reward models and improving alignment with human aesthetic preferences.

DetailsMotivation: Existing evaluation frameworks for image generation systems fail to align with human aesthetic preferences, particularly in assessing detailed and high-quality images.

Method: Developed the ICT score for text-image alignment and the HP score for aesthetics and detail quality, training models using CLIP and BLIP architectures.

Result: The proposed model improves scoring accuracy by over 10% and enhances optimization of text-to-image models.

Conclusion: The study advances image generation evaluation by aligning it with higher-order human aesthetic preferences, supported by theoretical and empirical evidence.

Abstract: Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while maintaining text-image alignment. Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical and empirical support for evolving image generation technology toward higher-order human aesthetic preferences. Code is available at https://github.com/BarretBa/ICTHP.

[121] A Survey of Multimodal Hallucination Evaluation and Detection

Zhiyuan Chen, Yuecong Min, Jie Zhang, Bei Yan, Jiahao Wang, Xiaozhen Wang, Shiguang Shan

Main category: cs.CV

TL;DR: A survey on hallucination in Multi-modal Large Language Models (MLLMs), covering evaluation benchmarks, detection methods, and future directions for Image-to-Text and Text-to-Image tasks.

DetailsMotivation: MLLMs often produce plausible but incorrect content (hallucination), necessitating a review of benchmarks and detection methods to address this issue.

Method: Proposes a hallucination taxonomy, reviews benchmarks (construction, objectives, metrics), and summarizes detection methods.

Result: Identifies limitations in current benchmarks and detection methods, suggesting areas for improvement.

Conclusion: Future research should focus on enhancing benchmarks and detection methods to mitigate hallucination in MLLMs.

Abstract: Multi-modal Large Language Models (MLLMs) have emerged as a powerful paradigm for integrating visual and textual information, supporting a wide range of multi-modal tasks. However, these models often suffer from hallucination, producing content that appears plausible but contradicts the input content or established world knowledge. This survey offers an in-depth review of hallucination evaluation benchmarks and detection methods across Image-to-Text (I2T) and Text-to-image (T2I) generation tasks. Specifically, we first propose a taxonomy of hallucination based on faithfulness and factuality, incorporating the common types of hallucinations observed in practice. Then we provide an overview of existing hallucination evaluation benchmarks for both T2I and I2T tasks, highlighting their construction process, evaluation objectives, and employed metrics. Furthermore, we summarize recent advances in hallucination detection methods, which aims to identify hallucinated content at the instance level and serve as a practical complement of benchmark-based evaluation. Finally, we highlight key limitations in current benchmarks and detection methods, and outline potential directions for future research.

[122] LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

Yusuke Hirota, Boyi Li, Ryo Hachiuma, Yueh-Hua Wu, Boris Ivanovic, Yuta Nakashima, Marco Pavone, Yejin Choi, Yu-Chiang Frank Wang, Chao-Han Huck Yang

Main category: cs.CV

TL;DR: LOTUS is a leaderboard for evaluating detailed image captions by LVLMs, addressing gaps in standardized criteria, bias-aware assessments, and user preferences. It reveals no single model excels in all aspects, with trade-offs between detail and bias risks.

DetailsMotivation: Existing evaluations lack standardized criteria, bias awareness, and user preference considerations, necessitating a comprehensive leaderboard like LOTUS.

Method: LOTUS evaluates caption quality, risks (e.g., hallucination), and societal biases (e.g., gender bias), while incorporating user preference-oriented criteria.

Result: Analysis shows no LVLM excels across all criteria, with correlations between caption detail and bias risks. User preferences influence optimal model selection.

Conclusion: LOTUS provides a holistic evaluation framework for detailed captions, highlighting trade-offs and the importance of user-specific priorities in model selection.

Abstract: Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (\eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.

[123] A New One-Shot Federated Learning Framework for Medical Imaging Classification with Feature-Guided Rectified Flow and Knowledge Distillation

Yufei Ma, Hanwen Zhang, Qiya Yang, Guibo Luo, Yuesheng Zhu

Main category: cs.CV

TL;DR: A modified One-Shot Federated Learning (OSFL) framework with Feature-Guided Rectified Flow Model (FG-RF) and Dual-Layer Knowledge Distillation (DLKD) improves efficiency, privacy, and performance in non-IID medical imaging scenarios.

DetailsMotivation: Existing OSFL methods in healthcare face low efficiency, privacy risks, and challenges with non-IID data.

Method: Proposed FG-RF for efficient, private generative modeling and DLKD for handling non-IID data during aggregation.

Result: Outperforms multi-round FL by 21.73% and FedISCA by 21.75%, with reduced privacy risks.

Conclusion: The framework enhances OSFL for medical imaging, addressing efficiency, privacy, and non-IID challenges.

Abstract: In multi-center scenarios, One-Shot Federated Learning (OSFL) has attracted increasing attention due to its low communication overhead, requiring only a single round of transmission. However, existing generative model-based OSFL methods suffer from low training efficiency and potential privacy leakage in the healthcare domain. Additionally, achieving convergence within a single round of model aggregation is challenging under non-Independent and Identically Distributed (non-IID) data. To address these challenges, in this paper a modified OSFL framework is proposed, in which a new Feature-Guided Rectified Flow Model (FG-RF) and Dual-Layer Knowledge Distillation (DLKD) aggregation method are developed. FG-RF on the client side accelerates generative modeling in medical imaging scenarios while preserving privacy by synthesizing feature-level images rather than pixel-level images. To handle non-IID distributions, DLKD enables the global student model to simultaneously mimic the output logits and align the intermediate-layer features of client-side teacher models during aggregation. Experimental results on three non-IID medical imaging datasets show that our new framework and method outperform multi-round federated learning approaches, achieving up to 21.73% improvement, and exceeds the baseline FedISCA by an average of 21.75%. Furthermore, our experiments demonstrate that feature-level synthetic images significantly reduce privacy leakage risks compared to pixel-level synthetic images.

[124] Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding

Hamid Abdollahi, Amir Hossein Mansouri Majoumerd, Amir Hossein Bagheri Baboukani, Amir Abolfazl Suratgar, Mohammad Bagher Menhaj

Main category: cs.CV

TL;DR: The paper explores brain encoding models using advanced visual and auditory feature extractors, revealing a trade-off between model complexity and generalization, with simpler models performing better on out-of-distribution data.

DetailsMotivation: To address the challenge of predicting brain activity in response to naturalistic, multimodal stimuli and test the generalization of encoding models to novel contexts.

Method: Developed brain encoding models using X-CLIP (visual) and Whisper (auditory) feature extractors, evaluated on in-distribution and out-of-distribution data.

Result: A simpler linear model outperformed a complex attention-based model on OOD data by 18%, and linguistic features did not improve accuracy. Performance gains were notable in the auditory cortex.

Conclusion: Rigorous OOD testing is crucial for robust neuro-AI models, and model architecture, stimulus characteristics, and sensory hierarchies significantly influence neural encoding.

Abstract: Predicting brain activity in response to naturalistic, multimodal stimuli is a key challenge in computational neuroscience. While encoding models are becoming more powerful, their ability to generalize to truly novel contexts remains a critical, often untested, question. In this work, we developed brain encoding models using state-of-the-art visual (X-CLIP) and auditory (Whisper) feature extractors and rigorously evaluated them on both in-distribution (ID) and diverse out-of-distribution (OOD) data. Our results reveal a fundamental trade-off between model complexity and generalization: a higher-capacity attention-based model excelled on ID data, but a simpler linear model was more robust, outperforming a competitive baseline by 18% on the OOD set. Intriguingly, we found that linguistic features did not improve predictive accuracy, suggesting that for familiar languages, neural encoding may be dominated by the continuous visual and auditory streams over redundant textual information. Spatially, our approach showed marked performance gains in the auditory cortex, underscoring the benefit of high-fidelity speech representations. Collectively, our findings demonstrate that rigorous OOD testing is essential for building robust neuro-AI models and provides nuanced insights into how model architecture, stimulus characteristics, and sensory hierarchies shape the neural encoding of our rich, multimodal world.

[125] MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang

Main category: cs.CV

TL;DR: MMBench-GUI is a hierarchical benchmark for evaluating GUI automation agents across multiple platforms, introducing the Efficiency-Quality Area (EQA) metric to assess efficiency. It highlights the importance of visual grounding, modular frameworks, task planning, and cross-platform generalization for reliable automation.

DetailsMotivation: To address the lack of a comprehensive benchmark for evaluating GUI automation agents across diverse platforms and to emphasize the importance of efficiency and quality in automation tasks.

Method: Develops MMBench-GUI with four levels (GUI Content Understanding, Element Grounding, Task Automation, Task Collaboration) and proposes the EQA metric for efficiency assessment.

Result: Identifies visual grounding as critical for task success, highlights inefficiencies in current models, and stresses the need for precise localization, planning, and early stopping.

Conclusion: MMBench-GUI provides a robust framework for evaluating GUI agents, emphasizing the integration of specialized modules and efficiency improvements for scalable automation.

Abstract: We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.

[126] ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment

Chong Xia, Shengjun Zhang, Fangfu Liu, Chang Liu, Khodchaphun Hirunyaratsameewong, Yueqi Duan

Main category: cs.CV

TL;DR: ScenePainter addresses semantic drift in 3D scene generation by aligning outpainting with scene comprehension using a hierarchical graph structure, improving consistency and immersion.

DetailsMotivation: Existing methods for perpetual 3D scene generation suffer from semantic drift due to outpainting deviations, limiting coherence in long-range view sequences.

Method: Proposes ScenePainter, leveraging a SceneConceptGraph to hierarchically organize scene concepts, guiding outpainting for consistency and enabling dynamic refinement.

Result: Overcomes semantic drift, producing more consistent and immersive 3D view sequences.

Conclusion: ScenePainter enhances 3D scene generation by ensuring semantic consistency and diversity through hierarchical scene concept alignment.

Abstract: Perpetual 3D scene generation aims to produce long-range and coherent 3D view sequences, which is applicable for long-term video synthesis and 3D scene reconstruction. Existing methods follow a “navigate-and-imagine” fashion and rely on outpainting for successive view expansion. However, the generated view sequences suffer from semantic drift issue derived from the accumulated deviation of the outpainting module. To tackle this challenge, we propose ScenePainter, a new framework for semantically consistent 3D scene generation, which aligns the outpainter’s scene-specific prior with the comprehension of the current scene. To be specific, we introduce a hierarchical graph structure dubbed SceneConceptGraph to construct relations among multi-level scene concepts, which directs the outpainter for consistent novel views and can be dynamically refined to enhance diversity. Extensive experiments demonstrate that our framework overcomes the semantic drift issue and generates more consistent and immersive 3D view sequences. Project Page: https://xiac20.github.io/ScenePainter/.

[127] Revisiting DETR for Small Object Detection via Noise-Resilient Query Optimization

Xiaocheng Fang, Jieyi Cai, Huanyu Liu, Wenxiu Cai, Yishu Liu, Bingzhi Chen

Main category: cs.CV

TL;DR: The paper proposes NRQO, a novel paradigm combining NT-FPN and PS-RPN, to improve small object detection by addressing noise sensitivity in FPN and poor query quality in label assignment.

DetailsMotivation: Existing Transformer-based detectors for small object detection struggle with noise sensitivity in FPN and suboptimal query quality in label assignment.

Method: NRQO integrates NT-FPN to preserve feature integrity during fusion and PS-RPN to enhance query quality via improved anchor-ground truth matching.

Result: NRQO outperforms state-of-the-art baselines in extensive experiments on multiple benchmarks.

Conclusion: NRQO effectively addresses key challenges in small object detection, offering superior performance.

Abstract: Despite advancements in Transformer-based detectors for small object detection (SOD), recent studies show that these detectors still face challenges due to inherent noise sensitivity in feature pyramid networks (FPN) and diminished query quality in existing label assignment strategies. In this paper, we propose a novel Noise-Resilient Query Optimization (NRQO) paradigm, which innovatively incorporates the Noise-Tolerance Feature Pyramid Network (NT-FPN) and the Pairwise-Similarity Region Proposal Network (PS-RPN). Specifically, NT-FPN mitigates noise during feature fusion in FPN by preserving spatial and semantic information integrity. Unlike existing label assignment strategies, PS-RPN generates a sufficient number of high-quality positive queries by enhancing anchor-ground truth matching through position and shape similarities, without the need for additional hyperparameters. Extensive experiments on multiple benchmarks consistently demonstrate the superiority of NRQO over state-of-the-art baselines.

[128] Negation-Aware Test-Time Adaptation for Vision-Language Models

Haochen Han, Alex Jinpeng Wang, Fangming Liu

Main category: cs.CV

TL;DR: The paper addresses the challenge of negation understanding in Vision-Language Models (VLMs), proposing a low-resource method called NEAT to adjust distribution shifts during inference.

DetailsMotivation: Real-world applications, like medical imaging, require models to identify false or non-existent conditions, but VLMs struggle with negation due to data scarcity.

Method: The proposed NEAT method adjusts distribution-related parameters during inference to handle negation without extensive fine-tuning.

Result: Experiments show NEAT effectively reduces distribution shifts and improves negation understanding in VLMs.

Conclusion: NEAT offers a sustainable, low-resource solution for negation understanding in VLMs, validated by extensive testing.

Abstract: In this paper, we study a practical but less-touched problem in Vision-Language Models (VLMs), \ie, negation understanding. Specifically, many real-world applications require models to explicitly identify what is false or non-existent, \eg, radiologists may search for images that exclude specific conditions. Despite the impressive transferability of VLMs through large-scale training, they suffer from a critical limitation that fails to handle negation. To address this challenge, existing methods attribute its root cause to the scarcity of negation training data and propose to fine-tune VLMs on massive data containing explicit negation. Undoubtedly, such data-centric solutions demand substantial data and computational resources, limiting their sustainable widespread adoption. To tackle negation in a low-carbon manner, we empirically observe that the key obstacle lies in the dual-concept shifts between the affirmation and negation distributions. Therefore, we propose a Negation-Aware Test-Time Adaptation (NEAT) method to efficiently adjust distribution-related parameters during inference. In brief, NEAT can reduce distribution shift in consistent semantics while eliminating false distributional consistency in unrelated semantics. Extensive experiments on the various negation understanding tasks verify the effectiveness of the proposed method. The code is available at https://github.com/hhc1997/NEAT.

[129] Cross-Subject Mind Decoding from Inaccurate Representations

Yangyang Xu, Bangzhen Liu, Wenqi Shao, Yong Du, Shengfeng He, Tingting Zhu

Main category: cs.CV

TL;DR: Proposes a Bidirectional Autoencoder Intertwining framework to improve cross-subject fMRI decoding by addressing sequential errors and enhancing semantic and visual fidelity.

DetailsMotivation: Existing methods fail in cross-subject mappings due to cognitive variability and subject-specific differences, leading to error accumulation in reconstructions.

Method: Introduces a Bidirectional Autoencoder Intertwining framework with Subject Bias Modulation, Semantic Refinement, and Visual Coherence Modules, integrated with ControlNet and Stable Diffusion.

Result: Outperforms state-of-the-art methods in qualitative and quantitative evaluations and shows adaptability to new subjects with minimal training.

Conclusion: The framework effectively addresses cross-subject challenges and enhances reconstruction fidelity in fMRI decoding.

Abstract: Decoding stimulus images from fMRI signals has advanced with pre-trained generative models. However, existing methods struggle with cross-subject mappings due to cognitive variability and subject-specific differences. This challenge arises from sequential errors, where unidirectional mappings generate partially inaccurate representations that, when fed into diffusion models, accumulate errors and degrade reconstruction fidelity. To address this, we propose the Bidirectional Autoencoder Intertwining framework for accurate decoded representation prediction. Our approach unifies multiple subjects through a Subject Bias Modulation Module while leveraging bidirectional mapping to better capture data distributions for precise representation prediction. To further enhance fidelity when decoding representations into stimulus images, we introduce a Semantic Refinement Module to improve semantic representations and a Visual Coherence Module to mitigate the effects of inaccurate visual representations. Integrated with ControlNet and Stable Diffusion, our method outperforms state-of-the-art approaches on benchmark datasets in both qualitative and quantitative evaluations. Moreover, our framework exhibits strong adaptability to new subjects with minimal training samples.

[130] Multistream Network for LiDAR and Camera-based 3D Object Detection in Outdoor Scenes

Muhammad Ibrahim, Naveed Akhtar, Haitian Wang, Saeed Anwar, Ajmal Mian

Main category: cs.CV

TL;DR: The paper proposes a MultiStream Detection (MuStD) network for fusing LiDAR and RGB data to improve 3D object detection accuracy in outdoor environments.

DetailsMotivation: The fusion of LiDAR and RGB data is promising for enhancing outdoor 3D object detection, but effective integration remains challenging.

Method: MuStD uses a three-stream structure: LiDAR-PillarNet for sparse 2D pillar features, LiDAR-Height Compression for Bird’s-Eye View features, and a 3D Multimodal stream combining RGB and LiDAR via UV mapping and polar coordinate indexing.

Result: The method achieves state-of-the-art or competitive results on the KITTI Object Detection Benchmark while maintaining efficiency.

Conclusion: MuStD effectively integrates LiDAR and RGB data for precise 3D object detection, demonstrating superior performance on a standard benchmark.

Abstract: Fusion of LiDAR and RGB data has the potential to enhance outdoor 3D object detection accuracy. To address real-world challenges in outdoor 3D object detection, fusion of LiDAR and RGB input has started gaining traction. However, effective integration of these modalities for precise object detection task still remains a largely open problem. To address that, we propose a MultiStream Detection (MuStD) network, that meticulously extracts task-relevant information from both data modalities. The network follows a three-stream structure. Its LiDAR-PillarNet stream extracts sparse 2D pillar features from the LiDAR input while the LiDAR-Height Compression stream computes Bird’s-Eye View features. An additional 3D Multimodal stream combines RGB and LiDAR features using UV mapping and polar coordinate indexing. Eventually, the features containing comprehensive spatial, textural and geometric information are carefully fused and fed to a detection head for 3D object detection. Our extensive evaluation on the challenging KITTI Object Detection Benchmark using public testing server at https://www.cvlibs.net/datasets/kitti/eval_object_detail.php?&result=d162ec699d6992040e34314d19ab7f5c217075e0 establishes the efficacy of our method by achieving new state-of-the-art or highly competitive results in different categories while remaining among the most efficient methods. Our code will be released through MuStD GitHub repository at https://github.com/IbrahimUWA/MuStD.git

[131] SIDE: Sparse Information Disentanglement for Explainable Artificial Intelligence

Viktar Dubovik, Łukasz Struski, Jacek Tabor, Dawid Rymarczyk

Main category: cs.CV

TL;DR: SIDE improves interpretability of prototypical models by enforcing sparsity and using sigmoid activations, reducing explanation size by over 90% while maintaining accuracy.

DetailsMotivation: Deep neural networks lack transparency, especially in high-stakes domains like medical imaging and autonomous driving. Prototypical models offer concept-level explanations but are often complex or limited to fine-grained tasks.

Method: SIDE introduces a training and pruning scheme for sparsity and replaces softmax with sigmoid activations, associating each class with a small set of prototypes.

Result: SIDE matches existing methods’ accuracy while reducing explanation size by over 90%, enhancing understandability.

Conclusion: SIDE effectively balances accuracy and interpretability, making prototype-based explanations more practical for large-scale applications.

Abstract: Understanding the decisions made by deep neural networks is essential in high-stakes domains such as medical imaging and autonomous driving. Yet, these models often lack transparency, particularly in computer vision. Prototypical-parts-based neural networks have emerged as a promising solution by offering concept-level explanations. However, most are limited to fine-grained classification tasks, with few exceptions such as InfoDisent. InfoDisent extends prototypical models to large-scale datasets like ImageNet, but produces complex explanations. We introduce Sparse Information Disentanglement for Explainability (SIDE), a novel method that improves the interpretability of prototypical parts through a dedicated training and pruning scheme that enforces sparsity. Combined with sigmoid activations in place of softmax, this approach allows SIDE to associate each class with only a small set of relevant prototypes. Extensive experiments show that SIDE matches the accuracy of existing methods while reducing explanation size by over $90%$, substantially enhancing the understandability of prototype-based explanations.

[132] SP-Mamba: Spatial-Perception State Space Model for Unsupervised Medical Anomaly Detection

Rui Pan, Ruiying Lu

Main category: cs.CV

TL;DR: SP-Mamba, a spatial-perception Mamba framework, is introduced for unsupervised medical anomaly detection, leveraging structural regularity and linear computational efficiency to outperform CNN- and transformer-based methods.

DetailsMotivation: Radiography images have consistent structural patterns, but CNN and transformer models have limitations in capturing long-range dependencies and computational efficiency. Mamba-based models offer a promising alternative.

Method: SP-Mamba uses window-sliding prototype learning and Circular-Hilbert scanning-based Mamba to exploit anatomical patterns and spatial information. It also utilizes anomaly map characteristics for improved detection.

Result: Extensive experiments on three benchmarks show SP-Mamba achieves state-of-the-art performance in medical anomaly detection.

Conclusion: SP-Mamba is effective and robust for unsupervised medical anomaly detection, validated by superior results on diverse benchmarks.

Abstract: Radiography imaging protocols target on specific anatomical regions, resulting in highly consistent images with recurrent structural patterns across patients. Recent advances in medical anomaly detection have demonstrated the effectiveness of CNN- and transformer-based approaches. However, CNNs exhibit limitations in capturing long-range dependencies, while transformers suffer from quadratic computational complexity. In contrast, Mamba-based models, leveraging superior long-range modeling, structural feature extraction, and linear computational efficiency, have emerged as a promising alternative. To capitalize on the inherent structural regularity of medical images, this study introduces SP-Mamba, a spatial-perception Mamba framework for unsupervised medical anomaly detection. The window-sliding prototype learning and Circular-Hilbert scanning-based Mamba are introduced to better exploit consistent anatomical patterns and leverage spatial information for medical anomaly detection. Furthermore, we excavate the concentration and contrast characteristics of anomaly maps for improving anomaly detection. Extensive experiments on three diverse medical anomaly detection benchmarks confirm the proposed method’s state-of-the-art performance, validating its efficacy and robustness. The code is available at https://github.com/Ray-RuiPan/SP-Mamba.

[133] Multi-Task Dense Prediction Fine-Tuning with Mixture of Fine-Grained Experts

Yangyang Xu, Xi Ye, Duo Su

Main category: cs.CV

TL;DR: FGMoE introduces a Fine-Grained Mixture of Experts for MTL, combining intra-task, shared, and global experts with fine-tuning to improve parameter efficiency and performance.

DetailsMotivation: Challenges in balancing shared representations and task-specific specialization in MTL for dense prediction.

Method: FGMoE architecture with intra-task, shared, and global experts, plus fine-tuning decoder parameters.

Result: FGMoE outperforms current MoE-based MTL models on NYUD-v2 and PASCAL-Context datasets with fewer parameters.

Conclusion: FGMoE effectively balances specialization and sharing in MTL, achieving superior performance and efficiency.

Abstract: Multi-task learning (MTL) for dense prediction has shown promising results but still faces challenges in balancing shared representations with task-specific specialization. In this paper, we introduce a novel Fine-Grained Mixture of Experts (FGMoE) architecture that explores MoE-based MTL models through a combination of three key innovations and fine-tuning. First, we propose intra-task experts that partition along intermediate hidden dimensions of MLPs, enabling finer decomposition of task information while maintaining parameter efficiency. Second, we introduce shared experts that consolidate common information across different contexts of the same task, reducing redundancy, and allowing routing experts to focus on unique aspects. Third, we design a global expert that facilitates adaptive knowledge transfer across tasks based on both input feature and task requirements, promoting beneficial information sharing while preventing harmful interference. In addition, we use the fine-tuning approach to improve parameter efficiency only by training the parameters of the decoder. Extensive experimental results show that the proposed FGMoE uses fewer parameters and significantly outperforms current MoE-based competitive MTL models on two dense prediction datasets (\textit{i.e.,} NYUD-v2, PASCAL-Context) in various metrics.

[134] CXR-CML: Improved zero-shot classification of long-tailed multi-label diseases in Chest X-Rays

Rajesh Madhipati, Sheethal Bhat, Lukas Buess, Andreas Maier

Main category: cs.CV

TL;DR: The paper addresses class imbalance in CXR diagnosis using a class-weighting mechanism with GMM clustering and Student t-distribution, improving zero-shot AUC by 7%.

DetailsMotivation: Class imbalance in CXR datasets hinders self-supervised models, especially for long-tailed classes, despite CLIP's success in primary classes.

Method: Uses GMM clustering on latent space, refined by Student t-distribution and metric loss, to align class distribution and improve rare class recognition.

Result: Achieves a 7% average improvement in zero-shot AUC across 40 classes in MIMIC-CXR-JPG.

Conclusion: The proposed method effectively enhances classification performance, particularly for rare classes, outperforming SOTA models.

Abstract: Chest radiography (CXR) plays a crucial role in the diagnosis of various diseases. However, the inherent class imbalance in the distribution of clinical findings presents a significant challenge for current self-supervised deep learning models. These models often fail to accurately classify long-tailed classes. Current Vision-Language models such as Contrastive Language Image Pre-training (CLIP) models effectively model the manifold distribution of the latent space, enabling high zero-shot classification accuracies. Although CLIP performs well on most of the primary classes in the dataset, our work reveals that its effectiveness decreases significantly for classes with a long-tailed distribution. Our approach employs a class-weighting mechanism that directly aligns with the distribution of classes within the latent space. This method ensures a substantial improvement in overall classification performance, with particular emphasis on enhancing the recognition and accuracy of rarely observed classes. We accomplish this by applying Gaussian Mixture Model (GMM) clustering to the latent space. The subsequent clusters are further refined by Student t-distribution, followed by a metric loss that utilizes the altered embeddings. Our approach facilitates stable and adaptive clustering of the features. This results in a notable average improvement of 7% points in zero-shot AUC scores across 40 classes in the MIMIC-CXR-JPG dataset from previous SOTA models.

[135] LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models

Zhihui Guo, Xin Man, Hui Xu, Jie Shao

Main category: cs.CV

TL;DR: LISA is a layer-wise integration and suppression approach to reduce object hallucinations in Multimodal Large Language Models (MLLMs) by modulating attention and fusing token-level logits.

DetailsMotivation: MLLMs often hallucinate objects not present in images, degrading performance in vision-language tasks.

Method: LISA uses hierarchical modulation (spectral suppression in deeper layers) and multi-layer fusion (anchor-based routing for logit integration).

Result: LISA reduces hallucinations by up to 53.6% and improves POPE F1 by 4.5% across benchmarks.

Conclusion: LISA is an effective, plug-and-play solution for enhancing MLLM consistency and reducing hallucinations.

Abstract: Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose \textbf{LISA}, a \textbf{L}ayer-wise \textbf{I}ntegration and \textbf{S}uppression \textbf{A}pproach that enhances generation consistency through hierarchical modulation and multi-layer fusion. LISA leverages the functional hierarchy within MLLMs, where shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, zone-specific spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully \textbf{plug-and-play} and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6% in $\mathrm{CHAIR}_I$ and improves POPE F1 by 4.5%, demonstrating strong generalization across models and tasks.

[136] Cross Spatial Temporal Fusion Attention for Remote Sensing Object Detection via Image Feature Matching

Abu Sadat Mohammad Salehin Amit, Xiaoli Zhang, Md Masum Billa Shagar, Zhaojun Liu, Xiongfei Li, Fanlong Meng

Main category: cs.CV

TL;DR: The paper proposes a Cross Spatial Temporal Fusion (CSTF) mechanism for robust cross-modal remote sensing image matching, achieving state-of-the-art performance on benchmark datasets.

DetailsMotivation: Existing methods fail to effectively capture cross-modal similarities due to geometric and radiometric differences in multimodal images.

Method: CSTF integrates scale-invariant keypoints and reformulates similarity matching as a classification task using SoftMax and FCN layers.

Result: Achieves 90.99% mAP on HRSC2016 and 90.86% on DOTA, with 12.5 FPS inference speed.

Conclusion: CSTF enhances cross-modal feature matching, improving downstream applications like object detection.

Abstract: Effectively describing features for cross-modal remote sensing image matching remains a challenging task due to the significant geometric and radiometric differences between multimodal images. Existing methods primarily extract features at the fully connected layer but often fail to capture cross-modal similarities effectively. We propose a Cross Spatial Temporal Fusion (CSTF) mechanism that enhances feature representation by integrating scale-invariant keypoints detected independently in both reference and query images. Our approach improves feature matching in two ways: First, by creating correspondence maps that leverage information from multiple image regions simultaneously, and second, by reformulating the similarity matching process as a classification task using SoftMax and Fully Convolutional Network (FCN) layers. This dual approach enables CSTF to maintain sensitivity to distinctive local features while incorporating broader contextual information, resulting in robust matching across diverse remote sensing modalities. To demonstrate the practical utility of improved feature matching, we evaluate CSTF on object detection tasks using the HRSC2016 and DOTA benchmark datasets. Our method achieves state-of-theart performance with an average mAP of 90.99% on HRSC2016 and 90.86% on DOTA, outperforming existing models. The CSTF model maintains computational efficiency with an inference speed of 12.5 FPS. These results validate that our approach to crossmodal feature matching directly enhances downstream remote sensing applications such as object detection.

[137] Preserving Topological and Geometric Embeddings for Point Cloud Recovery

Kaiyue Zhou, Zelong Tan, Hongxiao Wang, Ya-li Li, Shengjin Wang

Main category: cs.CV

TL;DR: TopGeoFormer is an end-to-end architecture for point cloud recovery, combining topological and geometric embeddings with novel attention and loss functions to outperform existing methods.

DetailsMotivation: Existing methods fail to effectively leverage both topological and geometric attributes in point cloud recovery.

Method: Proposes TopGeoFormer with topological embedding, InterTwining Attention, and dual loss functions (geometry and topological constraint).

Result: Outperforms conventional and learning-based sampling/upsampling methods in quantitative and qualitative evaluations.

Conclusion: TopGeoFormer successfully integrates topological and geometric features, enhancing point cloud recovery performance.

Abstract: Recovering point clouds involves the sequential process of sampling and restoration, yet existing methods struggle to effectively leverage both topological and geometric attributes. To address this, we propose an end-to-end architecture named \textbf{TopGeoFormer}, which maintains these critical features throughout the sampling and restoration phases. First, we revisit traditional feature extraction techniques to yield topological embedding using a continuous mapping of relative relationships between neighboring points, and integrate it in both phases for preserving the structure of the original space. Second, we propose the \textbf{InterTwining Attention} to fully merge topological and geometric embeddings, which queries shape with local awareness in both phases to form a learnable shape context facilitated with point-wise, point-shape-wise, and intra-shape features. Third, we introduce a full geometry loss and a topological constraint loss to optimize the embeddings in both Euclidean and topological spaces. The geometry loss uses inconsistent matching between coarse-to-fine generations and targets for reconstructing better geometric details, and the constraint loss limits embedding variances for better approximation of the topological space. In experiments, we comprehensively analyze the circumstances using the conventional and learning-based sampling/upsampling algorithms. The quantitative and qualitative results demonstrate that our method significantly outperforms existing sampling and recovery methods.

[138] MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective

Weitian Wang, Rai Shubham, Cecilia De La Parra, Akash Kumar

Main category: cs.CV

TL;DR: MixA-Q is a mixed-precision activation quantization framework for efficient inference in window-based vision transformers, leveraging intra-layer activation sparsity to balance performance and efficiency.

DetailsMotivation: To improve the trade-off between model performance and efficiency in quantized vision transformers by exploiting activation sparsity.

Method: MixA-Q separates batched window computations, assigning lower bit widths to less important windows, and introduces a Two-Branch Swin Block for high- and low-bit precision processing.

Result: Achieves 1.35x speedup without accuracy loss in PTQ, 1.25x lossless speedup with QAT, and 1.53x speedup with minimal mAP drop. Also improves mAP of W4A4 models by 0.7%.

Conclusion: MixA-Q effectively enhances efficiency and performance in quantized vision transformers by leveraging sparsity-aware quantization.

Abstract: In this paper, we propose MixA-Q, a mixed-precision activation quantization framework that leverages intra-layer activation sparsity (a concept widely explored in activation pruning methods) for efficient inference of quantized window-based vision transformers. For a given uniform-bit quantization configuration, MixA-Q separates the batched window computations within Swin blocks and assigns a lower bit width to the activations of less important windows, improving the trade-off between model performance and efficiency. We introduce a Two-Branch Swin Block that processes activations separately in high- and low-bit precision, enabling seamless integration of our method with most quantization-aware training (QAT) and post-training quantization (PTQ) methods, or with simple modifications. Our experimental evaluations over the COCO dataset demonstrate that MixA-Q achieves a training-free 1.35x computational speedup without accuracy loss in PTQ configuration. With QAT, MixA-Q achieves a lossless 1.25x speedup and a 1.53x speedup with only a 1% mAP drop by incorporating activation pruning. Notably, by reducing the quantization error in important regions, our sparsity-aware quantization adaptation improves the mAP of the quantized W4A4 model (with both weights and activations in 4-bit precision) by 0.7%, reducing quantization degradation by 24%.

[139] Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation

Tianyu Zou, Shengwu Xiong, Ruilin Yao, Yi Rong

Main category: cs.CV

TL;DR: PAHNet balances conservative prototype learning and aggressive affinity learning for few-shot segmentation, improving accuracy with hybrid modules.

DetailsMotivation: Address the imbalance between conservative predictions of prototype learning and aggressive predictions of affinity learning in few-shot segmentation.

Method: Proposes PAHNet with Prototype-guided Feature Enhancement (PFE) and Attention Score Calibration (ASC) modules to integrate prototype and affinity learning.

Result: Outperforms recent methods on PASCAL-5$^i$ and COCO-20$^i$ datasets in 1-shot and 5-shot settings.

Conclusion: PAHNet effectively balances conservatism and aggressiveness, enhancing few-shot segmentation performance.

Abstract: This paper studies the few-shot segmentation (FSS) task, which aims to segment objects belonging to unseen categories in a query image by learning a model on a small number of well-annotated support samples. Our analysis of two mainstream FSS paradigms reveals that the predictions made by prototype learning methods are usually conservative, while those of affinity learning methods tend to be more aggressive. This observation motivates us to balance the conservative and aggressive information captured by these two types of FSS frameworks so as to improve the segmentation performance. To achieve this, we propose a Prototype-Affinity Hybrid Network (PAHNet), which introduces a Prototype-guided Feature Enhancement (PFE) module and an Attention Score Calibration (ASC) module in each attention block of an affinity learning model (called affinity learner). These two modules utilize the predictions generated by a pre-trained prototype learning model (called prototype predictor) to enhance the foreground information in support and query image representations and suppress the mismatched foreground-background (FG-BG) relationships between them, respectively. In this way, the aggressiveness of the affinity learner can be effectively mitigated, thereby eventually increasing the segmentation accuracy of our PAHNet method. Experimental results show that PAHNet outperforms most recently proposed methods across 1-shot and 5-shot settings on both PASCAL-5$^i$ and COCO-20$^i$ datasets, suggesting its effectiveness. The code is available at: GitHub - tianyu-zou/PAHNet: Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation (ICCV'25)

[140] DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering

Jie Chen, Zhangchi Hu, Peixi Wu, Huyue Zhu, Hebei Li, Xiaoyan Sun

Main category: cs.CV

TL;DR: DASH introduces a real-time dynamic scene rendering framework using 4D hash encoding and self-supervised decomposition to improve rendering quality and avoid low-rank assumptions.

DetailsMotivation: Existing methods for dynamic scene reconstruction suffer from feature overlap, poor rendering quality, and hash collisions due to unsuitable assumptions.

Method: DASH uses self-supervised decomposition to separate dynamic/static components, a multiresolution 4D hash encoder for dynamic elements, and spatio-temporal smoothness regularization.

Result: DASH achieves state-of-the-art dynamic rendering at 264 FPS on a 4090 GPU, with enhanced visual quality.

Conclusion: DASH effectively addresses challenges in dynamic scene reconstruction, offering high-quality real-time rendering.

Abstract: Dynamic scene reconstruction is a long-term challenge in 3D vision. Existing plane-based methods in dynamic Gaussian splatting suffer from an unsuitable low-rank assumption, causing feature overlap and poor rendering quality. Although 4D hash encoding provides an explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. Our approach begins with a self-supervised decomposition mechanism that separates dynamic and static components without manual annotations or precomputed masks. Next, we introduce a multiresolution 4D hash encoder for dynamic elements, providing an explicit representation that avoids the low-rank assumption. Finally, we present a spatio-temporal smoothness regularization strategy to mitigate unstable deformation artifacts. Experiments on real-world datasets demonstrate that DASH achieves state-of-the-art dynamic rendering performance, exhibiting enhanced visual quality at real-time speeds of 264 FPS on a single 4090 GPU. Code: https://github.com/chenj02/DASH.

[141] Patch Pruning Strategy Based on Robust Statistical Measures of Attention Weight Diversity in Vision Transformers

Yuki Igaue, Hiroaki Aizawa

Main category: cs.CV

TL;DR: A patch pruning strategy for vision transformers evaluates patch importance using variance of attention weights across heads, improving efficiency without losing accuracy.

DetailsMotivation: Address the quadratic computational complexity of multi-head self-attention in vision transformers by pruning redundant patches.

Method: Propose a patch pruning strategy based on variance of attention weights across heads, with optional robust statistical measures like median absolute deviation. Also introduce overlapping patch embeddings.

Result: Improved throughput while maintaining classification accuracy, especially in fine-tuning scenarios. Overlapping embeddings enhance performance with comparable throughput.

Conclusion: The method effectively reduces computational cost without sacrificing accuracy, offering a practical solution for efficient vision transformer deployment.

Abstract: Multi-head self-attention is a distinctive feature extraction mechanism of vision transformers that computes pairwise relationships among all input patches, contributing significantly to their high performance. However, it is known to incur a quadratic computational complexity with respect to the number of patches. One promising approach to address this issue is patch pruning, which improves computational efficiency by identifying and removing redundant patches. In this work, we propose a patch pruning strategy that evaluates the importance of each patch based on the variance of attention weights across multiple attention heads. This approach is inspired by the design of multi-head self-attention, which aims to capture diverse attention patterns across different subspaces of feature representations. The proposed method can be easily applied during both training and inference, and achieves improved throughput while maintaining classification accuracy in scenarios such as fine-tuning with pre-trained models. In addition, we also found that using robust statistical measures, such as the median absolute deviation in place of variance, to assess patch importance can similarly lead to strong performance. Furthermore, by introducing overlapping patch embeddings, our method achieves better performance with comparable throughput to conventional approaches that utilize all patches.

[142] Continual Learning-Based Unified Model for Unpaired Image Restoration Tasks

Kotha Kartheek, Lingamaneni Gnanesh Chowdary, Snehasis Mukherjee

Main category: cs.CV

TL;DR: A unified framework for image restoration across fog, snow, and rain using continual learning, selective kernel fusion, EWC, and cycle-contrastive loss, achieving superior results.

DetailsMotivation: Existing methods focus on single weather conditions, but autonomous driving requires a unified model for diverse weather restoration.

Method: Proposes continual learning with selective kernel fusion, EWC for task retention, and cycle-contrastive loss for feature discrimination. Uses unpaired image restoration to reduce data dependency.

Result: Outperforms state-of-the-art in PSNR, SSIM, and perceptual quality on benchmark datasets for dehazing, desnowing, and deraining.

Conclusion: The framework effectively unifies restoration for multiple weather conditions, advancing practical applications like autonomous driving.

Abstract: Restoration of images contaminated by different adverse weather conditions such as fog, snow, and rain is a challenging task due to the varying nature of the weather conditions. Most of the existing methods focus on any one particular weather conditions. However, for applications such as autonomous driving, a unified model is necessary to perform restoration of corrupted images due to different weather conditions. We propose a continual learning approach to propose a unified framework for image restoration. The proposed framework integrates three key innovations: (1) Selective Kernel Fusion layers that dynamically combine global and local features for robust adaptive feature selection; (2) Elastic Weight Consolidation (EWC) to enable continual learning and mitigate catastrophic forgetting across multiple restoration tasks; and (3) a novel Cycle-Contrastive Loss that enhances feature discrimination while preserving semantic consistency during domain translation. Further, we propose an unpaired image restoration approach to reduce the dependance of the proposed approach on the training data. Extensive experiments on standard benchmark datasets for dehazing, desnowing and deraining tasks demonstrate significant improvements in PSNR, SSIM, and perceptual quality over the state-of-the-art.

[143] VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions

Haoang Lu, Yuanqi Su, Xiaoning Zhang, Longjun Gao, Yu Xue, Le Wang

Main category: cs.CV

TL;DR: VisHall3D is a two-stage framework for monocular semantic scene completion, addressing feature entanglement and geometric inconsistency by separating visible region reconstruction (vision) and invisible region inference (hallucination).

DetailsMotivation: Existing methods suffer from feature entanglement and geometric inconsistency in scene completion tasks.

Method: VisHall3D uses VisFrontierNet for visible region reconstruction and OcclusionMAE for invisible region hallucination with noise injection.

Result: Achieves state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360 benchmarks.

Conclusion: VisHall3D improves reconstruction quality and advances scene understanding for applications like autonomous driving.

Abstract: This paper introduces VisHall3D, a novel two-stage framework for monocular semantic scene completion that aims to address the issues of feature entanglement and geometric inconsistency prevalent in existing methods. VisHall3D decomposes the scene completion task into two stages: reconstructing the visible regions (vision) and inferring the invisible regions (hallucination). In the first stage, VisFrontierNet, a visibility-aware projection module, is introduced to accurately trace the visual frontier while preserving fine-grained details. In the second stage, OcclusionMAE, a hallucination network, is employed to generate plausible geometries for the invisible regions using a noise injection mechanism. By decoupling scene completion into these two distinct stages, VisHall3D effectively mitigates feature entanglement and geometric inconsistency, leading to significantly improved reconstruction quality. The effectiveness of VisHall3D is validated through extensive experiments on two challenging benchmarks: SemanticKITTI and SSCBench-KITTI-360. VisHall3D achieves state-of-the-art performance, outperforming previous methods by a significant margin and paves the way for more accurate and reliable scene understanding in autonomous driving and other applications.

[144] PRE-MAP: Personalized Reinforced Eye-tracking Multimodal LLM for High-Resolution Multi-Attribute Point Prediction

Hanbing Wu, Ping Jiang, Anyang Su, Chenxu Zhao, Tianyu Fu, Minghui Wu, Beiping Tan, Huiying Li

Main category: cs.CV

TL;DR: The paper introduces SPA-ADV, a dataset capturing gaze behaviors from 4,500+ participants, and PRE-MAP, a saliency model using reinforcement learning to predict personalized attention patterns.

DetailsMotivation: Existing models overlook subjective cognitive diversity in visual attention, and current saliency models struggle with personalized patterns and precise point predictions.

Method: Proposes PRE-MAP, a reinforcement learning-optimized eye-tracking model guided by multi-attribute profiles, and introduces C-GRPO for accurate point predictions.

Result: Demonstrates effectiveness on SPA-ADV and benchmarks, improving personalized attention prediction.

Conclusion: SPA-ADV and PRE-MAP address limitations in current models, offering better personalization and accuracy in visual attention prediction.

Abstract: Visual selective attention, driven by individual preferences, regulates human prioritization of visual stimuli by bridging subjective cognitive mechanisms with objective visual elements, thereby steering the semantic interpretation and hierarchical processing of dynamic visual scenes. However, existing models and datasets predominantly neglect the influence of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models, typically employing segmentation approaches, rely on low-resolution imagery to generate saliency heatmaps, subsequently upscaled to native resolutions, which limiting their capacity to capture personalized attention patterns. Furthermore, MLLMs are constrained by factors such as hallucinations, making it very costly to strictly adhere to the expected format in tasks involving multiple point predictions, and achieving precise point positioning is challenging. To address these limitations, we present Subjective Personalized Attention for Advertisement Videos, namely SPA-ADV, a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants varying in age and gender with 486 videos. Furthermore, we propose PRE-MAP, a novel eye-tracking saliency model that characterizes Personalized visual disparities through Reinforcement learning-optimized Eye-tracking, built upon MLLMs and guided by Multi-Attribute user profiles to predict Points. To ensure MLLMs produce prediction points that are both format-correct and spatially accurate, we introduce Consistency Group Relative Policy Optimization (C-GRPO), inspired by the variability in eye movement points and Multi-Attribute profiles. Extensive experiments on SPA-ADV and other benchmarks demonstrate the effectiveness of our approach. The code and dataset are available at \href{https://github.com/mininglamp-MLLM/PRE-MAP}{this URL}.

[145] Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene

Donggeun Lim, Jinseok Bae, Inwoo Hwang, Seungmin Lee, Hwanhee Lee, Young Min Kim

Main category: cs.CV

TL;DR: A framework for generating lively virtual scenes with multi-human contextual motions using LLMs, event sequencing, and spatial guidance, achieving scalable and diverse results.

DetailsMotivation: To address the challenge of holistic reasoning in dynamic human-human and human-scene interactions for multi-agent behavior generation.

Method: Uses an event generator to break scenes into small events, synthesizes motions with spatial guidance, and employs a high-level module for scalable context translation.

Result: Benchmark and user studies confirm the framework’s effectiveness in capturing scene context with high scalability.

Conclusion: The framework successfully generates diverse and scalable multi-human contextual motions, supported by a new benchmark.

Abstract: In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability. The code and benchmark, along with result videos, are available at our project page: https://rms0329.github.io/Event-Driven-Storytelling/.

[146] CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception

Jiaru Zhong, Jiahao Wang, Jiahui Xu, Xiaofan Li, Zaiqing Nie, Haibao Yu

Main category: cs.CV

TL;DR: CoopTrack is an end-to-end framework for cooperative 3D multi-object tracking, using sparse instance-level features and learnable association to improve perception with low transmission costs.

DetailsMotivation: Address limitations of single-vehicle autonomous systems by enabling multi-agent information exchange, focusing on understudied sequential perception tasks.

Method: Features Multi-Dimensional Feature Extraction and Cross-Agent Association and Aggregation for instance representation and fusion via a feature graph.

Result: Achieves 39.0% mAP and 32.8% AMOTA on V2X-Seq, outperforming existing methods.

Conclusion: CoopTrack advances cooperative perception with efficient, high-performance tracking.

Abstract: Cooperative perception aims to address the inherent limitations of single-vehicle autonomous driving systems through information exchange among multiple agents. Previous research has primarily focused on single-frame perception tasks. However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multi-object tracking, have not been thoroughly investigated. Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches. CoopTrack transmits sparse instance-level features that significantly enhance perception capabilities while maintaining low transmission costs. Furthermore, the framework comprises two key components: Multi-Dimensional Feature Extraction, and Cross-Agent Association and Aggregation, which collectively enable comprehensive instance representation with semantic and motion features, and adaptive cross-agent association and fusion based on a feature graph. Experiments on both the V2X-Seq and Griffin datasets demonstrate that CoopTrack achieves excellent performance. Specifically, it attains state-of-the-art results on V2X-Seq, with 39.0% mAP and 32.8% AMOTA. The project is available at https://github.com/zhongjiaru/CoopTrack.

[147] BridgeNet: A Unified Multimodal Framework for Bridging 2D and 3D Industrial Anomaly Detection

An Xiang, Zixuan Huang, Xitong Gao, Kejiang Ye, Cheng-zhong Xu

Main category: cs.CV

TL;DR: A novel unified multimodal anomaly detection framework is proposed to address 3D depth anomaly detection challenges by disentangling depth and appearance, enabling richer anomaly generation and outperforming SOTA methods.

DetailsMotivation: Existing methods struggle to represent 3D information in multimodal scenarios due to disparities among modalities and scarcity of abnormal samples in industrial data.

Method: The framework extracts visible depth from 3D point clouds, uses 2D RGB for appearance, and employs anomaly generators for richer anomaly simulation. All modules share parameters for RGB and depth data.

Result: The method outperforms SOTA on MVTec-3D AD and Eyecandies datasets.

Conclusion: The proposed framework effectively bridges 2D and 3D anomaly detection, leveraging multimodal features without complex fusion.

Abstract: Industrial anomaly detection for 2D objects has gained significant attention and achieved progress in anomaly detection (AD) methods. However, identifying 3D depth anomalies using only 2D information is insufficient. Despite explicitly fusing depth information into RGB images or using point cloud backbone networks to extract depth features, both approaches struggle to adequately represent 3D information in multimodal scenarios due to the disparities among different modal information. Additionally, due to the scarcity of abnormal samples in industrial data, especially in multimodal scenarios, it is necessary to perform anomaly generation to simulate real-world abnormal samples. Therefore, we propose a novel unified multimodal anomaly detection framework to address these issues. Our contributions consist of 3 key aspects. (1) We extract visible depth information from 3D point cloud data simply and use 2D RGB images to represent appearance, which disentangles depth and appearance to support unified anomaly generation. (2) Benefiting from the flexible input representation, the proposed Multi-Scale Gaussian Anomaly Generator and Unified Texture Anomaly Generator can generate richer anomalies in RGB and depth. (3) All modules share parameters for both RGB and depth data, effectively bridging 2D and 3D anomaly detection. Subsequent modules can directly leverage features from both modalities without complex fusion. Experiments show our method outperforms state-of-the-art (SOTA) on MVTec-3D AD and Eyecandies datasets. Code available at: https://github.com/Xantastic/BridgeNet

[148] OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models

Monika Wysoczańska, Shyamal Buch, Anurag Arnab, Cordelia Schmid

Main category: cs.CV

TL;DR: OV-Fact is a new method for evaluating the factuality of long captions in VLMs without human annotations, improving agreement with human judgments and enabling data filtering.

DetailsMotivation: Traditional metrics for hallucination and factuality in VLMs are inadequate for long captions and lack human annotations.

Method: OV-Fact uses open-vocabulary visual grounding and tool-based verification to measure caption factuality.

Result: Models trained on OV-Fact-filtered data show improved factuality without losing descriptiveness.

Conclusion: OV-Fact provides a reference-free, effective solution for evaluating and improving caption factuality in VLMs.

Abstract: Large vision-language models (VLMs) often struggle to generate long and factual captions. However, traditional measures for hallucination and factuality are not well suited for evaluating longer, more diverse captions and in settings where ground-truth human-annotated captions are unavailable. We introduce OV-Fact, a novel method for measuring caption factuality of long captions that leverages open-vocabulary visual grounding and tool-based verification without depending on human annotations. Our method improves agreement with human judgments and captures both caption descriptiveness (recall) and factual precision in the same metric. Furthermore, unlike previous metrics, our reference-free method design enables new applications towards factuality-based data filtering. We observe models trained on an OVFact-filtered (2.5-5x less) subset of a large-scale, noisy (VLM-generated) pretraining set meaningfully improve factuality precision without sacrificing caption descriptiveness across a range of downstream long caption benchmarks.

[149] SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality

Sijie Li, Chen Chen, Jungong Han

Main category: cs.CV

TL;DR: SimMLM is a simple, effective framework for multimodal learning with missing modalities, using a dynamic gating mechanism and MoFe ranking loss to improve accuracy and robustness.

DetailsMotivation: Existing methods for handling missing modalities are complex or rely on imputation. SimMLM aims to provide a simpler, more adaptable solution.

Method: SimMLM uses a Dynamic Mixture of Modality Experts (DMoME) with a learnable gating mechanism and introduces the MoFe ranking loss to ensure stable or improved accuracy with more modalities.

Result: SimMLM outperforms existing methods on medical image segmentation and classification tasks, showing better accuracy, robustness, and reliability.

Conclusion: SimMLM offers a generic, effective solution for multimodal learning with missing modalities, validated by superior performance in diverse scenarios.

Abstract: In this paper, we propose SimMLM, a simple yet powerful framework for multimodal learning with missing modalities. Unlike existing approaches that rely on sophisticated network architectures or complex data imputation techniques, SimMLM provides a generic and effective solution that can adapt to various missing modality scenarios with improved accuracy and robustness. Specifically, SimMLM consists of a generic Dynamic Mixture of Modality Experts (DMoME) architecture, featuring a dynamic, learnable gating mechanism that automatically adjusts each modality’s contribution in both full and partial modality settings. A key innovation of SimMLM is the proposed More vs. Fewer (MoFe) ranking loss, which ensures that task accuracy improves or remains stable as more modalities are made available. This aligns the model with an intuitive principle: removing one or more modalities should not increase accuracy. We validate SimMLM on multimodal medical image segmentation (BraTS 2018) and multimodal classification (UPMC Food-101, avMNIST) tasks, where it consistently surpasses competitive methods, demonstrating superior accuracy, interpretability, robustness, and reliability across both complete and missing modality scenarios at test time.

[150] MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, Qiang Xu

Main category: cs.CV

TL;DR: MagicDrive3D introduces a controllable 3D street scene generation framework using video-based synthesis and 3DGS, supporting multi-condition control and leveraging autonomous driving data for diverse, high-quality outputs.

DetailsMotivation: Existing 3D scene generation methods lack flexibility and rely on controlled environments, limiting generalizability. MagicDrive3D aims to address this by utilizing routine autonomous driving data and enabling multi-condition control.

Method: The framework combines video-based view synthesis with 3DGS generation, first training a multi-view video model. It employs Fault-Tolerant Gaussian Splatting and monocular depth for initialization and appearance modeling.

Result: MagicDrive3D generates diverse, high-quality 3D scenes, supports any-view rendering, and improves tasks like BEV segmentation.

Conclusion: MagicDrive3D demonstrates potential for autonomous driving simulation, offering flexible control and leveraging readily available data.

Abstract: Controllable generative models for images and videos have seen significant success, yet 3D scene generation, especially in unbounded scenarios like autonomous driving, remains underdeveloped. Existing methods lack flexible controllability and often rely on dense view data collection in controlled environments, limiting their generalizability across common datasets (e.g., nuScenes). In this paper, we introduce MagicDrive3D, a novel framework for controllable 3D street scene generation that combines video-based view synthesis with 3D representation (3DGS) generation. It supports multi-condition control, including road maps, 3D objects, and text descriptions. Unlike previous approaches that require 3D representation before training, MagicDrive3D first trains a multi-view video generation model to synthesize diverse street views. This method utilizes routinely collected autonomous driving data, reducing data acquisition challenges and enriching 3D scene generation. In the 3DGS generation step, we introduce Fault-Tolerant Gaussian Splatting to address minor errors and use monocular depth for better initialization, alongside appearance modeling to manage exposure discrepancies across viewpoints. Experiments show that MagicDrive3D generates diverse, high-quality 3D driving scenes, supports any-view rendering, and enhances downstream tasks like BEV segmentation, demonstrating its potential for autonomous driving simulation and beyond.

[151] Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception

Marcel Simon, Tae-Ho Kim, Seul-Ki Yeom

Main category: cs.CV

TL;DR: A self-supervised video-distilled image encoder improves visual feature learning by predicting next-frame representations, enhancing performance on ADE20K without complex temporal methods.

DetailsMotivation: Most SSL methods ignore temporal cues in videos, limiting their ability to learn robust visual features with spatial and temporal priors.

Method: A video-distilled single-image encoder is trained to predict next-frame representations from the current frame, avoiding optical flow or tracking.

Result: Pre-training on a 2-hour video boosts mIoU on ADE20K from 35.0 to 36.4, maintaining compatibility with image-only pipelines.

Conclusion: Video self-distillation offers a lightweight way to achieve geometry-aware perception, crucial for realistic world models and Physical AI.

Abstract: Self-supervised image encoders such as DINO have recently gained significant interest for learning robust visual features without labels. However, most SSL methods train on static images and miss the temporal cues inherent in videos. We introduce a video-distilled single-image encoder trained to predict the next-frame representation from the current frame. This simple objective injects 3D spatial and temporal priors without optical flow or tracking. When pre-training on a single 2-hour video, our approach raises the mean Intersection-over-Union (mIoU) on ADE20K from 35.0 (DoRA) to 36.4 while remaining a drop-in replacement for image-only pipelines. Our results highlight video self-distillation as a lightweight route to geometry-aware perception an essential ingredient for physically plausible world models and Physical AI.

[152] HumorDB: Can AI understand graphical humor?

Veedant Jain, Gabriel Kreiman, Felipe dos Santos Alves Feitosa

Main category: cs.CV

TL;DR: The paper introduces HumorDB, a dataset for evaluating AI’s visual humor understanding, revealing gaps between AI and human performance.

DetailsMotivation: To address the challenge of complex scene interpretation, focusing on humor as a test case requiring contextual and cognitive understanding.

Method: Creation of HumorDB with diverse images and contrastive pairs, evaluating humans and AI models on humor tasks.

Result: AI models lag behind humans, especially with abstract sketches and subtle cues; attention maps show misalignment with humor regions.

Conclusion: Advanced architectures are needed for detecting subtle contextual features and bridging visual perception with abstract reasoning.

Abstract: Despite significant advancements in image segmentation and object detection, understanding complex scenes remains a significant challenge. Here, we focus on graphical humor as a paradigmatic example of image interpretation that requires elucidating the interaction of different scene elements in the context of prior cognitive knowledge. This paper introduces \textbf{HumorDB}, a novel, controlled, and carefully curated dataset designed to evaluate and advance visual humor understanding by AI systems. The dataset comprises diverse images spanning photos, cartoons, sketches, and AI-generated content, including minimally contrastive pairs where subtle edits differentiate between humorous and non-humorous versions. We evaluate humans, state-of-the-art vision models, and large vision-language models on three tasks: binary humor classification, funniness rating prediction, and pairwise humor comparison. The results reveal a gap between current AI systems and human-level humor understanding. While pretrained vision-language models perform better than vision-only models, they still struggle with abstract sketches and subtle humor cues. Analysis of attention maps shows that even when models correctly classify humorous images, they often fail to focus on the precise regions that make the image funny. Preliminary mechanistic interpretability studies and evaluation of model explanations provide initial insights into how different architectures process humor. Our results identify promising trends and current limitations, suggesting that an effective understanding of visual humor requires sophisticated architectures capable of detecting subtle contextual features and bridging the gap between visual perception and abstract reasoning. All the code and data are available here: \href{https://github.com/kreimanlab/HumorDB}{https://github.com/kreimanlab/HumorDB}

[153] RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, Pai Peng

Main category: cs.CV

TL;DR: RemoteReasoner is a flexible workflow for remote sensing tasks, using a multi-modal LLM and reinforcement learning to autonomously handle complex queries and diverse outputs.

DetailsMotivation: Existing remote sensing methods lack autonomy and flexibility, relying on supervised fine-tuning, which limits reasoning capabilities.

Method: Proposes RemoteReasoner, integrating a multi-modal LLM for instruction interpretation and target localization, trained with RL for autonomous reasoning.

Result: Achieves strong performance in multi-granularity tasks (region/pixel-level) and enables novel capabilities like contour extraction.

Conclusion: RemoteReasoner advances remote sensing by enabling autonomous, flexible reasoning without task-specific fine-tuning.

Abstract: Remote sensing imagery presents vast, inherently unstructured spatial data, demanding sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition tasks. In this paper, we aim to construct an Earth observation workflow to handle complex queries by reasoning about spatial context and user intent. As a reasoning workflow, it should be somewhat autonomous, where predefined ground-truth reasoning paths do not constrain the learning process. Furthermore, its architecture ought to be unified yet flexible, enabling the model to perform diverse reasoning tasks with distinct output formats through a single forward pass. Existing remote sensing approaches fail to address these requirements, as they rely on supervised fine-tuning paradigms that constrain the autonomy of reasoning. To this end, we propose RemoteReasoner, a flexible and robust workflow for remote sensing reasoning tasks. The design of RemoteReasoner integrates a multi-modal large language model (MLLM) for interpreting user instructions and localizing targets, together with task adaptation strategies that enable multi-granularity output generation. In contrast to existing methods, our framework is trained with reinforcement learning (RL) to endow the MLLM sufficient autonomy for precise reasoning. At the inference stage, our adaptation strategies enable diverse output formats at inference time without requiring task-specific decoders or further fine-tuning. Preliminary experiments demonstrated that RemoteReasoner achieves remarkable performance across multi-granularity reasoning tasks, including region-level and pixel-level. Additionally, our framework enables novel capabilities such as the contour extraction task beyond the reach of existing reasoning pipelines.

[154] FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation

Xiang Gao, Jiaying Liu

Main category: cs.CV

TL;DR: The paper introduces a plug-and-play method for text-driven image-to-image translation using pre-trained diffusion models, enhancing controllability without training or fine-tuning.

DetailsMotivation: Current text-to-image diffusion models lack controllability for practical content creation, prompting the need for methods to leverage reference images for better synthesis.

Method: The approach decomposes guiding factors in the DCT spectral space and uses a frequency band substitution layer for dynamic control of reference images in text-to-image generation.

Result: The method achieves high-quality, versatile image translation with flexible control over guiding factors and intensity, outperforming related methods.

Conclusion: The proposed technique offers superior visual quality, versatility, and controllability in image-to-image translation, with publicly available code.

Abstract: Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI and multimodal technology, allowing wonderful image generation with natural-language text prompt. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation. Thus, attention has been focused on leveraging a reference image to control text-to-image synthesis, which is also regarded as manipulating (or editing) a reference image as per a text prompt, namely, text-driven image-to-image translation. This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without any model training, model fine-tuning, or online optimization process. To guide T2I generation with a reference image, we propose to decompose diverse guiding factors with different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer which realizes dynamic control of the reference image to the T2I generation result in a plug-and-play manner. We demonstrate that our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band, respectively. Extensive qualitative and quantitative experiments verify superiority of our approach over related methods in I2I translation visual quality, versatility, and controllability. The code is publicly available at: https://github.com/XiangGao1102/FBSDiff.

[155] PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups

Sakuya Ota, Qing Yu, Kent Fujiwara, Satoshi Ikehata, Ikuro Sato

Main category: cs.CV

TL;DR: PINO is a training-free framework for generating realistic group interactions by decomposing them into pairwise interactions and using physics-based penalties.

DetailsMotivation: Existing methods for group motion generation rely on shared prompts, limiting nuanced control and simplifying interactions.

Method: PINO decomposes group interactions into pairwise ones, uses pretrained two-person models, and applies physics-based penalties for plausibility.

Result: PINO produces realistic, physically coherent, and customizable multi-person interactions.

Conclusion: PINO is effective for diverse applications like animation, gaming, and robotics without requiring additional training.

Abstract: Generating realistic group interactions involving multiple characters remains challenging due to increasing complexity as group size expands. While existing conditional diffusion models incrementally generate motions by conditioning on previously generated characters, they rely on single shared prompts, limiting nuanced control and leading to overly simplified interactions. In this paper, we introduce Person-Interaction Noise Optimization (PINO), a novel, training-free framework designed for generating realistic and customizable interactions among groups of arbitrary size. PINO decomposes complex group interactions into semantically relevant pairwise interactions, and leverages pretrained two-person interaction diffusion models to incrementally compose group interactions. To ensure physical plausibility and avoid common artifacts such as overlapping or penetration between characters, PINO employs physics-based penalties during noise optimization. This approach allows precise user control over character orientation, speed, and spatial relationships without additional training. Comprehensive evaluations demonstrate that PINO generates visually realistic, physically coherent, and adaptable multi-person interactions suitable for diverse animation, gaming, and robotics applications.

[156] ABCD: Automatic Blood Cell Detection via Attention-Guided Improved YOLOX

Ahmed Endris Hasen, Yang Shangming, Chiagoziem C. Ukwuoma, Biniyam Gashaw, Abel Zenebe Yutra

Main category: cs.CV

TL;DR: The paper proposes an automatic blood cell detection method (ABCD) using an improved YOLOX model, achieving higher accuracy and speed than existing methods.

DetailsMotivation: Manual blood cell analysis is time-consuming and error-prone, necessitating automated solutions for efficiency and accuracy.

Method: The method integrates CBAM for feature extraction, ASFF for feature fusion, and CIOU loss for faster convergence in an improved YOLOX framework.

Result: ABCD achieved 95.49% mAP@0.5 and 86.89% mAP@0.5-0.9, outperforming baselines by 2.8% and 23.41%, with a 2.9% speed increase.

Conclusion: The proposed ABCD method is highly effective for real-time blood cell detection, offering superior performance over existing techniques.

Abstract: Detection of blood cells in microscopic images has become a major focus of medical image analysis, playing a crucial role in gaining valuable insights into a patient’s health. Manual blood cell checks for disease detection are known to be time-consuming, inefficient, and error-prone. To address these limitations, analyzing blood cells using deep learning-based object detectors can be regarded as a feasible solution. In this study, we propose automatic blood cell detection method (ABCD) based on an improved version of YOLOX, an object detector, for detecting various types of blood cells, including white blood cells, red blood cells, and platelets. Firstly, we introduce the Convolutional Block Attention Module (CBAM) into the network’s backbone to enhance the efficiency of feature extraction. Furthermore, we introduce the Adaptively Spatial Feature Fusion (ASFF) into the network’s neck, which optimizes the fusion of different features extracted from various stages of the network. Finally, to speed up the model’s convergence, we substitute the Intersection over Union (IOU) loss function with the Complete Intersection over Union (CIOU) loss function. The experimental results demonstrate that the proposed method is more effective than other existing methods for BCCD dataset. Compared to the baseline algorithm, our method ABCD achieved 95.49 % mAP@0.5 and 86.89 % mAP@0.5-0.9, which are 2.8% and 23.41% higher, respectively, and increased the detection speed by 2.9%, making it highly efficient for real-time applications.

[157] ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives

Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, Luc Van Gool

Main category: cs.CV

TL;DR: ObjectRelator improves cross-view object segmentation by integrating language cues and self-supervised alignment, outperforming PSALM on Ego-Exo4D and HANDAL-X.

DetailsMotivation: Addressing the challenge of accurately locating and segmenting objects across ego-exo views, especially in complex backgrounds or with appearance changes.

Method: Proposes ObjectRelator with Multimodal Condition Fusion (MCFuse) for language integration and SSL-based Cross-View Object Alignment (XObjAlign) for consistency.

Result: Achieves state-of-the-art performance on Ego-Exo4D and HANDAL-X benchmarks.

Conclusion: ObjectRelator effectively bridges the gap between ego-exo views through innovative multimodal and self-supervised techniques.

Abstract: Bridging the gap between ego-centric and exo-centric views has been a long-standing question in computer vision. In this paper, we focus on the emerging Ego-Exo object correspondence task, which aims to understand object relations across ego-exo perspectives through segmentation. While numerous segmentation models have been proposed, most operate on a single image (view), making them impractical for cross-view scenarios. PSALM, a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task. However, due to the drastic viewpoint change between ego and exo, PSALM fails to accurately locate and segment objects, especially in complex backgrounds or when object appearances change significantly. To address these issues, we propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse introduces language as an additional cue, integrating both visual masks and textual descriptions to improve object localization and prevent incorrect associations. XObjAlign enforces cross-view consistency through self-supervised alignment, enhancing robustness to object appearance variations. Extensive experiments demonstrate ObjectRelator’s effectiveness on the large-scale Ego-Exo4D benchmark and HANDAL-X (an adapted dataset for cross-view segmentation) with state-of-the-art performance. Code is made available at: http://yuqianfu.com/ObjectRelator.

[158] EffiComm: Bandwidth Efficient Multi Agent Communication

Melih Yazgan, Allen Xavier Arasan, J. Marius Zöllner

Main category: cs.CV

TL;DR: EffiComm reduces V2V communication data by 60% while maintaining high 3D object detection accuracy using selective transmission and adaptive grid reduction.

DetailsMotivation: Overcoming V2V communication overload from raw data transmission in collaborative perception.

Method: Two-stage reduction pipeline: Selective Transmission (ST) and Adaptive Grid Reduction (AGR) with a GNN, fused via MoE attention.

Result: Achieves 0.84 mAP@0.7 with ~1.5 MB/frame, outperforming prior methods.

Conclusion: EffiComm demonstrates adaptive, learned communication’s value for scalable V2X perception.

Abstract: Collaborative perception allows connected vehicles to exchange sensor information and overcome each vehicle’s blind spots. Yet transmitting raw point clouds or full feature maps overwhelms Vehicle-to-Vehicle (V2V) communications, causing latency and scalability problems. We introduce EffiComm, an end-to-end framework that transmits less than 40% of the data required by prior art while maintaining state-of-the-art 3D object detection accuracy. EffiComm operates on Bird’s-Eye-View (BEV) feature maps from any modality and applies a two-stage reduction pipeline: (1) Selective Transmission (ST) prunes low-utility regions with a confidence mask; (2) Adaptive Grid Reduction (AGR) uses a Graph Neural Network (GNN) to assign vehicle-specific keep ratios according to role and network load. The remaining features are fused with a soft-gated Mixture-of-Experts (MoE) attention layer, offering greater capacity and specialization for effective feature integration. On the OPV2V benchmark, EffiComm reaches 0.84 mAP@0.7 while sending only an average of approximately 1.5 MB per frame, outperforming previous methods on the accuracy-per-bit curve. These results highlight the value of adaptive, learned communication for scalable Vehicle-to-Everything (V2X) perception.

[159] Tell Me What to Track: Infusing Robust Language Guidance for Enhanced Referring Multi-Object Tracking

Wenjun Huang, Yang Ni, Hanning Chen, Yirui He, Ian Bryant, Yezi Liu, Mohsen Imani

Main category: cs.CV

TL;DR: The paper introduces a collaborative matching strategy for referring multi-object tracking (RMOT) to address data imbalance and improve newborn target detection, enhancing cross-modal fusion and referring guidance.

DetailsMotivation: Prior studies overlook imbalanced data distribution between newborn and existing targets and struggle with indirect multi-modal feature fusion, limiting newborn target detection.

Method: Proposes a collaborative matching strategy, integrates cross-modal and multi-scale fusion in the encoder, and develops referring-infused adaptation in the decoder.

Result: The model achieves a 3.42% performance improvement over prior works.

Conclusion: The proposed designs effectively address the challenges in RMOT, demonstrating superior performance.

Abstract: Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to localize an arbitrary number of targets based on a language expression and continuously track them in a video. This intricate task involves reasoning on multi-modal data and precise target localization with temporal association. However, prior studies overlook the imbalanced data distribution between newborn targets and existing targets due to the nature of the task. In addition, they only indirectly fuse multi-modal features, struggling to deliver clear guidance on newborn target detection. To solve the above issues, we conduct a collaborative matching strategy to alleviate the impact of the imbalance, boosting the ability to detect newborn targets while maintaining tracking performance. In the encoder, we integrate and enhance the cross-modal and multi-scale fusion, overcoming the bottlenecks in previous work, where limited multi-modal information is shared and interacted between feature maps. In the decoder, we also develop a referring-infused adaptation that provides explicit referring guidance through the query tokens. The experiments showcase the superior performance of our model (+3.42%) compared to prior works, demonstrating the effectiveness of our designs.

[160] CircuitProbe: Dissecting Spatiotemporal Visual Semantics with Circuit Tracing

Yiming Zhang, Chengzhang Yu, Zhuokai Zhao, Kun Wang, Qiankun Li, Zihan Chen, Yang Liu, Zenghui Ding, Yining Sun

Main category: cs.CV

TL;DR: The paper introduces a circuit-based framework to study spatiotemporal understanding in large vision-language models (LVLMs), revealing localized visual semantics and specialized functional layers.

DetailsMotivation: To understand the internal reasoning mechanisms of LVLMs for spatiotemporal understanding, which remains poorly explored.

Method: A systematic framework with three circuits: visual auditing, semantic tracing, and attention flow, to analyze visual semantics and model layers.

Result: Visual semantics are localized to specific object tokens (removal degrades performance by 92.6%), and middle-to-late layers refine interpretable concepts for spatiotemporal semantics.

Conclusion: The findings provide mechanistic insights into LVLMs’ spatiotemporal analysis, aiding the design of more robust and interpretable models.

Abstract: The processing mechanisms underlying language and image understanding in large vision-language models (LVLMs) have been extensively studied. However, the internal reasoning mechanisms of LVLMs for spatiotemporal understanding remain poorly understood. In this work, we introduce a systematic, circuit-based framework designed to investigate how spatiotemporal visual semantics are represented and processed within these LVLMs. Specifically, our framework comprises three circuits: visual auditing circuit, semantic tracing circuit, and attention flow circuit. Through the lens of these circuits, we discover that visual semantics are highly localized to specific object tokens–removing these tokens can degrade model performance by up to 92.6%. Furthermore, we identify that interpretable concepts of objects and actions emerge and become progressively refined in the middle-to-late layers of LVLMs. In contrary to the current works that solely focus on objects in one image, we reveal that the middle-to-late layers of LVLMs exhibit specialized functional localization for spatiotemporal semantics. Our findings offer significant mechanistic insights into spatiotemporal semantics analysis of LVLMs, laying a foundation for designing more robust and interpretable models.

[161] SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning

Lanmiao Liu, Esam Ghaleb, Aslı Özyürek, Zerrin Yumak

Main category: cs.CV

TL;DR: A novel approach for generating semantically coherent gestures aligned with speech, outperforming state-of-the-art methods in realism and coherence.

DetailsMotivation: Existing gesture generation research lacks semantic context, focusing mainly on rhythmic beat gestures. This paper aims to integrate semantic information for more realistic and coherent gestures.

Method: Uses a vector-quantized variational autoencoder for motion prior learning, followed by a module generating gestures from speech, text semantics, and speaker identity, ensuring semantic coherence.

Result: Outperforms state-of-the-art methods in objective and subjective metrics, enhancing gesture realism and coherence.

Conclusion: The proposed method successfully integrates semantic grounding, improving co-speech gesture generation.

Abstract: Creating a virtual avatar with semantically coherent gestures that are aligned with speech is a challenging task. Existing gesture generation research mainly focused on generating rhythmic beat gestures, neglecting the semantic context of the gestures. In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. Our approach starts with learning the motion prior through a vector-quantized variational autoencoder. Built on this model, a second-stage module is applied to automatically generate gestures from speech, text-based semantics and speaker identity that ensures consistency between the semantic relevance of generated gestures and co-occurring speech semantics through semantic coherence and relevance modules. Experimental results demonstrate that our approach enhances the realism and coherence of semantic gestures. Extensive experiments and user studies show that our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation in both objective and subjective metrics. The qualitative results of our model, code, dataset and pre-trained models can be viewed at https://semgesture.github.io/.

[162] EA-ViT: Efficient Adaptation for Elastic Vision Transformer

Chen Zhu, Wangbo Zhao, Huiwen Zhang, Samir Khaki, Yuhao Zhou, Weidong Tang, Shuo Wang, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Kai Wang, Dawei Yang

Main category: cs.CV

TL;DR: EA-ViT proposes a framework for adapting Vision Transformers (ViTs) to diverse resource constraints without retraining multiple models, using a nested elastic architecture and a lightweight router.

DetailsMotivation: Deploying ViTs for varying resource constraints typically requires retraining multiple models, which is inefficient and resource-intensive.

Method: The approach involves a two-stage process: (1) enhancing a pre-trained ViT with a nested elastic architecture for structural flexibility, and (2) designing a lightweight router to select submodels based on computational budgets and task demands.

Result: EA-ViT demonstrates effectiveness and versatility across multiple benchmarks, enabling efficient adaptation of ViTs.

Conclusion: The proposed framework offers a scalable and efficient solution for deploying ViTs under diverse resource constraints, reducing the need for multiple retraining processes.

Abstract: Vision Transformers (ViTs) have emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, deploying ViTs to support diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and energy-intensive. To address this issue, we propose an efficient ViT adaptation framework that enables a single adaptation process to generate multiple models of varying sizes for deployment on platforms with various resource constraints. Our approach comprises two stages. In the first stage, we enhance a pre-trained ViT with a nested elastic architecture that enables structural flexibility across MLP expansion ratio, number of attention heads, embedding dimension, and network depth. To preserve pre-trained knowledge and ensure stable adaptation, we adopt a curriculum-based training strategy that progressively increases elasticity. In the second stage, we design a lightweight router to select submodels according to computational budgets and downstream task demands. Initialized with Pareto-optimal configurations derived via a customized NSGA-II algorithm, the router is then jointly optimized with the backbone. Extensive experiments on multiple benchmarks demonstrate the effectiveness and versatility of EA-ViT. The code is available at https://github.com/zcxcf/EA-ViT.

[163] FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang

Main category: cs.CV

TL;DR: FrameFusion reduces visual tokens in LVLMs by 70% via similarity-based merging and importance-based pruning, achieving 1.6-3.6x speedups with minimal performance impact.

DetailsMotivation: Existing token reduction methods fail to address redundancy from similar adjacent frames, limiting efficiency in processing long, high-resolution videos.

Method: FrameFusion integrates similarity-based merging (focusing on spatially corresponding tokens) with pruning, leveraging insights on token similarity across layers.

Result: Reduces tokens by 70%, speeds up processing 1.6-3.6x, and maintains performance with <3% impact on benchmarks.

Conclusion: FrameFusion effectively balances efficiency and performance for LVLMs in video tasks.

Abstract: The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily prune tokens based on importance metrics, such as cumulative attention scores. However, even important tokens may exhibit high redundancy caused by similarity among adjacent video frames and repetitive visual elements. To address this limitation, we propose FrameFusion, a novel token reduction approach integrating similarity-based merging with importance-based pruning. We conduct a thorough study on token similarity characteristics, revealing three key insights: (1) spatially corresponding visual tokens between adjacent frames have higher cosine similarities compared to other token pairs; (2) high token similarities prominently decrease in deeper model layers; and (3) token similarity rankings are highly consistent across different layers. Guided by these observations, FrameFusion computes token similarities exclusively between corresponding visual tokens from adjacent frames, applies token merging at initial successive layers followed by pruning in deeper layers, and adopts a cascaded merging strategy to further enhance efficiency. We evaluate FrameFusion comprehensively across six diverse LVLMs, ranging from 2B to 72B parameters, using five video benchmarks encompassing video retrieval, question-answering, and spatial-temporal understanding tasks. Experiments show that FrameFusion reduces visual tokens by 70%, achieving 1.6-3.6x end-to-end speedups, with an average performance impact of less than 3%. Our code is available at: https://github.com/thu-nics/FrameFusion.

[164] BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

Felix Brandstaetter, Erik Schuetz, Katharina Winter, Fabian Flohr

Main category: cs.CV

TL;DR: BEV-LLM is a lightweight 3D captioning model for autonomous driving scenes, combining LiDAR and multi-view images, achieving competitive performance and introducing two new datasets.

DetailsMotivation: Enhancing transparency, safety, and human-AI interaction in autonomous driving through interpretable scene descriptions.

Method: BEV-LLM leverages BEVFusion for 3D LiDAR and multi-view image fusion, using a novel absolute positional encoding for view-specific descriptions.

Result: Achieves competitive performance on nuCaption dataset (5% higher BLEU scores) and introduces nuView and GroundView datasets.

Conclusion: BEV-LLM advances scene captioning for autonomous driving, with new datasets addressing benchmark gaps.

Abstract: Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions. Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5% in BLEU scores. Additionally, we release two new datasets - nuView (focused on environmental conditions and viewpoints) and GroundView (focused on object grounding) - to better assess scene captioning across diverse driving scenarios and address gaps in current benchmarks, along with initial benchmarking results demonstrating their effectiveness.

[165] Fast Learning of Non-Cooperative Spacecraft 3D Models through Primitive Initialization

Pol Francesch Huc, Emily Bates, Simone D’Amico

Main category: cs.CV

TL;DR: A CNN-based pipeline reduces the training cost and pose dependency of 3DGS for space applications, enabling high-fidelity 3D models with noisy or implicit poses.

DetailsMotivation: Current methods like NeRF and 3DGS require precise poses and are computationally expensive, limiting their use in space applications.

Method: A CNN initializes 3DGS with a coarse 3D model and pose estimates, reducing training iterations and input images. Variants of the CNN handle noisy or implicit poses.

Result: The pipeline achieves high-fidelity 3D models with significantly reduced training cost and pose requirements.

Conclusion: This work enables novel view synthesis in space applications by addressing pose and computational limitations.

Abstract: The advent of novel view synthesis techniques such as NeRF and 3D Gaussian Splatting (3DGS) has enabled learning precise 3D models only from posed monocular images. Although these methods are attractive, they hold two major limitations that prevent their use in space applications: they require poses during training, and have high computational cost at training and inference. To address these limitations, this work contributes: (1) a Convolutional Neural Network (CNN) based primitive initializer for 3DGS using monocular images; (2) a pipeline capable of training with noisy or implicit pose estimates; and (3) and analysis of initialization variants that reduce the training cost of precise 3D models. A CNN takes a single image as input and outputs a coarse 3D model represented as an assembly of primitives, along with the target’s pose relative to the camera. This assembly of primitives is then used to initialize 3DGS, significantly reducing the number of training iterations and input images needed – by at least an order of magnitude. For additional flexibility, the CNN component has multiple variants with different pose estimation techniques. This work performs a comparison between these variants, evaluating their effectiveness for downstream 3DGS training under noisy or implicit pose estimates. The results demonstrate that even with imperfect pose supervision, the pipeline is able to learn high-fidelity 3D representations, opening the door for the use of novel view synthesis in space applications.

[166] Modality Agnostic Efficient Long Range Encoder

Toufiq Parag, Ahmed Elgammal

Main category: cs.CV

TL;DR: MAELRE is a transformer architecture for efficient long-range encoding across modalities, reducing memory and computational costs while maintaining accuracy.

DetailsMotivation: Addressing the quadratic complexity of attention mechanisms in long-context processing on single devices without modality-specific tradeoffs.

Method: MAELRE combines token merging with attention approximation, switching between lightweight and standard attention as tokens reduce.

Result: Superior accuracy and reduced computational cost on multi-modal classification tasks compared to existing models.

Conclusion: MAELRE offers a unified, efficient solution for long-context processing across diverse modalities.

Abstract: The long-context capability of recent large transformer models can be surmised to rely on techniques such as attention/model parallelism, as well as hardware-level optimizations. While these strategies allow input lengths to scale to millions of tokens, they do not fundamentally mitigate the quadratic computational and memory complexity of the core attention mechanism. In this paper, we address the challenge of long-context processing on a single device using generic implementations by reducing the quadratic memory footprint and inference cost. Existing approaches to extend the context length for generic single device implementations – such as token merging and modified attentions – are often modality specific and attain a suboptimal tradeoff between accuracy and efficiency. To overcome these limitations, we propose MAELRE (Modality Agnostic Efficient Long Range Encoder), a unified and efficient transformer architecture designed for long-range encoding across diverse modalities. MAELRE integrates token merging with attention approximation, progressively merging tokens at different stages of internal computational blocks. It employs a lightweight attention approximation when the number of tokens is large, and switches to standard dot-product attention as the sequence becomes shorter through successive aggregation. We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models on classification tasks spanning multiple modalities, including text, time series, audio, and vision.

[167] Verbalized Representation Learning for Interpretable Few-Shot Generalization

Cheng-Fu Yang, Da Yin, Wenbo Hu, Nanyun Peng, Bolei Zhou, Kai-Wei Chang

Main category: cs.CV

TL;DR: VRL (Verbalized Representation Learning) improves few-shot object recognition by using natural language features from a Vision-Language Model, outperforming prior methods with less data.

DetailsMotivation: Humans recognize objects with few examples due to language understanding; VRL aims to replicate this by creating interpretable verbalized features.

Method: Uses a Vision-Language Model to extract natural language features (inter-class differences and intra-class commonalities) and maps them to numeric vectors for downstream tasks.

Result: VRL achieves 24% better performance than state-of-the-art methods with 95% less data and a smaller model, and 20% better than human-labeled attributes.

Conclusion: VRL demonstrates the effectiveness of verbalized features for few-shot learning, offering interpretability and efficiency.

Abstract: Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller mode. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks. Code is available at: https://github.com/joeyy5588/VRL/tree/main.

[168] DEFNet: Multitasks-based Deep Evidential Fusion Network for Blind Image Quality Assessment

Yiwei Lou, Yuanpeng He, Rongchao Zhang, Yongzhi Cao, Hanpin Wang, Yu Huang

Main category: cs.CV

TL;DR: Proposes DEFNet, a multitask-based Deep Evidential Fusion Network for BIQA, integrating scene/distortion classification and uncertainty estimation for improved performance.

DetailsMotivation: Existing BIQA methods lack sufficient integration and flexible uncertainty estimation, leading to suboptimal results.

Method: DEFNet combines multitask optimization (scene/distortion classification) with a trustworthy fusion strategy and evidential learning for uncertainty estimation.

Result: Demonstrates effectiveness and robustness on synthetic and authentic distortion datasets, with strong generalization to unseen scenarios.

Conclusion: DEFNet advances BIQA by integrating multitask learning and uncertainty estimation, enhancing performance and adaptability.

Abstract: Blind image quality assessment (BIQA) methods often incorporate auxiliary tasks to improve performance. However, existing approaches face limitations due to insufficient integration and a lack of flexible uncertainty estimation, leading to suboptimal performance. To address these challenges, we propose a multitasks-based Deep Evidential Fusion Network (DEFNet) for BIQA, which performs multitask optimization with the assistance of scene and distortion type classification tasks. To achieve a more robust and reliable representation, we design a novel trustworthy information fusion strategy. It first combines diverse features and patterns across sub-regions to enhance information richness, and then performs local-global information fusion by balancing fine-grained details with coarse-grained context. Moreover, DEFNet exploits advanced uncertainty estimation technique inspired by evidential learning with the help of normal-inverse gamma distribution mixture. Extensive experiments on both synthetic and authentic distortion datasets demonstrate the effectiveness and robustness of the proposed framework. Additional evaluation and analysis are carried out to highlight its strong generalization capability and adaptability to previously unseen scenarios.

[169] GS-Occ3D: Scaling Vision-only Occupancy Reconstruction for Autonomous Driving with Gaussian Splatting

Baijun Ye, Minghui Qin, Saining Zhang, Moonjun Gong, Shaoting Zhu, Zebang Shen, Luan Zhang, Lu Zhang, Hao Zhao, Hang Zhao

Main category: cs.CV

TL;DR: GS-Occ3D is a scalable vision-only framework for occupancy reconstruction in autonomous driving, addressing challenges like sparse viewpoints and occlusions with an Octree-based Gaussian Surfel formulation.

DetailsMotivation: Existing LiDAR-based occupancy methods limit scalability and exclude crowdsourced data. Vision-only approaches face challenges like incomplete geometry and post-processing needs.

Method: GS-Occ3D uses an Octree-based Gaussian Surfel formulation for explicit occupancy representation, decomposing scenes into static background, ground, and dynamic objects for tailored modeling.

Result: Achieves state-of-the-art geometry reconstruction on Waymo dataset and shows superior zero-shot generalization on Occ3D-nuScenes.

Conclusion: Demonstrates the potential of large-scale vision-based occupancy reconstruction as a new paradigm for autonomous driving perception.

Abstract: Occupancy is crucial for autonomous driving, providing essential geometric priors for perception and planning. However, existing methods predominantly rely on LiDAR-based occupancy annotations, which limits scalability and prevents leveraging vast amounts of potential crowdsourced data for auto-labeling. To address this, we propose GS-Occ3D, a scalable vision-only framework that directly reconstructs occupancy. Vision-only occupancy reconstruction poses significant challenges due to sparse viewpoints, dynamic scene elements, severe occlusions, and long-horizon motion. Existing vision-based methods primarily rely on mesh representation, which suffer from incomplete geometry and additional post-processing, limiting scalability. To overcome these issues, GS-Occ3D optimizes an explicit occupancy representation using an Octree-based Gaussian Surfel formulation, ensuring efficiency and scalability. Additionally, we decompose scenes into static background, ground, and dynamic objects, enabling tailored modeling strategies: (1) Ground is explicitly reconstructed as a dominant structural element, significantly improving large-area consistency; (2) Dynamic vehicles are separately modeled to better capture motion-related occupancy patterns. Extensive experiments on the Waymo dataset demonstrate that GS-Occ3D achieves state-of-the-art geometry reconstruction results. By curating vision-only binary occupancy labels from diverse urban scenes, we show their effectiveness for downstream occupancy models on Occ3D-Waymo and superior zero-shot generalization on Occ3D-nuScenes. It highlights the potential of large-scale vision-based occupancy reconstruction as a new paradigm for autonomous driving perception. Project Page: https://gs-occ3d.github.io/

[170] Back to the Features: DINO as a Foundation for Video World Models

Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, Piotr Bojanowski

Main category: cs.CV

TL;DR: DINO-world is a generalist video world model using DINOv2’s latent space for future frame prediction, excelling in benchmarks and intuitive physics understanding.

DetailsMotivation: To develop a versatile video world model capable of predicting future frames and understanding diverse scenes, leveraging pre-trained encoders and large-scale video data.

Method: Uses DINOv2’s pre-trained image encoder and trains a future predictor on uncurated video data. Fine-tunes on observation-action trajectories for action-conditioned planning.

Result: Outperforms prior models in video prediction benchmarks like segmentation and depth forecasting, and shows strong intuitive physics understanding.

Conclusion: DINO-world is effective for video prediction and planning, demonstrating versatility and performance in diverse scenarios.

Abstract: We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.

[171] Efficient Lines Detection for Robot Soccer

João G. Melo, João P. Mafaldo, Edna Barros

Main category: cs.CV

TL;DR: A lightweight method using ELSED and PSO for efficient soccer field line detection, achieving high accuracy and speed for real-time robot soccer applications.

DetailsMotivation: Accurate self-localization in robot soccer requires reliable detection of field lines, necessitating efficient and fast methods suitable for low-power platforms.

Method: Combines ELSED algorithm with RGB color transition analysis for line detection and uses PSO for threshold calibration with minimal annotated samples.

Result: Matches state-of-the-art deep learning accuracy while offering faster processing, ideal for real-time use.

Conclusion: The method is effective for real-time soccer field line detection on low-power robotic platforms.

Abstract: Self-localization is essential in robot soccer, where accurate detection of visual field features, such as lines and boundaries, is critical for reliable pose estimation. This paper presents a lightweight and efficient method for detecting soccer field lines using the ELSED algorithm, extended with a classification step that analyzes RGB color transitions to identify lines belonging to the field. We introduce a pipeline based on Particle Swarm Optimization (PSO) for threshold calibration to optimize detection performance, requiring only a small number of annotated samples. Our approach achieves accuracy comparable to a state-of-the-art deep learning model while offering higher processing speed, making it well-suited for real-time applications on low-power robotic platforms.

[172] Geometric Origins of Bias in Deep Neural Networks: A Human Visual System Perspective

Yanbiao Ma, Bowei Liu, Andi Zhang

Main category: cs.CV

TL;DR: The paper explores bias in DNNs through a geometric framework, linking class-specific perceptual manifold complexity to bias, and introduces a toolkit for analysis.

DetailsMotivation: Understanding bias formation in DNNs is critical for fairness and reliability in AI, inspired by the human visual system's hierarchical processing.

Method: A geometric analysis framework is proposed to study the complexity of perceptual manifolds in DNNs, supported by the Perceptual-Manifold-Geometry library.

Result: Differences in geometric complexity of manifolds lead to recognition biases across categories. The toolkit has been widely adopted (4,500+ downloads).

Conclusion: The work offers a geometric perspective on bias in AI, providing a foundation for more equitable and robust systems.

Abstract: Bias formation in deep neural networks (DNNs) remains a critical yet poorly understood challenge, influencing both fairness and reliability in artificial intelligence systems. Inspired by the human visual system, which decouples object manifolds through hierarchical processing to achieve object recognition, we propose a geometric analysis framework linking the geometric complexity of class-specific perceptual manifolds in DNNs to model bias. Our findings reveal that differences in geometric complexity can lead to varying recognition capabilities across categories, introducing biases. To support this analysis, we present the Perceptual-Manifold-Geometry library, designed for calculating the geometric properties of perceptual manifolds. The toolkit has been downloaded and installed over 4,500 times. This work provides a novel geometric perspective on bias formation in modern learning systems and lays a theoretical foundation for developing more equitable and robust artificial intelligence.

[173] DINO-SLAM: DINO-informed RGB-D SLAM for Neural Implicit and Explicit Representations

Ziren Gong, Xiaohan Li, Fabio Tosi, Youmin Zhang, Stefano Mattoccia, Jun Wu, Matteo Poggi

Main category: cs.CV

TL;DR: DINO-SLAM enhances SLAM systems using DINO features for better scene representation in NeRF and 3DGS, outperforming state-of-the-art methods.

DetailsMotivation: To improve scene representation in SLAM systems by integrating DINO features for hierarchical and structural understanding.

Method: Uses a Scene Structure Encoder (SSE) to create Enhanced DINO (EDINO) features, applied in NeRF and 3DGS SLAM pipelines.

Result: Achieves superior performance on Replica, ScanNet, and TUM datasets compared to existing methods.

Conclusion: DINO-SLAM effectively enhances SLAM systems with DINO-informed features, demonstrating significant improvements.

Abstract: This paper presents DINO-SLAM, a DINO-informed design strategy to enhance neural implicit (Neural Radiance Field – NeRF) and explicit representations (3D Gaussian Splatting – 3DGS) in SLAM systems through more comprehensive scene representations. Purposely, we rely on a Scene Structure Encoder (SSE) that enriches DINO features into Enhanced DINO ones (EDINO) to capture hierarchical scene elements and their structural relationships. Building upon it, we propose two foundational paradigms for NeRF and 3DGS SLAM systems integrating EDINO features. Our DINO-informed pipelines achieve superior performance on the Replica, ScanNet, and TUM compared to state-of-the-art methods.

[174] HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars

Byungjun Kim, Shunsuke Saito, Giljoo Nam, Tomas Simon, Jason Saragih, Hanbyul Joo, Junxuan Li

Main category: cs.CV

TL;DR: A prior model for 3D head avatars with explicit face-hair compositionality, enabling flexible swapping and few-shot fine-tuning.

DetailsMotivation: Existing holistic models struggle to disentangle face and hair representations and lack flexibility for applications like swapping.

Method: Separate latent spaces for face and hair, trained using synthetic hairless data and paired datasets.

Result: Enables seamless face-hair transfer and few-shot fine-tuning for high-fidelity avatars.

Conclusion: The model’s compositionality and flexibility advance practical 3D avatar generation.

Abstract: We present a universal prior model for 3D head avatars with explicit hair compositionality. Existing approaches to build generalizable priors for 3D head avatars often adopt a holistic modeling approach, treating the face and hair as an inseparable entity. This overlooks the inherent compositionality of the human head, making it difficult for the model to naturally disentangle face and hair representations, especially when the dataset is limited. Furthermore, such holistic models struggle to support applications like 3D face and hairstyle swapping in a flexible and controllable manner. To address these challenges, we introduce a prior model that explicitly accounts for the compositionality of face and hair, learning their latent spaces separately. A key enabler of this approach is our synthetic hairless data creation pipeline, which removes hair from studio-captured datasets using estimated hairless geometry and texture derived from a diffusion prior. By leveraging a paired dataset of hair and hairless captures, we train disentangled prior models for face and hair, incorporating compositionality as an inductive bias to facilitate effective separation. Our model’s inherent compositionality enables seamless transfer of face and hair components between avatars while preserving identity. Additionally, we demonstrate that our model can be fine-tuned in a few-shot manner using monocular captures to create high-fidelity, hair-compositional 3D head avatars for unseen subjects. These capabilities highlight the practical applicability of our approach in real-world scenarios, paving the way for flexible and expressive 3D avatar generation.

[175] Semi-autonomous Prosthesis Control Using Minimal Depth Information and Vibrotactile Feedback

Miguel Nobre Castro, Strahinja Dosen

Main category: cs.CV

TL;DR: A semi-autonomous prosthesis controller using minimal depth data (four laser scanner lines) instead of full point clouds was developed, showing slightly lower but promising performance compared to full-depth controllers.

DetailsMotivation: To address the computational challenges of full-depth data in embedded prosthesis controllers by simplifying depth sensing.

Method: Reconstruct object shapes from four laser scanner lines using simple geometry, implemented with a depth sensor and vibrotactile feedback for aiming.

Result: The novel controller successfully handled all test objects, with performance improving over training but remaining slightly below the benchmark (full-depth controller).

Conclusion: The study advances compact vision-based systems for embedded depth sensing in prostheses, balancing performance and computational efficiency.

Abstract: Semi-autonomous prosthesis controllers based on computer vision improve performance while reducing cognitive effort. However, controllers relying on full-depth data face challenges in being deployed as embedded prosthesis controllers due to the computational demands of processing point clouds. To address this, the present study proposes a method to reconstruct the shape of various daily objects from minimal depth data. This is achieved using four concurrent laser scanner lines instead of a full point cloud. These lines represent the partial contours of an object’s cross-section, enabling its dimensions and orientation to be reconstructed using simple geometry. A control prototype was implemented using a depth sensor with four laser scanners. Vibrotactile feedback was also designed to help users to correctly aim the sensor at target objects. Ten able-bodied volunteers used a prosthesis equipped with the novel controller to grasp ten objects of varying shapes, sizes, and orientations. For comparison, they also tested an existing benchmark controller that used full-depth information. The results showed that the novel controller handled all objects and, while performance improved with training, it remained slightly below that of the benchmark. This marks an important step towards a compact vision-based system for embedded depth sensing in prosthesis grasping.

[176] Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs

Zihao Pan, Yu Tong, Weibin Wu, Jingyi Wang, Lifeng Chen, Zhe Zhao, Jiajia Wei, Yitong Qiao, Zibin Zheng

Main category: cs.CV

TL;DR: The paper explores how large vision-language models (LVLMs) are prone to errors with specific semantic concepts in images, proposing a semantic evolution framework using LLMs and T2I models to identify these sensitive concepts.

DetailsMotivation: Understanding what semantic content in inputs causes LVLMs to fail is crucial for improving model robustness.

Method: A semantic evolution framework integrates LLMs and T2I models to generate and test image descriptions, using LVLM performance as feedback to identify sensitive concepts.

Result: Experiments on seven LVLMs and two tasks confirm the framework’s effectiveness and reveal sensitive semantics in LVLMs.

Conclusion: The study provides insights into LVLM vulnerabilities and aims to inspire further research on model robustness.

Abstract: Adversarial attacks aim to generate malicious inputs that mislead deep models, but beyond causing model failure, they cannot provide certain interpretable information such as \textit{What content in inputs make models more likely to fail?}'' However, this information is crucial for researchers to specifically improve model robustness. Recent research suggests that models may be particularly sensitive to certain semantics in visual inputs (such as wet,’’ ``foggy’’), making them prone to errors. Inspired by this, in this paper we conducted the first exploration on large vision-language models (LVLMs) and found that LVLMs indeed are susceptible to hallucinations and various errors when facing specific semantic concepts in images. To efficiently search for these sensitive concepts, we integrated large language models (LLMs) and text-to-image (T2I) models to propose a novel semantic evolution framework. Randomly initialized semantic concepts undergo LLM-based crossover and mutation operations to form image descriptions, which are then converted by T2I models into visual inputs for LVLMs. The task-specific performance of LVLMs on each input is quantified as fitness scores for the involved semantics and serves as reward signals to further guide LLMs in exploring concepts that induce LVLMs. Extensive experiments on seven mainstream LVLMs and two multimodal tasks demonstrate the effectiveness of our method. Additionally, we provide interesting findings about the sensitive semantics of LVLMs, aiming to inspire further in-depth research.

[177] Information Extraction from Unstructured data using Augmented-AI and Computer Vision

Aditya Parikh

Main category: cs.CV

TL;DR: A framework combining Augmented Intelligence (A2I), computer vision, and NLP for improved information extraction from unstructured documents, outperforming traditional OCR methods.

DetailsMotivation: Addressing the limitations of traditional OCR and parsing methods in processing large-scale, diverse document datasets.

Method: Leverages deep learning for object detection (tabular data) and integrates cloud-based services for scalable processing.

Result: Demonstrates improved accuracy and efficiency in extracting structured information from PDFs, images, and scanned documents.

Conclusion: The framework significantly outperforms traditional OCR-based approaches, especially for complex layouts and multi-modal content.

Abstract: Information extraction (IE) from unstructured documents remains a critical challenge in data processing pipelines. Traditional optical character recognition (OCR) methods and conventional parsing engines demonstrate limited effectiveness when processing large-scale document datasets. This paper presents a comprehensive framework for information extraction that combines Augmented Intelligence (A2I) with computer vision and natural language processing techniques. Our approach addresses the limitations of conventional methods by leveraging deep learning architectures for object detection, particularly for tabular data extraction, and integrating cloud-based services for scalable document processing. The proposed methodology demonstrates improved accuracy and efficiency in extracting structured information from diverse document formats including PDFs, images, and scanned documents. Experimental validation shows significant improvements over traditional OCR-based approaches, particularly in handling complex document layouts and multi-modal content extraction.

[178] Label Anything: Multi-Class Few-Shot Semantic Segmentation with Visual Prompts

Pasquale De Marinis, Nicola Fanelli, Raffaele Scaringi, Emanuele Colonna, Giuseppe Fiameni, Gennaro Vessio, Giovanna Castellano

Main category: cs.CV

TL;DR: Label Anything introduces a transformer-based architecture for multi-prompt, multi-way few-shot semantic segmentation, reducing annotation burden while maintaining high accuracy.

DetailsMotivation: To address the limitations of conventional few-shot segmentation by supporting diverse prompts and multi-class classification.

Method: A novel transformer-based architecture with attention mechanisms, trained for flexibility across various prompt types and configurations.

Result: Achieves state-of-the-art performance on COCO-$20^i$, outperforming single-class models in multi-class settings.

Conclusion: Label Anything offers a highly flexible and generalizable solution for few-shot semantic segmentation.

Abstract: Few-shot semantic segmentation aims to segment objects from previously unseen classes using only a limited number of labeled examples. In this paper, we introduce Label Anything, a novel transformer-based architecture designed for multi-prompt, multi-way few-shot semantic segmentation. Our approach leverages diverse visual prompts – points, bounding boxes, and masks – to create a highly flexible and generalizable framework that significantly reduces annotation burden while maintaining high accuracy. Label Anything makes three key contributions: ($\textit{i}$) we introduce a new task formulation that relaxes conventional few-shot segmentation constraints by supporting various types of prompts, multi-class classification, and enabling multiple prompts within a single image; ($\textit{ii}$) we propose a novel architecture based on transformers and attention mechanisms; and ($\textit{iii}$) we design a versatile training procedure allowing our model to operate seamlessly across different $N$-way $K$-shot and prompt-type configurations with a single trained model. Our extensive experimental evaluation on the widely used COCO-$20^i$ benchmark demonstrates that Label Anything achieves state-of-the-art performance among existing multi-way few-shot segmentation methods, while significantly outperforming leading single-class models when evaluated in multi-class settings. Code and trained models are available at https://github.com/pasqualedem/LabelAnything.

[179] High Performance Space Debris Tracking in Complex Skylight Backgrounds with a Large-Scale Dataset

Guohang Zhuang, Weixi Song, Jinyang Huang, Chenwei Yang, Wanli OuYang, Yan Lu

Main category: cs.CV

TL;DR: Proposes SDT-Net, a deep learning model for accurate space debris tracking, introduces SDTD dataset, and validates performance with real-world data.

DetailsMotivation: Addresses the challenge of real-time and accurate space debris tracking due to limitations of traditional signal processing methods.

Method: Uses SDT-Net, a deep learning-based network, and introduces SDTD, a large-scale dataset created via observation-based simulation.

Result: Achieves 73.2% MOTA score on real data from Antarctic Station, demonstrating strong transferability.

Conclusion: SDT-Net and SDTD effectively address space debris tracking challenges, with potential for real-world application.

Abstract: With the rapid development of space exploration, space debris has attracted more attention due to its potential extreme threat, leading to the need for real-time and accurate debris tracking. However, existing methods are mainly based on traditional signal processing, which cannot effectively process the complex background and dense space debris. In this paper, we propose a deep learning-based Space Debris Tracking Network~(SDT-Net) to achieve highly accurate debris tracking. SDT-Net effectively represents the feature of debris, enhancing the efficiency and stability of end-to-end model learning. To train and evaluate this model effectively, we also produce a large-scale dataset Space Debris Tracking Dataset (SDTD) by a novel observation-based data simulation scheme. SDTD contains 18,040 video sequences with a total of 62,562 frames and covers 250,000 synthetic space debris. Extensive experiments validate the effectiveness of our model and the challenging of our dataset. Furthermore, we test our model on real data from the Antarctic Station, achieving a MOTA score of 73.2%, which demonstrates its strong transferability to real-world scenarios. Our dataset and code will be released soon.

[180] MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, Sergey Tulyakov

Main category: cs.CV

TL;DR: MaskControl introduces innovations for high-precision motion control in diffusion models, improving quality and precision.

DetailsMotivation: Current motion diffusion models lack high-precision control while maintaining motion quality.

Method: Uses Logits Regularizer and Logit Optimization for implicit and explicit control, plus Differentiable Expectation Sampling (DES).

Result: Outperforms state-of-the-art methods (FID ↓77%, error 0.91 vs. 1.08). Enables diverse applications.

Conclusion: MaskControl achieves superior motion quality and control precision, enabling versatile motion generation.

Abstract: Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at https://www.ekkasit.com/ControlMM-page/

[181] MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, Qiang Xu

Main category: cs.CV

TL;DR: MagicDrive-V2 improves controllable video generation for autonomous driving by integrating MVDiT and spatial-temporal encoding, achieving higher resolution and frame count.

DetailsMotivation: Addressing challenges in geometry control and ineffective existing methods for driving video generation.

Method: Proposes MagicDrive-V2 with MVDiT block, spatial-temporal conditional encoding, contextual descriptions, and progressive training.

Result: Achieves 3.3× resolution and 4× frame count compared to SOTA, with rich contextual and geometric controls.

Conclusion: MagicDrive-V2 enhances autonomous driving applications through superior video synthesis and control.

Abstract: The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is vital for applications like autonomous driving. Although DiT with 3D VAE has become a standard framework for video generation, it introduces challenges in controllable driving video generation, especially for geometry control, rendering existing control methods ineffective. To address these issues, we propose MagicDrive-V2, a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multi-view video generation and precise geometric control. Additionally, we introduce an efficient method for obtaining contextual descriptions for videos to support diverse textual control, along with a progressive training strategy using mixed video data to enhance training efficiency and generalizability. Consequently, MagicDrive-V2 enables multi-view driving video synthesis with $3.3\times$ resolution and $4\times$ frame count (compared to current SOTA), rich contextual control, and geometric controls. Extensive experiments demonstrate MagicDrive-V2’s ability, unlocking broader applications in autonomous driving.

[182] Latent Space Analysis for Melanoma Prevention

Ciro Listone, Aniello Murano

Main category: cs.CV

TL;DR: A novel interpretable risk modeling method for melanoma diagnosis using a Conditional Variational Autoencoder and SVM, enhancing clinical insight and trust.

DetailsMotivation: Early and interpretable diagnostic tools are needed for melanoma due to its high mortality, as current deep learning models lack clinical insight.

Method: Uses a Conditional Variational Autoencoder to learn a structured latent space for semantic lesion relationships, combined with SVM for classification.

Result: Strong performance in differentiating benign nevi and melanomas, with interpretable risk assessment via latent space proximity.

Conclusion: Bridges predictive performance with clinical applicability, improving early detection and trust in AI-assisted diagnosis.

Abstract: Melanoma represents a critical health risk due to its aggressive progression and high mortality, underscoring the need for early, interpretable diagnostic tools. While deep learning has advanced in skin lesion classification, most existing models provide only binary outputs, offering limited clinical insight. This work introduces a novel approach that extends beyond classification, enabling interpretable risk modelling through a Conditional Variational Autoencoder. The proposed method learns a structured latent space that captures semantic relationships among lesions, allowing for a nuanced, continuous assessment of morphological differences. An SVM is also trained on this representation effectively differentiating between benign nevi and melanomas, demonstrating strong and consistent performance. More importantly, the learned latent space supports visual and geometric interpretation of malignancy, with the spatial proximity of a lesion to known melanomas serving as a meaningful indicator of risk. This approach bridges predictive performance with clinical applicability, fostering early detection, highlighting ambiguous cases, and enhancing trust in AI-assisted diagnosis through transparent and interpretable decision-making.

[183] BGM: Background Mixup for X-ray Prohibited Items Detection

Weizhe Liu, Renshuai Tao, Hongguang Zhu, Yunda Sun, Yao Zhao, Yunchao Wei

Main category: cs.CV

TL;DR: Proposes Background Mixup (BGM), a background-based augmentation for X-ray security imaging, improving detection by leveraging material and texture cues.

DetailsMotivation: Existing augmentations overlook X-ray-specific data characteristics and background cues, limiting detection performance.

Method: BGM mixes background patches based on X-ray transmission and material properties, enhancing model robustness to occlusion and imbalance.

Result: BGM outperforms baselines on X-ray benchmarks without extra annotations or training overhead.

Conclusion: BGM is a lightweight, effective solution for background-aware augmentation in X-ray prohibited items detection.

Abstract: Current data-driven approaches for X-ray prohibited items detection remain under-explored, particularly in the design of effective data augmentations. Existing natural image augmentations for reflected light imaging neglect the data characteristics of X-ray security images. Moreover, prior X-ray augmentation methods have predominantly focused on foreground prohibited items, overlooking informative background cues. In this paper, we propose Background Mixup (BGM), a background-based augmentation technique tailored for X-ray security imaging domain. Unlike conventional methods, BGM is founded on an in-depth analysis of physical properties including: 1) X-ray Transmission Imagery: Transmitted X-ray pixels represent composite information from multiple materials along the imaging path. 2) Material-based Pseudo-coloring: Pseudo-coloring in X-ray images correlates directly with material properties, aiding in material distinction. Building upon the above insights, BGM mixes background patches across regions on both 1) texture structure and 2) material variation, to benefit models from complicated background cues. This enhances the model’s capability to handle domain-specific challenges such as occlusion-induced discriminative imbalance. Importantly, BGM is orthogonal and fully compatible with existing foreground-focused augmentation techniques, enabling joint use to further enhance detection performance. Extensive experiments on multiple X-ray security benchmarks show that BGM consistently surpasses strong baselines, without additional annotations or significant training overhead. This work pioneers the exploration of background-aware augmentation in X-ray prohibited items detection and provides a lightweight, plug-and-play solution with broad applicability.

[184] Tuned Reverse Distillation: Enhancing Multimodal Industrial Anomaly Detection with Crossmodal Tuners

Xinyue Liu, Jianyuan Wang, Biao Leng, Shuo Zhang

Main category: cs.CV

TL;DR: The paper introduces Tuned Reverse Distillation (TRD) for multimodal anomaly detection, addressing limitations of existing KD-based methods by using a multi-branch design and crossmodal tuners.

DetailsMotivation: Existing KD-based methods for multimodal AD struggle with detecting anomalies in single modalities and fail to fully utilize intra- and inter-modality information.

Method: Proposes TRD with a multi-branch design for each modality and introduces Crossmodal Tuners (Filter and Amplifier) to enhance modality interaction during distillation.

Result: TRD achieves state-of-the-art performance in multimodal anomaly detection and localization on benchmark datasets.

Conclusion: TRD effectively addresses the limitations of existing methods, improving anomaly detection across all modalities.

Abstract: Knowledge distillation (KD) has been widely studied in unsupervised image Anomaly Detection (AD), but its application to unsupervised multimodal AD remains underexplored. Existing KD-based methods for multimodal AD that use fused multimodal features to obtain teacher representations face challenges. Anomalies that only exist in one modality may not be effectively captured in the fused teacher features, leading to detection failures. Besides, these methods do not fully leverage the rich intra- and inter-modality information that are critical for effective anomaly detection. In this paper, we propose Tuned Reverse Distillation (TRD) based on Multi-branch design to realize Multimodal Industrial AD. By assigning independent branches to each modality, our method enables finer detection of anomalies within each modality. Furthermore, we enhance the interaction between modalities during the distillation process by designing two Crossmodal Tuners including Crossmodal Filter and Amplifier. With the idea of crossmodal mapping, the student network is allowed to better learn normal features while anomalies in all modalities are ensured to be effectively detected. Experimental verifications on multimodal AD datasets demonstrate that our method achieves state-of-the-art performance in multimodal anomaly detection and localization. Code is available at https://github.com/hito2448/TRD.

[185] Level-Set Parameters: Novel Representation for 3D Shape Analysis

Huan Lei, Hongdong Li, Andreas Geiger, Anthony Dick

Main category: cs.CV

TL;DR: The paper proposes using neural fields and level-set parameters for continuous 3D shape analysis, addressing resolution issues in traditional methods, and demonstrates applications in classification, retrieval, and pose estimation.

DetailsMotivation: Traditional 3D shape representations (point clouds, meshes) are discrete and resolution-sensitive. Neural fields offer continuous representations via level-set parameters, enabling more robust shape analysis.

Method: Formulate level-set parameters as a pseudo-normal distribution to establish correlations across shapes. Use a hypernetwork to condition parameters on rotations/translations for pose-related analysis.

Result: The method simplifies pose-related shape analysis and shows promise in applications like classification, retrieval, and 6D pose estimation.

Conclusion: Neural fields and level-set parameters provide a novel, continuous representation for 3D shape analysis, outperforming traditional discrete methods in robustness and versatility.

Abstract: 3D shape analysis has been largely focused on traditional 3D representations of point clouds and meshes, but the discrete nature of these data makes the analysis susceptible to variations in input resolutions. Recent development of neural fields brings in level-set parameters from signed distance functions as a novel, continuous, and numerical representation of 3D shapes, where the shape surfaces are defined as zero-level-sets of those functions. This motivates us to extend shape analysis from the traditional 3D data to these novel parameter data. Since the level-set parameters are not Euclidean like point clouds, we establish correlations across different shapes by formulating them as a pseudo-normal distribution, and learn the distribution prior from the respective dataset. To further explore the level-set parameters with shape transformations, we propose to condition a subset of these parameters on rotations and translations, and generate them with a hypernetwork. This simplifies the pose-related shape analysis compared to using traditional data. We demonstrate the promise of the novel representations through applications in shape classification (arbitrary poses), retrieval, and 6D object pose estimation.

[186] Do Existing Testing Tools Really Uncover Gender Bias in Text-to-Image Models?

Yunbo Lyu, Zhou Yang, Yuqing Niu, Jing Jiang, David Lo

Main category: cs.CV

TL;DR: This study evaluates gender bias in T2I models, comparing detectors’ accuracy against manual validation, revealing biases and detector limitations.

DetailsMotivation: Address the gap in comparing gender bias detectors and understanding their deviation from actual bias in T2I models.

Method: Validate detectors using a manually labeled dataset of 6,000 images from three T2I models, analyzing bias and detector performance.

Result: T2I models prefer male images, especially in professional prompts; detectors often misestimate bias, with some overestimating by 26.95%.

Conclusion: Detectors lack accuracy, especially with low-quality images; an enhanced detector is proposed to address these limitations.

Abstract: Text-to-Image (T2I) models have recently gained significant attention due to their ability to generate high-quality images and are consequently used in a wide range of applications. However, there are concerns about the gender bias of these models. Previous studies have shown that T2I models can perpetuate or even amplify gender stereotypes when provided with neutral text prompts. Researchers have proposed automated gender bias uncovering detectors for T2I models, but a crucial gap exists: no existing work comprehensively compares the various detectors and understands how the gender bias detected by them deviates from the actual situation. This study addresses this gap by validating previous gender bias detectors using a manually labeled dataset and comparing how the bias identified by various detectors deviates from the actual bias in T2I models, as verified by manual confirmation. We create a dataset consisting of 6,000 images generated from three cutting-edge T2I models: Stable Diffusion XL, Stable Diffusion 3, and Dreamlike Photoreal 2.0. During the human-labeling process, we find that all three T2I models generate a portion (12.48% on average) of low-quality images (e.g., generate images with no face present), where human annotators cannot determine the gender of the person. Our analysis reveals that all three T2I models show a preference for generating male images, with SDXL being the most biased. Additionally, images generated using prompts containing professional descriptions (e.g., lawyer or doctor) show the most bias. We evaluate seven gender bias detectors and find that none fully capture the actual level of bias in T2I models, with some detectors overestimating bias by up to 26.95%. We further investigate the causes of inaccurate estimations, highlighting the limitations of detectors in dealing with low-quality images. Based on our findings, we propose an enhanced detector…

[187] All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Zongyan Han, Mohamed El Amine Boudjoghra, Jiahua Dong, Jinhong Wang, Rao Muhammad Anwer

Main category: cs.CV

TL;DR: VDG-Uni3DSeg integrates vision-language models and LLMs to improve 3D point cloud segmentation by leveraging multimodal cues, achieving state-of-the-art results.

DetailsMotivation: Addressing challenges like sparse structure, limited annotations, and fine-grained class differentiation in 3D point cloud segmentation.

Method: Uses pre-trained vision-language models (CLIP) and LLMs for multimodal cues, introduces Semantic-Visual Contrastive Loss and Spatial Enhanced Module.

Result: Achieves top performance in semantic, instance, and panoptic segmentation.

Conclusion: Proposes a scalable, practical solution for 3D scene understanding with multimodal integration.

Abstract: Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github.com/Hanzy1996/VDG-Uni3DSeg.

[188] TARS: Traffic-Aware Radar Scene Flow Estimation

Jialong Wu, Marco Braun, Dominic Spata, Matthias Rottmann

Main category: cs.CV

TL;DR: TARS introduces a traffic-aware radar scene-flow method, leveraging traffic-level rigidity and joint object detection for improved performance.

DetailsMotivation: Existing LiDAR scene flow methods assume rigid motion at the instance level, which fails for sparse radar point clouds. TARS addresses this by focusing on traffic-level rigidity.

Method: TARS combines object detection and scene flow estimation, using a Traffic Vector Field (TVF) for holistic traffic understanding. It integrates point-level motion cues and traffic-level consistency.

Result: TARS outperforms state-of-the-art methods by 23% and 15% on proprietary and View-of-Delft datasets, respectively.

Conclusion: TARS demonstrates the effectiveness of traffic-level rigidity and joint detection-flow approaches for radar scene flow estimation.

Abstract: Scene flow provides crucial motion information for autonomous driving. Recent LiDAR scene flow models utilize the rigid-motion assumption at the instance level, assuming objects are rigid bodies. However, these instance-level methods are not suitable for sparse radar point clouds. In this work, we present a novel Traffic-Aware Radar Scene-Flow (TARS) estimation method, which utilizes motion rigidity at the traffic level. To address the challenges in radar scene flow, we perform object detection and scene flow jointly and boost the latter. We incorporate the feature map from the object detector, trained with detection losses, to make radar scene flow aware of the environment and road users. From this, we construct a Traffic Vector Field (TVF) in the feature space to achieve holistic traffic-level scene understanding in our scene flow branch. When estimating the scene flow, we consider both point-level motion cues from point neighbors and traffic-level consistency of rigid motion within the space. TARS outperforms the state of the art on a proprietary dataset and the View-of-Delft dataset, improving the benchmarks by 23% and 15%, respectively.

[189] RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation

Yuwen Du, Anning Hu, Zichen Chao, Yifan Lu, Junhao Ge, Genjia Liu, Weitao Wu, Lanjun Wang, Siheng Chen

Main category: cs.CV

TL;DR: RoCo-Sim is a simulation framework for roadside collaborative perception, addressing data issues like calibration errors and multi-view consistency, outperforming SOTA methods in 3D object detection.

DetailsMotivation: Existing roadside perception methods focus on model design but neglect data issues, leading to poor performance. RoCo-Sim aims to enhance collaborative perception by solving these data challenges.

Method: RoCo-Sim includes Camera Extrinsic Optimization, Multi-View Occlusion-Aware Sampler (MOAS), DepthSAM for foreground-background modeling, and a Scalable Post-Processing Toolkit for realistic scene generation.

Result: RoCo-Sim improves 3D object detection, outperforming SOTA by 83.74 on Rcooper-Intersection and 83.12 on TUMTraf-V2X for AP70.

Conclusion: RoCo-Sim fills a critical gap in roadside perception simulation, offering a robust solution for collaborative perception challenges.

Abstract: Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for road-side collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and full-scene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimization ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by 83.74 on Rcooper-Intersection and 83.12 on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code and pre-trained models will be released soon: https://github.com/duyuwen-duen/RoCo-Sim

[190] SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Xiangyu Dong, Haoran Zhao, Jiang Gao, Haozhou Li, Xiaoguang Ma, Yaoming Zhou, Fuhai Chen, Juan Liu

Main category: cs.CV

TL;DR: The paper introduces SE-VLN, a self-evolving VLN framework leveraging multimodal LLMs to improve navigation success rates by incorporating experiential knowledge and continuous evolution during testing.

DetailsMotivation: Current VLN methods rely on fixed LLM knowledge, limiting experiential learning and evolutionary capacity. Inspired by natural agents, the authors aim to enable continuous evolution in VLN agents.

Method: SE-VLN includes three modules: hierarchical memory for knowledge transfer, retrieval-augmented reasoning for multi-step decisions, and reflection for continual evolution.

Result: SE-VLN achieved 57% and 35.2% success rates in unseen environments, outperforming state-of-the-art methods by 23.9% and 15.0% on R2R and REVERSE datasets, respectively.

Conclusion: SE-VLN demonstrates significant performance improvements and scalability with experience, highlighting its potential as a self-evolving framework for VLN.

Abstract: Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.

[191] FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

Ali Mollaahmadi Dehaghi, Hossein KhademSohi, Reza Razavi, Steve Drew, Mohammad Moshirpour

Main category: cs.CV

TL;DR: FedVSR is a federated learning framework for video super-resolution, addressing privacy concerns and improving output quality with a DWT-based loss function and loss-aware aggregation.

DetailsMotivation: Privacy concerns in centralized deep learning for video super-resolution led to the need for a federated learning solution that maintains performance.

Method: FedVSR is model-agnostic, stateless, and uses a DWT-based loss function and loss-aware aggregation strategy.

Result: FedVSR outperforms existing FL methods, achieving higher PSNR, SSIM, and lower LPIPS.

Conclusion: FedVSR bridges privacy and performance, setting a new benchmark for federated learning in low-level vision tasks.

Abstract: Video super-resolution aims to enhance low-resolution videos by leveraging both spatial and temporal information. While deep learning has led to impressive progress, it typically requires centralized data, which raises privacy concerns. Federated learning offers a privacy-friendly solution, but general FL frameworks often struggle with low-level vision tasks, resulting in blurry, low-quality outputs. To address this, we introduce FedVSR, the first FL framework specifically designed for VSR. It is model-agnostic and stateless, and introduces a lightweight loss function based on the DWT to better preserve high-frequency details during local training. Additionally, a loss-aware aggregation strategy combines both DWT-based and task-specific losses to guide global updates effectively. Extensive experiments across multiple VSR models and datasets demonstrate that FedVSR consistently outperforms existing FL methods, achieving up to 0.82 dB higher PSNR, 0.0327 higher SSIM, and 0.0251 lower LPIPS. These results underscore FedVSR’s ability to bridge the gap between privacy and performance, setting a new benchmark for federated learning in low-level vision tasks. The code is available at: https://github.com/alimd94/FedVSR

[192] $S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation

Junhong Min, Youngpil Jeon, Jimin Kim, Minyong Choi

Main category: cs.CV

TL;DR: The paper introduces $S^2M^2$, a global matching architecture for stereo matching that balances accuracy and efficiency, outperforming prior methods on benchmarks.

DetailsMotivation: Addressing the trade-off between local search methods (limited global consistency) and global matching architectures (historically inefficient) in stereo matching.

Method: Uses a multi-resolution transformer for long-range correspondence and a novel loss function to focus on feasible matches, avoiding cost volume filtering or deep refinement.

Result: Achieves state-of-the-art accuracy on Middlebury v3 and ETH3D benchmarks, with high efficiency and robust disparity, occlusion, and confidence estimation.

Conclusion: $S^2M^2$ resolves the trade-off, offering a generalizable, efficient, and accurate solution for stereo matching.

Abstract: The pursuit of a generalizable stereo matching model, capable of performing across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. On the other hand, global matching architectures, while theoretically more robust, have been historically rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with $S^2M^2$: a global matching architecture that achieves both state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. $S^2M^2$ establishes a new state of the art on the Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods across most metrics while reconstructing high-quality details with competitive efficiency.

[193] SceneMI: Motion In-betweening for Modeling Human-Scene Interactions

Inwoo Hwang, Bing Zhou, Young Min Kim, Jian Wang, Chuan Guo

Main category: cs.CV

TL;DR: SceneMI reformulates human-scene interaction modeling as scene-aware motion in-betweening, improving controllability and flexibility. It uses dual scene descriptors and diffusion models for noise generalization, showing effectiveness in keyframe animation and real-world data.

DetailsMotivation: Addressing limitations in controllability and flexibility of existing generative models for human-scene interactions (HSI).

Method: Proposes SceneMI, a framework using dual scene descriptors and diffusion models for scene-aware motion in-betweening.

Result: Demonstrates effectiveness in keyframe animation, noise generalization, and real-world data (GIMO dataset).

Conclusion: SceneMI offers practical applications for HSI modeling, including reconstruction from monocular videos.

Abstract: Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening - a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes. Experimental results demonstrate SceneMI’s effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI’s applicability in HSI reconstruction from monocular videos.

[194] EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler

Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, Shanghang Zhang

Main category: cs.CV

TL;DR: EmbodiedOcc++ enhances 3D occupancy prediction by incorporating geometric constraints and adaptive sampling, outperforming the original EmbodiedOcc framework.

DetailsMotivation: The original EmbodiedOcc framework lacks consideration for planar structures in indoor environments, limiting its accuracy.

Method: Introduces a Geometry-guided Refinement Module (GRM) for plane regularization and a Semantic-aware Uncertainty Sampler (SUS) for effective updates in overlapping regions.

Result: Achieves state-of-the-art performance on the EmbodiedOcc-ScanNet benchmark, improving edge accuracy and geometric detail retention.

Conclusion: EmbodiedOcc++ offers a computationally efficient solution for online embodied perception, with code to be released.

Abstract: Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception. The code will be released at: https://github.com/PKUHaoWang/EmbodiedOcc2.

[195] AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S. Adve, Yu-Xiong Wang

Main category: cs.CV

TL;DR: AgMMU is a real-world benchmark for vision-language models (VLMs) in agriculture, derived from authentic dialogues. It includes 746 MCQs and 746 OEQs, alongside AgBase, a corpus of 57,079 multimodal facts. Benchmarking shows gaps in VLMs’ performance, with fine-tuning improving open-sourced models.

DetailsMotivation: To address the lack of domain-specific benchmarks in agriculture, AgMMU provides a knowledge-intensive evaluation set to advance VLMs in fine-grained perception and factual grounding.

Method: A three-stage pipeline: automated knowledge extraction, QA generation, and human verification, to create AgMMU (evaluation set) and AgBase (development corpus).

Result: Benchmarking 12 VLMs revealed performance gaps, especially for open-sourced models. Fine-tuning on AgBase improved open-sourced models by up to 11.6%.

Conclusion: AgMMU highlights the need for better knowledge extraction strategies and encourages domain-specific AI research in agriculture.

Abstract: We present AgMMU, a challenging real-world benchmark for evaluating and advancing vision-language models (VLMs) in the knowledge-intensive domain of agriculture. Unlike prior datasets that rely on crowdsourced prompts, AgMMU is distilled from 116,231 authentic dialogues between everyday growers and USDA-authorized Cooperative Extension experts. Through a three-stage pipeline: automated knowledge extraction, QA generation, and human verification, we construct (i) AgMMU, an evaluation set of 746 multiple-choice questions (MCQs) and 746 open-ended questions (OEQs), and (ii) AgBase, a development corpus of 57,079 multimodal facts covering five high-stakes agricultural topics: insect identification, species identification, disease categorization, symptom description, and management instruction. Benchmarking 12 leading VLMs reveals pronounced gaps in fine-grained perception and factual grounding. Open-sourced models trail after proprietary ones by a wide margin. Simple fine-tuning on AgBase boosts open-sourced model performance on challenging OEQs for up to 11.6% on average, narrowing this gap and also motivating future research to propose better strategies in knowledge extraction and distillation from AgBase. We hope AgMMU stimulates research on domain-specific knowledge integration and trustworthy decision support in agriculture AI development.

[196] Style-Adaptive Detection Transformer for Single-Source Domain Generalized Object Detection

Jianhong Han, Yupei Wang, Liang Chen

Main category: cs.CV

TL;DR: SA-DETR, a DETR-based detector, improves single-source domain generalization in object detection by using a dynamic memory bank for style adaptation and object-aware contrastive learning.

DetailsMotivation: Existing CNN-based methods for single-source domain generalization (SDG) in object detection rely on data augmentation and feature alignment, which may not generalize well across diverse unseen domains. DETR's potential for SDG is underexplored.

Method: Proposes SA-DETR with an online domain style adapter (using a dynamic memory bank) and an object-aware contrastive learning module for domain-invariant feature extraction.

Result: SA-DETR outperforms existing methods in detection accuracy and domain generalization across five weather scenarios.

Conclusion: SA-DETR effectively addresses SDG challenges by leveraging style adaptation and contrastive learning, demonstrating superior generalization.

Abstract: Single-source domain generalization (SDG) in object detection aims to develop a detector using only source domain data that generalizes well to unseen target domains. Existing methods are primarily CNN-based and improve robustness through data augmentation combined with feature alignment. However, these methods are limited, as augmentation is only effective when the synthetic distribution approximates that of unseen domains, thus failing to ensure generalization across diverse scenarios. While DEtection TRansformer (DETR) has shown strong generalization in domain adaptation due to global context modeling, its potential for SDG remains underexplored. To this end, we propose Style-Adaptive DEtection TRansformer (SA-DETR), a DETR-based detector tailored for SDG. SA-DETR introduces an online domain style adapter that projects the style representation of unseen domains into the source domain via a dynamic memory bank. This bank self-organizes into diverse style prototypes and is continuously updated under a test-time adaptation framework, enabling effective style rectification. Additionally, we design an object-aware contrastive learning module to promote extraction of domain-invariant features. By applying gating masks that constrain contrastive learning in both spatial and semantic dimensions, this module facilitates instance-level cross-domain contrast and enhances generalization. Extensive experiments across five distinct weather scenarios demonstrate that SA-DETR consistently outperforms existing methods in both detection accuracy and domain generalization capability.

[197] Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

Seong Hyeon Park, Jinwoo Shin

Main category: cs.CV

TL;DR: MMP is a new model for estimating 3D geometry in dynamic scenes from monocular videos, improving expressiveness and reducing inference costs.

DetailsMotivation: Existing models struggle with noisy partial attributes and costly test-time optimizations for dynamic scenes.

Method: MMP uses a Siamese architecture with a trajectory encoding module to project point-wise dynamics for improved expressiveness.

Result: MMP achieves a 15.1% enhancement in regression error for feed-forward pointmap prediction.

Conclusion: MMP offers a more efficient and accurate solution for 3D geometry estimation in dynamic scenes.

Abstract: In monocular videos that capture dynamic scenes, estimating the 3D geometry of video contents has been a fundamental challenge in computer vision. Specifically, the task is significantly challenged by the object motion, where existing models are limited to predict only partial attributes of the dynamic scenes, such as depth or pointmaps spanning only over a pair of frames. Since these attributes are inherently noisy under multiple frames, test-time global optimizations are often employed to fully recover the geometry, which is liable to failure and incurs heavy inference costs. To address the challenge, we present a new model, coined MMP, to estimate the geometry in a feed-forward manner, which produces a dynamic pointmap representation that evolves over multiple frames. Specifically, based on the recent Siamese architecture, we introduce a new trajectory encoding module to project point-wise dynamics on the representation for each frame, which can provide significantly improved expressiveness for dynamic scenes. In our experiments, we find MMP can achieve state-of-the-art quality in feed-forward pointmap prediction, e.g., 15.1% enhancement in the regression error.

[198] ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis

Onkar Susladkar, Gayatri Deshmukh, Yalcin Tur, Gorkhem Durak, Ulas Bagci

Main category: cs.CV

TL;DR: ViCTr introduces a two-stage framework for high-fidelity medical image synthesis, combining rectified flow and Tweedie-corrected diffusion, achieving state-of-the-art results with efficient one-step sampling.

DetailsMotivation: Challenges in medical image synthesis include limited annotated data, domain gaps, and diffuse pathologies like liver cirrhosis. Existing methods lack anatomical fidelity and efficient sampling.

Method: ViCTr uses a two-stage approach: pretraining with EWC for anatomical fidelity and adversarial fine-tuning with LoRA for pathology severity control. It reformulates Tweedie’s formula for one-step sampling.

Result: ViCTr achieves a 28% lower MFID for cirrhosis synthesis and improves segmentation by +3.8% mDSC. Radiologists found its outputs clinically indistinguishable from real scans.

Conclusion: ViCTr is the first method to offer fine-grained, pathology-aware MRI synthesis with severity control, advancing AI-driven medical imaging.

Abstract: Synthesizing medical images remains challenging due to limited annotated pathological data, modality domain gaps, and the complexity of representing diffuse pathologies such as liver cirrhosis. Existing methods often struggle to maintain anatomical fidelity while accurately modeling pathological features, frequently relying on priors derived from natural images or inefficient multi-step sampling. In this work, we introduce ViCTr (Vital Consistency Transfer), a novel two-stage framework that combines a rectified flow trajectory with a Tweedie-corrected diffusion process to achieve high-fidelity, pathology-aware image synthesis. First, we pretrain ViCTr on the ATLAS-8k dataset using Elastic Weight Consolidation (EWC) to preserve critical anatomical structures. We then fine-tune the model adversarially with Low-Rank Adaptation (LoRA) modules for precise control over pathology severity. By reformulating Tweedie’s formula within a linear trajectory framework, ViCTr supports one-step sampling, reducing inference from 50 steps to just 4, without sacrificing anatomical realism. We evaluate ViCTr on BTCV (CT), AMOS (MRI), and CirrMRI600+ (cirrhosis) datasets. Results demonstrate state-of-the-art performance, achieving a Medical Frechet Inception Distance (MFID) of 17.01 for cirrhosis synthesis 28% lower than existing approaches and improving nnUNet segmentation by +3.8% mDSC when used for data augmentation. Radiologist reviews indicate that ViCTr-generated liver cirrhosis MRIs are clinically indistinguishable from real scans. To our knowledge, ViCTr is the first method to provide fine-grained, pathology-aware MRI synthesis with graded severity control, closing a critical gap in AI-driven medical imaging research.

[199] GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

Yusu Qian, Jiasen Lu, Tsu-Jui Fu, Xinze Wang, Chen Chen, Yinfei Yang, Wenze Hu, Zhe Gan

Main category: cs.CV

TL;DR: A new benchmark (GIE-Bench) evaluates text-guided image editing models on functional correctness and content preservation, outperforming CLIP-based metrics. GPT-Image-1 leads in accuracy but over-modifies non-targeted regions.

DetailsMotivation: Existing evaluation methods for text-guided image editing lack precision, relying on metrics like CLIP. A more grounded benchmark is needed.

Method: Introduces GIE-Bench with 1000+ editing examples, assessing functional correctness via multiple-choice questions and content preservation via object-aware masking.

Result: GPT-Image-1 excels in instruction-following but over-modifies irrelevant regions. The benchmark aligns with human ratings.

Conclusion: GIE-Bench offers a scalable, reproducible framework for improving evaluation in text-guided image editing.

Abstract: Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.

[200] Towards Generalized Range-View LiDAR Segmentation in Adverse Weather

Longyu Yang, Lu Zhang, Jun Liu, Yap-Peng Tan, Heng Tao Shen, Xiaofeng Zhu, Ping Hu

Main category: cs.CV

TL;DR: The paper proposes a lightweight framework to improve range-view LiDAR segmentation’s robustness in adverse weather by separating geometric and reflectance processing.

DetailsMotivation: Generalized performance of range-view LiDAR segmentation under adverse weather is underexplored, limiting real-world reliability.

Method: A modular framework with two branches (GAS for geometric noise, RDC for reflectance correction) is introduced, fused into the original pipeline.

Result: Experiments show significant improvement in adverse weather generalization with minimal overhead.

Conclusion: The method offers a practical solution for robust LiDAR segmentation in real-world conditions.

Abstract: LiDAR segmentation has emerged as an important task to enrich scene perception and understanding. Range-view-based methods have gained popularity due to their high computational efficiency and compatibility with real-time deployment. However, their generalized performance under adverse weather conditions remains underexplored, limiting their reliability in real-world environments. In this work, we identify and analyze the unique challenges that affect the generalization of range-view LiDAR segmentation in severe weather. To address these challenges, we propose a modular and lightweight framework that enhances robustness without altering the core architecture of existing models. Our method reformulates the initial stem block of standard range-view networks into two branches to process geometric attributes and reflectance intensity separately. Specifically, a Geometric Abnormality Suppression (GAS) module reduces the influence of weather-induced spatial noise, and a Reflectance Distortion Calibration (RDC) module corrects reflectance distortions through memory-guided adaptive instance normalization. The processed features are then fused and passed to the original segmentation pipeline. Extensive experiments on different benchmarks and baseline models demonstrate that our approach significantly improves generalization to adverse weather with minimal inference overhead, offering a practical and effective solution for real-world LiDAR segmentation.

[201] Preserve Anything: Controllable Image Synthesis with Object Preservation

Prasen Kumar Sharma, Neeraj Matiyali, Siddharth Srivastava, Gaurav Sharma

Main category: cs.CV

TL;DR: The paper introduces ‘Preserve Anything,’ a method for controlled image synthesis that improves object preservation, semantic consistency, and user control in text-to-image generation. It outperforms existing methods in fidelity, alignment, and aesthetics.

DetailsMotivation: Existing text-to-image generation methods struggle with preserving multiple objects, maintaining semantic alignment, and providing explicit control over scene composition.

Method: The method uses an N-channel ControlNet with modules for object preservation, background guidance, lighting consistency, and high-frequency overlay. It also introduces a benchmark dataset of 240K natural and 18K synthetic images.

Result: The method achieves state-of-the-art performance with FID 15.26 and CLIP-S 32.85, and user studies show significant improvements in prompt alignment, photorealism, and aesthetics.

Conclusion: ‘Preserve Anything’ effectively addresses key limitations in text-to-image synthesis, offering superior control, fidelity, and semantic consistency.

Abstract: We introduce \textit{Preserve Anything}, a novel method for controlled image synthesis that addresses key limitations in object preservation and semantic consistency in text-to-image (T2I) generation. Existing approaches often fail (i) to preserve multiple objects with fidelity, (ii) maintain semantic alignment with prompts, or (iii) provide explicit control over scene composition. To overcome these challenges, the proposed method employs an N-channel ControlNet that integrates (i) object preservation with size and placement agnosticism, color and detail retention, and artifact elimination, (ii) high-resolution, semantically consistent backgrounds with accurate shadows, lighting, and prompt adherence, and (iii) explicit user control over background layouts and lighting conditions. Key components of our framework include object preservation and background guidance modules, enforcing lighting consistency and a high-frequency overlay module to retain fine details while mitigating unwanted artifacts. We introduce a benchmark dataset consisting of 240K natural images filtered for aesthetic quality and 18K 3D-rendered synthetic images with metadata such as lighting, camera angles, and object relationships. This dataset addresses the deficiencies of existing benchmarks and allows a complete evaluation. Empirical results demonstrate that our method achieves state-of-the-art performance, significantly improving feature-space fidelity (FID 15.26) and semantic alignment (CLIP-S 32.85) while maintaining competitive aesthetic quality. We also conducted a user study to demonstrate the efficacy of the proposed work on unseen benchmark and observed a remarkable improvement of $\sim25%$, $\sim19%$, $\sim13%$, and $\sim14%$ in terms of prompt alignment, photorealism, the presence of AI artifacts, and natural aesthetics over existing works.

[202] RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors

Sicong Du, Jiarun Liu, Qifeng Chen, Hao-Xiang Chen, Tai-Jiang Mu, Sheng Yang

Main category: cs.CV

TL;DR: RGE-GS is a novel framework combining diffusion-based generation and reward-guided Gaussian integration to improve road structure reconstruction in sensor simulators, outperforming baseline methods.

DetailsMotivation: Incomplete road structure scanning from single-pass driving clips necessitates better reconstruction methods for sensor simulators, as current 3DGS techniques with diffusion priors introduce physical inconsistencies and inefficiencies.

Method: RGE-GS integrates a reward network to prioritize consistent patterns and uses a differentiated training strategy for Gaussian optimization, adjusting based on scene convergence metrics.

Result: RGE-GS achieves state-of-the-art reconstruction quality on public datasets.

Conclusion: The framework effectively addresses limitations of current methods, offering improved spatial stability and convergence in expansive reconstruction.

Abstract: A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGE-GS achieves state-of-the-art performance in reconstruction quality. Our source-code will be made publicly available at https://github.com/CN-ADLab/RGE-GS.

[203] Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

Aleksandar Jevtić, Christoph Reich, Felix Wimbauer, Oliver Hahn, Christian Rupprecht, Stefan Roth, Daniel Cremers

Main category: cs.CV

TL;DR: SceneDINO introduces an unsupervised method for semantic scene completion (SSC) using self-supervised learning, achieving state-of-the-art results without ground-truth annotations.

DetailsMotivation: Prior SSC methods rely on expensive ground-truth annotations. SceneDINO aims to address this by leveraging unsupervised techniques.

Method: SceneDINO adapts self-supervised representation learning and 2D unsupervised scene understanding to SSC, using multi-view consistency self-supervision and a novel 3D feature distillation approach.

Result: SceneDINO achieves state-of-the-art segmentation accuracy in unsupervised 3D and 2D scene understanding, matching supervised SSC performance in some cases.

Conclusion: SceneDINO demonstrates strong potential for unsupervised single-image 3D scene understanding, with domain generalization and multi-view consistency.

Abstract: Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

[204] Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution

Yonghyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Woosung Choi, Kin Wai Cheuk, Junghyun Koo, Yuki Mitsufuji

Main category: cs.CV

TL;DR: The paper introduces Concept-TRAK, a method for concept-level attribution in diffusion models, addressing copyright and transparency concerns by isolating contributions to specific image elements.

DetailsMotivation: Growing adoption of diffusion models raises copyright and transparency issues, with existing methods failing to attribute specific elements like styles or objects.

Method: Concept-TRAK extends influence functions with a reformulated diffusion training loss and a concept-aware reward function for robust, sample-specific attribution.

Result: Concept-TRAK outperforms prior methods on the AbC benchmark and provides actionable insights for responsible AI development.

Conclusion: Concept-level attribution via Concept-TRAK offers practical solutions for governance and ethical AI in generative models.

Abstract: While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that matter most to stakeholders. To bridge this gap, we introduce \emph{concept-level attribution} via a novel method called \emph{Concept-TRAK}. Concept-TRAK extends influence functions with two key innovations: (1) a reformulated diffusion training loss based on diffusion posterior sampling, enabling robust, sample-specific attribution; and (2) a concept-aware reward function that emphasizes semantic relevance. We evaluate Concept-TRAK on the AbC benchmark, showing substantial improvements over prior methods. Through diverse case studies–ranging from identifying IP-protected and unsafe content to analyzing prompt engineering and compositional learning–we demonstrate how concept-level attribution yields actionable insights for responsible generative AI development and governance.

[205] Registration beyond Points: General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold

Jaeho Shin, Hyeonjae Gil, Junwoo Jang, Maani Ghaffari, Ayoung Kim

Main category: cs.CV

TL;DR: The paper introduces an optimizable cost function for measuring distances between Grassmannian features, enabling improved solutions for registration problems in computer vision.

DetailsMotivation: Existing methods lack an explicit, optimizable distance function for Grassmannian features, limiting their use in registration tasks.

Method: The authors derive an optimizable cost function using bases of high-dimensional linear subspaces and propose a solution for registration problems.

Result: The method outperforms existing solutions in computer vision tasks by minimizing geodesic distance and avoiding representation ambiguity.

Conclusion: The proposed cost function and its integration with a BnB solver enhance convergence and performance in registration problems.

Abstract: Affine Grassmannian has been favored for expressing proximity between lines and planes due to its theoretical exactness in measuring distances among features. Despite this advantage, the existing method can only measure the proximity without yielding the distance as an explicit function of rigid body transformation. Thus, an optimizable distance function on the manifold has remained underdeveloped, stifling its application in registration problems. This paper is the first to explicitly derive an optimizable cost function between two Grassmannian features with respect to rigid body transformation ($\mathbf{R}$ and $\mathbf{t}$). Specifically, we present a rigorous mathematical proof demonstrating that the bases of high-dimensional linear subspaces can serve as an explicit representation of the cost. Finally, we propose an optimizable cost function based on the transformed bases that can be applied to the registration problem of any affine subspace. Compared to vector parameter-based approaches, our method is able to find a globally optimal solution by directly minimizing the geodesic distance which is agnostic to representation ambiguity. The resulting cost function and its extension to the inlier-set maximizing Branch-and-Bound (BnB) solver have been demonstrated to improve the convergence of existing solutions or outperform them in various computer vision tasks. The code is available on https://github.com/joomeok/GrassmannRegistration.

[206] A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli

Qianyi He, Yuan Chang Leong

Main category: cs.CV

TL;DR: A Transformer model predicts fMRI responses to multimodal movies using visual, auditory, and language inputs, achieving strong performance by leveraging temporal and multimodal context.

DetailsMotivation: To develop an encoding model for predicting whole-brain fMRI responses to naturalistic multimodal movies, addressing the challenge of capturing long-range temporal and individual variability.

Method: A sequence-to-sequence Transformer with dual cross-attention mechanisms integrates prior brain states and current stimuli, using pretrained models (VideoMAE, HuBERT, Qwen, BridgeTower) for feature extraction. Combines shared encoder with subject-specific decoders.

Result: The model performs well on in-distribution and out-of-distribution data, demonstrating effective brain activity prediction.

Conclusion: Temporally-aware, multimodal sequence modeling is effective for predicting brain activity, with code available for community use.

Abstract: The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies. In this submission, we propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs. Stimulus features were extracted using pretrained models including VideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates information from prior brain states and current stimuli via dual cross-attention mechanisms that attend to both perceptual information extracted from the stimulus as well as narrative information provided by high-level summaries of the content. One core innovation of our approach is the use of sequences of multimodal context to predict sequences of brain activity, enabling the model to capture long-range temporal structure in both stimuli and neural responses. Another is the combination of a shared encoder with partial subject-specific decoder, which leverages common representational structure across subjects while accounting for individual variability. Our model achieves strong performance on both in-distribution and out-of-distribution data, demonstrating the effectiveness of temporally-aware, multimodal sequence modeling for brain activity prediction. The code is available at https://github.com/Angelneer926/Algonauts_challenge.

[207] TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance

Minghao Fu, Guo-Hua Wang, Xiaohao Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

Main category: cs.CV

TL;DR: TeEFusion is a distillation method that simplifies text-to-image synthesis by incorporating guidance magnitude into text embeddings, enabling faster inference without quality loss.

DetailsMotivation: High inference costs of classifier-free guidance (CFG) and complex sampling strategies in text-to-image synthesis.

Method: TeEFusion fuses conditional and unconditional text embeddings linearly, distilling the teacher model’s sampling strategy without extra parameters.

Result: Student model achieves 6× faster inference while maintaining image quality comparable to the teacher model.

Conclusion: TeEFusion offers an efficient alternative to CFG, balancing speed and quality in text-to-image synthesis.

Abstract: Recent advances in text-to-image synthesis largely benefit from sophisticated sampling strategies and classifier-free guidance (CFG) to ensure high-quality generation. However, CFG’s reliance on two forward passes, especially when combined with intricate sampling algorithms, results in prohibitively high inference costs. To address this, we introduce TeEFusion (Text Embeddings Fusion), a novel and efficient distillation method that directly incorporates the guidance magnitude into the text embeddings and distills the teacher model’s complex sampling strategy. By simply fusing conditional and unconditional text embeddings using linear operations, TeEFusion reconstructs the desired guidance without adding extra parameters, simultaneously enabling the student model to learn from the teacher’s output produced via its sophisticated sampling approach. Extensive experiments on state-of-the-art models such as SD3 demonstrate that our method allows the student to closely mimic the teacher’s performance with a far simpler and more efficient sampling strategy. Consequently, the student model achieves inference speeds up to 6$\times$ faster than the teacher model, while maintaining image quality at levels comparable to those obtained through the teacher’s complex sampling approach. The code is publicly available at https://github.com/AIDC-AI/TeEFusion.

[208] GVCCS: A Dataset for Contrail Identification and Tracking on Visible Whole Sky Camera Sequences

Gabriel Jarry, Ramon Dalmau, Philippe Very, Franck Ballerini, Stefania-Denisa Bocu

Main category: cs.CV

TL;DR: The paper introduces GVCCS, a new dataset of contrails tracked via ground cameras, and a deep learning framework for contrail analysis to improve climate impact modeling.

DetailsMotivation: Contrails significantly impact aviation's climate effects, but existing models lack accurate data for validation. Observational datasets are incomplete, missing temporal tracking and flight attribution.

Method: The authors present GVCCS, a dataset of contrails recorded with ground cameras, labeled and tracked over time. They also propose a deep learning framework for contrail segmentation and tracking.

Result: GVCCS includes 122 video sequences (24,228 frames) with flight identifiers. The deep learning model performs semantic and instance segmentation, plus temporal tracking.

Conclusion: This work enhances contrail monitoring and model calibration, aiding more accurate climate impact assessments.

Abstract: Aviation’s climate impact includes not only CO2 emissions but also significant non-CO2 effects, especially from contrails. These ice clouds can alter Earth’s radiative balance, potentially rivaling the warming effect of aviation CO2. Physics-based models provide useful estimates of contrail formation and climate impact, but their accuracy depends heavily on the quality of atmospheric input data and on assumptions used to represent complex processes like ice particle formation and humidity-driven persistence. Observational data from remote sensors, such as satellites and ground cameras, could be used to validate and calibrate these models. However, existing datasets don’t explore all aspect of contrail dynamics and formation: they typically lack temporal tracking, and do not attribute contrails to their source flights. To address these limitations, we present the Ground Visible Camera Contrail Sequences (GVCCS), a new open data set of contrails recorded with a ground-based all-sky camera in the visible range. Each contrail is individually labeled and tracked over time, allowing a detailed analysis of its lifecycle. The dataset contains 122 video sequences (24,228 frames) and includes flight identifiers for contrails that form above the camera. As reference, we also propose a unified deep learning framework for contrail analysis using a panoptic segmentation model that performs semantic segmentation (contrail pixel identification), instance segmentation (individual contrail separation), and temporal tracking in a single architecture. By providing high-quality, temporally resolved annotations and a benchmark for model evaluation, our work supports improved contrail monitoring and will facilitate better calibration of physical models. This sets the groundwork for more accurate climate impact understanding and assessments.

cs.AI

[209] Initial Steps in Integrating Large Reasoning and Action Models for Service Composition

Ilche Georgievski, Marco Aiello

Main category: cs.AI

TL;DR: The paper proposes integrating Large Reasoning Models (LRMs) and Large Action Models (LAMs) to automate service composition, addressing reasoning and execution gaps.

DetailsMotivation: Service composition faces challenges due to limited reasoning and brittle execution. LRMs and LAMs offer complementary strengths but have individual limitations.

Method: The paper introduces an integrated LRM-LAM architectural framework to combine semantic reasoning (LRMs) with dynamic execution (LAMs).

Result: The proposed framework bridges the gap between intention and execution, enabling automated, natural language-driven service composition.

Conclusion: Integrating LRMs and LAMs is a promising direction for fully automated, user-friendly service composition.

Abstract: Service composition remains a central challenge in building adaptive and intelligent software systems, often constrained by limited reasoning capabilities or brittle execution mechanisms. This paper explores the integration of two emerging paradigms enabled by large language models: Large Reasoning Models (LRMs) and Large Action Models (LAMs). We argue that LRMs address the challenges of semantic reasoning and ecosystem complexity while LAMs excel in dynamic action execution and system interoperability. However, each paradigm has complementary limitations - LRMs lack grounded action capabilities, and LAMs often struggle with deep reasoning. We propose an integrated LRM-LAM architectural framework as a promising direction for advancing automated service composition. Such a system can reason about service requirements and constraints while dynamically executing workflows, thus bridging the gap between intention and execution. This integration has the potential to transform service composition into a fully automated, user-friendly process driven by high-level natural language intent.

[210] Simulation-Driven Reinforcement Learning in Queuing Network Routing Optimization

Fatima Al-Ani, Molly Wang, Jevon Charles, Aaron Ong, Joshua Forday, Vinayak Modi

Main category: cs.AI

TL;DR: A simulation-driven RL framework (Dyna-DDPG) is proposed for optimizing routing in queueing networks, outperforming traditional methods in dynamic, uncertain environments.

DetailsMotivation: Traditional queueing methods struggle with dynamic and uncertain environments, necessitating a more robust solution for routing optimization.

Method: The framework combines Deep Deterministic Policy Gradient (DDPG) with Dyna-style planning (Dyna-DDPG), using a flexible simulation environment and separate predictive models for stability.

Result: The framework learns effective routing policies quickly, performs robustly under disruptions, and scales well to larger networks.

Conclusion: The Dyna-DDPG approach is practical for real-world deployment, supported by strong software engineering for reproducibility and maintainability.

Abstract: This study focuses on the development of a simulation-driven reinforcement learning (RL) framework for optimizing routing decisions in complex queueing network systems, with a particular emphasis on manufacturing and communication applications. Recognizing the limitations of traditional queueing methods, which often struggle with dynamic, uncertain environments, we propose a robust RL approach leveraging Deep Deterministic Policy Gradient (DDPG) combined with Dyna-style planning (Dyna-DDPG). The framework includes a flexible and configurable simulation environment capable of modeling diverse queueing scenarios, disruptions, and unpredictable conditions. Our enhanced Dyna-DDPG implementation incorporates separate predictive models for next-state transitions and rewards, significantly improving stability and sample efficiency. Comprehensive experiments and rigorous evaluations demonstrate the framework’s capability to rapidly learn effective routing policies that maintain robust performance under disruptions and scale effectively to larger network sizes. Additionally, we highlight strong software engineering practices employed to ensure reproducibility and maintainability of the framework, enabling practical deployment in real-world scenarios.

[211] A Neuroscience-Inspired Dual-Process Model of Compositional Generalization

Alex Noviello, Claas Beger, Jacob Groner, Kevin Ellis, Weinan Sun

Main category: cs.AI

TL;DR: MIRAGE, a brain-inspired AI framework, achieves systematic compositional generalization by combining neural decomposition and schema extraction, demonstrating high accuracy on the SCAN benchmark.

DetailsMotivation: Addressing the challenge of systematic compositional generalization in AI by mimicking human cognitive processes involving the hippocampus and prefrontal cortex.

Method: MIRAGE uses two modules: a meta-trained Transformer Neural Decomposer for iterative refinement and a Schema Engine for dynamic schema extraction and application.

Result: Achieves >99% accuracy on the SCAN benchmark with 1.19M parameters, showcasing systematic generalization.

Conclusion: MIRAGE’s success hinges on schema quality and iterative refinement, proving the effectiveness of brain-inspired approaches for compositional tasks.

Abstract: Systematic compositional generalization - constructing and understanding novel combinations of known building blocks - remains a core challenge for AI systems. Human cognition achieves this flexibility via the interplay of the hippocampus (HPC) and prefrontal cortex (PFC): the hippocampus rapidly encodes episodes, and the prefrontal cortex consolidates them into reusable schemas for reasoning. Drawing on these insights, we present MIRAGE (Meta-Inference with Rules and Abstractions from Generalized Experience), a framework that achieves systematic generalization on compositional tasks. MIRAGE has two interacting modules mirroring the brain’s deliberative HPC-PFC loop and intuitive neocortical pattern recognition. (1) The meta-trained Transformer Neural Decomposer, paralleling neocortical “System 1” computation, is trained on a task-agnostic stream of randomly sampled compositional grammars and applies one decomposition step per pass, with successive passes iteratively refining the sequence representation. (2) The Schema Engine, analogous to the HPC-PFC “System 2” loop, dynamically extracts, ranks, and applies reusable schemas, storing variable bindings in episodic memory and expanding them when needed. By explicitly equipping the Transformer component of MIRAGE with actively managed schematic structures, our model performs systematic compositional operations through explicit schema application and transformation, relying solely on frozen weights when solving entirely novel tasks. This approach demonstrates systematic compositional generalization on the SCAN benchmark, achieving > 99% accuracy on all task splits with only 1.19M parameters in the transformer module. Ablation studies confirm that MIRAGE’s systematicity critically depends on the quality of extracted schemas and the model’s iterative refinement process.

[212] Success in Humanoid Reinforcement Learning under Partial Observation

Wuhao Wang, Zhiyong Chen

Main category: cs.AI

TL;DR: First successful reinforcement learning of humanoid locomotion under partial observability, achieving performance comparable to full-state methods using a novel history encoder.

DetailsMotivation: Addressing the challenge of effective policy learning in high-dimensional tasks like humanoid locomotion under partial observability, where prior methods failed.

Method: Introduces a novel history encoder that processes past observations in parallel, integrated into a model-free algorithm.

Result: The learned policy matches state-of-the-art performance with only partial state information and adapts to robot property variations.

Conclusion: The history encoder effectively reconstructs contextual information, enabling robust decision-making in partially observable environments.

Abstract: Reinforcement learning has been widely applied to robotic control, but effective policy learning under partial observability remains a major challenge, especially in high-dimensional tasks like humanoid locomotion. To date, no prior work has demonstrated stable training of humanoid policies with incomplete state information in the benchmark Gymnasium Humanoid-v4 environment. The objective in this environment is to walk forward as fast as possible without falling, with rewards provided for staying upright and moving forward, and penalties incurred for excessive actions and external contact forces. This research presents the first successful instance of learning under partial observability in this environment. The learned policy achieves performance comparable to state-of-the-art results with full state access, despite using only one-third to two-thirds of the original states. Moreover, the policy exhibits adaptability to robot properties, such as variations in body part masses. The key to this success is a novel history encoder that processes a fixed-length sequence of past observations in parallel. Integrated into a standard model-free algorithm, the encoder enables performance on par with fully observed baselines. We hypothesize that it reconstructs essential contextual information from recent observations, thereby enabling robust decision-making.

[213] Towards Improving Long-Tail Entity Predictions in Temporal Knowledge Graphs through Global Similarity and Weighted Sampling

Mehrnoosh Mirtaheri, Ryan A. Rossi, Sungchul Kim, Kanak Mahadik, Tong Yu, Xiang Chen, Mohammad Rostami

Main category: cs.AI

TL;DR: The paper introduces an incremental training framework for Temporal Knowledge Graphs (TKGs) to handle unseen or sparsely connected entities, improving generalization and robustness.

DetailsMotivation: Traditional TKG completion models assume full graph access during training, ignoring challenges like evolving knowledge and sparse entity connections. This work addresses these gaps.

Method: The framework combines a model-agnostic enhancement layer (using global entity similarity) and a weighted sampling strategy (emphasizing infrequent entities).

Result: The method outperforms existing approaches, achieving 10% and 15% MRR improvements on benchmark datasets, excelling in link prediction and handling long-tail entities.

Conclusion: The framework enhances TKG completion by mitigating catastrophic forgetting and improving robustness, especially in incremental training scenarios.

Abstract: Temporal Knowledge Graph (TKG) completion models traditionally assume access to the entire graph during training. This overlooks challenges stemming from the evolving nature of TKGs, such as: (i) the model’s requirement to generalize and assimilate new knowledge, and (ii) the task of managing new or unseen entities that often have sparse connections. In this paper, we present an incremental training framework specifically designed for TKGs, aiming to address entities that are either not observed during training or have sparse connections. Our approach combines a model-agnostic enhancement layer with a weighted sampling strategy, that can be augmented to and improve any existing TKG completion method. The enhancement layer leverages a broader, global definition of entity similarity, which moves beyond mere local neighborhood proximity of GNN-based methods. The weighted sampling strategy employed in training accentuates edges linked to infrequently occurring entities. We evaluate our method on two benchmark datasets, and demonstrate that our framework outperforms existing methods in total link prediction, inductive link prediction, and in addressing long-tail entities. Notably, our method achieves a 10% improvement and a 15% boost in MRR for these datasets. The results underscore the potential of our approach in mitigating catastrophic forgetting and enhancing the robustness of TKG completion methods, especially in an incremental training context

[214] Integrating LLM in Agent-Based Social Simulation: Opportunities and Challenges

Patrick Taillandier, Jean Daniel Zucker, Arnaud Grignard, Benoit Gaudou, Nghi Quang Huynh, Alexis Drogoul

Main category: cs.AI

TL;DR: The paper explores the use of LLMs in social simulation, discussing their potential and limitations, and suggests hybrid approaches for better integration with traditional agent-based modeling.

DetailsMotivation: To evaluate the role of LLMs in replicating human cognition and their application in social simulations, addressing both capabilities and challenges.

Method: Reviews recent findings on LLMs’ cognitive replication and surveys applications in multi-agent simulations, analyzing system architectures and validation strategies.

Result: Identifies LLMs’ strengths in interactive simulations but notes limitations in explanatory or predictive modeling, advocating for hybrid approaches.

Conclusion: Hybrid methods combining LLMs with traditional agent-based modeling are recommended to balance flexibility and analytical rigor.

Abstract: This position paper examines the use of Large Language Models (LLMs) in social simulation, analyzing both their potential and their limitations from a computational social science perspective. The first part reviews recent findings on the ability of LLMs to replicate key aspects of human cognition, including Theory of Mind reasoning and social inference, while also highlighting significant limitations such as cognitive biases, lack of true understanding, and inconsistencies in behavior. The second part surveys emerging applications of LLMs in multi-agent simulation frameworks, focusing on system architectures, scale, and validation strategies. Notable projects such as Generative Agents (Smallville) and AgentSociety are discussed in terms of their design choices, empirical grounding, and methodological innovations. Particular attention is given to the challenges of behavioral fidelity, calibration, and reproducibility in large-scale LLM-driven simulations. The final section distinguishes between contexts where LLMs, like other black-box systems, offer direct value-such as interactive simulations and serious games-and those where their use is more problematic, notably in explanatory or predictive modeling. The paper concludes by advocating for hybrid approaches that integrate LLMs into traditional agent-based modeling platforms (GAMA, Netlogo, etc), enabling modelers to combine the expressive flexibility of language-based reasoning with the transparency and analytical rigor of classical rule-based systems.

[215] Fine-Grained Traffic Inference from Road to Lane via Spatio-Temporal Graph Node Generation

Shuhao Li, Weidong Yang, Yue Cui, Xiaoxing Liu, Lingkai Meng, Lipeng Ma, Fan Zhang

Main category: cs.AI

TL;DR: The paper introduces the Fine-grained Road Traffic Inference (FRTI) task to generate lane-level traffic data using limited inputs, proposing the RoadDiff framework for accurate inference.

DetailsMotivation: Lane-level traffic data is crucial for applications like autonomous driving but is hard to obtain due to sensor limitations and tracking inaccuracies.

Method: A two-stage framework, RoadDiff, uses a Road-Lane Correlation Autoencoder-Decoder and Lane Diffusion Module to infer lane traffic states from limited data.

Result: Extensive experiments on six datasets validated RoadDiff’s effectiveness in solving the FRTI task.

Conclusion: RoadDiff provides an energy-efficient and cost-effective solution for fine-grained traffic management.

Abstract: Fine-grained traffic management and prediction are fundamental to key applications such as autonomous driving, lane change guidance, and traffic signal control. However, obtaining lane-level traffic data has become a critical bottleneck for data-driven models due to limitations in the types and number of sensors and issues with the accuracy of tracking algorithms. To address this, we propose the Fine-grained Road Traffic Inference (FRTI) task, which aims to generate more detailed lane-level traffic information using limited road data, providing a more energy-efficient and cost-effective solution for precise traffic management. This task is abstracted as the first scene of the spatio-temporal graph node generation problem. We designed a two-stage framework–RoadDiff–to solve the FRTI task. solve the FRTI task. This framework leverages the Road-Lane Correlation Autoencoder-Decoder and the Lane Diffusion Module to fully utilize the limited spatio-temporal dependencies and distribution relationships of road data to accurately infer fine-grained lane traffic states. Based on existing research, we designed several baseline models with the potential to solve the FRTI task and conducted extensive experiments on six datasets representing different road conditions to validate the effectiveness of the RoadDiff model in addressing the FRTI task. The relevant datasets and code are available at https://github.com/ShuhaoLii/RoadDiff.

[216] Pareto-NRPA: A Novel Monte-Carlo Search Algorithm for Multi-Objective Optimization

Noé Lallouet, Tristan Cazenave, Cyrille Enderli

Main category: cs.AI

TL;DR: Pareto-NRPA is a new Monte-Carlo algorithm for multi-objective optimization, extending NRPA to handle multiple objectives by maintaining non-dominated fronts and adapting policies for diversity.

DetailsMotivation: To address multi-objective optimization problems in discrete search spaces, extending the single-objective NRPA algorithm to handle multiple objectives effectively.

Method: Pareto-NRPA generalizes NRPA by using multiple policies to explore solution spaces, maintaining non-dominated fronts, and adapting policies based on Pareto front diversity.

Result: The algorithm performs competitively against state-of-the-art methods, excelling in constrained search spaces and demonstrating strong convergence and diversity.

Conclusion: Pareto-NRPA is the first adaptation of NRPA for multi-objective optimization, showing promising results in benchmarks like MO-TSPTW and neural architecture search.

Abstract: We introduce Pareto-NRPA, a new Monte-Carlo algorithm designed for multi-objective optimization problems over discrete search spaces. Extending the Nested Rollout Policy Adaptation (NRPA) algorithm originally formulated for single-objective problems, Pareto-NRPA generalizes the nested search and policy update mechanism to multi-objective optimization. The algorithm uses a set of policies to concurrently explore different regions of the solution space and maintains non-dominated fronts at each level of search. Policy adaptation is performed with respect to the diversity and isolation of sequences within the Pareto front. We benchmark Pareto-NRPA on two classes of problems: a novel bi-objective variant of the Traveling Salesman Problem with Time Windows problem (MO-TSPTW), and a neural architecture search task on well-known benchmarks. Results demonstrate that Pareto-NRPA achieves competitive performance against state-of-the-art multi-objective algorithms, both in terms of convergence and diversity of solutions. Particularly, Pareto-NRPA strongly outperforms state-of-the-art evolutionary multi-objective algorithms on constrained search spaces. To our knowledge, this work constitutes the first adaptation of NRPA to the multi-objective setting.

[217] OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?

Xuetian Chen, Yinghao Chen, Xinfeng Yuan, Zhuo Peng, Lu Chen, Yuekeng Li, Zhoujia Zhang, Yingqian Huang, Leyan Huang, Jiaqing Liang, Tianbao Xie, Zhiyong Wu, Qiushi Sun, Biqing Qi, Bowen Zhou

Main category: cs.AI

TL;DR: OS-MAP is a benchmark for daily computer-using automation, organizing 416 tasks across 15 applications by automation level and generalization scope to evaluate agent capabilities and alignment with user demands.

DetailsMotivation: Existing benchmarks lack consideration for task heterogeneity, agent capabilities, and alignment with user demands, hindering practical deployment and targeted development.

Method: OS-MAP organizes tasks along a five-level automation taxonomy and a generalization scope derived from user demand hierarchy, evaluating agents on autonomy and generalization.

Result: State-of-the-art agents struggle with higher-level tasks involving perception, reasoning, and coordination, revealing current limitations.

Conclusion: OS-MAP provides a structured evaluation framework to guide future research and deployment of computer-using agents, highlighting gaps in current capabilities.

Abstract: Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance-generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even State-of-the-Art agents with VLM backbones struggle with higher-level tasks involving perception, reasoning, and coordination-highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment. All code, environments, baselines, and data are publicly available at https://github.com/OS-Copilot/OS-Map.

[218] Toward Super Agent System with Hybrid AI Routers

Yuhang Yao, Haixin Wang, Yibo Chen, Jiawen Wang, Min Chang Jordan Ren, Bosheng Ding, Salman Avestimehr, Chaoyang He

Main category: cs.AI

TL;DR: The paper proposes a Super Agent System using hybrid AI routers to efficiently handle diverse tasks by dynamically selecting between local and cloud models, aiming for scalability and cost-effectiveness.

DetailsMotivation: To address the need for efficient, scalable, and cost-effective AI agents that can understand user intent and leverage appropriate tools for diverse tasks.

Method: Designs a system with hybrid AI routers that detect user intent, route tasks to specialized agents or generate workflows, and dynamically choose between local and cloud models based on task complexity.

Result: Introduces a blueprint for an on-device super agent enhanced with cloud collaboration, optimizing for efficiency, latency, and privacy.

Conclusion: Envisions seamless integration of super agents into everyday life by leveraging advancements in multi-modality models and edge hardware, with cloud support as needed.

Abstract: AI Agents powered by Large Language Models are transforming the world through enormous applications. A super agent has the potential to fulfill diverse user needs, such as summarization, coding, and research, by accurately understanding user intent and leveraging the appropriate tools to solve tasks. However, to make such an agent viable for real-world deployment and accessible at scale, significant optimizations are required to ensure high efficiency and low cost. This position paper presents a design of the Super Agent System powered by the hybrid AI routers. Upon receiving a user prompt, the system first detects the intent of the user, then routes the request to specialized task agents with the necessary tools or automatically generates agentic workflows. In practice, most applications directly serve as AI assistants on edge devices such as phones and robots. As different language models vary in capability and cloud-based models often entail high computational costs, latency, and privacy concerns, we then explore the hybrid mode where the router dynamically selects between local and cloud models based on task complexity. Finally, we introduce the blueprint of an on-device super agent enhanced with cloud. With advances in multi-modality models and edge hardware, we envision that most computations can be handled locally, with cloud collaboration only as needed. Such architecture paves the way for super agents to be seamlessly integrated into everyday life in the near future.

[219] PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver Monitoring

Jiyao Wang, Xiao Yang, Qingyong Hu, Jiankai Tang, Can Liu, Dengbo He, Yuntao Wang, Yingcong Chen, Kaishun Wu

Main category: cs.AI

TL;DR: PhysDrive is a large-scale multimodal dataset for contactless in-vehicle physiological monitoring, addressing limitations of existing datasets by including diverse modalities, driving conditions, and synchronized ground truths.

DetailsMotivation: Existing datasets for remote physiological measurement (RPM) in driving scenarios are limited in scale, diversity, and real-world applicability, hindering progress in driver monitoring.

Method: PhysDrive collects synchronized RGB, near-infrared, and mmWave radar data from 48 drivers, along with six physiological ground truths, covering diverse driving conditions.

Result: The dataset enables comprehensive evaluation of signal-processing and deep-learning methods, establishing benchmarks for multimodal driver monitoring.

Conclusion: PhysDrive is a foundational resource to advance research in multimodal driver monitoring and smart-cockpit systems.

Abstract: Robust and unobtrusive in-vehicle physiological monitoring is crucial for ensuring driving safety and user experience. While remote physiological measurement (RPM) offers a promising non-invasive solution, its translation to real-world driving scenarios is critically constrained by the scarcity of comprehensive datasets. Existing resources are often limited in scale, modality diversity, the breadth of biometric annotations, and the range of captured conditions, thereby omitting inherent real-world challenges in driving. Here, we present PhysDrive, the first large-scale multimodal dataset for contactless in-vehicle physiological sensing with dedicated consideration on various modality settings and driving factors. PhysDrive collects data from 48 drivers, including synchronized RGB, near-infrared camera, and raw mmWave radar data, accompanied with six synchronized ground truths (ECG, BVP, Respiration, HR, RR, and SpO2). It covers a wide spectrum of naturalistic driving conditions, including driver motions, dynamic natural light, vehicle types, and road conditions. We extensively evaluate both signal-processing and deep-learning methods on PhysDrive, establishing a comprehensive benchmark across all modalities, and release full open-source code with compatibility for mainstream public toolboxes. We envision PhysDrive will serve as a foundational resource and accelerate research on multimodal driver monitoring and smart-cockpit systems.

[220] Faster Lifting for Ordered Domains with Predecessor Relations

Kuncheng Zou, Jiahao Mai, Yonggang Zhang, Yuyi Wang, Ondřej Kuželka, Yuanhong Wang, Yi Chang

Main category: cs.AI

TL;DR: The paper introduces a novel algorithm for lifted inference on ordered domains with predecessor relations, improving efficiency over existing methods.

DetailsMotivation: Existing WFOMC methods struggle with predecessor relations in ordered domains, despite theoretical tractability.

Method: Proposes a new algorithm treating predecessor relations natively, supporting immediate, second, and general k-th predecessors.

Result: Achieves exponential speedups, especially for immediate and second predecessors, and handles general k-th predecessors efficiently.

Conclusion: The algorithm significantly outperforms existing methods, demonstrating practical efficiency in lifted inference and combinatorics tasks.

Abstract: We investigate lifted inference on ordered domains with predecessor relations, where the elements of the domain respect a total (cyclic) order, and every element has a distinct (clockwise) predecessor. Previous work has explored this problem through weighted first-order model counting (WFOMC), which computes the weighted sum of models for a given first-order logic sentence over a finite domain. In WFOMC, the order constraint is typically encoded by the linear order axiom introducing a binary predicate in the sentence to impose a linear ordering on the domain elements. The immediate and second predecessor relations are then encoded by the linear order predicate. Although WFOMC with the linear order axiom is theoretically tractable, existing algorithms struggle with practical applications, particularly when the predecessor relations are involved. In this paper, we treat predecessor relations as a native part of the axiom and devise a novel algorithm that inherently supports these relations. The proposed algorithm not only provides an exponential speedup for the immediate and second predecessor relations, which are known to be tractable, but also handles the general k-th predecessor relations. The extensive experiments on lifted inference tasks and combinatorics math problems demonstrate the efficiency of our algorithm, achieving speedups of a full order of magnitude.

[221] Knowledge Grafting: A Mechanism for Optimizing AI Model Deployment in Resource-Constrained Environments

Osama Almurshed, Ashish Kaushal, Asmail Muftah, Nitin Auluck, Omer Rana

Main category: cs.AI

TL;DR: The paper introduces knowledge grafting, a method to optimize AI models for resource-constrained environments by transferring features from a large donor model to a smaller one, achieving significant size reduction and improved performance.

DetailsMotivation: The challenge of deploying large AI models in resource-constrained scenarios due to their size and computational demands.

Method: Knowledge grafting transfers selected features (scion) from a large donor model to a smaller rootstock model, optimizing for size and performance.

Result: 88.54% reduction in model size (64.39 MB to 7.38 MB), improved validation accuracy (89.97% vs. 87.47%), lower validation loss (0.2976 vs. 0.5068), and 90.45% accuracy on unseen test data.

Conclusion: Knowledge grafting effectively addresses the size vs. performance trade-off, enabling AI deployment in resource-constrained environments with enhanced performance, as demonstrated in agricultural weed detection and applicable to edge computing.

Abstract: The increasing adoption of Artificial Intelligence (AI) has led to larger, more complex models with numerous parameters that require substantial computing power – resources often unavailable in many real-world application scenarios. Our paper addresses this challenge by introducing knowledge grafting, a novel mechanism that optimizes AI models for resource-constrained environments by transferring selected features (the scion) from a large donor model to a smaller rootstock model. The approach achieves an 88.54% reduction in model size (from 64.39 MB to 7.38 MB), while improving generalization capability of the model. Our new rootstock model achieves 89.97% validation accuracy (vs. donor’s 87.47%), maintains lower validation loss (0.2976 vs. 0.5068), and performs exceptionally well on unseen test data with 90.45% accuracy. It addresses the typical size vs performance trade-off, and enables deployment of AI frameworks on resource-constrained devices with enhanced performance. We have tested our approach on an agricultural weed detection scenario, however, it can be extended across various edge computing scenarios, potentially accelerating AI adoption in areas with limited hardware/software support – by mirroring in a similar manner the horticultural grafting enables productive cultivation in challenging agri-based environments.

[222] Modeling Uncertainty: Constraint-Based Belief States in Imperfect-Information Games

Achille Morenville, Éric Piette

Main category: cs.AI

TL;DR: The paper explores belief representation in imperfect-information games, comparing constraint-based and probabilistic methods, finding minimal performance differences.

DetailsMotivation: To address decision-making in games with partial knowledge by simplifying state estimation through belief states.

Method: Two approaches: constraint-based (CSP) and probabilistic (Belief Propagation) belief representation, tested with general-purpose agents in two games.

Result: Constraint-based beliefs performed comparably to probabilistic inference, with little difference in agent performance.

Conclusion: Constraint-based belief states may be sufficient for effective decision-making in many imperfect-information games.

Abstract: In imperfect-information games, agents must make decisions based on partial knowledge of the game state. The Belief Stochastic Game model addresses this challenge by delegating state estimation to the game model itself. This allows agents to operate on externally provided belief states, thereby reducing the need for game-specific inference logic. This paper investigates two approaches to represent beliefs in games with hidden piece identities: a constraint-based model using Constraint Satisfaction Problems and a probabilistic extension using Belief Propagation to estimate marginal probabilities. We evaluated the impact of both representations using general-purpose agents across two different games. Our findings indicate that constraint-based beliefs yield results comparable to those of probabilistic inference, with minimal differences in agent performance. This suggests that constraint-based belief states alone may suffice for effective decision-making in many settings.

[223] Learning neuro-symbolic convergent term rewriting systems

Flavio Petruzzellis, Alberto Testolin, Alessandro Sperduti

Main category: cs.AI

TL;DR: A neuro-symbolic framework for learning convergent term rewriting systems, with two implementations (NRS and FastNRS), outperforms strong neural baselines in generalization and efficiency.

DetailsMotivation: Addressing the challenge of building neural systems that generalize well for symbolic algorithms, especially out-of-distribution tasks.

Method: Introduces a neuro-symbolic architecture inspired by rewriting algorithms, with modular implementations (NRS and FastNRS). Evaluated on mathematical formula simplification and multi-domain learning.

Result: FastNRS improves memory efficiency, training speed, and inference time. Outperforms Neural Data Router, GPT-4o, and matches OpenAI’s o1-preview model.

Conclusion: The proposed framework demonstrates strong generalization and efficiency, advancing neuro-symbolic integration for algorithmic tasks.

Abstract: Building neural systems that can learn to execute symbolic algorithms is a challenging open problem in artificial intelligence, especially when aiming for strong generalization and out-of-distribution performance. In this work, we introduce a general framework for learning convergent term rewriting systems using a neuro-symbolic architecture inspired by the rewriting algorithm itself. We present two modular implementations of such architecture: the Neural Rewriting System (NRS) and the Fast Neural Rewriting System (FastNRS). As a result of algorithmic-inspired design and key architectural elements, both models can generalize to out-of-distribution instances, with FastNRS offering significant improvements in terms of memory efficiency, training speed, and inference time. We evaluate both architectures on four tasks involving the simplification of mathematical formulas and further demonstrate their versatility in a multi-domain learning scenario, where a single model is trained to solve multiple types of problems simultaneously. The proposed system significantly outperforms two strong neural baselines: the Neural Data Router, a recent transformer variant specifically designed to solve algorithmic problems, and GPT-4o, one of the most powerful general-purpose large-language models. Moreover, our system matches or outperforms the latest o1-preview model from OpenAI that excels in reasoning benchmarks.

[224] Hierarchical Deep Reinforcement Learning Framework for Multi-Year Asset Management Under Budget Constraints

Amir Fard, Arnold X. -X. Yuan

Main category: cs.AI

TL;DR: A Hierarchical Deep Reinforcement Learning method is proposed for multi-year infrastructure planning, addressing scalability issues by separating budget allocation and maintenance prioritization.

DetailsMotivation: Existing methods struggle with scalability due to combinatorial action spaces, diverse asset deterioration, budget constraints, and environmental uncertainty.

Method: The approach uses a hierarchical framework: a high-level Budget Planner allocates annual budgets, and a low-level Maintenance Planner prioritizes assets. It integrates linear programming projection within a hierarchical Soft Actor-Critic framework.

Result: The method outperforms Deep Q-Learning and genetic algorithms, converging faster and scaling effectively with network size (tested on 10, 15, and 20 sewersheds).

Conclusion: The proposed methodology efficiently handles large-scale infrastructure planning, ensuring budget compliance and near-optimal solutions.

Abstract: Budget planning and maintenance optimization are crucial for infrastructure asset management, ensuring cost-effectiveness and sustainability. However, the complexity arising from combinatorial action spaces, diverse asset deterioration, stringent budget constraints, and environmental uncertainty significantly limits existing methods’ scalability. This paper proposes a Hierarchical Deep Reinforcement Learning methodology specifically tailored to multi-year infrastructure planning. Our approach decomposes the problem into two hierarchical levels: a high-level Budget Planner allocating annual budgets within explicit feasibility bounds, and a low-level Maintenance Planner prioritizing assets within the allocated budget. By structurally separating macro-budget decisions from asset-level prioritization and integrating linear programming projection within a hierarchical Soft Actor-Critic framework, the method efficiently addresses exponential growth in the action space and ensures rigorous budget compliance. A case study evaluating sewer networks of varying sizes (10, 15, and 20 sewersheds) illustrates the effectiveness of the proposed approach. Compared to conventional Deep Q-Learning and enhanced genetic algorithms, our methodology converges more rapidly, scales effectively, and consistently delivers near-optimal solutions even as network size grows.

[225] Secret Collusion among AI Agents: Multi-Agent Deception via Steganography

Sumeet Ramesh Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip H. S. Torr, Lewis Hammond, Christian Schroeder de Witt

Main category: cs.AI

TL;DR: The paper formalizes secret collusion in generative AI agents, studies steganography incentives, proposes mitigations, and evaluates LLMs, finding GPT-4 poses a notable risk.

DetailsMotivation: Address privacy and security challenges from AI agents' unauthorized information sharing or coordination.

Method: Formalizes the problem, studies incentives, proposes mitigations, and tests LLMs for steganographic capabilities.

Result: GPT-4 shows a significant capability jump in steganography, highlighting risks.

Conclusion: Calls for continuous monitoring and a research program to mitigate future collusion risks.

Abstract: Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models.

[226] Improving Question Embeddings with Cognitive Representation Optimization for Knowledge Tracing

Lixiang Xu, Xianwei Ding, Xin Yuan, Zhanlong Wang, Lu Bai, Enhong Chen, Philip S. Yu, Yuanyan Tang

Main category: cs.AI

TL;DR: The paper proposes a CRO-KT model to optimize cognitive representation in knowledge tracking, addressing distractions and static limitations in current methods.

DetailsMotivation: Current KT models ignore distractions (e.g., slipping, guessing) and assume static cognitive representations, leading to dissonant records.

Method: Uses dynamic programming and synergistic optimization to align cognitive representation with student patterns and exercise difficulty.

Result: Experiments on three datasets confirm the model’s effectiveness in enhancing cognitive expression.

Conclusion: The CRO-KT model improves knowledge tracking by dynamically optimizing cognitive representation and integrating relationship embeddings.

Abstract: Designed to track changes in students’ knowledge status and predict their future answers based on students’ historical answer records. Current research on KT modeling focuses on predicting future student performance based on existing, unupdated records of student learning interactions. However, these methods ignore distractions in the response process (such as slipping and guessing) and ignore that static cognitive representations are temporary and limited. Most of them assume that there are no distractions during the answering process, and that the recorded representation fully represents the student’s understanding and proficiency in knowledge. This can lead to many dissonant and uncoordinated issues in the original record. Therefore, we propose a knowledge-tracking cognitive representation optimization (CRO-KT) model that uses dynamic programming algorithms to optimize the structure of cognitive representation. This ensures that the structure matches the student’s cognitive patterns in terms of practice difficulty. In addition, we use a synergistic optimization algorithm to optimize the cognitive representation of sub-target exercises based on the overall picture of exercise responses by considering all exercises with synergistic relationships as one goal. At the same time, the CRO-KT model integrates the relationship embedding learned in the dichotomous graph with the optimized record representation in a weighted manner, which enhances students’ cognitive expression ability. Finally, experiments were conducted on three public datasets to verify the effectiveness of the proposed cognitive representation optimization model.

[227] Reshaping MOFs text mining with a dynamic multi-agents framework of large language model

Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu, Xuefeng Bai, Xiaotiang Huang, Haiyang He, Pengxu Pan, Xiaohang Zhang, Ying Fang, Tianying Wang, Minli Wu, Zhanglin Li, Xiaochuan Zhang, Haipu Li, Jingjing Yao

Main category: cs.AI

TL;DR: MOFh6 is an LLM-based multi-agent system that extracts and structures MOF synthesis knowledge from diverse inputs, achieving high accuracy and efficiency.

DetailsMotivation: The unstructured nature of scientific texts hinders actionable insights for MOF synthesis, necessitating a scalable solution.

Method: Built on GPT-4o-mini and fine-tuned with expert-annotated data, MOFh6 processes raw literature and crystal codes.

Result: 99% accuracy in synthesis parsing, 94.1% co-reference resolution, and cost savings of 76%.

Conclusion: MOFh6 transforms static retrieval into dynamic knowledge acquisition, accelerating materials discovery.

Abstract: Accurately identifying synthesis conditions for metal-organic frameworks (MOFs) remains a critical bottleneck in materials research, as translating literature-derived knowledge into actionable insights is hindered by the unstructured and heterogeneous nature of scientific texts. Here we present MOFh6, a large language model (LLM)-based multi-agent system designed to extract, structure, and apply synthesis knowledge from diverse input formats, including raw literature and crystal codes. Built on gpt-4o-mini and fine-tuned with up to few-shot expert-annotated data, MOFh6 achieves 99% accuracy in synthesis data parsing and resolves 94.1% of complex co-reference abbreviations. It processes a single full-text document in 9.6 seconds and localizes structured synthesis descriptions within 36 seconds, with the cost per 100 papers reduced to USD 4.24, a 76% saving over existing systems. By addressing long-standing limitations in cross-paragraph semantic fusion and terminology standardization, MOFh6 reshapes the LLM-based paradigm for MOF synthesis research, transforming static retrieval into an integrated and dynamic knowledge acquisition process. This shift bridges the gap between scientific literature and actionable synthesis design, providing a scalable framework for accelerating materials discovery.

[228] Understanding LLM Scientific Reasoning through Promptings and Model’s Explanation on the Answers

Alice Rueda, Mohammed S. Hassan, Argyrios Perivolaris, Bazen G. Teferra, Reza Samavi, Sirisha Rambhatla, Yuqi Wu, Yanbo Zhang, Bo Cao, Divya Sharma, Sridhar Krishnan, Venkat Bhat

Main category: cs.AI

TL;DR: The paper evaluates the reasoning capabilities of LLMs like GPT-4o using prompt engineering on the GPQA dataset, finding self-consistency most accurate but lacking in explanation. It suggests integrating structured reasoning and hybrid AI for improvement.

DetailsMotivation: To assess and improve the multi-step reasoning abilities of LLMs, crucial for fields like science, medicine, and law.

Method: Tested seven prompt engineering techniques (e.g., CoT, self-consistency) on GPT-4o using the GPQA dataset.

Result: Self-consistency achieved the highest accuracy (52.99%) but performed poorly in explanations. Direct answer and zero-shot CoT showed strong reasoning.

Conclusion: LLMs rely on pattern recognition over logical inference. Future work should focus on structured reasoning and hybrid AI to enhance robustness.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and problem-solving across various domains. However, their ability to perform complex, multi-step reasoning task-essential for applications in science, medicine, and law-remains an area of active investigation. This paper examines the reasoning capabilities of contemporary LLMs, analyzing their strengths, limitations, and potential for improvement. The study uses prompt engineering techniques on the Graduate-Level GoogleProof Q&A (GPQA) dataset to assess the scientific reasoning of GPT-4o. Five popular prompt engineering techniques and two tailored promptings were tested: baseline direct answer (zero-shot), chain-of-thought (CoT), zero-shot CoT, self-ask, self-consistency, decomposition, and multipath promptings. Our findings indicate that while LLMs exhibit emergent reasoning abilities, they often rely on pattern recognition rather than true logical inference, leading to inconsistencies in complex problem-solving. The results indicated that self-consistency outperformed the other prompt engineering technique with an accuracy of 52.99%, followed by direct answer (52.23%). Zero-shot CoT (50%) outperformed multipath (48.44%), decomposition (47.77%), self-ask (46.88%), and CoT (43.75%). Self-consistency performed the second worst in explaining the answers. Simple techniques such as direct answer, CoT, and zero-shot CoT have the best scientific reasoning. We propose a research agenda aimed at bridging these gaps by integrating structured reasoning frameworks, hybrid AI approaches, and human-in-the-loop methodologies. By critically evaluating the reasoning mechanisms of LLMs, this paper contributes to the ongoing discourse on the future of artificial general intelligence and the development of more robust, trustworthy AI systems.

[229] RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification

Praphul Singh, Charlotte Dzialo, Jangwon Kim, Sumana Srivatsa, Irfan Bulu, Sri Gadde, Krishnaram Kenthapadi

Main category: cs.AI

TL;DR: RedactOR is a multi-modal framework for de-identifying clinical data, combining rule-based and LLM approaches to improve recall and efficiency while reducing costs.

DetailsMotivation: Existing De-ID methods have limitations like recall errors and inefficiencies, hindering real-world use in healthcare AI.

Method: RedactOR uses hybrid rule and LLM strategies, intelligent routing, and a two-step audio redaction approach, with retrieval-based relexicalization for consistency.

Result: Achieves competitive performance on the i2b2 2014 dataset, optimizing token usage to lower LLM costs.

Conclusion: RedactOR is effective for real-world healthcare data pipelines, balancing privacy and utility.

Abstract: Ensuring clinical data privacy while preserving utility is critical for AI-driven healthcare and data analytics. Existing de-identification (De-ID) methods, including rule-based techniques, deep learning models, and large language models (LLMs), often suffer from recall errors, limited generalization, and inefficiencies, limiting their real-world applicability. We propose a fully automated, multi-modal framework, RedactOR for de-identifying structured and unstructured electronic health records, including clinical audio records. Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches, and a two-step audio redaction approach. We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities, thereby enhancing data coherence for downstream applications. We discuss key design desiderata, de-identification and relexicalization methodology, and modular architecture of RedactOR and its integration with the Oracle Health Clinical AI system. Evaluated on the i2b2 2014 De-ID dataset using standard metrics with strict recall, our approach achieves competitive performance while optimizing token usage to reduce LLM costs. Finally, we discuss key lessons and insights from deployment in real-world AI- driven healthcare data pipelines.

[230] AI PsyRoom: Artificial Intelligence Platform for Segmented Yearning and Reactive Outcome Optimization Method

Yigui Feng, Qinglin Wang, Ke Liu, Xinhai Chen, Bo Yang, Jie Liu

Main category: cs.AI

TL;DR: AI PsyRoom, a multi-agent framework, enhances psychological counseling by generating empathetic dialogues and personalized treatment plans, outperforming existing methods.

DetailsMotivation: Addressing the shortage of trained professionals and the lack of deep emotional understanding in existing LLMs for psychological counseling.

Method: Uses fine-grained emotion classification and a multi-agent framework (PsyRoom A for dialogue reconstruction, PsyRoom B for treatment plans) to create the EmoPsy dataset.

Result: Achieves 18-24% improvements in problem orientation, expression, empathy, and communication quality.

Conclusion: AI PsyRoom provides a foundation for advancing AI-assisted psychological counseling, with datasets and models publicly available.

Abstract: Psychological counseling faces huge challenges due to the growing demand for mental health services and the shortage of trained professionals. Large language models (LLMs) have shown potential to assist psychological counseling, especially in empathy and emotional support. However, existing models lack a deep understanding of emotions and are unable to generate personalized treatment plans based on fine-grained emotions. To address these shortcomings, we present AI PsyRoom, a multi-agent simulation framework designed to enhance psychological counseling by generating empathetic and emotionally nuanced conversations. By leveraging fine-grained emotion classification and a multi-agent framework, we construct a multi-agent PsyRoom A for dialogue reconstruction, generating a high-quality dialogue dataset EmoPsy, which contains 35 sub-emotions, 423 specific emotion scenarios, and 12,350 dialogues. We also propose PsyRoom B for generating personalized treatment plans. Quantitative evaluations demonstrate that AI PsyRoom significantly outperforms state-of-the-art methods, achieving 18% improvement in problem orientation, 23% in expression, 24% in Empathy, and 16% in interactive communication quality. The datasets and models are publicly available, providing a foundation for advancing AI-assisted psychological counseling research.

[231] AI Flow: Perspectives, Scenarios, and Approaches

Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li

Main category: cs.AI

TL;DR: AI Flow is a framework addressing resource and bandwidth challenges in large AI models by integrating device-edge-cloud systems, familial models, and connectivity-based intelligence emergence.

DetailsMotivation: The convergence of IT/CT has advanced AI, but large models face resource and communication challenges. AI Flow aims to solve these.

Method: AI Flow uses a device-edge-cloud framework, familial models for flexibility, and connectivity-based intelligence emergence.

Result: AI Flow enhances intelligence, responsiveness, and accessibility, improving AI-communication fusion.

Conclusion: AI Flow paves the way for efficient, scalable, and collaborative AI systems.

Abstract: Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.

[232] Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Joel Becker, Nate Rush, Elizabeth Barnes, David Rein

Main category: cs.AI

TL;DR: AI tools increased task completion time by 19% for experienced developers, contrary to predictions of time savings.

DetailsMotivation: To study the real-world impact of AI tools on productivity in software development.

Method: Randomized controlled trial (RCT) with 16 developers completing 246 tasks, comparing AI-allowed and AI-disallowed conditions.

Result: AI tools slowed developers by 19%, contradicting expert predictions of 38-39% time savings.

Conclusion: The slowdown effect is robust, suggesting AI tools may not universally boost productivity as expected.

Abstract: Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers. 16 developers with moderate AI experience complete 246 tasks in mature projects on which they have an average of 5 years of prior experience. Each task is randomly assigned to allow or disallow usage of early 2025 AI tools. When AI tools are allowed, developers primarily use Cursor Pro, a popular code editor, and Claude 3.5/3.7 Sonnet. Before starting tasks, developers forecast that allowing AI will reduce completion time by 24%. After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%–AI tooling slowed developers down. This slowdown also contradicts predictions from experts in economics (39% shorter) and ML (38% shorter). To understand this result, we collect and evaluate evidence for 20 properties of our setting that a priori could contribute to the observed slowdown effect–for example, the size and quality standards of projects, or prior developer experience with AI tooling. Although the influence of experimental artifacts cannot be entirely ruled out, the robustness of the slowdown effect across our analyses suggests it is unlikely to primarily be a function of our experimental design.

[233] Why Isn’t Relational Learning Taking Over the World?

David Poole

Main category: cs.AI

TL;DR: The paper argues for prioritizing relational learning over pixel/word modeling, highlighting its underuse despite its potential in handling real-world data like spreadsheets and databases.

DetailsMotivation: Current AI focuses on modeling pixels and words, but real-world data (e.g., spreadsheets, databases) is relational. The paper advocates for relational learning to better represent entities and their relations.

Method: The paper critiques the dominance of pixel/word modeling and examines why relational learning remains niche, proposing steps to elevate its prominence.

Result: Relational learning is underutilized except in limited cases, despite its suitability for real-world data.

Conclusion: Relational learning needs broader adoption and development to match its potential in representing real-world entities and relations.

Abstract: AI seems to be taking over the world with systems that model pixels, words, and phonemes. The world is arguably made up, not of pixels, words, and phonemes but of entities (objects, things, including events) with properties and relations among them. Surely we should model these, not the perception or description of them. You might suspect that concentrating on modeling words and pixels is because all of the (valuable) data in the world is in terms of text and images. If you look into almost any company you will find their most valuable data is in spreadsheets, databases and other relational formats. These are not the form that are studied in introductory machine learning, but are full of product numbers, student numbers, transaction numbers and other identifiers that can’t be interpreted naively as numbers. The field that studies this sort of data has various names including relational learning, statistical relational AI, and many others. This paper explains why relational learning is not taking over the world – except in a few cases with restricted relations – and what needs to be done to bring it to it’s rightful prominence.

[234] Gemini 2.5 Pro Capable of Winning Gold at IMO 2025

Yichen Huang, Lin F. Yang

Main category: cs.AI

TL;DR: Gemini 2.5 Pro solves 5 out of 6 IMO 2025 problems using self-verification, highlighting the need for optimal strategies in LLMs for complex reasoning.

DetailsMotivation: Addressing the challenge of applying LLMs to Olympiad-level math problems, which require deep insight and creativity.

Method: Used Google’s Gemini 2.5 Pro with a self-verification pipeline and careful prompt design on IMO 2025 problems.

Result: Solved 5 out of 6 problems correctly, demonstrating progress in LLM performance on complex tasks.

Conclusion: Optimal strategies are crucial for leveraging LLMs in advanced reasoning tasks like the IMO.

Abstract: The International Mathematical Olympiad (IMO) poses uniquely challenging problems requiring deep insight, creativity, and formal reasoning. While Large Language Models (LLMs) perform well on mathematical benchmarks like AIME, they struggle with Olympiad-level tasks. We use Google’s Gemini 2.5 Pro on the newly released IMO 2025 problems, avoiding data contamination. Using a self-verification pipeline with careful prompt design, 5 (out of 6) problems are solved correctly. This result underscores the importance of developing optimal strategies to harness the full potential of powerful LLMs for complex reasoning tasks.

cs.SD

[235] SCORE-SET: A dataset of GuitarPro files for Music Phrase Generation and Sequence Learning

Vishakh Begari

Main category: cs.SD

TL;DR: A dataset of Guitar Pro tablature files (.gp5) is created for guitar music tasks, derived from MIDI notes in MAESTRO and GiantMIDI, enriched with performance nuances like bends and slides.

DetailsMotivation: To support guitar music generation, sequence modeling, and performance-aware learning by providing a dataset that reflects real-world guitar playing nuances.

Method: Adapt MIDI notes from MAESTRO and GiantMIDI into rhythm guitar tracks, then process them to include expression settings like bends, slides, vibrato, and palm muting.

Result: A curated dataset of Guitar Pro tablature files with performance nuances for realistic guitar music tasks.

Conclusion: The dataset enhances tasks involving guitar music by incorporating real-world performance details.

Abstract: A curated dataset of Guitar Pro tablature files (.gp5 format), tailored for tasks involving guitar music generation, sequence modeling, and performance-aware learning is provided. The dataset is derived from MIDI notes in MAESTRO and GiantMIDI which have been adapted into rhythm guitar tracks. These tracks are further processed to include a variety of expression settings typical of guitar performance, such as bends, slides, vibrato, and palm muting, to better reflect the nuances of real-world guitar playing.

[236] HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

Rongkun Xue, Yazhe Niu, Shuai Hu, Zixin Yin, Yongqiang Yao, Jing Yang

Main category: cs.SD

TL;DR: HH-Codec introduces a neural codec for extreme speech compression (24 tokens/sec for 24 kHz audio) using single-quantizer inference and optimized VQ space, achieving 0.3 kbps bandwidth with high fidelity.

DetailsMotivation: Addressing challenges of parallel streams and computational costs in large-scale speech-to-speech systems.

Method: Uses a Vector Quantization space for Spoken Language Modeling and an asymmetric encoder-decoder architecture with dual supervision and progressive training.

Result: State-of-the-art performance in speech reconstruction at 0.3 kbps, validated by extensive ablations.

Conclusion: HH-Codec is effective for ultra-low bandwidth speech compression and adaptable for generative models.

Abstract: Discrete speech tokenization is a fundamental component in speech codecs. However, in large-scale speech-to-speech systems, the complexity of parallel streams from multiple quantizers and the computational cost of high-time-dimensional codecs pose significant challenges. In this paper, we introduce HH-Codec, a neural codec that achieves extreme compression at 24 tokens per second for 24 kHz audio while relying on single-quantizer inference. Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss. Building on this, we propose an asymmetric encoder-decoder architecture (Audio-VQ-Mel-Audio) that leverages dual supervision and progressive training to enhance reconstruction stability and fidelity. HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps. We further evaluate its effectiveness in codebook utilization and generative model adaptation, with extensive ablations validating the necessity of each module. HH-Codec is available at https://github.com/opendilab/HH-Codec.

[237] MLLM-based Speech Recognition: When and How is Multimodality Beneficial?

Yiwen Guan, Viet Anh Trinh, Vivek Voleti, Jacob Whitehill

Main category: cs.SD

TL;DR: Multi-modal large language models (MLLMs) improve ASR accuracy in noisy environments, with synchronized modalities aiding high noise and unsynchronized ones helping at moderate noise. Visual quality and modality input order also impact performance.

DetailsMotivation: To explore how multiple input modalities can enhance ASR accuracy in noisy settings, leveraging complementary information from different modalities.

Method: Experiments on synthetic and real-world data, analyzing the impact of synchronized/unsynchronized modalities, visual quality, and input order on ASR performance.

Result: More modalities generally improve ASR, with synchronized ones excelling in high noise and unsynchronized in moderate noise. Visual quality and modality order significantly affect accuracy.

Conclusion: Multi-modal approaches enhance ASR in noise, with practical insights for model design and a deeper understanding of modality interactions.

Abstract: Recent advances in multi-modal large language models (MLLMs) have opened new possibilities for unified modeling of speech, text, images, and other modalities. Building on our prior work, this paper examines the conditions and model architectures under which multiple input modalities can improve automatic speech recognition (ASR) accuracy in noisy environments. Through experiments on synthetic and real-world data, we find that (1) harnessing more modalities usually improves ASR accuracy, as each modality provides complementary information, but the improvement depends on the amount of auditory noise. (2) Synchronized modalities (e.g., lip movements) are more useful at high noise levels whereas unsynchronized modalities (e.g., image context) are most helpful at moderate noise levels. (3) Higher-quality visual representations consistently improve ASR accuracy, highlighting the importance of developing more powerful visual encoders. (4) Mamba exhibits similar trends regarding the benefits of multimodality as do Transformers. (5) The input order of modalities as well as their weights in the loss function can significantly impact accuracy. These findings both offer practical insights and help to deepen our understanding of multi-modal speech recognition under challenging conditions.

[238] From Continuous to Discrete: Cross-Domain Collaborative General Speech Enhancement via Hierarchical Language Models

Zhaoxi Mu, Rilin Chen, Andong Li, Meng Yu, Xinyu Yang, Dong Yu

Main category: cs.SD

TL;DR: OmniGSE is a general speech enhancement framework addressing multiple distortions like noise, reverberation, and bandwidth issues. It combines discriminative and generative methods in a two-stage architecture, outperforming existing models in complex scenarios.

DetailsMotivation: Existing methods struggle with multiple simultaneous distortions in real-world speech signals. OmniGSE aims to bridge this gap by integrating diverse approaches.

Method: OmniGSE uses a two-stage architecture: first, enhancing continuous features with a NAC-RoFormer, and second, reconstructing speech via hierarchical language models (RootLM and BranchLMs).

Result: OmniGSE outperforms benchmarks, especially in compound distortion scenarios, showing robustness and versatility.

Conclusion: OmniGSE demonstrates strong potential for real-world speech enhancement, handling diverse distortions effectively.

Abstract: This paper introduces OmniGSE, a novel general speech enhancement (GSE) framework designed to mitigate the diverse distortions that speech signals encounter in real-world scenarios. These distortions include background noise, reverberation, bandwidth limitations, signal clipping, and network packet loss. Existing methods typically focus on optimizing for a single type of distortion, often struggling to effectively handle the simultaneous presence of multiple distortions in complex scenarios. OmniGSE bridges this gap by integrating the strengths of discriminative and generative approaches through a two-stage architecture that enables cross-domain collaborative optimization. In the first stage, continuous features are enhanced using a lightweight channel-split NAC-RoFormer. In the second stage, discrete tokens are generated to reconstruct high-quality speech through language models. Specifically, we designed a hierarchical language model structure consisting of a RootLM and multiple BranchLMs. The RootLM models general acoustic features across codebook layers, while the BranchLMs explicitly capture the progressive relationships between different codebook levels. Experimental results demonstrate that OmniGSE surpasses existing models across multiple benchmarks, particularly excelling in scenarios involving compound distortions. These findings underscore the framework’s potential for robust and versatile speech enhancement in real-world applications.

[239] Latent Granular Resynthesis using Neural Audio Codecs

Nao Tokui, Tom Baker

Main category: cs.SD

TL;DR: A novel audio resynthesis technique using granular synthesis at the latent vector level, creating hybrid audio with target structure and source timbre.

DetailsMotivation: To innovate audio resynthesis by leveraging latent vectors for seamless timbral transfer without model training.

Method: Encodes source audio into latent vector segments (granular codebook), matches target grains to the codebook, and decodes the hybrid sequence.

Result: Produces audio preserving target structure and source timbre, avoiding discontinuities of traditional methods.

Conclusion: The technique is versatile, requires no training, and offers a seamless alternative to concatenative synthesis.

Abstract: We introduce a novel technique for creative audio resynthesis that operates by reworking the concept of granular synthesis at the latent vector level. Our approach creates a “granular codebook” by encoding a source audio corpus into latent vector segments, then matches each latent grain of a target audio signal to its closest counterpart in the codebook. The resulting hybrid sequence is decoded to produce audio that preserves the target’s temporal structure while adopting the source’s timbral characteristics. This technique requires no model training, works with diverse audio materials, and naturally avoids the discontinuities typical of traditional concatenative synthesis through the codec’s implicit interpolation during decoding. We include supplementary material at https://github.com/naotokui/latentgranular/ , as well as a proof-of-concept implementation to allow users to experiment with their own sounds at https://huggingface.co/spaces/naotokui/latentgranular .

[240] Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation

Fang Kang, Yin Cao, Haoyu Chen

Main category: cs.SD

TL;DR: The paper introduces Face2VoiceSync, a framework for generating talking face animations and corresponding speech from a face image and text, addressing limitations of fixed-driven speech methods.

DetailsMotivation: To overcome the limitations of fixed-driven speech in talking face generation, such as face-voice mismatch, and extend the task to a more challenging setting involving text input.

Method: Proposes Face2VoiceSync with Voice-Face Alignment, Diversity & Manipulation, Efficient Training using a lightweight VAE, and a new evaluation metric.

Result: Face2VoiceSync achieves state-of-the-art performance in both visual and audio generation on a single 40GB GPU.

Conclusion: The framework successfully addresses the challenges of face-voice alignment and diversity, offering efficient training and superior performance.

Abstract: Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and its corresponding speeches. Accordingly, we propose a novel framework, Face2VoiceSync, with several novel contributions: 1) Voice-Face Alignment, ensuring generated voices match facial appearance; 2) Diversity & Manipulation, enabling generated voice control over paralinguistic features space; 3) Efficient Training, using a lightweight VAE to bridge visual and audio large-pretrained models, with significantly fewer trainable parameters than existing methods; 4) New Evaluation Metric, fairly assessing the diversity and identity consistency. Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.

[241] The Eloquence team submission for task 1 of MLC-SLM challenge

Lorenzo Concina, Jordi Luque, Alessio Brutti, Marco Matassoni, Yuchen Zhang

Main category: cs.SD

TL;DR: The paper explores three approaches to multilingual ASR for conversational speech recognition, evaluating baselines, custom projectors, and contrastive learning.

DetailsMotivation: Advancing multilingual conversational speech recognition to improve Spoken Dialogue Systems using real-world conversational data.

Method: 1. Evaluate official baseline with linear and qformer projectors. 2. Train a custom multilingual linear projector using SLAM-ASR. 3. Investigate contrastive learning and extended conversational context.

Result: Findings highlight the effectiveness of custom projectors and contrastive learning in enhancing ASR robustness.

Conclusion: The study demonstrates progress in multilingual conversational speech recognition, emphasizing the value of tailored architectures and learning techniques.

Abstract: In this paper, we present our studies and experiments carried out for the task 1 of the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM), which focuses on advancing multilingual conversational speech recognition through the development of speech language models architectures. Given the increasing relevance of real-world conversational data for building robust Spoken Dialogue Systems, we explore three approaches to multilingual ASR. First, we conduct an evaluation of the official baseline to better understand its strengths and limitations, by training two projectors (linear and qformer) with different foundation models. Second we leverage the SLAM-ASR framework to train a custom multilingual linear projector. Finally we investigate the role of contrastive learning and the extended conversational context in enhancing the robustness of recognition.

cs.LG

[242] Diffusion Models for Solving Inverse Problems via Posterior Sampling with Piecewise Guidance

Saeed Mohseni-Sehdeh, Walid Saad, Kei Sakaguchi, Tao Yu

Main category: cs.LG

TL;DR: A novel diffusion-based framework with piecewise guidance for solving inverse problems, balancing efficiency and accuracy without task-specific retraining.

DetailsMotivation: To address the challenge of solving inverse problems efficiently and accurately using diffusion models, while avoiding the need for retraining for each specific task.

Method: Introduces a piecewise guidance scheme for diffusion models, where the guidance term varies by diffusion timestep, optimizing approximations during high-noise and low-noise phases.

Result: Achieves significant reductions in inference time (25% for inpainting, 23-24% for super-resolution) with minimal loss in PSNR and SSIM.

Conclusion: The proposed framework is effective, problem-agnostic, and adaptable, demonstrating practical utility in image restoration tasks.

Abstract: Diffusion models are powerful tools for sampling from high-dimensional distributions by progressively transforming pure noise into structured data through a denoising process. When equipped with a guidance mechanism, these models can also generate samples from conditional distributions. In this paper, a novel diffusion-based framework is introduced for solving inverse problems using a piecewise guidance scheme. The guidance term is defined as a piecewise function of the diffusion timestep, facilitating the use of different approximations during high-noise and low-noise phases. This design is shown to effectively balance computational efficiency with the accuracy of the guidance term. Unlike task-specific approaches that require retraining for each problem, the proposed method is problem-agnostic and readily adaptable to a variety of inverse problems. Additionally, it explicitly incorporates measurement noise into the reconstruction process. The effectiveness of the proposed framework is demonstrated through extensive experiments on image restoration tasks, specifically image inpainting and super-resolution. Using a class conditional diffusion model for recovery, compared to the \pgdm baseline, the proposed framework achieves a reduction in inference time of (25%) for inpainting with both random and center masks, and (23%) and (24%) for (4\times) and (8\times) super-resolution tasks, respectively, while incurring only negligible loss in PSNR and SSIM.

[243] Efficient Knowledge Tracing Leveraging Higher-Order Information in Integrated Graphs

Donghee Han, Daehee Kim, Minjun Lee, Daeyoung Roh, Keejun Han, Mun Yong Yi

Main category: cs.LG

TL;DR: DGAKT is a graph neural network model that improves computational efficiency in knowledge tracing by focusing on relevant subgraphs, outperforming existing methods.

DetailsMotivation: Existing knowledge tracing methods struggle with high computational costs for large graphs and long sequences, which DGAKT aims to solve.

Method: DGAKT uses a subgraph-based approach with dual graph attention to process only relevant student-exercise-KC relationships, reducing resource usage.

Result: DGAKT outperforms existing KT models and sets a new standard in resource efficiency.

Conclusion: DGAKT addresses computational inefficiencies in KT, offering a scalable and effective solution.

Abstract: The rise of online learning has led to the development of various knowledge tracing (KT) methods. However, existing methods have overlooked the problem of increasing computational cost when utilizing large graphs and long learning sequences. To address this issue, we introduce Dual Graph Attention-based Knowledge Tracing (DGAKT), a graph neural network model designed to leverage high-order information from subgraphs representing student-exercise-KC relationships. DGAKT incorporates a subgraph-based approach to enhance computational efficiency. By processing only relevant subgraphs for each target interaction, DGAKT significantly reduces memory and computational requirements compared to full global graph models. Extensive experimental results demonstrate that DGAKT not only outperforms existing KT models but also sets a new standard in resource efficiency, addressing a critical need that has been largely overlooked by prior KT approaches.

[244] Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling

Ning Liao, Xiaoxing Wang, Zehao Lin, Weiyang Guo, Feng Hong, Shixiang Song, Geng Yu, Zihua Zhao, Sitao Xie, Longxuan Wei, Xiangqi Jin, Xiaohan Qin, Jiale Ma, Kai Chen, Jiangchao Yao, Zhouhan Lin, Junchi Yan, Zhiyu Li, Feiyu Xiong, Yanfeng Wang, Linfeng Zhang

Main category: cs.LG

TL;DR: Innovator upcycles a dense LLM into a Mixtures-of-Experts model to prevent catastrophic forgetting, improving science tasks by 25% while retaining general task performance.

DetailsMotivation: To address catastrophic forgetting when pretraining LLMs with science data while maintaining general task abilities.

Method: Four-stage training: Scientific Expert Induction, Fine-grained Expert Splitting, Science-Aware Routing warmup, and Generalist-Scientist Integration.

Result: 25% average improvement in 30 scientific tasks, 70% win rate, and 99% general task retention. Innovator-Reason boosts reasoning by 30%.

Conclusion: Innovator effectively decouples and integrates scientific and general knowledge, enhancing performance without forgetting.

Abstract: A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid datasets. Such a paradigm enables knowledge in the general domain, and different scientific disciplines can be decoupled, avoiding the negative influence among knowledge in different domains. With 53.3B total parameters and 13.3B activated, Innovator extends Qwen2.5-7B using a shared general expert and 64 specialized scientific experts with 8 activated. Trained on 300B tokens with tri-level quality-controlled data, Innovator achieves 25% average improvement across 30 scientific tasks with a win rate as 70%, while retaining 99% performance in general tasks. Furthermore, Innovator-Reason, which is post-trained from Innovator for reasoning boosting, exhibits excellent reasoning performance in solving complex scientific problems with improvements over 30%.

[245] Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning via Incorporating Generalized Human Expertise

Xuefei Wu, Xiao Yin, Yuanyang Zhu, Chunlin Chen

Main category: cs.LG

TL;DR: LIGHT integrates human expertise into MARL to improve exploration efficiency by designing individual intrinsic rewards, outperforming baselines in sparse-reward tasks.

DetailsMotivation: Exploration in MARL is challenging with sparse team rewards; manual reward shaping lacks high-order intelligence and generalization.

Method: LIGHT combines human knowledge with MARL, using action and human preference distributions to design intrinsic rewards aligned with Q-learning.

Result: LIGHT outperforms baselines in performance and knowledge reusability across sparse-reward scenarios.

Conclusion: LIGHT effectively integrates human expertise into MARL, enhancing exploration and generalization in complex tasks.

Abstract: Efficient exploration in multi-agent reinforcement learning (MARL) is a challenging problem when receiving only a team reward, especially in environments with sparse rewards. A powerful method to mitigate this issue involves crafting dense individual rewards to guide the agents toward efficient exploration. However, individual rewards generally rely on manually engineered shaping-reward functions that lack high-order intelligence, thus it behaves ineffectively than humans regarding learning and generalization in complex problems. To tackle these issues, we combine the above two paradigms and propose a novel framework, LIGHT (Learning Individual Intrinsic reward via Incorporating Generalized Human experTise), which can integrate human knowledge into MARL algorithms in an end-to-end manner. LIGHT guides each agent to avoid unnecessary exploration by considering both individual action distribution and human expertise preference distribution. Then, LIGHT designs individual intrinsic rewards for each agent based on actionable representational transformation relevant to Q-learning so that the agents align their action preferences with the human expertise while maximizing the joint action value. Experimental results demonstrate the superiority of our method over representative baselines regarding performance and better knowledge reusability across different sparse-reward tasks on challenging scenarios.

[246] Market Making Strategies with Reinforcement Learning

Óscar Fernández Vicente

Main category: cs.LG

TL;DR: The paper explores using Reinforcement Learning (RL) for market making, addressing challenges like inventory risk and non-stationary markets. It introduces novel methods like reward engineering, MORL, and POW-dTS, showing superior performance over traditional strategies.

DetailsMotivation: Market makers face challenges like inventory risk and dynamic markets. The study aims to leverage RL, especially DRL, to create adaptive and profitable strategies.

Method: Formulates market making as an RL problem, uses reward engineering and MORL for inventory management, and introduces POW-dTS for non-stationarity.

Result: RL-based approaches outperform traditional strategies in simulations, demonstrating robustness and adaptability.

Conclusion: The research advances methodologies for adaptive market making, highlighting RL’s potential in algorithmic trading.

Abstract: This thesis presents the results of a comprehensive research project focused on applying Reinforcement Learning (RL) to the problem of market making in financial markets. Market makers (MMs) play a fundamental role in providing liquidity, yet face significant challenges arising from inventory risk, competition, and non-stationary market dynamics. This research explores how RL, particularly Deep Reinforcement Learning (DRL), can be employed to develop autonomous, adaptive, and profitable market making strategies. The study begins by formulating the MM task as a reinforcement learning problem, designing agents capable of operating in both single-agent and multi-agent settings within a simulated financial environment. It then addresses the complex issue of inventory management using two complementary approaches: reward engineering and Multi-Objective Reinforcement Learning (MORL). While the former uses dynamic reward shaping to guide behavior, the latter leverages Pareto front optimization to explicitly balance competing objectives. To address the problem of non-stationarity, the research introduces POW-dTS, a novel policy weighting algorithm based on Discounted Thompson Sampling. This method allows agents to dynamically select and combine pretrained policies, enabling continual adaptation to shifting market conditions. The experimental results demonstrate that the proposed RL-based approaches significantly outperform traditional and baseline algorithmic strategies across various performance metrics. Overall, this research thesis contributes new methodologies and insights for the design of robust, efficient, and adaptive market making agents, reinforcing the potential of RL to transform algorithmic trading in complex financial systems.

[247] Concept Probing: Where to Find Human-Defined Concepts (Extended Version)

Manuel de Sousa Ribeiro, Afonso Leote, João Leite

Main category: cs.LG

TL;DR: A method to automatically identify the best neural network layer for concept probing based on representation informativeness and regularity.

DetailsMotivation: Improving concept probing by addressing the challenge of selecting the optimal layer for probing human-defined concepts.

Method: Proposes an automated approach to evaluate layer representations for their informativeness and regularity with respect to the concept.

Result: Validated through empirical analysis across various models and datasets.

Conclusion: The method effectively identifies suitable layers for concept probing, enhancing the reliability of probing results.

Abstract: Concept probing has recently gained popularity as a way for humans to peek into what is encoded within artificial neural networks. In concept probing, additional classifiers are trained to map the internal representations of a model into human-defined concepts of interest. However, the performance of these probes is highly dependent on the internal representations they probe from, making identifying the appropriate layer to probe an essential task. In this paper, we propose a method to automatically identify which layer’s representations in a neural network model should be considered when probing for a given human-defined concept of interest, based on how informative and regular the representations are with respect to the concept. We validate our findings through an exhaustive empirical analysis over different neural network models and datasets.

[248] The Right to be Forgotten in Pruning: Unveil Machine Unlearning on Sparse Models

Yang Xiao, Gen Li, Jie Ji, Ruimeng Ye, Xiaolong Ma, Bo Hui

Main category: cs.LG

TL;DR: The paper introduces “un-pruning” to address the impact of deleted data on sparse models, integrates it with existing unlearning algorithms, and proposes new metrics for evaluation.

DetailsMotivation: To address the gap in machine unlearning for sparse models and ensure the right to be forgotten by eliminating the influence of deleted data on pruned topologies.

Method: Proposes an un-pruning algorithm to approximate pruned topologies based on retained data, integrates it with existing unlearning methods, and provides theoretical error bounds.

Result: Demonstrates efficacy across structured and unstructured sparse models, highlights unreliability of MIA for unlearning assessment, and introduces new evaluation metrics.

Conclusion: The un-pruning algorithm effectively removes deleted data influence in sparse models, is versatile, and supported by theoretical guarantees and experimental validation.

Abstract: Machine unlearning aims to efficiently eliminate the memory about deleted data from trained models and address the right to be forgotten. Despite the success of existing unlearning algorithms, unlearning in sparse models has not yet been well studied. In this paper, we empirically find that the deleted data has an impact on the pruned topology in a sparse model. Motivated by the observation and the right to be forgotten, we define a new terminology ``un-pruning" to eliminate the impact of deleted data on model pruning. Then we propose an un-pruning algorithm to approximate the pruned topology driven by retained data. We remark that any existing unlearning algorithm can be integrated with the proposed un-pruning workflow and the error of un-pruning is upper-bounded in theory. Also, our un-pruning algorithm can be applied to both structured sparse models and unstructured sparse models. In the experiment, we further find that Membership Inference Attack (MIA) accuracy is unreliable for assessing whether a model has forgotten deleted data, as a small change in the amount of deleted data can produce arbitrary MIA results. Accordingly, we devise new performance metrics for sparse models to evaluate the success of un-pruning. Lastly, we conduct extensive experiments to verify the efficacy of un-pruning with various pruning methods and unlearning algorithms. Our code is released at https://anonymous.4open.science/r/UnlearningSparseModels-FBC5/.

[249] Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation

Pedro R. Pires, Gregorio F. Azevedo, Pietro L. Campos, Rafael T. Sereicikas, Tiago A. Almeida

Main category: cs.LG

TL;DR: A study finds that in offline evaluations of contextual linear bandits, a greedy linear model (no exploration) often outperforms exploratory variants, highlighting flaws in current evaluation methods.

DetailsMotivation: To assess the reliability of offline evaluation protocols for Multi-Armed Bandit (MAB) algorithms, particularly in capturing exploration behavior.

Method: Extensive offline empirical comparison of several linear MABs, including a greedy linear model, across various datasets.

Result: The greedy model (no exploration) consistently performs best in over 90% of datasets, outperforming exploratory variants. Hyperparameter optimization also favors minimal exploration.

Conclusion: Offline evaluation protocols inadequately assess exploration efficacy, calling for more robust methodologies for interactive learning in recommender systems.

Abstract: Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration-exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90% of various datasets, a greedy linear model, with no type of exploration, consistently achieves top-tier performance, often outperforming or matching its exploratory counterparts. This observation is further corroborated by hyperparameter optimization, which consistently favors configurations that minimize exploration, suggesting that pure exploitation is the dominant strategy within these evaluation settings. Our results expose significant inadequacies in offline evaluation protocols for bandits, particularly concerning their capacity to reflect true exploratory efficacy. Consequently, this research underscores the urgent necessity for developing more robust assessment methodologies, guiding future investigations into alternative evaluation frameworks for interactive learning in recommender systems.

[250] CLEAR: Unlearning Spurious Style-Content Associations with Contrastive LEarning with Anti-contrastive Regularization

Minghui Sun, Benjamin A. Goldstein, Matthew M. Engelhard

Main category: cs.LG

TL;DR: CLEAR is a framework to separate task-relevant and task-irrelevant features in data representations, improving performance when superficial characteristics shift at test time.

DetailsMotivation: To ensure equitable and generalizable predictions in healthcare by learning features unaffected by demographics like race and sex.

Method: Uses Contrastive LEarning with Anti-contrastive Regularization (CLEAR) to separate content (task-relevant) and style (task-irrelevant) features, implemented in a VAE.

Result: CLEAR-VAE enables content-style swapping/interpolation and improves classification with unseen content-style combinations.

Conclusion: CLEAR effectively separates essential and superficial features, enhancing model robustness and fairness.

Abstract: Learning representations unaffected by superficial characteristics is important to ensure that shifts in these characteristics at test time do not compromise downstream prediction performance. For instance, in healthcare applications, we might like to learn features that contain information about pathology yet are unaffected by race, sex, and other sources of physiologic variability, thereby ensuring predictions are equitable and generalizable across all demographics. Here we propose Contrastive LEarning with Anti-contrastive Regularization (CLEAR), an intuitive and easy-to-implement framework that effectively separates essential (i.e., task-relevant) characteristics from superficial (i.e., task-irrelevant) characteristics during training, leading to better performance when superficial characteristics shift at test time. We begin by supposing that data representations can be semantically separated into task-relevant content features, which contain information relevant to downstream tasks, and task-irrelevant style features, which encompass superficial attributes that are irrelevant to these tasks, yet may degrade performance due to associations with content present in training data that do not generalize. We then prove that our anti-contrastive penalty, which we call Pair-Switching (PS), minimizes the Mutual Information between the style attributes and content labels. Finally, we instantiate CLEAR in the latent space of a Variational Auto-Encoder (VAE), then perform experiments to quantitatively and qualitatively evaluate the resulting CLEAR-VAE over several image datasets. Our results show that CLEAR-VAE allows us to: (a) swap and interpolate content and style between any pair of samples, and (b) improve downstream classification performance in the presence of previously unseen combinations of content and style. Our code will be made publicly available.

[251] Ralts: Robust Aggregation for Enhancing Graph Neural Network Resilience on Bit-flip Errors

Wencheng Zou, Nan Wu

Main category: cs.LG

TL;DR: The paper explores GNN robustness against hardware-induced bit-flip errors, proposing Ralts, a lightweight solution to enhance resilience by filtering outliers and recovering graph topology.

DetailsMotivation: GNNs are used in safety-critical applications, but hardware faults like bit-flips are underexplored. Addressing this can improve system reliability.

Method: Ralts uses graph similarity metrics to filter outliers and recover compromised topology, integrating these techniques into GNN aggregation functions.

Result: Ralts improves GNN accuracy by 20% for weight/embedding errors and 10% for adjacency matrix errors, with efficient execution.

Conclusion: Ralts effectively enhances GNN robustness against bit-flip errors, offering a scalable and efficient solution for reliable systems.

Abstract: Graph neural networks (GNNs) have been widely applied in safety-critical applications, such as financial and medical networks, in which compromised predictions may cause catastrophic consequences. While existing research on GNN robustness has primarily focused on software-level threats, hardware-induced faults and errors remain largely underexplored. As hardware systems progress toward advanced technology nodes to meet high-performance and energy efficiency demands, they become increasingly susceptible to transient faults, which can cause bit flips and silent data corruption, a prominent issue observed by major technology companies (e.g., Meta and Google). In response, we first present a comprehensive analysis of GNN robustness against bit-flip errors, aiming to reveal system-level optimization opportunities for future reliable and efficient GNN systems. Second, we propose Ralts, a generalizable and lightweight solution to bolster GNN resilience to bit-flip errors. Specifically, Ralts exploits various graph similarity metrics to filter out outliers and recover compromised graph topology, and incorporates these protective techniques directly into aggregation functions to support any message-passing GNNs. Evaluation results demonstrate that Ralts effectively enhances GNN robustness across a range of GNN models, graph datasets, error patterns, and both dense and sparse architectures. On average, under a BER of $3\times10^{-5}$, these robust aggregation functions improve prediction accuracy by at least 20% when errors present in model weights or node embeddings, and by at least 10% when errors occur in adjacency matrices. Ralts is also optimized to deliver execution efficiency comparable to built-in aggregation functions in PyTorch Geometric.

[252] Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator

YuXin Li, Felix Dangel, Derek Tam, Colin Raffel

Main category: cs.LG

TL;DR: The paper proposes ‘Squisher,’ a method to approximate the Fisher diagonal using squared gradient accumulators from adaptive optimizers like Adam, avoiding extra computational costs.

DetailsMotivation: To reduce the computational cost of estimating the Fisher diagonal by leveraging existing squared gradient accumulators from training.

Method: Uses the squared gradient accumulator from adaptive optimizers (e.g., Adam) as an approximation of the Fisher diagonal.

Result: Squisher performs similarly to the Fisher diagonal and outperforms baselines in five applications.

Conclusion: Squisher is a computationally efficient and effective approximation of the Fisher diagonal, with clarified differences and empirical impact.

Abstract: The diagonal of a model’s Fisher Information Matrix (the “Fisher diagonal”) has frequently been used as a way to measure parameter sensitivity. Typically, the Fisher diagonal is estimated via squared sampled gradients of the model’s likelihood with respect to its parameters, averaged over a few hundred or thousand examples – a process which incurs nontrivial computational costs. At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training. This paper therefore explores whether an approximation of the Fisher diagonal can be obtained “for free” by recycling the squared gradient accumulator that has already been computed over the course of training. Through a comprehensive set of experiments covering five applications of the Fisher diagonal, we demonstrate that the “Squisher” (SQUared gradient accumulator as an approximation of the FISHER) consistently performs similarly to the Fisher diagonal while outperforming baseline methods. Additionally, we clarify the exact differences between the Squisher and the Fisher diagonal and provide empirical quantification of their respective impact.

Marco Bagatella, Mert Albaba, Jonas Hübotter, Georg Martius, Andreas Krause

Main category: cs.LG

TL;DR: The paper introduces GC-TTT, a goal-conditioned test-time training algorithm for offline reinforcement learning, showing significant performance gains by fine-tuning policies on relevant data at minimal compute costs.

DetailsMotivation: The motivation is to improve offline reinforcement learning by leveraging test-time training, inspired by foundation models, to specialize policies for specific goals efficiently.

Method: The method involves a self-supervised data selection criterion to pick relevant transitions from an offline dataset, followed by fine-tuning the policy for a few gradient steps. GC-TTT applies this in a receding-horizon fashion during evaluation.

Result: Results show substantial performance improvements across high-dimensional tasks compared to standard offline pre-training, with minimal compute overhead.

Conclusion: The conclusion highlights that GC-TTT outperforms scaling model size at comparable costs, demonstrating the effectiveness of test-time adaptation in offline reinforcement learning.

Abstract: Foundation models compress a large amount of information in a single, large neural network, which can then be queried for individual tasks. There are strong parallels between this widespread framework and offline goal-conditioned reinforcement learning algorithms: a universal value function is trained on a large number of goals, and the policy is evaluated on a single goal in each test episode. Extensive research in foundation models has shown that performance can be substantially improved through test-time training, specializing the model to the current goal. We find similarly that test-time offline reinforcement learning on experience related to the test goal can lead to substantially better policies at minimal compute costs. We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state and quality with respect to the evaluation goal. We demonstrate across a wide range of high-dimensional loco-navigation and manipulation tasks that fine-tuning a policy on the selected data for a few gradient steps leads to significant performance gains over standard offline pre-training. Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out. Finally, we study compute allocation at inference, demonstrating that, at comparable costs, GC-TTT induces performance gains that are not achievable by scaling model size.

[254] Even Faster Simulations with Flow Matching: A Study of Zero Degree Calorimeter Responses

Maksymilian Wojnar

Main category: cs.LG

TL;DR: The paper introduces a flow matching (FM) model for fast simulations in high-energy physics, achieving high fidelity and reduced computational costs for ALICE experiment detectors.

DetailsMotivation: To address increasing computational demands in high-energy physics by leveraging generative neural networks for efficient simulations.

Method: Uses flow matching (FM) to develop surrogate models with a low-parameter training strategy for zero degree calorimeters in ALICE.

Result: Achieves state-of-the-art fidelity (Wasserstein distance of 1.27 for ZN, 1.30 for ZP) and faster inference (0.46 ms for ZN, 0.026 ms for latent FM).

Conclusion: The FM model offers a computationally efficient and accurate solution for simulations, outperforming existing methods in speed and fidelity.

Abstract: Recent advances in generative neural networks, particularly flow matching (FM), have enabled the generation of high-fidelity samples while significantly reducing computational costs. A promising application of these models is accelerating simulations in high-energy physics (HEP), helping research institutions meet their increasing computational demands. In this work, we leverage FM to develop surrogate models for fast simulations of zero degree calorimeters in the ALICE experiment. We present an effective training strategy that enables the training of fast generative models with an exceptionally low number of parameters. This approach achieves state-of-the-art simulation fidelity for both neutron (ZN) and proton (ZP) detectors, while offering substantial reductions in computational costs compared to existing methods. Our FM model achieves a Wasserstein distance of 1.27 for the ZN simulation with an inference time of 0.46 ms per sample, compared to the current best of 1.20 with an inference time of approximately 109 ms. The latent FM model further improves the inference speed, reducing the sampling time to 0.026 ms per sample, with a minimal trade-off in accuracy. Similarly, our approach achieves a Wasserstein distance of 1.30 for the ZP simulation, outperforming the current best of 2.08. The source code is available at https://github.com/m-wojnar/faster_zdc.

[255] Scale-Consistent Learning for Partial Differential Equations

Zongyi Li, Samuel Lanthaler, Catherine Deng, Michael Chen, Yixuan Wang, Kamyar Azizzadenesheli, Anima Anandkumar

Main category: cs.LG

TL;DR: A scale-informed neural operator is proposed to generalize ML models for solving PDEs across different scales, leveraging scale-consistency properties and data augmentation.

DetailsMotivation: Overcome limitations of ML models that fail to generalize outside training data, such as fixed Reynolds numbers or domains, by utilizing PDE rescaling properties.

Method: Introduces a scale-consistency loss and a scale-informed neural operator, leveraging PDE rescaling and sub-domain consistency to train models adaptable to multiple scales.

Result: The model generalizes to a wide range of Reynolds numbers (250-10000) and reduces error by 34% on average across tested PDEs.

Conclusion: Scale-consistency loss and neural operators enable robust generalization in ML-based PDE solvers, outperforming traditional methods.

Abstract: Machine learning (ML) models have emerged as a promising approach for solving partial differential equations (PDEs) in science and engineering. Previous ML models typically cannot generalize outside the training data; for example, a trained ML model for the Navier-Stokes equations only works for a fixed Reynolds number ($Re$) on a pre-defined domain. To overcome these limitations, we propose a data augmentation scheme based on scale-consistency properties of PDEs and design a scale-informed neural operator that can model a wide range of scales. Our formulation leverages the facts: (i) PDEs can be rescaled, or more concretely, a given domain can be re-scaled to unit size, and the parameters and the boundary conditions of the PDE can be appropriately adjusted to represent the original solution, and (ii) the solution operators on a given domain are consistent on the sub-domains. We leverage these facts to create a scale-consistency loss that encourages matching the solutions evaluated on a given domain and the solution obtained on its sub-domain from the rescaled PDE. Since neural operators can fit to multiple scales and resolutions, they are the natural choice for incorporating scale-consistency loss during training of neural PDE solvers. We experiment with scale-consistency loss and the scale-informed neural operator model on the Burgers’ equation, Darcy Flow, Helmholtz equation, and Navier-Stokes equations. With scale-consistency, the model trained on $Re$ of 1000 can generalize to $Re$ ranging from 250 to 10000, and reduces the error by 34% on average of all datasets compared to baselines.

[256] Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models

Ruimeng Ye, Zihan Wang, Xiao Yang, Zinan Ling, Manling Li, Bo Hui

Main category: cs.LG

TL;DR: The paper extends Weak-to-Strong Generalization (W2SG) to complex interactive tasks by fine-tuning strong models with weak model-generated trajectories, including failures, and using trajectory trees with MCTS for optimization.

DetailsMotivation: To enhance strong models' capabilities by learning from weak models' successes and failures, inspired by human learning.

Method: Fine-tunes strong models with weak model trajectories, constructs trajectory trees, and applies MCTS for optimization.

Result: Empirical evaluations show improved reasoning and decision-making across tasks, with theoretical guarantees.

Conclusion: The framework is scalable and robust, validated by diverse task performance.

Abstract: Weak-to-Strong generalization (W2SG) is a new trend to elicit the full capabilities of a strong model with supervision from a weak model. While existing W2SG studies focus on simple tasks like binary classification, we extend this paradigm to complex interactive decision-making environments. Specifically, we fine-tune a strong model with trajectories of intermediate actions generated by a weak model. Motivated by the human learning process, we propose to generalize not only success knowledge but also failure experience so that the strong model can learn from failed trajectories accumulated by weak models. To effectively and efficiently elicit the potential of strong agents, we further construct ``trajectory trees," a hierarchical representation that organizes weak model-generated action trajectories, coupled with Monte Carlo Tree Search (MCTS) to optimize the strong model. Through theoretical analysis, we provide formal guarantees for the effectiveness of our method in improving W2SG performance. Our empirical evaluations demonstrate substantial improvements in reasoning and decision-making capabilities across diverse task domains, validating the scalability and robustness of our proposed framework. Our code is available at: https://github.com/yeruimeng/TraTree

[257] Early Mortality Prediction in ICU Patients with Hypertensive Kidney Disease Using Interpretable Machine Learning

Yong Si, Junyi Fan, Li Sun, Shuheng Chen, Minoo Ahmadi, Elham Pishgar, Kamiar Alaei, Greg Placencia, Maryam Pishgar

Main category: cs.LG

TL;DR: A machine learning model (CatBoost) was developed to predict 30-day mortality in ICU patients with hypertensive kidney disease, achieving high accuracy (AUROC 0.88) and interpretability.

DetailsMotivation: Hypertensive kidney disease patients in ICUs lack tailored risk prediction tools, necessitating early identification of high-risk individuals for better clinical decisions.

Method: A machine learning framework was developed using MIMIC-IV v2.2 data (1,366 adults), selecting 18 clinical features. CatBoost outperformed other models, validated via five-fold cross-validation.

Result: CatBoost achieved AUROC 0.88, sensitivity 0.811, and specificity 0.798. SHAP and ALE plots highlighted key predictors like altered consciousness and vasopressor use.

Conclusion: The interpretable model supports individualized triage with uncertainty quantification, showing promise for clinical deployment and requiring external validation.

Abstract: Background: Hypertensive kidney disease (HKD) patients in intensive care units (ICUs) face high short-term mortality, but tailored risk prediction tools are lacking. Early identification of high-risk individuals is crucial for clinical decision-making. Methods: We developed a machine learning framework to predict 30-day in-hospital mortality among ICU patients with HKD using early clinical data from the MIMIC-IV v2.2 database. A cohort of 1,366 adults was curated with strict criteria, excluding malignancy cases. Eighteen clinical features-including vital signs, labs, comorbidities, and therapies-were selected via random forest importance and mutual information filtering. Several models were trained and compared with stratified five-fold cross-validation; CatBoost demonstrated the best performance. Results: CatBoost achieved an AUROC of 0.88 on the independent test set, with sensitivity of 0.811 and specificity of 0.798. SHAP values and Accumulated Local Effects (ALE) plots showed the model relied on meaningful predictors such as altered consciousness, vasopressor use, and coagulation status. Additionally, the DREAM algorithm was integrated to estimate patient-specific posterior risk distributions, allowing clinicians to assess both predicted mortality and its uncertainty. Conclusions: We present an interpretable machine learning pipeline for early, real-time risk assessment in ICU patients with HKD. By combining high predictive performance with uncertainty quantification, our model supports individualized triage and transparent clinical decisions. This approach shows promise for clinical deployment and merits external validation in broader critical care populations.

[258] Geometric Multi-color Message Passing Graph Neural Networks for Blood-brain Barrier Permeability Prediction

Trung Nguyen, Md Masud Rana, Farjana Tasnim Mukta, Chang-Guo Zhan, Duc Duy Nguyen

Main category: cs.LG

TL;DR: GMC-MPNN, a geometric multi-color message-passing GNN, improves BBBP prediction by incorporating atomic-level geometry and long-range interactions, outperforming existing models.

DetailsMotivation: Accurate BBBP prediction is crucial for CNS drug development, but current GNNs often ignore 3D geometric information vital for transport mechanisms.

Method: GMC-MPNN enhances message-passing architectures with atomic-level geometric features and weighted colored subgraphs based on atom types.

Result: GMC-MPNN achieves superior performance (AUC-ROC up to 0.9704, RMSE 0.4609) in classification and regression tasks, validated on benchmark datasets.

Conclusion: GMC-MPNN sets a new benchmark by integrating spatial geometry, offering a more accurate and generalizable tool for drug discovery.

Abstract: Accurate prediction of blood-brain barrier permeability (BBBP) is essential for central nervous system (CNS) drug development. While graph neural networks (GNNs) have advanced molecular property prediction, they often rely on molecular topology and neglect the three-dimensional geometric information crucial for modeling transport mechanisms. This paper introduces the geometric multi-color message-passing graph neural network (GMC-MPNN), a novel framework that enhances standard message-passing architectures by explicitly incorporating atomic-level geometric features and long-range interactions. Our model constructs weighted colored subgraphs based on atom types to capture the spatial relationships and chemical context that govern BBB permeability. We evaluated GMC-MPNN on three benchmark datasets for both classification and regression tasks, using rigorous scaffold-based splitting to ensure a robust assessment of generalization. The results demonstrate that GMC-MPNN consistently outperforms existing state-of-the-art models, achieving superior performance in both classifying compounds as permeable/non-permeable (AUC-ROC of 0.9704 and 0.9685) and in regressing continuous permeability values (RMSE of 0.4609, Pearson correlation of 0.7759). An ablation study further quantified the impact of specific atom-pair interactions, revealing that the model’s predictive power derives from its ability to learn from both common and rare, but chemically significant, functional motifs. By integrating spatial geometry into the graph representation, GMC-MPNN sets a new performance benchmark and offers a more accurate and generalizable tool for drug discovery pipelines.

[259] Secure Best Arm Identification in the Presence of a Copycat

Asaf Cohen, Onur Günlü

Main category: cs.LG

TL;DR: The paper addresses best arm identification in stochastic linear bandits with a security constraint, proposing a secure algorithm using coded arms to achieve a near-optimal error exponent while minimizing information leakage.

DetailsMotivation: The problem arises from the need to identify the best arm in a bandit setup while preventing an observer (copycat Chloe) from learning the best arm, balancing performance and security.

Method: The proposed algorithm uses coded arms without cryptographic primitives, ensuring security by obfuscating the best arm’s identity while maintaining efficient exploration.

Result: The algorithm achieves an error exponent of Ω(T/log²(d)), outperforming naive secure methods and nearly matching the optimal non-secure performance.

Conclusion: The secure algorithm effectively balances exploration and security, offering a practical solution for scenarios requiring privacy in bandit problems.

Abstract: Consider the problem of best arm identification with a security constraint. Specifically, assume a setup of stochastic linear bandits with $K$ arms of dimension $d$. In each arm pull, the player receives a reward that is the sum of the dot product of the arm with an unknown parameter vector and independent noise. The player’s goal is to identify the best arm after $T$ arm pulls. Moreover, assume a copycat Chloe is observing the arm pulls. The player wishes to keep Chloe ignorant of the best arm. While a minimax–optimal algorithm identifies the best arm with an $\Omega\left(\frac{T}{\log(d)}\right)$ error exponent, it easily reveals its best-arm estimate to an outside observer, as the best arms are played more frequently. A naive secure algorithm that plays all arms equally results in an $\Omega\left(\frac{T}{d}\right)$ exponent. In this paper, we propose a secure algorithm that plays with \emph{coded arms}. The algorithm does not require any key or cryptographic primitives, yet achieves an $\Omega\left(\frac{T}{\log^2(d)}\right)$ exponent while revealing almost no information on the best arm.

[260] KASPER: Kolmogorov Arnold Networks for Stock Prediction and Explainable Regimes

Vidhi Oad, Param Pathak, Nouhaila Innan, Shalini D, Muhammad Shafique

Main category: cs.LG

TL;DR: KASPER is a novel framework for stock prediction using regime detection, sparse spline modeling, and symbolic rule extraction, outperforming traditional methods with high accuracy and interpretability.

DetailsMotivation: Financial markets' nonlinear and regime-dependent dynamics make forecasting challenging. Traditional deep learning models lack adaptability and interpretability across shifting market conditions.

Method: KASPER integrates regime detection (Gumbel-Softmax), Kolmogorov-Arnold networks with sparse spline activations, and symbolic rule extraction (Monte Carlo Shapley values) for regime-specific forecasting.

Result: Achieves R² of 0.89, Sharpe Ratio of 12.02, and MSE of 0.0001 on Yahoo Finance data, outperforming existing methods.

Conclusion: KASPER advances regime-aware, transparent, and robust financial forecasting.

Abstract: Forecasting in financial markets remains a significant challenge due to their nonlinear and regime-dependent dynamics. Traditional deep learning models, such as long short-term memory networks and multilayer perceptrons, often struggle to generalize across shifting market conditions, highlighting the need for a more adaptive and interpretable approach. To address this, we introduce Kolmogorov-Arnold networks for stock prediction and explainable regimes (KASPER), a novel framework that integrates regime detection, sparse spline-based function modeling, and symbolic rule extraction. The framework identifies hidden market conditions using a Gumbel-Softmax-based mechanism, enabling regime-specific forecasting. For each regime, it employs Kolmogorov-Arnold networks with sparse spline activations to capture intricate price behaviors while maintaining robustness. Interpretability is achieved through symbolic learning based on Monte Carlo Shapley values, which extracts human-readable rules tailored to each regime. Applied to real-world financial time series from Yahoo Finance, the model achieves an $R^2$ score of 0.89, a Sharpe Ratio of 12.02, and a mean squared error as low as 0.0001, outperforming existing methods. This research establishes a new direction for regime-aware, transparent, and robust forecasting in financial markets.

[261] Differentiated Thyroid Cancer Recurrence Classification Using Machine Learning Models and Bayesian Neural Networks with Varying Priors: A SHAP-Based Interpretation of the Best Performing Model

HMNS Kumari, HMLS Kumari, UMMPK Nawarathne

Main category: cs.LG

TL;DR: The paper introduces a framework for classifying DTC recurrence using ML and Bayesian Neural Networks (BNN), achieving high accuracy and addressing uncertainty quantification.

DetailsMotivation: DTC recurrence is a public health concern, requiring accurate, interpretable, and uncertainty-aware predictive models.

Method: 11 ML models were tested, followed by feature selection using the Boruta algorithm. BNN with six prior distributions was then applied.

Result: SVM (0.9481) and LR (0.9611) performed best among ML models. BNN with Normal 0,10 prior achieved the highest accuracies (0.9740 and 0.9870).

Conclusion: BNN with uncertainty quantification outperforms traditional ML models, offering a robust solution for DTC recurrence classification.

Abstract: Differentiated thyroid cancer DTC recurrence is a major public health concern, requiring classification and predictive models that are not only accurate but also interpretable and uncertainty aware. This study introduces a comprehensive framework for DTC recurrence classification using a dataset containing 383 patients and 16 clinical and pathological variables. Initially, 11 machine learning ML models were employed using the complete dataset, where the Support Vector Machines SVM model achieved the highest accuracy of 0.9481. To reduce complexity and redundancy, feature selection was carried out using the Boruta algorithm, and the same ML models were applied to the reduced dataset, where it was observed that the Logistic Regression LR model obtained the maximum accuracy of 0.9611. However, these ML models often lack uncertainty quantification, which is critical in clinical decision making. Therefore, to address this limitation, the Bayesian Neural Networks BNN with six varying prior distributions, including Normal 0,1, Normal 0,10, Laplace 0,1, Cauchy 0,1, Cauchy 0,2.5, and Horseshoe 1, were implemented on both the complete and reduced datasets. The BNN model with Normal 0,10 prior distribution exhibited maximum accuracies of 0.9740 and 0.9870 before and after feature selection, respectively.

[262] GENIAL: Generative Design Space Exploration via Network Inversion for Low Power Algorithmic Logic Units

Maxence Bouvier, Ryan Amaudruz, Felix Arnold, Renzo Andri, Lukas Cavigelli

Main category: cs.LG

TL;DR: GENIAL is a machine learning-based framework for optimizing arithmetic units, using a Transformer model to automate design exploration and achieve significant power savings.

DetailsMotivation: Optimizing arithmetic units is critical for AI workloads, but conventional methods are limited in exploring the design space thoroughly.

Method: GENIAL uses a two-stage Transformer-based surrogate model (self-supervised pretraining + supervised finetuning) to predict hardware metrics and invert the model for optimization.

Result: GENIAL achieves up to 18% switching activity savings in multipliers and shows versatility in optimizing other logic functions like Finite State Machines.

Conclusion: GENIAL represents a significant advancement in automated, quality-of-results-optimized circuit generation for digital systems.

Abstract: As AI workloads proliferate, optimizing arithmetic units is becoming increasingly important to reduce the footprint of digital systems. Conventional design flows, which often rely on manual or heuristics-based optimization, are limited in their ability to thoroughly explore the vast design space. In this paper, we introduce GENIAL, a machine learning-based framework for the automatic generation and optimization of arithmetic units, more specifically multipliers. At the core of GENIAL is a Transformer-based surrogate model trained in two stages, involving self-supervised pretraining followed by supervised finetuning, to robustly forecast key hardware metrics such as power and area from abstracted design representations. By inverting the surrogate model, GENIAL efficiently searches for new operand encodings that directly minimize power consumption in arithmetic units for specific input data distributions. Extensive experiments on large datasets demonstrate that GENIAL is consistently more sample efficient than other methods, and converges faster towards optimized designs. This enables to deploy a high-effort logic synthesis optimization flow in the loop, improving the accuracy of the surrogate model. Notably, GENIAL automatically discovers encodings that achieve up to 18% switching activity savings within multipliers on representative AI workloads compared with the conventional two’s complement. We also demonstrate the versatility of our approach by achieving significant improvements on Finite State Machines, highlighting GENIAL’s applicability for a wide spectrum of logic functions. Together, these advances mark a significant step toward automated Quality-of-Results-optimized combinational circuit generation for digital systems.

[263] Reinforcement Learning via Conservative Agent for Environments with Random Delays

Jongsoo Lee, Jangwon Kim, Jiseok Jeong, Soohee Han

Main category: cs.LG

TL;DR: A robust agent (conservative agent) is proposed to handle random delays in reinforcement learning by transforming them into constant delays, enabling existing methods to work without modification.

DetailsMotivation: Real-world reinforcement learning faces challenges due to random delays, which violate the Markov assumption and lack exploration compared to constant delays.

Method: The conservative agent reformulates random-delay environments into constant-delay equivalents, allowing direct application of state-of-the-art constant-delay methods.

Result: The agent outperforms baselines in continuous control tasks, showing better asymptotic performance and sample efficiency.

Conclusion: The conservative agent effectively addresses random delays, extending the applicability of existing methods without performance loss.

Abstract: Real-world reinforcement learning applications are often hindered by delayed feedback from environments, which violates the Markov assumption and introduces significant challenges. Although numerous delay-compensating methods have been proposed for environments with constant delays, environments with random delays remain largely unexplored due to their inherent variability and unpredictability. In this study, we propose a simple yet robust agent for decision-making under random delays, termed the conservative agent, which reformulates the random-delay environment into its constant-delay equivalent. This transformation enables any state-of-the-art constant-delay method to be directly extended to the random-delay environments without modifying the algorithmic structure or sacrificing performance. We evaluate the conservative agent-based algorithm on continuous control tasks, and empirical results demonstrate that it significantly outperforms existing baseline algorithms in terms of asymptotic performance and sample efficiency.

[264] Adapting to Fragmented and Evolving Data: A Fisher Information Perspective

Behraj Khan, Tahir Qasim Syed, Nouman Muhammad Durrani

Main category: cs.LG

TL;DR: FADE is a lightweight framework for robust learning under sequential covariate shift (SCS), using Fisher-based adaptation and shift-aware regularization. It outperforms existing methods like TENT and DIW, achieving up to 19% higher accuracy.

DetailsMotivation: Modern ML systems face SCS, where input distributions evolve over time while conditional distributions remain stable. Existing methods often require task boundaries or target supervision, which FADE avoids.

Method: FADE employs Fisher information geometry for shift-aware regularization and a Cramer-Rao-informed shift signal to detect distribution changes. It operates online with fixed memory and no target labels.

Result: FADE achieves up to 19% higher accuracy under severe shifts compared to methods like TENT and DIW. It also generalizes to federated learning.

Conclusion: FADE is a scalable, robust solution for SCS, with theoretical guarantees and empirical success across diverse benchmarks and federated settings.

Abstract: Modern machine learning systems operating in dynamic environments often face \textit{sequential covariate shift} (SCS), where input distributions evolve over time while the conditional distribution remains stable. We introduce FADE (Fisher-based Adaptation to Dynamic Environments), a lightweight and theoretically grounded framework for robust learning under SCS. FADE employs a shift-aware regularization mechanism anchored in Fisher information geometry, guiding adaptation by modulating parameter updates based on sensitivity and stability. To detect significant distribution changes, we propose a Cramer-Rao-informed shift signal that integrates KL divergence with temporal Fisher dynamics. Unlike prior methods requiring task boundaries, target supervision, or experience replay, FADE operates online with fixed memory and no access to target labels. Evaluated on seven benchmarks spanning vision, language, and tabular data, FADE achieves up to 19% higher accuracy under severe shifts, outperforming methods such as TENT and DIW. FADE also generalizes naturally to federated learning by treating heterogeneous clients as temporally fragmented environments, enabling scalable and stable adaptation in decentralized settings. Theoretical analysis guarantees bounded regret and parameter consistency, while empirical results demonstrate FADE’s robustness across modalities and shift intensities.

[265] A diffusion-based generative model for financial time series via geometric Brownian motion

Gihun Kim, Sun-Yong Choi, Yeoneung Kim

Main category: cs.LG

TL;DR: A diffusion-based generative framework for financial time series incorporates GBM into the forward noising process, improving realism in modeling financial data.

DetailsMotivation: Standard score-based models treat financial data as generic sequences, ignoring heteroskedasticity. This work aims to better reflect financial market dynamics.

Method: The framework injects noise proportionally to asset prices, balancing drift and diffusion terms. It uses a Transformer-based architecture for training via denoising score matching.

Result: The model reproduces stylized facts like heavy-tailed returns, volatility clustering, and leverage effect more realistically than conventional methods.

Conclusion: The proposed framework effectively captures financial time series characteristics, outperforming standard diffusion models.

Abstract: We propose a novel diffusion-based generative framework for financial time series that incorporates geometric Brownian motion (GBM), the foundation of the Black–Scholes theory, into the forward noising process. Unlike standard score-based models that treat price trajectories as generic numerical sequences, our method injects noise proportionally to asset prices at each time step, reflecting the heteroskedasticity observed in financial time series. By accurately balancing the drift and diffusion terms, we show that the resulting log-price process reduces to a variance-exploding stochastic differential equation, aligning with the formulation in score-based generative models. The reverse-time generative process is trained via denoising score matching using a Transformer-based architecture adapted from the Conditional Score-based Diffusion Imputation (CSDI) framework. Empirical evaluations on historical stock data demonstrate that our model reproduces key stylized facts heavy-tailed return distributions, volatility clustering, and the leverage effect more realistically than conventional diffusion models.

[266] MindSpeed RL: Distributed Dataflow for Scalable and Efficient RL Training on Ascend NPU Cluster

Laingjun Feng, Chenyi Pan, Xinjie Guo, Fei Mei, Benzhe Ning, Jianxiang Zhang, Xinyang Liu, Beirong Zhou, Zeng Shu, Chang Liu, Guang Yang, Zhenyu Han, Jiangben Wang, Bo Wang

Main category: cs.LG

TL;DR: MindSpeed RL is a scalable and efficient system for large-scale reinforcement learning (RL) training, addressing poor cluster scalability and low memory utilization by optimizing data dependencies and integrating parallelization strategies.

DetailsMotivation: Existing RL training systems suffer from poor scalability and memory inefficiency due to heavy cross-node dependencies.

Method: MindSpeed RL introduces a distributed transfer dock strategy for sample flow and an allgather-swap strategy for resharding flow, along with parallelization and acceleration techniques.

Result: Experiments show MindSpeed RL achieves 1.42~3.97x higher throughput compared to state-of-the-art systems.

Conclusion: MindSpeed RL is a powerful and reliable system for large-scale RL training, demonstrated on Ascend NPUs.

Abstract: Reinforcement learning (RL) is a paradigm increasingly used to align large language models. Popular RL algorithms utilize multiple workers and can be modeled as a graph, where each node is the status of a worker and each edge represents dataflow between nodes. Owing to the heavy cross-node dependencies, the RL training system usually suffers from poor cluster scalability and low memory utilization. In this article, we introduce MindSpeed RL, an effective and efficient system for large-scale RL training. Unlike existing centralized methods, MindSpeed RL organizes the essential data dependencies in RL training, i.e., sample flow and resharding flow, from a distributed view. On the one hand, a distributed transfer dock strategy, which sets controllers and warehouses on the basis of the conventional replay buffer, is designed to release the dispatch overhead in the sample flow. A practical allgather–swap strategy is presented to eliminate redundant memory usage in resharding flow. In addition, MindSpeed RL further integrates numerous parallelization strategies and acceleration techniques for systematic optimization. Compared with existing state-of-the-art systems, comprehensive experiments on the RL training of popular Qwen2.5-Dense-7B/32B, Qwen3-MoE-30B, and DeepSeek-R1-MoE-671B show that MindSpeed RL increases the throughput by 1.42 ~ 3.97 times. Finally, we open–source MindSpeed RL and perform all the experiments on a super pod of Ascend with 384 neural processing units (NPUs) to demonstrate the powerful performance and reliability of Ascend.

[267] ProGMLP: A Progressive Framework for GNN-to-MLP Knowledge Distillation with Efficient Trade-offs

Weigang Lu, Ziyu Guan, Wei Zhao, Yaming Yang, Yujie Sun, Zheng Liang, Yibing Zhan, Dapeng Tao

Main category: cs.LG

TL;DR: ProGMLP introduces a progressive framework for GNN-to-MLP knowledge distillation, enabling dynamic trade-offs between inference cost and accuracy.

DetailsMotivation: Existing G2M methods lack flexibility in adjusting inference cost and accuracy dynamically, which is crucial for real-world applications with varying resource constraints.

Method: ProGMLP uses Progressive Training Structure (PTS), Progressive Knowledge Distillation (PKD), and Progressive Mixup Augmentation (PMA) to train MLPs iteratively and enhance generalization.

Result: Validated on eight real-world graph datasets, ProGMLP maintains high accuracy while adapting to varying runtime scenarios.

Conclusion: ProGMLP is effective for diverse applications, offering flexible and on-demand trade-offs between cost and accuracy.

Abstract: GNN-to-MLP (G2M) methods have emerged as a promising approach to accelerate Graph Neural Networks (GNNs) by distilling their knowledge into simpler Multi-Layer Perceptrons (MLPs). These methods bridge the gap between the expressive power of GNNs and the computational efficiency of MLPs, making them well-suited for resource-constrained environments. However, existing G2M methods are limited by their inability to flexibly adjust inference cost and accuracy dynamically, a critical requirement for real-world applications where computational resources and time constraints can vary significantly. To address this, we introduce a Progressive framework designed to offer flexible and on-demand trade-offs between inference cost and accuracy for GNN-to-MLP knowledge distillation (ProGMLP). ProGMLP employs a Progressive Training Structure (PTS), where multiple MLP students are trained in sequence, each building on the previous one. Furthermore, ProGMLP incorporates Progressive Knowledge Distillation (PKD) to iteratively refine the distillation process from GNNs to MLPs, and Progressive Mixup Augmentation (PMA) to enhance generalization by progressively generating harder mixed samples. Our approach is validated through comprehensive experiments on eight real-world graph datasets, demonstrating that ProGMLP maintains high accuracy while dynamically adapting to varying runtime scenarios, making it highly effective for deployment in diverse application settings.

[268] Neural Ordinary Differential Equations for Learning and Extrapolating System Dynamics Across Bifurcations

Eva van Tegelen, George van Voorn, Ioannis Athanasiadis, Peter van Heijster

Main category: cs.LG

TL;DR: Neural ODEs effectively learn and forecast bifurcations in dynamical systems from timeseries data, even with limited or noisy data.

DetailsMotivation: Address limitations of existing methods by using Neural ODEs for continuous, data-driven learning of system dynamics, including local and global bifurcations.

Method: Apply Neural ODEs to a predator-prey system with bifurcations, learning parameter-dependent vector fields from timeseries data.

Result: Neural ODEs recover bifurcation structures and forecast beyond training parameter regions; performance depends on data quality, not quantity.

Conclusion: Neural ODEs are a robust tool for learning and predicting bifurcations in dynamical systems, even with imperfect data.

Abstract: Forecasting system behaviour near and across bifurcations is crucial for identifying potential shifts in dynamical systems. While machine learning has recently been used to learn critical transitions and bifurcation structures from data, most studies remain limited as they exclusively focus on discrete-time methods and local bifurcations. To address these limitations, we use Neural Ordinary Differential Equations which provide a continuous, data-driven framework for learning system dynamics. We apply our approach to a predator-prey system that features both local and global bifurcations, presenting a challenging test case. Our results show that Neural Ordinary Differential Equations can recover underlying bifurcation structures directly from timeseries data by learning parameter-dependent vector fields. Notably, we demonstrate that Neural Ordinary Differential Equations can forecast bifurcations even beyond the parameter regions represented in the training data. We also assess the method’s performance under limited and noisy data conditions, finding that model accuracy depends more on the quality of information that can be inferred from the training data, than on the amount of data available.

[269] Dynamics-Informed Reservoir Computing with Visibility Graphs

Charlotte Geier, Merten Stender

Main category: cs.LG

TL;DR: The paper introduces DyRC-VG, a dynamics-informed reservoir computing method that uses visibility graphs to structure reservoirs from training data, improving prediction accuracy and consistency.

DetailsMotivation: Traditional reservoir computing relies on random reservoir networks, leading to suboptimal performance. The paper aims to optimize reservoir structure by leveraging input data dynamics.

Method: The DyRC-VG framework uses visibility graphs to convert time series into networks, constructing reservoirs directly from training data without hyperparameter tuning.

Result: DyRC-VG outperforms Erdős-Rényi graphs in prediction accuracy and consistency for nonlinear time series tasks, such as the Duffing oscillator.

Conclusion: DyRC-VG offers a systematic, data-driven approach to reservoir computing, enhancing performance by aligning reservoir structure with input dynamics.

Abstract: Accurate prediction of complex and nonlinear time series remains a challenging problem across engineering and scientific disciplines. Reservoir computing (RC) offers a computationally efficient alternative to traditional deep learning by training only the read-out layer while employing a randomly structured and fixed reservoir network. Despite its advantages, the largely random reservoir graph architecture often results in suboptimal and oversized networks with poorly understood dynamics. Addressing this issue, we propose a novel Dynamics-Informed Reservoir Computing (DyRC) framework that systematically infers the reservoir network structure directly from the input training sequence. This work proposes to employ the visibility graph (VG) technique, which converts time series data into networks by representing measurement points as nodes linked by mutual visibility. The reservoir network is constructed by directly adopting the VG network from a training data sequence, leveraging the parameter-free visibility graph approach to avoid expensive hyperparameter tuning. This process results in a reservoir that is directly informed by the specific dynamics of the prediction task under study. We assess the DyRC-VG method through prediction tasks involving the canonical nonlinear Duffing oscillator, evaluating prediction accuracy and consistency. Compared to an Erd\H{o}s-R'enyi graph of the same size, spectral radius, and comparable density, we observe higher prediction quality and more consistent performance over repeated implementations in the DyRC-VG.

[270] Exploring molecular assembly as a biosignature using mass spectrometry and machine learning

Lindsay A. Rutter, Abhishek Sharma, Ian Seet, David Obeh Alobo, An Goto, Leroy Cronin

Main category: cs.LG

TL;DR: Molecular assembly is proposed as an agnostic biosignature for life detection, measurable via mass spectrometry without structural elucidation. A machine learning model improves prediction accuracy, though standardization is crucial.

DetailsMotivation: To detect extraterrestrial life without bias from terrestrial assumptions, using mass spectrometry data for unbiased biosignature identification.

Method: Molecular assembly is measured via mass spectrometry, and a machine learning model predicts it with high accuracy, addressing mission constraints.

Result: The machine learning model reduces prediction error by three-fold, but instrumental inconsistencies can double error, highlighting the need for standardization.

Conclusion: Standardized mass spectrometry databases could enable accurate molecular assembly prediction, supporting future astrobiology missions.

Abstract: Molecular assembly offers a promising path to detect life beyond Earth, while minimizing assumptions based on terrestrial life. As mass spectrometers will be central to upcoming Solar System missions, predicting molecular assembly from their data without needing to elucidate unknown structures will be essential for unbiased life detection. An ideal agnostic biosignature must be interpretable and experimentally measurable. Here, we show that molecular assembly, a recently developed approach to measure objects that have been produced by evolution, satisfies both criteria. First, it is interpretable for life detection, as it reflects the assembly of molecules with their bonds as building blocks, in contrast to approaches that discount construction history. Second, it can be determined without structural elucidation, as it can be physically measured by mass spectrometry, a property that distinguishes it from other approaches that use structure-based information measures for molecular complexity. Whilst molecular assembly is directly measurable using mass spectrometry data, there are limits imposed by mission constraints. To address this, we developed a machine learning model that predicts molecular assembly with high accuracy, reducing error by three-fold compared to baseline models. Simulated data shows that even small instrumental inconsistencies can double model error, emphasizing the need for standardization. These results suggest that standardized mass spectrometry databases could enable accurate molecular assembly prediction, without structural elucidation, providing a proof-of-concept for future astrobiology missions.

[271] Clustering-Oriented Generative Attribute Graph Imputation

Mulin Chen, Bocheng Wang, Jiaxin Zhong, Zongcheng Miao, Xuelong Li

Main category: cs.LG

TL;DR: CGIR introduces a unified framework for attribute-missing graph clustering by combining generative imputation and refinement, outperforming existing methods.

DetailsMotivation: Existing imputation approaches lack class-relevant semantic information, and refinement strategies neglect uncorrelated attributes, leading to sub-optimal clustering.

Method: CGIR estimates subcluster distributions for precise class-specific imputation and uses an edge attention network to refine embeddings by identifying relevant attributes.

Result: CGIR outperforms state-of-the-art methods in attribute-missing graph clustering.

Conclusion: CGIR effectively addresses imputation and refinement challenges, offering a robust solution for attribute-missing graph clustering.

Abstract: Attribute-missing graph clustering has emerged as a significant unsupervised task, where only attribute vectors of partial nodes are available and the graph structure is intact. The related models generally follow the two-step paradigm of imputation and refinement. However, most imputation approaches fail to capture class-relevant semantic information, leading to sub-optimal imputation for clustering. Moreover, existing refinement strategies optimize the learned embedding through graph reconstruction, while neglecting the fact that some attributes are uncorrelated with the graph. To remedy the problems, we establish the Clustering-oriented Generative Imputation with reliable Refinement (CGIR) model. Concretely, the subcluster distributions are estimated to reveal the class-specific characteristics precisely, and constrain the sampling space of the generative adversarial module, such that the imputation nodes are impelled to align with the correct clusters. Afterwards, multiple subclusters are merged to guide the proposed edge attention network, which identifies the edge-wise attributes for each class, so as to avoid the redundant attributes in graph reconstruction from disturbing the refinement of overall embedding. To sum up, CGIR splits attribute-missing graph clustering into the search and mergence of subclusters, which guides to implement node imputation and refinement within a unified framework. Extensive experiments prove the advantages of CGIR over state-of-the-art competitors.

[272] GCL-GCN: Graphormer and Contrastive Learning Enhanced Attributed Graph Clustering Network

Binxiong Li, Xu Xiang, Xue Li, Binyu Zhao, Yujie Liu, Huijie Tang, Benhan Yang, Zhixuan Chen

Main category: cs.LG

TL;DR: GCL-GCN is a deep graph clustering model addressing challenges in attributed graph clustering by combining centrality encoding, spatial relationships, and contrastive learning to improve node representations and clustering quality.

DetailsMotivation: The complexity and heterogeneity of graph data make clustering challenging, requiring better methods to capture local dependencies and complex structures.

Method: Proposes GCL-GCN with a Graphormer module for global/local node information and a contrastive learning module for enhanced feature distinction.

Result: Outperforms 14 advanced methods, improving ACC, NMI, and ARI by significant margins on datasets like Cora.

Conclusion: GCL-GCN effectively enhances clustering quality and robustness in attributed graph analysis.

Abstract: Attributed graph clustering holds significant importance in modern data analysis. However, due to the complexity of graph data and the heterogeneity of node attributes, leveraging graph information for clustering remains challenging. To address this, we propose a novel deep graph clustering model, GCL-GCN, specifically designed to address the limitations of existing models in capturing local dependencies and complex structures when dealing with sparse and heterogeneous graph data. GCL-GCN introduces an innovative Graphormer module that combines centrality encoding and spatial relationships, effectively capturing both global and local information between nodes, thereby enhancing the quality of node representations. Additionally, we propose a novel contrastive learning module that significantly enhances the discriminative power of feature representations. In the pre-training phase, this module increases feature distinction through contrastive learning on the original feature matrix, ensuring more identifiable initial representations for subsequent graph convolution and clustering tasks. Extensive experimental results on six datasets demonstrate that GCL-GCN outperforms 14 advanced methods in terms of clustering quality and robustness. Specifically, on the Cora dataset, it improves ACC, NMI, and ARI by 4.94%, 13.01%, and 10.97%, respectively, compared to the primary comparison method MBN.

[273] Graph Structure Learning with Privacy Guarantees for Open Graph Data

Muhao Guo, Jiaqi Wu, Yang Weng, Yizheng Liao, Shengzhe Chen

Main category: cs.LG

TL;DR: A novel privacy-preserving framework for graph data publishing using Gaussian DP, ensuring unbiased recovery and addressing gaps in traditional methods.

DetailsMotivation: Challenges in balancing privacy and utility in open datasets under GDPR, with existing methods neglecting privacy at the data publishing stage.

Method: Proposes a Gaussian DP-based framework with structured noise injection for unbiased graph recovery, extending to discrete-variable graphs.

Result: Theoretical guarantees on accuracy and robust experimental performance in graph learning.

Conclusion: Offers a viable solution for privacy-conscious graph analysis, addressing limitations of traditional DP approaches.

Abstract: Ensuring privacy in large-scale open datasets is increasingly challenging under regulations such as the General Data Protection Regulation (GDPR). While differential privacy (DP) provides strong theoretical guarantees, it primarily focuses on noise injection during model training, neglecting privacy preservation at the data publishing stage. Existing privacy-preserving data publishing (PPDP) approaches struggle to balance privacy and utility, particularly when data publishers and users are distinct entities. To address this gap, we focus on the graph recovery problem and propose a novel privacy-preserving estimation framework for open graph data, leveraging Gaussian DP (GDP) with a structured noise-injection mechanism. Unlike traditional methods that perturb gradients or model updates, our approach ensures unbiased graph structure recovery while enforcing DP at the data publishing stage. Moreover, we provide theoretical guarantees on estimation accuracy and extend our method to discrete-variable graphs, a setting often overlooked in DP research. Experimental results in graph learning demonstrate robust performance, offering a viable solution for privacy-conscious graph analysis.

[274] Solar Photovoltaic Assessment with Large Language Model

Muhao Guo, Yang Weng

Main category: cs.LG

TL;DR: The paper proposes PVAL, a framework using LLMs for accurate solar PV panel detection in satellite imagery, addressing transparency, scalability, and adaptability challenges.

DetailsMotivation: Existing methods for PV panel detection lack transparency, require extensive training data, and struggle with generalization, hindering large-scale deployment.

Method: PVAL leverages LLMs with task decomposition, output standardization, few-shot prompting, and fine-tuning on curated datasets.

Result: PVAL improves accuracy, scalability, and adaptability in PV detection, enabling automated and reproducible workflows.

Conclusion: PVAL offers a robust solution for large-scale renewable energy integration and optimized grid management.

Abstract: Accurate detection and localization of solar photovoltaic (PV) panels in satellite imagery is essential for optimizing microgrids and active distribution networks (ADNs), which are critical components of renewable energy systems. Existing methods lack transparency regarding their underlying algorithms or training datasets, rely on large, high-quality PV training data, and struggle to generalize to new geographic regions or varied environmental conditions without extensive re-training. These limitations lead to inconsistent detection outcomes, hindering large-scale deployment and data-driven grid optimization. In this paper, we investigate how large language models (LLMs) can be leveraged to overcome these challenges. Despite their promise, LLMs face several challenges in solar panel detection, including difficulties with multi-step logical processes, inconsistent output formatting, frequent misclassification of visually similar objects (e.g., shadows, parking lots), and low accuracy in complex tasks such as spatial localization and quantification. To overcome these issues, we propose the PV Assessment with LLMs (PVAL) framework, which incorporates task decomposition for more efficient workflows, output standardization for consistent and scalable formatting, few-shot prompting to enhance classification accuracy, and fine-tuning using curated PV datasets with detailed annotations. PVAL ensures transparency, scalability, and adaptability across heterogeneous datasets while minimizing computational overhead. By combining open-source accessibility with robust methodologies, PVAL establishes an automated and reproducible pipeline for solar panel detection, paving the way for large-scale renewable energy integration and optimized grid management.

[275] Explainable AI guided unsupervised fault diagnostics for high-voltage circuit breakers

Chi-Ching Hsu, Gaëtan Frusque, Florent Forest, Felipe Macedo, Christian M. Franck, Olga Fink

Main category: cs.LG

TL;DR: Proposes an unsupervised framework for fault detection in high-voltage circuit breakers using vibration/acoustic signals, with explainable AI for diagnostics.

DetailsMotivation: Existing methods rely on supervised learning with labeled faults, impractical for real-world use where labels are unavailable.

Method: Unsupervised fault detection and segmentation framework using vibration/acoustic signals, with XAI for diagnostics.

Result: Validated on experimental data, detects faults and clusters them without labeled data, aiding diagnostics.

Conclusion: Enables reliable online monitoring of circuit breakers without fault labels, improving system reliability.

Abstract: Commercial high-voltage circuit breaker (CB) condition monitoring systems rely on directly observable physical parameters such as gas filling pressure with pre-defined thresholds. While these parameters are crucial, they only cover a small subset of malfunctioning mechanisms and usually can be monitored only if the CB is disconnected from the grid. To facilitate online condition monitoring while CBs remain connected, non-intrusive measurement techniques such as vibration or acoustic signals are necessary. Currently, CB condition monitoring studies using these signals typically utilize supervised methods for fault diagnostics, where ground-truth fault types are known due to artificially introduced faults in laboratory settings. This supervised approach is however not feasible in real-world applications, where fault labels are unavailable. In this work, we propose a novel unsupervised fault detection and segmentation framework for CBs based on vibration and acoustic signals. This framework can detect deviations from the healthy state. The explainable artificial intelligence (XAI) approach is applied to the detected faults for fault diagnostics. The specific contributions are: (1) we propose an integrated unsupervised fault detection and segmentation framework that is capable of detecting faults and clustering different faults with only healthy data required during training (2) we provide an unsupervised explainability-guided fault diagnostics approach using XAI to offer domain experts potential indications of the aged or faulty components, achieving fault diagnostics without the prerequisite of ground-truth fault labels. These contributions are validated using an experimental dataset from a high-voltage CB under healthy and artificially introduced fault conditions, contributing to more reliable CB system operation.

[276] Automatic Cough Analysis for Non-Small Cell Lung Cancer Detection

Chiara Giangregorio, Cristina Maria Licciardello, Vanja Miskovic, Leonardo Provenzano, Alessandra Laura Giulia Pedrocchi, Andra Diana Dumitrascu, Arsela Prelaj, Marina Chiara Garassino, Emilia Ambrosini, Simona Ferrante

Main category: cs.LG

TL;DR: The study explores automatic cough analysis using machine learning (SVM, XGBoost) and deep learning (CNN, VGG16) to distinguish NSCLC patients from healthy controls, achieving 0.83 accuracy with CNN. Fairness analysis revealed disparities across age and gender, highlighting the need for a more diverse dataset.

DetailsMotivation: Early detection of NSCLC is crucial for better patient outcomes, and novel pre-screening tools like cough analysis are needed.

Method: Cough audio recordings from 227 subjects (NSCLC patients and healthy controls) were analyzed using SVM, XGBoost, CNN, and VGG16. SHAP was used for model interpretability, and fairness was assessed across age and gender.

Result: CNN achieved the highest accuracy (0.83), while SVM was suitable for low-power contexts (0.78 accuracy). Fairness analysis showed disparities (age: 0.15, gender: 0.09).

Conclusion: The approach shows promise, but a larger, more diverse dataset is needed to improve reliability and address fairness concerns.

Abstract: Early detection of non-small cell lung cancer (NSCLC) is critical for improving patient outcomes, and novel approaches are needed to facilitate early diagnosis. In this study, we explore the use of automatic cough analysis as a pre-screening tool for distinguishing between NSCLC patients and healthy controls. Cough audio recordings were prospectively acquired from a total of 227 subjects, divided into NSCLC patients and healthy controls. The recordings were analyzed using machine learning techniques, such as support vector machine (SVM) and XGBoost, as well as deep learning approaches, specifically convolutional neural networks (CNN) and transfer learning with VGG16. To enhance the interpretability of the machine learning model, we utilized Shapley Additive Explanations (SHAP). The fairness of the models across demographic groups was assessed by comparing the performance of the best model across different age groups (less than or equal to 58y and higher than 58y) and gender using the equalized odds difference on the test set. The results demonstrate that CNN achieves the best performance, with an accuracy of 0.83 on the test set. Nevertheless, SVM achieves slightly lower performances (accuracy of 0.76 in validation and 0.78 in the test set), making it suitable in contexts with low computational power. The use of SHAP for SVM interpretation further enhances model transparency, making it more trustworthy for clinical applications. Fairness analysis shows slightly higher disparity across age (0.15) than gender (0.09) on the test set. Therefore, to strengthen our findings’ reliability, a larger, more diverse, and unbiased dataset is needed – particularly including individuals at risk of NSCLC and those in early disease stages.

[277] WACA-UNet: Weakness-Aware Channel Attention for Static IR Drop Prediction in Integrated Circuit Design

Youngmin Seo, Yunhyeong Kwon, Younghun Park, HwiRyong Kim, Seungho Eum, Jinha Kim, Taigon Song, Juho Kim, Unsang Park

Main category: cs.LG

TL;DR: The paper proposes a Weakness-Aware Channel Attention (WACA) mechanism to improve IR drop prediction in VLSI design by adaptively balancing feature channels, outperforming prior methods significantly.

DetailsMotivation: Traditional simulation-based solvers for IR drop prediction are computationally expensive and hard to scale, necessitating a more efficient and accurate learning-based approach.

Method: The method reformulates IR drop estimation as pixel-wise regression using multi-channel physical maps and introduces WACA to enhance weak features and suppress dominant ones in a ConvNeXtV2-based attention U-Net.

Result: The approach reduces mean absolute error by 61.1% and improves F1-score by 71.0% compared to the ICCAD-2023 contest winner.

Conclusion: Channel-wise heterogeneity is crucial for accurate physical layout analysis in VLSI, and the WACA mechanism effectively addresses this.

Abstract: Accurate spatial prediction of power integrity issues, such as IR drop, is critical for reliable VLSI design. However, traditional simulation-based solvers are computationally expensive and difficult to scale. We address this challenge by reformulating IR drop estimation as a pixel-wise regression task on heterogeneous multi-channel physical maps derived from circuit layouts. Prior learning-based methods treat all input layers (e.g., metal, via, and current maps) equally, ignoring their varying importance to prediction accuracy. To tackle this, we propose a novel Weakness-Aware Channel Attention (WACA) mechanism, which recursively enhances weak feature channels while suppressing over-dominant ones through a two-stage gating strategy. Integrated into a ConvNeXtV2-based attention U-Net, our approach enables adaptive and balanced feature representation. On the public ICCAD-2023 benchmark, our method outperforms the ICCAD-2023 contest winner by reducing mean absolute error by 61.1% and improving F1-score by 71.0%. These results demonstrate that channel-wise heterogeneity is a key inductive bias in physical layout analysis for VLSI.

[278] Physics-Informed Graph Neural Networks for Transverse Momentum Estimation in CMS Trigger Systems

Md Abrar Jahin, Shahriar Soudeep, M. F. Mridha, Muhammad Mostafa Monowar, Md. Abdul Hamid

Main category: cs.LG

TL;DR: A physics-informed GNN framework is proposed for real-time particle transverse momentum estimation, outperforming baselines in accuracy and efficiency.

DetailsMotivation: Addressing the inefficiency and inaccuracy of static ML models and generic GNNs in high-energy physics under hardware constraints.

Method: Four graph construction strategies (station-as-node, feature-as-node, bending angle-centric, pseudorapidity-centric) combined with a novel Message Passing Layer and domain-specific loss functions.

Result: Achieves state-of-the-art MAE of 0.8525 with 55% fewer parameters than baselines, validated on the CMS Trigger Dataset.

Conclusion: Physics-guided GNNs show promise for deployment in resource-constrained trigger systems.

Abstract: Real-time particle transverse momentum ($p_T$) estimation in high-energy physics demands algorithms that are both efficient and accurate under strict hardware constraints. Static machine learning models degrade under high pileup and lack physics-aware optimization, while generic graph neural networks (GNNs) often neglect domain structure critical for robust $p_T$ regression. We propose a physics-informed GNN framework that systematically encodes detector geometry and physical observables through four distinct graph construction strategies that systematically encode detector geometry and physical observables: station-as-node, feature-as-node, bending angle-centric, and pseudorapidity ($\eta$)-centric representations. This framework integrates these tailored graph structures with a novel Message Passing Layer (MPL), featuring intra-message attention and gated updates, and domain-specific loss functions incorporating $p_{T}$-distribution priors. Our co-design methodology yields superior accuracy-efficiency trade-offs compared to existing baselines. Extensive experiments on the CMS Trigger Dataset validate the approach: a station-informed EdgeConv model achieves a state-of-the-art MAE of 0.8525 with $\ge55%$ fewer parameters than deep learning baselines, especially TabNet, while an $\eta$-centric MPL configuration also demonstrates improved accuracy with comparable efficiency. These results establish the promise of physics-guided GNNs for deployment in resource-constrained trigger systems.

[279] A Markov Categorical Framework for Language Modeling

Yifan Zhang

Main category: cs.LG

TL;DR: The paper introduces a framework using Markov Categories to analyze auto-regressive language models, explaining why NLL training works and its implicit geometric and contrastive learning properties.

DetailsMotivation: To provide a theoretical understanding of why the negative log-likelihood (NLL) objective in auto-regressive models yields versatile representations, despite its simplicity.

Method: Uses Markov Categories (MCs) to model the AR generation process and NLL objective, analyzing information flow and learned geometry through compositional and information-theoretic perspectives.

Result: The framework explains the success of speculative decoding methods, shows NLL learns intrinsic data stochasticity, and proves NLL acts as implicit spectral contrastive learning.

Conclusion: The work reveals deep structural principles behind modern language models’ effectiveness, linking NLL training to geometric and contrastive learning properties.

Abstract: Auto-regressive language models factorize sequence probabilities and are trained by minimizing the negative log-likelihood (NLL) objective. While empirically powerful, a deep theoretical understanding of why this simple objective yields such versatile representations remains elusive. This work introduces a unifying analytical framework using Markov Categories (MCs) to deconstruct the AR generation process and the NLL objective. We model the single-step generation map as a composition of Markov kernels in the category Stoch. This compositional view, when enriched with statistical divergences, allows us to dissect information flow and learned geometry. Our framework makes three main contributions. First, we provide a formal, information-theoretic rationale for the success of modern speculative decoding methods like EAGLE, quantifying the information surplus in hidden states that these methods exploit. Second, we formalize how NLL minimization forces the model to learn not just the next token, but the data’s intrinsic conditional stochasticity, a process we analyze using categorical entropy. Third, and most centrally, we prove that NLL training acts as an implicit form of spectral contrastive learning. By analyzing the information geometry of the model’s prediction head, we show that NLL implicitly forces the learned representation space to align with the eigenspectrum of a predictive similarity operator, thereby learning a geometrically structured space without explicit contrastive pairs. This compositional and information-geometric perspective reveals the deep structural principles underlying the effectiveness of modern LMs. Project Page: https://github.com/asiresearch/lm-theory

[280] Dependency-aware synthetic tabular data generation

Chaithra Umesh, Kristian Schultz, Manjunath Mahendra, Saptarshi Bej, Olaf Wolkenhauer

Main category: cs.LG

TL;DR: HFGF improves synthetic tabular data generation by preserving functional and logical dependencies, enhancing structural fidelity and utility.

DetailsMotivation: Existing generative models often fail to preserve inter-attribute relationships like functional and logical dependencies in synthetic data.

Method: Proposes HFGF, which first generates independent features and then reconstructs dependent features using predefined rules for dependencies.

Result: HFGF outperforms six generative models (e.g., CTGAN, TVAE) in preserving dependencies across benchmark datasets.

Conclusion: HFGF significantly enhances the structural fidelity and downstream utility of synthetic tabular data.

Abstract: Synthetic tabular data is increasingly used in privacy-sensitive domains such as health care, but existing generative models often fail to preserve inter-attribute relationships. In particular, functional dependencies (FDs) and logical dependencies (LDs), which capture deterministic and rule-based associations between features, are rarely or often poorly retained in synthetic datasets. To address this research gap, we propose the Hierarchical Feature Generation Framework (HFGF) for synthetic tabular data generation. We created benchmark datasets with known dependencies to evaluate our proposed HFGF. The framework first generates independent features using any standard generative model, and then reconstructs dependent features based on predefined FD and LD rules. Our experiments on four benchmark datasets with varying sizes, feature imbalance, and dependency complexity demonstrate that HFGF improves the preservation of FDs and LDs across six generative models, including CTGAN, TVAE, and GReaT. Our findings demonstrate that HFGF can significantly enhance the structural fidelity and downstream utility of synthetic tabular data.

[281] Component-Based Machine Learning for Indoor Flow and Temperature Fields Prediction Latent Feature Aggregation and Flow Interaction

Shaofan Wang, Nils Thuerey, Philipp Geyer

Main category: cs.LG

TL;DR: A component-based machine learning (CBML) surrogate model replaces CFD for fast indoor airflow and temperature prediction, using neural networks for feature extraction and aggregation.

DetailsMotivation: Traditional CFD simulations are too slow for real-time or iterative workflows, necessitating a faster alternative.

Method: The CBML model uses three neural networks: a convolutional autoencoder with residual connections (CAER), a multilayer perceptron (MLP), and a convolutional neural network (CNN) to predict indoor airflow and temperature.

Result: The CBML model accurately and quickly predicts velocity and temperature fields in training and testing datasets.

Conclusion: The CBML approach is effective for fast and accurate indoor airflow and temperature prediction, overcoming CFD’s computational limitations.

Abstract: Accurate and efficient prediction of indoor airflow and temperature distributions is essential for building energy optimization and occupant comfort control. However, traditional CFD simulations are computationally intensive, limiting their integration into real-time or design-iterative workflows. This study proposes a component-based machine learning (CBML) surrogate modeling approach to replace conventional CFD simulation for fast prediction of indoor velocity and temperature fields. The model consists of three neural networks: a convolutional autoencoder with residual connections (CAER) to extract and compress flow features, a multilayer perceptron (MLP) to map inlet velocities to latent representations, and a convolutional neural network (CNN) as an aggregator to combine single-inlet features into dual-inlet scenarios. A two-dimensional room with varying left and right air inlet velocities is used as a benchmark case, with CFD simulations providing training and testing data. Results show that the CBML model accurately and fast predicts two-component aggregated velocity and temperature fields across both training and testing datasets.

[282] Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs

Shuo Yang, Zheyu Zhang, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.LG

TL;DR: SPADA is a lightweight generative framework for tabular data augmentation that uses sparse dependency modeling to reduce bias and computational overhead.

DetailsMotivation: High-quality tabular datasets are scarce due to privacy and cost issues, and existing LLM-based methods introduce bias and computational inefficiency.

Method: SPADA captures sparse dependencies via an LLM-induced graph, treating features as nodes and synthesizing values by traversing the graph. It employs Gaussian kernel density estimation or conditional normalizing flows for synthesis.

Result: SPADA reduces constraint violations by 4% and speeds up generation by 9,500 times compared to baselines.

Conclusion: SPADA effectively addresses bias and computational challenges in tabular data augmentation.

Abstract: Tabular data is critical across diverse domains, yet high-quality datasets remain scarce due to privacy concerns and the cost of collection. Contemporary approaches adopt large language models (LLMs) for tabular augmentation, but exhibit two major limitations: (1) dense dependency modeling among tabular features that can introduce bias, and (2) high computational overhead in sampling. To address these issues, we propose SPADA for SPArse Dependency-driven Augmentation, a lightweight generative framework that explicitly captures sparse dependencies via an LLM-induced graph. We treat each feature as a node and synthesize values by traversing the graph, conditioning each feature solely on its parent nodes. We explore two synthesis strategies: a non-parametric method using Gaussian kernel density estimation, and a conditional normalizing flow model that learns invertible mappings for conditional density estimation. Experiments on four datasets show that SPADA reduces constraint violations by 4% compared to diffusion-based methods and accelerates generation by nearly 9,500 times over LLM-based baselines.

[283] Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts

Sang-Woo Lee, Sohee Yang, Donghyun Kwak, Noah Y. Siegel

Main category: cs.LG

TL;DR: Recent advancements in LLMs show potential for superforecaster-level event forecasting, with improved methods and reinforcement learning. The paper proposes large-scale training and data acquisition strategies to overcome challenges like noisiness-sparsity and knowledge cut-off.

DetailsMotivation: To leverage recent progress in LLMs and reasoning models for achieving superforecaster-level event forecasting, addressing current limitations and exploring scalable solutions.

Method: Proposes training methods like hypothetical event Bayesian networks and auxiliary rewards, alongside aggressive data acquisition from diverse sources.

Result: Identifies key challenges and solutions, suggesting scalable training and data strategies to advance forecasting capabilities.

Conclusion: The time is ripe for large-scale research on superforecaster-level LLMs, with proposed directions aiming to inspire further exploration and societal impact.

Abstract: Many recent papers have studied the development of superforecaster-level event forecasting LLMs. While methodological problems with early studies cast doubt on the use of LLMs for event forecasting, recent studies with improved evaluation methods have shown that state-of-the-art LLMs are gradually reaching superforecaster-level performance, and reinforcement learning has also been reported to improve future forecasting. Additionally, the unprecedented success of recent reasoning models and Deep Research-style models suggests that technology capable of greatly improving forecasting performance has been developed. Therefore, based on these positive recent trends, we argue that the time is ripe for research on large-scale training of superforecaster-level event forecasting LLMs. We discuss two key research directions: training methods and data acquisition. For training, we first introduce three difficulties of LLM-based event forecasting training: noisiness-sparsity, knowledge cut-off, and simple reward structure problems. Then, we present related ideas to mitigate these problems: hypothetical event Bayesian networks, utilizing poorly-recalled and counterfactual events, and auxiliary reward signals. For data, we propose aggressive use of market, public, and crawling datasets to enable large-scale training and evaluation. Finally, we explain how these technical advances could enable AI to provide predictive intelligence to society in broader areas. This position paper presents promising specific paths and considerations for getting closer to superforecaster-level AI technology, aiming to call for researchers’ interest in these directions.

[284] Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges

Andrii Dzhoha, Katya Mirylenka, Egor Malykh, Marco-Andrea Buchmann, Francesca Catino

Main category: cs.LG

TL;DR: The paper discusses challenges in recommender systems for short-form video platforms, proposing a multimodal vision-language model as a more effective solution than traditional methods.

DetailsMotivation: The rise of short-form video platforms and their adoption in domains like e-commerce has introduced new challenges for recommender systems, such as limited interaction data and biases.

Method: The authors leverage a fine-tuned multimodal vision-language model for video retrieval, addressing biases and data limitations.

Result: Online experiments on an e-commerce platform showed this approach outperformed conventional supervised learning methods.

Conclusion: Using a multimodal vision-language model can effectively overcome challenges in short-form video recommendation systems.

Abstract: In recent years, social media users have spent significant amounts of time on short-form video platforms. As a result, established platforms in other domains, such as e-commerce, have begun introducing short-form video content to engage users and increase their time spent on the platform. The success of these experiences is due not only to the content itself but also to a unique UI innovation: instead of offering users a list of choices to click, platforms actively recommend content for users to watch one at a time. This creates new challenges for recommender systems, especially when launching a new video experience. Beyond the limited interaction data, immersive feed experiences introduce stronger position bias due to the UI and duration bias when optimizing for watch-time, as models tend to favor shorter videos. These issues, together with the feedback loop inherent in recommender systems, make it difficult to build effective solutions. In this paper, we highlight the challenges faced when introducing a new short-form video experience and present our experience showing that, even with sufficient video interaction data, it can be more beneficial to leverage a video retrieval system using a fine-tuned multimodal vision-language model to overcome these challenges. This approach demonstrated greater effectiveness compared to conventional supervised learning methods in online experiments conducted on our e-commerce platform.

[285] Reconstruction of Sparse Urban Wireless Signals via Group Equivariant Non-Expansive Operators

Lorenzo Mario Amorosa, Francesco Conti, Nicola Quercioli, Flavio Zabini, Tayebeh Lotfi Mahyari, Yiqun Ge, Patrizio Frosini

Main category: cs.LG

TL;DR: The paper proposes a GENEO-based method for reconstructing SINR maps in 6G networks from sparse measurements, offering a low-complexity alternative to neural networks.

DetailsMotivation: Efficient resource management in 6G networks requires high-resolution SINR maps, which are costly to acquire. Traditional methods like neural networks are complex and data-intensive.

Method: Uses Group Equivariant Non-Expansive Operators (GENEOs) to reconstruct SINR maps from sparse samples, leveraging geometric and algebraic constraints to reduce complexity and data needs.

Result: The GENEO-based approach achieves competitive performance in reconstructing SINR maps under severe data limitations, validated by statistical and TDA metrics.

Conclusion: GENEOs provide a promising low-complexity solution for spatial signal reconstruction in resource-constrained wireless networks.

Abstract: In emerging communication systems such as sixth generation (6G) wireless networks, efficient resource management and service delivery rely on accurate knowledge of spatially-varying quantities like signal-to-interference-noise ratio (SINR) maps, which are costly to acquire at high resolution. This work explores the reconstruction of such spatial signals from sparse measurements using Group Equivariant Non-Expansive Operators (GENEOs), offering a low-complexity alternative to traditional neural networks. The concept of GENEO, which originated in topological data analysis (TDA), is a mathematical tool used in machine learning to represent agents modelled as functional operators acting on data while incorporating application-specific invariances. Leveraging these invariances reduces the number of parameters with respect to traditional neural networks and mitigates data scarcity by enforcing known algebraic and geometric constraints that reflect symmetries in the agents’ actions. In this paper, we introduce a novel GENEO-based approach for SINR map reconstruction in urban wireless communication networks using extremely sparse sampling. We demonstrate that this mathematical framework achieves competitive performance compared to established methods. Our evaluation, conducted using both statistical and TDA metrics, highlights the advantages of our approach in accurately reconstructing spatial signals under severe data limitations on the number of samples.

[286] A Data-Driven Approach to Estimate LEO Orbit Capacity Models

Braden Stock, Maddox McVarthy, Simone Servadio

Main category: cs.LG

TL;DR: Combines SINDy and LSTM to model LEO space objects (Active, Derelict, Debris) for accurate, faster predictions using MOCAT-MC data.

DetailsMotivation: To improve forecasting of satellite and debris propagation in LEO with a lightweight, efficient model.

Method: Uses SINDy for nonlinear dynamics and LSTM for sequence modeling, trained on MOCAT-MC data.

Result: Accurate predictions with reduced computational cost compared to high-fidelity models.

Conclusion: The hybrid approach offers a practical, efficient solution for space object propagation forecasting.

Abstract: Utilizing the Sparse Identification of Nonlinear Dynamics algorithm (SINDy) and Long Short-Term Memory Recurrent Neural Networks (LSTM), the population of resident space objects, divided into Active, Derelict, and Debris, in LEO can be accurately modeled to predict future satellite and debris propagation. This proposed approach makes use of a data set coming from a computational expensive high-fidelity model, the MOCAT-MC, to provide a light, low-fidelity counterpart that provides accurate forecasting in a shorter time frame.

[287] Counterfactual Explanations in Medical Imaging: Exploring SPN-Guided Latent Space Manipulation

Julia Siekiera, Stefan Kramer

Main category: cs.LG

TL;DR: The paper explores generating plausible counterfactual explanations for deep learning models in medical image analysis using a semi-supervised VAE and SPN integration.

DetailsMotivation: Addressing the challenge of interpretability in black-box deep learning models, particularly in medical image analysis, by providing human-understandable counterfactual explanations.

Method: Combines a semi-supervised VAE with an SPN to model latent space likelihood, optimizing counterfactuals that align with data and target class distributions.

Result: Experimental evaluation on the cheXpert dataset shows SPN-guided latent space manipulation outperforms a neural network baseline, balancing regularization and counterfactual quality.

Conclusion: The proposed method effectively generates interpretable counterfactuals, enhancing trust and reliability in AI-driven medical decision-making.

Abstract: Artificial intelligence is increasingly leveraged across various domains to automate decision-making processes that significantly impact human lives. In medical image analysis, deep learning models have demonstrated remarkable performance. However, their inherent complexity makes them black box systems, raising concerns about reliability and interpretability. Counterfactual explanations provide comprehensible insights into decision processes by presenting hypothetical “what-if” scenarios that alter model classifications. By examining input alterations, counterfactual explanations provide patterns that influence the decision-making process. Despite their potential, generating plausible counterfactuals that adhere to similarity constraints providing human-interpretable explanations remains a challenge. In this paper, we investigate this challenge by a model-specific optimization approach. While deep generative models such as variational autoencoders (VAEs) exhibit significant generative power, probabilistic models like sum-product networks (SPNs) efficiently represent complex joint probability distributions. By modeling the likelihood of a semi-supervised VAE’s latent space with an SPN, we leverage its dual role as both a latent space descriptor and a classifier for a given discrimination task. This formulation enables the optimization of latent space counterfactuals that are both close to the original data distribution and aligned with the target class distribution. We conduct experimental evaluation on the cheXpert dataset. To evaluate the effectiveness of the integration of SPNs, our SPN-guided latent space manipulation is compared against a neural network baseline. Additionally, the trade-off between latent variable regularization and counterfactual quality is analyzed.

[288] FD4QC: Application of Classical and Quantum-Hybrid Machine Learning for Financial Fraud Detection A Technical Report

Matteo Cardaioli, Luca Marangoni, Giada Martini, Francesco Mazzolin, Luca Pajola, Andrea Ferretto Parodi, Alessandra Saitta, Maria Chiara Vernillo

Main category: cs.LG

TL;DR: The paper compares classical, quantum, and quantum-hybrid ML models for fraud detection, finding classical models like Random Forest outperform quantum ones, though QSVM shows promise. It also proposes FD4QC, a practical system for real-world deployment.

DetailsMotivation: Address the challenges of fraud detection in complex financial transactions by evaluating and comparing classical and quantum ML models.

Method: Developed a behavioural feature engineering framework, implemented and evaluated classical (Logistic Regression, Decision Tree, Random Forest, XGBoost) and quantum (QSVM, VQC, HQNN) models on the IBM AML dataset, and proposed FD4QC for real-world deployment.

Result: Random Forest outperformed quantum models (97.34% accuracy, 86.95% F-measure). QSVM showed promise with high precision (77.15%) and low false-positive rate (1.36%).

Conclusion: Classical models currently outperform quantum ones in fraud detection, but QSVM is promising. The paper provides benchmarks and future research directions for quantum ML in finance.

Abstract: The increasing complexity and volume of financial transactions pose significant challenges to traditional fraud detection systems. This technical report investigates and compares the efficacy of classical, quantum, and quantum-hybrid machine learning models for the binary classification of fraudulent financial activities. As of our methodology, first, we develop a comprehensive behavioural feature engineering framework to transform raw transactional data into a rich, descriptive feature set. Second, we implement and evaluate a range of models on the IBM Anti-Money Laundering (AML) dataset. The classical baseline models include Logistic Regression, Decision Tree, Random Forest, and XGBoost. These are compared against three hybrid classic quantum algorithms architectures: a Quantum Support Vector Machine (QSVM), a Variational Quantum Classifier (VQC), and a Hybrid Quantum Neural Network (HQNN). Furthermore, we propose Fraud Detection for Quantum Computing (FD4QC), a practical, API-driven system architecture designed for real-world deployment, featuring a classical-first, quantum-enhanced philosophy with robust fallback mechanisms. Our results demonstrate that classical tree-based models, particularly \textit{Random Forest}, significantly outperform the quantum counterparts in the current setup, achieving high accuracy ((97.34%)) and F-measure ((86.95%)). Among the quantum models, \textbf{QSVM} shows the most promise, delivering high precision ((77.15%)) and a low false-positive rate ((1.36%)), albeit with lower recall and significant computational overhead. This report provides a benchmark for a real-world financial application, highlights the current limitations of quantum machine learning in this domain, and outlines promising directions for future research.

[289] On Arbitrary Predictions from Equally Valid Models

Sarah Lockfisch, Kristian Schwethelm, Martin Menten, Rickmer Braren, Daniel Rueckert, Alexander Ziller, Georgios Kaissis

Main category: cs.LG

TL;DR: The paper examines predictive multiplicity in medical ML models, showing that small ensembles and abstention strategies can mitigate conflicting predictions, while higher model accuracy reduces multiplicity.

DetailsMotivation: To understand and address the risk of conflicting predictions from equally valid ML models in medicine, which can lead to arbitrary diagnoses for patients.

Method: Empirical analysis of predictive multiplicity across medical tasks and model architectures, testing small ensembles and abstention strategies.

Result: Standard metrics fail to identify optimal models; small ensembles reduce predictive multiplicity, and higher model accuracy decreases it.

Conclusion: Ensemble-based strategies improve diagnostic reliability, and expert review is recommended for cases with insufficient model consensus.

Abstract: Model multiplicity refers to the existence of multiple machine learning models that describe the data equally well but may produce different predictions on individual samples. In medicine, these models can admit conflicting predictions for the same patient – a risk that is poorly understood and insufficiently addressed. In this study, we empirically analyze the extent, drivers, and ramifications of predictive multiplicity across diverse medical tasks and model architectures, and show that even small ensembles can mitigate/eliminate predictive multiplicity in practice. Our analysis reveals that (1) standard validation metrics fail to identify a uniquely optimal model and (2) a substantial amount of predictions hinges on arbitrary choices made during model development. Using multiple models instead of a single model reveals instances where predictions differ across equally plausible models – highlighting patients that would receive arbitrary diagnoses if any single model were used. In contrast, (3) a small ensemble paired with an abstention strategy can effectively mitigate measurable predictive multiplicity in practice; predictions with high inter-model consensus may thus be amenable to automated classification. While accuracy is not a principled antidote to predictive multiplicity, we find that (4) higher accuracy achieved through increased model capacity reduces predictive multiplicity. Our findings underscore the clinical importance of accounting for model multiplicity and advocate for ensemble-based strategies to improve diagnostic reliability. In cases where models fail to reach sufficient consensus, we recommend deferring decisions to expert review.

[290] SILS: Strategic Influence on Liquidity Stability and Whale Detection in Concentrated-Liquidity DEXs

Ali RajabiNekoo, Laleh Rasoul, Amirfarhad Farhadi, Azadeh Zamanifar

Main category: cs.LG

TL;DR: The SILS framework introduces a dynamic, impact-focused approach to identify high-impact liquidity providers (LPs) in CLMMs, using ETWL profiles and LSIS scores for accurate risk analysis, outperforming traditional static methods.

DetailsMotivation: Traditional methods for identifying impactful LPs rely on broad measures like capital size, leading to inaccurate risk analysis. SILS aims to provide a more detailed and dynamic understanding of LP impact on market stability.

Method: SILS uses on-chain event logs and smart contract execution traces to compute ETWL profiles and applies unsupervised anomaly detection. It defines LP importance via the LSIS, a counterfactual metric measuring potential market degradation if the LP withdraws.

Result: SILS accurately identifies high-impact LPs, including those missed by traditional methods, and supports applications like protective oracle layers and trader signals, enhancing DeFi ecosystem transparency and risk management.

Conclusion: SILS transforms DeFi risk management by providing a proactive, impact-focused approach, reducing false positives/negatives and improving ecosystem safeguards against asymmetric liquidity behavior.

Abstract: Traditional methods for identifying impactful liquidity providers (LPs) in Concentrated Liquidity Market Makers (CLMMs) rely on broad measures, such as nominal capital size or surface-level activity, which often lead to inaccurate risk analysis. The SILS framework offers a significantly more detailed approach, characterizing LPs not just as capital holders but as dynamic systemic agents whose actions directly impact market stability. This represents a fundamental paradigm shift from the static, volume-based analysis to a dynamic, impact-focused understanding. This advanced approach uses on-chain event logs and smart contract execution traces to compute Exponential Time-Weighted Liquidity (ETWL) profiles and apply unsupervised anomaly detection. Most importantly, it defines an LP’s functional importance through the Liquidity Stability Impact Score (LSIS), a counterfactual metric that measures the potential degradation of the market if the LP withdraws. This combined approach provides a more detailed and realistic characterization of an LP’s impact, moving beyond the binary and often misleading classifications used by existing methods. This impact-focused and comprehensive approach enables SILS to accurately identify high-impact LPs-including those missed by traditional methods and supports essential applications like a protective oracle layer and actionable trader signals, thereby significantly enhancing DeFi ecosystem. The framework provides unprecedented transparency into the underlying liquidity structure and associated risks, effectively reducing the common false positives and uncovering critical false negatives found in traditional models. Therefore, SILS provides an effective mechanism for proactive risk management, transforming how DeFi protocols safeguard their ecosystems against asymmetric liquidity behavior.

[291] Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding

StepFun, :, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen, Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang

Main category: cs.LG

TL;DR: Step-3 is a 321B-parameter VLM optimized for decoding efficiency via hardware-aware co-design, featuring MFA and AFD, outperforming models like DeepSeek-V3 in cost and throughput.

DetailsMotivation: Addressing low hardware efficiency in LLMs, especially for long-context tasks, by minimizing decoding costs.

Method: Introduces Multi-Matrix Factorization Attention (MFA) to reduce KV cache and computation, and Attention-FFN Disaggregation (AFD) for distributed inference.

Result: Step-3 reduces decoding costs, activates 38B parameters per token, and achieves 4,039 tokens/sec/GPU throughput, outperforming DeepSeek-V3.

Conclusion: Hardware-aligned design with MFA, MoE sparsity, and AFD sets a new Pareto frontier for efficient LLM decoding.

Abstract: Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3’s 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.

[292] Observations Meet Actions: Learning Control-Sufficient Representations for Robust Policy Generalization

Yuliang Gu, Hongpeng Cao, Marco Caccamo, Naira Hovakimyan

Main category: cs.LG

TL;DR: The paper introduces a dual inference-control framework for context-based RL, emphasizing observation and control sufficiency, and proposes BCPO, an efficient algorithm for learning in shifting environments.

DetailsMotivation: To address the challenge of deploying RL agents beyond their training regime by capturing latent variations (contexts) effectively.

Method: Recasts context-based RL as a dual inference-control problem, introduces a contextual ELBO-style objective, and develops BCPO with a variational information-bottleneck encoder.

Result: BCPO matches or outperforms baselines in continuous-control benchmarks, using fewer samples and maintaining performance outside the training regime.

Conclusion: The framework unifies theory, diagnostics, and practice for context-based RL, offering a robust solution for real-world deployment.

Abstract: Capturing latent variations (“contexts”) is key to deploying reinforcement-learning (RL) agents beyond their training regime. We recast context-based RL as a dual inference-control problem and formally characterize two properties and their hierarchy: observation sufficiency (preserving all predictive information) and control sufficiency (retaining decision-making relevant information). Exploiting this dichotomy, we derive a contextual evidence lower bound(ELBO)-style objective that cleanly separates representation learning from policy learning and optimizes it with Bottlenecked Contextual Policy Optimization (BCPO), an algorithm that places a variational information-bottleneck encoder in front of any off-policy policy learner. On standard continuous-control benchmarks with shifting physical parameters, BCPO matches or surpasses other baselines while using fewer samples and retaining performance far outside the training regime. The framework unifies theory, diagnostics, and practice for context-based RL.

[293] Forest-Guided Clustering – Shedding Light into the Random Forest Black Box

Lisa Barros de Andrade e Sousa, Gregor Miller, Ronan Le Gleut, Dominik Thalmeier, Helena Pelin, Marie Piraud

Main category: cs.LG

TL;DR: Forest-Guided Clustering (FGC) is introduced to enhance interpretability of Random Forests by grouping instances via shared decision paths, offering insights into model logic and feature importance.

DetailsMotivation: The need for interpretable and trustworthy machine learning models, especially in sensitive applications, drives the development of FGC to address the opacity of Random Forests.

Method: FGC clusters instances based on shared decision paths in Random Forests, generating interpretable clusters and feature importance scores to explain predictions.

Result: FGC outperformed traditional clustering and explanation methods, accurately identifying latent structures and biologically relevant patterns in datasets.

Conclusion: FGC successfully bridges the gap between model performance and interpretability, providing deeper, structure-aware insights into Random Forests.

Abstract: As machine learning models are increasingly deployed in sensitive application areas, the demand for interpretable and trustworthy decision-making has increased. Random Forests (RF), despite their widespread use and strong performance on tabular data, remain difficult to interpret due to their ensemble nature. We present Forest-Guided Clustering (FGC), a model-specific explainability method that reveals both local and global structure in RFs by grouping instances according to shared decision paths. FGC produces human-interpretable clusters aligned with the model’s internal logic and computes cluster-specific and global feature importance scores to derive decision rules underlying RF predictions. FGC accurately recovered latent subclass structure on a benchmark dataset and outperformed classical clustering and post-hoc explanation methods. Applied to an AML transcriptomic dataset, FGC uncovered biologically coherent subpopulations, disentangled disease-relevant signals from confounders, and recovered known and novel gene expression patterns. FGC bridges the gap between performance and interpretability by providing structure-aware insights that go beyond feature-level attribution.

[294] Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition

Tu Nguyen, Nedim Šrndić, Alexander Neth

Main category: cs.LG

TL;DR: The paper proposes a neural matching architecture for TTP mapping in cybersecurity, using semantic similarity to simplify the problem compared to traditional multi-class classification.

DetailsMotivation: TTP mapping is challenging due to the large number of classes, skewed label distribution, and hierarchical label structure. Conventional methods struggle with these complexities.

Method: A neural matching architecture with a sampling-based learn-to-compare mechanism is introduced, focusing on semantic similarity between text and TTP labels.

Result: The approach reduces complexity by avoiding direct competition over a large label space, improving learning efficiency.

Conclusion: The proposed method offers a more effective solution for TTP mapping by leveraging semantic similarity and constrained resources.

Abstract: Tactics, Techniques and Procedures (TTPs) represent sophisticated attack patterns in the cybersecurity domain, described encyclopedically in textual knowledge bases. Identifying TTPs in cybersecurity writing, often called TTP mapping, is an important and challenging task. Conventional learning approaches often target the problem in the classical multi-class or multilabel classification setting. This setting hinders the learning ability of the model due to a large number of classes (i.e., TTPs), the inevitable skewness of the label distribution and the complex hierarchical structure of the label space. We formulate the problem in a different learning paradigm, where the assignment of a text to a TTP label is decided by the direct semantic similarity between the two, thus reducing the complexity of competing solely over the large labeling space. To that end, we propose a neural matching architecture with an effective sampling-based learn-to-compare mechanism, facilitating the learning process of the matching model despite constrained resources.

[295] XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare

Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio

Main category: cs.LG

TL;DR: A knowledge-guided ICL framework improves LLMs for clinical data, reducing bias and enhancing recall, though with higher latency than traditional ML.

DetailsMotivation: To address the need for equitable and accurate clinical decision support systems by leveraging LLMs for structured clinical data.

Method: Integrates domain-specific feature groupings, balanced few-shot examples, and task-specific prompting strategies, evaluated across 70 ICL designs.

Result: LLMs with narrative prompts achieve higher recall and reduce gender bias significantly, though with increased latency.

Conclusion: LLMs offer advantages like zero-shot deployment and equity, with future research directions including distillation and multimodal extensions.

Abstract: Clinical decision support systems require models that are not only highly accurate but also equitable and sensitive to the implications of missed diagnoses. In this study, we introduce a knowledge-guided in-context learning (ICL) framework designed to enable large language models (LLMs) to effectively process structured clinical data. Our approach integrates domain-specific feature groupings, carefully balanced few-shot examples, and task-specific prompting strategies. We systematically evaluate this method across seventy distinct ICL designs by various prompt variations and two different communication styles-natural-language narrative and numeric conversational-and compare its performance to robust classical machine learning (ML) benchmarks on tasks involving heart disease and diabetes prediction. Our findings indicate that while traditional ML models maintain superior performance in balanced precision-recall scenarios, LLMs employing narrative prompts with integrated domain knowledge achieve higher recall and significantly reduce gender bias, effectively narrowing fairness disparities by an order of magnitude. Despite the current limitation of increased inference latency, LLMs provide notable advantages, including the capacity for zero-shot deployment and enhanced equity. This research offers the first comprehensive analysis of ICL design considerations for applying LLMs to tabular clinical tasks and highlights distillation and multimodal extensions as promising directions for future research.

[296] PLEIADES: Building Temporal Kernels with Orthogonal Polynomials

Yan Ru Pei, Olivier Coenen

Main category: cs.LG

TL;DR: PLEIADES neural networks use polynomial-based temporal kernels for low-latency event-based data processing, achieving state-of-the-art results with minimal resources.

DetailsMotivation: To perform efficient online spatiotemporal classification and detection with event-based data, leveraging structured temporal kernels for flexibility in sample rates and discretization.

Method: Uses orthogonal polynomial basis functions to generate temporal convolution kernels, interfacing with event-based data for adaptive processing.

Result: Achieved top performance on three benchmarks: 99.59% accuracy on DVS128, 99.58% on AIS 2024, and 0.556 mAP on PROPHESEE, with low memory/compute costs.

Conclusion: PLEIADES demonstrates superior efficiency and accuracy for event-based tasks, enabling flexible and scalable solutions.

Abstract: We introduce a class of neural networks named PLEIADES (PoLynomial Expansion In Adaptive Distributed Event-based Systems), which contains temporal convolution kernels generated from orthogonal polynomial basis functions. We focus on interfacing these networks with event-based data to perform online spatiotemporal classification and detection with low latency. By virtue of using structured temporal kernels and event-based data, we have the freedom to vary the sample rate of the data along with the discretization step-size of the network without additional finetuning. We experimented with three event-based benchmarks and obtained state-of-the-art results on all three by large margins with significantly smaller memory and compute costs. We achieved: 1) 99.59% accuracy with 192K parameters on the DVS128 hand gesture recognition dataset and 100% with a small additional output filter; 2) 99.58% test accuracy with 277K parameters on the AIS 2024 eye tracking challenge; and 3) 0.556 mAP with 576k parameters on the PROPHESEE 1 Megapixel Automotive Detection Dataset.

[297] ToolACE: Winning the Points of LLM Function Calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen

Main category: cs.LG

TL;DR: ToolACE is an automatic pipeline for generating high-quality, diverse tool-learning data, addressing challenges in collecting real function-calling data. It uses a self-evolution synthesis process and multi-agent dialogs, achieving state-of-the-art performance with models trained on its data.

DetailsMotivation: Real function-calling data is hard to collect, and synthetic data lacks coverage and accuracy. ToolACE aims to solve this by generating accurate, complex, and diverse data.

Method: ToolACE employs a self-evolution synthesis process to create an API pool and generates dialogs via multi-agent interplay. A dual-layer verification system ensures data accuracy.

Result: Models trained on ToolACE data (even 8B parameters) achieve state-of-the-art performance, rivaling GPT-4.

Conclusion: ToolACE effectively generates high-quality tool-learning data, enabling strong model performance, with publicly available resources.

Abstract: Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE.

[298] Bootstrapped Reward Shaping

Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, Rahul V. Kulkarni

Main category: cs.LG

TL;DR: The paper proposes BSRS, a bootstrapped reward shaping method in reinforcement learning, using the agent’s state-value estimate as a potential function to improve training speed in sparse-reward domains.

DetailsMotivation: Sparse rewards in RL require many environment steps for feedback. PBRS offers denser rewards but needs task-specific potential functions, which can hinder performance.

Method: Introduces BSRS, where the agent’s state-value function estimate serves as the potential function for PBRS, avoiding manual design.

Result: Convergence proofs for tabular settings, insights into deep RL dynamics, and improved training speed in Atari games.

Conclusion: BSRS effectively enhances training efficiency without compromising policy optimality, validated in theoretical and practical settings.

Abstract: In reinforcement learning, especially in sparse-reward domains, many environment steps are required to observe reward information. In order to increase the frequency of such observations, “potential-based reward shaping” (PBRS) has been proposed as a method of providing a more dense reward signal while leaving the optimal policy invariant. However, the required “potential function” must be carefully designed with task-dependent knowledge to not deter training performance. In this work, we propose a “bootstrapped” method of reward shaping, termed BSRS, in which the agent’s current estimate of the state-value function acts as the potential function for PBRS. We provide convergence proofs for the tabular setting, give insights into training dynamics for deep RL, and show that the proposed method improves training speed in the Atari suite.

[299] Studying Cross-cluster Modularity in Neural Networks

Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots

Main category: cs.LG

TL;DR: The paper proposes improving neural network interpretability by clustering models into disjoint groups, using a “clusterability loss” to encourage modularity. Results show smaller circuits but no increased task specialization.

DetailsMotivation: To enhance neural network interpretability by making models more modular and easier to analyze independently.

Method: Defines a clusterability measure, applies spectral graph clustering, and trains models with a clusterability loss to form non-interacting clusters.

Result: Clustered models form smaller circuits but do not show increased task specialization. Tested on CNNs, transformers, and large models like GPT-2 and Gemma.

Conclusion: Clustered models offer interpretability benefits through modularity but do not specialize more for tasks.

Abstract: An approach to improve neural network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We define a measure for clusterability and show that pre-trained models form highly enmeshed clusters via spectral graph clustering. We thus train models to be more modular using a “clusterability loss” function that encourages the formation of non-interacting clusters. We then investigate the emerging properties of these highly clustered models. We find our trained clustered models do not exhibit more task specialization, but do form smaller circuits. We investigate CNNs trained on MNIST and CIFAR, small transformers trained on modular addition, and GPT-2 and Pythia on the Wiki dataset, and Gemma on a Chemistry dataset. This investigation shows what to expect from clustered models.

[300] PIPA: Preference Alignment as Prior-Informed Statistical Estimation

Junbo Li, Zhangyang Wang, Qiang Liu

Main category: cs.LG

TL;DR: PIPA is a unified, RL-free framework for offline preference alignment in language models, accommodating various data types and outperforming existing methods by 3-10% on benchmarks.

DetailsMotivation: Existing offline preference alignment methods lack a unified understanding and are limited in handling diverse data settings.

Method: PIPA formulates preference alignment as an MLE problem with prior constraints, accommodating paired/unpaired data and annotations. It generalizes DPO and KTO as special cases.

Result: PIPA variants (PIPA-M, PIPA-N) improve performance by 3-10% on GSM8K and MATH benchmarks without extra training or computational costs.

Conclusion: PIPA provides a versatile and efficient solution for offline preference alignment, outperforming existing methods while maintaining simplicity.

Abstract: Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms demonstrate a $3\sim10%$ performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms.

[301] Bounded KRnet and its applications to density estimation and approximation

Li Zeng, Xiaoliang Wan, Tao Zhou

Main category: cs.LG

TL;DR: B-KRnet is an invertible mapping for density estimation on bounded domains, extending KRnet with exact invertibility on hypercubes, and is applied to PDE solutions and data.

DetailsMotivation: To address density estimation and approximation for bounded domains, particularly for PDE solutions like Fokker-Planck and Keller-Segel equations, where existing methods like KRnet are limited to unbounded spaces.

Method: B-KRnet uses coupling layers with progressively fewer active dimensions, maintaining invertibility on hypercubes. It combines with KRnet for mixed bounded/unbounded domains and is applied to adaptive learning for PDE solutions.

Result: Demonstrated effectiveness through numerical experiments, showing accurate density estimation and PDE solution approximation.

Conclusion: B-KRnet successfully extends KRnet’s capabilities to bounded domains, providing a flexible tool for density estimation and PDE solutions.

Abstract: In this paper, we develop an invertible mapping, called B-KRnet, on a bounded domain and apply it to density estimation/approximation for data or the solutions of PDEs such as the Fokker-Planck equation and the Keller-Segel equation. Similar to KRnet, B-KRnet consists of a series of coupling layers with progressively fewer active transformation dimensions, inspired by the triangular structure of the Knothe-Rosenblatt (KR) rearrangement. The main difference between B-KRnet and KRnet is that B-KRnet is defined on a hypercube while KRnet is defined on the whole space, in other words, a new mechanism is introduced in B-KRnet to maintain the exact invertibility. Using B-KRnet as a transport map, we obtain an explicit probability density function (PDF) model that corresponds to the pushforward of a base (uniform) distribution on the hypercube. It can be directly applied to density estimation when only data are available. By coupling KRnet and B-KRnet, we define a deep generative model on a high-dimensional domain where some dimensions are bounded and other dimensions are unbounded. A typical case is the solution of the stationary kinetic Fokker-Planck equation, which is a PDF of position and momentum. Based on B-KRnet, we develop an adaptive learning approach to approximate partial differential equations whose solutions are PDFs or can be treated as PDFs. A variety of numerical experiments is presented to demonstrate the effectiveness of B-KRnet.

[302] Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov

Main category: cs.LG

TL;DR: A new method maps features in large language models across layers using cosine similarity, enabling interpretability and control over model behavior.

DetailsMotivation: To understand and manipulate how features evolve across layers in large language models for better interpretability and control.

Method: Uses data-free cosine similarity to trace feature persistence, transformation, or emergence across layers, creating granular flow graphs.

Result: Enables fine-grained interpretability and direct steering of model behavior by amplifying or suppressing features.

Conclusion: The framework clarifies feature development and offers transparent manipulation of large language models.

Abstract: We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

[303] Distillation Scaling Laws

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb

Main category: cs.LG

TL;DR: A distillation scaling law predicts distilled model performance based on compute budget and allocation, optimizing student-teacher compute for maximum efficiency.

DetailsMotivation: To mitigate risks in large-scale distillation by providing compute-optimal strategies for teacher-student allocation.

Method: Proposes a distillation scaling law and evaluates compute-optimal recipes for scenarios with existing or new teachers.

Result: Distillation outperforms supervised learning in multi-student or existing-teacher cases, but supervised learning is better for single-student, new-teacher scenarios.

Conclusion: The study enhances understanding of distillation and guides experimental design with predictable scaling laws.

Abstract: We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.

[304] Disentangled Latent Spaces Facilitate Data-Driven Auxiliary Learning

Geri Skenderi, Luigi Capogrosso, Andrea Toaiari, Matteo Denitto, Franco Fummi, Simone Melzi

Main category: cs.LG

TL;DR: Detaux proposes a framework to automatically discover auxiliary tasks for Multi-Task Learning (MTL) using weakly supervised disentanglement, improving generalization without hand-crafted solutions.

DetailsMotivation: Auxiliary tasks enhance learning in data-scarce or complex scenarios, but finding optimal tasks is challenging. Detaux aims to automate this process.

Method: Uses disentanglement to isolate task-related variations and create orthogonal subspaces. Auxiliary tasks are generated via clustering on the most disentangled subspace.

Result: Experiments on synthetic and real data show promising results, linking disentangled representations to MTL effectively.

Conclusion: Detaux successfully automates auxiliary task discovery, demonstrating a novel connection between disentanglement and MTL.

Abstract: Auxiliary tasks facilitate learning in situations where data is scarce or the principal task of interest is extremely complex. This idea is primarily inspired by the improved generalization capability induced by solving multiple tasks simultaneously, which leads to a more robust shared representation. Nevertheless, finding optimal auxiliary tasks is a crucial problem that often requires hand-crafted solutions or expensive meta-learning approaches. In this paper, we propose a novel framework, dubbed Detaux, whereby a weakly supervised disentanglement procedure is used to discover a new unrelated auxiliary classification task, which allows us to go from a Single-Task Learning (STL) to a Multi-Task Learning (MTL) problem. The disentanglement procedure works at the representation level, isolating the variation related to the principal task into an isolated subspace and additionally producing an arbitrary number of orthogonal subspaces, each of which encourages high separability among projections. We generate the auxiliary classification task through a clustering procedure on the most disentangled subspace, obtaining a discrete set of labels. Subsequently, the original data, the labels associated with the principal task, and the newly discovered ones can be fed into any MTL framework. Experimental validation on both synthetic and real data, along with various ablation studies, demonstrates promising results, revealing the potential in what has been, so far, an unexplored connection between learning disentangled representations and MTL. The source code is available at https://github.com/intelligolabs/Detaux.

[305] Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them

Neel Rajani, Aryo Pradipta Gema, Seraphina Goldfarb-Tarrant, Ivan Titov

Main category: cs.LG

TL;DR: The paper compares RL and SFT for training LLMs on maths problems, finding RL offers minor in-domain gains while SFT shows more pronounced effects, including out-of-domain degradation. Freezing model parts yielded inconclusive results.

DetailsMotivation: To understand the training dynamics of RL and SFT in LLMs for reasoning tasks, particularly their impact on in-domain and out-of-domain performance.

Method: Comparative analysis of RL and SFT on the same maths problems using the same model and similar hyperparameters, including parameter updates and freezing experiments.

Result: RL yields minor in-domain gains but slight degradation on knowledge benchmarks, while SFT shows more pronounced trends. Freezing parts of the model had mixed results.

Conclusion: RL amplifies existing capabilities, whereas SFT replaces old skills with new ones, with freezing experiments providing inconclusive mitigation for out-of-domain degradation.

Abstract: Training large language models (LLMs) for reasoning via maths and code datasets has become a major new focus in LLM post-training. Two particularly popular approaches are reinforcement learning (RL) and supervised fine-tuning (SFT), but their training dynamics are poorly understood. We present a comparative analysis of RL and SFT on the same maths problems with the same model and similar hyperparameters. We find that RL yields minor in-domain gains on maths and slight degradation on knowledge-intensive benchmarks like MMLU, while both trends are more pronounced in SFT. We also analyse model parameters across checkpoints, observing that both algorithms modify query and key weights the most. Meanwhile, SFT exhibits greater updates and also affects mid-layer MLPs more, leading us to hypothesise that this may have caused the out-of-domain degradation. We therefore investigate whether freezing parts of the model during training can mitigate the reduced performance on knowledge-intensive benchmarks. However, our results are inconclusive, with benefits on GPQA:Diamond and degradation on other benchmarks. Taken together, our observations provide a preliminary indication for why RL amplifies existing capabilities, while SFT replaces old skills with new ones.

[306] Agreement-Based Cascading for Efficient Inference

Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith

Main category: cs.LG

TL;DR: ABC is a cost-effective adaptive inference method using model cascades and ensemble agreement for routing, improving efficiency and accuracy over single models.

DetailsMotivation: To reduce machine learning inference costs by dynamically assigning smaller models to easier examples and avoiding larger models when possible.

Method: Uses Agreement-Based Cascading (ABC), building cascades of models of increasing size/complexity and routing based on ensemble agreement.

Result: ABC reduces communication costs (14x), rental costs (3x), and API costs (2-25x) while surpassing single-model accuracy.

Conclusion: ABC is a reliable, efficient drop-in replacement for existing models, outperforming in both cost and accuracy.

Abstract: Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. Although ensemble execution introduces additional expense, we show that these costs can be easily offset in practice due to large expected differences in model sizes, parallel inference execution capabilities, and accuracy benefits of ensembling. We examine ABC theoretically and empirically in terms of these parameters, showing that the approach can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy. Additionally, we explore the performance of ABC relative to existing cascading methods in three common scenarios: (1) edge-to-cloud inference, where ABC reduces communication costs by up to 14x; (2) cloud-based model serving, where it achieves a 3x reduction in rental costs; and (3) inference via model API services, where ABC achieves a 2-25x reduction in average price per token/request relative to state-of-the-art LLM cascades.

[307] Integrating Physics and Topology in Neural Networks for Learning Rigid Body Dynamics

Amaury Wei, Olga Fink

Main category: cs.LG

TL;DR: A novel framework for modeling rigid body dynamics using higher-order topology complexes and physics-informed neural networks, improving accuracy and generalization in complex scenarios.

DetailsMotivation: Rigid body interactions are challenging to simulate due to nonlinearity and environmental sensitivity, requiring adaptable learning-based methods beyond traditional models.

Method: Extends mesh representation with higher-order topology complexes and introduces a physics-informed message-passing neural network to embed physical laws directly.

Result: Demonstrates superior accuracy in long-term predictions and strong generalization to unseen scenarios, addressing multi-entity dynamic interactions.

Conclusion: The framework advances rigid body simulation, with broad applications in scientific and engineering fields.

Abstract: Rigid body interactions are fundamental to numerous scientific disciplines, but remain challenging to simulate due to their abrupt nonlinear nature and sensitivity to complex, often unknown environmental factors. These challenges call for adaptable learning-based methods capable of capturing complex interactions beyond explicit physical models and simulations. While graph neural networks can handle simple scenarios, they struggle with complex scenes and long-term predictions. We introduce a novel framework for modeling rigid body dynamics and learning collision interactions, addressing key limitations of existing graph-based methods. Our approach extends the traditional representation of meshes by incorporating higher-order topology complexes, offering a physically consistent representation. Additionally, we propose a physics-informed message-passing neural architecture, embedding physical laws directly in the model. Our method demonstrates superior accuracy, even during long rollouts, and exhibits strong generalization to unseen scenarios. Importantly, this work addresses the challenge of multi-entity dynamic interactions, with applications spanning diverse scientific and engineering domains.

[308] Large Language Models as Attribution Regularizers for Efficient Model Training

Davor Vukadin, Marin Šilić, Goran Delač

Main category: cs.LG

TL;DR: A method to integrate LLM-generated feature attributions into smaller models via attribution-matching regularization, improving performance in few-shot learning and addressing data issues like skewness and bias.

DetailsMotivation: Leveraging LLM knowledge for training smaller models is challenging, especially in domains like tabular data where simplicity is preferred.

Method: Proposes an attribution-matching regularization term to align smaller models with LLM insights, requiring only black-box API access.

Result: Superior performance in few-shot learning, improved generalization, and robustness, validated through extensive experiments.

Conclusion: The method effectively integrates LLM knowledge into smaller models, enhancing efficiency and addressing real-world data challenges.

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains. However, effectively leveraging their vast knowledge for training smaller downstream models remains an open challenge, especially in domains like tabular data learning, where simpler models are often preferred due to interpretability and efficiency. In this paper, we introduce a novel yet straightforward method for incorporating LLM-generated global task feature attributions into the training process of smaller networks. Specifically, we propose an attribution-matching regularization term that aligns the training dynamics of the smaller model with the insights provided by the LLM. By doing so, our approach yields superior performance in few-shot learning scenarios. Notably, our method requires only black-box API access to the LLM, making it easy to integrate into existing training pipelines with minimal computational overhead. Furthermore, we demonstrate how this method can be used to address common issues in real-world datasets, such as skewness and bias. By integrating high-level knowledge from LLMs, our approach improves generalization, even when training data is limited or imbalanced. We validate its effectiveness through extensive experiments across multiple tasks, demonstrating improved learning efficiency and model robustness.

[309] Value-Based Deep RL Scales Predictably

Oleh Rybkin, Michal Nauman, Preston Fu, Charlie Snell, Pieter Abbeel, Sergey Levine, Aviral Kumar

Main category: cs.LG

TL;DR: The paper demonstrates that value-based off-policy RL methods are predictable, enabling performance estimation from small-scale runs. It identifies a Pareto frontier for data and compute requirements, optimizes resource allocation, and validates the approach across algorithms and environments.

DetailsMotivation: To address the unpredictability of scaling in modern ML, particularly in RL, by proving that value-based off-policy methods can be predictable despite common beliefs.

Method: Establishes a Pareto frontier for data and compute, optimizes resource allocation, and manages hyperparameters to mitigate overfitting and plasticity loss in RL.

Result: Validated predictability and scaling efficiency across SAC, BRO, and PQL algorithms on DeepMind Control, OpenAI gym, and IsaacGym.

Conclusion: The study successfully predicts RL performance scaling, optimizing resource use and challenging traditional assumptions about RL unpredictability.

Abstract: Scaling data and compute is critical to the success of modern ML. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.

[310] Generating Clinically Realistic EHR Data via a Hierarchy- and Semantics-Guided Transformer

Guanglin Zhou, Sebastiano Barbieri

Main category: cs.LG

TL;DR: HiSGT is a novel framework for generating synthetic EHRs by leveraging hierarchical and semantic information, improving clinical fidelity and utility.

DetailsMotivation: Existing methods treat EHRs as flat sequences, ignoring hierarchical and semantic context, leading to low clinical fidelity.

Method: HiSGT uses a hierarchical graph and graph neural network for hierarchy-aware embeddings, fused with semantic embeddings from ClinicalBERT, to guide a Transformer-based generator.

Result: HiSGT improves statistical alignment of synthetic data with real EHRs and enhances downstream tasks like chronic disease classification.

Conclusion: HiSGT advances clinically high-fidelity synthetic data generation, offering applications in data augmentation and privacy-preserving analytics.

Abstract: Generating realistic synthetic electronic health records (EHRs) holds tremendous promise for accelerating healthcare research, facilitating AI model development and enhancing patient privacy. However, existing generative methods typically treat EHRs as flat sequences of discrete medical codes. This approach overlooks two critical aspects: the inherent hierarchical organization of clinical coding systems and the rich semantic context provided by code descriptions. Consequently, synthetic patient sequences often lack high clinical fidelity and have limited utility in downstream clinical tasks. In this paper, we propose the Hierarchy- and Semantics-Guided Transformer (HiSGT), a novel framework that leverages both hierarchical and semantic information for the generative process. HiSGT constructs a hierarchical graph to encode parent-child and sibling relationships among clinical codes and employs a graph neural network to derive hierarchy-aware embeddings. These are then fused with semantic embeddings extracted from a pre-trained clinical language model (e.g., ClinicalBERT), enabling the Transformer-based generator to more accurately model the nuanced clinical patterns inherent in real EHRs. Extensive experiments on the MIMIC-III and MIMIC-IV datasets demonstrate that HiSGT significantly improves the statistical alignment of synthetic data with real patient records, as well as supports robust downstream applications such as chronic disease classification. By addressing the limitations of conventional raw code-based generative models, HiSGT represents a significant step toward clinically high-fidelity synthetic data generation and a general paradigm suitable for interpretable medical code representation, offering valuable applications in data augmentation and privacy-preserving healthcare analytics.

[311] Accelerometry-based Energy Expenditure Estimation During Activities of Daily Living: A Comparison Among Different Accelerometer Compositions

Shuhao Que, Remco Poelarends, Peter Veltink, Miriam Vollenbroek-Hutten, Ying Wang

Main category: cs.LG

TL;DR: The study compared COM-based and wrist-based accelerometer settings for predicting PAEE, finding COM-based methods superior, with no significant difference between wrists.

DetailsMotivation: To evaluate and compare the accuracy of PAEE predictions from COM-based and wrist-based accelerometer settings using respiratory data as a reference.

Method: Used two PAEE estimation methods (LR and CNN-LSTM) on data from 9 participants wearing 5 accelerometers during daily activities.

Result: COM-based 3-acc setting performed best (LR: R²=0.41, CNN-LSTM: R²=0.53), significantly outperforming wrist-based settings (R²≈0). No difference between wrists.

Conclusion: COM-based accelerometer settings are more reliable for PAEE prediction than wrist-based ones, with no advantage in using multiple COM sensors over a single pelvis sensor.

Abstract: Physical activity energy expenditure (PAEE) can be measured from breath-by-breath respiratory data, which can serve as a reference. Alternatively, PAEE can be predicted from the body movements, which can be measured and estimated with accelerometers. The body center of mass (COM) acceleration reflects the movements of the whole body and thus serves as a good predictor for PAEE. However, the wrist has also become a popular location due to recent advancements in wrist-worn devices. Therefore, in this work, using the respiratory data measured by COSMED K5 as the reference, we evaluated and compared the performances of COM-based settings and wrist-based settings. The COM-based settings include two different accelerometer compositions, using only the pelvis accelerometer (pelvis-acc) and the pelvis accelerometer with two accelerometers from two thighs (3-acc). The wrist-based settings include using only the left wrist accelerometer (l-wrist-acc) and only the right wrist accelerometer (r-wrist-acc). We implemented two existing PAEE estimation methods on our collected dataset, where 9 participants performed activities of daily living while wearing 5 accelerometers (i.e., pelvis, two thighs, and two wrists). These two methods include a linear regression (LR) model and a CNN-LSTM model. Both models yielded the best results with the COM-based 3-acc setting (LR: $R^2$ = 0.41, CNN-LSTM: $R^2$ = 0.53). No significant difference was found between the 3-acc and pelvis-acc settings (p-value = 0.278). For both models, neither the l-wrist-acc nor the r-wrist-acc settings demonstrated predictive power on PAEE with $R^2$ values close to 0, significantly outperformed by the two COM-based settings (p-values $<$ 0.05). No significant difference was found between the two wrists (p-value = 0.329).

[312] Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners

Calarina Muslimani, Kerrick Johnstonbaugh, Suyog Chandramouli, Serena Booth, W. Bradley Knox, Matthew E. Taylor

Main category: cs.LG

TL;DR: The paper introduces the Trajectory Alignment Coefficient to evaluate reward alignment in reinforcement learning, showing its effectiveness in improving reward selection and policy performance.

DetailsMotivation: Reward design in reinforcement learning is challenging, and evaluating reward correctness is often overlooked. The paper addresses this by focusing on reward alignment with human preferences.

Method: The authors propose the Trajectory Alignment Coefficient to measure similarity between human stakeholder rankings and reward-induced trajectory distributions.

Result: The coefficient improves reward selection, reduces cognitive workload by 1.5x, is preferred by 82% of users, and increases success rate by 41%.

Conclusion: The Trajectory Alignment Coefficient is a practical tool for assessing reward alignment, enhancing reward design in reinforcement learning.

Abstract: Reinforcement learning agents are fundamentally limited by the quality of the reward functions they learn from, yet reward design is often overlooked under the assumption that a well-defined reward is readily available. However, in practice, designing rewards is difficult, and even when specified, evaluating their correctness is equally problematic: how do we know if a reward function is correctly specified? In our work, we address these challenges by focusing on reward alignment – assessing whether a reward function accurately encodes the preferences of a human stakeholder. As a concrete measure of reward alignment, we introduce the Trajectory Alignment Coefficient to quantify the similarity between a human stakeholder’s ranking of trajectory distributions and those induced by a given reward function. We show that the Trajectory Alignment Coefficient exhibits desirable properties, such as not requiring access to a ground truth reward, invariance to potential-based reward shaping, and applicability to online RL. Additionally, in an 11 – person user study of RL practitioners, we found that access to the Trajectory Alignment Coefficient during reward selection led to statistically significant improvements. Compared to relying only on reward functions, our metric reduced cognitive workload by 1.5x, was preferred by 82% of users and increased the success rate of selecting reward functions that produced performant policies by 41%.

[313] LeanKAN: A Parameter-Lean Kolmogorov-Arnold Network Layer with Improved Memory Efficiency and Convergence Behavior

Benjamin C. Koenig, Suyong Kim, Sili Deng

Main category: cs.LG

TL;DR: LeanKANs are introduced as a modular replacement for MultKAN and AddKAN layers, addressing drawbacks like limited output layer applicability, bulky parameterizations, and complex hyperparameters. LeanKANs improve efficiency and performance in KAN-based tasks.

DetailsMotivation: MultKAN layers, while combining addition and multiplication, have limitations such as bulky parameterizations and complex hyperparameters. This motivates the development of LeanKANs for better efficiency and applicability.

Method: LeanKANs are proposed as a direct replacement for MultKAN and AddKAN layers, featuring general output layer applicability, reduced parameter counts, and fewer hyperparameters.

Result: LeanKANs outperform MultKANs in tasks like KAN-ODEs and DeepOKANs, demonstrating higher expressivity and learning capability with a sparser structure.

Conclusion: LeanKANs offer a simpler, more efficient alternative to MultKAN and AddKAN layers, enhancing performance in KAN-based modeling tasks.

Abstract: The recently proposed Kolmogorov-Arnold network (KAN) is a promising alternative to multi-layer perceptrons (MLPs) for data-driven modeling. While original KAN layers were only capable of representing the addition operator, the recently-proposed MultKAN layer combines addition and multiplication subnodes in an effort to improve representation performance. Here, we find that MultKAN layers suffer from a few key drawbacks including limited applicability in output layers, bulky parameterizations with extraneous activations, and the inclusion of complex hyperparameters. To address these issues, we propose LeanKANs, a direct and modular replacement for MultKAN and traditional AddKAN layers. LeanKANs address these three drawbacks of MultKAN through general applicability as output layers, significantly reduced parameter counts for a given network structure, and a smaller set of hyperparameters. As a one-to-one layer replacement for standard AddKAN and MultKAN layers, LeanKAN is able to provide these benefits to traditional KAN learning problems as well as augmented KAN structures in which it serves as the backbone, such as KAN Ordinary Differential Equations (KAN-ODEs) or Deep Operator KANs (DeepOKAN). We demonstrate LeanKAN’s simplicity and efficiency in a series of demonstrations carried out across a standard KAN toy problem as well as ordinary and partial differential equations learned via KAN-ODEs, where we find that its sparser parameterization and compact structure serve to increase its expressivity and learning capability, leading it to outperform similar and even much larger MultKANs in various tasks.

[314] Fixed-Point RNNs: Interpolating from Diagonal to Dense

Sajad Movahedi, Felix Sarnthein, Nicola Muca Cirone, Antonio Orvieto

Main category: cs.LG

TL;DR: The paper explores dense linear RNNs as fixed-points of parallelizable diagonal linear RNNs, improving expressivity and efficiency for sequence mixing tasks.

DetailsMotivation: Current models lack full state-tracking expressivity due to channel-wise sequence mixing, limiting their potential.

Method: Parameterize dense linear RNNs as fixed-points of parallelizable diagonal linear RNNs, enabling trade-offs between expressivity and efficiency.

Result: Achieves state-of-the-art results on toy tasks like $A_5$, $S_5$, copying, and modular arithmetics.

Conclusion: The proposed method enhances expressivity while maintaining efficiency, advancing sequence mixing capabilities.

Abstract: Linear recurrent neural networks (RNNs) and state-space models (SSMs) such as Mamba have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures. Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e. diagonal) sequence mixing. In this paper, we investigate parameterizations of a large class of dense linear RNNs as fixed-points of parallelizable diagonal linear RNNs. The resulting models can naturally trade expressivity for efficiency at a fixed number of parameters and achieve state-of-the-art results on the commonly used toy tasks $A_5$, $S_5$, copying, and modular arithmetics.

[315] Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs

Chang Gao, Kang Zhao, Runqi Wang, Jianfei Chen, Liping Jing

Main category: cs.LG

TL;DR: The paper investigates sparsity allocation in pruning large language models (LLMs), proposing Maximum Redundancy Pruning (MRP) based on three principles: non-uniformity, pruning metric dependency, and uniform layerwise redundancy. MRP outperforms prior methods.

DetailsMotivation: Address challenges in deploying LLMs due to their size by improving pruning techniques, avoiding suboptimal sparsity allocation.

Method: Proposed Maximum Redundancy Pruning (MRP), an iterative algorithm pruning the most redundant layers first, guided by three principles.

Result: MRP shows superior performance over existing methods in experiments on models like LLaMA2 and OPT.

Conclusion: MRP effectively addresses sparsity allocation in LLM pruning, aligning with principles for optimal performance.

Abstract: Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. To address this issue, researchers have sought to apply network pruning techniques to LLMs. A critical challenge in pruning is allocation the sparsity for each layer. Recent sparsity allocation methods is often based on heuristics or search that can easily lead to suboptimal performance. In this paper, we conducted an extensive investigation into various LLMs and revealed three significant discoveries: (1) the layerwise pruning sensitivity (LPS) of LLMs is highly non-uniform, (2) the choice of pruning metric affects LPS, and (3) the performance of a sparse model is related to the uniformity of its layerwise redundancy level. Based on these observations, we propose that the layerwise sparsity of LLMs should adhere to three principles: \emph{non-uniformity}, \emph{pruning metric dependency}, and \emph{uniform layerwise redundancy level} in the pruned model. To this end, we proposed Maximum Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers (\emph{i.e.}, those with the highest non-outlier ratio) at each iteration. The achieved layerwise sparsity aligns with the outlined principles. We conducted extensive experiments on publicly available LLMs, including the LLaMA2 and OPT, across various benchmarks. Experimental results validate the effectiveness of MRP, demonstrating its superiority over previous methods.

[316] Decision by Supervised Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization

Juhyeong Kim, Sungyoon Choi, Youngbin Lee, Yejin Kim, Yongmin Choi, Yongjae Lee

Main category: cs.LG

TL;DR: DSL reframes portfolio optimization as a supervised learning problem, using cross-entropy loss and Deep Ensembles for stability, outperforming traditional and ML-based methods.

DetailsMotivation: To improve robustness and performance in portfolio optimization by leveraging supervised learning and ensemble methods.

Method: Trains models to predict optimal portfolio weights using cross-entropy loss and Sharpe/Sortino ratio maximization, enhanced by Deep Ensembles.

Result: Outperforms traditional and ML-based methods, with larger ensembles improving median returns and stability.

Conclusion: DSL is a robust framework for portfolio optimization, combining supervised learning and ensembles for superior performance.

Abstract: We propose Decision by Supervised Learning (DSL), a practical framework for robust portfolio optimization. DSL reframes portfolio construction as a supervised learning problem: models are trained to predict optimal portfolio weights, using cross-entropy loss and portfolios constructed by maximizing the Sharpe or Sortino ratio. To further enhance stability and reliability, DSL employs Deep Ensemble methods, substantially reducing variance in portfolio allocations. Through comprehensive backtesting across diverse market universes and neural architectures, shows superior performance compared to both traditional strategies and leading machine learning-based methods, including Prediction-Focused Learning and End-to-End Learning. We show that increasing the ensemble size leads to higher median returns and more stable risk-adjusted performance. The code is available at https://github.com/DSLwDE/DSLwDE.

[317] MetaSel: A Test Selection Approach for Fine-tuned DNN Models

Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand, Dayi Lin

Main category: cs.LG

TL;DR: MetaSel is a novel test selection method for fine-tuned DNNs, leveraging behavioral differences between pre-trained and fine-tuned models to improve test coverage under constrained labeling budgets.

DetailsMotivation: Addressing the challenge of testing fine-tuned DNNs under limited labeled data, as existing methods are ineffective under distribution shifts.

Method: MetaSel uses behavioral differences between fine-tuned and pre-trained models to estimate misclassification probabilities for unlabeled inputs, enabling targeted test selection.

Result: MetaSel outperforms 11 baselines, achieving 28.46% to 56.18% higher Test Relative Coverage (TRC) under constrained budgets.

Conclusion: MetaSel is practical, robust, and cost-effective for test selection in fine-tuned DNNs, especially under distribution shifts.

Abstract: Deep Neural Networks (DNNs) face challenges during deployment due to data distribution shifts. Fine-tuning adapts pre-trained models to new contexts requiring smaller labeled sets. However, testing fine-tuned models under constrained labeling budgets remains a critical challenge. This paper introduces MetaSel, a new approach, tailored for fine-tuned DNN models, to select tests from unlabeled inputs. MetaSel assumes that fine-tuned and pre-trained models share related data distributions and exhibit similar behaviors for many inputs. However, their behaviors diverge within the input subspace where fine-tuning alters decision boundaries, making those inputs more prone to misclassification. Unlike general approaches that rely solely on the DNN model and its input set, MetaSel leverages information from both the fine-tuned and pre-trained models and their behavioral differences to estimate misclassification probability for unlabeled test inputs, enabling more effective test selection. Our extensive empirical evaluation, comparing MetaSel against 11 state-of-the-art approaches and involving 68 fine-tuned models across weak, medium, and strong distribution shifts, demonstrates that MetaSel consistently delivers significant improvements in Test Relative Coverage (TRC) over existing baselines, particularly under highly constrained labeling budgets. MetaSel shows average TRC improvements of 28.46% to 56.18% over the most frequent second-best baselines while maintaining a high TRC median and low variability. Our results confirm MetaSel’s practicality, robustness, and cost-effectiveness for test selection in the context of fine-tuned models.

[318] Perturbation-efficient Zeroth-order Optimization for Hardware-friendly On-device Training

Qitao Tan, Sung-En Chang, Rui Xia, Huidong Ji, Chence Yang, Ci Zhang, Jun Liu, Zheng Zhan, Zhenman Fang, Zhou Zou, Yanzhi Wang, Jin Lu, Geng Yuan

Main category: cs.LG

TL;DR: PeZO addresses the hardware inefficiency of ZO optimization by reducing random number generation needs and replacing Gaussian with uniform distribution, enabling feasible on-device training.

DetailsMotivation: ZO optimization's reliance on Gaussian random numbers is impractical for hardware like FPGAs and ASICs, creating a mismatch between algorithm and hardware design.

Method: PeZO introduces random number reuse and a hardware-friendly adaptive scaling method, replacing Gaussian with uniform distribution.

Result: PeZO reduces LUTs and FFs by 48.6% and 12.7%, saves up to 86% power, and maintains training performance.

Conclusion: PeZO makes ZO optimization feasible for on-device training, pioneering its potential for future research.

Abstract: Zeroth-order (ZO) optimization is an emerging deep neural network (DNN) training paradigm that offers computational simplicity and memory savings. However, this seemingly promising approach faces a significant and long-ignored challenge. ZO requires generating a substantial number of Gaussian random numbers, which poses significant difficulties and even makes it infeasible for hardware platforms, such as FPGAs and ASICs. In this paper, we identify this critical issue, which arises from the mismatch between algorithm and hardware designers. To address this issue, we proposed PeZO, a perturbation-efficient ZO framework. Specifically, we design random number reuse strategies to significantly reduce the demand for random number generation and introduce a hardware-friendly adaptive scaling method to replace the costly Gaussian distribution with a uniform distribution. Our experiments show that PeZO reduces the required LUTs and FFs for random number generation by 48.6% and 12.7%, and saves at maximum 86% power consumption, all without compromising training performance, making ZO optimization feasible for on-device training. To the best of our knowledge, we are the first to explore the potential of on-device ZO optimization, providing valuable insights for future research.

[319] Learnable cut flow for high energy physics

Jing Li, Hao Sun

Main category: cs.LG

TL;DR: LCF merges neural networks and cut flow methods, offering interpretability and performance in high-energy physics tasks.

DetailsMotivation: Combine the interpretability of traditional cut flow methods with the power of neural networks to improve feature selection and boundary optimization.

Method: Proposes Learnable Cut Flow (LCF), a neural network with parallel and sequential cut strategies, and Learnable Importance for feature weighting. Uses differentiable mask operations for training.

Result: LCF learns optimal cut boundaries, handles feature redundancy, and improves performance in real-world datasets (e.g., diboson vs. QCD). Pruning less important features boosts its performance.

Conclusion: LCF successfully bridges interpretability and performance, providing insights into feature importance and training dynamics.

Abstract: Neural networks have emerged as a powerful paradigm for tasks in high energy physics, yet their opaque training process renders them as a black box. In contrast, the traditional cut flow method offers simplicity and interpretability but requires extensive manual tuning to identify optimal cut boundaries. To merge the strengths of both approaches, we propose the Learnable Cut Flow (LCF), a neural network that transforms the traditional cut selection into a fully differentiable, data-driven process. LCF implements two cut strategies-parallel, where observable distributions are treated independently, and sequential, where prior cuts shape subsequent ones-to flexibly determine optimal boundaries. Building on this strategy, we introduce the Learnable Importance, a metric that quantifies feature importance and adjusts their contributions to the loss accordingly, offering model-driven insights unlike ad-hoc metrics. To ensure differentiability, a modified loss function replaces hard cuts with mask operations, preserving data shape throughout the training process. LCF is tested on six varied mock datasets and a realistic diboson vs. QCD dataset. Results demonstrate that LCF 1. accurately learns cut boundaries across typical feature distributions in both parallel and sequential strategies, 2. assigns higher importance to discriminative features with minimal overlap, 3. handles redundant or correlated features robustly, and 4. performs effectively in real-world scenarios. In the diboson dataset, LCF initially underperforms boosted decision trees and multiplayer perceptrons when using all observables. However, pruning less critical features-guided by learned importance-boosts its performance to match or exceed these baselines. LCF bridges the gap between traditional cut flow method and modern black-box neural networks, delivering actionable insights into the training process and feature importance.

[320] Deep Learning for Double Auction

Jiayin Liu, Chenglong Zhang

Main category: cs.LG

TL;DR: The paper introduces deep learning methods for double auctions, addressing challenges like imperfect information, IC, and IR constraints, and improves upon prior single-auction models by enhancing generalizability, efficiency, and learning stability.

DetailsMotivation: Traditional and recent deep learning methods for auctions focus on single auctions, leaving double auctions—with imperfect information on both supply and demand sides—understudied. Existing methods also suffer from generalizability, efficiency, and learning fluctuation issues.

Method: The authors design transformer-based models for double auctions, treating participants as sequences for varying market sizes. They pre-treat constraint features for efficiency and introduce a gradient-conflict-elimination scheme to stabilize learning.

Result: Experimental results show the proposed method outperforms classical and machine learning baselines.

Conclusion: The study successfully addresses the complexities of double auctions and overcomes limitations of prior models, offering a scalable and efficient solution.

Abstract: Auctions are important mechanisms extensively implemented in various markets, e.g., search engines’ keyword auctions, antique auctions, etc. Finding an optimal auction mechanism is extremely difficult due to the constraints of imperfect information, incentive compatibility (IC), and individual rationality (IR). In addition to the traditional economic methods, some recently attempted to find the optimal (single) auction using deep learning methods. Unlike those attempts focusing on single auctions, we develop deep learning methods for double auctions, where imperfect information exists on both the demand and supply sides. The previous attempts on single auction cannot directly apply to our contexts and those attempts additionally suffer from limited generalizability, inefficiency in ensuring the constraints, and learning fluctuations. We innovate in designing deep learning models for solving the more complex problem and additionally addressing the previous models’ three limitations. Specifically, we achieve generalizability by leveraging a transformer-based architecture to model market participants as sequences for varying market sizes; we utilize the numerical features of the constraints and pre-treat them for a higher learning efficiency; we develop a gradient-conflict-elimination scheme to address the problem of learning fluctuation. Extensive experimental evaluations demonstrate the superiority of our approach to classical and machine learning baselines.

[321] Bridging Quantum and Classical Computing in Drug Design: Architecture Principles for Improved Molecule Generation

Andrew Smith, Erhan Guven

Main category: cs.LG

TL;DR: Optimized hybrid quantum-classical GAN (BO-QGAN) improves drug discovery performance, outperforming benchmarks with fewer parameters.

DetailsMotivation: To leverage NISQ devices for drug discovery by optimizing hybrid quantum-classical GAN architectures.

Method: Multi-objective Bayesian optimization to systematically optimize GAN architectures for molecule discovery.

Result: BO-QGAN achieves 2.27x higher DCS than quantum-hybrid benchmarks and 2.21x higher than classical baseline, with 60% fewer parameters.

Conclusion: Layering multiple shallow quantum circuits is optimal; provides first empirical guidelines for hybrid models in drug discovery.

Abstract: Hybrid quantum-classical machine learning offers a path to leverage noisy intermediate-scale quantum (NISQ) devices for drug discovery, but optimal model architectures remain unclear. We systematically optimize the quantum-classical bridge architecture of generative adversarial networks (GANs) for molecule discovery using multi-objective Bayesian optimization. Our optimized model (BO-QGAN) significantly improves performance, achieving a 2.27-fold higher Drug Candidate Score (DCS) than prior quantum-hybrid benchmarks and 2.21-fold higher than the classical baseline, while reducing parameter count by more than 60%. Key findings favor layering multiple (3-4) shallow (4-8 qubit) quantum circuits sequentially, while classical architecture shows less sensitivity above a minimum capacity. This work provides the first empirically-grounded architectural guidelines for hybrid models, enabling more effective integration of current quantum computers into pharmaceutical research pipelines.

[322] Less is More: Adaptive Coverage for Synthetic Training Data

Sasan Tavakkol, Max Springer, Mohammadhossein Bateni, Neslihan Bulut, Vincent Cohen-Addad, MohammadTaghi Hajiaghayi

Main category: cs.LG

TL;DR: Using LLMs like Gemma and GPT to generate synthetic training data, this study introduces a sampling algorithm for selecting a representative subset, improving classifier performance with less data.

DetailsMotivation: Addresses the challenge of obtaining large labeled datasets quickly, especially for emerging trends or new forms of online abuse.

Method: Introduces a novel sampling algorithm based on the maximum coverage problem to select a representative subset from synthetic data.

Result: Training on the sampled subset outperforms using the entire dataset, improving accuracy and reducing data volume.

Conclusion: A ’less is more’ approach with synthetic data sampling enhances efficiency and performance in classifier training.

Abstract: Synthetic training data generation with Large Language Models (LLMs) like Google’s Gemma and OpenAI’s GPT offer a promising solution to the challenge of obtaining large, labeled datasets for training classifiers. When rapid model deployment is critical, such as in classifying emerging social media trends or combating new forms of online abuse tied to current events, the ability to generate training data is invaluable. While prior research has examined the comparability of synthetic data to human-labeled data, this study introduces a novel sampling algorithm, based on the maximum coverage problem, to select a representative subset from a synthetically generated dataset. Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset. This “less is more” approach not only improves model accuracy but also reduces the volume of data required, leading to potentially more efficient model fine-tuning.

[323] Pilot Contamination-Aware Graph Attention Network for Power Control in CFmMIMO

Tingting Zhang, Sergiy A. Vorobyov, David J. Love, Taejoon Kim, Kai Dong

Main category: cs.LG

TL;DR: A self-supervised graph attention network is proposed for downlink power control in CFmMIMO systems, addressing pilot contamination and dynamic UE numbers, outperforming traditional methods.

DetailsMotivation: Existing GNN-based power control methods assume ideal conditions (orthogonal pilots, fixed UE count) and require costly supervised training, making them impractical for real-world CFmMIMO systems.

Method: A graph attention network is developed to operate self-supervised, handling pilot contamination and adapting to varying UE numbers dynamically.

Result: The proposed method shows effectiveness, outperforming the optimal accelerated projected gradient method in experiments.

Conclusion: The self-supervised GNN approach offers a practical and efficient solution for power control in CFmMIMO systems, overcoming limitations of existing methods.

Abstract: Optimization-based power control algorithms are predominantly iterative with high computational complexity, making them impractical for real-time applications in cell-free massive multiple-input multiple-output (CFmMIMO) systems. Learning-based methods have emerged as a promising alternative, and among them, graph neural networks (GNNs) have demonstrated their excellent performance in solving power control problems. However, all existing GNN-based approaches assume ideal orthogonality among pilot sequences for user equipments (UEs), which is unrealistic given that the number of UEs exceeds the available orthogonal pilot sequences in CFmMIMO schemes. Moreover, most learning-based methods assume a fixed number of UEs, whereas the number of active UEs varies over time in practice. Additionally, supervised training necessitates costly computational resources for computing the target power control solutions for a large volume of training samples. To address these issues, we propose a graph attention network for downlink power control in CFmMIMO systems that operates in a self-supervised manner while effectively handling pilot contamination and adapting to a dynamic number of UEs. Experimental results show its effectiveness, even in comparison to the optimal accelerated projected gradient method as a baseline.

[324] Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

Tianyi Xu, Jiaxin Liu, Nicholas Mattei, Zizhan Zheng

Main category: cs.LG

TL;DR: A multi-agent multi-armed bandit (MA-MAB) framework ensures fair outcomes and maximizes performance using strategic probing and submodular properties for offline and online settings.

DetailsMotivation: To address the challenge of fair decision-making under limited information about arm rewards in multi-agent systems.

Method: Introduces a probing framework for strategic information gathering, leveraging submodular properties for offline settings and developing an online algorithm with sublinear regret.

Result: Outperforms baseline methods in fairness and efficiency on synthetic and real-world datasets.

Conclusion: The proposed framework effectively balances fairness and performance in MA-MAB problems.

Abstract: We propose a multi-agent multi-armed bandit (MA-MAB) framework aimed at ensuring fair outcomes across agents while maximizing overall system performance. A key challenge in this setting is decision-making under limited information about arm rewards. To address this, we introduce a novel probing framework that strategically gathers information about selected arms before allocation. In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound. For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets show that our approach outperforms baseline methods, achieving better fairness and efficiency.

[325] Learning Causally Predictable Outcomes from Psychiatric Longitudinal Data

Eric V. Strobl

Main category: cs.LG

TL;DR: DEBIAS optimizes outcome definitions to maximize causal identifiability in psychiatric longitudinal data, outperforming existing methods.

DetailsMotivation: Addressing challenges in causal inference due to symptom heterogeneity and latent confounding in psychiatry.

Method: DEBIAS learns non-negative, interpretable weights for outcome aggregation to maximize durable treatment effects and minimize confounding.

Result: Outperforms state-of-the-art methods in recovering causal effects for depression and schizophrenia.

Conclusion: DEBIAS provides a robust solution for causal inference in psychiatry with verifiable unconfoundedness.

Abstract: Causal inference in longitudinal biomedical data remains a central challenge, especially in psychiatry, where symptom heterogeneity and latent confounding frequently undermine classical estimators. Most existing methods for treatment effect estimation presuppose a fixed outcome variable and address confounding through observed covariate adjustment. However, the assumption of unconfoundedness may not hold for a fixed outcome in practice. To address this foundational limitation, we directly optimize the outcome definition to maximize causal identifiability. Our DEBIAS (Durable Effects with Backdoor-Invariant Aggregated Symptoms) algorithm learns non-negative, clinically interpretable weights for outcome aggregation, maximizing durable treatment effects and empirically minimizing both observed and latent confounding by leveraging the time-limited direct effects of prior treatments in psychiatric longitudinal data. The algorithm also furnishes an empirically verifiable test for outcome unconfoundedness. DEBIAS consistently outperforms state-of-the-art methods in recovering causal effects for clinically interpretable composite outcomes across comprehensive experiments in depression and schizophrenia.

[326] TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis

Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline C Lisaius, Markus Immitzer, David A. Coomes, Anil Madhavapeddy, Andrew Blake, Srinivasan Keshav

Main category: cs.LG

TL;DR: TESSERA is a global, open-source remote sensing foundation model using self-supervised learning to generate 10m-scale embeddings from satellite data, outperforming task-specific models in downstream applications.

DetailsMotivation: Satellite data is voluminous and noisy, making it challenging for applications like climate modeling and land use. TESSERA aims to simplify and enhance data usability.

Method: TESSERA uses dual Transformer-based encoders for Sentinel-2 (optical) and Sentinel-1 (radar) data, fused with a multilayer perceptron to create annual global embeddings.

Result: TESSERA matches or outperforms state-of-the-art models in five diverse downstream tasks.

Conclusion: TESSERA’s efficiency, openness, and performance make it transformative for ecological and agricultural applications.

Abstract: Satellite remote sensing from repeated observations and multiple sensors enables a wide range of downstream applications, including climate modeling, carbon accounting, and strategies for conservation and sustainable land use. However, satellite time series are voluminous, often corrupted by sensor noise, clouds, and atmospheric conditions, and unevenly spaced in time, making them challenging to use. We present TESSERA, an open, global, land-oriented remote sensing foundation model that uses self-supervised learning to generate `ready-to-use’ embeddings at 10m scale from pixel-level satellite time series data. TESSERA uses two parallel Transformer-based encoders to combine optical data from ten Sentinel-2 spectral bands at 10-60m spatial resolution and two Sentinel-1 synthetic aperture radar backscatter coefficients at 10~m resolution to create embeddings that are subsequently fused with a multilayer perceptron to create annual global embedding maps. We compare our work with state-of-the-art task-specific models and other foundation models in five diverse downstream tasks and find that TESSERA closely matches or outperforms these baselines. We believe that TESSERA’s ease of use, openness, computation-, label-, and data-efficiency, and high performance will prove transformative in a wide range of vegetation-oriented ecological and agricultural applications.

[327] Exploration Behavior of Untrained Policies

Jacob Adamczyk

Main category: cs.LG

TL;DR: The paper explores how untrained deep neural policies shape exploration in RL, revealing non-trivial state-visitation patterns and correlated actions, with implications for policy initialization.

DetailsMotivation: Understanding how neural policy architectures implicitly influence exploration in RL, especially in sparse or adversarial reward settings.

Method: Theoretical and empirical analysis using infinite-width networks and continuous-time limits to study untrained policies in a toy model.

Result: Untrained policies produce correlated actions and non-trivial state-visitation distributions, offering insights into exploration biases.

Conclusion: The study provides a framework for using policy initialization to understand and design early exploration behavior in RL.

Abstract: Exploration remains a fundamental challenge in reinforcement learning (RL), particularly in environments with sparse or adversarial reward structures. In this work, we study how the architecture of deep neural policies implicitly shapes exploration before training. We theoretically and empirically demonstrate strategies for generating ballistic or diffusive trajectories from untrained policies in a toy model. Using the theory of infinite-width networks and a continuous-time limit, we show that untrained policies return correlated actions and result in non-trivial state-visitation distributions. We discuss the distributions of the corresponding trajectories for a standard architecture, revealing insights into inductive biases for tackling exploration. Our results establish a theoretical and experimental framework for using policy initialization as a design tool to understand exploration behavior in early training.

[328] 2048: Reinforcement Learning in a Delayed Reward Environment

Prady Saligram, Tanvir Bhathal, Robby Manihani

Main category: cs.LG

TL;DR: A unified distributional multi-step RL framework (H-DQN) outperforms standard DQN, PPO, and QR-DQN in the sparse-reward game 2048, achieving higher scores and tiles.

DetailsMotivation: Addressing the challenge of delayed and sparse rewards in RL, particularly in games like 2048 where immediate feedback can mislead agents into suboptimal strategies.

Method: Developed and compared four RL agents: DQN, PPO, QR-DQN, and a novel H-DQN integrating distributional learning and advanced techniques.

Result: H-DQN achieved the highest performance (max score 41.828K, 4096 tile), followed by QR-DQN (8.66K), PPO (5.756K), and DQN (3.988K).

Conclusion: Distributional, multi-step RL significantly improves performance in sparse-reward domains, with potential for further gains via model-based planning and curriculum learning.

Abstract: Delayed and sparse rewards present a fundamental obstacle for reinforcement-learning (RL) agents, which struggle to assign credit for actions whose benefits emerge many steps later. The sliding-tile game 2048 epitomizes this challenge: although frequent small score changes yield immediate feedback, they often mislead agents into locally optimal but globally suboptimal strategies. In this work, we introduce a unified, distributional multi-step RL framework designed to directly optimize long-horizon performance. Using the open source Gym-2048 environment we develop and compare four agent variants: standard DQN, PPO, QR-DQN (Quantile Regression DQN), and a novel Horizon-DQN (H-DQN) that integrates distributional learning, dueling architectures, noisy networks, prioritized replay, and more. Empirical evaluation reveals a clear hierarchy in effectiveness: max episode scores improve from 3.988K (DQN) to 5.756K (PPO), 8.66K (QR-DQN), and 18.21K (H-DQN), with H-DQN reaching the 2048 tile. Upon scaling H-DQN it reaches a max score 41.828K and a 4096 tile. These results demonstrate that distributional, multi-step targets substantially enhance performance in sparse-reward domains, and they suggest promising avenues for further gains through model-based planning and curriculum learning.

[329] Resonant-Tunnelling Diode Reservoir Computing System for Image Recognition

A. H. Abbas, Hend Abdel-Ghani, Ivan S. Maksymov

Main category: cs.LG

TL;DR: A neuromorphic computing architecture using resonant-tunnelling diodes (RTDs) is proposed for efficient physical reservoir computing (RC), validated on image recognition tasks.

DetailsMotivation: The need for hardware-efficient computational models in AI for edge-based and resource-constrained environments drives this research.

Method: Theoretical formulation and numerical implementation of an RTD-based RC system, tested on handwritten digit classification and Fruit~360 object recognition.

Result: The architecture shows promising performance while adhering to deterministic nonlinear transformations, avoiding random connectivity.

Conclusion: The RTD-based RC system offers a viable, efficient solution for next-generation neuromorphic computing.

Abstract: As artificial intelligence continues to push into real-time, edge-based and resource-constrained environments, there is an urgent need for novel, hardware-efficient computational models. In this study, we present and validate a neuromorphic computing architecture based on resonant-tunnelling diodes (RTDs), which exhibit the nonlinear characteristics ideal for physical reservoir computing (RC). We theoretically formulate and numerically implement an RTD-based RC system and demonstrate its effectiveness on two image recognition benchmarks: handwritten digit classification and object recognition using the Fruit~360 dataset. Our results show that this circuit-level architecture delivers promising performance while adhering to the principles of next-generation RC – eliminating random connectivity in favour of a deterministic nonlinear transformation of input signals.

[330] Reactivation: Empirical NTK Dynamics Under Task Shifts

Yuzhi Liu, Zixuan Chen, Zirui Zhang, Yufei Liu, Giulia Lanzillotta

Main category: cs.LG

TL;DR: The paper analyzes Neural Tangent Kernel (NTK) dynamics in continual learning, challenging static-kernel approximations and highlighting continual learning as a testbed for neural training dynamics.

DetailsMotivation: To extend NTK dynamics understanding beyond single-task settings to continual learning, where data distribution shifts over time.

Method: Comprehensive empirical analysis of NTK dynamics in continual learning scenarios.

Result: Findings reveal continual learning as a rich testbed for probing neural training dynamics and question static-kernel approximations in theoretical treatments.

Conclusion: Continual learning provides valuable insights into NTK dynamics, challenging existing theoretical assumptions.

Abstract: The Neural Tangent Kernel (NTK) offers a powerful tool to study the functional dynamics of neural networks. In the so-called lazy, or kernel regime, the NTK remains static during training and the network function is linear in the static neural tangents feature space. The evolution of the NTK during training is necessary for feature learning, a key driver of deep learning success. The study of the NTK dynamics has led to several critical discoveries in recent years, in generalization and scaling behaviours. However, this body of work has been limited to the single task setting, where the data distribution is assumed constant over time. In this work, we present a comprehensive empirical analysis of NTK dynamics in continual learning, where the data distribution shifts over time. Our findings highlight continual learning as a rich and underutilized testbed for probing the dynamics of neural training. At the same time, they challenge the validity of static-kernel approximations in theoretical treatments of continual learning, even at large scale.

[331] R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, Bohan Zhuang

Main category: cs.LG

TL;DR: R-Stitch accelerates Chain-of-Thought (CoT) reasoning by dynamically switching between small and large language models based on confidence, reducing latency by 85% with minimal accuracy loss.

DetailsMotivation: CoT reasoning improves problem-solving but incurs high computational costs. Existing methods like speculative decoding have limitations in speedup and fail to leverage small models effectively.

Method: R-Stitch uses a hybrid decoding framework, defaulting to a small model and switching to a large model only when confidence is low, avoiding full-sequence rollback.

Result: Experiments show R-Stitch reduces inference latency by up to 85% with negligible accuracy drop on math reasoning benchmarks.

Conclusion: R-Stitch is a practical, model-agnostic solution for efficient CoT reasoning without compromising quality.

Abstract: Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM’s confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.

[332] Causal Mechanism Estimation in Multi-Sensor Systems Across Multiple Domains

Jingyi Yu, Tim Pychynski, Marco F. Huber

Main category: cs.LG

TL;DR: CICME is a three-step method for inferring causal mechanisms from heterogeneous data across domains, using Causal Transfer Learning to identify common and domain-specific mechanisms. It outperforms baselines in certain scenarios.

DetailsMotivation: To understand complex sensor systems through causality, especially with heterogeneous data from multiple domains.

Method: CICME uses Causal Transfer Learning (CTL) to detect domain-invariant causal mechanisms and guides estimation of domain-specific mechanisms. Evaluated on linear Gaussian models.

Result: CICME outperforms baseline methods in certain scenarios by leveraging pooled and individual domain data.

Conclusion: CICME effectively identifies common and domain-specific causal mechanisms, enhancing causal discovery in heterogeneous settings.

Abstract: To gain deeper insights into a complex sensor system through the lens of causality, we present common and individual causal mechanism estimation (CICME), a novel three-step approach to inferring causal mechanisms from heterogeneous data collected across multiple domains. By leveraging the principle of Causal Transfer Learning (CTL), CICME is able to reliably detect domain-invariant causal mechanisms when provided with sufficient samples. The identified common causal mechanisms are further used to guide the estimation of the remaining causal mechanisms in each domain individually. The performance of CICME is evaluated on linear Gaussian models under scenarios inspired from a manufacturing process. Building upon existing continuous optimization-based causal discovery methods, we show that CICME leverages the benefits of applying causal discovery on the pooled data and repeatedly on data from individual domains, and it even outperforms both baseline methods under certain scenarios.

[333] VIBE: Video-Input Brain Encoder for fMRI Response Modeling

Daniel Carlström Schad, Shrey Dixit, Janis Keck, Viktor Studenyak, Aleksandr Shpilevoi, Andrej Bicanski

Main category: cs.LG

TL;DR: VIBE is a two-stage Transformer model that combines video, audio, and text features to predict fMRI activity, achieving strong performance on both in-distribution and out-of-distribution datasets.

DetailsMotivation: The goal is to improve fMRI activity prediction by leveraging multi-modal features (video, audio, text) and advanced Transformer architectures.

Method: VIBE uses a modality-fusion Transformer to merge features from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) and a prediction Transformer with rotary embeddings for temporal decoding. It was trained on 65 hours of movie data from CNeuroMod and ensembled across 20 seeds.

Result: VIBE achieved mean parcel-wise Pearson correlations of 0.3225 (in-distribution) and 0.2125 (out-of-distribution). An earlier version won Phase-1 and placed second in the Algonauts 2025 Challenge.

Conclusion: VIBE demonstrates the effectiveness of multi-modal fusion and Transformer architectures for fMRI prediction, with competitive performance on diverse datasets.

Abstract: We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 0.3225 on in-distribution Friends S07 and 0.2125 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.

[334] When Noisy Labels Meet Class Imbalance on Graphs: A Graph Augmentation Method with LLM and Pseudo Label

Riting Xia, Rucong Wang, Yulin Liu, Anchen Li, Xueyan Liu, Yan Zhang

Main category: cs.LG

TL;DR: GraphALP is a novel framework using LLMs and pseudo-labeling to address class-imbalanced graph node classification with noisy labels.

DetailsMotivation: Real-world graphs often have noisy labels and class imbalance, which existing methods overlook.

Method: GraphALP combines LLM-based oversampling for minority nodes and dynamically weighted pseudo-labeling to reduce noise.

Result: GraphALP outperforms state-of-the-art methods on imbalanced graphs with noisy labels.

Conclusion: The proposed framework effectively tackles class imbalance and label noise in graph node classification.

Abstract: Class-imbalanced graph node classification is a practical yet underexplored research problem. Although recent studies have attempted to address this issue, they typically assume clean and reliable labels when processing class-imbalanced graphs. This assumption often violates the nature of real-world graphs, where labels frequently contain noise. Given this gap, this paper systematically investigates robust node classification for class-imbalanced graphs with noisy labels. We propose GraphALP, a novel Graph Augmentation framework based on Large language models (LLMs) and Pseudo-labeling techniques. Specifically, we design an LLM-based oversampling method to generate synthetic minority nodes, producing label-accurate minority nodes to alleviate class imbalance. Based on the class-balanced graphs, we develop a dynamically weighted pseudo-labeling method to obtain high-confidence pseudo labels to reduce label noise ratio. Additionally, we implement a secondary LLM-guided oversampling mechanism to mitigate potential class distribution skew caused by pseudo labels. Experimental results show that GraphALP achieves superior performance over state-of-the-art methods on class-imbalanced graphs with noisy labels.

[335] Neural Tangent Kernels and Fisher Information Matrices for Simple ReLU Networks with Random Hidden Weights

Jun’ichi Takeuchi, Yoshinari Takeishi, Noboru Murata, Kazushi Mimura, Ka Long Keith Ho, Hiroshi Nagaoka

Main category: cs.LG

TL;DR: The paper explores Fisher information matrices and NTKs in 2-layer ReLU networks with random weights, linking them via linear transformation and analyzing spectral properties.

DetailsMotivation: To understand the relationship between Fisher information matrices and NTKs in shallow ReLU networks and their spectral properties.

Method: Analyzes the linear transformation between Fisher information matrices and NTKs, and derives spectral decompositions with eigenfunctions.

Result: Concrete forms of eigenfunctions for NTK with major eigenvalues and an approximation formula for 2-layer network functions.

Conclusion: The study provides insights into the spectral structure of NTKs and their connection to Fisher information in shallow ReLU networks.

Abstract: Fisher information matrices and neural tangent kernels (NTK) for 2-layer ReLU networks with random hidden weight are argued. We discuss the relation between both notions as a linear transformation and show that spectral decomposition of NTK with concrete forms of eigenfunctions with major eigenvalues. We also obtain an approximation formula of the functions presented by the 2-layer neural networks.

cs.MA

[336] A Distributed Approach for Agile Supply Chain Decision-Making Based on Network Attributes

Mingjie Bi, Dawn M. Tilbury, Siqian Shen, Kira Barton

Main category: cs.MA

TL;DR: The paper explores agile disruption mitigation in supply chains using distributed decision-making, comparing it to centralized approaches based on network attributes and agent capabilities.

DetailsMotivation: Frequent disruptions negatively impact global supply chains, prompting the need for agile decision-making strategies to maintain competitiveness.

Method: The study uses a distributed decision-making approach based on multi-agent frameworks, evaluating performance through a case study considering network structure and agent attributes.

Result: The distributed framework’s performance is compared to centralized approaches, revealing trade-offs in performance, computation time, and network communication.

Conclusion: Practitioners can leverage the findings to design response strategies tailored to agent capabilities, network attributes, and desired supply chain performance.

Abstract: In recent years, the frequent occurrence of disruptions has had a negative impact on global supply chains. To stay competitive, enterprises strive to remain agile through the implementation of efficient and effective decision-making strategies in reaction to disruptions. A significant effort has been made to develop these agile disruption mitigation approaches, leveraging both centralized and distributed decision-making strategies. Though trade-offs of centralized and distributed approaches have been analyzed in existing studies, no related work has been found on understanding supply chain performance based on the network attributes of the disrupted supply chain entities. In this paper, we characterize supply chains from a capability and network topological perspective and investigate the use of a distributed decision-making approach based on classical multi-agent frameworks. The performance of the distributed framework is evaluated through a comprehensive case study that investigates the performance of the supply chain as a function of the network structure and agent attributes within the network in the presence of a disruption. Comparison to a centralized decision-making approach highlights trade-offs between performance, computation time, and network communication based on the decision-making strategy and network architecture. Practitioners can use the outcomes of our studies to design response strategies based on agent capabilities, network attributes, and desired supply chain performance.

[337] Dynamic distributed decision-making for resilient resource reallocation in disrupted manufacturing systems

Mingjie Bi, Ilya Kovalenko, Dawn M. Tilbury, Kira Barton

Main category: cs.MA

TL;DR: A multi-agent framework for dynamic resource allocation in manufacturing during disruptions, incorporating risk assessment and agent coordination.

DetailsMotivation: Addressing the need for flexible, real-time decision-making in dynamic manufacturing environments disrupted by COVID-19.

Method: Proposes a model-based resource agent (RA) architecture with clustering agent coordination and risk assessment for rescheduling.

Result: Reduces computational effort but slightly loses throughput optimality compared to centralized methods; risk assessment improves throughput.

Conclusion: The multi-agent framework offers a practical solution for dynamic rescheduling in uncertain manufacturing environments.

Abstract: The COVID-19 pandemic brings many unexpected disruptions, such as frequently shifting markets and limited human workforce, to manufacturers. To stay competitive, flexible and real-time manufacturing decision-making strategies are needed to deal with such highly dynamic manufacturing environments. One essential problem is dynamic resource allocation to complete production tasks, especially when a resource disruption (e.g., machine breakdown) occurs. Though multi-agent methods have been proposed to solve the problem in a flexible and agile manner, the agent internal decision-making process and resource uncertainties have rarely been studied. This work introduces a model-based resource agent (RA) architecture that enables effective agent coordination and dynamic agent decision-making. Based on the RA architecture, a rescheduling strategy that incorporates risk assessment via a clustering agent coordination strategy is also proposed. A simulation-based case study is implemented to demonstrate dynamic rescheduling using the proposed multi-agent framework. The results show that the proposed method reduces the computational efforts while losing some throughput optimality compared to the centralized method. Furthermore, the case study illustrates that incorporating risk assessment into rescheduling decision-making improves the throughput.

[338] Heterogeneous Risk Management Using a Multi-Agent Framework for Supply Chain Disruption Response

Mingjie Bi, Juan-Alberto Estrada-Garcia, Dawn M. Tilbury, Siqian Shen, Kira Barton

Main category: cs.MA

TL;DR: A heterogeneous risk management mechanism is proposed for distributed supply chain agents to handle disruptions dynamically, incorporating uncertainties and varying risk attitudes.

DetailsMotivation: Existing approaches often ignore temporal dynamics and agent heterogeneity in risk management, limiting agility in stochastic environments.

Method: The study introduces a mechanism integrating uncertainties and risk attitudes into agent communication and decision-making, tested via simulation.

Result: The approach proves feasible and effective, showing how varying risk attitudes influence disruption response decisions.

Conclusion: The proposed mechanism enhances distributed risk management in stochastic supply chains, adapting to agent-specific risk preferences.

Abstract: In the highly complex and stochastic global, supply chain environments, local enterprise agents seek distributed and dynamic strategies for agile responses to disruptions. Existing literature explores both centralized and distributed approaches, while most work neglects temporal dynamics and the heterogeneity of the risk management of individual agents. To address this gap, this letter presents a heterogeneous risk management mechanism to incorporate uncertainties and risk attitudes into agent communication and decision-making strategy. Hence, this approach empowers enterprises to handle disruptions in stochastic environments in a distributed way, and in particular in the context of multi-agent control and management. Through a simulated case study, we showcase the feasibility and effectiveness of the proposed approach under stochastic settings and how the decision of disruption responses changes when agents hold various risk attitudes.

cs.MM

[339] CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation

Hyunwoo Oh, SeungJu Cha, Kwanyoung Lee, Si-Woo Kim, Dong-Jin Kim

Main category: cs.MM

TL;DR: CatchPhrase is a framework for generating images from audio inputs, addressing semantic misalignment by using enriched prompts and multi-modal filtering.

DetailsMotivation: Semantic misalignment in audio-to-image generation due to homographs and auditory illusions hinders accuracy.

Method: Uses EXPrompt Mining (leveraging LLMs and ACMs) and EXPrompt Selector (multi-modal filtering) to align prompts, then trains a lightweight mapping network for audio input.

Result: Improves audio-to-image alignment and generation quality on multiple datasets.

Conclusion: CatchPhrase effectively mitigates semantic misalignment in audio-to-image generation.

Abstract: We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple audio classification datasets demonstrate that CatchPhrase improves audio-to-image alignment and consistently enhances generation quality by mitigating semantic misalignment.

[340] Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks

Lei Zhang, Xin Zhou, Chaoyue He, Di Wang, Yi Wu, Hong Xu, Wei Liu, Chunyan Miao

Main category: cs.MM

TL;DR: The paper introduces MMESGBench, a benchmark dataset for evaluating multimodal understanding and reasoning in ESG documents, addressing gaps in AI systems’ ability to process such complex, diverse materials.

DetailsMotivation: Existing AI systems struggle with document-level reasoning in ESG reports due to their length, structural diversity, and multimodal nature. No dedicated benchmark exists for this domain.

Method: A human-AI collaborative pipeline constructs the dataset: a multimodal LLM generates QA pairs, another LLM verifies them, and domain experts validate for quality.

Result: MMESGBench includes 933 QA pairs from 45 ESG documents, showing multimodal models outperform text-only ones, especially for visually grounded and cross-page tasks.

Conclusion: MMESGBench fills a critical gap in ESG document analysis, providing a robust benchmark for future AI advancements in multimodal reasoning.

Abstract: Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency. However, these documents are often lengthy, structurally diverse, and multimodal, comprising dense text, structured tables, complex figures, and layout-dependent semantics. Existing AI systems often struggle to perform reliable document-level reasoning in such settings, and no dedicated benchmark currently exists in ESG domain. To fill the gap, we introduce \textbf{MMESGBench}, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents. This dataset is constructed via a human-AI collaborative, multi-stage pipeline. First, a multimodal LLM generates candidate question-answer (QA) pairs by jointly interpreting rich textual, tabular, and visual information from layout-aware document pages. Second, an LLM verifies the semantic accuracy, completeness, and reasoning complexity of each QA pair. This automated process is followed by an expert-in-the-loop validation, where domain specialists validate and calibrate QA pairs to ensure quality, relevance, and diversity. MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories. Questions are categorized as single-page, cross-page, or unanswerable, with each accompanied by fine-grained multimodal evidence. Initial experiments validate that multimodal and retrieval-augmented models substantially outperform text-only baselines, particularly on visually grounded and cross-page tasks. MMESGBench is publicly available as an open-source dataset at https://github.com/Zhanglei1103/MMESGBench.

[341] CLIP-Guided Backdoor Defense through Entropy-Based Poisoned Dataset Separation

Binyan Xu, Fan Yang, Xilin Dai, Di Tang, Kehuan Zhang

Main category: cs.MM

TL;DR: CGD is an efficient backdoor defense method using CLIP to identify and neutralize poisoned inputs, reducing attack success rates to below 1% with minimal clean accuracy drop.

DetailsMotivation: Current backdoor defenses are computationally expensive or ineffective against advanced attacks like clean-label and clean-image backdoors.

Method: CGD uses a publicly accessible CLIP model to identify clean or poisoned inputs and retrains the model with CLIP’s logits as guidance.

Result: CGD reduces attack success rates to below 1% with a maximum clean accuracy drop of 0.3%, outperforming existing defenses.

Conclusion: CGD is efficient, effective, and robust, making it suitable for real-world backdoor defense scenarios.

Abstract: Deep Neural Networks (DNNs) are susceptible to backdoor attacks, where adversaries poison training data to implant backdoor into the victim model. Current backdoor defenses on poisoned data often suffer from high computational costs or low effectiveness against advanced attacks like clean-label and clean-image backdoors. To address them, we introduce CLIP-Guided backdoor Defense (CGD), an efficient and effective method that mitigates various backdoor attacks. CGD utilizes a publicly accessible CLIP model to identify inputs that are likely to be clean or poisoned. It then retrains the model with these inputs, using CLIP’s logits as a guidance to effectively neutralize the backdoor. Experiments on 4 datasets and 11 attack types demonstrate that CGD reduces attack success rates (ASRs) to below 1% while maintaining clean accuracy (CA) with a maximum drop of only 0.3%, outperforming existing defenses. Additionally, we show that clean-data-based defenses can be adapted to poisoned data using CGD. Also, CGD exhibits strong robustness, maintaining low ASRs even when employing a weaker CLIP model or when CLIP itself is compromised by a backdoor. These findings underscore CGD’s exceptional efficiency, effectiveness, and applicability for real-world backdoor defense scenarios. Code: https://github.com/binyxu/CGD.

eess.AS

[342] FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng

Main category: eess.AS

TL;DR: The paper introduces a benchmarking pipeline for full-duplex spoken dialogue systems (FDSDS) to evaluate their performance in handling interruptions and robustness, revealing challenges in existing models.

DetailsMotivation: Existing benchmarks lack metrics for full-duplex (FD) interactions, such as evaluating performance during user interruptions, prompting the need for a comprehensive FD benchmarking pipeline.

Method: The proposed pipeline uses LLMs, TTS, and ASR to assess FDSDS capabilities in handling interruptions, managing delays, and robustness in challenging scenarios with diverse metrics.

Result: Testing three open-source FDSDS models with 293 simulated conversations and 1,200 interruptions revealed challenges like failure to respond to interruptions under noisy conditions.

Conclusion: The benchmark highlights the need for improved FDSDS models to handle real-time interruptions and robustness, with plans to release demonstrations, data, and code.

Abstract: Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS’s ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released.

[343] Assessment of Personality Dimensions Across Situations Using Conversational Speech

Alice Zhang, Skanda Muralidhar, Daniel Gatica-Perez, Mathew Magimai-Doss

Main category: eess.AS

TL;DR: The study explores how perceived personality traits vary across conversational contexts, identifying key acoustic features and showing that stressful interactions better predict neuroticism.

DetailsMotivation: Prior work treated personality as static, but psychological research shows it varies by context. This study investigates how conversational speech reflects perceived personality in different work situations.

Method: Analyzed conversational speech in neutral and stressful work interactions, using acoustic and non-verbal features to infer perceived personality traits.

Result: Perceived personalities differ by context; specific acoustic features predict traits like extraversion and neuroticism; handcrafted features outperform speaker embeddings.

Conclusion: Context matters in personality perception; acoustic features are context-dependent, and stressful situations better reveal neuroticism.

Abstract: Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to predict an individual’s perceived personality traits. Previous studies in APP have treated personalities as static traits, independent of context. However, perceived personalities can vary by context and situation as shown in psychological research. In this study, we investigate the relationship between conversational speech and perceived personality for participants engaged in two work situations (a neutral interview and a stressful client interaction). Our key findings are: 1) perceived personalities differ significantly across interactions, 2) loudness, sound level, and spectral flux features are indicative of perceived extraversion, agreeableness, conscientiousness, and openness in neutral interactions, while neuroticism correlates with these features in stressful contexts, 3) handcrafted acoustic features and non-verbal features outperform speaker embeddings in inference of perceived personality, and 4) stressful interactions are more predictive of neuroticism, aligning with existing psychological research.

[344] Should Top-Down Clustering Affect Boundaries in Unsupervised Word Discovery?

Simon Malan, Benjamin van Niekerk, Herman Kamper

Main category: eess.AS

TL;DR: The paper compares bottom-up and top-down methods for segmenting unlabeled speech into word-like units and clustering them into a lexicon, finding both achieve similar results, with bottom-up being faster.

DetailsMotivation: To determine whether top-down information is necessary for improving segmentation in unlabeled speech and to compare the efficiency and effectiveness of bottom-up and top-down approaches.

Method: Two approaches are tested: a simple bottom-up method using self-supervised features for boundary prediction and clustering, and a top-down method (ES-KMeans) that iteratively updates boundaries using K-means.

Result: Both methods achieve comparable state-of-the-art results on the ZeroSpeech benchmarks, with the bottom-up method being significantly faster. Top-down benefits depend on factors like candidate boundaries, but bottom-up often performs equally well.

Conclusion: Future work should focus on improving clustering techniques and learning more discriminative word-like representations, as clustering is a limiting factor for both methods.

Abstract: We investigate the problem of segmenting unlabeled speech into word-like units and clustering these to create a lexicon. Prior work can be categorized into two frameworks. Bottom-up methods first determine boundaries and then cluster the fixed segmented words into a lexicon. In contrast, top-down methods incorporate information from the clustered words to inform boundary selection. However, it is unclear whether top-down information is necessary to improve segmentation. To explore this, we look at two similar approaches that differ in whether top-down clustering informs boundary selection. Our simple bottom-up strategy predicts word boundaries using the dissimilarity between adjacent self-supervised features, then clusters the resulting segments to construct a lexicon. Our top-down system is an updated version of the ES-KMeans dynamic programming method that iteratively uses K-means to update its boundaries. On the five-language ZeroSpeech benchmarks, both approaches achieve comparable state-of-the-art results, with the bottom-up system being nearly five times faster. Through detailed analyses, we show that the top-down influence of ES-KMeans can be beneficial (depending on factors like the candidate boundaries), but in many cases the simple bottom-up method performs just as well. For both methods, we show that the clustering step is a limiting factor. Therefore, we recommend that future work focus on improved clustering techniques and learning more discriminative word-like representations. Project code repository: https://github.com/s-malan/prom-seg-clus.

[345] Comparison of Knowledge Distillation Methods for Low-complexity Multi-microphone Speech Enhancement using the FT-JNF Architecture

Robert Metzger, Mattes Ohlenbusch, Christian Rollwage, Simon Doclo

Main category: eess.AS

TL;DR: The paper explores Knowledge Distillation (KD) methods to reduce the size of DNN-based multi-microphone speech enhancement models while maintaining performance.

DetailsMotivation: Many DNN-based speech enhancement algorithms are too resource-intensive for limited hardware, and reducing complexity often degrades performance.

Method: Five KD methods are evaluated, including direct output matching, self-similarity of layers, and fused multi-layer losses, applied to the FT-JNF architecture.

Result: Three KD methods significantly improved student model performance, with a 25%-parameter model matching teacher PESQ scores at 0 dB SNR, and up to 96% size reduction with minimal PESQ loss.

Conclusion: KD effectively reduces model size for resource-limited devices without sacrificing performance, making DNN-based speech enhancement more practical.

Abstract: Multi-microphone speech enhancement using deep neural networks (DNNs) has significantly progressed in recent years. However, many proposed DNN-based speech enhancement algorithms cannot be implemented on devices with limited hardware resources. Only lowering the complexity of such systems by reducing the number of parameters often results in worse performance. Knowledge Distillation (KD) is a promising approach for reducing DNN model size while preserving performance. In this paper, we consider the recently proposed Frequency-Time Joint Non-linear Filter (FT-JNF) architecture and investigate several KD methods to train smaller (student) models from a large pre-trained (teacher) model. Five KD methods are evaluated using direct output matching, the self-similarity of intermediate layers, and fused multi-layer losses. Experimental results on a simulated dataset using a compact array with five microphones show that three KD methods substantially improve the performance of student models compared to training without KD. A student model with only 25% of the teacher model’s parameters achieves comparable PESQ scores at 0 dB SNR. Furthermore, a reduction of up to 96% in model size can be achieved with only a minimal decrease in PESQ scores.

[346] Binaural Target Speaker Extraction using HRTFs and a Complex-Valued Neural Network

Yoav Ellinson, Sharon Gannot

Main category: eess.AS

TL;DR: A novel binaural target speaker extraction method uses HRTF and a complex-valued neural network for speaker-independent performance, excelling in both anechoic and reverberant conditions.

DetailsMotivation: To mimic human ability to focus on a single speaker amidst multiple talkers without relying on speaker embeddings.

Method: Uses HRTF and a fully complex-valued neural network operating directly on complex-valued STFT of mixed audio signals.

Result: Excellent performance in anechoic conditions; robust in reverberation, preserving clarity and directionality while reducing reverberation.

Conclusion: The method is effective, speaker-independent, and generalizes well across languages and environments.

Abstract: In this work, we aim to imitate the human ability to selectively attend to a single speaker, even in the presence of multiple simultaneous talkers. We propose a novel approach for binaural target speaker extraction that leverages the listener’s Head-Related Transfer Function (HRTF) to isolate the desired speaker. Notably, our method does not rely on speaker embeddings, making it speaker-independent and enabling strong generalization across multiple speech datasets in different languages. We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier Transform (STFT) of the mixed audio signals. This deviates from conventional approaches that use spectrograms or treat the real and imaginary components of the STFT as separate real-valued inputs. We first evaluate the method in an anechoic, noise-free scenario, where it demonstrates excellent extraction performance while effectively preserving the binaural cues of the target signal. We then test a modified variant under mild reverberation conditions. This version remains robust in reverberant environments, maintaining speech clarity, preserving source directionality, and simultaneously reducing reverberation.

[347] Integrating IP Broadcasting with Audio Tags: Workflow and Challenges

Rhys Burchett-Vass, Arshdeep Singh, Gabriel Bibbó, Mark D. Plumbley

Main category: eess.AS

TL;DR: The paper discusses containerizing an audio tagging model into a microservice for flexible integration into IP broadcasting workflows, addressing challenges like latency.

DetailsMotivation: To enhance IP broadcasting workflows by integrating live audio tagging tools for greater flexibility and functionality in content production.

Method: Containerizing an audio tagging model into a microservice for seamless deployment in various network setups.

Result: A modular, accessible, and flexible tool adaptable to broadcasting workflows of all sizes, though latency issues are noted.

Conclusion: The microservice approach offers scalable integration for audio tagging in broadcasting, but latency remains a challenge to address.

Abstract: The broadcasting industry has adopted IP technologies, revolutionising both live and pre-recorded content production, from news gathering to live music events. IP broadcasting allows for the transport of audio and video signals in an easily configurable way, aligning with modern networking techniques. This shift towards an IP workflow allows for much greater flexibility, not only in routing signals but with the integration of tools using standard web development techniques. One possible tool could include the use of live audio tagging, which has a number of uses in the production of content. These could include adding sound effects to automated closed captioning or identifying unwanted sound events within a scene. In this paper, we describe the process of containerising an audio tagging model into a microservice, a small segregated code module that can be integrated into a multitude of different network setups. The goal is to develop a modular, accessible, and flexible tool capable of seamless deployment into broadcasting workflows of all sizes, from small productions to large corporations. Challenges surrounding latency of the selected audio tagging model and its effect on the usefulness of the end product are discussed.

[348] Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling

Theo Lepage, Reda Dehak

Main category: eess.AS

TL;DR: The paper introduces Self-Supervised Positive Sampling (SSPS), a technique to improve SSL for Speaker Verification by sampling diverse positives, reducing channel bias and improving performance.

DetailsMotivation: Current SSL frameworks for SV rely on anchor-positive pairs from the same utterance, encoding too much recording-source information, limiting performance.

Method: SSPS samples positives close to anchors in representation space, assuming they share speaker identity but differ in recording conditions.

Result: SSPS improves SV performance on VoxCeleb benchmarks, with SimCLR and DINO achieving ~2.5% EER. SimCLR shows a 58% EER reduction.

Conclusion: SSPS enhances SSL frameworks by reducing intra-class variance and channel bias, achieving robust performance even without data augmentation.

Abstract: Recent developments in Self-Supervised Learning (SSL) have demonstrated significant potential for Speaker Verification (SV), but closing the performance gap with supervised systems remains an ongoing challenge. SSL frameworks rely on anchor-positive pairs, constructed from segments of the same audio utterance. Hence, positives have channel characteristics similar to those of their corresponding anchors, even with extensive data-augmentation. Therefore, this positive sampling strategy is a fundamental limitation as it encodes too much information regarding the recording source in the learned representations. This article introduces Self-Supervised Positive Sampling (SSPS), a bootstrapped technique for sampling appropriate and diverse positives in SSL frameworks for SV. SSPS samples positives close to their anchor in the representation space, assuming that these pseudo-positives belong to the same speaker identity but correspond to different recording conditions. This method consistently demonstrates improvements in SV performance on VoxCeleb benchmarks when applied to major SSL frameworks, including SimCLR, SwAV, VICReg, and DINO. Using SSPS, SimCLR and DINO achieve 2.57% and 2.53% EER on VoxCeleb1-O, respectively. SimCLR yields a 58% relative reduction in EER, getting comparable performance to DINO with a simpler training framework. Furthermore, SSPS lowers intra-class variance and reduces channel information in speaker representations while exhibiting greater robustness without data-augmentation.

[349] Incremental Averaging Method to Improve Graph-Based Time-Difference-of-Arrival Estimation

Klaus Brümann, Kouei Yamaoka, Nobutaka Ono, Simon Doclo

Main category: eess.AS

TL;DR: The paper proposes an incremental method to improve TDOA estimation by averaging multiple CPSDs for GCC-PHAT functions, enhancing accuracy in noisy and reverberant environments.

DetailsMotivation: Background noise and reverberation degrade TDOA estimation, affecting source localization. Existing methods using single CPSDs or MST-based pair selection are limited.

Method: An incremental method averages multiple CPSDs for GCC-PHAT functions, leveraging indirect CPSDs via other microphones to improve robustness.

Result: Experiments show reduced TDOA and 2D position errors compared to single CPSD and MST-based methods, especially in reverberant conditions.

Conclusion: The proposed method outperforms existing techniques by leveraging multiple CPSDs, enhancing accuracy in challenging acoustic environments.

Abstract: Estimating the position of a speech source based on time-differences-of-arrival (TDOAs) is often adversely affected by background noise and reverberation. A popular method to estimate the TDOA between a microphone pair involves maximizing a generalized cross-correlation with phase transform (GCC-PHAT) function. Since the TDOAs across different microphone pairs satisfy consistency relations, generally only a small subset of microphone pairs are used for source position estimation. Although the set of microphone pairs is often determined based on a reference microphone, recently a more robust method has been proposed to determine the set of microphone pairs by computing the minimum spanning tree (MST) of a signal graph of GCC-PHAT function reliabilities. To reduce the influence of noise and reverberation on the TDOA estimation accuracy, in this paper we propose to compute the GCC-PHAT functions of the MST based on an average of multiple cross-power spectral densities (CPSDs) using an incremental method. In each step of the method, we increase the number of CPSDs over which we average by considering CPSDs computed indirectly via other microphones from previous steps. Using signals recorded in a noisy and reverberant laboratory with an array of spatially distributed microphones, the performance of the proposed method is evaluated in terms of TDOA estimation error and 2D source position estimation error. Experimental results for different source and microphone configurations and three reverberation conditions show that the proposed method considering multiple CPSDs improves the TDOA estimation and source position estimation accuracy compared to the reference microphone- and MST-based methods that rely on a single CPSD as well as steered-response power-based source position estimation.

[350] P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge

Marvin Sach, Yihui Fu, Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Anurag Kumar, Wei Wang, Yanmin Qian, Shinji Watanabe, Tim Fingscheidt

Main category: eess.AS

TL;DR: The paper discusses challenges in speech quality estimation for SE systems, proposes localization of crowdsourced subjective tests, and highlights the need for objective metrics alongside subjective ones for generative AI.

DetailsMotivation: Addressing the reliability of subjective listening tests in the era of generative AI and multilingual datasets, while improving testing methods.

Method: Recaps ITU-T P.808, proposes localization of text/audio for crowdsourced ACR tests, and analyzes URGENT Challenge results.

Result: Finds subjective ACR tests may need objective phone fidelity metrics for generative SE methods to detect hallucinations.

Conclusion: Advocates combining subjective and objective metrics for reliable SE evaluation and plans to release localization tools for multilingual tests.

Abstract: In speech quality estimation for speech enhancement (SE) systems, subjective listening tests so far are considered as the gold standard. This should be even more true considering the large influx of new generative or hybrid methods into the field, revealing issues of some objective metrics. Efforts such as the Interspeech 2025 URGENT Speech Enhancement Challenge also involving non-English datasets add the aspect of multilinguality to the testing procedure. In this paper, we provide a brief recap of the ITU-T P.808 crowdsourced subjective listening test method. A first novel contribution is our proposed process of localizing both text and audio components of Naderi and Cutler’s implementation of crowdsourced subjective absolute category rating (ACR) listening tests involving text-to-speech (TTS). Further, we provide surprising analyses of and insights into URGENT Challenge results, tackling the reliability of (P.808) ACR subjective testing as gold standard in the age of generative AI. Particularly, it seems that for generative SE methods, subjective (ACR MOS) and objective (DNSMOS, NISQA) reference-free metrics should be accompanied by objective phone fidelity metrics to reliably detect hallucinations. Finally, we will soon release our localization scripts and methods for easy deployment for new multilingual speech enhancement subjective evaluations according to ITU-T P.808.

[351] ASR-Guided Speaker-Role Diarization and Diarization-Guided ASR Decoding

Arindam Ghosh, Mark Fuhs, Bongjun Kim, Anurag Chowdhury, Monika Woszczyna

Main category: eess.AS

TL;DR: The paper extends speaker diarization to speaker-role diarization (RD) with simplified training, task-specific predictors, and RD-influenced ASR decoding.

DetailsMotivation: Speaker-role diarization (RD) is more practical than traditional speaker diarization (SD) for applications like doctor-patient or host-guest interactions.

Method: The framework uses forced alignment and cross-entropy loss for training, employs separate predictors for word and role tasks, and leverages RD posteriors to improve ASR decoding.

Result: The approach demonstrates effectiveness in role prediction and reduces small-word deletion errors in ASR.

Conclusion: The proposed method advances RD by addressing training complexity and task-specific needs, enhancing ASR performance.

Abstract: From an application standpoint, speaker-role diarization (RD), such as doctor vs. patient, host vs. guest, etc. is often more useful than traditional speaker diarization (SD), which assigns generic labels like speaker-1, speaker-2 etc. In the context of joint automatic speech recognition (ASR) + SD (who spoke what?), recent end-to-end models employ an auxiliary SD transducer, synchronized with the ASR transducer, to predict speakers per word. In this paper, we extend this framework to RD with three key contributions: (1) we simplify the training via forced alignment and cross-entropy loss instead of RNNT loss, (2) we show that word prediction and role prediction require different amounts of predictor’s context, leading to separate task-specific predictors, unlike existing shared-predictor models, and (3) we propose a way to leverage RD posterior activity to influence ASR decoding and reduce small-word deletion errors.

eess.IV

[352] Estimating Sensitivity Maps for X-Nuclei Magnetic Resonance Spectroscopic Imaging

Nicholas Dwork, Jeremy W. Gordon, Shuyu Tang, Peder E. Z. Larson

Main category: eess.IV

TL;DR: The paper proposes the L2 optimal method for estimating sensitivity maps in X-nuclei imaging, outperforming the RefPeak method in accuracy and SNR.

DetailsMotivation: To improve sensitivity map estimation for X-nuclei with sparse presence in the field of view.

Method: Solve a least-squares problem using multiple spectral, temporal, or frequency estimates.

Result: L2 optimal method provides more accurate sensitivity maps and higher SNR in brain, pancreas, and heart imaging.

Conclusion: The L2 optimal method is superior to RefPeak, leveraging more measurement data for better sensitivity estimation.

Abstract: The purpose of this research is to estimate sensitivity maps when imaging X-nuclei that may not have a significant presence throughout the field of view. We propose to estimate the coil’s sensitivities by solving a least-squares problem where each row corresponds to an individual estimate of the sensitivity for a given voxel. Multiple estimates come from the multiple bins of the spectrum with spectroscopy, multiple times with dynamic imaging, or multiple frequencies when utilizing spectral excitation. The method presented in this manuscript, called the L2 optimal method, is compared to the commonly used RefPeak method which uses the spectral bin with the highest energy to estimate the sensitivity maps. The L2 optimal method yields more accurate sensitivity maps when imaging a numerical phantom and is shown to yield a higher signal-to-noise ratio when imaging the brain, pancreas, and heart with hyperpolarized pyruvate as the contrast agent with hyperpolarized MRI. The L2 optimal method is able to better estimate the sensitivity by extracting more information from the measurements.

[353] RealisVSR: Detail-enhanced Diffusion for Real-World 4K Video Super-Resolution

Weisong Zhao, Jingkai Zhou, Xiangyu Zhu, Weihua Chen, Xiao-Yu Zhang, Zhen Lei, Fan Wang

Main category: eess.IV

TL;DR: RealisVSR addresses key VSR challenges with innovations like CPC architecture, HR-Loss, and a 4K benchmark, outperforming existing methods.

DetailsMotivation: Overcome limitations in temporal dynamics modeling, high-frequency detail recovery, and lack of 4K evaluation in VSR.

Method: Integrates CPC with Wan2.1 diffusion, uses HR-Loss for texture, and introduces RealisVideo-4K benchmark.

Result: Superior performance on benchmarks, especially in ultra-high-resolution, with reduced training data.

Conclusion: RealisVSR effectively enhances VSR with novel techniques and a 4K dataset, setting a new standard.

Abstract: Video Super-Resolution (VSR) has achieved significant progress through diffusion models, effectively addressing the over-smoothing issues inherent in GAN-based methods. Despite recent advances, three critical challenges persist in VSR community: 1) Inconsistent modeling of temporal dynamics in foundational models; 2) limited high-frequency detail recovery under complex real-world degradations; and 3) insufficient evaluation of detail enhancement and 4K super-resolution, as current methods primarily rely on 720P datasets with inadequate details. To address these challenges, we propose RealisVSR, a high-frequency detail-enhanced video diffusion model with three core innovations: 1) Consistency Preserved ControlNet (CPC) architecture integrated with the Wan2.1 video diffusion to model the smooth and complex motions and suppress artifacts; 2) High-Frequency Rectified Diffusion Loss (HR-Loss) combining wavelet decomposition and HOG feature constraints for texture restoration; 3) RealisVideo-4K, the first public 4K VSR benchmark containing 1,000 high-definition video-text pairs. Leveraging the advanced spatio-temporal guidance of Wan2.1, our method requires only 5-25% of the training data volume compared to existing approaches. Extensive experiments on VSR benchmarks (REDS, SPMCS, UDM10, YouTube-HQ, VideoLQ, RealisVideo-720P) demonstrate our superiority, particularly in ultra-high-resolution scenarios.

[354] XAI-Guided Analysis of Residual Networks for Interpretable Pneumonia Detection in Paediatric Chest X-rays

Rayyan Ridwan

Main category: eess.IV

TL;DR: A deep learning model using ResNet-50 and BayesGrad-CAM for pediatric pneumonia diagnosis achieves high accuracy (95.94%) and interpretability.

DetailsMotivation: Pneumonia is a leading cause of child mortality, necessitating fast, accurate, and interpretable diagnostic tools.

Method: Proposes an interpretable ResNet-50 model enhanced with BayesGrad-CAM for uncertainty-aware visual explanations.

Result: Achieves 95.94% accuracy, 98.91% AUC-ROC, and 0.913 Cohen’s Kappa, with clinically meaningful visualizations.

Conclusion: High performance and interpretability are achievable and essential for clinical AI adoption.

Abstract: Pneumonia remains one of the leading causes of death among children worldwide, underscoring a critical need for fast and accurate diagnostic tools. In this paper, we propose an interpretable deep learning model on Residual Networks (ResNets) for automatically diagnosing paediatric pneumonia on chest X-rays. We enhance interpretability through Bayesian Gradient-weighted Class Activation Mapping (BayesGrad-CAM), which quantifies uncertainty in visual explanations, and which offers spatial locations accountable for the decision-making process of the model. Our ResNet-50 model, trained on a large paediatric chest X-rays dataset, achieves high classification accuracy (95.94%), AUC-ROC (98.91%), and Cohen’s Kappa (0.913), accompanied by clinically meaningful visual explanations. Our findings demonstrate that high performance and interpretability are not only achievable but critical for clinical AI deployment.

[355] RegScore: Scoring Systems for Regression Tasks

Michal K. Grzeszczyk, Tomasz Szczepański, Pawel Renc, Siyeop Yoon, Jerome Charton, Tomasz Trzciński, Arkadiusz Sitek

Main category: eess.IV

TL;DR: RegScore is a sparse, interpretable scoring system for regression tasks, combining tabular data and medical images for personalized predictions, outperforming black-box models.

DetailsMotivation: To create a transparent and interpretable scoring system for regression tasks in medicine, addressing the limitations of conventional integer-valued scoring systems.

Method: Uses beam search and k-sparse ridge regression to relax integer constraints, integrates tabular data with medical images via TIP transformer, and generates personalized regression parameters.

Result: RegScore and its bimodal extensions perform comparably or better than state-of-the-art black-box models in estimating mean Pulmonary Artery Pressure.

Conclusion: RegScore offers a transparent, interpretable solution for clinical regression tasks, enhancing trust and decision-making.

Abstract: Scoring systems are widely adopted in medical applications for their inherent simplicity and transparency, particularly for classification tasks involving tabular data. In this work, we introduce RegScore, a novel, sparse, and interpretable scoring system specifically designed for regression tasks. Unlike conventional scoring systems constrained to integer-valued coefficients, RegScore leverages beam search and k-sparse ridge regression to relax these restrictions, thus enhancing predictive performance. We extend RegScore to bimodal deep learning by integrating tabular data with medical images. We utilize the classification token from the TIP (Tabular Image Pretraining) transformer to generate Personalized Linear Regression parameters and a Personalized RegScore, enabling individualized scoring. We demonstrate the effectiveness of RegScore by estimating mean Pulmonary Artery Pressure using tabular data and further refine these estimates by incorporating cardiac MRI images. Experimental results show that RegScore and its personalized bimodal extensions achieve performance comparable to, or better than, state-of-the-art black-box models. Our method provides a transparent and interpretable approach for regression tasks in clinical settings, promoting more informed and trustworthy decision-making. We provide our code at https://github.com/SanoScience/RegScore.

[356] Extreme Cardiac MRI Analysis under Respiratory Motion: Results of the CMRxMotion Challenge

Kang Wang, Chen Qin, Zhang Shi, Haoran Wang, Xiwen Zhang, Chen Chen, Cheng Ouyang, Chengliang Dai, Yuanhan Mo, Chenchen Dai, Xutong Kuang, Ruizhe Li, Xin Chen, Xiuzheng Yue, Song Tian, Alejandro Mora-Rubio, Kumaradevan Punithakumar, Shizhan Gong, Qi Dou, Sina Amirrajab, Yasmina Al Khalil, Cian M. Scannell, Lexiaozi Fan, Huili Yang, Xiaowu Sun, Rob van der Geest, Tewodros Weldebirhan Arega, Fabrice Meriaudeau, Caner Özer, Amin Ranem, John Kalkhof, İlkay Öksüz, Anirban Mukhopadhyay, Abdul Qayyum, Moona Mazher, Steven A Niederer, Carles Garcia-Cabrera, Eric Arazo, Michal K. Grzeszczyk, Szymon Płotka, Wanqin Ma, Xiaomeng Li, Rongjun Ge, Yongqing Kou, Xinrong Chen, He Wang, Chengyan Wang, Wenjia Bai, Shuo Wang

Main category: eess.IV

TL;DR: The paper discusses the CMRxMotion challenge, which aimed to evaluate deep learning models for CMR analysis under motion artifacts. It involved tasks like image quality assessment and myocardial segmentation, with 22 algorithms tested.

DetailsMotivation: Deep learning models for CMR analysis struggle with motion artifacts, a common issue in clinical practice. The challenge was organized to address this gap.

Method: A dataset of 320 CMR cine series with induced motion artifacts was curated. Two tasks were designed: image quality classification and robust segmentation.

Result: 22 algorithms were evaluated, with top-performing methods reported. The impact of artifacts on clinical biomarkers was also analyzed.

Conclusion: The challenge and dataset aim to advance research in robust CMR analysis under motion artifacts. All resources are publicly available.

Abstract: Deep learning models have achieved state-of-the-art performance in automated Cardiac Magnetic Resonance (CMR) analysis. However, the efficacy of these models is highly dependent on the availability of high-quality, artifact-free images. In clinical practice, CMR acquisitions are frequently degraded by respiratory motion, yet the robustness of deep learning models against such artifacts remains an underexplored problem. To promote research in this domain, we organized the MICCAI CMRxMotion challenge. We curated and publicly released a dataset of 320 CMR cine series from 40 healthy volunteers who performed specific breathing protocols to induce a controlled spectrum of motion artifacts. The challenge comprised two tasks: 1) automated image quality assessment to classify images based on motion severity, and 2) robust myocardial segmentation in the presence of motion artifacts. A total of 22 algorithms were submitted and evaluated on the two designated tasks. This paper presents a comprehensive overview of the challenge design and dataset, reports the evaluation results for the top-performing methods, and further investigates the impact of motion artifacts on five clinically relevant biomarkers. All resources and code are publicly available at: https://github.com/CMRxMotion

[357] Learned Image Compression with Hierarchical Progressive Context Modeling

Yuqi Li, Haotian Zhang, Li Li, Dong Liu

Main category: eess.IV

TL;DR: The paper introduces a Hierarchical Progressive Context Model (HPCM) for improved context modeling in learned image compression, achieving state-of-the-art performance.

DetailsMotivation: Existing methods struggle with efficiently exploiting long-range dependency and diverse context information across coding steps.

Method: HPCM uses a hierarchical coding schedule for multi-scale latent dependency modeling and a progressive context fusion mechanism to incorporate prior step information.

Result: The method achieves superior rate-distortion performance and balances compression efficiency with computational complexity.

Conclusion: HPCM effectively addresses context modeling limitations, offering a practical solution for learned image compression.

Abstract: Context modeling is essential in learned image compression for accurately estimating the distribution of latents. While recent advanced methods have expanded context modeling capacity, they still struggle to efficiently exploit long-range dependency and diverse context information across different coding steps. In this paper, we introduce a novel Hierarchical Progressive Context Model (HPCM) for more efficient context information acquisition. Specifically, HPCM employs a hierarchical coding schedule to sequentially model the contextual dependencies among latents at multiple scales, which enables more efficient long-range context modeling. Furthermore, we propose a progressive context fusion mechanism that incorporates contextual information from previous coding steps into the current step, effectively exploiting diverse contextual information. Experimental results demonstrate that our method achieves state-of-the-art rate-distortion performance and strikes a better balance between compression performance and computational complexity. The code is available at https://github.com/lyq133/LIC-HPCM.

[358] Enhancing Diabetic Retinopathy Classification Accuracy through Dual Attention Mechanism in Deep Learning

Abdul Hannan, Zahid Mahmood, Rizwan Qureshi, Hazrat Ali

Main category: eess.IV

TL;DR: The paper proposes a deep learning model combining global and category attention blocks to address imbalanced data in Diabetic Retinopathy (DR) classification, achieving competitive performance on two datasets.

DetailsMotivation: Imbalanced data distribution in DR datasets hinders deep learning model generalization, necessitating an effective solution.

Method: The model integrates global attention block (GAB) and category attention block (CAB) using pre-trained networks (MobileNetV3-small, EfficientNet-b0, DenseNet-169) as backbones.

Result: On APTOS, DenseNet-169 achieved 83.20% accuracy; on EYEPACS, EfficientNet-b0 scored 80%. F1-score was 82.0%, with MobileNetV3-small having fewer parameters.

Conclusion: The proposed attention-based model effectively addresses data imbalance, achieving performance comparable to recent DR classification methods.

Abstract: Automatic classification of Diabetic Retinopathy (DR) can assist ophthalmologists in devising personalized treatment plans, making it a critical component of clinical practice. However, imbalanced data distribution in the dataset becomes a bottleneck in the generalization of deep learning models trained for DR classification. In this work, we combine global attention block (GAB) and category attention block (CAB) into the deep learning model, thus effectively overcoming the imbalanced data distribution problem in DR classification. Our proposed approach is based on an attention mechanism-based deep learning model that employs three pre-trained networks, namely, MobileNetV3-small, Efficientnet-b0, and DenseNet-169 as the backbone architecture. We evaluate the proposed method on two publicly available datasets of retinal fundoscopy images for DR. Experimental results show that on the APTOS dataset, the DenseNet-169 yielded 83.20% mean accuracy, followed by the MobileNetV3-small and EfficientNet-b0, which yielded 82% and 80% accuracies, respectively. On the EYEPACS dataset, the EfficientNet-b0 yielded a mean accuracy of 80%, while the DenseNet-169 and MobileNetV3-small yielded 75.43% and 76.68% accuracies, respectively. In addition, we also compute the F1-score of 82.0%, precision of 82.1%, sensitivity of 83.0%, specificity of 95.5%, and a kappa score of 88.2% for the experiments. Moreover, in our work, the MobileNetV3-small has 1.6 million parameters on the APTOS dataset and 0.90 million parameters on the EYEPACS dataset, which is comparatively less than other methods. The proposed approach achieves competitive performance that is at par with recently reported works on DR classification.

[359] SAM2-Aug: Prior knowledge-based Augmentation for Target Volume Auto-Segmentation in Adaptive Radiation Therapy Using Segment Anything Model 2

Guoping Xu, Yan Dai, Hengrui Zhao, Ying Zhang, Jie Deng, Weiguo Lu, You Zhang

Main category: eess.IV

TL;DR: The paper proposes SAM2-Aug, an enhanced version of SAM2 for tumor segmentation in ART, using prior knowledge and prompt robustness strategies, achieving high accuracy across datasets.

DetailsMotivation: Accurate tumor segmentation is crucial for adaptive radiation therapy (ART) but is time-consuming and user-dependent. SAM2 shows promise but lacks tumor accuracy.

Method: Two strategies: (1) prior MR images and annotations as contextual inputs, (2) improved prompt robustness via random bounding box expansion and mask erosion/dilation. SAM2-Aug was fine-tuned and tested on liver, abdomen, and brain datasets.

Result: SAM2-Aug outperformed other models with Dice scores of 0.86 (liver), 0.89 (abdomen), and 0.90 (brain), showing strong generalization and boundary-sensitive performance.

Conclusion: Incorporating prior images and enhancing prompt diversity improves segmentation accuracy. SAM2-Aug is a robust, efficient solution for ART tumor segmentation.

Abstract: Purpose: Accurate tumor segmentation is vital for adaptive radiation therapy (ART) but remains time-consuming and user-dependent. Segment Anything Model 2 (SAM2) shows promise for prompt-based segmentation but struggles with tumor accuracy. We propose prior knowledge-based augmentation strategies to enhance SAM2 for ART. Methods: Two strategies were introduced to improve SAM2: (1) using prior MR images and annotations as contextual inputs, and (2) improving prompt robustness via random bounding box expansion and mask erosion/dilation. The resulting model, SAM2-Aug, was fine-tuned and tested on the One-Seq-Liver dataset (115 MRIs from 31 liver cancer patients), and evaluated without retraining on Mix-Seq-Abdomen (88 MRIs, 28 patients) and Mix-Seq-Brain (86 MRIs, 37 patients). Results: SAM2-Aug outperformed convolutional, transformer-based, and prompt-driven models across all datasets, achieving Dice scores of 0.86(liver), 0.89(abdomen), and 0.90(brain). It demonstrated strong generalization across tumor types and imaging sequences, with improved performance in boundary-sensitive metrics. Conclusions: Incorporating prior images and enhancing prompt diversity significantly boosts segmentation accuracy and generalizability. SAM2-Aug offers a robust, efficient solution for tumor segmentation in ART. Code and models will be released at https://github.com/apple1986/SAM2-Aug.

[360] A multi-dynamic low-rank deep image prior (ML-DIP) for real-time 3D cardiovascular MRI

Chong Chen, Marc Vornehm, Preethi Chandrasekaran, Muhammad A. Sultan, Syed M. Arshad, Yingmin Liu, Yuchi Han, Rizwan Ahmad

Main category: eess.IV

TL;DR: A framework (ML-DIP) for 3D real-time CMR reconstruction from undersampled data without needing fully sampled training data.

DetailsMotivation: To enable high-quality 3D real-time CMR without relying on fully sampled datasets, addressing limitations of existing methods.

Method: ML-DIP uses separate neural networks for spatial and temporal modeling, optimized per scan for reconstruction from undersampled k-space data.

Result: Achieved high PSNR (>29 dB) and SSIM (>0.90) in phantoms, comparable functional measurements in healthy subjects, and preserved beat-to-beat variability in PVC patients.

Conclusion: ML-DIP enables high-quality 3D real-time CMR with high acceleration factors, learning directly from undersampled data.

Abstract: Purpose: To develop a reconstruction framework for 3D real-time cine cardiovascular magnetic resonance (CMR) from highly undersampled data without requiring fully sampled training data. Methods: We developed a multi-dynamic low-rank deep image prior (ML-DIP) framework that models spatial image content and temporal deformation fields using separate neural networks. These networks are optimized per scan to reconstruct the dynamic image series directly from undersampled k-space data. ML-DIP was evaluated on (i) a 3D cine digital phantom with simulated premature ventricular contractions (PVCs), (ii) ten healthy subjects (including two scanned during both rest and exercise), and (iii) five patients with PVCs. Phantom results were assessed using peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). In vivo performance was evaluated by comparing left-ventricular function quantification (against 2D real-time cine) and image quality (against 2D real-time cine and binning-based 5D-Cine). Results: In the phantom study, ML-DIP achieved PSNR > 29 dB and SSIM > 0.90 for scan times as short as two minutes, while recovering cardiac motion, respiratory motion, and PVC events. In healthy subjects, ML-DIP yielded functional measurements comparable to 2D cine and higher image quality than 5D-Cine, including during exercise with high heart rates and bulk motion. In PVC patients, ML-DIP preserved beat-to-beat variability and reconstructed irregular beats, whereas 5D-Cine showed motion artifacts and information loss due to binning. Conclusion: ML-DIP enables high-quality 3D real-time CMR with acceleration factors exceeding 1,000 by learning low-rank spatial and temporal representations from undersampled data, without relying on external fully sampled training datasets.

[361] Framework of a multiscale data-driven DT of the musculoskeletal system

Martina Paccini, Simone Cammarasana, Giuseppe Patanè

Main category: eess.IV

TL;DR: The paper introduces the Musculoskeletal Digital Twin (MS-DT), a framework integrating multiscale biomechanical data for personalized MSD assessment and treatment.

DetailsMotivation: MSDs are a major cause of disability, requiring advanced tools for personalized care. The DT paradigm is ideal for managing heterogeneous data in MSD treatment.

Method: The MS-DT integrates motion capture, ultrasound imaging, electromyography, and medical imaging to create patient-specific musculoskeletal models. An interactive visualization platform aids analysis.

Result: MS-DT effectively extracts precise kinematic and dynamic tissue features, aiding spine biomechanics monitoring and rehabilitation.

Conclusion: The MS-DT framework offers high-fidelity modeling and real-time visualization, enhancing patient-specific diagnosis and intervention planning.

Abstract: Musculoskeletal disorders (MSDs) are a leading cause of disability worldwide, requiring advanced diagnostic and therapeutic tools for personalised assessment and treatment. Effective management of MSDs involves the interaction of heterogeneous data sources, making the Digital Twin (DT) paradigm a valuable option. This paper introduces the Musculoskeletal Digital Twin (MS-DT), a novel framework that integrates multiscale biomechanical data with computational modelling to create a detailed, patient-specific representation of the musculoskeletal system. By combining motion capture, ultrasound imaging, electromyography, and medical imaging, the MS-DT enables the analysis of spinal kinematics, posture, and muscle function. An interactive visualisation platform provides clinicians and researchers with an intuitive interface for exploring biomechanical parameters and tracking patient-specific changes. Results demonstrate the effectiveness of MS-DT in extracting precise kinematic and dynamic tissue features, offering a comprehensive tool for monitoring spine biomechanics and rehabilitation. This framework provides high-fidelity modelling and real-time visualization to improve patient-specific diagnosis and intervention planning.

[362] A Study of Anatomical Priors for Deep Learning-Based Segmentation of Pheochromocytoma in Abdominal CT

Tanjin Taher Toma, Tejas Sudharshan Mathai, Bikash Santra, Pritam Mukherjee, Jianfei Liu, Wesley Jong, Darwish Alabyad, Vivek Batheja, Abhishek Jha, Mayank Patel, Darko Pucar, Jayadira del Rivero, Karel Pacak, Ronald M. Summers

Main category: eess.IV

TL;DR: The study evaluates anatomical priors for improving deep learning-based PCC segmentation in CT scans, finding the Tumor + Kidney + Aorta (TKA) strategy most effective.

DetailsMotivation: Accurate PCC segmentation aids tumor burden estimation, prognosis, treatment planning, and reduces reliance on expensive genetic testing.

Method: Used nnU-Net to test 11 annotation strategies, including novel multi-class schemes based on organ-specific anatomical priors, and compared them against a body-region prior. Evaluated on 105 CT scans from 91 patients.

Result: TKA annotation outperformed others in DSC, NSD, and F1 score, with superior tumor burden quantification and robustness in cross-validation.

Conclusion: Incorporating relevant anatomical context enhances PCC segmentation, supporting clinical assessment and monitoring.

Abstract: Accurate segmentation of pheochromocytoma (PCC) in abdominal CT scans is essential for tumor burden estimation, prognosis, and treatment planning. It may also help infer genetic clusters, reducing reliance on expensive testing. This study systematically evaluates anatomical priors to identify configurations that improve deep learning-based PCC segmentation. We employed the nnU-Net framework to evaluate eleven annotation strategies for accurate 3D segmentation of pheochromocytoma, introducing a set of novel multi-class schemes based on organ-specific anatomical priors. These priors were derived from adjacent organs commonly surrounding adrenal tumors (e.g., liver, spleen, kidney, aorta, adrenal gland, and pancreas), and were compared against a broad body-region prior used in previous work. The framework was trained and tested on 105 contrast-enhanced CT scans from 91 patients at the NIH Clinical Center. Performance was measured using Dice Similarity Coefficient (DSC), Normalized Surface Distance (NSD), and instance-wise F1 score. Among all strategies, the Tumor + Kidney + Aorta (TKA) annotation achieved the highest segmentation accuracy, significantly outperforming the previously used Tumor + Body (TB) annotation across DSC (p = 0.0097), NSD (p = 0.0110), and F1 score (25.84% improvement at an IoU threshold of 0.5), measured on a 70-30 train-test split. The TKA model also showed superior tumor burden quantification (R^2 = 0.968) and strong segmentation across all genetic subtypes. In five-fold cross-validation, TKA consistently outperformed TB across IoU thresholds (0.1 to 0.5), reinforcing its robustness and generalizability. These findings highlight the value of incorporating relevant anatomical context into deep learning models to achieve precise PCC segmentation, offering a valuable tool to support clinical assessment and longitudinal disease monitoring in PCC patients.

[363] MLRU++: Multiscale Lightweight Residual UNETR++ with Attention for Efficient 3D Medical Image Segmentation

Nand Kumar Yadav, Rodrigue Rizk, William CW Chen, KC Santosh

Main category: eess.IV

TL;DR: MLRU++ is a lightweight hybrid CNN-Transformer architecture for medical image segmentation, balancing accuracy and efficiency with innovations like LCBAM and M2B. It outperforms existing models on benchmarks while reducing computational costs.

DetailsMotivation: Medical image segmentation faces challenges like anatomical variability and high computational demands. Hybrid CNN-Transformer models improve accuracy but add complexity, necessitating a lightweight yet effective solution.

Method: Proposes MLRU++, featuring Lightweight Channel and Bottleneck Attention Module (LCBAM) for efficient feature encoding and Multiscale Bottleneck Block (M2B) for multi-resolution detail capture.

Result: Achieves state-of-the-art Dice scores (e.g., 87.57% on Synapse, 93.00% on ACDC) and reduces parameters/computational costs compared to leading models.

Conclusion: MLRU++ provides a practical, high-performance solution for 3D medical image segmentation, validated by ablation studies and benchmark results.

Abstract: Accurate and efficient medical image segmentation is crucial but challenging due to anatomical variability and high computational demands on volumetric data. Recent hybrid CNN-Transformer architectures achieve state-of-the-art results but add significant complexity. In this paper, we propose MLRU++, a Multiscale Lightweight Residual UNETR++ architecture designed to balance segmentation accuracy and computational efficiency. It introduces two key innovations: a Lightweight Channel and Bottleneck Attention Module (LCBAM) that enhances contextual feature encoding with minimal overhead, and a Multiscale Bottleneck Block (M2B) in the decoder that captures fine-grained details via multi-resolution feature aggregation. Experiments on four publicly available benchmark datasets (Synapse, BTCV, ACDC, and Decathlon Lung) demonstrate that MLRU++ achieves state-of-the-art performance, with average Dice scores of 87.57% (Synapse), 93.00% (ACDC), and 81.12% (Lung). Compared to existing leading models, MLRU++ improves Dice scores by 5.38% and 2.12% on Synapse and ACDC, respectively, while significantly reducing parameter count and computational cost. Ablation studies evaluating LCBAM and M2B further confirm the effectiveness of the proposed architectural components. Results suggest that MLRU++ offers a practical and high-performing solution for 3D medical image segmentation tasks. Source code is available at: https://github.com/1027865/MLRUPP

[364] Improving Multislice Electron Ptychography with a Generative Prior

Christian K. Belardi, Chia-Hao Lee, Yingheng Wang, Justin Lovelace, Kilian Q. Weinberger, David A. Muller, Carla P. Gomes

Main category: eess.IV

TL;DR: MEP-Diffusion, a diffusion model trained on crystal structures, enhances multislice electron ptychography (MEP) reconstructions, improving quality by 90.50% in SSIM.

DetailsMotivation: Existing iterative algorithms for MEP are slow and produce suboptimal solutions due to the ill-posed nature of the problem.

Method: Developed MEP-Diffusion, a diffusion model trained on crystal structures, integrated with iterative solvers via Diffusion Posterior Sampling (DPS).

Result: Achieved a 90.50% improvement in SSIM over existing methods.

Conclusion: MEP-Diffusion significantly enhances reconstruction quality, offering a promising hybrid approach for MEP.

Abstract: Multislice electron ptychography (MEP) is an inverse imaging technique that computationally reconstructs the highest-resolution images of atomic crystal structures from diffraction patterns. Available algorithms often solve this inverse problem iteratively but are both time consuming and produce suboptimal solutions due to their ill-posed nature. We develop MEP-Diffusion, a diffusion model trained on a large database of crystal structures specifically for MEP to augment existing iterative solvers. MEP-Diffusion is easily integrated as a generative prior into existing reconstruction methods via Diffusion Posterior Sampling (DPS). We find that this hybrid approach greatly enhances the quality of the reconstructed 3D volumes, achieving a 90.50% improvement in SSIM over existing methods.

[365] Benchmarking of Deep Learning Methods for Generic MRI Multi-Organ Abdominal Segmentation

Deepa Krishnaswamy, Cosmin Ciausu, Steve Pieper, Ron Kikinis, Benjamin Billot, Andrey Fedorov

Main category: eess.IV

TL;DR: A benchmarking study compares three MRI abdominal segmentation models and introduces ABDSynth, a CT-trained alternative, evaluating their accuracy and generalizability across diverse datasets.

DetailsMotivation: MRI segmentation is challenging due to signal variability and annotation effort, limiting generalizability of existing tools.

Method: Benchmarked three state-of-the-art models (MRSegmentator, MRISegmentator-Abdomen, TotalSegmentator MRI) and introduced ABDSynth, trained on CT data. Evaluated on three public datasets covering various MRI sequences and conditions.

Result: MRSegmentator performed best in accuracy and generalizability, while ABDSynth, though less accurate, offers a viable alternative with lower annotation requirements.

Conclusion: MRSegmentator is the top performer, but ABDSynth is a practical option when annotation resources are limited. Tools and datasets are shared for future benchmarking.

Abstract: Recent advances in deep learning have led to robust automated tools for segmentation of abdominal computed tomography (CT). Meanwhile, segmentation of magnetic resonance imaging (MRI) is substantially more challenging due to the inherent signal variability and the increased effort required for annotating training datasets. Hence, existing approaches are trained on limited sets of MRI sequences, which might limit their generalizability. To characterize the landscape of MRI abdominal segmentation tools, we present here a comprehensive benchmarking of the three state-of-the-art and open-source models: MRSegmentator, MRISegmentator-Abdomen, and TotalSegmentator MRI. Since these models are trained using labor-intensive manual annotation cycles, we also introduce and evaluate ABDSynth, a SynthSeg-based model purely trained on widely available CT segmentations (no real images). More generally, we assess accuracy and generalizability by leveraging three public datasets (not seen by any of the evaluated methods during their training), which span all major manufacturers, five MRI sequences, as well as a variety of subject conditions, voxel resolutions, and fields-of-view. Our results reveal that MRSegmentator achieves the best performance and is most generalizable. In contrast, ABDSynth yields slightly less accurate results, but its relaxed requirements in training data make it an alternative when the annotation budget is limited. The evaluation code and datasets are given for future benchmarking at https://github.com/deepakri201/AbdoBench, along with inference code and weights for ABDSynth.

[366] Learned Single-Pixel Fluorescence Microscopy

Serban C. Tudosie, Valerio Gandolfi, Shivaprasad Varakkoth, Andrea Farina, Cosimo D’Andrea, Simon Arridge

Main category: eess.IV

TL;DR: The paper introduces a self-supervised autoencoder to improve single-pixel imaging in fluorescence microscopy, achieving faster reconstruction, better quality, and enabling multispectral imaging.

DetailsMotivation: Single-pixel imaging in fluorescence microscopy requires fast and high-quality reconstruction. Traditional methods like total variation minimisation are limited, and data-driven approaches can enhance performance.

Method: An autoencoder is trained through self-supervision to learn an encoder (measurement matrix) and decoder. The encoder is integrated into the physical device for data acquisition.

Result: The method reduces reconstruction time by two orders of magnitude, improves image quality, and enables multispectral reconstructions.

Conclusion: Learned single-pixel fluorescence microscopy can advance diagnosis and biological research by providing cost-effective multispectral imaging.

Abstract: Single-pixel imaging has emerged as a key technique in fluorescence microscopy, where fast acquisition and reconstruction are crucial. In this context, images are reconstructed from linearly compressed measurements. In practice, total variation minimisation is still used to reconstruct the image from noisy measurements of the inner product between orthogonal sampling pattern vectors and the original image data. However, data can be leveraged to learn the measurement vectors and the reconstruction process, thereby enhancing compression, reconstruction quality, and speed. We train an autoencoder through self-supervision to learn an encoder (or measurement matrix) and a decoder. We then test it on physically acquired multispectral and intensity data. During acquisition, the learned encoder becomes part of the physical device. Our approach can enhance single-pixel imaging in fluorescence microscopy by reducing reconstruction time by two orders of magnitude, achieving superior image quality, and enabling multispectral reconstructions. Ultimately, learned single-pixel fluorescence microscopy could advance diagnosis and biological research, providing multispectral imaging at a fraction of the cost.

[367] RealDeal: Enhancing Realism and Details in Brain Image Generation via Image-to-Image Diffusion Models

Shen Zhu, Yinzhu Jin, Tyler Spears, Ifrah Zawar, P. Thomas Fletcher

Main category: eess.IV

TL;DR: The paper proposes image-to-image diffusion models to enhance realism and details in generated brain images, addressing the smoothness issue in latent diffusion models (LDMs).

DetailsMotivation: Latent diffusion models (LDMs) generate overly smooth brain MRIs, lacking fine anatomical details and realistic noise. This work aims to improve realism and detail in LDM-generated images.

Method: The authors formulate the enhancement process as image-to-image diffusion models, refining LDM outputs. Metrics like FID and LPIPS are used, alongside new metrics for noise, sharpness, and texture.

Result: The proposed method improves realism and detail in generated brain images, as validated by standard and novel metrics.

Conclusion: The image-to-image diffusion approach effectively enhances the realism and detail of LDM-generated brain images, addressing key limitations.

Abstract: We propose image-to-image diffusion models that are designed to enhance the realism and details of generated brain images by introducing sharp edges, fine textures, subtle anatomical features, and imaging noise. Generative models have been widely adopted in the biomedical domain, especially in image generation applications. Latent diffusion models achieve state-of-the-art results in generating brain MRIs. However, due to latent compression, generated images from these models are overly smooth, lacking fine anatomical structures and scan acquisition noise that are typically seen in real images. This work formulates the realism enhancing and detail adding process as image-to-image diffusion models, which refines the quality of LDM-generated images. We employ commonly used metrics like FID and LPIPS for image realism assessment. Furthermore, we introduce new metrics to demonstrate the realism of images generated by RealDeal in terms of image noise distribution, sharpness, and texture.

[368] Dealing with Segmentation Errors in Needle Reconstruction for MRI-Guided Brachytherapy

Vangelis Kostoulas, Arthur Guijt, Ellen M. Kerkhof, Bradley R. Pieters, Peter A. N. Bosman, Tanja Alderliesten

Main category: eess.IV

TL;DR: The paper proposes adaptations to post-processing techniques in brachytherapy needle reconstruction to handle segmentation errors, improving accuracy in needle localization.

DetailsMotivation: Manual needle annotation in image-guided brachytherapy is time-consuming and challenging, and existing post-processing methods lack robustness against segmentation errors.

Method: A two-stage pipeline (segmentation followed by post-processing) is adapted to better manage segmentation errors, tested on a prostate cancer MRI dataset.

Result: The adapted technique achieved median errors of 1.07 mm (tip), 0.43 mm (bottom), and 0.75 mm (shaft), with no false positives/negatives on 261 needles.

Conclusion: The proposed adaptations effectively improve needle reconstruction accuracy by addressing segmentation errors.

Abstract: Brachytherapy involves bringing a radioactive source near tumor tissue using implanted needles. Image-guided brachytherapy planning requires amongst others, the reconstruction of the needles. Manually annotating these needles on patient images can be a challenging and time-consuming task for medical professionals. For automatic needle reconstruction, a two-stage pipeline is commonly adopted, comprising a segmentation stage followed by a post-processing stage. While deep learning models are effective for segmentation, their results often contain errors. No currently existing post-processing technique is robust to all possible segmentation errors. We therefore propose adaptations to existing post-processing techniques mainly aimed at dealing with segmentation errors and thereby improving the reconstruction accuracy. Experiments on a prostate cancer dataset, based on MRI scans annotated by medical professionals, demonstrate that our proposed adaptations can help to effectively manage segmentation errors, with the best adapted post-processing technique achieving median needle-tip and needle-bottom point localization errors of $1.07$ (IQR $\pm 1.04$) mm and $0.43$ (IQR $\pm 0.46$) mm, respectively, and median shaft error of $0.75$ (IQR $\pm 0.69$) mm with 0 false positive and 0 false negative needles on a test set of 261 needles.

[369] Dual Path Learning – learning from noise and context for medical image denoising

Jitindra Fartiyal, Pedro Freire, Yasmeen Whayeb, James S. Wolffsohn, Sergei K. Turitsyn, Sergei G. Sokolov

Main category: eess.IV

TL;DR: The paper introduces a Dual-Pathway Learning (DPL) model for medical image denoising, combining noise and contextual information, and shows improved performance across multiple modalities and noise types.

DetailsMotivation: Noise in medical imaging degrades quality and affects diagnosis. Existing methods are limited to single modalities or noise types.

Method: Proposes DPL, a dual-pathway model integrating noise and context, evaluated on multiple modalities and noise types.

Result: DPL improves PSNR by 3.35% over UNet on Gaussian noise and generalizes well across modalities.

Conclusion: DPL is a robust and generalizable solution for medical image denoising.

Abstract: Medical imaging plays a critical role in modern healthcare, enabling clinicians to accurately diagnose diseases and develop effective treatment plans. However, noise, often introduced by imaging devices, can degrade image quality, leading to misinterpretation and compromised clinical outcomes. Existing denoising approaches typically rely either on noise characteristics or on contextual information from the image. Moreover, they are commonly developed and evaluated for a single imaging modality and noise type. Motivated by Geng et.al CNCL, which integrates both noise and context, this study introduces a Dual-Pathway Learning (DPL) model architecture that effectively denoises medical images by leveraging both sources of information and fusing them to generate the final output. DPL is evaluated across multiple imaging modalities and various types of noise, demonstrating its robustness and generalizability. DPL improves PSNR by 3.35% compared to the baseline UNet when evaluated on Gaussian noise and trained across all modalities. The code is available at 10.5281/zenodo.15836053.

[370] Joint Holistic and Lesion Controllable Mammogram Synthesis via Gated Conditional Diffusion Model

Xin Li, Kaixiang Yang, Qiang Li, Zhiwei Wang

Main category: eess.IV

TL;DR: GCDM is a novel framework for synthesizing mammograms and localized lesions using a gated conditional diffusion model, improving realism and diversity.

DetailsMotivation: Addressing limitations in data availability and diversity for deep-learning in mammography by enhancing lesion-specific feature synthesis.

Method: GCDM uses a latent denoising diffusion framework with soft mask embeddings and a gated conditioning branch to guide lesion synthesis.

Result: GCDM achieves precise lesion control and realistic mammogram synthesis, outperforming current methods.

Conclusion: GCDM is a promising tool for clinical mammogram synthesis, with potential for large-scale applications.

Abstract: Mammography is the most commonly used imaging modality for breast cancer screening, driving an increasing demand for deep-learning techniques to support large-scale analysis. However, the development of accurate and robust methods is often limited by insufficient data availability and a lack of diversity in lesion characteristics. While generative models offer a promising solution for data synthesis, current approaches often fail to adequately emphasize lesion-specific features and their relationships with surrounding tissues. In this paper, we propose Gated Conditional Diffusion Model (GCDM), a novel framework designed to jointly synthesize holistic mammogram images and localized lesions. GCDM is built upon a latent denoising diffusion framework, where the noised latent image is concatenated with a soft mask embedding that represents breast, lesion, and their transitional regions, ensuring anatomical coherence between them during the denoising process. To further emphasize lesion-specific features, GCDM incorporates a gated conditioning branch that guides the denoising process by dynamically selecting and fusing the most relevant radiomic and geometric properties of lesions, effectively capturing their interplay. Experimental results demonstrate that GCDM achieves precise control over small lesion areas while enhancing the realism and diversity of synthesized mammograms. These advancements position GCDM as a promising tool for clinical applications in mammogram synthesis. Our code is available at https://github.com/lixinHUST/Gated-Conditional-Diffusion-Model/

[371] A Self-training Framework for Semi-supervised Pulmonary Vessel Segmentation and Its Application in COPD

Shuiqing Zhao, Meihuan Wang, Jiaxuan Xu, Jie Feng, Wei Qian, Rongchang Chen, Zhenyu Liang, Shouliang Qi, Yanan Wu

Main category: eess.IV

TL;DR: A semi-supervised method (Semi2) improves pulmonary vessel segmentation in COPD patients using a teacher-student model, achieving 90.3% precision.

DetailsMotivation: Accurate segmentation of pulmonary vessels, especially smaller ones, is crucial for COPD analysis.

Method: A self-training framework with a teacher-student model generates pseudo-labels for unlabeled CT images, iteratively improving segmentation.

Result: Semi2 achieves 90.3% precision, a 2.3% improvement, and provides insights into vessel differences across COPD severity.

Conclusion: The method enhances vessel segmentation and is applicable to COPD analysis; code is publicly available.

Abstract: Background: It is fundamental for accurate segmentation and quantification of the pulmonary vessel, particularly smaller vessels, from computed tomography (CT) images in chronic obstructive pulmonary disease (COPD) patients. Objective: The aim of this study was to segment the pulmonary vasculature using a semi-supervised method. Methods: In this study, a self-training framework is proposed by leveraging a teacher-student model for the segmentation of pulmonary vessels. First, the high-quality annotations are acquired in the in-house data by an interactive way. Then, the model is trained in the semi-supervised way. A fully supervised model is trained on a small set of labeled CT images, yielding the teacher model. Following this, the teacher model is used to generate pseudo-labels for the unlabeled CT images, from which reliable ones are selected based on a certain strategy. The training of the student model involves these reliable pseudo-labels. This training process is iteratively repeated until an optimal performance is achieved. Results: Extensive experiments are performed on non-enhanced CT scans of 125 COPD patients. Quantitative and qualitative analyses demonstrate that the proposed method, Semi2, significantly improves the precision of vessel segmentation by 2.3%, achieving a precision of 90.3%. Further, quantitative analysis is conducted in the pulmonary vessel of COPD, providing insights into the differences in the pulmonary vessel across different severity of the disease. Conclusion: The proposed method can not only improve the performance of pulmonary vascular segmentation, but can also be applied in COPD analysis. The code will be made available at https://github.com/wuyanan513/semi-supervised-learning-for-vessel-segmentation.

[372] Reconstruct or Generate: Exploring the Spectrum of Generative Modeling for Cardiac MRI

Niklas Bubeck, Yundi Zhang, Suprosanna Shit, Daniel Rueckert, Jiazhen Pan

Main category: eess.IV

TL;DR: The paper analyzes generative models in medical imaging, comparing latent diffusion and autoregressive models for reconstruction and generation tasks. Diffusion models excel in perceptual quality for generation but hallucinate with high masking, while autoregressive models are stable but less faithful.

DetailsMotivation: To understand how generative models balance reconstruction (data fidelity) and generation (perceptual quality/diversity) in medical imaging, given their shared architectures but differing priorities.

Method: Introduces a ‘generative model zoo’ to systematically evaluate latent diffusion and autoregressive models on cardiac imaging tasks, including inpainting with varying masking ratios and unconditional generation.

Result: Diffusion models perform better in unconditional generation but hallucinate with high masking. Autoregressive models are stable across masking levels but have lower fidelity.

Conclusion: The choice of generative model depends on the task’s priority: diffusion for perceptual quality in generation, autoregressive for stable reconstruction.

Abstract: In medical imaging, generative models are increasingly relied upon for two distinct but equally critical tasks: reconstruction, where the goal is to restore medical imaging (usually inverse problems like inpainting or superresolution), and generation, where synthetic data is created to augment datasets or carry out counterfactual analysis. Despite shared architecture and learning frameworks, they prioritize different goals: generation seeks high perceptual quality and diversity, while reconstruction focuses on data fidelity and faithfulness. In this work, we introduce a “generative model zoo” and systematically analyze how modern latent diffusion models and autoregressive models navigate the reconstruction-generation spectrum. We benchmark a suite of generative models across representative cardiac medical imaging tasks, focusing on image inpainting with varying masking ratios and sampling strategies, as well as unconditional image generation. Our findings show that diffusion models offer superior perceptual quality for unconditional generation but tend to hallucinate as masking ratios increase, whereas autoregressive models maintain stable perceptual performance across masking levels, albeit with generally lower fidelity.

[373] Unstable Prompts, Unreliable Segmentations: A Challenge for Longitudinal Lesion Analysis

Niels Rocholl, Ewoud Smit, Mathias Prokop, Alessa Hering

Main category: eess.IV

TL;DR: The paper evaluates the ULS23 segmentation model for longitudinal lesion tracking, revealing its failure due to registration errors and lesion displacement, advocating for integrated temporal models.

DetailsMotivation: To assess the performance of single-timepoint lesion segmentation models in longitudinal contexts, identifying their limitations for oncological care.

Method: Used a public clinical dataset of baseline and follow-up CT scans, evaluated ULS23’s segmentation and tracking, and conducted controlled experiments with displaced lesions.

Result: The model’s performance degrades sharply in follow-up cases due to registration errors and fails with displaced lesions, highlighting its unsuitability for longitudinal data.

Conclusion: Robust longitudinal lesion tracking requires end-to-end models designed for temporal analysis, not cascaded single-timepoint tools.

Abstract: Longitudinal lesion analysis is crucial for oncological care, yet automated tools often struggle with temporal consistency. While universal lesion segmentation models have advanced, they are typically designed for single time points. This paper investigates the performance of the ULS23 segmentation model in a longitudinal context. Using a public clinical dataset of baseline and follow-up CT scans, we evaluated the model’s ability to segment and track lesions over time. We identified two critical, interconnected failure modes: a sharp degradation in segmentation quality in follow-up cases due to inter-scan registration errors, and a subsequent breakdown of the lesion correspondence process. To systematically probe this vulnerability, we conducted a controlled experiment where we artificially displaced the input volume relative to the true lesion center. Our results demonstrate that the model’s performance is highly dependent on its assumption of a centered lesion; segmentation accuracy collapses when the lesion is sufficiently displaced. These findings reveal a fundamental limitation of applying single-timepoint models to longitudinal data. We conclude that robust oncological tracking requires a paradigm shift away from cascading single-purpose tools towards integrated, end-to-end models inherently designed for temporal analysis.

[374] NerT-CA: Efficient Dynamic Reconstruction from Sparse-view X-ray Coronary Angiography

Kirsten W. H. Maas, Danny Ruijters, Nicola Pezzotti, Anna Vilanova

Main category: eess.IV

TL;DR: NerT-CA proposes a hybrid neural and tensorial method for faster and more accurate 4D coronary artery reconstruction from sparse X-ray angiograms.

DetailsMotivation: Improving 3D/4D coronary artery reconstruction from sparse X-ray angiography by addressing challenges like sparsity, poor contrast, and motion, while reducing reliance on manual or error-prone methods.

Method: Combines low-rank tensorial fields for static reconstruction and neural fields for dynamic reconstruction, building on NeRF-based approaches.

Result: Outperforms prior methods in training speed and accuracy, achieving reasonable reconstructions from just three angiogram views.

Conclusion: NerT-CA offers a promising solution for efficient and accurate 4D coronary artery reconstruction in clinical settings.

Abstract: Three-dimensional (3D) and dynamic 3D+time (4D) reconstruction of coronary arteries from X-ray coronary angiography (CA) has the potential to improve clinical procedures. However, there are multiple challenges to be addressed, most notably, blood-vessel structure sparsity, poor background and blood vessel distinction, sparse-views, and intra-scan motion. State-of-the-art reconstruction approaches rely on time-consuming manual or error-prone automatic segmentations, limiting clinical usability. Recently, approaches based on Neural Radiance Fields (NeRF) have shown promise for automatic reconstructions in the sparse-view setting. However, they suffer from long training times due to their dependence on MLP-based representations. We propose NerT-CA, a hybrid approach of Neural and Tensorial representations for accelerated 4D reconstructions with sparse-view CA. Building on top of the previous NeRF-based work, we model the CA scene as a decomposition of low-rank and sparse components, utilizing fast tensorial fields for low-rank static reconstruction and neural fields for dynamic sparse reconstruction. Our approach outperforms previous works in both training time and reconstruction accuracy, yielding reasonable reconstructions from as few as three angiogram views. We validate our approach quantitatively and qualitatively on representative 4D phantom datasets.

Last updated: 2025-08-22
Built with Hugo, theme modified on Stack