Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 60]
cs.CV [Total: 103]
cs.AI [Total: 39]
cs.SD [Total: 6]
cs.LG [Total: 116]
cs.MA [Total: 5]
cs.MM [Total: 4]
eess.AS [Total: 8]
eess.IV [Total: 29]

cs.CL

[1] DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

Jusheng Zhang, Yijia Fan, Kaitong Cai, Zimeng Huang, Xiaofei Sun, Jian Wang, Chengpei Tang, Keze Wang

Main category: cs.CL

TL;DR: DrDiff is a novel framework for efficient long-text generation using dynamic expert scheduling, hierarchical sparse attention, and soft absorption guidance to overcome the efficiency-quality trade-off.

Details

Motivation: To address the efficiency-quality trade-off in long-text generation by developing a framework that can handle varying text complexity while maintaining high performance and reducing computational costs.

Method: Three core technologies: 1) Dynamic expert scheduling for intelligent computational resource allocation, 2) Hierarchical Sparse Attention (HSA) that reduces complexity from O(n²) to O(n), 3) Soft absorption guidance optimization with DPM-solver++ to reduce diffusion steps.

Result: Comprehensive experiments on various long-text generation benchmarks demonstrate superiority over existing state-of-the-art methods.

Conclusion: DrDiff successfully overcomes the efficiency-quality trade-off in long-text generation through its innovative three-component framework, achieving both high performance and computational efficiency.

Abstract: This paper introduces DrDiff, a novel framework for long-text generation that overcomes the efficiency-quality trade-off through three core technologies. First, we design a dynamic expert scheduling mechanism that intelligently allocates computational resources during the diffusion process based on text complexity, enabling more efficient handling of text generation tasks of varying difficulty. Second, we introduce a Hierarchical Sparse Attention (HSA) mechanism that adaptively adjusts attention patterns according to a variety of input lengths, reducing computational complexity from O($n^2$) to O($n$) while maintaining model performance. Finally, we propose a soft absorption guidance optimization strategy that combines with DPM-solver++ to reduce diffusion steps, significantly improving generation speed. Comprehensive experiments on various long-text generation benchmarks demonstrate the superiority of our DrDiff over the existing SOTA methods.

[2] SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking under Domain Shift in ASR

Pu Wang, Shinji Watanabe, Hugo Van hamme

Main category: cs.CL

TL;DR: Comprehensive benchmarking of state-of-the-art PEFT methods (VeRA, DoRA, PiSSA, SVFT) for speech recognition, with introduction of novel SSVD method for efficient domain adaptation.

Details

Motivation: Existing PEFT variants are mainly developed for language/vision tasks with limited validation in speech applications, creating a gap in speech-specific parameter-efficient fine-tuning research.

Method: Integrated and benchmarked multiple PEFT methods in ESPnet framework, introduced structured SVD-guided (SSVD) fine-tuning that selectively rotates input-associated singular vectors while keeping output vectors fixed to preserve semantic mappings.

Result: Evaluated on domain-shifted speech recognition tasks (child speech, dialectal variation) across model scales from 0.1B to 2B parameters, showing robust domain adaptation with minimal trainable parameters and improved efficiency.

Conclusion: Provides first comprehensive PEFT benchmarking for speech tasks, releases all implementations in ESPnet to support reproducibility and future research in speech-specific parameter-efficient fine-tuning.

Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a scalable solution for adapting large foundation models. While low-rank adaptation (LoRA) is widely used in speech applications, its state-of-the-art variants, e.g., VeRA, DoRA, PiSSA, and SVFT, are developed mainly for language and vision tasks, with limited validation in speech. This work presents the first comprehensive integration and benchmarking of these PEFT methods within ESPnet. We further introduce structured SVD-guided (SSVD) fine-tuning, which selectively rotates input-associated right singular vectors while keeping output-associated vectors fixed to preserve semantic mappings. This design enables robust domain adaptation with minimal trainable parameters and improved efficiency. We evaluate all methods on domain-shifted speech recognition tasks, including child speech and dialectal variation, across model scales from 0.1B to 2B. All implementations are released in ESPnet to support reproducibility and future work.

[3] Clustering Discourses: Racial Biases in Short Stories about Women Generated by Large Language Models

Gustavo Bonil, João Gondim, Marina dos Santos, Simone Hashiguti, Helena Maia, Nadia Silva, Helio Pedrini, Sandra Avila

Main category: cs.CL

TL;DR: LLaMA 3.2-3B generates Portuguese short stories that reinforce colonial stereotypes about Black and white women through three main narrative patterns: social overcoming, ancestral mythification, and subjective self-realization.

Details

Motivation: To investigate how large language models construct narratives about race and gender, specifically examining how seemingly neutral AI-generated texts can perpetuate historical inequalities and colonial framing of female bodies.

Method: Generated 2100 Portuguese short stories using LLaMA 3.2-3B, applied computational methods to group semantically similar stories, and conducted qualitative discourse analysis on selected narratives.

Result: The analysis revealed three dominant discursive representations that materialize a crystallized, colonially structured framing of the female body, demonstrating how grammatically coherent AI texts can reinforce historical racial and gender inequalities.

Conclusion: The study proposes an integrated approach combining machine learning with qualitative discourse analysis to critically examine AI-generated content and uncover hidden biases in language models.

Abstract: This study investigates how large language models, in particular LLaMA 3.2-3B, construct narratives about Black and white women in short stories generated in Portuguese. From 2100 texts, we applied computational methods to group semantically similar stories, allowing a selection for qualitative analysis. Three main discursive representations emerge: social overcoming, ancestral mythification and subjective self-realization. The analysis uncovers how grammatically coherent, seemingly neutral texts materialize a crystallized, colonially structured framing of the female body, reinforcing historical inequalities. The study proposes an integrated approach, that combines machine learning techniques with qualitative, manual discourse analysis.

[4] IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations

Hyunji Nam, Lucia Langlois, James Malamut, Mei Tan, Dorottya Demszky

Main category: cs.CL

TL;DR: IDEAlgin: A novel benchmarking paradigm using “pick-the-odd-one-out” triplet judgments to evaluate LLM-generated interpretive annotations against expert human ratings, showing LLM-as-judge significantly outperforms traditional metrics.

Details

Motivation: Evaluating LLM-generated interpretive annotations (like thematic analysis or educational feedback) against expert human judgments is challenging at scale, with no existing validated measures for similarity in ideas.

Method: Proposed IDEAlgin benchmark using triplet judgment tasks to capture expert similarity ratings, then evaluated various similarity metrics (vector-based, LLM-as-judge) against human benchmarks on two educational datasets.

Result: Vector-based metrics largely failed to capture nuanced expert similarity dimensions. LLM-as-judge via IDEAlgin showed 9-30% improvement in alignment with expert judgments compared to traditional lexical and vector-based metrics.

Conclusion: IDEAlgin establishes a scalable paradigm for evaluating LLMs against open-ended expert annotations, supporting responsible LLM deployment in education and other interpretive domains.

Abstract: Large language models (LLMs) are increasingly applied to open-ended, interpretive annotation tasks, such as thematic analysis by researchers or generating feedback on student work by teachers. These tasks involve free-text annotations requiring expert-level judgments grounded in specific objectives (e.g., research questions or instructional goals). Evaluating whether LLM-generated annotations align with those generated by expert humans is challenging to do at scale, and currently, no validated, scalable measure of similarity in ideas exists. In this paper, we (i) introduce the scalable evaluation of interpretive annotation by LLMs as a critical and understudied task, (ii) propose IDEAlgin, an intuitive benchmarking paradigm for capturing expert similarity ratings via a “pick-the-odd-one-out” triplet judgment task, and (iii) evaluate various similarity metrics, including vector-based ones (topic models, embeddings) and LLM-as-a-judge via IDEAlgin, against these human benchmarks. Applying this approach to two real-world educational datasets (interpretive analysis and feedback generation), we find that vector-based metrics largely fail to capture the nuanced dimensions of similarity meaningful to experts. Prompting LLMs via IDEAlgin significantly improves alignment with expert judgments (9-30% increase) compared to traditional lexical and vector-based metrics. These results establish IDEAlgin as a promising paradigm for evaluating LLMs against open-ended expert annotations at scale, informing responsible deployment of LLMs in education and beyond.

[5] A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation

Kesen Wang, Daulet Toibazar, Pedro J. Moreno

Main category: cs.CL

TL;DR: Self-evolving adversarial workflow for Arabic long-context QA generation using specialized LVLMs that iteratively improve without human intervention, outperforming static pipelines.

Details

Motivation: To enhance long-context comprehension capabilities of Arabic LVLMs by creating an automated system that can generate and refine context-aware questions at customizable difficulty levels.

Method: Orchestrates multiple specialized LVLMs (question generator, evaluator, answer generator swarm) in a closed-loop cycle that refines performance through automated re-generation and model updates based on quality metrics.

Result: Substantially outperforms static pipelines, boosts long-context comprehension of leading Arabic LVLMs, and produces AraLongBench - a large-scale Arabic benchmark with hundreds of pages of challenges.

Conclusion: The self-evolving adversarial workflow enables continuous learning and customizable difficulty levels, significantly advancing Arabic long-context QA generation capabilities through fully automated processes.

Abstract: We present an end-to-end, self-evolving adversarial workflow for long-context Question-Answer (QA) Generation in Arabic. By orchestrating multiple specialized LVLMs: a question generator, an evaluator, and a swarm of answer generators, our system iteratively refines its own performance without any human intervention. Starting from raw, multi-page Arabic documents across diverse domains, the question generator produces fine-grained, context-aware queries to be tackled by the answer generator swarm, and the evaluator assesses and feeds back quality metrics. This closed-loop cycle enables continuous learning: low-confidence outputs trigger automated re-generation and model updates, progressively enhancing question difficulty and relevance. Moreover, we set the quality metrics as a tunable hyperparameter, enabling question generation at controllable and customizable difficulty levels. We release AraLongBench, a large-scale Arabic benchmark of single- and multi-page challenges spanning hundreds of pages, and demonstrate that our self-evolving workflow substantially outperform static pipelines, markedly boosting the long-context comprehension capabilities of leading Arabic Large Vision Language Models (LVLMs). Lastly, we also meticulously architect a fully automated agentic workflow for long-context Arabic document collection.

Santosh Chapagain, Cory J Cascalheira, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi, Jillian R. Scheer

Main category: cs.CL

TL;DR: Transformer models with graph augmentation outperform traditional methods for detecting minority stress in online discourse, showing improved ability to identify key linguistic markers through social connectivity modeling.

Details

Motivation: Sexual and gender minority groups experience disproportionately high rates of poor health outcomes due to minority stress, but current methods for detecting this stress in online discourse need improvement.

Method: Benchmarked transformer models (ELECTRA, BERT, RoBERTa, BART) against traditional ML baselines and graph-augmented variants, using zero-shot and few-shot learning on Reddit corpora with 12,645 and 5,789 posts over five random seeds.

Result: Graph structure integration consistently improved detection performance across transformer models, with supervised fine-tuning outperforming zero and few-shot approaches. Graph augmentation helped identify key linguistic markers like identity concealment and internalized stigma.

Conclusion: Graph-enhanced transformers provide the most reliable foundation for digital health interventions and public health policy by effectively modeling social connectivity and conversational context for minority stress detection.

Abstract: Individuals from sexual and gender minority groups experience disproportionately high rates of poor health outcomes and mental disorders compared to their heterosexual and cisgender counterparts, largely as a consequence of minority stress as described by Meyer’s (2003) model. This study presents the first comprehensive evaluation of transformer-based architectures for detecting minority stress in online discourse. We benchmark multiple transformer models including ELECTRA, BERT, RoBERTa, and BART against traditional machine learning baselines and graph-augmented variants. We further assess zero-shot and few-shot learning paradigms to assess their applicability on underrepresented datasets. Experiments are conducted on the two largest publicly available Reddit corpora for minority stress detection, comprising 12,645 and 5,789 posts, and are repeated over five random seeds to ensure robustness. Our results demonstrate that integrating graph structure consistently improves detection performance across transformer-only models and that supervised fine-tuning with relational context outperforms zero and few-shot approaches. Theoretical analysis reveals that modeling social connectivity and conversational context via graph augmentation sharpens the models’ ability to identify key linguistic markers such as identity concealment, internalized stigma, and calls for support, suggesting that graph-enhanced transformers offer the most reliable foundation for digital health interventions and public health policy.

[7] English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM

Taekyung Ahn, Hosung Nam

Main category: cs.CL

TL;DR: MLLM adapted via LoRA can perform both APA and MDD simultaneously without complex architectural changes, achieving strong correlation with human scores and low error rates.

Details

Motivation: To create an integrated pronunciation assessment system that eliminates the need for separate training procedures for Automatic Pronunciation Assessment (APA) and Mispronunciation Detection and Diagnosis (MDD).

Method: Fine-tuned Microsoft’s Phi-4-multimodal-instruct using Low-Rank Adaptation (LoRA) on the Speechocean762 dataset, focusing only on LoRA layers rather than full fine-tuning.

Result: Achieved strong Pearson Correlation Coefficient (PCC > 0.7) with human scores, low Word Error Rate and Phoneme Error Rate (both < 0.15), with performance comparable to full audio layer fine-tuning.

Conclusion: LoRA-based adaptation enables efficient integrated pronunciation assessment without full fine-tuning, paving the way for more accessible Computer-Assisted Pronunciation Training technologies.

Abstract: This study demonstrates that a Multimodal Large Language Model (MLLM) adapted via Low-Rank Adaptation (LoRA) can perform both Automatic Pronunciation Assessment (APA) and Mispronunciation Detection and Diagnosis (MDD) simultaneously. Leveraging Microsoft’s Phi-4-multimodal-instruct, our fine-tuning method eliminates the need for complex architectural changes or separate training procedures conventionally required for these distinct tasks. Fine-tuned on the Speechocean762 dataset, the pronunciation evaluation scores predicted by the model exhibited a strong Pearson Correlation Coefficient (PCC

0.7) with human-assigned scores, while achieving low Word Error Rate (WER) and Phoneme Error Rate (PER) (both < 0.15). Notably, fine-tuning only the LoRA layers was sufficient to achieve performance levels comparable to those achieved by fine-tuning all audio layers. This research highlights that an integrated pronunciation assessment system can be established by adapting large multimodal models without full fine-tuning, utilizing a significantly simpler training methodology compared to previous joint models designed for simultaneous APA and MDD. This efficient LoRA-based approach paves the way for more accessible, integrated, and effective Computer-Assisted Pronunciation Training (CAPT) technologies for English L2 learners.

[8] Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities

Youngwoo Kim, Himanshu Beniwal, Steven L. Johnson, Thomas Hartvigsen

Main category: cs.CL

TL;DR: Novel interpretable method extracts implicit moderation criteria from historical data using lexical score tables, matching neural model performance while providing transparency into community-specific enforcement patterns.

Details

Motivation: Online communities operate with diverse implicit moderation standards that lack explicit classification criteria, making content moderation systems difficult to understand and compare across different communities.

Method: Interpretable architecture that extracts moderation criteria as score tables of lexical expressions associated with content removal from historical moderation data, enabling systematic comparison across communities.

Result: Extracted lexical patterns effectively replicate neural moderation model performance while providing transparent insights. Criteria matrix reveals significant variations in enforcement of shared norms, community-specific language tolerances, topical restrictions, and subcategories of toxic speech.

Conclusion: The approach successfully identifies and extracts implicit moderation criteria, providing both performance comparable to neural models and valuable transparency into community decision-making processes and enforcement variations.

Abstract: Effective content moderation systems require explicit classification criteria, yet online communities like subreddits often operate with diverse, implicit standards. This work introduces a novel approach to identify and extract these implicit criteria from historical moderation data using an interpretable architecture. We represent moderation criteria as score tables of lexical expressions associated with content removal, enabling systematic comparison across different communities. Our experiments demonstrate that these extracted lexical patterns effectively replicate the performance of neural moderation models while providing transparent insights into decision-making processes. The resulting criteria matrix reveals significant variations in how seemingly shared norms are actually enforced, uncovering previously undocumented moderation patterns including community-specific tolerances for language, features for topical restrictions, and underlying subcategories of the toxic speech classification.

[9] Comparison of End-to-end Speech Assessment Models for the NOCASA 2025 Challenge

Aleksei Žavoronkov, Tanel Alumäe

Main category: cs.CL

TL;DR: Analysis of three end-to-end models for Norwegian pronunciation assessment in children, with GOP-CTC model achieving best results and top leaderboard performance.

Details

Motivation: To develop effective automatic pronunciation assessment models for children learning Norwegian as a second language, addressing the NOCASA 2025 Challenge requirements.

Method: Three approaches: encoder-decoder Siamese architecture (E2E-R), prefix-tuned classification using wav2vec2.0, and novel alignment-free GOP features via CTC with weighted ordinal cross-entropy loss.

Result: GOP-CTC-based model achieved highest performance, substantially surpassing challenge baselines and attaining top leaderboard scores.

Conclusion: The integration of alignment-free GOP features computed via CTC with weighted ordinal loss proved most effective for Norwegian pronunciation assessment in children.

Abstract: This paper presents an analysis of three end-to-end models developed for the NOCASA 2025 Challenge, aimed at automatic word-level pronunciation assessment for children learning Norwegian as a second language. Our models include an encoder-decoder Siamese architecture (E2E-R), a prefix-tuned direct classification model leveraging pretrained wav2vec2.0 representations, and a novel model integrating alignment-free goodness-of-pronunciation (GOP) features computed via CTC. We introduce a weighted ordinal cross-entropy loss tailored for optimizing metrics such as unweighted average recall and mean absolute error. Among the explored methods, our GOP-CTC-based model achieved the highest performance, substantially surpassing challenge baselines and attaining top leaderboard scores.

[10] ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan Holm, Yuran Wang, Vincent Zhou, Ken Fukuda, Teruko Mitamura

Main category: cs.CL

TL;DR: ProMQA-Assembly is a new multimodal QA dataset for evaluating assembly task assistants, featuring 391 QA pairs combining human activity recordings and instruction manuals, with semi-automated annotation using LLMs and human verification.

Details

Motivation: There is a lack of testbeds for practical evaluation of assembly task assistants, despite their potential benefits in both everyday and industrial settings.

Method: Semi-automated QA annotation approach where LLMs generate candidate questions and humans verify them, integrated with fine-grained action labels and instruction task graphs for toy vehicle assembly tasks.

Result: Benchmarking experiments with competitive proprietary multimodal models show significant room for improvement, indicating the dataset effectively evaluates current limitations.

Conclusion: The ProMQA-Assembly dataset provides a valuable evaluation framework that can contribute to the development of more effective procedural-activity assistants for assembly tasks.

Abstract: Assistants on assembly tasks have a large potential to benefit humans from everyday tasks to industrial settings. However, no testbeds support application-oriented system evaluation in a practical setting, especially in assembly. To foster the development, we propose a new multimodal QA dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals in an online-style manner. In the development, we adopt a semi-automated QA annotation approach, where LLMs generate candidates and humans verify them, as a cost-effective method, and further improve it by integrating fine-grained action labels to diversify question types. Furthermore, we create instruction task graphs for the target tasks of assembling toy vehicles. These newly created task graphs are used in our benchmarking experiment, as well as to facilitate the human verification process in the QA annotation. Utilizing our dataset, we benchmark models, including competitive proprietary multimodal models. Our results suggest great room for improvement for the current models. We believe our new evaluation dataset can contribute to the further development of procedural-activity assistants.

[11] Mitigating Data Imbalance in Automated Speaking Assessment

Fong-Chun Tsai, Kuan-Tang Huang, Bi-Cheng Yan, Tien-Hong Lo, Berlin Chen

Main category: cs.CL

TL;DR: BLV loss improves automated speaking assessment by addressing class imbalance through prediction perturbation, enhancing accuracy and fairness without dataset modification.

Details

Motivation: Automated Speaking Assessment models often suffer from class imbalance issues that lead to biased predictions against minority proficiency classes, limiting their effectiveness for diverse L2 learners.

Method: Proposed Balancing Logit Variation (BLV) loss function that perturbs model predictions to improve feature representation for minority classes without requiring dataset modifications or resampling.

Result: Evaluation on ICNALE benchmark dataset showed that integrating BLV loss into BERT-based models significantly improved both classification accuracy and fairness metrics.

Conclusion: The BLV loss provides an effective solution for class imbalance in automated speaking assessment, making speech evaluation more robust and equitable for diverse language learners without dataset manipulation.

Abstract: Automated Speaking Assessment (ASA) plays a crucial role in evaluating second-language (L2) learners proficiency. However, ASA models often suffer from class imbalance, leading to biased predictions. To address this, we introduce a novel objective for training ASA models, dubbed the Balancing Logit Variation (BLV) loss, which perturbs model predictions to improve feature representation for minority classes without modifying the dataset. Evaluations on the ICNALE benchmark dataset show that integrating the BLV loss into a celebrated text-based (BERT) model significantly enhances classification accuracy and fairness, making automated speech evaluation more robust for diverse learners.

[12] DiaCBT: A Long-Periodic Dialogue Corpus Guided by Cognitive Conceptualization Diagram for CBT-based Psychological Counseling

Yougen Zhou, Ningning Zhou, Qin Chen, Jie Zhou, Aimin Zhou, Liang He

Main category: cs.CL

TL;DR: DiaCBT dataset enables LLMs to provide CBT-based psychotherapy through multi-session dialogues with cognitive conceptualization diagrams, improving access to mental health services.

Details

Motivation: Mental health services face limited accessibility due to stigma and therapist shortages. LLMs could expand access but lack proper psychological conversation datasets for effective psychotherapy guidance.

Method: Constructed a long-periodic dialogue corpus based on cognitive behavioral therapy (CBT) with multiple counseling sessions per case. Incorporated cognitive conceptualization diagrams (CCDs) to guide client simulation across diverse scenarios.

Result: Trained counseling model demonstrates enhanced ability to emulate psychologists with CBT expertise. Comprehensive evaluation framework shows effectiveness against established psychological criteria for CBT-based counseling.

Conclusion: DiaCBT dataset effectively enhances LLMs’ capabilities in providing professional CBT-based counseling, showing strong potential for training more effective psychotherapy conversational agents.

Abstract: Psychotherapy reaches only a small fraction of individuals suffering from mental disorders due to social stigma and the limited availability of therapists. Large language models (LLMs), when equipped with professional psychotherapeutic skills, offer a promising solution to expand access to mental health services. However, the lack of psychological conversation datasets presents significant challenges in developing effective psychotherapy-guided conversational agents. In this paper, we construct a long-periodic dialogue corpus for counseling based on cognitive behavioral therapy (CBT). Our curated dataset includes multiple sessions for each counseling and incorporates cognitive conceptualization diagrams (CCDs) to guide client simulation across diverse scenarios. To evaluate the utility of our dataset, we train an in-depth counseling model and present a comprehensive evaluation framework to benchmark it against established psychological criteria for CBT-based counseling. Results demonstrate that DiaCBT effectively enhances LLMs’ ability to emulate psychologists with CBT expertise, underscoring its potential for training more professional counseling agents.

[13] Training LLMs to be Better Text Embedders through Bidirectional Reconstruction

Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin

Main category: cs.CL

TL;DR: The paper proposes a new training stage to improve LLM-based text embeddings by enriching the semantics of the final token embedding through bidirectional generative reconstruction tasks, achieving state-of-the-art results on MTEB benchmark.

Details

Motivation: Existing LLM-based text embedding approaches use the final token embedding (like [EOS]) which hasn't been intentionally trained to capture whole context semantics, limiting performance in retrieval and re-ranking tasks.

Method: Adds a pre-contrastive learning training stage with bidirectional generative reconstruction tasks (EBQ2D and EBD2Q) that interleave to anchor the [EOS] embedding and reconstruct Query-Document pairs.

Result: Significantly improves LLM performance on Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

Conclusion: The proposed additional training stage with bidirectional reconstruction tasks effectively enhances the semantic representation capability of final token embeddings in LLMs for text embedding tasks.

Abstract: Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

[14] AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, Shuicheng Yan

Main category: cs.CL

TL;DR: AgenTracer is an automated framework for diagnosing failures in multi-agent LLM systems through counterfactual replay and fault injection, achieving superior performance over large proprietary models.

Details

Motivation: LLM-based agentic systems are complex and fragile, making failure attribution challenging. Current reasoning LLMs perform poorly (<10% accuracy) at identifying which agent or step causes errors in multi-agent execution traces.

Method: Proposed AgenTracer framework uses counterfactual replay and programmed fault injection to create TracerTraj dataset. Then trains AgenTracer-8B model with multi-granular reinforcement learning for efficient error diagnosis.

Result: AgenTracer-8B outperforms giant proprietary LLMs (Gemini-2.5-Pro, Claude-4-Sonnet) by up to 18.18% on Who&When benchmark. Provides 4.8-14.2% performance gains for systems like MetaGPT and MaAS.

Conclusion: AgenTracer sets new standard for LLM agentic failure attribution and enables self-correcting, self-evolving agentic AI through actionable feedback.

Abstract: Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains, empowering self-correcting and self-evolving agentic AI.

[15] FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu, Haoqin Sun, Jiaming Zhou, Yan Lu, Yong Qin

Main category: cs.CL

TL;DR: FELLE integrates autoregressive language modeling with token-wise flow matching to predict continuous-valued mel-spectrograms, improving temporal coherence and synthesis quality through hierarchical generation.

Details

Motivation: To advance continuous-valued token modeling and enforce temporal coherence in speech synthesis by combining the strengths of autoregressive language models and flow matching techniques.

Method: FELLE modifies flow matching’s prior distribution by incorporating information from previous steps for each continuous-valued token. It uses a coarse-to-fine flow-matching mechanism that generates tokens hierarchically conditioned on the language model’s output.

Result: Experimental results show significant improvements in TTS generation quality, demonstrating the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling.

Conclusion: The integration of flow matching with autoregressive language modeling effectively enhances continuous-valued token prediction and temporal coherence, leading to improved speech synthesis quality.

Abstract: To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model’s output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.

[16] Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient Large Language Models

Ming Gong, Yingnan Deng, Nia Qi, Yujun Zou, Zhihao Xue, Yun Zi

Main category: cs.CL

TL;DR: Adapter-based fine-tuning method with structure-learnable mechanism that automatically optimizes adapter insertion points and activation paths for multi-task settings, improving parameter efficiency and robustness.

Details

Motivation: Address parameter redundancy, rigid structure, and limited task adaptability in fine-tuning large language models through flexible structural optimization.

Method: Uses differentiable gating functions and structural sparsity control variables to enable automatic optimization of adapter insertion points, activation paths, and module combinations while keeping backbone parameters frozen.

Result: Outperforms mainstream parameter-efficient tuning techniques on multiple tasks, achieving better balance among accuracy, compression rate, and robustness to noise and perturbation.

Conclusion: The proposed structure-learnable adapter method provides an effective solution for flexible and efficient multi-task fine-tuning with improved parameter utilization and robustness.

Abstract: This paper addresses the issues of parameter redundancy, rigid structure, and limited task adaptability in the fine-tuning of large language models. It proposes an adapter-based fine-tuning method built on a structure-learnable mechanism. By introducing differentiable gating functions and structural sparsity control variables, the method enables automatic optimization of adapter insertion points, activation paths, and module combinations. This allows the model to adjust its structure flexibly in multi-task settings to match different task characteristics. With the backbone parameters kept frozen, the method uses a structure search mechanism to guide the dynamic construction of task-specific efficient substructures during training. This significantly improves parameter utilization and representational capacity. In addition, the paper designs a set of sensitivity analysis experiments to systematically evaluate the effects of sparsity weight, noise injection ratio, and data perturbation on model performance. These experiments verify the stability and robustness of the proposed method across various multi-task natural language understanding tasks. The experimental results show that the proposed method outperforms mainstream parameter-efficient tuning techniques on multiple tasks. It achieves a better balance among accuracy, compression rate, and robustness to noise and perturbation.

[17] IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu

Main category: cs.CL

TL;DR: IndexTTS2 is an autoregressive TTS model that enables precise duration control for audio-visual synchronization, supports two generation modes, achieves speaker-emotion disentanglement, and outperforms SOTA models in zero-shot settings.

Details

Motivation: Existing autoregressive TTS models struggle with precise duration control, which is crucial for applications like video dubbing requiring strict audio-visual synchronization.

Method: Proposes a novel autoregressive method supporting two modes: explicit token count specification for duration control, and free generation with prosodic reproduction. Uses GPT latent representations, three-stage training paradigm, and soft instruction mechanism via fine-tuned Qwen3 for emotional control.

Result: Outperforms state-of-the-art zero-shot TTS models on multiple datasets in word error rate, speaker similarity, and emotional fidelity. Achieves accurate timbre reconstruction and perfect emotional tone reproduction.

Conclusion: IndexTTS2 provides effective solutions for precise duration control and emotional-speaker disentanglement in TTS, making it suitable for applications requiring strict audio-visual synchronization and independent control over timbre and emotion.

Abstract: Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This becomes a significant limitation in applications requiring strict audio-visual synchronization, such as video dubbing. This paper introduces IndexTTS2, which proposes a novel, general, and autoregressive model-friendly method for speech duration control. The method supports two generation modes: one explicitly specifies the number of generated tokens to precisely control speech duration; the other freely generates speech in an autoregressive manner without specifying the number of tokens, while faithfully reproducing the prosodic features of the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion. In the zero-shot setting, the model can accurately reconstruct the target timbre (from the timbre prompt) while perfectly reproducing the specified emotional tone (from the style prompt). To enhance speech clarity in highly emotional expressions, we incorporate GPT latent representations and design a novel three-stage training paradigm to improve the stability of the generated speech. Additionally, to lower the barrier for emotional control, we designed a soft instruction mechanism based on text descriptions by fine-tuning Qwen3, effectively guiding the generation of speech with the desired emotional orientation. Finally, experimental results on multiple datasets show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity. Audio samples are available at: https://index-tts.github.io/index-tts2.github.io/

[18] A Long Short-Term Memory (LSTM) Model for Business Sentiment Analysis Based on Recurrent Neural Network

Md. Jahidul Islam Razin, Md. Abdul Karim, M. F. Mridha, S M Rafiuddin, Tahira Alam

Main category: cs.CL

TL;DR: LSTM-based modified RNN model achieves 91.33% accuracy in business sentiment analysis, outperforming conventional RNN models on product review datasets.

Details

Motivation: Business sentiment analysis is crucial for companies to understand customer feedback and evaluate marketing strategies, but conventional RNN models suffer from vanishing gradient problems.

Method: Applied Long Short-Term Memory (LSTM) as a modified RNN approach to prevent vanishing gradient problem, using 70% of product review data for training and 30% for testing.

Result: The proposed LSTM-based model achieved 91.33% accuracy, performing better than conventional RNN models in sentiment classification of customer reviews.

Conclusion: LSTM provides an effective solution for business sentiment analysis, enabling companies to better understand customer preferences and improve marketing strategies based on accurate sentiment classification.

Abstract: Business sentiment analysis (BSA) is one of the significant and popular topics of natural language processing. It is one kind of sentiment analysis techniques for business purposes. Different categories of sentiment analysis techniques like lexicon-based techniques and different types of machine learning algorithms are applied for sentiment analysis on different languages like English, Hindi, Spanish, etc. In this paper, long short-term memory (LSTM) is applied for business sentiment analysis, where a recurrent neural network is used. An LSTM model is used in a modified approach to prevent the vanishing gradient problem rather than applying the conventional recurrent neural network (RNN). To apply the modified RNN model, product review dataset is used. In this experiment, 70% of the data is trained for the LSTM and the rest 30% of the data is used for testing. The result of this modified RNN model is compared with other conventional RNN models, and a comparison is made among the results. It is noted that the proposed model performs better than the other conventional RNN models. Here, the proposed model, i.e., the modified RNN model approach has achieved around 91.33% of accuracy. By applying this model, any business company or e-commerce business site can identify the feedback from their customers about different types of products that customers like or dislike. Based on the customer reviews, a business company or e-commerce platform can evaluate its marketing strategy.

Hauke Licht, Rupak Sarkar, Patrick Y. Wu, Pranav Goel, Niklas Stoehr, Elliott Ash, Alexander Miserlis Hoyle

Main category: cs.CL

TL;DR: LLMs can measure continuous language constructs but produce discontinuous scores. Pairwise comparisons and token-probability-weighted scoring improve measurement quality. Finetuning small models with minimal data matches prompted LLM performance.

Details

Motivation: Language constructs like complexity and emotionality exist on continuous scales, but LLMs have idiosyncratic numerical outputs that need systematic evaluation for reliable social science measurement.

Method: Evaluated four LLM approaches: unweighted direct pointwise scoring, pairwise comparison aggregation, token-probability-weighted pointwise scoring, and finetuning on multiple political science datasets.

Result: Direct pointwise scoring produces discontinuous distributions with arbitrary bunching. Pairwise comparisons improve quality, but token-probability-weighted scoring works even better. Finetuned small models (1,000 training pairs) match or exceed prompted LLM performance.

Conclusion: Token-probability-weighted scoring and finetuning small models are effective approaches for measuring scalar language constructs, providing actionable guidance for applied researchers using LLMs in social science.

Abstract: Many constructs that characterize language, like its complexity or emotionality, have a naturally continuous semantic structure; a public speech is not just “simple” or “complex,” but exists on a continuum between extremes. Although large language models (LLMs) are an attractive tool for measuring scalar constructs, their idiosyncratic treatment of numerical outputs raises questions of how to best apply them. We address these questions with a comprehensive evaluation of LLM-based approaches to scalar construct measurement in social science. Using multiple datasets sourced from the political science literature, we evaluate four approaches: unweighted direct pointwise scoring, aggregation of pairwise comparisons, token-probability-weighted pointwise scoring, and finetuning. Our study yields actionable findings for applied researchers. First, LLMs prompted to generate pointwise scores directly from texts produce discontinuous distributions with bunching at arbitrary numbers. The quality of the measurements improves with pairwise comparisons made by LLMs, but it improves even more by taking pointwise scores and weighting them by token probability. Finally, finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.

[20] From Evaluation to Defense: Constructing Persistent Edit-Based Fingerprints for Large Language Models

Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Xiaoling Wang, Linlin Wang

Main category: cs.CL

TL;DR: Proposes Fingerprint Subspace-aware Fine-Tuning (FSFT) to protect LLM intellectual property through knowledge editing-based fingerprint injection, reducing degradation during fine-tuning.

Details

Motivation: Current fingerprint injection methods degrade model performance, require heavy computation, and lack persistence under model modifications. Knowledge editing offers a lightweight alternative for IP protection.

Method: Uses knowledge editing for fingerprint injection with scrambled text fingerprints. Proposes FSFT to constrain updates to fingerprint subspace during fine-tuning, reducing degradation.

Result: FSFT outperforms standard fine-tuning by 10% even in worst-case scenarios. However, models struggle to distinguish fingerprints from similar texts due to feature similarity.

Conclusion: Knowledge editing is effective for fingerprint injection but reveals need for more robust, fine-grained methods due to feature similarity challenges between fingerprints and similar texts.

Abstract: The intellectual property (IP) protection of Large Language Models (LLMs) is increasingly critical. Injecting specialized fingerprints into LLMs through instruction tuning is a common IP protection technique. However, this may significantly degrade model performance, requires substantial computational resources, and exhibits poor persistence under model modifications. We argue that knowledge editing offers a lightweight alternative that is more suitable for fingerprint injection. Accordingly, we apply knowledge editing to fingerprint injection for the first time and demonstrate its strong capability. Despite using scrambled text as fingerprints to prevent them from being overwritten during fine-tuning, degradation still occurs under large-scale fine-tuning. To address this, we propose Fingerprint Subspace-aware Fine-Tuning (FSFT), which reduces fingerprint degradation by constraining the update of the fingerprint subspace. The performance of FSFT exceeds fine-tuning by 10% even in the worst-case scenario. Additionally, we observe that the fingerprint-injected models struggle to distinguish between fingerprints and similar texts due to the high similarity of their features. This finding underscores the urgent need for more robust and fine-grained fingerprinting injection methods for LLMs.

[21] An experimental and computational study of an Estonian single-person word naming

Kaidi Lõo, Arvi Tavast, Maria Heitmeier, Harald Baayen

Main category: cs.CL

TL;DR: DLM-based measures from computational models predict lexical processing in Estonian, with deep learning not outperforming linear mappings, and classical predictors generally providing better fits except for total fixation duration.

Details

Motivation: To investigate whether computational model measures (DLM) can predict lexical processing variables and compare them with classical predictors like word frequency and neighborhood size.

Method: Large-scale single-subject experiment combining word naming task with eye-tracking, analyzing five response variables using generalized additive models with both linear and deep learning mappings.

Result: DLM-based measures are powerful predictors but deep learning mappings don’t outperform linear ones; classical predictors generally provide better fits except for total fixation duration where they’re equivalent; meaning is heavily involved in word naming.

Conclusion: Computational DLM measures effectively predict lexical processing, with meaning playing a significant role in word naming tasks, though classical predictors often outperform DLM measures.

Abstract: This study investigates lexical processing in Estonian. A large-scale single-subject experiment is reported that combines the word naming task with eye-tracking. Five response variables (first fixation duration, total fixation duration, number of fixations, word naming latency, and spoken word duration) are analyzed with the generalized additive model. Of central interest is the question of whether measures for lexical processing generated by a computational model of the mental lexicon (the Discriminative Lexicon Model, DLM) are predictive for these response variables, and how they compare to classical predictors such as word frequency, neighborhood size, and inflectional paradigm size. Computational models were implemented both with linear and deep mappings. Central findings are, first, that DLM-based measures are powerful predictors for lexical processing, second, that DLM-measures using deep learning are not necessarily more precise predictors of lexical processing than DLM-measures using linear mappings, third, that classical predictors tend to provide somewhat more precise fits compared to DLM-based predictors (except for total fixation duration, where the two provide equivalent goodness of fit), and fourth, that in the naming task lexical variables are not predictive for first fixation duration and the total number of fixations. As the DLM works with mappings from form to meaning, the predictivity of DLM-based measures for total fixation duration, naming latencies, and spoken word duration indicates that meaning is heavily involved in the present word naming task.

[22] Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader

Jannis Vamvas, Ignacio Pérez Prat, Not Battesta Soliva, Sandra Baltermia-Guetg, Andrina Beeli, Simona Beeli, Madlaina Capeder, Laura Decurtins, Gian Peder Gregori, Flavia Hobi, Gabriela Holderegger, Arina Lazzarini, Viviana Lazzarini, Walter Rosselli, Bettina Vital, Anna Rutkiewicz, Rico Sennrich

Main category: cs.CL

TL;DR: Created benchmark for Romansh machine translation evaluation covering 6 varieties, showing translation out of Romansh works well but translation into Romansh remains challenging.

Details

Motivation: Romansh language in Switzerland has limited resources for machine translation evaluation, creating a need for standardized benchmarks.

Method: Developed benchmark with human-translated references based on WMT24++ for 6 Romansh varieties (Rumantsch Grischun + 5 regional varieties), then conducted automatic evaluation of existing MT systems and LLMs.

Result: Translation from Romansh to German performed relatively well across all varieties, but translation into Romansh proved challenging for current systems.

Conclusion: The benchmark fills a resource gap for Romansh MT evaluation and highlights the ongoing difficulty of translating into this low-resource language.

Abstract: The Romansh language, spoken in Switzerland, has limited resources for machine translation evaluation. In this paper, we present a benchmark for six varieties of Romansh: Rumantsch Grischun, a supra-regional variety, and five regional varieties: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader. Our reference translations were created by human translators based on the WMT24++ benchmark, which ensures parallelism with more than 55 other languages. An automatic evaluation of existing MT systems and LLMs shows that translation out of Romansh into German is handled relatively well for all the varieties, but translation into Romansh is still challenging.

[23] Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Schütze, Volker Tresp, Yunpu Ma

Main category: cs.CL

TL;DR: Memory-R1 is an RL framework that enables LLMs to actively manage external memory through specialized agents for memory operations and reasoning, achieving strong performance with minimal supervision.

Details

Motivation: LLMs are stateless with limited context windows, and existing memory augmentation approaches lack learned mechanisms for dynamic memory management.

Method: Reinforcement learning framework with two agents: Memory Manager for structured operations (add/update/delete) and Answer Agent for retrieval and reasoning, fine-tuned with PPO and GRPO.

Result: Outperforms strongest baselines with only 152 QA pairs, demonstrates strong generalization across question types and LLM backbones.

Conclusion: RL enables more agentic, memory-aware behavior in LLMs, pointing toward richer persistent reasoning systems.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations, including adding, updating, deleting, or taking no operation on memory entries; and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and utilization with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the strongest existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behavior in LLMs, pointing toward richer, more persistent reasoning systems.

[24] Domain Adaptation of LLMs for Process Data

Rafael Seidi Oyamada, Jari Peeperkorn, Jochen De Weerdt, Johannes De Smedt

Main category: cs.CL

TL;DR: This paper explores parameter-efficient fine-tuning of pretrained LLMs directly on process data without natural language conversion, showing improved predictive performance over RNNs and narrative-based approaches in multi-task predictive process monitoring.

Details

Motivation: LLMs excel at generating token sequences similar to process mining objectives, but current PM applications rely on natural language reformulation. The study aims to leverage LLMs' sequence generation capabilities directly on process data without semantic transformation.

Method: Parameter-efficient fine-tuning techniques applied to pretrained LLMs for direct adaptation to process data. Experimental setup focused on predictive process monitoring with single- and multi-task predictions.

Result: Fine-tuned models demonstrated improved predictive performance over state-of-the-art RNN approaches and narrative-style solutions, particularly in multi-task settings. They also showed faster convergence and required significantly less hyperparameter optimization.

Conclusion: Direct adaptation of pretrained LLMs to process data without natural language reformulation is effective and computationally efficient, offering superior performance in predictive process monitoring tasks compared to traditional approaches.

Abstract: In recent years, Large Language Models (LLMs) have emerged as a prominent area of interest across various research domains, including Process Mining (PM). Current applications in PM have predominantly centered on prompt engineering strategies or the transformation of event logs into narrative-style datasets, thereby exploiting the semantic capabilities of LLMs to address diverse tasks. In contrast, this study investigates the direct adaptation of pretrained LLMs to process data without natural language reformulation, motivated by the fact that these models excel in generating sequences of tokens, similar to the objective in PM. More specifically, we focus on parameter-efficient fine-tuning techniques to mitigate the computational overhead typically associated with such models. Our experimental setup focuses on Predictive Process Monitoring (PPM), and considers both single- and multi-task predictions. The results demonstrate a potential improvement in predictive performance over state-of-the-art recurrent neural network (RNN) approaches and recent narrative-style-based solutions, particularly in the multi-task setting. Additionally, our fine-tuned models exhibit faster convergence and require significantly less hyperparameter optimization.

[25] SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala

Ashmari Pramodya, Nirasha Nelki, Heshan Shalinda, Chamila Liyanage, Yusuke Sakai, Randil Pushpananda, Ruvan Weerasinghe, Hidetaka Kamigaito, Taro Watanabe

Main category: cs.CL

TL;DR: SinhalaMMLU is the first multiple-choice benchmark for Sinhala language, evaluating LLMs on culturally specific content and revealing performance gaps in low-resource language contexts.

Details

Motivation: Current LLM evaluation focuses on global/anglocentric subjects, neglecting low-resource languages and cultural context. Automatic translation in multilingual benchmarks introduces errors and misrepresents cultural nuances.

Method: Created SinhalaMMLU - a dataset with 7,000+ questions aligned with Sri Lankan curriculum, covering 6 domains and 30 subjects including culturally grounded knowledge. Evaluated 26 LLMs on this benchmark.

Result: Claude 3.5 sonnet (67%) and GPT-4o (62%) achieved highest accuracy, but overall performance remains limited. Models particularly struggled with culturally rich domains like Humanities.

Conclusion: Substantial room for improvement in adapting LLMs to low-resource languages and culturally specific contexts. Culturally grounded benchmarks are essential for proper evaluation.

Abstract: Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.

[26] LatPhon: Lightweight Multilingual G2P for Romance Languages and English

Luis Felipe Chary, Miguel Arjona Ramirez

Main category: cs.CL

TL;DR: LatPhon is a compact 7.5M-parameter Transformer model for multilingual grapheme-to-phoneme conversion across 6 Latin-script languages, achieving 3.5% PER while being small enough for on-device deployment.

Details

Motivation: G2P conversion is crucial for speech systems (TTS, ASR, S2ST, alignment) across multiple Latin-script languages, requiring efficient and accurate solutions that can work across languages.

Method: Developed a 7.5M-parameter Transformer model jointly trained on six Latin-script languages: English, Spanish, French, Italian, Portuguese, and Romanian using the ipa-dict corpus.

Result: Achieved mean phoneme error rate (PER) of 3.5%, outperforming ByT5 baseline (5.4%) and approaching language-specific WFSTs (3.2%) while using only 30MB memory.

Conclusion: Compact multilingual G2P models like LatPhon can serve as universal front-ends for Latin-language speech pipelines with efficient on-device deployment capabilities.

Abstract: Grapheme-to-phoneme (G2P) conversion is a key front-end for text-to-speech (TTS), automatic speech recognition (ASR), speech-to-speech translation (S2ST) and alignment systems, especially across multiple Latin-script languages.We present LatPhon, a 7.5 M - parameter Transformer jointly trained on six such languages–English, Spanish, French, Italian, Portuguese, and Romanian. On the public ipa-dict corpus, it attains a mean phoneme error rate (PER) of 3.5%, outperforming the byte-level ByT5 baseline (5.4%) and approaching language-specific WFSTs (3.2%) while occupying 30 MB of memory, which makes on-device deployment feasible when needed. These results indicate that compact multilingual G2P can serve as a universal front-end for Latin-language speech pipelines.

[27] LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations

Daniela Gottesman, Alon Gilae-Dotan, Ido Cohen, Yoav Gur-Arieh, Marius Mosbach, Ori Yoran, Mor Geva

Main category: cs.CL

TL;DR: LMEnt is a new analysis suite for studying how language models acquire knowledge during pretraining, featuring annotated Wikipedia data, improved entity retrieval, and pretrained models with checkpoints.

Details

Motivation: To better understand how language models transform data into knowledge representations and beliefs about the world, which could lead to more consistent, robust, and complete knowledge representations.

Method: Developed LMEnt suite with: (1) knowledge-rich pretraining corpus from Wikipedia with entity annotations, (2) entity-based retrieval method that outperforms previous approaches by up to 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints.

Result: The suite provides a controlled environment for analyzing entity mention connections and causal interventions. Analysis shows fact frequency is key to knowledge acquisition but doesn’t fully explain learning trends.

Conclusion: LMEnt is released to support research on knowledge representations, plasticity, editing, attribution, and learning dynamics in language models.

Abstract: Language models (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly understood. Insights into these processes could pave the way for developing LMs with knowledge representations that are more consistent, robust, and complete. To facilitate studying these questions, we present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining. LMEnt introduces: (1) a knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia, (2) an entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks. Together, these resources provide a controlled environment for analyzing connections between entity mentions in pretraining and downstream performance, and the effects of causal interventions in pretraining data. We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends. We release LMEnt to support studies of knowledge in LMs, including knowledge representations, plasticity, editing, attribution, and learning dynamics.

[28] Learning Mechanism Underlying NLP Pre-Training and Fine-Tuning

Yarden Tzach, Ronit D. Gross, Ella Koresh, Shalom Rosner, Or Shpringer, Tal Halevi, Ido Kanter

Main category: cs.CL

TL;DR: Study examines pre-training mechanism in NLP transformers, finding token accuracy increases with frequency and forms clusters, improving both pre-training and fine-tuning performance across transformer blocks.

Details

Motivation: To understand the underlying mechanism of successful pre-training in NLP and determine how pre-training accuracy affects fine-tuning performance on classification tasks.

Method: Used BERT-6 architecture pre-trained on Wikipedia dataset, then fine-tuned on FewRel and DBpedia classification tasks. Analyzed accuracy per token (APT), token confusion matrix, and performance across transformer blocks.

Result: APT increased with token frequency and served as order parameter for pre-training success. Pre-training formed finite strong match token clusters that sharpened along transformer blocks, enhancing performance. Higher-order language structures emerged despite single-token learning objective.

Conclusion: Pre-training breaks token symmetry and forms clusters that improve both pre-training and fine-tuning performance. The mechanism shows universality as it resembles fine-tuning in image classification tasks, despite pre-training being typically absent there.

Abstract: Natural language processing (NLP) enables the understanding and generation of meaningful human language, typically using a pre-trained complex architecture on a large dataset to learn the language and next fine-tune its weights to implement a specific task. Twofold goals are examined; to understand the mechanism underlying successful pre-training and to determine the interplay between the pre-training accuracy and the fine-tuning of classification tasks. The following main results were obtained; the accuracy per token (APT) increased with its appearance frequency in the dataset, and its average over all tokens served as an order parameter to quantify pre-training success, which increased along the transformer blocks. Pre-training broke the symmetry among tokens and grouped them into finite, small, strong match token clusters, as inferred from the presented token confusion matrix. This feature was sharpened along the transformer blocks toward the output layer, enhancing its performance considerably compared with that of the embedding layer. Consequently, higher-order language structures were generated by pre-training, even though the learning cost function was directed solely at identifying a single token. These pre-training findings were reflected by the improved fine-tuning accuracy along the transformer blocks. Additionally, the output label prediction confidence was found to be independent of the average input APT, as the input meaning was preserved since the tokens are replaced primarily by strong match tokens. Finally, although pre-training is commonly absent in image classification tasks, its underlying mechanism is similar to that used in fine-tuning NLP classification tasks, hinting at its universality. The results were based on the BERT-6 architecture pre-trained on the Wikipedia dataset and fine-tuned on the FewRel and DBpedia classification tasks.

[29] Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

Weiyuan Li, Xintao Wang, Siyu Yuan, Rui Xu, Jiangjie Chen, Qingqing Dong, Yanghua Xiao, Deqing Yang

Main category: cs.CL

TL;DR: ComplexEval benchmark reveals LLMs as judges are highly susceptible to 6 previously unexplored biases in complex tasks, with bias severity increasing with task complexity and Large Reasoning Models showing paradoxical vulnerability.

Details

Motivation: As LLMs handle more complex tasks, reliable evaluation becomes challenging. The LLMs-as-judges paradigm needs validation for complex scenarios with multi-faceted rubrics, unstructured references, and nuanced criteria.

Method: Constructed ComplexEval benchmark to systematically expose and quantify Auxiliary Information Induced Biases. Investigated 6 biases across 12 basic and 3 advanced scenarios.

Result: All evaluated models show significant bias susceptibility, with bias magnitude scaling with task complexity. Large Reasoning Models paradoxically demonstrate high vulnerability despite their reasoning capabilities.

Conclusion: The findings provide crucial insights for improving evaluation accuracy and verifiability, paving the way for more general and robust evaluation models in complex task settings.

Abstract: As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks–where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical–remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.

[30] Continuous Saudi Sign Language Recognition: A Vision Transformer Approach

Soukeina Elhassen, Lama Al Khuzayem, Areej Alhothali, Ohoud Alzamzami, Nahed Alowaidi

Main category: cs.CL

TL;DR: First continuous Saudi Sign Language dataset (KAU-CSSL) with transformer-based model achieving high accuracy for SSL recognition and translation.

Details

Motivation: Address the significant gap in Arabic sign language resources, particularly Saudi Sign Language (SSL), where over 84,000 people depend on SSL as primary communication, with most existing solutions focused on non-Arabic sign languages.

Method: Developed KAU-CSSL dataset focusing on complete sentences (not isolated words) and proposed transformer-based model using pretrained ResNet-18 for spatial features and Transformer Encoder with Bidirectional LSTM for temporal dependencies.

Result: Achieved 99.02% accuracy in signer-dependent mode and 77.71% accuracy in signer-independent mode for Saudi Sign Language recognition.

Conclusion: This research provides crucial resources for SSL and represents a significant advancement in Arabic sign language technology, paving the way for improved communication tools and broader sign language research contributions.

Abstract: Sign language (SL) is an essential communication form for hearing-impaired and deaf people, enabling engagement within the broader society. Despite its significance, limited public awareness of SL often leads to inequitable access to educational and professional opportunities, thereby contributing to social exclusion, particularly in Saudi Arabia, where over 84,000 individuals depend on Saudi Sign Language (SSL) as their primary form of communication. Although certain technological approaches have helped to improve communication for individuals with hearing impairments, there continues to be an urgent requirement for more precise and dependable translation techniques, especially for Arabic sign language variants like SSL. Most state-of-the-art solutions have primarily focused on non-Arabic sign languages, resulting in a considerable absence of resources dedicated to Arabic sign language, specifically SSL. The complexity of the Arabic language and the prevalence of isolated sign language datasets that concentrate on individual words instead of continuous speech contribute to this issue. To address this gap, our research represents an important step in developing SSL resources. To address this, we introduce the first continuous Saudi Sign Language dataset called KAU-CSSL, focusing on complete sentences to facilitate further research and enable sophisticated recognition systems for SSL recognition and translation. Additionally, we propose a transformer-based model, utilizing a pretrained ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies, achieving 99.02% accuracy at signer dependent mode and 77.71% accuracy at signer independent mode. This development leads the way to not only improving communication tools for the SSL community but also making a substantial contribution to the wider field of sign language.

[31] Design and Optimization of Reinforcement Learning-Based Agents in Text-Based Games

Haonan Wang, Mingjia Zhao, Junfeng Sun, Wei Liu

Main category: cs.CL

TL;DR: Novel RL agent for text-based games using deep learning for text processing and policy gradient methods, achieving superior performance on completion ratio and win rate.

Details

Motivation: Advance AI capabilities in text-based games through improved agent design and learning methods using reinforcement learning.

Method: Deep learning model for game text processing and world modeling, combined with policy gradient-based deep reinforcement learning for policy optimization.

Result: Enhanced agent significantly outperforms previous agents on game completion ratio and win rate in text-based game experiments.

Conclusion: Provides novel understanding and empirical foundation for using RL in text games, paving way for broader applications in general domains.

Abstract: As AI technology advances, research in playing text-based games with agents has becomeprogressively popular. In this paper, a novel approach to agent design and agent learning ispresented with the context of reinforcement learning. A model of deep learning is first applied toprocess game text and build a world model. Next, the agent is learned through a policy gradient-based deep reinforcement learning method to facilitate conversion from state value to optimal policy.The enhanced agent works better in several text-based game experiments and significantlysurpasses previous agents on game completion ratio and win rate. Our study introduces novelunderstanding and empirical ground for using reinforcement learning for text games and sets thestage for developing and optimizing reinforcement learning agents for more general domains andproblems.

[32] Similarity between Units of Natural Language: The Transition from Coarse to Fine Estimation

Wenchuan Mu

Main category: cs.CL

TL;DR: Develops a progressively refined similarity computation framework that combines attack testing with adversarial training to improve language unit similarity measures and catch calculation loopholes.

Details

Motivation: Existing similarity computation methods often rely on fitting human judgments without proper debugging, have vague definitions, and lack interpretability - which is problematic in high-stakes domains like legal/medical affairs where precision matters.

Method: Proposes a regression model called progressively refined similarity computation that combines attack testing (to identify loopholes) with adversarial training (to constantly improve the model through catching different loopholes).

Result: The model achieves state-of-the-art performance in handling edge cases and provides reasonable explanations for every refinement made to the model.

Conclusion: The framework addresses key shortcomings in similarity computation by providing both improved performance on edge cases and interpretable refinements through a systematic attack testing and adversarial training approach.

Abstract: Capturing the similarities between human language units is crucial for explaining how humans associate different objects, and therefore its computation has received extensive attention, research, and applications. With the ever-increasing amount of information around us, calculating similarity becomes increasingly complex, especially in many cases, such as legal or medical affairs, measuring similarity requires extra care and precision, as small acts within a language unit can have significant real-world effects. My research goal in this thesis is to develop regression models that account for similarities between language units in a more refined way. Computation of similarity has come a long way, but approaches to debugging the measures are often based on continually fitting human judgment values. To this end, my goal is to develop an algorithm that precisely catches loopholes in a similarity calculation. Furthermore, most methods have vague definitions of the similarities they compute and are often difficult to interpret. The proposed framework addresses both shortcomings. It constantly improves the model through catching different loopholes. In addition, every refinement of the model provides a reasonable explanation. The regression model introduced in this thesis is called progressively refined similarity computation, which combines attack testing with adversarial training. The similarity regression model of this thesis achieves state-of-the-art performance in handling edge cases.

[33] Learn and Unlearn: Addressing Misinformation in Multilingual LLMs

Taiming Lu, Philipp Koehn

Main category: cs.CL

TL;DR: Multilingual LLMs propagate harmful information across languages, and standard English-focused unlearning methods are ineffective - comprehensive multilingual unlearning is required.

Details

Motivation: To investigate how harmful information spreads in multilingual LLMs and evaluate the effectiveness of current unlearning methods across different languages.

Method: Studied the propagation of fake/harmful information in multilingual LLMs and tested various unlearning techniques, comparing English-only vs multilingual approaches.

Result: Harmful content spreads across languages in multilingual LLMs. Standard English-focused unlearning methods are insufficient and may reinforce harmful content across languages. Only addressing harmful responses in both English and the original language effectively eliminates generations across all languages.

Conclusion: Comprehensive multilingual unlearning strategies are critically needed to ensure LLM safety and reliability across diverse linguistic contexts, as language-agnostic approaches are inadequate for multilingual models.

Abstract: This paper investigates the propagation of harmful information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated content. Our findings reveal that standard unlearning techniques, which typically focus on English data, are insufficient in mitigating the spread of harmful content in multilingual contexts and could inadvertently reinforce harmful content across languages. We show that only by addressing harmful responses in both English and the original language of the harmful data can we effectively eliminate generations for all languages. This underscores the critical need for comprehensive unlearning strategies that consider the multilingual nature of modern LLMs to enhance their safety and reliability across diverse linguistic landscapes.

[34] SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, Chao Yang

Main category: cs.CL

TL;DR: SampleAttention is a sparse attention mechanism that reduces Time-to-First-Token latency by 2.42x compared to FlashAttention while maintaining near-lossless accuracy in large language models with long context windows.

Details

Motivation: Vanilla attention has quadratic complexity that causes long TTFT latency in LLMs with long context windows. Existing solutions require additional training and often sacrifice model accuracy.

Method: Proposes SampleAttention with adaptive structured sparse attention: attends to fixed percentage of adjacent tokens for local patterns, and uses two-stage query-guided key-value filtering to capture column stripe patterns with low overhead.

Result: Reduces TTFT by up to 2.42x compared to FlashAttention while maintaining nearly no accuracy loss. Works seamlessly with off-the-shelf LLMs without additional training.

Conclusion: SampleAttention provides an effective solution for reducing attention complexity and TTFT latency in long-context LLMs while preserving accuracy, without requiring model retraining.

Abstract: Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42\times$ compared with FlashAttention.

[35] Banishing LLM Hallucinations Requires Rethinking Generalization

Johnny Li, Saksham Consul, Eda Zhou, James Wong, Naila Farooqui, Yuxin Ye, Nithyashree Manohar, Zhuxiaona Wei, Tian Wu, Ben Echols, Sharon Zhou, Gregory Diamos

Main category: cs.CL

TL;DR: LLMs hallucinate not due to creativity-factuality tradeoff but because they memorize random patterns when trained on internet-scale data with high loss. A new architecture called Lamini-1 uses millions of memory experts to store facts and reduce hallucinations.

Details

Motivation: Traditional approaches fail to explain why LLMs hallucinate in practice, suggesting the need for a better understanding of the root causes and new mitigation methods.

Method: Extensive experiments with LLMs augmented with Mixture of Memory Experts (MoME) to memorize random datasets, combined with theoretical analysis of neural networks trained with high loss. Developed Lamini-1 model with millions of memory experts for dynamic fact retrieval.

Result: LLMs can easily memorize large datasets of random numbers, and hallucinations occur when training loss is above a threshold. The proposed Lamini-1 architecture shows promise in reducing hallucinations.

Conclusion: Hallucinations stem from memorization of random patterns during training rather than creativity-factuality balance. The memory expert approach provides a new direction for building more factual LLMs.

Abstract: Despite their powerful chat, coding, and reasoning abilities, Large Language Models (LLMs) frequently hallucinate. Conventional wisdom suggests that hallucinations are a consequence of a balance between creativity and factuality, which can be mitigated, but not eliminated, by grounding the LLM in external knowledge sources. Through extensive systematic experiments, we show that these traditional approaches fail to explain why LLMs hallucinate in practice. Specifically, we show that LLMs augmented with a massive Mixture of Memory Experts (MoME) can easily memorize large datasets of random numbers. We corroborate these experimental findings with a theoretical construction showing that simple neural networks trained to predict the next token hallucinate when the training loss is above a threshold as it usually does in practice when training on internet scale data. We interpret our findings by comparing against traditional retrieval methods for mitigating hallucinations. We use our findings to design a first generation model for removing hallucinations – Lamini-1 – that stores facts in a massive mixture of millions of memory experts that are retrieved dynamically.

[36] Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language

Arief Purnama Muharram, Ayu Purwarianti

Main category: cs.CL

TL;DR: Using Knowledge Graphs (KGs) as external knowledge improves Natural Language Inference (NLI) performance for automated COVID-19 fact-checking in Indonesian, achieving 86.16% accuracy.

Details

Motivation: To overcome COVID-19 misinformation spread online by addressing deep learning performance stagnation due to knowledge limitations during training.

Method: Proposed three-module architecture: fact module processes KG information, NLI module handles semantic relationships between premise and hypothesis, and classifier module combines representations for final decision.

Result: Achieved best accuracy of 0.8616, demonstrating significant NLI performance improvement in fact-checking when incorporating knowledge graphs.

Conclusion: Knowledge Graphs are valuable components for enhancing NLI performance in automated fact-checking systems, particularly for combating COVID-19 misinformation.

Abstract: Automated fact-checking is a key strategy to overcome the spread of COVID-19 misinformation on the internet. These systems typically leverage deep learning approaches through Natural Language Inference (NLI) to verify the truthfulness of information based on supporting evidence. However, one challenge that arises in deep learning is performance stagnation due to a lack of knowledge during training. This study proposes using a Knowledge Graph (KG) as external knowledge to enhance NLI performance for automated COVID-19 fact-checking in the Indonesian language. The proposed model architecture comprises three modules: a fact module, an NLI module, and a classifier module. The fact module processes information from the KG, while the NLI module handles semantic relationships between the given premise and hypothesis. The representation vectors from both modules are concatenated and fed into the classifier module to produce the final result. The model was trained using the generated Indonesian COVID-19 fact-checking dataset and the COVID-19 KG Bahasa Indonesia. Our study demonstrates that incorporating KGs can significantly improve NLI performance in fact-checking, achieving the best accuracy of 0.8616. This suggests that KGs are a valuable component for enhancing NLI performance in automated fact-checking.

[37] TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Chenhao Zhu, Xinzhe Juan, Ling Yang, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang

Main category: cs.CL

TL;DR: TreeBoN integrates speculative tree-search with Best-of-N sampling to reduce computational costs while maintaining high-quality output in LLM inference-time alignment.

Details

Motivation: Address the computational inefficiency of Best-of-N sampling while maintaining high output quality for inference-time alignment of large language models.

Method: Integrates speculative tree-search strategy into BoN sampling, maintaining parent nodes, iteratively branching and pruning low-quality responses using token-level rewards from DPO to guide tree expansion.

Result: Achieves 65% win rate on TutorEval and ~60% win rates across other datasets (AlpacaFarm, HH-RLHF, UltraFeedback, GSM8K), outperforming standard BoN with same computational cost.

Conclusion: TreeBoN demonstrates effective scalability and alignment efficacy, reducing computational overhead while maintaining high output quality in inference-time alignment.

Abstract: Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, HH-RLHF, UltraFeedback, GSM8K, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves the highest win rate of 65% on TutorEval and around 60% win rates across other different datasets, outperforming standard BoN with the same computational cost and showcasing its scalability and alignment efficacy.

[38] Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models

Piotr Przybyła, Euan McGill, Horacio Saggion

Main category: cs.CL

TL;DR: TREPAT uses large language models to generate adversarial examples that bypass content-filtering algorithms by making small, meaning-preserving modifications to text within realistic query limits.

Details

Motivation: To test the robustness of text classification algorithms that detect low-credibility content (propaganda, false claims, rumors, hyperpartisan news) and investigate whether LLMs can be used to attack content moderation systems.

Method: Initial rephrasings generated by LLMs using prompts inspired by meaning-preserving NLP tasks (text simplification, style transfer), then decomposed into small changes applied through beam search until classifier decision changes.

Result: Superior performance in constrained scenarios, especially for long input text (news articles) where exhaustive search is not feasible, confirmed through quantitative evaluation, manual assessment, and linguistic analysis.

Conclusion: LLMs can effectively generate adversarial examples to bypass content moderation systems, demonstrating vulnerabilities in current text classification algorithms for detecting low-credibility content.

Abstract: Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform (1) quantitative evaluation using various prompts, models and query limits, (2) targeted manual assessment of the generated text and (3) qualitative linguistic analysis. The results confirm the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.

[39] Dial-In LLM: Human-Aligned LLM-in-the-loop Intent Clustering for Customer Service Dialogues

Mengze Hong, Wailing Ng, Chen Jason Zhang, Yuanfeng Song, Di Jiang

Main category: cs.CL

TL;DR: Proposes LLM-in-the-loop intent clustering framework that integrates LLMs into clustering algorithms to discover coherent customer intent clusters with improved accuracy and efficiency.

Details

Motivation: Existing intent clustering methods rely on embedding distance metrics and neglect semantic structures, leading to suboptimal performance in automated service agents.

Method: LLM-in-the-loop framework that uses fine-tuned LLMs for semantic coherence evaluation and intent cluster naming, with context-aware techniques for customer service dialogue and iterative discovery of optimal cluster numbers.

Result: Achieves over 95% accuracy in semantic coherence evaluation aligned with human judgments, significantly outperforms LLM-guided baselines in clustering quality and cost efficiency, and introduces comprehensive Chinese dialogue dataset with 100k+ calls and 1,507 annotated clusters.

Conclusion: LLM-in-the-loop techniques are highly effective for scalable dialogue data mining, demonstrating prominence in improving intent discovery for customer service applications.

Abstract: Discovering customer intentions is crucial for automated service agents, yet existing intent clustering methods often fall short due to their reliance on embedding distance metrics and neglect of underlying semantic structures. To address these limitations, we propose an LLM-in-the-loop (LLM-ITL) intent clustering framework, integrating the language understanding capabilities of LLMs into conventional clustering algorithms. Specifically, this paper (1) examines the effectiveness of fine-tuned LLMs in semantic coherence evaluation and intent cluster naming, achieving over 95% accuracy aligned with human judgments; (2) designs an LLM-ITL framework that facilitates the iterative discovery of coherent intent clusters and the optimal number of clusters; and (3) introduces context-aware techniques tailored for customer service dialogue. Since existing English benchmarks lack sufficient semantic diversity and intent coverage, we further present a comprehensive Chinese dialogue intent dataset comprising over 100k real customer service calls with 1,507 human-annotated clusters. The proposed approaches significantly outperform LLM-guided baselines, achieving notable improvements in clustering quality, cost efficiency, and downstream applications. Combined with several best practices, our findings highlight the prominence of LLM-in-the-loop techniques for scalable dialogue data mining.

[40] Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models

Qiang Liu, Xinlong Chen, Yue Ding, Bowen Song, Weiqiang Wang, Shu Wu, Liang Wang

Main category: cs.CL

TL;DR: AGSER is a novel zero-shot hallucination detection method that uses attention-guided self-reflection to categorize queries and compute consistency scores, outperforming existing methods with reduced computational overhead.

Details

Motivation: Hallucination has become a major obstacle for effective LLM applications, creating a need for efficient detection methods that don't require additional training data.

Method: Uses attention contributions to categorize queries into attentive/non-attentive types, processes each separately through LLMs, computes consistency scores between responses and original answers, and uses score differences as hallucination estimators.

Result: Extensive experiments with 4 LLMs across 3 hallucination benchmarks show AGSER significantly outperforms existing zero-shot detection methods while requiring only 3 LLM passes and 2 token sets.

Conclusion: AGSER provides an effective and computationally efficient approach for zero-shot hallucination detection in LLMs, demonstrating superior performance over current methods.

Abstract: Hallucination has emerged as a significant barrier to the effective application of Large Language Models (LLMs). In this work, we introduce a novel Attention-Guided SElf-Reflection (AGSER) approach for zero-shot hallucination detection in LLMs. The AGSER method utilizes attention contributions to categorize the input query into attentive and non-attentive queries. Each query is then processed separately through the LLMs, allowing us to compute consistency scores between the generated responses and the original answer. The difference between the two consistency scores serves as a hallucination estimator. In addition to its efficacy in detecting hallucinations, AGSER notably reduces computational overhead, requiring only three passes through the LLM and utilizing two sets of tokens. We have conducted extensive experiments with four widely-used LLMs across three different hallucination benchmarks, demonstrating that our approach significantly outperforms existing methods in zero-shot hallucination detection.

[41] FedP$^2$EFT: Federated Learning to Personalize PEFT for Multilingual LLMs

Royson Lee, Minyoung Kim, Fady Rezk, Rui Li, Stylianos I. Venieris, Timothy Hospedales

Main category: cs.CL

TL;DR: FedP^2EFT is a federated learning method that uses Bayesian sparse rank selection to automatically optimize personalized PEFT structures for multilingual LLMs in cross-device FL settings, outperforming existing methods.

Details

Motivation: To improve client-specific performance in federated learning for multilingual LLMs by automating personalized PEFT structure selection instead of manual configuration, addressing overfitting issues in low-data regimes.

Method: Uses Bayesian sparse rank selection to collaboratively learn optimal personalized PEFT structures (LoRA adapters) for each client in cross-device federated learning settings.

Result: Outperforms existing personalized fine-tuning methods on both simulated and real-world multilingual FL benchmarks, while complementing other FL methods.

Conclusion: FedP^2EFT provides an effective automated approach for personalized PEFT structure optimization in multilingual federated learning, demonstrating superior performance over manual configuration methods.

Abstract: Federated learning (FL) has enabled the training of multilingual large language models (LLMs) on diverse and decentralized multilingual data, especially on low-resource languages. To improve client-specific performance, personalization via the use of parameter-efficient fine-tuning (PEFT) modules such as LoRA is common. This involves a personalization strategy (PS), such as the design of the PEFT adapter structures (e.g., in which layers to add LoRAs and what ranks) and choice of hyperparameters (e.g., learning rates) for fine-tuning. Instead of manual PS configuration, we propose FedP$^2$EFT, a federated learning-to-personalize method for multilingual LLMs in cross-device FL settings. Unlike most existing PEFT structure selection methods, which are prone to overfitting low-data regimes, FedP$^2$EFT collaboratively learns the optimal personalized PEFT structure for each client via Bayesian sparse rank selection. Evaluations on both simulated and real-world multilingual FL benchmarks demonstrate that FedP$^2$EFT largely outperforms existing personalized fine-tuning methods, while complementing other existing FL methods. Code is available at https://github.com/SamsungLabs/fedp2eft.

[42] Rapid Word Learning Through Meta In-Context Learning

Wentao Wang, Guangyuan Jiang, Tal Linzen, Brenden M. Lake

Main category: cs.CL

TL;DR: Meta-training method (Minnow) enables language models to learn new words from few examples, achieving performance comparable to large pre-trained models and improving word discrimination, categorization, and generation abilities.

Details

Motivation: Current language models have underexplored abilities for few-shot word learning compared to humans who can quickly learn and flexibly use new words from minimal examples.

Method: Minnow trains models to generate new word usages using placeholder tokens, repeated across many words to develop general word-learning ability. Applied to both training from scratch and finetuning pre-trained LLMs.

Result: Models trained with Minnow achieve strong few-shot word learning comparable to large LLMs, with improved discrimination, syntactic categorization, and generation of new word usages and definitions.

Conclusion: Minnow demonstrates high data efficiency and potential to enhance language model performance in word learning tasks through meta-training approach.

Abstract: Humans can quickly learn a new word from a few illustrative examples, and then systematically and flexibly use it in novel contexts. Yet the abilities of current language models for few-shot word learning, and methods for improving these abilities, are underexplored. In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words (Minnow). This method trains language models to generate new examples of a word’s usage given a few in-context examples, using a special placeholder token to represent the new word. This training is repeated on many new words to develop a general word-learning ability. We find that training models from scratch with Minnow on human-scale child-directed language enables strong few-shot word learning, comparable to a large language model (LLM) pre-trained on orders of magnitude more data. Furthermore, through discriminative and generative evaluations, we demonstrate that finetuning pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. These findings highlight the data efficiency of Minnow and its potential to improve language model performance in word learning tasks.

[43] Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst

Main category: cs.CL

TL;DR: LLMs can achieve competitive layout-aware information extraction performance when properly configured with optimized data structuring, model engagement, and output refinement techniques, matching specialized models without fine-tuning.

Details

Motivation: To explore how large language models can effectively extract information from layout-rich documents by addressing core challenges of data structuring, model engagement, and output refinement.

Method: Developed LayIE-LLM test suite to benchmark layout-aware IE, used one-factor-at-a-time (OFAT) optimization method to find optimal configurations, and compared against traditional fine-tuned IE models.

Result: Optimized LLM configurations achieved 13.3-37.5 F1 points improvement over baseline, with OFAT method achieving near-optimal results (only 0.8-1.8 points lower than full factorial) using only 2.8% of computation.

Conclusion: Well-configured general-purpose LLMs can match specialized model performance for layout-aware information extraction, providing a cost-effective, fine-tuning-free alternative.

Abstract: This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study investigates the sub-problems and methods within these core challenges, such as input representation, chunking, prompting, selection of LLMs, and multimodal models. It examines the effect of different design choices through LayIE-LLM, a new, open-source, layout-aware IE test suite, benchmarking against traditional, fine-tuned IE models. The results on two IE datasets show that LLMs require adjustment of the IE pipeline to achieve competitive performance: the optimized configuration found with LayIE-LLM achieves 13.3–37.5 F1 points more than a general-practice baseline configuration using the same LLM. To find a well-working configuration, we develop a one-factor-at-a-time (OFAT) method that achieves near-optimal results. Our method is only 0.8–1.8 points lower than the best full factorial exploration with a fraction (2.8%) of the required computation. Overall, we demonstrate that, if well-configured, general-purpose LLMs match the performance of specialized models, providing a cost-effective, finetuning-free alternative. Our test-suite is available at https://github.com/gayecolakoglu/LayIE-LLM.

[44] Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

Zhecheng Li, Guoxian Song, Yujun Cai, Zhen Xiong, Junsong Yuan, Yiwei Wang

Main category: cs.CL

TL;DR: VLMs perform poorly on font recognition tasks, struggling with both easy font identification and hard Stroop-effect challenges, with minimal improvement from few-shot learning or CoT prompting.

Details

Motivation: To investigate whether modern Vision-Language Models (VLMs) can effectively recognize fonts in fine-grained tasks, given their multimodal capabilities and potential applications in design-related scenarios.

Method: Created Font Recognition Benchmark (FRB) with 15 common fonts in easy (10 sentences) and hard (font names as text with Stroop effect) versions. Evaluated various VLMs on font recognition tasks using few-shot learning and Chain-of-Thought prompting.

Result: Current VLMs show limited font recognition capabilities, fail to achieve satisfactory performance, are easily affected by Stroop effect, and show minimal improvement from few-shot learning or CoT prompting.

Conclusion: VLMs have inherent limitations in capturing semantic font features, indicating they are not yet effective tools for fine-grained font recognition tasks despite their general multimodal capabilities.

Abstract: Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance and being easily affected by the stroop effect introduced by textual information. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.

[45] Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification

Takuma Udagawa, Yang Zhao, Hiroshi Kanayama, Bishwaranjan Bhattacharjee

Main category: cs.CL

TL;DR: Efficient annotation pipeline for detecting social biases in LLM pretraining corpora through protected attribute detection and regard classification, with focus on Common Crawl data.

Details

Motivation: Pretraining data from web-crawled texts contain undesirable social biases that can be perpetuated or amplified by large language models, requiring systematic analysis and mitigation.

Method: Proposed annotation pipeline with two steps: protected attribute detection to identify diverse demographics, followed by regard classification to analyze language polarity towards each attribute.

Result: The study demonstrates the effectiveness of the bias analysis and mitigation measures, specifically applied to Common Crawl as a representative pretraining corpus.

Conclusion: The proposed pipeline provides an efficient and effective method for investigating and addressing social biases in LLM pretraining data, helping prevent bias amplification in language models.

Abstract: Large language models (LLMs) acquire general linguistic knowledge from massive-scale pretraining. However, pretraining data mainly comprised of web-crawled texts contain undesirable social biases which can be perpetuated or even amplified by LLMs. In this study, we propose an efficient yet effective annotation pipeline to investigate social biases in the pretraining corpora. Our pipeline consists of protected attribute detection to identify diverse demographics, followed by regard classification to analyze the language polarity towards each attribute. Through our experiments, we demonstrate the effect of our bias analysis and mitigation measures, focusing on Common Crawl as the most representative pretraining corpus.

[46] LawFlow: Collecting and Simulating Lawyers’ Thought Processes on Business Formation Case Studies

Debarati Das, Khanh Chi Le, Ritik Sachin Parkar, Karin De Langis, Brendan Madson, Chad M. Berryman, Robin M. Willis, Daniel H. Moses, Brett McDonnell, Daniel Schwarcz, Dongyeop Kang

Main category: cs.CL

TL;DR: LawFlow dataset captures end-to-end legal workflows from trained law students, revealing systematic differences between human and LLM reasoning patterns in legal practice.

Details

Motivation: Current AI legal datasets focus on isolated subtasks rather than capturing the complete, adaptive reasoning required in real-world legal practice, creating a gap in supporting complex legal workflows.

Method: Created LawFlow dataset with complete legal workflows from trained law students in business entity formation scenarios, comparing human and LLM-generated workflows to analyze structural and reasoning differences.

Result: Human workflows are modular and adaptive, while LLM workflows are sequential, exhaustive, and less sensitive to downstream implications. Legal professionals prefer AI in supportive roles rather than end-to-end execution.

Conclusion: LLMs currently have limitations in supporting complex legal workflows but present opportunities for developing more collaborative, reasoning-aware legal AI systems that complement rather than replace human expertise.

Abstract: Legal practitioners, particularly those early in their careers, face complex, high-stakes tasks that require adaptive, context-sensitive reasoning. While AI holds promise in supporting legal work, current datasets and models are narrowly focused on isolated subtasks and fail to capture the end-to-end decision-making required in real-world practice. To address this gap, we introduce LawFlow, a dataset of complete end-to-end legal workflows collected from trained law students, grounded in real-world business entity formation scenarios. Unlike prior datasets focused on input-output pairs or linear chains of thought, LawFlow captures dynamic, modular, and iterative reasoning processes that reflect the ambiguity, revision, and client-adaptive strategies of legal practice. Using LawFlow, we compare human and LLM-generated workflows, revealing systematic differences in structure, reasoning flexibility, and plan execution. Human workflows tend to be modular and adaptive, while LLM workflows are more sequential, exhaustive, and less sensitive to downstream implications. Our findings also suggest that legal professionals prefer AI to carry out supportive roles, such as brainstorming, identifying blind spots, and surfacing alternatives, rather than executing complex workflows end-to-end. Our results highlight both the current limitations of LLMs in supporting complex legal workflows and opportunities for developing more collaborative, reasoning-aware legal AI systems. All data and code are available on our project page (https://minnesotanlp.github.io/LawFlow-website/).

[47] Demystifying optimized prompts in language models

Rimon Melamed, Lucas H. McCabe, H. Howie Huang

Main category: cs.CL

TL;DR: Optimized prompts for language models consist of rare punctuation and noun tokens, follow distinct activation patterns, and exhibit consistent representation formation across different instruction-tuned models.

Details

Motivation: Language models are not robust to out-of-distribution inputs, and machine-generated optimized prompts can induce specific behaviors while appearing uninterpretable, prompting investigation into their composition and internal processing mechanisms.

Method: Investigation of optimized prompt composition through analysis of token types and rarity, examination of model activations to distinguish optimized from natural language prompts, and tracing representation formation pathways across different instruction-tuned model families.

Result: Optimized prompts primarily consist of punctuation and rare noun tokens, show clear distinction from natural language in sparse activation subsets, and follow similar representation formation paths through network layers across various model families.

Conclusion: Optimized prompts have distinctive compositional and processing characteristics that differ fundamentally from natural language inputs, providing insights into how language models handle out-of-distribution inputs and can be manipulated through carefully engineered prompts.

Abstract: Modern language models (LMs) are not robust to out-of-distribution inputs. Machine generated (``optimized’’) prompts can be used to modulate LM outputs and induce specific behaviors while appearing completely uninterpretable. In this work, we investigate the composition of optimized prompts, as well as the mechanisms by which LMs parse and build predictions from optimized prompts. We find that optimized prompts primarily consist of punctuation and noun tokens which are more rare in the training data. Internally, optimized prompts are clearly distinguishable from natural language counterparts based on sparse subsets of the model’s activations. Across various families of instruction-tuned models, optimized prompts follow a similar path in how their representations form through the network.

[48] QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

Mengze Hong, Wailing Ng, Chen Jason Zhang, Di Jiang

Main category: cs.CL

TL;DR: QualBench is the first multi-domain Chinese QA benchmark using qualification exams to evaluate Chinese LLMs, showing they outperform non-Chinese models but still have significant room for improvement with 53.98% average accuracy.

Details

Motivation: Existing benchmarks lack domain coverage and insights into Chinese working contexts, creating a need for vertical-domain evaluations to ensure reliable applications of Chinese LLMs.

Method: Created QualBench with over 17,000 questions across six vertical domains from 24 Chinese qualification exams, using qualification exams as a unified framework for expertise evaluation aligned with national policies and professional standards.

Result: Chinese LLMs consistently surpassed non-Chinese models, with Qwen2.5 outperforming GPT-4o. Average accuracy was 53.98%, revealing gaps in domain coverage. Identified performance degradation from LLM crowdsourcing and data contamination issues.

Conclusion: Localized domain knowledge is crucial for meeting qualification requirements. Prompt engineering and fine-tuning show effectiveness, suggesting opportunities for future improvements through multi-domain RAG and Federated Learning.

Abstract: The rapid advancement of Chinese LLMs underscores the need for vertical-domain evaluations to ensure reliable applications. However, existing benchmarks often lack domain coverage and provide limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, drawn from 24 Chinese qualifications to align with national policies and professional standards. Results reveal an interesting pattern of Chinese LLMs consistently surpassing non-Chinese models, with the Qwen2.5 model outperforming the more advanced GPT-4o, emphasizing the value of localized domain knowledge in meeting qualification requirements. The average accuracy of 53.98% reveals the current gaps in domain coverage within model capabilities. Furthermore, we identify performance degradation caused by LLM crowdsourcing, assess data contamination, and illustrate the effectiveness of prompt engineering and model fine-tuning, suggesting opportunities for future improvements through multi-domain RAG and Federated Learning.

[49] Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions

Dhruvesh Patel, Aishwarya Sahoo, Avinash Amballa, Tahira Naseem, Tim G. J. Rudner, Andrew McCallum

Main category: cs.CL

TL;DR: Insertion Language Models (ILMs) outperform autoregressive and masked diffusion models on planning tasks and offer flexible text infilling by inserting tokens at arbitrary positions one at a time.

Details

Motivation: Autoregressive models struggle with sequences requiring sophisticated constraints or out-of-order generation, while masked diffusion models have incoherence issues and limited infilling flexibility when token count is unknown.

Method: ILMs learn to insert tokens at arbitrary positions by jointly selecting both position and vocabulary element using a tailored network parameterization and simple denoising objective.

Result: ILMs outperform both ARMs and MDMs on planning tasks, perform on par with ARMs in unconditional text generation, and offer greater flexibility than MDMs in arbitrary-length text infilling.

Conclusion: Insertion Language Models provide a flexible alternative to traditional sequence generation methods, effectively handling complex constraints and out-of-order dependencies while maintaining strong performance across various tasks.

Abstract: Autoregressive models (ARMs), which predict subsequent tokens one-by-one ``from left to right,’’ have achieved significant success across a wide range of sequence generation tasks. However, they struggle to accurately represent sequences that require satisfying sophisticated constraints or whose sequential dependencies are better addressed by out-of-order generation. Masked Diffusion Models (MDMs) address some of these limitations, but the process of unmasking multiple tokens simultaneously in MDMs can introduce incoherences, and MDMs cannot handle arbitrary infilling constraints when the number of tokens to be filled in is not known in advance. In this work, we introduce Insertion Language Models (ILMs), which learn to insert tokens at arbitrary positions in a sequence – that is, they select jointly both the position and the vocabulary element to be inserted. By inserting tokens one at a time, ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences where token dependencies do not follow a left-to-right sequential structure. To train ILMs, we propose a tailored network parameterization and use a simple denoising objective. Our empirical evaluation demonstrates that ILMs outperform both ARMs and MDMs on common planning tasks. Furthermore, we show that ILMs outperform MDMs and perform on par with ARMs in an unconditional text generation task while offering greater flexibility than MDMs in arbitrary-length text infilling. The code is available at: https://dhruveshp.com/projects/ilm .

[50] NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Wei Liu, Siya Qi, Xinyu Wang, Chen Qian, Yali Du, Yulan He

Main category: cs.CL

TL;DR: NOVER is a reinforcement learning framework that enables incentive training without external verifiers, outperforming same-size models distilled from large reasoning models by 7.7% and enabling new optimization possibilities like inverse incentive training.

Details

Motivation: Existing incentive training methods rely on external verifiers that limit applicability to domains like math and coding, and reward models require costly high-quality annotated data.

Method: Proposes NOVER (NO-VERifier Reinforcement Learning), a general RL framework that uses only standard supervised fine-tuning data without needing external verifiers.

Result: NOVER outperforms same-size models distilled from large reasoning models like DeepSeek R1 671B by 7.7% and works across diverse text-to-text tasks.

Conclusion: NOVER provides a flexible, verifier-free approach to incentive training that expands applicability beyond domains with readily available verifiers and enables new optimization techniques.

Abstract: Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model’s output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.

[51] Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning

Kristin Qi, Jiali Cheng, Youxiang Zhu, Hadi Amiri, Xiaohui Liang

Main category: cs.CL

TL;DR: A framework for multilingual MCI detection using contrastive learning, image modality integration, and Product of Experts to handle multiple pictures, achieving significant performance improvements.

Details

Motivation: Detecting Mild Cognitive Impairment from picture descriptions is challenging in multilingual and multiple picture settings, as prior work focused only on English speakers and single pictures.

Method: Three-component framework: 1) supervised contrastive learning for discriminative representations, 2) incorporating image modality alongside speech/text, 3) Product of Experts strategy to reduce spurious correlations and overfitting.

Result: +7.1% increase in UAR (68.1% to 75.2%) and +2.9% increase in F1 score (80.6% to 83.5%) compared to text unimodal baseline. Contrastive learning particularly benefits text modality over speech.

Conclusion: The framework effectively addresses challenges in multilingual and multi-picture MCI detection, demonstrating significant performance improvements through multimodal integration and contrastive learning.

Abstract: Detecting Mild Cognitive Impairment from picture descriptions is critical yet challenging, especially in multilingual and multiple picture settings. Prior work has primarily focused on English speakers describing a single picture (e.g., the ‘Cookie Theft’). The TAUKDIAL-2024 challenge expands this scope by introducing multilingual speakers and multiple pictures, which presents new challenges in analyzing picture-dependent content. To address these challenges, we propose a framework with three components: (1) enhancing discriminative representation learning via supervised contrastive learning, (2) involving image modality rather than relying solely on speech and text modalities, and (3) applying a Product of Experts (PoE) strategy to mitigate spurious correlations and overfitting. Our framework improves MCI detection performance, achieving a +7.1% increase in Unweighted Average Recall (UAR) (from 68.1% to 75.2%) and a +2.9% increase in F1 score (from 80.6% to 83.5%) compared to the text unimodal baseline. Notably, the contrastive learning component yields greater gains for the text modality compared to speech. These results highlight our framework’s effectiveness in multilingual and multi-picture MCI detection.

Kristin Qi, Youxiang Zhu, Caroline Summerour, John A. Batsis, Xiaohui Liang

Main category: cs.CL

TL;DR: Voice assistant systems can detect cognitive decline through speech pattern analysis using a novel framework combining LLM-driven linguistic feature extraction, acoustic analysis, and temporal modeling.

Details

Motivation: Early detection of cognitive decline is crucial for interventions, but traditional clinical assessments are labor-intensive and impractical for frequent monitoring.

Method: Proposed Cog-TiPRO framework with: 1) LLM-driven iterative prompt refinement for linguistic features, 2) HuBERT-based acoustic feature extraction, 3) transformer-based temporal modeling using iTransformer.

Result: Achieved 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming baseline by 27.13%. Identified unique linguistic features characterizing cognitive decline in voice commands.

Conclusion: Voice assistant systems with the Cog-TiPRO framework provide an effective, non-invasive approach for longitudinal monitoring and early detection of cognitive decline through everyday voice interactions.

Abstract: Early detection of cognitive decline is crucial for enabling interventions that can slow neurodegenerative disease progression. Traditional diagnostic approaches rely on labor-intensive clinical assessments, which are impractical for frequent monitoring. Our pilot study investigates voice assistant systems (VAS) as non-invasive tools for detecting cognitive decline through longitudinal analysis of speech patterns in voice commands. Over an 18-month period, we collected voice commands from 35 older adults, with 15 participants providing daily at-home VAS interactions. To address the challenges of analyzing these short, unstructured and noisy commands, we propose Cog-TiPRO, a framework that combines (1) LLM-driven iterative prompt refinement for linguistic feature extraction, (2) HuBERT-based acoustic feature extraction, and (3) transformer-based temporal modeling. Using iTransformer, our approach achieves 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming its baseline by 27.13%. Through our LLM approach, we identify linguistic features that uniquely characterize everyday command usage patterns in individuals experiencing cognitive decline.

[53] On the class of coding optimality of human languages and the origins of Zipf’s law

Ramon Ferrer-i-Cancho

Main category: cs.CL

TL;DR: The paper presents a new optimality class for coding systems where Zipf’s law emerges from linear displacement from optimal coding, showing that languages exhibiting Zipf’s law are potential members while other species’ communication systems may not qualify.

Details

Motivation: To understand the mathematical foundations of Zipf's law in communication systems and explain why some systems (like human languages) exhibit power-law distributions while others show exponential distributions.

Method: The authors develop a theoretical framework defining a class of optimal coding systems where Zipf’s law emerges from linear displacement from optimal coding. They analyze frequency-rank relationships in double logarithmic scale and compare human languages with other species’ communication systems.

Result: Human languages showing Zipf’s law are identified as potential members of this optimality class, while many other species’ communication systems cannot be members due to exponential distributions. Dolphins and humpback whales might qualify. A straight line in log-log plots indicates linear displacement from optimal coding.

Conclusion: Zipf’s law originates from compression processes, and the paper provides testable conditions for its emergence in compressing systems, offering new insights into the mathematical structure of optimal coding systems.

Abstract: Here we present a new class of optimality for coding systems. Members of that class are displaced linearly from optimal coding and thus exhibit Zipf’s law, namely a power-law distribution of frequency ranks. Within that class, Zipf’s law, the size-rank law and the size-probability law form a group-like structure. We identify human languages that are members of the class. All languages showing sufficient agreement with Zipf’s law are potential members of the class. In contrast, there are communication systems in other species that cannot be members of that class for exhibiting an exponential distribution instead but dolphins and humpback whales might. We provide a new insight into plots of frequency versus rank in double logarithmic scale. For any system, a straight line in that scale indicates that the lengths of optimal codes under non-singular coding and under uniquely decodable encoding are displaced by a linear function whose slope is the exponent of Zipf’s law. For systems under compression and constrained to be uniquely decodable, such a straight line may indicate that the system is coding close to optimality. We provide support for the hypothesis that Zipf’s law originates from compression and define testable conditions for the emergence of Zipf’s law in compressing systems.

[54] RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling

Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre

Main category: cs.CL

TL;DR: RAT is a hybrid architecture that combines recurrence within chunks and attention across chunks, achieving 7x faster training and 9x faster generation while maintaining performance comparable to standard attention.

Details

Motivation: Transformers face computational bottlenecks due to softmax attention, while recurrent models suffer from memory degradation in long contexts. There's a need for an efficient architecture that bridges RNN efficiency with attention capacity.

Method: Partitions input into chunks, applies recurrence within chunks for local dependencies, and softmax attention across chunks for long-range interactions. Also proposes hybrid architecture interleaving RAT with local attention.

Result: 7x training speed improvement with 100K token sequences, 9x generation speed at 4K position. Maintains similar performance to standard attention in 1.3B parameter models across various benchmarks.

Conclusion: RAT successfully bridges efficiency of RNNs and capacity of attention, mitigating memory degradation while enabling direct access to distant tokens and maintaining computational efficiency.

Abstract: Transformers have become the cornerstone of modern large-scale language models, but their reliance on softmax attention poses a computational bottleneck at both training and inference. Recurrent models offer high efficiency, but compressing the full sequence into a fixed-size and holistic representation suffers from memory degradation in long contexts and limits fine-grained retrieval. To address this, we propose RAT, an intermediate design that bridges the efficiency of RNNs and capacity of attention. RAT partitions the input into chunks, applies recurrence within each chunk for local dependencies, and softmax-based attention across chunks for long-range interactions. This design mitigates memory degradation and enables direct access to distant tokens, while retaining computational efficiency. Empirically, with a chunk size of 16, the RAT block achieves a 7x improvement in training speed with 100K token sequences and 9x in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves RAT with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage, but also consistently enhances performance and shows the overall best results. Code is available at https://github.com/CLAIRE-Labo/RAT.

[55] Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

Abdullah Alabdullah, Lifeng Han, Chenghua Lin

Main category: cs.CL

TL;DR: This paper addresses the challenge of Dialectal Arabic (DA) to Modern Standard Arabic (MSA) translation by evaluating prompting techniques and developing efficient fine-tuning methods, achieving strong results with limited resources.

Details

Motivation: Dialectal Arabic poses significant challenges for NLP as everyday communication occurs in dialects that diverge from Modern Standard Arabic, creating a linguistic divide that impedes Arabic machine translation progress.

Method: The paper presents two approaches: (i) comprehensive evaluation of training-free prompting techniques across six LLMs, including zero-shot, few-shot, chain-of-thought, and a proposed Ara-TEaR three-stage self-refinement method; (ii) development of a resource-efficient fine-tuning pipeline using quantized models and joint multi-dialect training.

Result: Few-shot prompting consistently outperformed other methods, with GPT-4o achieving highest performance. Fine-tuned quantized Gemma2-9B model achieved chrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint multi-dialect training improved performance by over 10%, and 4-bit quantization reduced memory usage by 60% with minimal performance loss.

Conclusion: The research provides a practical blueprint for improving dialectal inclusion in Arabic NLP, demonstrating that high-quality DA-MSA machine translation is achievable with limited resources, paving the way for more inclusive language technologies.

Abstract: Dialectal Arabic (DA) poses a persistent challenge for natural language processing (NLP), as most everyday communication in the Arab world occurs in dialects that diverge significantly from Modern Standard Arabic (MSA). This linguistic divide impedes progress in Arabic machine translation. This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects, particularly in low-resource and computationally constrained settings: (i) a comprehensive evaluation of training-free prompting techniques, and (ii) the development of a resource-efficient fine-tuning pipeline. Our evaluation of prompting strategies across six large language models (LLMs) found that few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method. Ara-TEaR is designed as a three-stage self-refinement prompting process, targeting frequent meaning-transfer and adaptation errors in DA-MSA translation. In this evaluation, GPT-4o achieved the highest performance across all prompting settings. For fine-tuning LLMs, a quantized Gemma2-9B model achieved a chrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint multi-dialect trained models outperformed single-dialect counterparts by over 10% chrF++, and 4-bit quantization reduced memory usage by 60% with less than 1% performance loss. The results and insights of our experiments offer a practical blueprint for improving dialectal inclusion in Arabic NLP, showing that high-quality DA-MSA machine translation is achievable even with limited resources and paving the way for more inclusive language technologies.

[56] MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sabharwal, Reut Tsarfaty

Main category: cs.CL

TL;DR: MoNaCo is a new benchmark with 1,315 natural, time-consuming questions requiring dozens to hundreds of intermediate steps, showing frontier LLMs achieve only 61.2% F1 due to low recall and hallucinations.

Details

Motivation: Existing LLM agent evaluation benchmarks lack natural questions that are both information-seeking and genuinely time-consuming for humans, creating a gap in assessing real-world complexity.

Method: Developed a decomposed annotation pipeline to elicit and manually answer real-world time-consuming questions at scale, creating 1,315 challenging questions.

Result: Frontier LLMs achieved at most 61.2% F1 score on MoNaCo, hampered by low recall and hallucinations, demonstrating limitations in handling complex real-world information-seeking tasks.

Conclusion: MoNaCo effectively reveals LLM limitations in handling complex, multi-step information retrieval and provides a valuable resource for tracking progress in LLM-powered agent capabilities.

Abstract: Automated agents, powered by Large language models (LLMs), are emerging as the go-to tool for querying information. However, evaluation benchmarks for LLM agents rarely feature natural questions that are both information-seeking and genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and time-consuming questions that require dozens, and at times hundreds, of intermediate steps to solve – far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer real-world time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the limitations of LLM-powered agents in handling the complexity and sheer breadth of real-world information-seeking tasks – with MoNaCo providing an effective resource for tracking such progress. The MoNaCo benchmark, codebase, prompts and models predictions are all publicly available at: https://tomerwolgithub.github.io/monaco

[57] GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text Generation

Yuanhao Ding, Esteban Garces Arias, Meimingwei Li, Julian Rodemann, Matthias Aßenmacher, Danlu Chen, Gaojuan Fan, Christian Heumann, Chongsheng Zhang

Main category: cs.CL

TL;DR: GUARD is a self-adaptive decoding method that balances text diversity and coherence using global and local uncertainty signals, achieving faster generation with theoretical guarantees.

Details

Motivation: Address the trade-off between coherence and diversity in open-ended text generation, overcoming limitations of contrastive search methods like hyperparameter dependence and high computational costs.

Method: GUARD uses a ‘Glocal’ uncertainty-driven framework combining global entropy estimates with local entropy deviations, plus a token-count-based penalty to reduce computational overhead.

Result: Achieves good balance between text diversity and coherence, shows substantial improvements in generation speed, and demonstrates remarkable performance in both human and LLM evaluations.

Conclusion: GUARD provides an effective solution for open-ended text generation with theoretical guarantees, practical efficiency, and validated performance across multiple evaluation dimensions.

Abstract: Open-ended text generation faces a critical challenge: balancing coherence with diversity in LLM outputs. While contrastive search-based decoding strategies have emerged to address this trade-off, their practical utility is often limited by hyperparameter dependence and high computational costs. We introduce GUARD, a self-adaptive decoding method that effectively balances these competing objectives through a novel “Glocal” uncertainty-driven framework. GUARD combines global entropy estimates with local entropy deviations to integrate both long-term and short-term uncertainty signals. We demonstrate that our proposed global entropy formulation effectively mitigates abrupt variations in uncertainty, such as sudden overconfidence or high entropy spikes, and provides theoretical guarantees of unbiasedness and consistency. To reduce computational overhead, we incorporate a simple yet effective token-count-based penalty into GUARD. Experimental results demonstrate that GUARD achieves a good balance between text diversity and coherence, while exhibiting substantial improvements in generation speed. In a more nuanced comparison study across different dimensions of text quality, both human and LLM evaluators validated its remarkable performance. Our code is available at https://github.com/YecanLee/GUARD.

[58] Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Lang Xiong, Nishant Bhargava, Wesley Chang, Jianhang Hong, Haihao Liu, Kevin Zhu

Main category: cs.CL

TL;DR: LLMs show behavioral changes between real-world deployment and controlled evaluation settings. This study quantifies evaluation awareness using prompt rewriting to shift contexts, finding models are more honest and safe in deploy-like contexts.

Details

Motivation: Address the discrepancy between benchmark performance and real-world model behavior, as evaluation awareness poses critical challenges for AI alignment and safety assessment.

Method: Used linear probe to score prompts on test-like to deploy-like scale, employed LLM rewriting strategy to shift prompts to more natural deployment contexts while preserving original tasks.

Result: 30% average probe score increase after rewriting. Across models: 5.26% increase in honest responses, 12.40% decrease in deceptive responses, 6.38% increase in refusal rates indicating improved safety compliance.

Conclusion: Evaluation awareness is quantifiable and manipulable, directly influencing LLM behavior. Models are more prone to unsafe/deceptive outputs in test environments, highlighting need for more realistic evaluation frameworks.

Abstract: Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as “evaluation awareness.” This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model’s true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from “test-like” to “deploy-like” and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten “deploy-like” prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.

[59] Avoidance Decoding for Diverse Multi-Branch Story Generation

Kyeongman Park, Nakyeong Yang, Kyomin Jung

Main category: cs.CL

TL;DR: Avoidance Decoding strategy penalizes similarity to previous outputs to increase diversity in LLM-generated stories, achieving 2.6x higher diversity and 30% less repetition.

Details

Motivation: LLMs often produce repetitive and monotonous outputs, especially in story generation tasks, due to limited creative diversity when given the same input prompt.

Method: A novel decoding strategy that modifies token logits by penalizing similarity to previously generated outputs. Uses adaptive balancing of two similarity measures: Concept-level Similarity Penalty (prioritized early) and Narrative-level Similarity Penalty (emphasized later).

Result: Achieves up to 2.6 times higher output diversity and reduces repetition by an average of 30% compared to strong baselines, while effectively mitigating text degeneration. Activates a broader range of neurons, demonstrating enhanced intrinsic creativity.

Conclusion: Avoidance Decoding successfully enhances creative diversity in LLM story generation by adaptively penalizing similarity at both concept and narrative levels, resulting in more diverse and less repetitive outputs while leveraging the model’s inherent creative capabilities.

Abstract: Large Language Models (LLMs) often generate repetitive and monotonous outputs, especially in tasks like story generation, due to limited creative diversity when given the same input prompt. To address this challenge, we propose a novel decoding strategy, Avoidance Decoding, that modifies token logits by penalizing similarity to previously generated outputs, thereby encouraging more diverse multi-branch stories. This penalty adaptively balances two similarity measures: (1) Concept-level Similarity Penalty, which is prioritized in early stages to diversify initial story concepts, and (2) Narrative-level Similarity Penalty, which is increasingly emphasized later to ensure natural yet diverse plot development. Notably, our method achieves up to 2.6 times higher output diversity and reduces repetition by an average of 30% compared to strong baselines, while effectively mitigating text degeneration. Furthermore, we reveal that our method activates a broader range of neurons, demonstrating that it leverages the model’s intrinsic creativity.

[60] MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds

Junxi Wu, Jinpeng Wang, Zheng Liu, Bin Chen, Dongjian Hu, Hao Wu, Shu-Tao Xia

Main category: cs.CL

TL;DR: MoSEs framework improves AI-generated text detection by 11.34% on average and 39.15% in low-resource scenarios through stylistic modeling and dynamic threshold estimation.

Details

Motivation: Address public concerns about AI misuse by building trustworthy detection systems, overcoming limitations of existing methods that neglect stylistic modeling and rely on static thresholds.

Method: Mixture of Stylistic Experts (MoSEs) with three components: Stylistics Reference Repository (SRR) for reference data activation, Stylistics-Aware Router (SAR), and Conditional Threshold Estimator (CTE) that jointly models linguistic statistics and semantic features for dynamic threshold determination.

Result: Achieves 11.34% average improvement in detection performance compared to baselines, with 39.15% improvement in low-resource cases.

Conclusion: MoSEs framework effectively addresses stylistic modeling limitations in AI-generated text detection, providing significant performance gains especially in challenging low-resource scenarios.

Abstract: The rapid advancement of large language models has intensified public concerns about the potential misuse. Therefore, it is important to build trustworthy AI-generated text detection systems. Existing methods neglect stylistic modeling and mostly rely on static thresholds, which greatly limits the detection performance. In this paper, we propose the Mixture of Stylistic Experts (MoSEs) framework that enables stylistics-aware uncertainty quantification through conditional threshold estimation. MoSEs contain three core components, namely, the Stylistics Reference Repository (SRR), the Stylistics-Aware Router (SAR), and the Conditional Threshold Estimator (CTE). For input text, SRR can activate the appropriate reference data in SRR and provide them to CTE. Subsequently, CTE jointly models the linguistic statistical properties and semantic features to dynamically determine the optimal threshold. With a discrimination score, MoSEs yields prediction labels with the corresponding confidence level. Our framework achieves an average improvement 11.34% in detection performance compared to baselines. More inspiringly, MoSEs shows a more evident improvement 39.15% in the low-resource case. Our code is available at https://github.com/creator-xi/MoSEs.

cs.CV

[61] 2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model

Zilong Guo, Yi Luo, Long Sha, Dongxu Wang, Panqu Wang, Chenyang Xu, Yi Yang

Main category: cs.CV

TL;DR: This paper demonstrates that combining end-to-end architectural design with multi-modality Vision Language Models (VLMs) achieves impressive performance in autonomous driving tasks using only a single camera, making it the best camera-only solution.

Details

Motivation: To explore whether powerful large language models, particularly multi-modality Vision Language Models, could benefit end-to-end autonomous driving tasks, moving beyond traditional modular deep neural network approaches.

Method: Combines end-to-end architectural design with knowledgeable Vision Language Models (VLMs) and uses only a single camera input for the driving system.

Result: Achieves impressive performance on driving tasks and becomes the best camera-only solution across the leaderboard, demonstrating superior effectiveness.

Conclusion: The approach shows the effectiveness of vision-based driving and reveals significant potential for end-to-end autonomous driving tasks using VLMs with minimal sensor requirements.

Abstract: End-to-end autonomous driving has drawn tremendous attention recently. Many works focus on using modular deep neural networks to construct the end-to-end archi-tecture. However, whether using powerful large language models (LLM), especially multi-modality Vision Language Models (VLM) could benefit the end-to-end driving tasks remain a question. In our work, we demonstrate that combining end-to-end architectural design and knowledgeable VLMs yield impressive performance on the driving tasks. It is worth noting that our method only uses a single camera and is the best camera-only solution across the leaderboard, demonstrating the effectiveness of vision-based driving approach and the potential for end-to-end driving tasks.

Mennatullah Siam

Main category: cs.CV

TL;DR: This paper introduces MoCentric-Bench, a motion-centric benchmark to evaluate video MLLMs’ ability to perform pixel-level visual grounding based on motion patterns, addressing limitations in current benchmarks where static frames often suffice.

Details

Motivation: Current video MLLM benchmarks don't adequately test motion understanding for visual grounding tasks, as single frames often contain enough information without requiring temporal reasoning. The authors want to determine if video MLLMs can truly segment objects based on motion patterns described in natural language.

Method: The authors introduce four motion-centric probing techniques designed for visual grounding tasks to study video MLLMs’ ability to identify true motion and understand motion order. They create MoCentric-Bench, a benchmark that ensures evaluation focuses on motion-language interaction rather than static appearance cues.

Result: The paper establishes strong single-image baselines that perform on par with or outperform prior methods. They also develop simple motion-centric adaptation techniques that achieve state-of-the-art performance on their MoCentric-Bench benchmark.

Conclusion: The motion-centric benchmark, evaluation methods, and findings challenge future models to improve dense spatiotemporal grounding and pixel-level understanding in videos, moving beyond static appearance-based approaches.

Abstract: Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities. While their extension to video has enabled tasks such as video question answering and video captioning, their pixel-level visual grounding abilities are less studied. In this work, we raise the pertinent question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions describing their motion patterns. We identify the shortcomings in the current benchmarks, where we show that a single frame can often suffice for capturing the motion referring expression without any temporal reasoning. To address this, we introduce four motion-centric probing techniques, particularly designed for the visual grounding task, to study video MLLMs’ ability to identify true motion from a fake one and their ability to grasp the motion order. Consequently, we provide a motion-centric benchmark, MoCentric-Bench. It ensures that video MLLMs are evaluated towards leveraging the interaction between motion and language rather than being dominated by static appearance cues emphasized in existing visual grounding datasets. We further establish strong single-image baselines that are on par with or outperform prior methods. Finally, we explore simple motion-centric adaptation techniques that provide state-of-the-art performance on our MoCentric-Bench. Our motion-centric benchmark, evaluation and findings challenge future models to improve dense spatiotemporal grounding and pixel-level understanding within videos. Code and datasets will be made publicly available at https://github.com/MSiam/PixFoundation-2.0.git.

[63] VQualA 2025 Challenge on Engagement Prediction for Short Videos: Methods and Results

Dasong Li, Sizhuo Ma, Hang Hua, Wenjie Li, Jian Wang, Chris Wei Zhou, Fengbin Guan, Xin Li, Zihao Yu, Yiting Lu, Ru-Ling Liao, Yan Ye, Zhibo Chen, Wei Sun, Linhan Cao, Yuqin Cao, Weixia Zhang, Wen Wen, Kaiwei Zhang, Zijian Chen, Fangfang Lu, Xiongkuo Min, Guangtao Zhai, Erjia Xiao, Lingfeng Zhang, Zhenjie Su, Hao Cheng, Yu Liu, Renjing Xu, Long Chen, Xiaoshuai Hao, Zhenpeng Zeng, Jianqin Wu, Xuxu Wang, Qian Yu, Bo Hu, Weiwei Wang, Pinxin Liu, Yunlong Tang, Luchuan Song, Jinxi He, Jiaru Wu, Hanjia Lyu

Main category: cs.CV

TL;DR: VQualA 2025 Challenge overview on engagement prediction for short videos using multi-modal features and real-world UGC dataset from ICCV 2025.

Details

Motivation: To understand and model popularity of user-generated short videos on social media platforms by capturing complex factors influencing user engagement.

Method: Used new short-form UGC dataset with engagement metrics from real user interactions, exploring multi-modal features including visual content, audio, and creator metadata.

Result: Attracted 97 participants with 15 valid test submissions, significantly advancing progress in short-form UGC video engagement prediction.

Conclusion: The challenge successfully promoted robust modeling strategies for engagement prediction in short videos, contributing to the field through multi-modal feature exploration and real-world dataset utilization.

Abstract: This paper presents an overview of the VQualA 2025 Challenge on Engagement Prediction for Short Videos, held in conjunction with ICCV 2025. The challenge focuses on understanding and modeling the popularity of user-generated content (UGC) short videos on social media platforms. To support this goal, the challenge uses a new short-form UGC dataset featuring engagement metrics derived from real-world user interactions. This objective of the Challenge is to promote robust modeling strategies that capture the complex factors influencing user engagement. Participants explored a variety of multi-modal features, including visual content, audio, and metadata provided by creators. The challenge attracted 97 participants and received 15 valid test submissions, contributing significantly to progress in short-form UGC video engagement prediction.

[64] Multi-Scale Deep Learning for Colon Histopathology: A Hybrid Graph-Transformer Approach

Sadra Saremi, Amirhossein Ahmadkhan Kordbacheh

Main category: cs.CV

TL;DR: Hybrid multi-scale deep learning architecture combining capsule networks, graph attention, transformers, and residual learning for improved colon cancer classification on LC25000 dataset.

Details

Motivation: Early detection of colon cancer is crucial to prevent deterioration, requiring advanced classification methods for histopathological images.

Method: HG-TNet model with transformer branch for global context (convolution-based patch embedding + transformer encoder) and CNN branch for local details, combined with capsule networks for spatial order preservation and self-supervised rotation prediction.

Result: Better performance in accuracy, loss function, and overall algorithm effectiveness compared to standard architectures.

Conclusion: The hybrid architecture successfully captures multi-scale features and produces robust diagnostic representations that outperform conventional approaches in colon cancer classification.

Abstract: Colon cancer also known as Colorectal cancer, is one of the most malignant types of cancer worldwide. Early-stage detection of colon cancer is highly crucial to prevent its deterioration. This research presents a hybrid multi-scale deep learning architecture that synergizes capsule networks, graph attention mechanisms, transformer modules, and residual learning to advance colon cancer classification on the Lung and Colon Cancer Histopathological Image Dataset (LC25000) dataset. The proposed model in this paper utilizes the HG-TNet model that introduces a hybrid architecture that joins strength points in transformers and convolutional neural networks to capture multi-scale features in histopathological images. Mainly, a transformer branch extracts global contextual bonds by partitioning the image into patches by convolution-based patch embedding and then processing these patches through a transformer encoder. Analogously, a dedicated CNN branch captures fine-grained, local details through successive Incorporation these diverse features, combined with a self-supervised rotation prediction objective, produce a robust diagnostic representation that surpasses standard architectures in performance. Results show better performance not only in accuracy or loss function but also in these algorithms by utilizing capsule networks to preserve spatial orders and realize how each element individually combines and forms whole structures.

[65] PRECISE-AS: Personalized Reinforcement Learning for Efficient Point-of-Care Echocardiography in Aortic Stenosis Diagnosis

Armin Saadat, Nima Hashemi, Hooman Vaseli, Michael Y. Tsang, Christina Luong, Michiel Van de Panne, Teresa S. M. Tsang, Purang Abolmaesumi

Main category: cs.CV

TL;DR: RL-based active video acquisition framework for aortic stenosis diagnosis that dynamically selects the most informative echo videos, achieving 80.6% accuracy using only 47% of videos compared to full acquisition.

Details

Motivation: Limited access to echocardiography in rural/underserved areas and operator expertise challenges with point-of-care ultrasound for aortic stenosis diagnosis.

Method: Reinforcement learning-driven framework that continuously evaluates whether additional imaging is needed and dynamically selects the most informative echo videos for each patient.

Result: Tested on 2,572 patients, achieved 80.6% classification accuracy while using only 47% of echo videos compared to full acquisition.

Conclusion: Active feature acquisition can enhance AS diagnosis, making echocardiographic assessments more efficient, scalable, and personalized.

Abstract: Aortic stenosis (AS) is a life-threatening condition caused by a narrowing of the aortic valve, leading to impaired blood flow. Despite its high prevalence, access to echocardiography (echo), the gold-standard diagnostic tool, is often limited due to resource constraints, particularly in rural and underserved areas. Point-of-care ultrasound (POCUS) offers a more accessible alternative but is restricted by operator expertise and the challenge of selecting the most relevant imaging views. To address this, we propose a reinforcement learning (RL)-driven active video acquisition framework that dynamically selects each patient’s most informative echo videos. Unlike traditional methods that rely on a fixed set of videos, our approach continuously evaluates whether additional imaging is needed, optimizing both accuracy and efficiency. Tested on data from 2,572 patients, our method achieves 80.6% classification accuracy while using only 47% of the echo videos compared to a full acquisition. These results demonstrate the potential of active feature acquisition to enhance AS diagnosis, making echocardiographic assessments more efficient, scalable, and personalized. Our source code is available at: https://github.com/Armin-Saadat/PRECISE-AS.

[66] LiGuard: A Streamlined Open-Source Framework for Rapid & Interactive Lidar Research

Muhammad Shahbaz, Shaurya Agarwal

Main category: cs.CV

TL;DR: LiGuard is an open-source framework that simplifies lidar-based research by providing built-in data handling, processing algorithms, and visualization tools to reduce code duplication and enable rapid project development.

Details

Motivation: To address the duplication of efforts in lidar research where researchers develop niche-specific code for common tasks like data I/O, processing, and algorithms, which leads to inefficiency and major code revisions when research focus changes.

Method: Developed LiGuard framework with built-in support for data input/output, pre/post processing, commonly used algorithms, interactive algorithm management, and visualization capabilities for classification, detection, segmentation, and tracking tasks.

Result: The framework creates structured code files that facilitate easy sharing and reuse of projects/components, demonstrated effective through case studies.

Conclusion: LiGuard successfully addresses the challenges of code duplication and revision difficulties in lidar research by providing a flexible, reusable framework that accelerates development and promotes collaboration.

Abstract: There is a growing interest in the development of lidar-based autonomous mobility and Intelligent Transportation Systems (ITS). To operate and research on lidar data, researchers often develop code specific to application niche. This approach leads to duplication of efforts across studies that, in many cases, share multiple methodological steps such as data input/output (I/O), pre/post processing, and common algorithms in multi-stage solutions. Moreover, slight changes in data, algorithms, and/or research focus may force major revisions in the code. To address these challenges, we present LiGuard, an open-source software framework that allows researchers to: 1) rapidly develop code for their lidar-based projects by providing built-in support for data I/O, pre/post processing, and commonly used algorithms, 2) interactively add/remove/reorder custom algorithms and adjust their parameters, and 3) visualize results for classification, detection, segmentation, and tracking tasks. Moreover, because it creates all the code files in structured directories, it allows easy sharing of entire projects or even the individual components to be reused by other researchers. The effectiveness of LiGuard is demonstrated via case studies.

[67] PercepTwin: Modeling High-Fidelity Digital Twins for Sim2Real LiDAR-based Perception for Intelligent Transportation Systems

Muhammad Shahbaz, Shaurya Agarwal

Main category: cs.CV

TL;DR: Paper introduces a methodology for creating high-quality synthetic LiDAR datasets using High-Fidelity Digital Twins to address the cost and scalability issues of real-world data collection for ITS perception systems.

Details

Motivation: LiDAR-based perception in ITS requires large labeled datasets that are costly and time-consuming to create, hindering scalability. Sim2Real learning offers an alternative but depends on simulation fidelity to real-world conditions.

Method: Proposes a rigorous workflow using High-Fidelity Digital Twins to replicate real-world environments, including static geometry modeling, road infrastructure replication, and dynamic traffic scenario generation using open-source resources like satellite imagery and OpenStreetMap data.

Result: Enables creation of large-scale, high-quality synthetic datasets that facilitate scalable, cost-effective, and diverse data generation for robust Sim2Real learning.

Conclusion: The methodology provides practical guidance for constructing reliable synthetic environments that form a foundation for effective Sim2Real transfer in LiDAR-based perception systems for intelligent transportation.

Abstract: LiDAR-based perception in intelligent transportation systems (ITS), for tasks such as object detection, tracking, and semantic and instance segmentation, is predominantly solved by deep neural network models which often require large-scale labeled datasets during training to achieve generalization. However, creating these datasets is costly. time consuming and require human labor before the datasets are ready for training models. This hinders scalability of the LiDAR-based perception systems in ITS. Sim2Real learning offers scalable alternative, however, its effectiveness is dependent on the fidelity of the source simulation(s) to real-world, in terms of environment structure, actor dynamics, and sensor emulations. In response, this paper introduces a rigorous and reproducible methodology for creating large-scale, high-quality synthetic datasets using High-Fidelity Digital Twins (HiFi DTs). The proposed workflow outlines the steps, tools, and best practices for digitally replicating real-world environments, encompassing static geometry modeling, road infrastructure replication, and dynamic traffic scenario generation. Leveraging open-source and readily available resources such as satellite imagery and OpenStreetMap data, alongside specific sensor configurations, this paper provides practical, detailed guidance for constructing robust synthetic environments. These environments subsequently facilitate scalable, cost-effective, and diverse dataset generation, forming a reliable foundation for robust Sim2Real learning.

[68] High-Fidelity Digital Twins for Bridging the Sim2Real Gap in LiDAR-Based ITS Perception

Muhammad Shahbaz, Shaurya Agarwal

Main category: cs.CV

TL;DR: Proposes HiFi DT framework for Sim2Real transfer in LiDAR perception, achieving 4.8% better performance than real-data training through high-fidelity digital twins that reduce domain shift.

Details

Motivation: Address the Sim2Real performance gap in LiDAR-based perception for Intelligent Transportation Systems, where models trained in simulation underperform on real-world data due to distributional shifts.

Method: Develops a high-fidelity digital twin framework incorporating real-world background geometry, lane-level road topology, and sensor-specific specifications. Uses systematic environment construction and evaluates domain alignment with metrics like Chamfer Distance, MMD, EMD, and Fr’echet Distance.

Result: DT-trained model outperforms equivalent model trained on real data by 4.8%. HiFi DTs substantially reduce domain shift and improve generalization across diverse evaluation scenarios.

Conclusion: Digital twins play a significant role in enabling reliable, simulation-based LiDAR perception for real-world ITS applications by effectively bridging the Sim2Real gap.

Abstract: Sim2Real domain transfer offers a cost-effective and scalable approach for developing LiDAR-based perception (e.g., object detection, tracking, segmentation) in Intelligent Transportation Systems (ITS). However, perception models trained in simulation often under perform on real-world data due to distributional shifts. To address this Sim2Real gap, this paper proposes a high-fidelity digital twin (HiFi DT) framework that incorporates real-world background geometry, lane-level road topology, and sensor-specific specifications and placement. We formalize the domain adaptation challenge underlying Sim2Real learning and present a systematic method for constructing simulation environments that yield in-domain synthetic data. An off-the-shelf 3D object detector is trained on HiFi DT-generated synthetic data and evaluated on real data. Our experiments show that the DT-trained model outperforms the equivalent model trained on real data by 4.8%. To understand this gain, we quantify distributional alignment between synthetic and real data using multiple metrics, including Chamfer Distance (CD), Maximum Mean Discrepancy (MMD), Earth Mover’s Distance (EMD), and Fr’echet Distance (FD), at both raw-input and latent-feature levels. Results demonstrate that HiFi DTs substantially reduce domain shift and improve generalization across diverse evaluation scenarios. These findings underscore the significant role of digital twins in enabling reliable, simulation-based LiDAR perception for real-world ITS applications.

[69] Single Domain Generalization in Diabetic Retinopathy: A Neuro-Symbolic Learning Approach

Midhat Urooj, Ayan Banerjee, Farhat Shaikh, Kuntal Thakur, Sandeep Gupta

Main category: cs.CV

TL;DR: KG-DG is a neuro-symbolic framework for diabetic retinopathy classification that combines vision transformers with expert-guided symbolic reasoning, achieving significant improvements in cross-domain generalization across multiple datasets.

Details

Motivation: Domain generalization is a critical challenge in medical imaging where models trained on single sources fail under real-world distribution shifts, requiring robust solutions that can handle unseen domains.

Method: Integrates vision transformers with expert-guided symbolic reasoning using clinical lesion ontologies, structured rule-based features, and retinal vessel segmentation. Uses confidence-weighted integration strategy and minimizes KL divergence between domain embeddings to enforce clinical semantic alignment.

Result: Achieves up to 5.2% accuracy gain in cross-domain settings and 6% improvement over baseline ViT models. Symbolic-only model achieves 63.67% average accuracy in MDG, with complete neuro-symbolic integration achieving highest accuracy in SDG scenarios. Lesion-based features achieve 84.65% accuracy.

Conclusion: Neuro-symbolic integration is a promising paradigm for building clinically robust and domain-invariant medical AI systems, with symbolic components acting as effective regularizers beyond just enhancing interpretability.

Abstract: Domain generalization remains a critical challenge in medical imaging, where models trained on single sources often fail under real-world distribution shifts. We propose KG-DG, a neuro-symbolic framework for diabetic retinopathy (DR) classification that integrates vision transformers with expert-guided symbolic reasoning to enable robust generalization across unseen domains. Our approach leverages clinical lesion ontologies through structured, rule-based features and retinal vessel segmentation, fusing them with deep visual representations via a confidence-weighted integration strategy. The framework addresses both single-domain generalization (SDG) and multi-domain generalization (MDG) by minimizing the KL divergence between domain embeddings, thereby enforcing alignment of high-level clinical semantics. Extensive experiments across four public datasets (APTOS, EyePACS, Messidor-1, Messidor-2) demonstrate significant improvements: up to a 5.2% accuracy gain in cross-domain settings and a 6% improvement over baseline ViT models. Notably, our symbolic-only model achieves a 63.67% average accuracy in MDG, while the complete neuro-symbolic integration achieves the highest accuracy compared to existing published baselines and benchmarks in challenging SDG scenarios. Ablation studies reveal that lesion-based features (84.65% accuracy) substantially outperform purely neural approaches, confirming that symbolic components act as effective regularizers beyond merely enhancing interpretability. Our findings establish neuro-symbolic integration as a promising paradigm for building clinically robust, and domain-invariant medical AI systems.

[70] A Data-Driven RetinaNet Model for Small Object Detection in Aerial Images

Zhicheng Tang, Jinwen Tang, Yi Shang

Main category: cs.CV

TL;DR: DDR-Net is a data-driven deep learning model based on RetinaNet that enhances small object detection in aerial imagery through automated feature map selection, anchor estimation, and innovative sampling techniques for limited data scenarios.

Details

Motivation: Small object detection in aerial imaging is crucial for applications like environmental surveillance, urban design, and crisis management, but existing methods struggle with detecting diminutive objects efficiently.

Method: Leverages RetinaNet architecture with novel data-driven techniques for autonomous feature map selection and anchor estimation, plus an innovative sampling technique for limited data training scenarios.

Result: Empirical assessments show DDR-Net significantly outperforms RetinaNet and other contemporary models on aerial imagery datasets, reducing data collection and training costs while maintaining precision.

Conclusion: DDR-Net advances aerial image analysis technologies with wide-ranging impacts across agriculture, security, archaeology, wildlife monitoring, traffic optimization, and public safety applications.

Abstract: In the realm of aerial imaging, the ability to detect small objects is pivotal for a myriad of applications, encompassing environmental surveillance, urban design, and crisis management. Leveraging RetinaNet, this work unveils DDR-Net: a data-driven, deep-learning model devised to enhance the detection of diminutive objects. DDR-Net introduces novel, data-driven techniques to autonomously ascertain optimal feature maps and anchor estimations, cultivating a tailored and proficient training process while maintaining precision. Additionally, this paper presents an innovative sampling technique to bolster model efficacy under limited data training constraints. The model’s enhanced detection capabilities support critical applications including wildlife and habitat monitoring, traffic flow optimization, and public safety improvements through accurate identification of small objects like vehicles and pedestrians. DDR-Net significantly reduces the cost and time required for data collection and training, offering efficient performance even with limited data. Empirical assessments over assorted aerial avian imagery datasets demonstrate that DDR-Net markedly surpasses RetinaNet and alternative contemporary models. These innovations advance current aerial image analysis technologies and promise wide-ranging impacts across multiple sectors including agriculture, security, and archaeology.

[71] STAR: A Fast and Robust Rigid Registration Framework for Serial Histopathological Images

Zeyu Liu, Shengwei Ding

Main category: cs.CV

TL;DR: STAR is a fast, robust open-source framework for rigid registration of serial whole-slide histopathological images across different stains, offering hierarchical correlation strategy and quality control.

Details

Motivation: Existing methods for WSI registration are computationally intensive and difficult to reproduce, while lightweight rigid frameworks suitable for consecutive-section scenarios remain underdeveloped.

Method: Integrates stain-conditioned preprocessing with hierarchical coarse-to-fine correlation strategy, adaptive kernel scaling, and built-in quality control for multi-WSI alignment.

Result: Achieves reliable rigid registration across heterogeneous tissue types and staining protocols within minutes per slide, demonstrating robustness to cross-stain variability and partial tissue overlap.

Conclusion: STAR provides a reproducible baseline that lowers barriers for clinical adoption and enables large-scale paired data preparation for computational pathology.

Abstract: Registration of serial whole-slide histopathological images (WSIs) is critical for enabling direct comparison across diverse stains and for preparing paired datasets in artificial intelligence (AI) workflows such as virtual staining and biomarker prediction. While existing methods often rely on complex deformable or deep learning approaches that are computationally intensive and difficult to reproduce, lightweight rigid frameworks-sufficient for many consecutive-section scenarios-remain underdeveloped. We introduce STAR (Serial Tissue Alignment for Rigid registration), a fast and robust open-source framework for multi-WSI alignment. STAR integrates stain-conditioned preprocessing with a hierarchical coarse-to-fine correlation strategy, adaptive kernel scaling, and built-in quality control, achieving reliable rigid registration across heterogeneous tissue types and staining protocols, including hematoxylin-eosin (H&E), special histochemical stains (e.g., PAS, PASM, Masson’s), and immunohistochemical (IHC) markers (e.g., CD31, KI67). Evaluated on the ANHIR 2019 and ACROBAT 2022 datasets spanning multiple organs and scanning conditions, STAR consistently produced stable alignments within minutes per slide, demonstrating robustness to cross-stain variability and partial tissue overlap. Beyond benchmarks, we present case studies on H&E-IHC alignment, construction of multi-IHC panels, and typical failure modes, underscoring both utility and limitations. Released as an open and lightweight tool, STAR provides a reproducible baseline that lowers the barrier for clinical adoption and enables large-scale paired data preparation for next-generation computational pathology.

[72] Resilient Multimodal Industrial Surface Defect Detection with Uncertain Sensors Availability

Shuai Jiang, Yunfeng Ma, Jingyu Zhou, Yuan Bian, Yaonan Wang, Min Liu

Main category: cs.CV

TL;DR: A novel method for multimodal industrial surface defect detection that handles modality-missing problems through cross-modal prompt learning and symmetric contrastive learning, achieving state-of-the-art performance.

Details

Motivation: Address modality-missing problems caused by uncertain sensor availability in industrial surface defect detection, which creates challenges in learning mode transformation and information vacancy when fusing RGB and 3D modalities.

Method: Proposes cross-modal prompt learning with three components: cross-modal consistency prompt, modality-specific prompt, and missing-aware prompt. Also introduces symmetric contrastive learning using text modality as a bridge for dual vision modalities fusion, with paired antithetical text prompts and triple-modal contrastive pre-training.

Result: Achieves 73.83% I-AUROC and 93.05% P-AUROC with 0.7 total missing rate for RGB and 3D modalities, exceeding state-of-the-art methods by 3.84% and 5.58% respectively. Outperforms existing approaches under different missing types and rates.

Conclusion: The proposed method effectively handles modality-missing problems in industrial surface defect detection through innovative prompt learning and contrastive learning techniques, demonstrating superior performance compared to existing methods.

Abstract: Multimodal industrial surface defect detection (MISDD) aims to identify and locate defect in industrial products by fusing RGB and 3D modalities. This article focuses on modality-missing problems caused by uncertain sensors availability in MISDD. In this context, the fusion of multiple modalities encounters several troubles, including learning mode transformation and information vacancy. To this end, we first propose cross-modal prompt learning, which includes: i) the cross-modal consistency prompt serves the establishment of information consistency of dual visual modalities; ii) the modality-specific prompt is inserted to adapt different input patterns; iii) the missing-aware prompt is attached to compensate for the information vacancy caused by dynamic modalities-missing. In addition, we propose symmetric contrastive learning, which utilizes text modality as a bridge for fusion of dual vision modalities. Specifically, a paired antithetical text prompt is designed to generate binary text semantics, and triple-modal contrastive pre-training is offered to accomplish multimodal learning. Experiment results show that our proposed method achieves 73.83% I-AUROC and 93.05% P-AUROC with a total missing rate 0.7 for RGB and 3D modalities (exceeding state-of-the-art methods 3.84% and 5.58% respectively), and outperforms existing approaches to varying degrees under different missing types and rates. The source code will be available at https://github.com/SvyJ/MISDD-MM.

[73] EdgeAttNet: Towards Barb-Aware Filament Segmentation

Victor Solomon, Piet Martens, Jingyu Liu, Rafal Angryk

Main category: cs.CV

TL;DR: EdgeAttNet - a U-Net-based segmentation model with learnable edge maps integrated into attention mechanism for improved solar filament segmentation, particularly capturing fine-scale barbs.

Details

Motivation: Existing methods fail to capture fine-scale filament structures like barbs due to limited ability to model long-range dependencies and spatial detail, which is critical for determining filament chirality and predicting CME behavior.

Method: U-Net backbone with novel learnable edge map derived from input image, linearly transforming attention Key and Query matrices with edge information to guide self-attention mechanism for better boundary and barb capture.

Result: Outperforms U-Net and other U-Net-based transformer baselines on MAGFILO dataset with higher segmentation accuracy, significantly better barb recognition, and faster inference performance.

Conclusion: EdgeAttNet effectively integrates structural priors into attention computations, enhancing spatial sensitivity and segmentation accuracy while reducing parameters, making it suitable for practical deployment in solar filament analysis.

Abstract: Accurate segmentation of solar filaments in H-alpha observations is critical for determining filament chirality, a key factor in the behavior of Coronal Mass Ejections (CMEs). However, existing methods often fail to capture fine-scale filament structures, particularly barbs, due to a limited ability to model long-range dependencies and spatial detail. We propose EdgeAttNet, a segmentation architecture built on a U-Net backbone by introducing a novel, learnable edge map derived directly from the input image. This edge map is incorporated into the model by linearly transforming the attention Key and Query matrices with the edge information, thereby guiding the self-attention mechanism at the network’s bottleneck to more effectively capture filament boundaries and barbs. By explicitly integrating this structural prior into the attention computations, EdgeAttNet enhances spatial sensitivity and segmentation accuracy while reducing the number of trainable parameters. Trained end-to-end, EdgeAttNet outperforms U-Net and other U-Net-based transformer baselines on the MAGFILO dataset. It achieves higher segmentation accuracy and significantly better recognition of filament barbs, with faster inference performance suitable for practical deployment.

[74] KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models

Yujin Wang, Tianyi Wang, Quanfeng Liu, Wenxian Fan, Junfeng Jiao, Christian Claudel, Yunbing Yan, Bingzhao Gao, Jianqiang Wang, Hong Chen

Main category: cs.CV

TL;DR: KEPT is a knowledge-enhanced vision-language model that achieves state-of-the-art trajectory prediction for autonomous driving by combining temporal-spatial fusion, exemplar retrieval, and chain-of-thought prompting.

Details

Motivation: Existing vision-language models often fail to effectively ground reasoning in scene dynamics and domain knowledge for accurate short-horizon trajectory prediction in autonomous driving.

Method: Uses temporal frequency-spatial fusion video encoder with self-supervised learning, k-means + HNSW retrieval for scene-aligned exemplars, chain-of-thought prompts with planning constraints, and triple-stage fine-tuning for spatial alignment, physical feasibility, and temporal planning.

Result: Achieves 0.70m average L2 with 0.21% collision rate (NoAvg) and 0.31m average L2 with 0.07% collision rate (TemAvg) on nuScenes dataset, with sub-millisecond retrieval latency.

Conclusion: Retrieval-augmented, chain-of-thought-guided VLMs provide a data-efficient pathway toward interpretable and trustworthy autonomous driving with complementary benefits from all fine-tuning stages.

Abstract: Accurate short-horizon trajectory prediction is pivotal for safe and reliable autonomous driving, yet existing vision-language models (VLMs) often fail to effectively ground their reasoning in scene dynamics and domain knowledge. To address this challenge, this paper introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT couples a temporal frequency-spatial fusion (TFSF) video encoder, trained via self-supervised learning with hard-negative mining, with a scalable k-means + HNSW retrieval stack that supplies scene-aligned exemplars. Retrieved priors are embedded into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning schedule incrementally aligns the language head to metric spatial cues, physically feasible motion, and temporally conditioned front-view planning. Evaluated on nuScenes dataset, KEPT achieves state-of-the-art performance across open-loop protocols: under NoAvg, it achieves 0.70m average L2 with a 0.21% collision rate; under TemAvg with lightweight ego status, it attains 0.31m average L2 and a 0.07% collision rate. Ablation studies show that all three fine-tuning stages contribute complementary benefits, and that using Top-2 retrieved exemplars yields the best accuracy-safety trade-off. The k-means-clustered HNSW index delivers sub-millisecond retrieval latency, supporting practical deployment. These results indicate that retrieval-augmented, CoT-guided VLMs offer a promising, data-efficient pathway toward interpretable and trustworthy autonomous driving.

[75] InstaDA: Augmenting Instance Segmentation Data with Dual-Agent System

Xianbao Hou, Yonghao He, Zeyd Boukhers, John See, Hu Su, Wei Sui, Cong Yang

Main category: cs.CV

TL;DR: InstaDA is a training-free dual-agent system that enhances instance segmentation datasets through LLM-diffusion model collaboration and image-based data augmentation, achieving significant performance improvements on LVIS 1.0.

Details

Motivation: Addressing challenges in acquiring high-quality instance segmentation data due to labor-intensive annotation and class imbalances, while overcoming limitations of existing methods that lack deep collaboration between LLMs and diffusion models and underutilize existing training data.

Method: Proposes a dual-agent system: 1) Text-Agent with Prompt Rethink mechanism for iterative prompt refinement through LLM-diffusion collaboration, 2) Image-Agent that generates new instances conditioned on training images to enrich data distribution. Both operate as independent automated workflows.

Result: Achieves +4.0 box AP and +3.3 mask AP improvements over baseline on LVIS 1.0 validation set. Outperforms DiverGen by +0.3 box AP and +0.1 mask AP, with notable gains in common categories (+0.7 box AP) and frequent categories (+0.5 mask AP).

Conclusion: InstaDA effectively addresses dataset limitations through collaborative LLM-diffusion integration and comprehensive data augmentation, demonstrating significant performance improvements in instance segmentation without requiring training.

Abstract: Acquiring high-quality instance segmentation data is challenging due to the labor-intensive nature of the annotation process and significant class imbalances within datasets. Recent studies have utilized the integration of Copy-Paste and diffusion models to create more diverse datasets. However, these studies often lack deep collaboration between large language models (LLMs) and diffusion models, and underutilize the rich information within the existing training data. To address these limitations, we propose InstaDA, a novel, training-free Dual-Agent system designed to augment instance segmentation datasets. First, we introduce a Text-Agent (T-Agent) that enhances data diversity through collaboration between LLMs and diffusion models. This agent features a novel Prompt Rethink mechanism, which iteratively refines prompts based on the generated images. This process not only fosters collaboration but also increases image utilization and optimizes the prompts themselves. Additionally, we present an Image-Agent (I-Agent) aimed at enriching the overall data distribution. This agent augments the training set by generating new instances conditioned on the training images. To ensure practicality and efficiency, both agents operate as independent and automated workflows, enhancing usability. Experiments conducted on the LVIS 1.0 validation set indicate that InstaDA achieves significant improvements, with an increase of +4.0 in box average precision (AP) and +3.3 in mask AP compared to the baseline. Furthermore, it outperforms the leading model, DiverGen, by +0.3 in box AP and +0.1 in mask AP, with a notable +0.7 gain in box AP on common categories and mask AP gains of +0.2 on common categories and +0.5 on frequent categories.

[76] SPENet: Self-guided Prototype Enhancement Network for Few-shot Medical Image Segmentation

Chao Fan, Xibin Jia, Anqi Xiao, Hongyuan Yu, Zhenghan Yang, Dawei Yang, Hui Xu, Yan Huang, Liang Wang

Main category: cs.CV

TL;DR: SPENet is a novel few-shot medical image segmentation method that addresses intra-class variations through multi-level prototype generation and query-guided local prototype enhancement, outperforming state-of-the-art methods.

Details

Motivation: Existing prototype-based methods for few-shot medical image segmentation typically generate a single global prototype, which overlooks intra-class variations and struggles with substantial discrepancies between support and query images.

Method: Proposes Self-guided Prototype Enhancement Network (SPENet) with two modules: 1) Multi-level Prototype Generation (MPG) that creates global and adaptive local prototypes for multi-granularity measurement, and 2) Query-guided Local Prototype Enhancement (QLPE) that refines support prototypes using query image guidance to mitigate discrepancies.

Result: Extensive experiments on three public medical datasets demonstrate that SPENet outperforms existing state-of-the-art methods, achieving superior performance in few-shot medical image segmentation.

Conclusion: SPENet effectively addresses intra-class variations and support-query discrepancies in few-shot medical image segmentation through its multi-level prototype generation and query-guided enhancement approach, setting new state-of-the-art performance.

Abstract: Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel classes of medical objects using only a few labeled images. Prototype-based methods have made significant progress in addressing FSMIS. However, they typically generate a single global prototype for the support image to match with the query image, overlooking intra-class variations. To address this issue, we propose a Self-guided Prototype Enhancement Network (SPENet). Specifically, we introduce a Multi-level Prototype Generation (MPG) module, which enables multi-granularity measurement between the support and query images by simultaneously generating a global prototype and an adaptive number of local prototypes. Additionally, we observe that not all local prototypes in the support image are beneficial for matching, especially when there are substantial discrepancies between the support and query images. To alleviate this issue, we propose a Query-guided Local Prototype Enhancement (QLPE) module, which adaptively refines support prototypes by incorporating guidance from the query image, thus mitigating the negative effects of such discrepancies. Extensive experiments on three public medical datasets demonstrate that SPENet outperforms existing state-of-the-art methods, achieving superior performance.

[77] SOPSeg: Prompt-based Small Object Instance Segmentation in Remote Sensing Imagery

Chenhao Wang, Yingrui Ji, Yu Meng, Yunjian Zhang, Yao Zhu

Main category: cs.CV

TL;DR: SOPSeg is a prompt-based framework for small object segmentation in remote sensing imagery that addresses SAM’s limitations with coarse feature resolution through region-adaptive magnification and customized decoder with edge prediction.

Details

Motivation: Small object instance segmentation in remote sensing is underexplored due to technical challenges and high annotation costs, and existing models like SAM perform poorly on small objects due to coarse feature resolution.

Method: Proposed SOPSeg framework with region-adaptive magnification to preserve fine details, customized decoder with edge prediction and progressive refinement, and novel prompting mechanism for oriented bounding boxes.

Result: SOPSeg outperforms existing methods in small object segmentation and enables efficient dataset construction for remote sensing tasks.

Conclusion: The framework successfully addresses small object segmentation challenges in remote sensing and includes release of both model and a comprehensive dataset based on SODA-A to support future research.

Abstract: Extracting small objects from remote sensing imagery plays a vital role in various applications, including urban planning, environmental monitoring, and disaster management. While current research primarily focuses on small object detection, instance segmentation for small objects remains underexplored, with no dedicated datasets available. This gap stems from the technical challenges and high costs of pixel-level annotation for small objects. While the Segment Anything Model (SAM) demonstrates impressive zero-shot generalization, its performance on small-object segmentation deteriorates significantly, largely due to the coarse 1/16 feature resolution that causes severe loss of fine spatial details. To this end, we propose SOPSeg, a prompt-based framework specifically designed for small object segmentation in remote sensing imagery. It incorporates a region-adaptive magnification strategy to preserve fine-grained details, and employs a customized decoder that integrates edge prediction and progressive refinement for accurate boundary delineation. Moreover, we introduce a novel prompting mechanism tailored to the oriented bounding boxes widely adopted in remote sensing applications. SOPSeg outperforms existing methods in small object segmentation and facilitates efficient dataset construction for remote sensing tasks. We further construct a comprehensive small object instance segmentation dataset based on SODA-A, and will release both the model and dataset to support future research.

[78] Enhancing Robustness in Post-Processing Watermarking: An Ensemble Attack Network Using CNNs and Transformers

Tzuhsuan Huang, Cheng Yu Yeo, Tsai-Ling Huang, Hong-Han Shuai, Wen-Huang Cheng, Jun-Cheng Chen

Main category: cs.CV

TL;DR: This paper proposes an ensemble attack network approach to enhance robustness in post-processing deep watermarking, combining CNN and Transformer networks across spatial and frequency domains.

Details

Motivation: Post-processing watermarking offers more flexibility than in-processing methods as it can be applied to outputs from any generative model without needing model internals and allows unique watermarks for individual images.

Method: Developed an ensemble attack network using CNN and Transformer architectures in both spatial and frequency domains during training to improve watermark robustness. Tested various combinations to determine optimal configuration.

Result: Combining CNN-based attack network in spatial domain with Transformer-based network in frequency domain yielded highest robustness. Achieved 18.743% improvement over StegaStamp on WAVES benchmark’s Regeneration Attack using average bit accuracy metric.

Conclusion: The ensemble attack network approach significantly enhances post-processing watermarking robustness, with optimal performance achieved through cross-domain CNN-Transformer combination, demonstrating effectiveness across various stress tests.

Abstract: Recent studies on deep watermarking have predominantly focused on in-processing watermarking, which integrates the watermarking process into image generation. However, post-processing watermarking, which embeds watermarks after image generation, offers more flexibility. It can be applied to outputs from any generative model (e.g. GANs, diffusion models) without needing access to the model’s internal structure. It also allows users to embed unique watermarks into individual images. Therefore, this study focuses on post-processing watermarking and enhances its robustness by incorporating an ensemble attack network during training. We construct various versions of attack networks using CNN and Transformer in both spatial and frequency domains to investigate how each combination influences the robustness of the watermarking model. Our results demonstrate that combining a CNN-based attack network in the spatial domain with a Transformer-based attack network in the frequency domain yields the highest robustness in watermarking models. Extensive evaluation on the WAVES benchmark, using average bit accuracy as the metric, demonstrates that our ensemble attack network significantly enhances the robustness of baseline watermarking methods under various stress tests. In particular, for the Regeneration Attack defined in WAVES, our method improves StegaStamp by 18.743%. The code is released at:https://github.com/aiiu-lab/DeepRobustWatermark.

[79] Lesion-Aware Visual-Language Fusion for Automated Image Captioning of Ulcerative Colitis Endoscopic Examinations

Alexis Ivan Lopez Escamilla, Gilberto Ochoa, Sharib Al

Main category: cs.CV

TL;DR: A lesion-aware image captioning framework for ulcerative colitis that integrates visual features with clinical metadata to generate structured endoscopic reports and improve classification accuracy.

Details

Motivation: To develop an automated system for generating clinically relevant and interpretable endoscopic reports for ulcerative colitis that aligns with medical practice standards.

Method: Combines ResNet embeddings, Grad-CAM heatmaps, and CBAM-enhanced attention with a T5 decoder, injecting clinical metadata (MES score, vascular pattern, bleeding, etc.) as natural-language prompts to guide caption generation.

Result: The approach improves caption quality and MES classification accuracy compared to baseline methods, supporting reliable endoscopic reporting.

Conclusion: The framework successfully integrates visual and clinical information to produce structured, interpretable descriptions that align with clinical practice for ulcerative colitis assessment.

Abstract: We present a lesion-aware image captioning framework for ulcerative colitis (UC). The model integrates ResNet embeddings, Grad-CAM heatmaps, and CBAM-enhanced attention with a T5 decoder. Clinical metadata (MES score 0-3, vascular pattern, bleeding, erythema, friability, ulceration) is injected as natural-language prompts to guide caption generation. The system produces structured, interpretable descriptions aligned with clinical practice and provides MES classification and lesion tags. Compared with baselines, our approach improves caption quality and MES classification accuracy, supporting reliable endoscopic reporting.

[80] Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Sohee Kim, Soohyun Ryu, Joonhyung Park, Eunho Yang

Main category: cs.CV

TL;DR: LVLMs often mistakenly treat text inputs as visual content. Researchers discovered specific FFN neurons that detect visual absence and developed a method to refine outputs by reinterpreting prompts or replacing ungrounded tokens.

Details

Motivation: Large Vision-Language Models frequently make errors by assuming text inputs lacking visual evidence are part of the image, leading to incorrect responses.

Method: Identified Visual Absence-aware (VA) neurons in FFN layers that signal visual absence through activation patterns. Developed detection module to classify token grounding, and refined outputs by reinterpreting prompts or replacing absent tokens.

Result: The method effectively mitigates models’ tendency to falsely presume visual presence of text input and shows generality across various LVLMs.

Conclusion: Leveraging internal VA neuron patterns enables systematic detection of visually ungrounded tokens and improves LVLM output accuracy by preventing false visual assumptions.

Abstract: Large Vision-Language Models (LVLMs) generate contextually relevant responses by jointly interpreting visual and textual inputs. However, our finding reveals they often mistakenly perceive text inputs lacking visual evidence as being part of the image, leading to erroneous responses. In light of this finding, we probe whether LVLMs possess an internal capability to determine if textual concepts are grounded in the image, and discover a specific subset of Feed-Forward Network (FFN) neurons, termed Visual Absence-aware (VA) neurons, that consistently signal the visual absence through a distinctive activation pattern. Leveraging these patterns, we develop a detection module that systematically classifies whether an input token is visually grounded. Guided by its prediction, we propose a method to refine the outputs by reinterpreting question prompts or replacing the detected absent tokens during generation. Extensive experiments show that our method effectively mitigates the models’ tendency to falsely presume the visual presence of text input and its generality across various LVLMs.

[81] Background Matters Too: A Language-Enhanced Adversarial Framework for Person Re-Identification

Kaicong Huang, Talha Azfar, Jack M. Reilly, Thomas Guggisberg, Ruimin Ke

Main category: cs.CV

TL;DR: Proposes a dual-branch cross-modal framework that jointly models foreground and background information for person re-identification, using intra-semantic alignment and inter-semantic adversarial learning to enhance discriminative power.

Details

Motivation: Existing methods either rely on costly manual annotations or focus only on foreground information while overlooking valuable background cues. Inspired by human perception that eliminates background distractions while focusing on target appearance.

Method: End-to-end framework with dual-branch cross-modal feature extraction pipeline. Uses intra-semantic alignment to align visual and textual features with same semantics, and inter-semantic adversarial learning to penalize similarity between foreground and background features.

Result: Comprehensive experiments on two holistic and two occluded ReID benchmarks demonstrate effectiveness and generality, with results matching or surpassing current state-of-the-art approaches.

Conclusion: Background semantics are as important as foreground semantics in ReID. The proposed joint modeling of both domains with alignment and adversarial strategies effectively suppresses background noise and enhances attention to identity-relevant foreground cues.

Abstract: Person re-identification faces two core challenges: precisely locating the foreground target while suppressing background noise and extracting fine-grained features from the target region. Numerous visual-only approaches address these issues by partitioning an image and applying attention modules, yet they rely on costly manual annotations and struggle with complex occlusions. Recent multimodal methods, motivated by CLIP, introduce semantic cues to guide visual understanding. However, they focus solely on foreground information, but overlook the potential value of background cues. Inspired by human perception, we argue that background semantics are as important as the foreground semantics in ReID, as humans tend to eliminate background distractions while focusing on target appearance. Therefore, this paper proposes an end-to-end framework that jointly models foreground and background information within a dual-branch cross-modal feature extraction pipeline. To help the network distinguish between the two domains, we propose an intra-semantic alignment and inter-semantic adversarial learning strategy. Specifically, we align visual and textual features that share the same semantics across domains, while simultaneously penalizing similarity between foreground and background features to enhance the network’s discriminative power. This strategy drives the model to actively suppress noisy background regions and enhance attention toward identity-relevant foreground cues. Comprehensive experiments on two holistic and two occluded ReID benchmarks demonstrate the effectiveness and generality of the proposed method, with results that match or surpass those of current state-of-the-art approaches.

[82] MedLiteNet: Lightweight Hybrid Medical Image Segmentation Model

Pengyang Yu, Haoquan Wang, Gerard Marks, Tahar Kechadi, Laurence T. Yang, Sahraoui Dhelim, Nyothiri Aung

Main category: cs.CV

TL;DR: MedLiteNet: Lightweight CNN-Transformer hybrid for precise skin-lesion segmentation with efficient computation and boundary-aware attention.

Details

Motivation: Existing methods struggle with limited receptive fields (CNNs) or computational complexity (Transformers) for skin cancer diagnosis on small medical datasets.

Method: Hybrid architecture with Mobile Inverted Bottleneck blocks, cross-scale token-mixing unit, and boundary-aware self-attention module for hierarchical feature extraction.

Result: Achieves high precision segmentation through efficient multi-scale context aggregation while maintaining computational efficiency.

Conclusion: MedLiteNet provides an effective solution for dermoscopic segmentation by balancing global context modeling with lightweight design suitable for medical imaging applications.

Abstract: Accurate skin-lesion segmentation remains a key technical challenge for computer-aided diagnosis of skin cancer. Convolutional neural networks, while effective, are constrained by limited receptive fields and thus struggle to model long-range dependencies. Vision Transformers capture global context, yet their quadratic complexity and large parameter budgets hinder use on the small-sample medical datasets common in dermatology. We introduce the MedLiteNet, a lightweight CNN Transformer hybrid tailored for dermoscopic segmentation that achieves high precision through hierarchical feature extraction and multi-scale context aggregation. The encoder stacks depth-wise Mobile Inverted Bottleneck blocks to curb computation, inserts a bottleneck-level cross-scale token-mixing unit to exchange information between resolutions, and embeds a boundary-aware self-attention module to sharpen lesion contours.

[83] DCDB: Dynamic Conditional Dual Diffusion Bridge for Ill-posed Multi-Tasks

Chengjie Huang, Jiafeng Yan, Jing Li, Lu Bai

Main category: cs.CV

TL;DR: Proposes a dynamic conditional double diffusion bridge training paradigm for ill-posed multi-task image processing, addressing limitations of traditional diffusion models in multi-task scenarios with limited data.

Details

Motivation: Traditional conditional diffusion models struggle with multi-task scenarios due to difficulty exploiting intrinsic task correlations and static condition control that doesn't adapt to dynamically evolving multi-task characteristics, especially in ill-posed tasks with limited training data.

Method: A dynamic conditional double diffusion bridge paradigm that decouples diffusion and condition generation processes, uses dynamic conditions generated by the same noise schedule to gradually adjust statistical characteristics and embed time-related information, reducing network learning difficulty.

Result: Achieved best performance in multiple indicators on public datasets for dehazing and visible-infrared fusion tasks, demonstrating superiority of dynamic conditions through analysis of learning objectives and attention weight changes.

Conclusion: The proposed framework effectively addresses challenges in ill-posed multi-task scenarios by providing a general solution that outperforms traditional approaches through dynamic condition control and decoupled training paradigm.

Abstract: Conditional diffusion models have made impressive progress in the field of image processing, but the characteristics of constructing data distribution pathways make it difficult to exploit the intrinsic correlation between tasks in multi-task scenarios, which is even worse in ill-posed tasks with a lack of training data. In addition, traditional static condition control makes it difficult for networks to learn in multi-task scenarios with its dynamically evolving characteristics. To address these challenges, we propose a dynamic conditional double diffusion bridge training paradigm to build a general framework for ill-posed multi-tasks. Firstly, this paradigm decouples the diffusion and condition generation processes, avoiding the dependence of the diffusion model on supervised data in ill-posed tasks. Secondly, generated by the same noise schedule, dynamic conditions are used to gradually adjust their statistical characteristics, naturally embed time-related information, and reduce the difficulty of network learning. We analyze the learning objectives of the network under different conditional forms in the single-step denoising process and compare the changes in its attention weights in the network, demonstrating the superiority of our dynamic conditions. Taking dehazing and visible-infrared fusion as typical ill-posed multi-task scenarios, we achieve the best performance in multiple indicators on public datasets. The code has been publicly released at: https://anonymous.4open.science/r/DCDB-D3C2.

[84] Isolated Bangla Handwritten Character Classification using Transfer Learning

Abdul Karim, S M Rafiuddin, Jahidul Islam Razin, Tahira Alam

Main category: cs.CV

TL;DR: Transfer learning approach using 3DCNN, ResNet, and MobileNet achieves state-of-the-art accuracy (99.46%) for Bangla handwritten character classification across 84 character classes.

Details

Motivation: Bangla language has complex character structures with 50 distinct characters and many compound characters, requiring advanced recognition methods that can handle both basic and compound characters while avoiding vanishing gradient problems.

Method: Applied transfer learning with deep neural network techniques including 3D Convolutional Neural Network (3DCNN), Residual Neural Network (ResNet), and MobileNet for end-to-end classification of all standard Bangla handwritten character formations using the Bangla Lekha Isolated dataset (166,105 samples across 84 classes).

Result: Achieved 99.82% accuracy on training data and 99.46% accuracy on test data, outperforming various state-of-the-art benchmarks for Bangla handwritten character classification.

Conclusion: The proposed transfer learning model with multiple deep neural network architectures successfully classifies complex Bangla handwritten characters with exceptional accuracy, demonstrating superior performance compared to existing methods.

Abstract: Bangla language consists of fifty distinct characters and many compound characters. Several notable studies have been performed to recognize Bangla characters, both handwritten and optical. Our approach uses transfer learning to classify the basic, distinct, as well as compound Bangla handwritten characters while avoiding the vanishing gradient problem. Deep Neural Network techniques such as 3D Convolutional Neural Network (3DCNN), Residual Neural Network (ResNet), and MobileNet are applied to generate an end-to-end classification of all possible standard formations of handwritten characters in the Bangla language. The Bangla Lekha Isolated dataset, which contains 166,105 Bangla character image samples categorized into 84 distinct classes, is used for this classification model. The model achieved 99.82% accuracy on training data and 99.46% accuracy on test data. Comparisons with various state-of-the-art benchmarks of Bangla handwritten character classification show that the proposed model achieves better accuracy in classifying the data.

[85] Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li, Jose M. Alvarez

Main category: cs.CV

TL;DR: Proposes a gradient-based self-reflection method to estimate token influence and detect visual tokens, then uses influence-aware contrastive decoding to mitigate text-visual and co-occurrence biases in multimodal LLMs without additional resources.

Details

Motivation: Existing methods address hallucinations in multimodal LLMs heuristically without understanding fluctuating bias levels across instances. Text-visual bias (over-reliance on text) and co-occurrence bias (statistical object-pairing patterns) cause hallucinations.

Method: Uses gradient-based self-reflection to estimate influence of different token types (visual, prompt, previous outputs). Detects object-related visual tokens and integrates them into an influence-aware contrastive decoding framework to mitigate both bias types simultaneously.

Result: Effectively reduces hallucinations, achieving up to 92% accuracy increase on LLaVA-QA90 benchmark in extensive experiments.

Conclusion: The proposed method successfully mitigates both text-visual and co-occurrence biases in multimodal LLMs without requiring additional fine-tuning, extra models, or data statistics, demonstrating significant performance improvements.

Abstract: Hallucinations in multimodal large language model are caused by the text-visual bias and the co-occurrence bias. The former reflects an over-reliance on text information in the decision-making process, while the latter arises from the statistical object-pairing patterns abstracted from the training data. Existing mitigation methods heuristically address these biases without understanding the fluctuating bias level across the instances. We first propose estimating the influence of respective token types (visual, prompt, and previous outputs) using a gradient-based self-reflection method. The estimated token influence further enables the detection of object-related visual tokens and their integration into an influence-aware contrastive decoding framework to mitigate both types of biases simultaneously. Our method operates without the need for additional resources, such as costly fine-tuning, extra models, or data statistics. Extensive experiments show it effectively reduces hallucinations, achieving up to a 92% accuracy increase on LLaVA-QA90.

[86] High Cursive Complex Character Recognition using GAN External Classifier

S M Rafiuddin

Main category: cs.CV

TL;DR: ADA-GAN model combines GAN with external classifier to handle cursive handwritten characters using data augmentation with adversarially perturbed noise.

Details

Motivation: Handwritten cursive characters are more complex and challenging to classify than simple characters, requiring more robust classification methods.

Method: Uses a GAN where generator creates fake handwritten images, adds adversarially perturbed noise, and augments training data when discriminator confidence exceeds threshold.

Result: CNN accuracy decreases with character complexity, but ADA-GAN remains robust and effective for both cursive and complex characters.

Conclusion: ADA-GAN provides superior performance for classifying complex cursive handwritten characters compared to traditional CNNs.

Abstract: Handwritten characters can be trickier to classify due to their complex and cursive nature compared to simple and non-cursive characters. We present an external classifier along with a Generative Adversarial Network that can classify highly cursive and complex characters. The generator network produces fake handwritten character images, which are then used to augment the training data after adding adversarially perturbed noise and achieving a confidence score above a threshold with the discriminator network. The results show that the accuracy of convolutional neural networks decreases as character complexity increases, but our proposed model, ADA-GAN, remains more robust and effective for both cursive and complex characters.

[87] TRELLIS-Enhanced Surface Features for Comprehensive Intracranial Aneurysm Analysis

Clément Hervé, Paul Garnier, Jonathan Viquerat, Elie Hachem

Main category: cs.CV

TL;DR: Cross-domain feature transfer using TRELLIS generative model improves intracranial aneurysm analysis tasks including classification, segmentation, and blood flow prediction.

Details

Motivation: Intracranial aneurysms are clinically significant but difficult to detect and model due to limited annotated 3D medical data.

Method: Leverages latent geometric embeddings from TRELLIS (generative model trained on non-medical 3D datasets) to replace conventional features, enhancing three downstream tasks: aneurysm classification, 3D mesh segmentation, and blood flow prediction using graph neural networks.

Result: Strong gains in accuracy, F1-score and segmentation quality over state-of-the-art baselines, with 15% reduction in simulation error.

Conclusion: Demonstrates the potential of transferring 3D representations from general-purpose generative models to specialized medical applications.

Abstract: Intracranial aneurysms pose a significant clinical risk yet are difficult to detect, delineate and model due to limited annotated 3D data. We propose a cross-domain feature-transfer approach that leverages the latent geometric embeddings learned by TRELLIS, a generative model trained on large-scale non-medical 3D datasets, to augment neural networks for aneurysm analysis. By replacing conventional point normals or mesh descriptors with TRELLIS surface features, we systematically enhance three downstream tasks: (i) classifying aneurysms versus healthy vessels in the Intra3D dataset, (ii) segmenting aneurysm and vessel regions on 3D meshes, and (iii) predicting time-evolving blood-flow fields using a graph neural network on the AnXplore dataset. Our experiments show that the inclusion of these features yields strong gains in accuracy, F1-score and segmentation quality over state-of-the-art baselines, and reduces simulation error by 15%. These results illustrate the broader potential of transferring 3D representations from general-purpose generative models to specialized medical tasks.

[88] Backdoor Poisoning Attack Against Face Spoofing Attack Detection Methods

Shota Iwamatsu, Koichi Ito, Takafumi Aoki

Main category: cs.CV

TL;DR: A backdoor poisoning attack method that embeds spoofing attack features into live face images to bypass face anti-spoofing detection systems without visible changes.

Details

Motivation: Face recognition systems are vulnerable to spoofing attacks using photos, and existing deep learning-based detection methods can be compromised if malicious data is injected into training datasets, leading to false positives.

Method: Proposed method embeds features extracted from spoofing attack face images into live face images without causing perceptible visual alterations, creating poisoned training data that causes the detection system to misclassify specific spoofing attacks as live.

Result: Experiments on public datasets demonstrate that the proposed backdoor poisoning attack method poses a realistic threat to existing spoofing attack detection systems by enabling certain attacks to bypass detection.

Conclusion: The research reveals a latent threat of backdoor poisoning in face anti-spoofing detection, showing that attackers can compromise detection systems by strategically poisoning training data without visible changes to images.

Abstract: Face recognition systems are robust against environmental changes and noise, and thus may be vulnerable to illegal authentication attempts using user face photos, such as spoofing attacks. To prevent such spoofing attacks, it is crucial to discriminate whether the input image is a live user image or a spoofed image prior to the face recognition process. Most existing spoofing attack detection methods utilize deep learning, which necessitates a substantial amount of training data. Consequently, if malicious data is injected into a portion of the training dataset, a specific spoofing attack may be erroneously classified as live, leading to false positives.In this paper, we propose a novel backdoor poisoning attack method to demonstrate the latent threat of backdoor poisoning within face anti-spoofing detection. The proposed method enables certain spoofing attacks to bypass detection by embedding features extracted from the spoofing attack’s face image into a live face image without inducing any perceptible visual alterations.Through experiments conducted on public datasets, we demonstrate that the proposed method constitutes a realistic threat to existing spoofing attack detection systems.

[89] Information transmission: Inferring change area from change moment in time series remote sensing images

Jialu Li, Chen Wu, Meiqi Hu

Main category: cs.CV

TL;DR: CAIM-Net is a novel time series change detection network that infers change areas from change moments, ensuring consistency between spatial and temporal detection results through a three-step process.

Details

Motivation: Current deep learning approaches treat change area detection and change moment identification as separate tasks, despite the intrinsic relationship where change areas can be inferred from change moments.

Method: Three-step architecture: 1) Difference Extraction and Enhancement with lightweight encoder and boundary enhancement convolution, 2) Coarse Change Moment Extraction using spatiotemporal correlation analysis, 3) Fine Change Moment Extraction with multiscale temporal CAM and change area inference from weighted change moments.

Result: The network achieves consistent results between change area and change moment detection by leveraging the relationship that pixels with identified change moments must have undergone changes.

Conclusion: CAIM-Net provides an integrated approach to time series change detection that maintains consistency between spatial change areas and temporal change moments through inference-based methodology.

Abstract: Time series change detection is a critical task for exploring ecosystem dynamics using time series remote sensing images, because it can simultaneously indicate where and when change occur. While deep learning has shown excellent performance in this domain, it continues to approach change area detection and change moment identification as distinct tasks. Given that change area can be inferred from change moment, we propose a time series change detection network, named CAIM-Net (Change Area Inference from Moment Network), to ensure consistency between change area and change moment results. CAIM-Net infers change area from change moment based on the intrinsic relationship between time series analysis and spatial change detection. The CAIM-Net comprises three key steps: Difference Extraction and Enhancement, Coarse Change Moment Extraction, and Fine Change Moment Extraction and Change Area Inference. In the Difference Extraction and Enhancement, a lightweight encoder with batch dimension stacking is designed to rapidly extract difference features. Subsequently, boundary enhancement convolution is applied to amplify these difference features. In the Coarse Change Moment Extraction, the enhanced difference features from the first step are used to spatiotemporal correlation analysis, and then two distinct methods are employed to determine coarse change moments. In the Fine Change Moment Extraction and Change Area Inference, a multiscale temporal Class Activation Mapping (CAM) module first increases the weight of the change-occurring moment from coarse change moments. Then the weighted change moment is used to infer change area based on the fact that pixels with the change moment must have undergone a change.

[90] Towards Realistic Hand-Object Interaction with Gravity-Field Based Diffusion Bridge

Miao Xu, Xiangyu Zhu, Xusheng Liang, Zidu Wang, Jinlin Wu, Zhen Lei

Main category: cs.CV

TL;DR: GravityDB is a diffusion-based method that formulates hand-object interaction as an attraction-driven process to generate physically plausible interactions without interpenetration, ensuring stable grasping and realistic hand deformations.

Details

Motivation: Existing methods for hand-object pose estimation suffer from interpenetration, noticeable gaps in contact regions, and inability to capture realistic hand deformations during interactions.

Method: Proposes Gravity-Field Based Diffusion Bridge (GravityDB) that simulates interactions between deformable hand surfaces and rigid objects using an attraction-driven process, incorporating semantic information from textual descriptions to guide gravitational field construction.

Result: Extensive experiments on multiple datasets demonstrate the method effectively resolves interpenetration issues, ensures stable grasping, captures realistic hand deformations, and enables semantically meaningful interaction regions.

Conclusion: GravityDB provides a novel approach to hand-object interaction that addresses key limitations of previous methods by generating physically plausible and semantically guided interactions.

Abstract: Existing reconstruction or hand-object pose estimation methods are capable of producing coarse interaction states. However, due to the complex and diverse geometry of both human hands and objects, these approaches often suffer from interpenetration or leave noticeable gaps in regions that are supposed to be in contact. Moreover, the surface of a real human hand undergoes non-negligible deformations during interaction, which are difficult to capture and represent with previous methods. To tackle these challenges, we formulate hand-object interaction as an attraction-driven process and propose a Gravity-Field Based Diffusion Bridge (GravityDB) to simulate interactions between a deformable hand surface and rigid objects. Our approach effectively resolves the aforementioned issues by generating physically plausible interactions that are free of interpenetration, ensure stable grasping, and capture realistic hand deformations. Furthermore, we incorporate semantic information from textual descriptions to guide the construction of the gravitational field, enabling more semantically meaningful interaction regions. Extensive qualitative and quantitative experiments on multiple datasets demonstrate the effectiveness of our method.

[91] Temporally-Aware Diffusion Model for Brain Progression Modelling with Bidirectional Temporal Regularisation

Mattia Litrico, Francesco Guarnera, Mario Valerio Giuffrida, Daniele Ravì, Sebastiano Battiato

Main category: cs.CV

TL;DR: A 3D diffusion model (TADM-3D) for predicting future brain MRI progression using time-aware guidance and bidirectional regularization to improve temporal accuracy.

Details

Motivation: Current MRI prediction methods have limitations: they fail to capture structural changes over time intervals, rely on interpolation without clinical utility, and use 2D architectures that disregard 3D anatomical context needed for accurate longitudinal predictions.

Method: Proposes TADM-3D, a 3D diffusion model that uses a pre-trained Brain-Age Estimator to guide MRI generation based on expected age differences, and incorporates Back-In-Time Regularisation by training bidirectionally (forward and backward) to improve temporal awareness.

Result: Trained and evaluated on OASIS-3 dataset with validation on external NACC dataset. The model generates MRIs that accurately reflect expected age differences between baseline and follow-up scans.

Conclusion: TADM-3D effectively addresses limitations of existing methods by incorporating 3D context, explicit time modeling, and bidirectional regularization to produce clinically useful future brain progression predictions.

Abstract: Generating realistic MRIs to accurately predict future changes in the structure of brain is an invaluable tool for clinicians in assessing clinical outcomes and analysing the disease progression at the patient level. However, current existing methods present some limitations: (i) some approaches fail to explicitly capture the relationship between structural changes and time intervals, especially when trained on age-imbalanced datasets; (ii) others rely only on scan interpolation, which lack clinical utility, as they generate intermediate images between timepoints rather than future pathological progression; and (iii) most approaches rely on 2D slice-based architectures, thereby disregarding full 3D anatomical context, which is essential for accurate longitudinal predictions. We propose a 3D Temporally-Aware Diffusion Model (TADM-3D), which accurately predicts brain progression on MRI volumes. To better model the relationship between time interval and brain changes, TADM-3D uses a pre-trained Brain-Age Estimator (BAE) that guides the diffusion model in the generation of MRIs that accurately reflect the expected age difference between baseline and generated follow-up scans. Additionally, to further improve the temporal awareness of TADM-3D, we propose the Back-In-Time Regularisation (BITR), by training TADM-3D to predict bidirectionally from the baseline to follow-up (forward), as well as from the follow-up to baseline (backward). Although predicting past scans has limited clinical applications, this regularisation helps the model generate temporally more accurate scans. We train and evaluate TADM-3D on the OASIS-3 dataset, and we validate the generalisation performance on an external test set from the NACC dataset. The code will be available upon acceptance.

[92] Preserving instance continuity and length in segmentation through connectivity-aware loss computation

Karol Szustakowski, Luk Frank, Julia Esser, Jan Gründemann, Marie Piraud

Main category: cs.CV

TL;DR: Novel loss functions for preserving connectivity in biomedical segmentation, particularly for elongated structures like axons, reducing discontinuities and improving length calculations.

Details

Motivation: Preserving elongated structure continuity and length is more important than voxel-wise accuracy in biomedical segmentation tasks, especially for structures prone to signal dropout like axon initial segments.

Method: Proposed two novel loss functions: Negative Centerline Loss and Simplified Topology Loss, applied to CNNs. Also discussed experiment design characteristics like downscaling and spacing correction to obtain continuous segmentation masks.

Result: Reduced number of segmentation discontinuities per instance, particularly in regions with missing input signal, resulting in improved instance length calculation in downstream applications compared to standard CNNs and existing topology-aware losses.

Conclusion: Structural priors embedded in loss design can significantly enhance the reliability of segmentation for biological applications, demonstrating the importance of connectivity preservation over pure voxel-wise accuracy.

Abstract: In many biomedical segmentation tasks, the preservation of elongated structure continuity and length is more important than voxel-wise accuracy. We propose two novel loss functions, Negative Centerline Loss and Simplified Topology Loss, that, applied to Convolutional Neural Networks (CNNs), help preserve connectivity of output instances. Moreover, we discuss characteristics of experiment design, such as downscaling and spacing correction, that help obtain continuous segmentation masks. We evaluate our approach on a 3D light-sheet fluorescence microscopy dataset of axon initial segments (AIS), a task prone to discontinuity due to signal dropout. Compared to standard CNNs and existing topology-aware losses, our methods reduce the number of segmentation discontinuities per instance, particularly in regions with missing input signal, resulting in improved instance length calculation in downstream applications. Our findings demonstrate that structural priors embedded in the loss design can significantly enhance the reliability of segmentation for biological applications.

[93] Count2Density: Crowd Density Estimation without Location-level Annotations

Mattia Litrico, Feng Chen, Michael Pound, Sotirios A Tsaftaris, Sebastiano Battiato, Mario Valerio Giuffrida

Main category: cs.CV

TL;DR: Count2Density is a novel pipeline for crowd density estimation that uses only count-level annotations (total people count) instead of detailed point-level annotations, reducing annotation burden while still producing meaningful density maps.

Details

Motivation: Traditional crowd density estimation requires fine-grained point-level annotations which are tedious and time-consuming to collect, creating scalability barriers for real-world applications. The goal is to develop a method that only needs count-level annotations.

Method: Generates pseudo-density maps using a Historical Map Bank initialized with unsupervised saliency estimation, updated with EMA of predictions. Uses hypergeometric sampling based on count annotations. Includes self-supervised contrastive spatial regularizer to enhance spatial awareness.

Result: Significantly outperforms cross-domain adaptation methods and achieves better results than recent state-of-the-art approaches in semi-supervised settings across several datasets. Effectively retrieves spatial information from count-level annotations.

Conclusion: Count2Density successfully reduces annotation burden while maintaining accurate density estimation, enabling practical applications with only count-level supervision and accurate subregion counting capabilities.

Abstract: Crowd density estimation is a well-known computer vision task aimed at estimating the density distribution of people in an image. The main challenge in this domain is the reliance on fine-grained location-level annotations, (i.e. points placed on top of each individual) to train deep networks. Collecting such detailed annotations is both tedious, time-consuming, and poses a significant barrier to scalability for real-world applications. To alleviate this burden, we present Count2Density: a novel pipeline designed to predict meaningful density maps containing quantitative spatial information using only count-level annotations (i.e., total number of people) during training. To achieve this, Count2Density generates pseudo-density maps leveraging past predictions stored in a Historical Map Bank, thereby reducing confirmation bias. This bank is initialised using an unsupervised saliency estimator to provide an initial spatial prior and is iteratively updated with an EMA of predicted density maps. These pseudo-density maps are obtained by sampling locations from estimated crowd areas using a hypergeometric distribution, with the number of samplings determined by the count-level annotations. To further enhance the spatial awareness of the model, we add a self-supervised contrastive spatial regulariser to encourage similar feature representations within crowded regions while maximising dissimilarity with background regions. Experimental results demonstrate that our approach significantly outperforms cross-domain adaptation methods and achieves better results than recent state-of-the-art approaches in semi-supervised settings across several datasets. Additional analyses validate the effectiveness of each individual component of our pipeline, confirming the ability of Count2Density to effectively retrieve spatial information from count-level annotations and enabling accurate subregion counting.

[94] AutoDetect: Designing an Autoencoder-based Detection Method for Poisoning Attacks on Object Detection Applications in the Military Domain

Alma M. Liezenga, Stefan Wijnja, Puck de Haan, Niels W. T. Brink, Jip J. van Stijn, Yori Kamphuis, Klamer Schutte

Main category: cs.CV

TL;DR: This paper investigates poisoning attacks on military object detection systems, tests a modified BadDet attack, finds it requires substantial data poisoning, and proposes AutoDetect - a lightweight autoencoder-based detection method that outperforms existing approaches.

Details

Motivation: Poisoning attacks pose serious threats to military AI systems, but there's limited research on object detection poisoning, especially in military contexts where consequences can be grave. The widespread use of open-source datasets and pretrained models increases this risk.

Method: Created a custom military vehicle dataset (MilCivVeh), implemented a modified patch-based BadDet poisoning attack, tested both specialized poisoning detection methods and anomaly detection methods from industrial inspection, and introduced AutoDetect - an autoencoder-based method using reconstruction error of image slices.

Result: The poisoning attack achieved positive success but required substantial data poisoning, raising questions about practical applicability. Both existing detection methods were lacking. AutoDetect showed promising results in separating clean from poisoned samples, outperforming existing methods while being less time- and memory-intensive.

Conclusion: Large, representative military datasets are needed to further evaluate poisoning risks. AutoDetect provides an effective, lightweight solution for patch detection that addresses current methodological gaps in protecting military object detection systems from poisoning attacks.

Abstract: Poisoning attacks pose an increasing threat to the security and robustness of Artificial Intelligence systems in the military domain. The widespread use of open-source datasets and pretrained models exacerbates this risk. Despite the severity of this threat, there is limited research on the application and detection of poisoning attacks on object detection systems. This is especially problematic in the military domain, where attacks can have grave consequences. In this work, we both investigate the effect of poisoning attacks on military object detectors in practice, and the best approach to detect these attacks. To support this research, we create a small, custom dataset featuring military vehicles: MilCivVeh. We explore the vulnerability of military object detectors for poisoning attacks by implementing a modified version of the BadDet attack: a patch-based poisoning attack. We then assess its impact, finding that while a positive attack success rate is achievable, it requires a substantial portion of the data to be poisoned – raising questions about its practical applicability. To address the detection challenge, we test both specialized poisoning detection methods and anomaly detection methods from the visual industrial inspection domain. Since our research shows that both classes of methods are lacking, we introduce our own patch detection method: AutoDetect, a simple, fast, and lightweight autoencoder-based method. Our method shows promising results in separating clean from poisoned samples using the reconstruction error of image slices, outperforming existing methods, while being less time- and memory-intensive. We urge that the availability of large, representative datasets in the military domain is a prerequisite to further evaluate risks of poisoning attacks and opportunities patch detection.

[95] DeepTopoNet: A Framework for Subglacial Topography Estimation on the Greenland Ice Sheets

Bayu Adhi Tama, Mansa Krishna, Homayra Alam, Mostafa Cham, Omar Faruque, Gong Cheng, Jianwu Wang, Mathieu Morlighem, Vandana Janeja

Main category: cs.CV

TL;DR: Deep learning framework called DeepTopoNet that integrates radar ice thickness data with BedMachine Greenland data using dynamic loss-balancing to reconstruct subglacial topography with high accuracy.

Details

Motivation: Understanding Greenland's subglacial topography is critical for projecting future ice sheet mass loss and sea-level rise, but sparse observational data and gaps between radar flight lines create significant uncertainty in model projections.

Method: Uses a CNN architecture with novel dynamic loss-balancing mechanism that adaptively weights radar and BedMachine data, incorporating gradient-based and trend surface features for subgrid-scale predictions. Tested on Upernavik Isstrøm region.

Result: Achieves high accuracy in reconstructing subglacial terrain, outperforming baseline methods. Demonstrates robustness in areas with limited radar coverage while leveraging high spatial resolution of BedMachine predictions.

Conclusion: DeepTopoNet demonstrates the potential of deep learning in bridging observational gaps for subglacial topography mapping, providing a scalable and efficient solution that could improve projections of Greenland’s ice sheet contribution to sea-level rise.

Abstract: Understanding Greenland’s subglacial topography is critical for projecting the future mass loss of the ice sheet and its contribution to global sea-level rise. However, the complex and sparse nature of observational data, particularly information about the bed topography under the ice sheet, significantly increases the uncertainty in model projections. Bed topography is traditionally measured by airborne ice-penetrating radar that measures the ice thickness directly underneath the aircraft, leaving data gap of tens of kilometers in between flight lines. This study introduces a deep learning framework, which we call as DeepTopoNet, that integrates radar-derived ice thickness observations and BedMachine Greenland data through a novel dynamic loss-balancing mechanism. Among all efforts to reconstruct bed topography, BedMachine has emerged as one of the most widely used datasets, combining mass conservation principles and ice thickness measurements to generate high-resolution bed elevation estimates. The proposed loss function adaptively adjusts the weighting between radar and BedMachine data, ensuring robustness in areas with limited radar coverage while leveraging the high spatial resolution of BedMachine predictions i.e. bed estimates. Our approach incorporates gradient-based and trend surface features to enhance model performance and utilizes a CNN architecture designed for subgrid-scale predictions. By systematically testing on the Upernavik Isstr{\o}m) region, the model achieves high accuracy, outperforming baseline methods in reconstructing subglacial terrain. This work demonstrates the potential of deep learning in bridging observational gaps, providing a scalable and efficient solution to inferring subglacial topography.

[96] PPORLD-EDNetLDCT: A Proximal Policy Optimization-Based Reinforcement Learning Framework for Adaptive Low-Dose CT Denoising

Debopom Sutradhar, Ripon Kumar Debnath, Mohaimenul Azam Khan Raiaan, Yan Zhang, Reem E. Mohamed, Sami Azam

Main category: cs.CV

TL;DR: PPORLD-EDNetLDCT is a reinforcement learning-based denoising method using PPO algorithm with encoder-decoder architecture that significantly improves low-dose CT image quality and outperforms traditional methods.

Details

Motivation: Low-dose CT imaging reduces radiation exposure but results in increased noise and reduced image quality. Traditional denoising methods often fail to preserve image quality effectively.

Method: Reinforcement learning approach using Proximal Policy Optimization (PPO) algorithm with encoder-decoder architecture, trained via custom gym environment with real-time image quality feedback.

Result: Achieved PSNR of 41.87, SSIM of 0.9814, RMSE of 0.00236 on test datasets. Improved COVID-19 classification accuracy to 94% (4% improvement over non-RL methods).

Conclusion: The RL-based approach provides a promising solution for safer and more accurate low-dose CT imaging by effectively denoising images while preserving diagnostic quality.

Abstract: Low-dose computed tomography (LDCT) is critical for minimizing radiation exposure, but it often leads to increased noise and reduced image quality. Traditional denoising methods, such as iterative optimization or supervised learning, often fail to preserve image quality. To address these challenges, we introduce PPORLD-EDNetLDCT, a reinforcement learning-based (RL) approach with Encoder-Decoder for LDCT. Our method utilizes a dynamic RL-based approach in which an advanced posterior policy optimization (PPO) algorithm is used to optimize denoising policies in real time, based on image quality feedback, trained via a custom gym environment. The experimental results on the low dose CT image and projection dataset demonstrate that the proposed PPORLD-EDNetLDCT model outperforms traditional denoising techniques and other DL-based methods, achieving a peak signal-to-noise ratio of 41.87, a structural similarity index measure of 0.9814 and a root mean squared error of 0.00236. Moreover, in NIH-AAPM-Mayo Clinic Low Dose CT Challenge dataset our method achived a PSNR of 41.52, SSIM of 0.9723 and RMSE of 0.0051. Furthermore, we validated the quality of denoising using a classification task in the COVID-19 LDCT dataset, where the images processed by our method improved the classification accuracy to 94%, achieving 4% higher accuracy compared to denoising without RL-based denoising. This method offers a promising solution for safer and more accurate LDCT imaging.

[97] AIVA: An AI-based Virtual Companion for Emotion-aware Interaction

Chenxi Li

Main category: cs.CV

TL;DR: Integration of multimodal sentiment perception into LLMs to create emotion-aware virtual companions that can interpret emotional cues from non-verbal signals for more empathetic human-computer interactions.

Details

Motivation: LLMs are limited to unimodal text processing and lack ability to interpret emotional cues from non-verbal signals, which hinders immersive and empathetic human-computer interactions.

Method: Proposed \ours framework with Multimodal Sentiment Perception Network (MSPN) using cross-modal fusion transformer and supervised contrastive learning, emotion-aware prompt engineering, TTS system, and animated avatar module.

Result: Development of an emotion-aware AI virtual companion that captures multimodal sentiment cues and enables emotionally aligned animated interactions.

Conclusion: Provides a framework for emotion-aware agents with applications in companion robotics, social care, mental health, and human-centered AI.

Abstract: Recent advances in Large Language Models (LLMs) have significantly improved natural language understanding and generation, enhancing Human-Computer Interaction (HCI). However, LLMs are limited to unimodal text processing and lack the ability to interpret emotional cues from non-verbal signals, hindering more immersive and empathetic interactions. This work explores integrating multimodal sentiment perception into LLMs to create emotion-aware agents. We propose \ours, an AI-based virtual companion that captures multimodal sentiment cues, enabling emotionally aligned and animated HCI. \ours introduces a Multimodal Sentiment Perception Network (MSPN) using a cross-modal fusion transformer and supervised contrastive learning to provide emotional cues. Additionally, we develop an emotion-aware prompt engineering strategy for generating empathetic responses and integrate a Text-to-Speech (TTS) system and animated avatar module for expressive interactions. \ours provides a framework for emotion-aware agents with applications in companion robotics, social care, mental health, and human-centered AI.

[98] RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion

Junhao Jia, Yifei Sun, Yunyou Liu, Cheng Yang, Changmiao Wang, Feiwei Qin, Yong Peng, Wenwen Min

Main category: cs.CV

TL;DR: RTGMFF is a novel fMRI analysis framework that combines automatic ROI-level text generation with multimodal feature fusion for improved brain disorder diagnosis, outperforming existing methods on ADHD-200 and ABIDE benchmarks.

Details

Motivation: Current fMRI diagnosis faces challenges with low signal-to-noise ratios, inter-subject variability, limited frequency awareness in CNN/Transformer models, and lack of textual annotations to contextualize brain activation patterns.

Method: Three-component framework: (1) ROI-driven fMRI text generation converting activation/connectivity data into text tokens, (2) Hybrid frequency-spatial encoder combining wavelet-mamba branch with Transformer for frequency and spatial analysis, (3) Adaptive semantic alignment module embedding text and visual features in shared space with regularized cosine-similarity loss.

Result: Extensive experiments show RTGMFF surpasses current methods in diagnostic accuracy with notable gains in sensitivity, specificity, and AUC on ADHD-200 and ABIDE benchmarks.

Conclusion: RTGMFF effectively addresses fMRI analysis limitations by unifying text generation with multimodal fusion, demonstrating superior performance for brain disorder diagnosis and providing reproducible text tokens for better interpretability.

Abstract: Functional magnetic resonance imaging (fMRI) is a powerful tool for probing brain function, yet reliable clinical diagnosis is hampered by low signal-to-noise ratios, inter-subject variability, and the limited frequency awareness of prevailing CNN- and Transformer-based models. Moreover, most fMRI datasets lack textual annotations that could contextualize regional activation and connectivity patterns. We introduce RTGMFF, a framework that unifies automatic ROI-level text generation with multimodal feature fusion for brain-disorder diagnosis. RTGMFF consists of three components: (i) ROI-driven fMRI text generation deterministically condenses each subject’s activation, connectivity, age, and sex into reproducible text tokens; (ii) Hybrid frequency-spatial encoder fuses a hierarchical wavelet-mamba branch with a cross-scale Transformer encoder to capture frequency-domain structure alongside long-range spatial dependencies; and (iii) Adaptive semantic alignment module embeds the ROI token sequence and visual features in a shared space, using a regularized cosine-similarity loss to narrow the modality gap. Extensive experiments on the ADHD-200 and ABIDE benchmarks show that RTGMFF surpasses current methods in diagnostic accuracy, achieving notable gains in sensitivity, specificity, and area under the ROC curve. Code is available at https://github.com/BeistMedAI/RTGMFF.

[99] LGBP-OrgaNet: Learnable Gaussian Band Pass Fusion of CNN and Transformer Features for Robust Organoid Segmentation and Tracking

Jing Zhang, Siying Tao, Jiao Li, Tianhe Wang, Junchen Wu, Ruqian Hao, Xiaohui Du, Ruirong Tan, Rui Li

Main category: cs.CV

TL;DR: Proposes LGBP-OrgaNet, an automated non-destructive deep learning system for organoid segmentation and tracking using CNN-Transformer fusion with Learnable Gaussian Band Pass and bidirectional cross fusion.

Details

Motivation: Traditional fluorescence labeling methods risk damaging organoid structure, so a non-destructive automated approach is needed for accurate segmentation and tracking of organoids which indicate developmental status.

Method: Deep learning-based system combining CNN and Transformer modules with Learnable Gaussian Band Pass Fusion module and Bidirectional Cross Fusion Block in decoder for multi-scale feature fusion and progressive upsampling.

Result: SROrga demonstrates satisfactory segmentation accuracy and robustness on organoid segmentation datasets.

Conclusion: Provides a powerful non-destructive tool for organoid research that accurately segments, tracks, and quantifies organoids without compromising their structure.

Abstract: Organoids replicate organ structure and function, playing a crucial role in fields such as tumor treatment and drug screening. Their shape and size can indicate their developmental status, but traditional fluorescence labeling methods risk compromising their structure. Therefore, this paper proposes an automated, non-destructive approach to organoid segmentation and tracking. We introduced the LGBP-OrgaNet, a deep learning-based system proficient in accurately segmenting, tracking, and quantifying organoids. The model leverages complementary information extracted from CNN and Transformer modules and introduces the innovative feature fusion module, Learnable Gaussian Band Pass Fusion, to merge data from two branches. Additionally, in the decoder, the model proposes a Bidirectional Cross Fusion Block to fuse multi-scale features, and finally completes the decoding through progressive concatenation and upsampling. SROrga demonstrates satisfactory segmentation accuracy and robustness on organoids segmentation datasets, providing a potent tool for organoid research.

[100] PI3DETR: Parametric Instance Detection of 3D Point Cloud Edges with a Geometry-Aware 3DETR

Fabio F. Oberweger, Michael Schwingshackl, Vanessa Staderini

Main category: cs.CV

TL;DR: PI3DETR is an end-to-end 3D curve detection framework that directly predicts parametric curve instances from point clouds using geometry-aware matching and specialized loss functions, achieving state-of-the-art performance on ABC dataset.

Details

Motivation: To overcome the limitations of intermediate representations and multi-stage processing in prior 3D curve detection methods, and to handle real-world challenges like noise and varying sampling densities in LiDAR and 3D sensing scenarios.

Method: Extends 3DETR with geometry-aware matching strategy and specialized loss functions for unified detection of multiple curve types (cubic Bézier curves, line segments, circles, arcs) in single forward pass, with optional post-processing refinement.

Result: Sets new state-of-the-art on ABC dataset, demonstrates improved robustness to noise and varying sampling densities, and generalizes effectively to real sensor data.

Conclusion: PI3DETR provides a simple yet powerful end-to-end solution for 3D edge and curve estimation that avoids complex intermediate processing while maintaining high performance across diverse curve types and real-world conditions.

Abstract: We present PI3DETR, an end-to-end framework that directly predicts 3D parametric curve instances from raw point clouds, avoiding the intermediate representations and multi-stage processing common in prior work. Extending 3DETR, our model introduces a geometry-aware matching strategy and specialized loss functions that enable unified detection of differently parameterized curve types, including cubic B'ezier curves, line segments, circles, and arcs, in a single forward pass. Optional post-processing steps further refine predictions without adding complexity. This streamlined design improves robustness to noise and varying sampling densities, addressing critical challenges in real world LiDAR and 3D sensing scenarios. PI3DETR sets a new state-of-the-art on the ABC dataset and generalizes effectively to real sensor data, offering a simple yet powerful solution for 3D edge and curve estimation.

[101] SynBT: High-quality Tumor Synthesis for Breast Tumor Segmentation by 3D Diffusion Model

Hongxu Yang, Edina Timko, Levente Lippenszky, Vanda Czipczer, Lehel Ferenczi

Main category: cs.CV

TL;DR: SynBT is a 3D medical diffusion model that generates high-quality synthetic breast tumors in MRI images using a patch-to-volume autoencoder and mask-conditioned diffusion, improving tumor segmentation performance by 2-3% Dice Score.

Details

Motivation: Existing tumor synthesis methods perform poorly for large spatial volume tumors like breast tumors in MRI with large field-of-view, as current methods are based on small patches rather than full volumes.

Method: Proposed SynBT model with patch-to-volume autoencoder to compress high-resolution MRIs into compact latent space while preserving large FOV resolution, combined with mask-conditioned diffusion model to synthesize realistic breast tumors in selected tissue regions.

Result: The method demonstrated improved tumor segmentation performance with 2-3% Dice Score improvement on a large public dataset compared to common segmentation models.

Conclusion: The proposed high-quality tumor synthesis approach effectively facilitates tumor segmentation in MRI images, providing significant benefits for medical image analysis tasks involving large-volume tumors.

Abstract: Synthetic tumors in medical images offer controllable characteristics that facilitate the training of machine learning models, leading to an improved segmentation performance. However, the existing methods of tumor synthesis yield suboptimal performances when tumor occupies a large spatial volume, such as breast tumor segmentation in MRI with a large field-of-view (FOV), while commonly used tumor generation methods are based on small patches. In this paper, we propose a 3D medical diffusion model, called SynBT, to generate high-quality breast tumor (BT) in contrast-enhanced MRI images. The proposed model consists of a patch-to-volume autoencoder, which is able to compress the high-resolution MRIs into compact latent space, while preserving the resolution of volumes with large FOV. Using the obtained latent space feature vector, a mask-conditioned diffusion model is used to synthesize breast tumors within selected regions of breast tissue, resulting in realistic tumor appearances. We evaluated the proposed method for a tumor segmentation task, which demonstrated the proposed high-quality tumor synthesis method can facilitate the common segmentation models with performance improvement of 2-3% Dice Score on a large public dataset, and therefore provides benefits for tumor segmentation in MRI images.

[102] PointAD+: Learning Hierarchical Representations for Zero-shot 3D Anomaly Detection

Qihang Zhou, Shibo He, Jiangtao Yan, Wenchao Meng, Jiming Chen

Main category: cs.CV

TL;DR: PointAD+ transfers CLIP’s 2D generalization to 3D anomaly detection using both implicit (rendering pixel) and explicit (spatial geometry) representations with hierarchical text prompts and cross-hierarchy contrastive alignment.

Details

Motivation: To transfer CLIP's robust 2D generalization capabilities to identify 3D anomalies across unseen objects with diverse class semantics, overcoming limitations of previous methods that neglect spatial relationships in point clouds.

Method: Proposes PointAD+ framework with: 1) implicit 3D representation using point-pixel correspondence, 2) explicit 3D representation with G-aggregation for spatial awareness, 3) hierarchical representation learning with rendering and geometry prompts, 4) cross-hierarchy contrastive alignment for layer interaction.

Result: Extensive experiments demonstrate superiority in zero-shot 3D anomaly detection across unseen objects with diverse class semantics, achieving holistic abnormality understanding with plug-and-play RGB integration capability.

Conclusion: PointAD+ successfully bridges 2D-3D generalization gap by comprehensively capturing both rendering and spatial abnormalities through unified hierarchical representation learning, enabling robust 3D anomaly detection on diverse unseen objects.

Abstract: In this paper, we aim to transfer CLIP’s robust 2D generalization capabilities to identify 3D anomalies across unseen objects of highly diverse class semantics. To this end, we propose a unified framework to comprehensively detect and segment 3D anomalies by leveraging both point- and pixel-level information. We first design PointAD, which leverages point-pixel correspondence to represent 3D anomalies through their associated rendering pixel representations. This approach is referred to as implicit 3D representation, as it focuses solely on rendering pixel anomalies but neglects the inherent spatial relationships within point clouds. Then, we propose PointAD+ to further broaden the interpretation of 3D anomalies by introducing explicit 3D representation, emphasizing spatial abnormality to uncover abnormal spatial relationships. Hence, we propose G-aggregation to involve geometry information to enable the aggregated point representations spatially aware. To simultaneously capture rendering and spatial abnormality, PointAD+ proposes hierarchical representation learning, incorporating implicit and explicit anomaly semantics into hierarchical text prompts: rendering prompts for the rendering layer and geometry prompts for the geometry layer. A cross-hierarchy contrastive alignment is further introduced to promote the interaction between the rendering and geometry layers, facilitating mutual anomaly learning. Finally, PointAD+ integrates anomaly semantics from both layers to capture the generalized anomaly semantics. During the test, PointAD+ can integrate RGB information in a plug-and-play manner and further improve its detection performance. Extensive experiments demonstrate the superiority of PointAD+ in ZS 3D anomaly detection across unseen objects with highly diverse class semantics, achieving a holistic understanding of abnormality.

[103] Empowering Lightweight MLLMs with Reasoning via Long CoT SFT

Linyu Ou

Main category: cs.CV

TL;DR: Long Chain-of-Thought data significantly improves reasoning in lightweight multimodal language models (<7B parameters) through supervised fine-tuning, with additional gains from subsequent reinforcement learning.

Details

Motivation: To investigate whether verifiable reward reinforcement learning techniques that work for large LLMs can also enhance reasoning in smaller multimodal language models, and to explore the role of long Chain-of-Thought data in this process.

Method: Used Supervised Fine-Tuning (SFT) with long Chain-of-Thought data on lightweight MLLMs, followed by a reinforcement learning stage to further enhance performance.

Result: SFT with long CoT data significantly improved MLLM reasoning capabilities, and subsequent RL stage provided additional performance gains beyond the initial SFT improvements.

Conclusion: A supervised fine-tuning stage with long Chain-of-Thought data is essential as a prerequisite for developing reasoning capabilities in lightweight multimodal language models before applying reinforcement learning techniques.

Abstract: While Reinforcement Learning with Verifiable Rewards has enhanced the reasoning of large-scale language models (LLMs), its efficacy for lightweight multimodal language models (MLLMs) with fewer than seven billion parameters remains underexplored. This paper investigates the role of long Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of such MLLMs. Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT data significantly improves MLLM reasoning. Furthermore, we observe that after this initial SFT phase, MLLMs can achieve additional performance gains through a subsequent RL stage. We conclude that a SFT stage with long CoT data is a critical prerequisite for developing the reasoning capabilities of lightweight MLLMs.

[104] Heatmap Guided Query Transformers for Robust Astrocyte Detection across Immunostains and Resolutions

Xizhe Zhang, Jiayang Zhu

Main category: cs.CV

TL;DR: Hybrid CNN-Transformer model for astrocyte detection in histology images, outperforming existing methods with better sensitivity and fewer false positives.

Details

Motivation: Astrocytes' complex morphology and stain variability make automated detection challenging, which is crucial for studying neurological disorders.

Method: Combines CNN for local features with Transformer for global context, using heatmap-guided queries for small/faint astrocytes and lightweight Transformer for dense clusters.

Result: Outperformed Faster R-CNN, YOLOv11 and DETR on ALDH1L1 and GFAP stained datasets, achieving higher sensitivity with fewer false positives in FROC analysis.

Conclusion: Hybrid CNN-Transformer architecture shows strong potential for robust astrocyte detection and provides foundation for computational pathology tools.

Abstract: Astrocytes are critical glial cells whose altered morphology and density are hallmarks of many neurological disorders. However, their intricate branching and stain dependent variability make automated detection of histological images a highly challenging task. To address these challenges, we propose a hybrid CNN Transformer detector that combines local feature extraction with global contextual reasoning. A heatmap guided query mechanism generates spatially grounded anchors for small and faint astrocytes, while a lightweight Transformer module improves discrimination in dense clusters. Evaluated on ALDH1L1 and GFAP stained astrocyte datasets, the model consistently outperformed Faster R-CNN, YOLOv11 and DETR, achieving higher sensitivity with fewer false positives, as confirmed by FROC analysis. These results highlight the potential of hybrid CNN Transformer architectures for robust astrocyte detection and provide a foundation for advanced computational pathology tools.

[105] InfraDiffusion: zero-shot depth map restoration with diffusion models and prompted segmentation from sparse infrastructure point clouds

Yixiong Jing, Cheng Zhang, Haibing Wu, Guangming Wang, Olaf Wysocki, Brian Sheil

Main category: cs.CV

TL;DR: InfraDiffusion is a zero-shot framework that enhances masonry point clouds for brick-level segmentation by projecting them into depth maps and restoring them using diffusion models, enabling automated inspection in low-light environments.

Details

Motivation: Existing brick-level segmentation methods rely on RGB images which are impractical in low-light environments like masonry tunnels. Point clouds are robust to lighting conditions but are typically unstructured and noisy, limiting fine-grained segmentation capabilities.

Method: Projects masonry point clouds into depth maps using virtual cameras and restores them by adapting the Denoising Diffusion Null-space Model (DDNM) without task-specific training. The enhanced depth maps are then used with Segment Anything Model (SAM) for brick-level segmentation.

Result: Experiments on masonry bridge and tunnel point cloud datasets show significant improvements in brick-level segmentation performance, demonstrating enhanced visual clarity and geometric consistency.

Conclusion: InfraDiffusion provides an effective zero-shot solution for automated inspection of masonry assets in challenging low-light environments, enabling fine-grained defect detection without requiring task-specific training.

Abstract: Point clouds are widely used for infrastructure monitoring by providing geometric information, where segmentation is required for downstream tasks such as defect detection. Existing research has automated semantic segmentation of structural components, while brick-level segmentation (identifying defects such as spalling and mortar loss) has been primarily conducted from RGB images. However, acquiring high-resolution images is impractical in low-light environments like masonry tunnels. Point clouds, though robust to dim lighting, are typically unstructured, sparse, and noisy, limiting fine-grained segmentation. We present InfraDiffusion, a zero-shot framework that projects masonry point clouds into depth maps using virtual cameras and restores them by adapting the Denoising Diffusion Null-space Model (DDNM). Without task-specific training, InfraDiffusion enhances visual clarity and geometric consistency of depth maps. Experiments on masonry bridge and tunnel point cloud datasets show significant improvements in brick-level segmentation using the Segment Anything Model (SAM), underscoring its potential for automated inspection of masonry assets. Our code and data is available at https://github.com/Jingyixiong/InfraDiffusion-official-implement.

[106] Transformer-Guided Content-Adaptive Graph Learning for Hyperspectral Unmixing

Hui Chen, Liangyu Liu, Xianchao Xiu, Wanquan Liu

Main category: cs.CV

TL;DR: T-CAGU is a novel hyperspectral unmixing framework that combines transformer for global dependencies and content-adaptive graph neural network for local relationships, achieving state-of-the-art performance.

Details

Motivation: Most existing deep learning methods for hyperspectral unmixing fail to simultaneously capture global dependencies and local consistency, making it difficult to preserve both long-range interactions and boundary details in remote sensing images.

Method: Proposes transformer-guided content-adaptive graph unmixing framework (T-CAGU) that uses transformer to capture global dependencies and content-adaptive graph neural network to enhance local relationships. Integrates multiple propagation orders to dynamically learn graph structure and employs graph residual mechanism to preserve global information and stabilize training.

Result: Experimental results demonstrate superiority over state-of-the-art methods, showing robustness against noise and improved performance in hyperspectral unmixing tasks.

Conclusion: T-CAGU effectively addresses the limitations of previous methods by simultaneously capturing global and local information, providing a robust solution for hyperspectral unmixing with better preservation of both long-range interactions and boundary details.

Abstract: Hyperspectral unmixing (HU) targets to decompose each mixed pixel in remote sensing images into a set of endmembers and their corresponding abundances. Despite significant progress in this field using deep learning, most methods fail to simultaneously characterize global dependencies and local consistency, making it difficult to preserve both long-range interactions and boundary details. This letter proposes a novel transformer-guided content-adaptive graph unmixing framework (T-CAGU), which overcomes these challenges by employing a transformer to capture global dependencies and introducing a content-adaptive graph neural network to enhance local relationships. Unlike previous work, T-CAGU integrates multiple propagation orders to dynamically learn the graph structure, ensuring robustness against noise. Furthermore, T-CAGU leverages a graph residual mechanism to preserve global information and stabilize training. Experimental results demonstrate its superiority over the state-of-the-art methods. Our code is available at https://github.com/xianchaoxiu/T-CAGU.

[107] TinyDrop: Tiny Model Guided Token Dropping for Vision Transformers

Guoxin Wang, Qingyuan Wang, Binhua Huang, Shaowu Chen, Deepu John

Main category: cs.CV

TL;DR: TinyDrop is a training-free token dropping framework that uses a lightweight vision model to guide large Vision Transformers in selectively discarding low-importance tokens during inference, reducing computational costs by up to 80% with minimal accuracy loss.

Details

Motivation: Vision Transformers achieve strong image classification performance but incur high computational costs from processing all image tokens, creating a need for efficient inference methods that maintain accuracy while reducing computational overhead.

Method: A plug-and-play framework that uses a lightweight guidance model to estimate token importance during inference, selectively dropping low-importance tokens before attention calculations in large ViTs without requiring architectural modifications or retraining.

Result: The framework reduces FLOPs by up to 80% for Vision Transformers with minimal accuracy degradation, demonstrating strong generalization capability across diverse ViT architectures on standard image classification benchmarks.

Conclusion: TinyDrop provides an effective and practical solution for efficient ViT-based classification, offering significant computational savings without compromising accuracy through training-free token dropping guided by lightweight vision models.

Abstract: Vision Transformers (ViTs) achieve strong performance in image classification but incur high computational costs from processing all image tokens. To reduce inference costs in large ViTs without compromising accuracy, we propose TinyDrop, a training-free token dropping framework guided by a lightweight vision model. The guidance model estimates the importance of tokens while performing inference, thereby selectively discarding low-importance tokens if large vit models need to perform attention calculations. The framework operates plug-and-play, requires no architectural modifications, and is compatible with diverse ViT architectures. Evaluations on standard image classification benchmarks demonstrate that our framework reduces FLOPs by up to 80% for ViTs with minimal accuracy degradation, highlighting its generalization capability and practical utility for efficient ViT-based classification.

[108] Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation

Reina Ishikawa, Ryo Fujii, Hideo Saito, Ryo Hachiuma

Main category: cs.CV

TL;DR: D-GPTScore is a human-aligned evaluation method that uses MLLM to decompose concept customization evaluation into finer aspects, outperforming existing metrics and better matching human preferences.

Details

Motivation: Existing evaluation metrics for concept customization are either too narrow or too generalized, causing misalignment with human judgment, especially for multi-concept evaluation which requires assessing both individual concepts and their interactions.

Method: Proposed Decomposed GPT Score (D-GPTScore) that breaks down evaluation criteria into finer aspects and uses Multimodal Large Language Model for aspect-wise assessments. Also created CC-AlignBench benchmark with single- and multi-concept tasks across varying difficulty levels.

Result: D-GPTScore significantly outperforms existing evaluation approaches on the CC-AlignBench benchmark, demonstrating higher correlation with human preferences in concept customization evaluation.

Conclusion: This work establishes a new standard for evaluating concept customization and identifies key challenges for future research, providing a comprehensive benchmark and evaluation method that better aligns with human judgment.

Abstract: Evaluating concept customization is challenging, as it requires a comprehensive assessment of fidelity to generative prompts and concept images. Moreover, evaluating multiple concepts is considerably more difficult than evaluating a single concept, as it demands detailed assessment not only for each individual concept but also for the interactions among concepts. While humans can intuitively assess generated images, existing metrics often provide either overly narrow or overly generalized evaluations, resulting in misalignment with human preference. To address this, we propose Decomposed GPT Score (D-GPTScore), a novel human-aligned evaluation method that decomposes evaluation criteria into finer aspects and incorporates aspect-wise assessments using Multimodal Large Language Model (MLLM). Additionally, we release Human Preference-Aligned Concept Customization Benchmark (CC-AlignBench), a benchmark dataset containing both single- and multi-concept tasks, enabling stage-wise evaluation across a wide difficulty range – from individual actions to multi-person interactions. Our method significantly outperforms existing approaches on this benchmark, exhibiting higher correlation with human preferences. This work establishes a new standard for evaluating concept customization and highlights key challenges for future research. The benchmark and associated materials are available at https://github.com/ReinaIshikawa/D-GPTScore.

[109] Scalable and Loosely-Coupled Multimodal Deep Learning for Breast Cancer Subtyping

Mohammed Amer, Mohamed A. Suliman, Tu Bui, Nuria Garcia, Serban Georgescu

Main category: cs.CV

TL;DR: Proposes scalable multimodal framework for breast cancer subtyping using CNV, clinical records, and histopathology images with dual WSI representation and novel fusion strategy.

Details

Motivation: Healthcare applications benefit from multimodal integration but face variability in available modalities across clinical settings. Breast cancer molecular subtyping can facilitate personalized treatment and improve prognosis.

Method: Scalable loosely-coupled multimodal framework integrating CNV, clinical records, and histopathology images. Introduces dual-based representation for WSIs (combining image-based and graph-based) and new multimodal fusion strategy.

Result: Significant performance improvements with dual WSI representation. Comprehensive results show framework outperforms state-of-the-art methods in breast cancer subtyping when integrating dual WSI representation with CNV and clinical records.

Conclusion: The proposed framework is flexible, scalable, and applicable to other cancer types. The dual WSI representation and fusion strategy provide substantial performance gains over existing methods.

Abstract: Healthcare applications are inherently multimodal, benefiting greatly from the integration of diverse data sources. However, the modalities available in clinical settings can vary across different locations and patients. A key area that stands to gain from multimodal integration is breast cancer molecular subtyping, an important clinical task that can facilitate personalized treatment and improve patient prognosis. In this work, we propose a scalable and loosely-coupled multimodal framework that seamlessly integrates data from various modalities, including copy number variation (CNV), clinical records, and histopathology images, to enhance breast cancer subtyping. While our primary focus is on breast cancer, our framework is designed to easily accommodate additional modalities, offering the flexibility to scale up or down with minimal overhead without requiring re-training of existing modalities, making it applicable to other types of cancers as well. We introduce a dual-based representation for whole slide images (WSIs), combining traditional image-based and graph-based WSI representations. This novel dual approach results in significant performance improvements. Moreover, we present a new multimodal fusion strategy, demonstrating its ability to enhance performance across a range of multimodal conditions. Our comprehensive results show that integrating our dual-based WSI representation with CNV and clinical health records, along with our pipeline and fusion strategy, outperforms state-of-the-art methods in breast cancer subtyping.

[110] Time-Scaling State-Space Models for Dense Video Captioning

AJ Piergiovanni, Ganesh Satish Mallya, Dahun Kim, Anelia Angelova

Main category: cs.CV

TL;DR: A new State-Space Model with Transfer State approach for dense video captioning that enables online processing of long videos with 7x fewer FLOPs.

Details

Motivation: Existing dense video captioning methods struggle with long videos due to computational complexity and memory limitations, and require full video input which prevents online processing.

Method: Time-scaling State-Space Models (SSMs) with Transfer State to handle longer sequences, combining long-sequence and recurrent properties while addressing SSM limitations in sustaining state for very long contexts.

Result: The approach scales well with video lengths, uses 7x fewer FLOPs, and enables on-the-fly caption generation without waiting for full video processing.

Conclusion: The proposed State-Space Models with Transfer State provide an efficient solution for online dense video captioning that overcomes computational and memory constraints of traditional methods.

Abstract: Dense video captioning is a challenging video understanding task which aims to simultaneously segment the video into a sequence of meaningful consecutive events and to generate detailed captions to accurately describe each event. Existing methods often encounter difficulties when working with the long videos associated with dense video captioning, due to the computational complexity and memory limitations. Furthermore, traditional approaches require the entire video as input, in order to produce an answer, which precludes online processing of the video. We address these challenges by time-scaling State-Space Models (SSMs) to even longer sequences than before. Our approach, State-Space Models with Transfer State, combines both the long-sequence and recurrent properties of SSMs and addresses the main limitation of SSMs which are otherwise not able to sustain their state for very long contexts, effectively scaling SSMs further in time. The proposed model is particularly suitable for generating captions on-the-fly, in an online or streaming manner, without having to wait for the full video to be processed, which is more beneficial in practice. When applied to dense video captioning, our approach scales well with video lengths and uses 7x fewer FLOPs.

[111] Decoding Visual Neural Representations by Multimodal with Dynamic Balancing

Kaili sun, Xingyu Miao, Bing Zhai, Haoran Duan, Yang Long

Main category: cs.CV

TL;DR: A multimodal framework integrating EEG, image, and text data to decode visual neural representations from low SNR EEG signals, achieving state-of-the-art performance on ThingsEEG dataset.

Details

Motivation: To enhance semantic correspondence between EEG signals and visual content by incorporating text modality, and to address challenges with low signal-to-noise ratio EEG data and multimodal feature alignment.

Method: Proposes an adapter module for cross-modal feature alignment, Modal Consistency Dynamic Balance (MCDB) strategy for modality weighting, and stochastic perturbation regularization (SPR) with Gaussian noise for generalization.

Result: Achieved 2.0% improvement in Top-1 accuracy and 4.7% improvement in Top-5 accuracy on ThingsEEG dataset compared to previous state-of-the-art methods.

Conclusion: The integration of text modality with EEG and image data, combined with novel alignment and regularization techniques, significantly improves neural representation decoding from low-quality EEG signals.

Abstract: In this work, we propose an innovative framework that integrates EEG, image, and text data, aiming to decode visual neural representations from low signal-to-noise ratio EEG signals. Specifically, we introduce text modality to enhance the semantic correspondence between EEG signals and visual content. With the explicit semantic labels provided by text, image and EEG features of the same category can be more closely aligned with the corresponding text representations in a shared multimodal space. To fully utilize pre-trained visual and textual representations, we propose an adapter module that alleviates the instability of high-dimensional representation while facilitating the alignment and fusion of cross-modal features. Additionally, to alleviate the imbalance in multimodal feature contributions introduced by the textual representations, we propose a Modal Consistency Dynamic Balance (MCDB) strategy that dynamically adjusts the contribution weights of each modality. We further propose a stochastic perturbation regularization (SPR) term to enhance the generalization ability of semantic perturbation-based models by introducing dynamic Gaussian noise in the modality optimization process. The evaluation results on the ThingsEEG dataset show that our method surpasses previous state-of-the-art methods in both Top-1 and Top-5 accuracy metrics, improving by 2.0% and 4.7% respectively.

[112] Joint Training of Image Generator and Detector for Road Defect Detection

Kuan-Chuan Peng

Main category: cs.CV

TL;DR: JTGD is a lightweight road defect detection method that jointly trains a generator and detector without ensemble or TTA, achieving state-of-the-art performance with <20% parameters of baseline methods.

Details

Motivation: Road defect detection needs to be deployable on edge devices with limited memory and computational resources, requiring methods that avoid computationally expensive ensemble techniques and test-time augmentation.

Method: Proposes Jointly Trained Generator and Detector (JTGD) with dual discriminators to enforce plausible synthesized defect patches and overall images. Uses CLIP-based Fréchet Inception Distance loss to improve image quality and jointly trains generator to create harder examples for the detector.

Result: Outperforms state-of-the-art methods on RDD2022 benchmark across various countries without ensemble or TTA. Uses less than 20% of parameters compared to baseline methods.

Conclusion: JTGD provides an efficient and effective solution for road defect detection that is suitable for deployment on resource-constrained edge devices while maintaining high performance.

Abstract: Road defect detection is important for road authorities to reduce the vehicle damage caused by road defects. Considering the practical scenarios where the defect detectors are typically deployed on edge devices with limited memory and computational resource, we aim at performing road defect detection without using ensemble-based methods or test-time augmentation (TTA). To this end, we propose to Jointly Train the image Generator and Detector for road defect detection (dubbed as JTGD). We design the dual discriminators for the generative model to enforce both the synthesized defect patches and overall images to look plausible. The synthesized image quality is improved by our proposed CLIP-based Fr'echet Inception Distance loss. The generative model in JTGD is trained jointly with the detector to encourage the generative model to synthesize harder examples for the detector. Since harder synthesized images of better quality caused by the aforesaid design are used in the data augmentation, JTGD outperforms the state-of-the-art method in the RDD2022 road defect detection benchmark across various countries under the condition of no ensemble and TTA. JTGD only uses less than 20% of the number of parameters compared with the competing baseline, which makes it more suitable for deployment on edge devices in practice.

[113] Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA

Yahya Benmahane, Mohammed El Hassouni

Main category: cs.CV

TL;DR: Proposes parameter-efficient NR-IQA method using pixel-space visual prompts, training only 600K parameters while keeping MLLM frozen, achieving competitive performance with full fine-tuning.

Details

Motivation: To enable efficient adaptation of Multimodal Large Language Models for No-Reference Image Quality Assessment without full fine-tuning, reducing computational costs while maintaining performance.

Method: Trains visual prompts optimized in pixel-space (only 600K parameters) that are added to images, then processed by frozen mPLUG-Owl2 model with textual query “Rate the technical quality of the image.”

Result: Achieves competitive performance across distortion types on KADID-10k, KonIQ-10k, and AGIQA-3k, with 0.93 SRCC on KADID-10k, comparable to full fine-tuning and specialized NR-IQA models.

Conclusion: First work to use pixel-space visual prompts for NR-IQA, demonstrating efficient MLLM adaptation for low-level vision tasks with minimal parameter training.

Abstract: In this paper, we propose a novel parameter-efficient adaptation method for No- Reference Image Quality Assessment (NR-IQA) using visual prompts optimized in pixel-space. Unlike full fine-tuning of Multimodal Large Language Models (MLLMs), our approach trains only 600K parameters at most (< 0.01% of the base model), while keeping the underlying model fully frozen. During inference, these visual prompts are combined with images via addition and processed by mPLUG-Owl2 with the textual query “Rate the technical quality of the image.” Evaluations across distortion types (synthetic, realistic, AI-generated) on KADID- 10k, KonIQ-10k, and AGIQA-3k demonstrate competitive performance against full finetuned methods and specialized NR-IQA models, achieving 0.93 SRCC on KADID-10k. To our knowledge, this is the first work to leverage pixel-space visual prompts for NR-IQA, enabling efficient MLLM adaptation for low-level vision tasks. The source code is publicly available at https: // github. com/ yahya-ben/ mplug2-vp-for-nriqa .

[114] OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong

Main category: cs.CV

TL;DR: OneCAT is a unified multimodal model using pure decoder-only transformer architecture that integrates understanding, generation, and editing without external vision components, achieving state-of-the-art performance with efficiency gains.

Details

Motivation: To create a more efficient and unified multimodal model that eliminates the need for external vision components like ViT or vision tokenizers during inference, particularly for high-resolution inputs, while maintaining top performance.

Method: Uses a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive objective, featuring a multi-scale visual autoregressive mechanism within LLM that reduces decoding steps compared to diffusion methods.

Result: Outperforms existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding, with significant efficiency gains especially for high-resolution inputs.

Conclusion: Pure autoregressive modeling serves as a sufficient and elegant foundation for unified multimodal intelligence, demonstrating powerful potential for integrated multimodal systems.

Abstract: We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.

[115] DeepSea MOT: A benchmark dataset for multi-object tracking on deep-sea video

Kevin Barnard, Elaine Liu, Kristine Walz, Brian Schlining, Nancy Jacobsen Stout, Lonny Lundsten

Main category: cs.CV

TL;DR: A novel benchmark video dataset for evaluating multi-object tracking and object detection models in deep-sea environments, providing the first publicly available benchmark for deep-sea video footage with comprehensive evaluation metrics and tools.

Details

Motivation: Benchmarking is essential for evaluating model performance and facilitating consistent comparisons between different object detection models and trackers, but there was no publicly available benchmark specifically for multi-object tracking in deep-sea video footage.

Method: Developed a novel benchmark video dataset with four video sequences representing midwater and benthic deep-sea habitats, then assessed performance of several object detection models and trackers using Higher Order Tracking Accuracy metric.

Result: Created the first publicly available benchmark for multi-object tracking in deep-sea video footage, providing benchmark data, documented workflow for generating additional benchmark videos, and example Python notebooks for computing metrics.

Conclusion: This study successfully establishes a standardized benchmarking framework for evaluating multi-object tracking performance in deep-sea environments, enabling consistent model comparisons and performance optimization in this specialized domain.

Abstract: Benchmarking multi-object tracking and object detection model performance is an essential step in machine learning model development, as it allows researchers to evaluate model detection and tracker performance on human-generated ’test’ data, facilitating consistent comparisons between models and trackers and aiding performance optimization. In this study, a novel benchmark video dataset was developed and used to assess the performance of several Monterey Bay Aquarium Research Institute object detection models and a FathomNet single-class object detection model together with several trackers. The dataset consists of four video sequences representing midwater and benthic deep-sea habitats. Performance was evaluated using Higher Order Tracking Accuracy, a metric that balances detection, localization, and association accuracy. To the best of our knowledge, this is the first publicly available benchmark for multi-object tracking in deep-sea video footage. We provide the benchmark data, a clearly documented workflow for generating additional benchmark videos, as well as example Python notebooks for computing metrics.

[116] Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S. Ryoo, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles

Main category: cs.CV

TL;DR: Strefer is a synthetic instruction data generation framework that enhances Video LLMs’ spatiotemporal reasoning capabilities without proprietary models or human annotation, outperforming baselines on spatial and temporal disambiguation tasks.

Details

Motivation: Existing Video LLMs struggle with fine-grained spatiotemporal reasoning needed for real-world AI companions, particularly with temporal references for event anchoring and gestural cues for spatial object referencing.

Method: Strefer uses a data engine to pseudo-annotate temporally dense video metadata including subjects, objects, locations as masklets, and action descriptions/timelines to generate diverse instruction-tuning data.

Result: Models trained with Strefer-generated data outperform baselines on spatial and temporal disambiguation tasks and exhibit enhanced space-time-aware reasoning capabilities.

Conclusion: Strefer establishes a new foundation for perceptually grounded, instruction-tuned Video LLMs by enabling better spatiotemporal referring and reasoning without costly human annotation or proprietary models.

Abstract: Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, spatiotemporal reasoning, especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata, capturing rich spatial and temporal information in a structured manner, including subjects, objects, their locations as masklets, and their action descriptions and timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.

[117] A comprehensive Persian offline handwritten database for investigating the effects of heritability and family relationships on handwriting

Abbas Zohrevand, Javad Sadri, Zahra Imani

Main category: cs.CV

TL;DR: A comprehensive database of 210 families’ handwriting samples created to study genetic inheritance effects on handwriting patterns and styles.

Details

Motivation: To investigate whether handwriting has genetic components and is inherited, and to understand how family relationships affect handwriting characteristics.

Method: Collected diverse handwriting samples (digits, letters, shapes, free paragraphs) from 210 families including multiple generations and relatives using specially designed forms, capturing all family relationships.

Result: Similarities in handwriting features and writing styles were detected among family members through comparisons and feature analysis.

Conclusion: The database is freely available to facilitate further research on inheritance effects and family relationships in handwriting patterns.

Abstract: This paper introduces a comprehensive database for research and investigation on the effects of inheritance on handwriting. A database has been created that can be used to answer questions such as: Is there a genetic component to handwriting? Is handwriting inherited? Do family relationships affect handwriting? Varieties of samples of handwritten components such as: digits, letters, shapes and free paragraphs of 210 families including (grandparents, parents, uncles, aunts, siblings, cousins, nephews and nieces) have been collected using specially designed forms, and family relationships of all writers are captured. To the best of our knowledge, no such database is presently available. Based on comparisons and investigation of features of handwritings of family members, similarities among their features and writing styles are detected. Our database is freely available to the pattern recognition community and hope it will pave the way for investigations on the effects of inheritance and family relationships on handwritings.

[118] Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Fuli Feng

Main category: cs.CV

TL;DR: T2I-CoReBench is a comprehensive benchmark that evaluates text-to-image models on both composition (instance, attribute, relation) and reasoning (deductive, inductive, abductive) capabilities through 1,080 complex prompts with high scene density and multi-step inference requirements.

Details

Motivation: Existing T2I benchmarks have limitations in evaluating both composition and reasoning capabilities comprehensively, and fail to handle complex prompts with high scene density and multi-step reasoning that current advanced models can potentially handle.

Method: Developed a 12-dimensional evaluation taxonomy structured around scene graph elements for composition and philosophical inference framework for reasoning. Created 1,080 challenging prompts with high compositional density and multi-step inference, each paired with a checklist of ~13,500 yes/no questions for fine-grained assessment.

Result: Evaluation of 27 T2I models revealed that composition capability remains limited in complex high-density scenarios, and reasoning capability is even more limited as a critical bottleneck, with all models struggling to infer implicit elements from prompts.

Conclusion: T2I models still face significant challenges in handling complex composition and reasoning tasks, particularly in high-density scenarios and multi-step inference, indicating the need for continued research to improve these capabilities.

Abstract: Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts. Our project page: https://t2i-corebench.github.io/.

[119] Repurposing SAM for User-Defined Semantics Aware Segmentation

Rohit Kundu, Sudipta Paul, Arindam Dutta, Amit K. Roy-Chowdhury

Main category: cs.CV

TL;DR: U-SAM enhances SAM with semantic awareness by learning mapping between mask embeddings and class labels, enabling targeted object segmentation without test data samples.

Details

Motivation: SAM generates precise masks but lacks semantic awareness, failing to associate masks with specific object categories, limiting its practical utility.

Method: Leverages synthetic/web-crawled images to accumulate semantic information, learns mapping function between SAM’s mask embeddings and object class labels.

Result: Achieves significant mIoU improvements of +17.95% on PASCAL VOC 2012 and +5.20% on MSCOCO-80 over state-of-the-art methods.

Conclusion: U-SAM transforms SAM into a semantically aware segmentation model, providing practical pixel-level annotation solution for diverse domains without requiring test data samples.

Abstract: The Segment Anything Model (SAM) excels at generating precise object masks from input prompts but lacks semantic awareness, failing to associate its generated masks with specific object categories. To address this limitation, we propose U-SAM, a novel framework that imbibes semantic awareness into SAM, enabling it to generate targeted masks for user-specified object categories. Given only object class names as input from the user, U-SAM provides pixel-level semantic annotations for images without requiring any labeled/unlabeled samples from the test data distribution. Our approach leverages synthetically generated or web crawled images to accumulate semantic information about the desired object classes. We then learn a mapping function between SAM’s mask embeddings and object class labels, effectively enhancing SAM with granularity-specific semantic recognition capabilities. As a result, users can obtain meaningful and targeted segmentation masks for specific objects they request, rather than generic and unlabeled masks. We evaluate U-SAM on PASCAL VOC 2012 and MSCOCO-80, achieving significant mIoU improvements of +17.95% and +5.20%, respectively, over state-of-the-art methods. By transforming SAM into a semantically aware segmentation model, U-SAM offers a practical and flexible solution for pixel-level annotation across diverse and unseen domains in a resource-constrained environment.

[120] Aligning Machine and Human Visual Representations across Abstraction Levels

Lukas Muttenthaler, Klaus Greff, Frieda Born, Bernhard Spitzer, Simon Kornblith, Michael C. Mozer, Klaus-Robert Müller, Thomas Unterthiner, Andrew K. Lampinen

Main category: cs.CV

TL;DR: Neural networks lack human-like hierarchical conceptual organization. The paper proposes transferring human-aligned structure from teacher models to refine vision foundation models, resulting in better human behavior approximation and improved ML task performance.

Details

Motivation: Address the misalignment between neural network representations and human hierarchical conceptual knowledge organization, where models fail to capture all levels of abstraction that humans naturally use.

Method: Train a teacher model to imitate human judgments, then transfer human-aligned structure from its representations to finetune pretrained state-of-the-art vision foundation models.

Result: Human-aligned models more accurately approximate human behavior and uncertainty across similarity tasks, perform better on diverse ML tasks, and show increased generalization and out-of-distribution robustness.

Conclusion: Infusing neural networks with human knowledge creates representations that are both more consistent with human cognition and more practically useful, paving the way for more robust and human-aligned AI systems.

Abstract: Deep neural networks have achieved success across a wide range of applications, including as models of human behavior and neural representations in vision tasks. However, neural network training and human learning differ in fundamental ways, and neural networks often fail to generalize as robustly as humans do raising questions regarding the similarity of their underlying representations. What is missing for modern learning systems to exhibit more human-aligned behavior? We highlight a key misalignment between vision models and humans: whereas human conceptual knowledge is hierarchically organized from fine- to coarse-scale distinctions, model representations do not accurately capture all these levels of abstraction. To address this misalignment, we first train a teacher model to imitate human judgments, then transfer human-aligned structure from its representations to refine the representations of pretrained state-of-the-art vision foundation models via finetuning. These human-aligned models more accurately approximate human behavior and uncertainty across a wide range of similarity tasks, including a new dataset of human judgments spanning multiple levels of semantic abstractions. They also perform better on a diverse set of machine learning tasks, increasing generalization and out-of-distribution robustness. Thus, infusing neural networks with additional human knowledge yields a best-of-both-worlds representation that is both more consistent with human cognitive judgments and more practically useful, thus paving the way toward more robust, interpretable, and human-aligned artificial intelligence systems.

[121] Bridging the Domain Gap for Flight-Ready Spaceborne Vision

Tae Ha Park, Simone D’Amico

Main category: cs.CV

TL;DR: SPNv3 is a computationally efficient neural network for monocular spacecraft pose estimation that bridges the synthetic-to-real domain gap and achieves state-of-the-art accuracy while running efficiently on space-grade hardware.

Details

Motivation: To develop a neural network for spacecraft pose estimation that is both computationally efficient for deployment on space-grade edge devices and robust to unseen spaceborne images, enabling reliable close-range rendezvous operations.

Method: Careful NN design choices including data augmentation, transfer learning, and vision transformer architecture to maximize robustness while minimizing computational overhead. Trained exclusively on computer-generated synthetic images.

Result: Achieves state-of-the-art pose accuracy on hardware-in-the-loop images from robotic testbeds, effectively bridging the domain gap between synthetic and real imagery. Runs well above the update frequency of modern satellite navigation filters on flight-ready GPU systems.

Conclusion: SPNv3 is an efficient, flight-ready neural network model suitable for close-range rendezvous and proximity operations with target spacecraft, demonstrating successful deployment readiness for space applications.

Abstract: This work presents Spacecraft Pose Network v3 (SPNv3), a Neural Network (NN) for monocular pose estimation of a known, non-cooperative target spacecraft. SPNv3 is designed and trained to be computationally efficient while providing robustness to spaceborne images that have not been observed during offline training and validation on the ground. These characteristics are essential to deploying NNs on space-grade edge devices. They are achieved through careful NN design choices, and an extensive trade-off analysis reveals features such as data augmentation, transfer learning and vision transformer architecture as a few of those that contribute to simultaneously maximizing robustness and minimizing computational overhead. Experiments demonstrate that the final SPNv3 can achieve state-of-the-art pose accuracy on hardware-in-the-loop images from a robotic testbed while having trained exclusively on computer-generated synthetic images, effectively bridging the domain gap between synthetic and real imagery. At the same time, SPNv3 runs well above the update frequency of modern satellite navigation filters when tested on a representative graphical processing unit system with flight heritage. Overall, SPNv3 is an efficient, flight-ready NN model readily applicable to close-range rendezvous and proximity operations with target resident space objects.

[122] Domain Consistency Representation Learning for Lifelong Person Re-Identification

Shiben Liu, Huijie Fan, Qiang Wang, Weihong Ren, Yandong Tang, Yang Cong

Main category: cs.CV

TL;DR: Proposes DCR model for lifelong person re-identification that balances intra-domain discrimination and inter-domain gaps using global and attribute-wise representations with anti-forgetting strategies.

Details

Motivation: Address the contradictory relationship between intra-domain discrimination (individual nuances) and inter-domain gaps (domain consistency) in lifelong person re-identification, where existing methods focus on reducing inter-domain gaps but ignore intra-domain discrimination.

Method: Domain Consistency Representation (DCR) model using global and attribute-wise representations; attribute-oriented anti-forgetting strategy to enhance inter-domain consistency; knowledge consolidation strategy for knowledge transfer.

Result: Extensive experiments show superior performance compared to state-of-the-art LReID methods.

Conclusion: DCR effectively balances intra-domain discrimination and inter-domain gaps through complementary representations and anti-forgetting strategies, achieving improved lifelong person re-identification performance.

Abstract: Lifelong person re-identification (LReID) exhibits a contradictory relationship between intra-domain discrimination and inter-domain gaps when learning from continuous data. Intra-domain discrimination focuses on individual nuances (i.e., clothing type, accessories, etc.), while inter-domain gaps emphasize domain consistency. Achieving a trade-off between maximizing intra-domain discrimination and minimizing inter-domain gaps is a crucial challenge for improving LReID performance. Most existing methods strive to reduce inter-domain gaps through knowledge distillation to maintain domain consistency. However, they often ignore intra-domain discrimination. To address this challenge, we propose a novel domain consistency representation learning (DCR) model that explores global and attribute-wise representations as a bridge to balance intra-domain discrimination and inter-domain gaps. At the intra-domain level, we explore the complementary relationship between global and attribute-wise representations to improve discrimination among similar identities. Excessive learning intra-domain discrimination can lead to catastrophic forgetting. We further develop an attribute-oriented anti-forgetting (AF) strategy that explores attribute-wise representations to enhance inter-domain consistency, and propose a knowledge consolidation (KC) strategy to facilitate knowledge transfer. Extensive experiments show that our DCR achieves superior performance compared to state-of-the-art LReID methods. Our code is available at https://github.com/LiuShiBen/DCR.

Laura Fink, Linus Franke, Bernhard Egger, Joachim Keinert, Marc Stamminger

Main category: cs.CV

TL;DR: A method that combines monocular depth estimation with multi-view optimization to produce accurate, 3D-consistent depth maps through analysis-by-synthesis refinement.

Details

Motivation: Current monocular depth estimators generalize well but lack 3D consistency needed for many applications in computer graphics, vision, and robotics.

Method: Two-stage optimization: initial global scale estimation through structure-from-motion, then refinement via photometric and geometric losses with differentiable rendering of meshed depth maps.

Result: Generates detailed, high-quality, view-consistent depth maps that outperform state-of-the-art multi-view depth reconstruction approaches, especially in challenging indoor scenarios.

Conclusion: The approach successfully bridges the gap between generalizing monocular depth estimation and multi-view consistency, producing error-free depth maps suitable for various applications.

Abstract: Accurate depth estimation is at the core of many applications in computer graphics, vision, and robotics. Current state-of-the-art monocular depth estimators, trained on extensive datasets, generalize well but lack 3D consistency needed for many applications. In this paper, we combine the strength of those generalizing monocular depth estimation techniques with multi-view data by framing this as an analysis-by-synthesis optimization problem to lift and refine such relative depth maps to accurate error-free depth maps. After an initial global scale estimation through structure-from-motion point clouds, we further refine the depth map through optimization enforcing multi-view consistency via photometric and geometric losses with differentiable rendering of the meshed depth map. In a two-stage optimization, scaling is further refined first, and afterwards artifacts and errors in the depth map are corrected via nearby-view photometric supervision. Our evaluation shows that our method is able to generate detailed, high-quality, view consistent, accurate depth maps, also in challenging indoor scenarios, and outperforms state-of-the-art multi-view depth reconstruction approaches on such datasets. Project page and source code can be found at https://lorafib.github.io/ref_depth/.

[124] Learning a Neural Association Network for Self-supervised Multi-Object Tracking

Shuai Li, Michael Burke, Subramanian Ramamoorthy, Juergen Gall

Main category: cs.CV

TL;DR: Self-supervised multi-object tracking framework using neural Kalman filter and EM algorithm that learns data association without identity annotations.

Details

Motivation: Fully-supervised tracking methods require tedious identity-level annotations, while real-world object motion can be modeled as a Markov process, enabling self-supervised learning.

Method: Uses expectation maximization with neural Kalman filter conditioned on detection associations. Neural network predicts data associations with Sinkhorn normalization, then Kalman smoothing computes marginal probabilities for training via gradient descent.

Result: Achieves state-of-the-art results on MOT17, MOT20, and BDD100K datasets compared to other self-supervised trackers using public detections.

Conclusion: Proposes a fully differentiable framework for self-supervised multi-object tracking that eliminates need for identity annotations while maintaining competitive performance.

Abstract: This paper introduces a novel framework to learn data association for multi-object tracking in a self-supervised manner. Fully-supervised learning methods are known to achieve excellent tracking performances, but acquiring identity-level annotations is tedious and time-consuming. Motivated by the fact that in real-world scenarios object motion can be usually represented by a Markov process, we present a novel expectation maximization (EM) algorithm that trains a neural network to associate detections for tracking, without requiring prior knowledge of their temporal correspondences. At the core of our method lies a neural Kalman filter, with an observation model conditioned on associations of detections parameterized by a neural network. Given a batch of frames as input, data associations between detections from adjacent frames are predicted by a neural network followed by a Sinkhorn normalization that determines the assignment probabilities of detections to states. Kalman smoothing is then used to obtain the marginal probability of observations given the inferred states, producing a training objective to maximize this marginal probability using gradient descent. The proposed framework is fully differentiable, allowing the underlying neural model to be trained end-to-end. We evaluate our approach on the challenging MOT17, MOT20, and BDD100K datasets and achieve state-of-the-art results in comparison to self-supervised trackers using public detections.

[125] GalaxAlign: Mimicking Citizen Scientists’ Multimodal Guidance for Galaxy Morphology Analysis

Ruoqi Wang, Haitao Wang, Qiong Luo

Main category: cs.CV

TL;DR: GalaxAlign is a multimodal approach that aligns schematic symbols, textual labels, and galaxy images to improve fine-tuning of vision foundation models for galaxy classification and similarity search without expensive pretraining.

Details

Motivation: Existing methods for galaxy morphology analysis either require costly domain-specific foundation model training or yield lower accuracy with resource-efficient fine-tuning. The approach is inspired by how citizen scientists use textual descriptions and schematic symbols to identify galaxies.

Method: Tri-modal alignment framework that aligns three data types during fine-tuning: schematic symbols representing galaxy shapes, textual labels for these symbols, and actual galaxy images. Uses multimodal instructions to enhance fine-tuning effectiveness.

Result: Experiments demonstrate effective fine-tuning of general pre-trained models for astronomical tasks by incorporating domain-specific multimodal knowledge, showing improvements in galaxy classification and similarity search.

Conclusion: GalaxAlign provides a resource-efficient solution that eliminates the need for expensive pretraining while achieving better performance than traditional fine-tuning approaches for galaxy morphology analysis tasks.

Abstract: Galaxy morphology analysis involves studying galaxies based on their shapes and structures. For such studies, fundamental tasks include identifying and classifying galaxies in astronomical images, as well as retrieving visually or structurally similar galaxies through similarity search. Existing methods either directly train domain-specific foundation models on large, annotated datasets or fine-tune vision foundation models on a smaller set of images. The former is effective but costly, while the latter is more resource-efficient but often yields lower accuracy. To address these challenges, we introduce GalaxAlign, a multimodal approach inspired by how citizen scientists identify galaxies in astronomical images by following textual descriptions and matching schematic symbols. Specifically, GalaxAlign employs a tri-modal alignment framework to align three types of data during fine-tuning: (1) schematic symbols representing galaxy shapes and structures, (2) textual labels for these symbols, and (3) galaxy images. By incorporating multimodal instructions, GalaxAlign eliminates the need for expensive pretraining and enhances the effectiveness of fine-tuning. Experiments on galaxy classification and similarity search demonstrate that our method effectively fine-tunes general pre-trained models for astronomical tasks by incorporating domain-specific multi-modal knowledge. Code is available at https://github.com/RapidsAtHKUST/GalaxAlign.

[126] LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Xiaoyan Xing, Konrad Groh, Sezer Karaoglu, Theo Gevers, Anand Bhattad

Main category: cs.CV

TL;DR: LumiNet is a novel architecture for lighting transfer that uses generative models and latent representations to relight source images with target lighting, outperforming existing methods on complex indoor scenes.

Details

Motivation: To develop an effective method for transferring lighting characteristics between images while preserving source geometry and albedo, addressing the challenge of complex lighting phenomena like specular highlights and indirect illumination.

Method: Uses a modified diffusion-based ControlNet that processes latent intrinsic properties from source image and latent extrinsic properties from target image, with a learned MLP adaptor for injecting target properties via cross-attention and fine-tuning. Includes data curation from StyleGAN-based relighting model.

Result: Successfully transfers complex lighting phenomena including specular highlights and indirect illumination across scenes with varying spatial layouts and materials, outperforming existing approaches on challenging indoor scenes.

Conclusion: LumiNet provides an effective solution for lighting transfer that preserves source scene properties while accurately capturing target lighting characteristics, demonstrating superior performance compared to traditional methods.

Abstract: We introduce LumiNet, a novel architecture that leverages generative models and latent intrinsic representations for effective lighting transfer. Given a source image and a target lighting image, LumiNet synthesizes a relit version of the source scene that captures the target’s lighting. Our approach makes two key contributions: a data curation strategy from the StyleGAN-based relighting model for our training, and a modified diffusion-based ControlNet that processes both latent intrinsic properties from the source image and latent extrinsic properties from the target image. We further improve lighting transfer through a learned adaptor (MLP) that injects the target’s latent extrinsic properties via cross-attention and fine-tuning. Unlike traditional ControlNet, which generates images with conditional maps from a single scene, LumiNet processes latent representations from two different images - preserving geometry and albedo from the source while transferring lighting characteristics from the target. Experiments demonstrate that our method successfully transfers complex lighting phenomena including specular highlights and indirect illumination across scenes with varying spatial layouts and materials, outperforming existing approaches on challenging indoor scenes using only images as input.

[127] Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury

Main category: cs.CV

TL;DR: UNITE is a universal DeepFake detection model that extends beyond face-centric approaches to detect full-frame manipulations, including non-human subjects and background alterations, using transformer architecture with attention-diversity loss.

Details

Motivation: Traditional DeepFake detection focuses only on facial manipulations, but new text-to-video and image-to-video generative models can create fully synthetic content and background alterations, requiring more versatile detection methods.

Method: Transformer-based architecture processing domain-agnostic features from SigLIP-So400M foundation model, trained with task-irrelevant data and attention-diversity loss to prevent over-focusing on faces and promote diverse spatial attention.

Result: UNITE outperforms state-of-the-art detectors in cross-data settings across datasets with face/background manipulations and fully synthetic T2V/I2V videos.

Conclusion: The model demonstrates superior adaptability and generalizable detection capabilities for various manipulation types beyond traditional face-centric DeepFakes.

Abstract: Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the \underline{U}niversal \underline{N}etwork for \underline{I}dentifying \underline{T}ampered and synth\underline{E}tic videos (\texttt{UNITE}) model, which, unlike traditional detectors, captures full-frame manipulations. \texttt{UNITE} extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model’s tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that \texttt{UNITE} outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.

[128] Survey on Hand Gesture Recognition from Visual Input

Manousos Linardakis, Iraklis Varlamis, Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: A comprehensive survey paper on hand gesture recognition covering recent advancements, methodologies, datasets, and open challenges in the field.

Details

Motivation: To address the lack of comprehensive surveys covering recent research developments, available solutions, and benchmark datasets in hand gesture recognition, which is crucial for human-computer interaction applications.

Method: Examines latest advancements in hand gesture and 3D hand pose recognition from various camera inputs (RGB images, depth images, videos from monocular/multiview cameras) and analyzes methodological requirements of each approach.

Result: Provides an overview of widely used datasets with their characteristics and application domains, and identifies current trends in the field.

Conclusion: Highlights open challenges including robust recognition in real-world environments, handling occlusions, generalization across users, and computational efficiency for real-time applications, providing guidance for future research directions.

Abstract: Hand gesture recognition has become an important research area, driven by the growing demand for human-computer interaction in fields such as sign language recognition, virtual and augmented reality, and robotics. Despite the rapid growth of the field, there are few surveys that comprehensively cover recent research developments, available solutions, and benchmark datasets. This survey addresses this gap by examining the latest advancements in hand gesture and 3D hand pose recognition from various types of camera input data including RGB images, depth images, and videos from monocular or multiview cameras, examining the differing methodological requirements of each approach. Furthermore, an overview of widely used datasets is provided, detailing their main characteristics and application domains. Finally, open challenges such as achieving robust recognition in real-world environments, handling occlusions, ensuring generalization across diverse users, and addressing computational efficiency for real-time applications are highlighted to guide future research directions. By synthesizing the objectives, methodologies, and applications of recent studies, this survey offers valuable insights into current trends, challenges, and opportunities for future research in human hand gesture recognition.

[129] ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality

Yanming Xiu, Tim Scargill, Maria Gorlatova

Main category: cs.CV

TL;DR: ViDDAR is a vision language model-based system that detects task-detrimental virtual content in AR environments, achieving high accuracy for obstruction and information manipulation attacks with low latency.

Details

Motivation: Improperly positioned or designed virtual content in AR can impair users' ability to accurately interpret real-world information, leading to task performance degradation through obstruction attacks and information manipulation attacks.

Method: Developed ViDDAR - a comprehensive full-reference system using Vision Language Models (VLMs) and deep learning techniques with user-edge-cloud architecture to monitor and evaluate virtual content in AR environments.

Result: ViDDAR achieved 92.15% obstruction detection accuracy with 533 ms latency, and 82.46% information manipulation detection accuracy with 9.62 s latency, effectively understanding complex scenes.

Conclusion: ViDDAR is the first system to employ VLMs for detecting task-detrimental content in AR settings, successfully addressing obstruction and information manipulation attacks with high accuracy and low latency.

Abstract: In Augmented Reality (AR), virtual content enhances user experience by providing additional information. However, improperly positioned or designed virtual content can be detrimental to task performance, as it can impair users’ ability to accurately interpret real-world information. In this paper we examine two types of task-detrimental virtual content: obstruction attacks, in which virtual content prevents users from seeing real-world objects, and information manipulation attacks, in which virtual content interferes with users’ ability to accurately interpret real-world information. We provide a mathematical framework to characterize these attacks and create a custom open-source dataset for attack evaluation. To address these attacks, we introduce ViDDAR (Vision language model-based Task-Detrimental content Detector for Augmented Reality), a comprehensive full-reference system that leverages Vision Language Models (VLMs) and advanced deep learning techniques to monitor and evaluate virtual content in AR environments, employing a user-edge-cloud architecture to balance performance with low latency. To the best of our knowledge, ViDDAR is the first system to employ VLMs for detecting task-detrimental content in AR settings. Our evaluation results demonstrate that ViDDAR effectively understands complex scenes and detects task-detrimental content, achieving up to 92.15% obstruction detection accuracy with a detection latency of 533 ms, and an 82.46% information manipulation content detection accuracy with a latency of 9.62 s.

[130] Comparing Next-Day Wildfire Predictability of MODIS and VIIRS Satellite Data

Justus Karlsson, Yonghao Xu, Amanda Berg, Leif Haglund

Main category: cs.CV

TL;DR: VIIRS (VNP14) outperforms MODIS (MOD14) for next-day fire prediction due to MOD14’s stochastic nature and poor correlation with fire spread patterns, making VNP14 the better choice despite MODIS data showing potential when paired with VNP14 targets.

Details

Motivation: No direct comparison exists between MODIS and VIIRS satellites for next-day fire prediction, despite both being commonly used for wildfire detection through their respective fire mask products (MOD14 and VNP14).

Method: Evaluated VIIRS and MODIS data for forecasting wildfire spread one day ahead, comparing different input-target combinations including MODIS input with VNP14 target and VNP14 input with MOD14 target.

Result: Model using VIIRS as input and VNP14 as target achieved best results. MODIS input with VNP14 target performed significantly better than VNP14 input with MOD14 target. MOD14 fire mask was found to be highly stochastic and not correlating with reasonable fire spread patterns.

Conclusion: MOD14 is unsuitable for next-day fire prediction while VNP14 is a much better option. However, using MODIS input with VNP14 target shows significant improvement, indicating potential for improved fire detection models for MODIS.

Abstract: Multiple studies have performed next-day fire prediction using satellite imagery. Two main satellites are used to detect wildfires: MODIS and VIIRS. Both satellites provide fire mask products, called MOD14 and VNP14, respectively. Studies have used one or the other, but there has been no comparison between them to determine which might be more suitable for next-day fire prediction. In this paper, we first evaluate how well VIIRS and MODIS data can be used to forecast wildfire spread one day ahead. We find that the model using VIIRS as input and VNP14 as target achieves the best results. Interestingly, the model using MODIS as input and VNP14 as target performs significantly better than using VNP14 as input and MOD14 as target. Next, we discuss why MOD14 might be harder to use for predicting next-day fires. We find that the MOD14 fire mask is highly stochastic and does not correlate with reasonable fire spread patterns. This is detrimental for machine learning tasks, as the model learns irrational patterns. Therefore, we conclude that MOD14 is unsuitable for next-day fire prediction and that VNP14 is a much better option. However, using MODIS input and VNP14 as target, we achieve a significant improvement in predictability. This indicates that an improved fire detection model is possible for MODIS. The full code and dataset is available online: https://github.com/justuskarlsson/wildfire-mod14-vnp14

[131] LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization

Alessio Spagnoletti, Jean Prost, Andrés Almansa, Nicolas Papadakis, Marcelo Pereyra

Main category: cs.CV

TL;DR: LATINO is a novel zero-shot Plug & Play framework that uses Latent Consistency Models for efficient inverse problem solving, achieving state-of-the-art results with only 8 neural function evaluations and automatic prompt calibration.

Details

Motivation: Existing text-to-image PnP approaches are computationally expensive and require manual text prompt identification for unknown images, limiting their practical application in inverse problem solving.

Method: Proposes LATINO framework that embeds generative models within stochastic inverse solvers, specifically using Latent Consistency Models (distilled LDMs) with a conditioning mechanism that avoids automatic differentiation and enables prompt self-calibration through empirical Bayesian framework.

Result: Achieves SOTA quality in as little as 8 neural function evaluations, delivers remarkably accurate solutions, and is significantly more memory and computationally efficient than previous approaches. Prompt self-calibration greatly improves estimation quality.

Conclusion: LATINO with prompt optimization defines new state-of-the-art benchmarks in both image reconstruction quality and computational efficiency for solving inverse problems with generative priors.

Abstract: Text-to-image latent diffusion models (LDMs) have recently emerged as powerful generative models with great potential for solving inverse problems in imaging. However, leveraging such models in a Plug & Play (PnP), zero-shot manner remains challenging because it requires identifying a suitable text prompt for the unknown image of interest. Also, existing text-to-image PnP approaches are highly computationally expensive. We herein address these challenges by proposing a novel PnP inference paradigm specifically designed for embedding generative models within stochastic inverse solvers, with special attention to Latent Consistency Models (LCMs), which distill LDMs into fast generators. We leverage our framework to propose LAtent consisTency INverse sOlver (LATINO), the first zero-shot PnP framework to solve inverse problems with priors encoded by LCMs. Our conditioning mechanism avoids automatic differentiation and reaches SOTA quality in as little as 8 neural function evaluations. As a result, LATINO delivers remarkably accurate solutions and is significantly more memory and computationally efficient than previous approaches. We then embed LATINO within an empirical Bayesian framework that automatically calibrates the text prompt from the observed measurements by marginal maximum likelihood estimation. Extensive experiments show that prompt self-calibration greatly improves estimation, allowing LATINO with PRompt Optimization to define new SOTAs in image reconstruction quality and computational efficiency. The code is available at https://latino-pro.github.io

[132] Deeply Supervised Flow-Based Generative Models

Inkyu Shin, Chenglin Yang, Liang-Chieh Chen

Main category: cs.CV

TL;DR: DeepFlow enhances flow-based generative models by introducing inter-layer communication and velocity alignment, achieving 8x faster convergence and improved performance on image generation tasks.

Details

Motivation: Current flow-based models underutilize rich inter-layer representations by training velocity solely from final layer output, which potentially impedes model convergence.

Method: Partitions transformer layers into balanced branches with deep supervision and inserts lightweight Velocity Refiner with Acceleration (VeRA) blocks between adjacent branches to align intermediate velocity features.

Result: Converges 8 times faster on ImageNet with equivalent performance, reduces FID by 2.6 while halving training time, and outperforms baselines in text-to-image generation on MSCOCO and zero-shot GenEval.

Conclusion: DeepFlow’s inter-layer communication and velocity alignment significantly improve convergence speed and performance of flow-based generative models without requiring classifier-free guidance.

Abstract: Flow based generative models have charted an impressive path across multiple visual generation tasks by adhering to a simple principle: learning velocity representations of a linear interpolant. However, we observe that training velocity solely from the final layer output underutilizes the rich inter layer representations, potentially impeding model convergence. To address this limitation, we introduce DeepFlow, a novel framework that enhances velocity representation through inter layer communication. DeepFlow partitions transformer layers into balanced branches with deep supervision and inserts a lightweight Velocity Refiner with Acceleration (VeRA) block between adjacent branches, which aligns the intermediate velocity features within transformer blocks. Powered by the improved deep supervision via the internal velocity alignment, DeepFlow converges 8 times faster on ImageNet with equivalent performance and further reduces FID by 2.6 while halving training time compared to previous flow based models without a classifier free guidance. DeepFlow also outperforms baselines in text to image generation tasks, as evidenced by evaluations on MSCOCO and zero shot GenEval.

[133] TruthLens: Visual Grounding for Universal DeepFake Reasoning

Rohit Kundu, Shan Jia, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury

Main category: cs.CV

TL;DR: TruthLens is a novel framework that provides detailed textual reasoning for DeepFake detection, going beyond binary classification by combining global semantic context with region-specific forensic cues through MLLM grounding.

Details

Motivation: Existing DeepFake detection methods are limited to binary classification (real vs. fake) and lack interpretability, while AI image generators enable effortless creation of manipulated content.

Method: Uses task-driven representation integration that combines global semantic context from a multimodal large language model (MLLM) with region-specific forensic cues through explicit cross-modal adaptation of a vision-only model.

Result: Extensive experiments show TruthLens sets new benchmarks in both forensic interpretability and detection accuracy, generalizing to both seen and unseen manipulations.

Conclusion: TruthLens bridges the interpretability gap in DeepFake detection by unifying high-level scene understanding with fine-grained region grounding, providing transparent forensic analysis.

Abstract: Detecting DeepFakes has become a crucial research area as the widespread use of AI image generators enables the effortless creation of face-manipulated and fully synthetic content, while existing methods are often limited to binary classification (real vs. fake) and lack interpretability. To address these challenges, we propose TruthLens, a novel, unified, and highly generalizable framework that goes beyond traditional binary classification, providing detailed, textual reasoning for its predictions. Distinct from conventional methods, TruthLens performs MLLM grounding. TruthLens uses a task-driven representation integration strategy that unites global semantic context from a multimodal large language model (MLLM) with region-specific forensic cues through explicit cross-modal adaptation of a vision-only model. This enables nuanced, region-grounded reasoning for both face-manipulated and fully synthetic content, and supports fine-grained queries such as “Does the eyes/nose/mouth look real or fake?”- capabilities beyond pretrained MLLMs alone. Extensive experiments across diverse datasets demonstrate that TruthLens sets a new benchmark in both forensic interpretability and detection accuracy, generalizing to seen and unseen manipulations alike. By unifying high-level scene understanding with fine-grained region grounding, TruthLens delivers transparent DeepFake forensics, bridging a critical gap in the literature.

[134] GAEA: A Geolocation Aware Conversational Assistant

Ron Campos, Ashmal Vayani, Parth Parag Kulkarni, Rohit Gupta, Aizan Zafar, Aritra Dutta, Mubarak Shah

Main category: cs.CV

TL;DR: GAEA is a conversational AI model that provides location information about images, outperforming existing LMMs by significant margins through a novel dataset and specialized training.

Details

Motivation: Traditional image geolocalization only provides GPS coordinates without understanding location context or conversational ability. Existing LMMs struggle with specialized tasks like geolocalization.

Method: Developed GAEA conversational model using GAEA-1.4M dataset (800k images, 1.4M QA pairs) created from OpenStreetMap attributes and geographical context clues. Also created GAEA-Bench benchmark with 3.5k image-text pairs.

Result: GAEA outperforms best open-source model (LLaVA-OneVision) by 18.2% and best proprietary model (GPT-4o) by 7.2% on conversational geolocalization tasks.

Conclusion: The proposed GAEA model successfully addresses the limitations of traditional geolocalization and existing LMMs by providing conversational location understanding through specialized dataset and training.

Abstract: Image geolocalization, in which an AI model traditionally predicts the precise GPS coordinates of an image, is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge beyond the GPS coordinates; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with the tremendous progress of large multimodal models (LMMs) – proprietary and open-source – researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, such as geolocalization, LMMs struggle. In this work, we propose solving this problem by introducing a conversational model, GAEA, that provides information regarding the location of an image as the user requires. No large-scale dataset enabling the training of such a model exists. Thus, we propose GAEA-1.4M, a comprehensive dataset comprising over 800k images and approximately 1.4M question-answer pairs, constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, GAEA-Bench, comprising 3.5k image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision, by 18.2% and the best proprietary model, GPT-4o, by 7.2%. Our dataset, model and codes are available.

[135] On the representation of stack operators by mathematical morphology

Diego Marcondes

Main category: cs.CV

TL;DR: This paper introduces grey-scale image stack operators as 1-Lipchitz extensions of set operators that map binary images to binary images and commute with cross-sectioning, generalizing stack filters.

Details

Motivation: To develop a framework for solving grey-scale image processing problems by designing operators for binary images first, then extending them to grey-scale through stack operators.

Method: Defines stack operators mathematically, proves they inherit lattice properties from characteristic set operators, and derives their characteristic function, kernel, and basis representation for translation-invariant locally defined cases.

Result: Main result shows stack operators preserve lattice properties of their underlying set operators, providing a systematic way to extend binary image operators to grey-scale processing.

Conclusion: Stack operators provide a powerful framework where grey-scale image processing problems can be solved by designing binary image operators first, with implications for operator design and future research in machine learning applications.

Abstract: This paper introduces the class of grey-scale image stack operators as those that (a) map binary-images into binary-images and (b) commute on average with cross-sectioning. Equivalently, stack operators are 1-Lipchitz extensions of set operators which can be represented by applying a characteristic set operator to the cross-sections of the image and adding. In particular, they are a generalisation of stack filters, for which the characteristic set operators are increasing. Our main result is that stack operators inherit lattice properties of the characteristic set operators. We focus on the case of translation-invariant and locally defined stack operators and show the main result by deducing the characteristic function, kernel, and basis representation of stack operators. The results of this paper have implications on the design of image operators, since imply that to solve some grey-scale image processing problems it is enough to design an operator for performing the desired transformation on binary images, and then considering its extension given by a stack operator. We leave many topics for future research regarding the machine learning of stack operators and the characterisation of the image processing problems that can be solved by them.

[136] WildFireCan-MMD: A Multimodal Dataset for Classification of User-Generated Content During Wildfires in Canada

Braeden Sherritt, Isar Nejadgholi, Efstratios Aivaliotis, Khaled Mslmani, Marzieh Amini

Main category: cs.CV

TL;DR: New multimodal dataset WildFireCan-MMD for Canadian wildfire social media analysis, showing custom-trained models outperform zero-shot VLMs and baseline classifiers with 84.48% f-score.

Details

Motivation: Traditional wildfire data sources are slow and costly, while social media offers real-time updates but extracting relevant insights remains challenging, especially for Canadian contexts where multimodal data is underrepresented.

Method: Created WildFireCan-MMD dataset of X posts from Canadian wildfires annotated across 12 themes, evaluated zero-shot vision-language models, custom-trained classifiers, and baseline methods on this dataset.

Result: Custom-trained models significantly outperform zero-shot VLMs and baseline classifiers, with best model achieving 84.48% f-score. Demonstrated practical application through analysis of large unlabeled dataset to uncover wildfire trends.

Conclusion: Tailored datasets and task-specific training are crucial for effective wildfire response, and such datasets should be localized as disaster response requirements vary across regions and contexts.

Abstract: Rapid information access is vital during wildfires, yet traditional data sources are slow and costly. Social media offers real-time updates, but extracting relevant insights remains a challenge. In this work, we focus on multimodal wildfire social media data, which, although existing in current datasets, is currently underrepresented in Canadian contexts. We present WildFireCan-MMD, a new multimodal dataset of X posts from recent Canadian wildfires, annotated across twelve key themes. We evaluate zero-shot vision-language models on this dataset and compare their results with those of custom-trained and baseline classifiers. We show that while baseline methods and zero-shot prompting offer quick deployment, custom-trained models outperform them when labelled data is available. Our best-performing custom model reaches 84.48% f-score, outperforming VLMs and baseline classifiers. We also demonstrate how this model can be used to uncover trends during wildfires, through the collection and analysis of a large unlabeled dataset. Our dataset facilitates future research in wildfire response, and our findings highlight the importance of tailored datasets and task-specific training. Importantly, such datasets should be localized, as disaster response requirements vary across regions and contexts.

[137] Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

Muhammad Tayyab Khan, Zane Yong, Lequn Chen, Jun Ming Tan, Wenhe Feng, Seung Ki Moon

Main category: cs.CV

TL;DR: Hybrid deep learning framework combining OBB detection (YOLOv11) and transformer-based parsing (Donut) for structured information extraction from 2D engineering drawings, achieving high accuracy with reduced hallucinations.

Details

Motivation: Manual extraction from engineering drawings is slow and labor-intensive, while traditional OCR struggles with complex layouts and overlapping symbols, producing unstructured outputs that hinder precision manufacturing.

Method: Integration of Oriented Bounding Box detection (YOLOv11 trained on 9 key categories) with Donut transformer model for structured JSON output. Evaluated both single model and category-specific fine-tuning approaches.

Result: Single model outperformed category-specific models across all metrics: 94.77% precision for GD&T, 100% recall for most categories, 97.3% F1 score, with only 5.23% hallucinations.

Conclusion: The proposed hybrid framework significantly improves accuracy, reduces manual effort, and supports scalable deployment in precision-driven industries by providing structured information extraction from complex engineering drawings.

Abstract: Accurate extraction of key information from 2D engineering drawings is crucial for high-precision manufacturing. Manual extraction is slow and labor-intensive, while traditional Optical Character Recognition (OCR) techniques often struggle with complex layouts and overlapping symbols, resulting in unstructured outputs. To address these challenges, this paper proposes a novel hybrid deep learning framework for structured information extraction by integrating an Oriented Bounding Box (OBB) detection model with a transformer-based document parsing model (Donut). An in-house annotated dataset is used to train YOLOv11 for detecting nine key categories: Geometric Dimensioning and Tolerancing (GD&T), General Tolerances, Measures, Materials, Notes, Radii, Surface Roughness, Threads, and Title Blocks. Detected OBBs are cropped into images and labeled to fine-tune Donut for structured JSON output. Fine-tuning strategies include a single model trained across all categories and category-specific models. Results show that the single model consistently outperforms category-specific ones across all evaluation metrics, achieving higher precision (94.77% for GD&T), recall (100% for most categories), and F1 score (97.3%), while reducing hallucinations (5.23%). The proposed framework improves accuracy, reduces manual effort, and supports scalable deployment in precision-driven industries.

[138] Mitigating Hallucination in Large Vision-Language Models through Aligning Attention Distribution to Information Flow

Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng

Main category: cs.CV

TL;DR: The paper addresses visual hallucinations in LVLMs by aligning attention distribution with actual information flow through a two-stage optimization that propagates advantages from attention heads focusing on core semantic representations.

Details

Motivation: Decoder-Only LVLMs suffer from misalignment between attention distribution and information flow, where visual information gets absorbed into semantic representations but attention doesn't properly focus on them, leading to hallucinations and poor visual understanding.

Method: Identify attention heads that focus on core semantic representations based on attention distributions, then use a two-stage optimization paradigm to propagate these advantages across the entire model to align attention with information flow.

Result: Method significantly reduces hallucinations on three image captioning benchmarks using five different LVLMs, with experiments showing a trade-off between reduced hallucinations and richer details, while allowing manual adjustment of model conservativeness.

Conclusion: The proposed approach effectively enhances visual understanding in LVLMs by aligning attention mechanisms with actual information flow, providing flexible control over model behavior to meet diverse real-world requirements while reducing hallucinations.

Abstract: Due to the unidirectional masking mechanism, Decoder-Only models propagate information from left to right. LVLMs (Large Vision-Language Models) follow the same architecture, with visual information gradually integrated into semantic representations during forward propagation. Through systematic analysis, we observe that the majority of the visual information is absorbed into the semantic representations. However, the model’s attention distribution does not exhibit sufficient emphasis on semantic representations. This misalignment between the attention distribution and the actual information flow undermines the model’s visual understanding ability and contributes to hallucinations. To address this issue, we enhance the model’s visual understanding by leveraging the core information embedded in semantic representations. Specifically, we identify attention heads that focus on core semantic representations based on their attention distributions. Then, through a two-stage optimization paradigm, we propagate the advantages of these attention heads across the entire model, aligning the attention distribution with the actual information flow. We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations. Further experiments reveal a trade-off between reduced hallucinations and richer details. Notably, our method allows for manual adjustment of the model’s conservativeness, enabling flexible control to meet diverse real-world requirements.

[139] Open-Set LiDAR Panoptic Segmentation Guided by Uncertainty-Aware Learning

Rohit Mohan, Julia Hindel, Florian Drews, Claudius Gläser, Daniele Cattaneo, Abhinav Valada

Main category: cs.CV

TL;DR: ULOPS is an uncertainty-guided open-set LiDAR panoptic segmentation framework that uses Dirichlet-based evidential learning to detect unknown objects in autonomous driving scenarios.

Details

Motivation: Existing LiDAR panoptic segmentation models rely on closed-set assumptions and fail to detect previously unseen object classes in open-world autonomous driving environments.

Method: Proposes ULOPS with separate decoders for semantic segmentation with uncertainty estimation, embedding with prototype association, and instance center prediction. Introduces three uncertainty-driven loss functions: Uniform Evidence Loss, Adaptive Uncertainty Separation Loss, and Contrastive Uncertainty Loss.

Result: Extensive experiments show ULOPS consistently outperforms existing open-set LiDAR panoptic segmentation methods on extended benchmark settings for KITTI-360 and new open-set evaluation for nuScenes.

Conclusion: ULOPS effectively addresses the challenge of detecting unknown objects in LiDAR panoptic segmentation through uncertainty-guided learning and specialized loss functions, demonstrating superior performance over existing methods.

Abstract: Autonomous vehicles that navigate in open-world environments may encounter previously unseen object classes. However, most existing LiDAR panoptic segmentation models rely on closed-set assumptions, failing to detect unknown object instances. In this work, we propose ULOPS, an uncertainty-guided open-set panoptic segmentation framework that leverages Dirichlet-based evidential learning to model predictive uncertainty. Our architecture incorporates separate decoders for semantic segmentation with uncertainty estimation, embedding with prototype association, and instance center prediction. During inference, we leverage uncertainty estimates to identify and segment unknown instances. To strengthen the model’s ability to differentiate between known and unknown objects, we introduce three uncertainty-driven loss functions. Uniform Evidence Loss to encourage high uncertainty in unknown regions. Adaptive Uncertainty Separation Loss ensures a consistent difference in uncertainty estimates between known and unknown objects at a global scale. Contrastive Uncertainty Loss refines this separation at the fine-grained level. To evaluate open-set performance, we extend benchmark settings on KITTI-360 and introduce a new open-set evaluation for nuScenes. Extensive experiments demonstrate that ULOPS consistently outperforms existing open-set LiDAR panoptic segmentation methods.

[140] Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection

Anja Delić, Matej Grcić, Siniša Šegvić

Main category: cs.CV

TL;DR: SeeKer detects anomalous human behavior by analyzing skeleton sequences using autoregressive keypoint-level density estimation, flagging low-probability poses as anomalies.

Details

Motivation: Detecting abnormal human behavior is crucial for safety applications like healthcare monitoring and surveillance, where anomalies often manifest as unusual human poses.

Method: Formulates skeleton sequence density through autoregressive factorization at keypoint level, using conditional Gaussian distributions to predict probable keypoint locations given prior motion. Anomaly score is weighted sum of per-keypoint log-conditionals accounting for detector confidence.

Result: Surpasses all previous methods on UBnormal and MSAD-HR datasets, delivers competitive performance on ShanghaiTech dataset.

Conclusion: Despite its conceptual simplicity, SeeKer achieves state-of-the-art performance in skeleton-based anomaly detection across multiple benchmark datasets.

Abstract: Detecting anomalous human behaviour is an important visual task in safety-critical applications such as healthcare monitoring, workplace safety, or public surveillance. In these contexts, abnormalities are often reflected with unusual human poses. Thus, we propose SeeKer, a method for detecting anomalies in sequences of human skeletons. Our method formulates the skeleton sequence density through autoregressive factorization at the keypoint level. The corresponding conditional distributions represent probable keypoint locations given prior skeletal motion. We formulate the joint distribution of the considered skeleton as causal prediction of conditional Gaussians across its constituent keypoints. A skeleton is flagged as anomalous if its keypoint locations surprise our model (i.e. receive a low density). In practice, our anomaly score is a weighted sum of per-keypoint log-conditionals, where the weights account for the confidence of the underlying keypoint detector. Despite its conceptual simplicity, SeeKer surpasses all previous methods on the UBnormal and MSAD-HR datasets while delivering competitive performance on the ShanghaiTech dataset.

[141] GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Hamza Rasaee, Taha Koleilat, Hassan Rivaz

Main category: cs.CV

TL;DR: A prompt-driven vision-language model combining Grounding DINO and SAM2 achieves superior ultrasound segmentation across multiple organs, outperforming state-of-the-art methods on both seen and unseen datasets without additional fine-tuning.

Details

Motivation: Address the challenge of accurate and generalizable object segmentation in ultrasound imaging due to anatomical variability, diverse protocols, and limited annotated data.

Method: Integrated Grounding DINO with SAM2 using Low Rank Adaptation (LoRA) fine-tuning on 15 ultrasound datasets covering multiple organs, with 3 held-out datasets for testing unseen distributions.

Result: Outperforms state-of-the-art methods (UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, SAMUS) on most seen datasets and maintains strong performance on unseen datasets without additional fine-tuning.

Conclusion: Demonstrates the promise of vision-language models for scalable and robust ultrasound image analysis, reducing dependence on large organ-specific annotated datasets.

Abstract: Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine-tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ-specific annotated datasets. We will publish our code on code.sonography.ai after acceptance.

[142] LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Huaqiu Li, Yong Wang, Tongwen Huang, Hailang Huang, Haoqian Wang, Xiangxiang Chu

Main category: cs.CV

TL;DR: A novel dataset-free unified image restoration method using recurrent posterior sampling with pretrained latent diffusion models and multimodal understanding for semantic priors.

Details

Motivation: Existing methods are either task-specific (lacking generalizability) or rely on paired datasets (suffering from closed-set constraints), limiting their applicability across various degradation types.

Method: Uses recurrent posterior sampling with pretrained latent diffusion model, incorporates multimodal understanding for semantic priors, employs lightweight module to align degraded input with diffusion model preferences, and uses recurrent refinement.

Result: Extensive experiments show the method outperforms state-of-the-art approaches, demonstrating effectiveness and robustness.

Conclusion: The proposed dataset-free unified approach successfully addresses limitations of existing methods and achieves superior performance across various image restoration tasks.

Abstract: Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data are available at https://github.com/AMAP-ML/LD-RPS.

[143] Enhancing Diffusion Model Stability for Image Restoration via Gradient Management

Hongjie Wu, Mingqin Zhang, Linchao He, Ji-Zhe Zhou, Jiancheng Lv

Main category: cs.CV

TL;DR: SPGD introduces gradient management techniques to stabilize diffusion models for image restoration, addressing conflicts between prior and likelihood gradients through progressive warm-up and adaptive momentum smoothing.

Details

Motivation: Current diffusion-based image restoration methods suffer from instabilities due to conflicts between denoising prior and likelihood guidance gradients, which disrupt the generative process and degrade restoration performance.

Method: Proposes Stabilized Progressive Gradient Diffusion (SPGD) with two components: 1) progressive likelihood warm-up to mitigate gradient conflicts, and 2) adaptive directional momentum smoothing to reduce likelihood gradient fluctuations.

Result: Extensive experiments show SPGD significantly enhances generation stability and achieves state-of-the-art performance across diverse restoration tasks in both quantitative metrics and visual quality.

Conclusion: SPGD effectively addresses gradient instability issues in diffusion-based restoration, providing a robust framework that improves both stability and performance of image restoration models.

Abstract: Diffusion models have shown remarkable promise for image restoration by leveraging powerful priors. Prominent methods typically frame the restoration problem within a Bayesian inference framework, which iteratively combines a denoising step with a likelihood guidance step. However, the interactions between these two components in the generation process remain underexplored. In this paper, we analyze the underlying gradient dynamics of these components and identify significant instabilities. Specifically, we demonstrate conflicts between the prior and likelihood gradient directions, alongside temporal fluctuations in the likelihood gradient itself. We show that these instabilities disrupt the generative process and compromise restoration performance. To address these issues, we propose Stabilized Progressive Gradient Diffusion (SPGD), a novel gradient management technique. SPGD integrates two synergistic components: (1) a progressive likelihood warm-up strategy to mitigate gradient conflicts; and (2) adaptive directional momentum (ADM) smoothing to reduce fluctuations in the likelihood gradient. Extensive experiments across diverse restoration tasks demonstrate that SPGD significantly enhances generation stability, leading to state-of-the-art performance in quantitative metrics and visually superior results. Code is available at https://github.com/74587887/SPGD.

[144] A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding

Zhan Shi, Song Wang, Junbo Chen, Jianke Zhu

Main category: cs.CV

TL;DR: A new benchmark and model for 3D occupancy grounding that uses voxel-level annotations instead of bounding boxes for more precise object localization in autonomous driving scenes.

Details

Motivation: Traditional visual grounding relies on bounding boxes that fail to capture fine-grained details and accurate object representations, as not all voxels within a box are occupied.

Method: GroundingOcc - an end-to-end model with multimodal encoder, occupancy head for voxel predictions, grounding head for localization refinement, plus 2D grounding and depth estimation modules for enhanced geometric understanding.

Result: Extensive experiments show the method outperforms existing baselines on 3D occupancy grounding tasks.

Conclusion: The proposed benchmark and GroundingOcc model provide more precise object perception through voxel-level occupancy annotations and multi-modal learning, advancing visual grounding for autonomous driving applications.

Abstract: Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset, it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding. The dataset is available at https://github.com/RONINGOD/GroundingOcc.

[145] Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang

Main category: cs.CV

TL;DR: VITAL is a novel agentic video reasoning framework that uses visual tools for dense frame sampling and multimodal chain-of-thought reasoning to improve long video understanding, outperforming existing methods on 11 benchmarks.

Details

Motivation: Current text-based chain-of-thought reasoning methods for multimodal LLMs suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains.

Method: Proposes VITAL framework with visual toolbox for on-demand frame sampling, multimodal CoT generation, and Difficulty-aware Group Relative Policy Optimization (DGRPO) to handle multi-task reinforcement learning with difficulty imbalance.

Result: Extensive experiments on 11 challenging video understanding benchmarks demonstrate superior performance in video question answering and temporal grounding, especially for long videos.

Conclusion: VITAL effectively addresses cross-modal interaction limitations and hallucination issues in video reasoning, showing that temporal grounding and question answering are mutually beneficial for comprehensive video understanding.

Abstract: The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. Code is available at https://zhang9302002.github.io/thinkingwithvideos-page/.

[146] D2-Mamba: Dual-Scale Fusion and Dual-Path Scanning with SSMs for Shadow Removal

Linhao Li, Boya Jin, Zizhe Li, Lanqing Guo, Hao Cheng, Bo Li, Yongfeng Dong

Main category: cs.CV

TL;DR: A novel Mamba-based network for shadow removal that uses dual-scale fusion and dual-path scanning to selectively propagate contextual information based on transformation similarity across regions.

Details

Motivation: Shadow removal requires leveraging information from non-shadow regions to guide restoration, but the transformation needed for shadowed areas differs significantly from well-lit regions, making uniform correction strategies ineffective.

Method: Proposes Dual-Scale Fusion Mamba Block (DFMB) for multi-scale feature representation and Dual-Path Mamba Group (DPMG) with horizontal scanning and mask-aware adaptive scanning strategy to capture global features and improve structural continuity.

Result: The method significantly outperforms existing state-of-the-art approaches on shadow removal benchmarks.

Conclusion: The proposed Mamba-based network with dual-scale fusion and dual-path scanning effectively addresses the challenges of shadow removal by selectively integrating contextual information and modeling region-specific transformations.

Abstract: Shadow removal aims to restore images that are partially degraded by shadows, where the degradation is spatially localized and non-uniform. Unlike general restoration tasks that assume global degradation, shadow removal can leverage abundant information from non-shadow regions for guidance. However, the transformation required to correct shadowed areas often differs significantly from that of well-lit regions, making it challenging to apply uniform correction strategies. This necessitates the effective integration of non-local contextual cues and adaptive modeling of region-specific transformations. To this end, we propose a novel Mamba-based network featuring dual-scale fusion and dual-path scanning to selectively propagate contextual information based on transformation similarity across regions. Specifically, the proposed Dual-Scale Fusion Mamba Block (DFMB) enhances multi-scale feature representation by fusing original features with low-resolution features, effectively reducing boundary artifacts. The Dual-Path Mamba Group (DPMG) captures global features via horizontal scanning and incorporates a mask-aware adaptive scanning strategy, which improves structural continuity and fine-grained region modeling. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches on shadow removal benchmarks.

[147] AIM 2025 Rip Current Segmentation (RipSeg) Challenge Report

Andrei Dumitriu, Florin Miron, Florin Tatui, Radu Tudor Ionescu, Radu Timofte, Aakash Ralhan, Florin-Alexandru Vasluianu, Shenyang Qian, Mitchell Harley, Imran Razzak, Yang Song, Pu Luo, Yumei Li, Cong Xu, Jinming Chai, Kexin Zhang, Licheng Jiao, Lingling Li, Siqi Yu, Chao Zhang, Kehuan Song, Fang Liu, Puhua Chen, Xu Liu, Jin Hu, Jinyang Xu, Biao Liu

Main category: cs.CV

TL;DR: The AIM 2025 RipSeg Challenge focused on advancing automatic rip current segmentation using the RipVIS dataset, with 75 participants and 5 valid submissions evaluated on composite metrics combining F1, F2, and AP scores.

Details

Motivation: Rip currents are dangerous, fast-moving flows that pose major risks to beach safety worldwide, making accurate visual detection an important and underexplored research task.

Method: The challenge used the largest available rip current dataset (RipVIS) for single-class instance segmentation, with diverse locations, rip current types, and camera orientations. Teams leveraged deep learning architectures, domain adaptation techniques, pretrained models, and domain generalization strategies.

Result: 75 participants registered with 5 valid test submissions. Top-performing methods were evaluated on a composite score combining F1, F2, AP50, and AP[50:95] metrics, ensuring robust and application-relevant rankings.

Conclusion: The report provides insights into current rip current segmentation capabilities, discusses key challenges and lessons learned from submissions, and outlines future directions for expanding the RipSeg initiative to improve beach safety through better detection technology.

Abstract: This report presents an overview of the AIM 2025 RipSeg Challenge, a competition designed to advance techniques for automatic rip current segmentation in still images. Rip currents are dangerous, fast-moving flows that pose a major risk to beach safety worldwide, making accurate visual detection an important and underexplored research task. The challenge builds on RipVIS, the largest available rip current dataset, and focuses on single-class instance segmentation, where precise delineation is critical to fully capture the extent of rip currents. The dataset spans diverse locations, rip current types, and camera orientations, providing a realistic and challenging benchmark. In total, $75$ participants registered for this first edition, resulting in $5$ valid test submissions. Teams were evaluated on a composite score combining $F_1$, $F_2$, $AP_{50}$, and $AP_{[50:95]}$, ensuring robust and application-relevant rankings. The top-performing methods leveraged deep learning architectures, domain adaptation techniques, pretrained models, and domain generalization strategies to improve performance under diverse conditions. This report outlines the dataset details, competition framework, evaluation metrics, and final results, providing insights into the current state of rip current segmentation. We conclude with a discussion of key challenges, lessons learned from the submissions, and future directions for expanding RipSeg.

[148] Performance is not All You Need: Sustainability Considerations for Algorithms

Xiang Li, Chong Zhang, Hongpeng Wang, Shreyank Narayana Gowda, Yushi Li, Xiaobo Jin

Main category: cs.CV

TL;DR: Proposes a two-dimensional sustainability evaluation system with FMS and ASC metrics to balance AI performance and energy consumption, tested across multiple multimodal tasks.

Details

Motivation: Address high carbon emissions from deep learning training by moving beyond single performance metrics to include energy efficiency considerations.

Method: Developed two quantitative indicators: Sustainable Harmonic Mean (FMS) integrating energy consumption and performance, and Area Under Sustainability Curve (ASC) characterizing energy efficiency throughout algorithm lifecycle. Built benchmarks across various multimodal tasks.

Result: The evaluation system successfully provides quantitative basis for cross-task algorithm assessment and facilitates transition from theoretical green AI research to practical implementation.

Conclusion: The proposed sustainability framework offers methodological support for establishing industry standards on algorithm energy efficiency, promoting environmentally conscious AI development.

Abstract: This work focuses on the high carbon emissions generated by deep learning model training, specifically addressing the core challenge of balancing algorithm performance and energy consumption. It proposes an innovative two-dimensional sustainability evaluation system. Different from the traditional single performance-oriented evaluation paradigm, this study pioneered two quantitative indicators that integrate energy efficiency ratio and accuracy: the sustainable harmonic mean (FMS) integrates accumulated energy consumption and performance parameters through the harmonic mean to reveal the algorithm performance under unit energy consumption; the area under the sustainability curve (ASC) constructs a performance-power consumption curve to characterize the energy efficiency characteristics of the algorithm throughout the cycle. To verify the universality of the indicator system, the study constructed benchmarks in various multimodal tasks, including image classification, segmentation, pose estimation, and batch and online learning. Experiments demonstrate that the system can provide a quantitative basis for evaluating cross-task algorithms and promote the transition of green AI research from theory to practice. Our sustainability evaluation framework code can be found here, providing methodological support for the industry to establish algorithm energy efficiency standards.

[149] Scaffold Diffusion: Sparse Multi-Category Voxel Structure Generation with Discrete Diffusion

Justin Jung

Main category: cs.CV

TL;DR: Scaffold Diffusion is a generative model that uses discrete diffusion language models to generate realistic sparse multi-category 3D voxel structures, overcoming challenges of cubic memory scaling and class imbalance from sparsity.

Details

Motivation: Generating realistic sparse multi-category 3D voxel structures is difficult due to cubic memory scaling of voxels and significant class imbalance caused by sparsity.

Method: Treats voxels as tokens and uses a discrete diffusion language model to generate 3D voxel structures, extending beyond sequential domains to create spatially coherent 3D structures.

Result: Outperforms prior baselines and auto-regressive formulations, producing realistic and coherent structures even when trained on data with over 98% sparsity, as demonstrated on Minecraft house structures from 3D-Craft dataset.

Conclusion: Discrete diffusion language models can be successfully applied to generate spatially coherent 3D structures, providing an effective solution for sparse multi-category voxel generation.

Abstract: Generating realistic sparse multi-category 3D voxel structures is difficult due to the cubic memory scaling of voxel structures and moreover the significant class imbalance caused by sparsity. We introduce Scaffold Diffusion, a generative model designed for sparse multi-category 3D voxel structures. By treating voxels as tokens, Scaffold Diffusion uses a discrete diffusion language model to generate 3D voxel structures. We show that discrete diffusion language models can be extended beyond inherently sequential domains such as text to generate spatially coherent 3D structures. We evaluate on Minecraft house structures from the 3D-Craft dataset and demonstrate that, unlike prior baselines and an auto-regressive formulation, Scaffold Diffusion produces realistic and coherent structures even when trained on data with over 98% sparsity. We provide an interactive viewer where readers can visualize generated samples and the generation process: https://scaffold.deepexploration.org/

[150] A Multimodal and Multi-centric Head and Neck Cancer Dataset for Tumor Segmentation and Outcome Prediction

Numan Saeed, Salma Hassan, Shahad Hardan, Ahmed Aly, Darya Taratynova, Umair Nawaz, Ufaq Khan, Muhammad Ridzuan, Vincent Andrearczyk, Adrien Depeursinge, Mathieu Hatt, Thomas Eugene, Raphaël Metz, Mélanie Dore, Gregory Delpon, Vijay Ram Kumar Papineni, Kareem Wahid, Cem Dede, Alaa Mohamed Shawky Ali, Carlos Sjogreen, Mohamed Naser, Clifton D. Fuller, Valentin Oreiller, Mario Jreige, John O. Prior, Catherine Cheze Le Rest, Olena Tankyevych, Pierre Decazes, Su Ruan, Stephanie Tanadini-Lang, Martin Vallières, Hesham Elhalawani, Ronan Abgral, Romain Floch, Kevin Kerleguer, Ulrike Schick, Maelle Mauguen, Arman Rahmim, Mohammad Yaqub

Main category: cs.CV

TL;DR: A publicly available multimodal PET/CT dataset of 1123 head and neck cancer studies from 10 international centers, with expert annotations, clinical metadata, and benchmark results for automated segmentation, survival prediction, and HPV classification.

Details

Motivation: To provide a comprehensive, standardized multimodal dataset for head and neck cancer research that reflects real-world clinical diversity across institutions and enables development of AI models for key clinical tasks.

Method: Collection of 1123 FDG-PET/CT studies from 10 international medical centers with varying acquisition protocols. Expert manual segmentation of tumor volumes by radiation oncologists and radiologists following standardized guidelines. Dataset includes NifTi files, segmentation masks, radiotherapy dose data, and comprehensive clinical metadata.

Result: Created a large-scale annotated dataset with 1123 studies, providing benchmark performance using state-of-the-art deep learning models (UNet, SegResNet, multimodal frameworks) for three clinical tasks: automated tumor segmentation, recurrence-free survival prediction, and HPV status classification.

Conclusion: This publicly available dataset serves as a valuable resource for head and neck cancer research, enabling development and validation of AI models for automated segmentation, prognostic prediction, and biomarker classification in real-world clinical settings.

Abstract: We describe a publicly available multimodal dataset of annotated Positron Emission Tomography/Computed Tomography (PET/CT) studies for head and neck cancer research. The dataset includes 1123 FDG-PET/CT studies from patients with histologically confirmed head and neck cancer, acquired from 10 international medical centers. All examinations consisted of co-registered PET/CT scans with varying acquisition protocols, reflecting real-world clinical diversity across institutions. Primary gross tumor volumes (GTVp) and involved lymph nodes (GTVn) were manually segmented by experienced radiation oncologists and radiologists following standardized guidelines and quality control measures. We provide anonymized NifTi files of all studies, along with expert-annotated segmentation masks, radiotherapy dose distribution for a subset of patients, and comprehensive clinical metadata. This metadata includes TNM staging, HPV status, demographics (age and gender), long-term follow-up outcomes, survival times, censoring indicators, and treatment information. We demonstrate how this dataset can be used for three key clinical tasks: automated tumor segmentation, recurrence-free survival prediction, and HPV status classification, providing benchmark results using state-of-the-art deep learning models, including UNet, SegResNet, and multimodal prognostic frameworks.

[151] C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Object Detection

Abdellah Zakaria Sellam, Ilyes Benaissa, Salah Eddine Bekhouche, Abdenour Hadid, Vito Renó, Cosimo Distante

Main category: cs.CV

TL;DR: Introduces Context-Aware Fusion (CAF) to improve fine-grained object detection by integrating global scene context with local features using cross-attention mechanisms, outperforming state-of-the-art models on vehicle damage assessment tasks.

Details

Motivation: Fine-grained object detection in challenging domains like vehicle damage assessment is difficult even for human experts. While DiffusionDet advanced the field, its performance is limited by local feature conditioning in context-dependent scenarios.

Method: Proposes Context-Aware Fusion (CAF) that uses cross-attention mechanisms to integrate global scene context (captured by a separate dedicated encoder) with local proposal features, enabling each object proposal to attend to comprehensive environmental information.

Result: Experimental results show improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains.

Conclusion: The framework significantly enhances the generative detection paradigm by enabling object proposals to leverage comprehensive environmental context, addressing fundamental limitations of previous approaches in context-dependent scenarios.

Abstract: Fine-grained object detection in challenging visual domains, such as vehicle damage assessment, presents a formidable challenge even for human experts to resolve reliably. While DiffusionDet has advanced the state-of-the-art through conditional denoising diffusion, its performance remains limited by local feature conditioning in context-dependent scenarios. We address this fundamental limitation by introducing Context-Aware Fusion (CAF), which leverages cross-attention mechanisms to integrate global scene context with local proposal features directly. The global context is generated using a separate dedicated encoder that captures comprehensive environmental information, enabling each object proposal to attend to scene-level understanding. Our framework significantly enhances the generative detection paradigm by enabling each object proposal to attend to comprehensive environmental information. Experimental results demonstrate an improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains

[152] Multimodal Iterative RAG for Knowledge Visual Question Answering

Changin Choi, Wonseok Lee, Jungmin Ko, Wonjong Rhee

Main category: cs.CV

TL;DR: MI-RAG is a Multimodal Iterative RAG framework that enhances knowledge retrieval and reasoning for visual question answering by using iterative query formulation and joint search across multimodal knowledge bases.

Details

Motivation: Conventional single-pass RAG frameworks often fail to gather sufficient knowledge for knowledge-intensive visual questions that require external knowledge beyond the image content.

Method: Proposes MI-RAG framework that uses iterative reasoning: accumulates reasoning records, formulates multi-queries dynamically, performs joint search across visually-grounded and textual knowledge bases, and synthesizes new knowledge into reasoning records across iterations.

Result: Significant improvements in both retrieval recall and answer accuracy on challenging benchmarks including Encyclopedic VQA, InfoSeek, and OK-VQA.

Conclusion: MI-RAG establishes a scalable approach for compositional reasoning in knowledge-intensive visual question answering, overcoming limitations of conventional single-pass RAG frameworks.

Abstract: While Multimodal Large Language Models (MLLMs) have significantly advanced multimodal understanding, their performance remains limited on knowledge-intensive visual questions that require external knowledge beyond the image. Retrieval-Augmented Generation (RAG) has become a promising solution for providing models with external knowledge, its conventional single-pass framework often fails to gather sufficient knowledge. To overcome this limitation, we propose MI-RAG, a Multimodal Iterative RAG framework that leverages reasoning to enhance retrieval and update reasoning over newly retrieved knowledge across modalities. At each iteration, MI-RAG leverages an accumulated reasoning record to dynamically formulate a multi-query. These queries then drive a joint search across heterogeneous knowledge bases containing both visually-grounded and textual knowledge. The newly acquired knowledge is synthesized into the reasoning record, progressively refining understanding across iterations. Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy, establishing a scalable approach for compositional reasoning in knowledge-intensive VQA.

[153] UPGS: Unified Pose-aware Gaussian Splatting for Dynamic Scene Deblurring

Zhijing Wu, Longguang Wang

Main category: cs.CV

TL;DR: A unified optimization framework that jointly optimizes camera poses and 3D Gaussian attributes to handle motion blur in dynamic 3D scene reconstruction from monocular video.

Details

Motivation: Existing methods fail with motion blur because they use a two-step pipeline where pose estimation errors accumulate and degrade 3D reconstruction quality. Motion blur undermines pose estimation, leading to inferior results.

Method: Proposes a unified end-to-end optimization framework that treats camera poses as learnable parameters alongside 3DGS attributes. Models camera and object motion as per-primitive SE(3) affine transformations on 3D Gaussians. Uses a three-stage training schedule: 1) optimize Gaussians with fixed poses, 2) optimize poses with fixed Gaussians, 3) jointly optimize all parameters.

Result: Achieves significant improvements in reconstruction quality and pose estimation accuracy on the Stereo Blur dataset and challenging real-world sequences compared to prior dynamic deblurring methods.

Conclusion: The unified optimization framework successfully addresses motion blur issues in dynamic 3D scene reconstruction by jointly optimizing camera poses and 3D Gaussian attributes, outperforming existing two-step approaches.

Abstract: Reconstructing dynamic 3D scenes from monocular video has broad applications in AR/VR, robotics, and autonomous navigation, but often fails due to severe motion blur caused by camera and object motion. Existing methods commonly follow a two-step pipeline, where camera poses are first estimated and then 3D Gaussians are optimized. Since blurring artifacts usually undermine pose estimation, pose errors could be accumulated to produce inferior reconstruction results. To address this issue, we introduce a unified optimization framework by incorporating camera poses as learnable parameters complementary to 3DGS attributes for end-to-end optimization. Specifically, we recast camera and object motion as per-primitive SE(3) affine transformations on 3D Gaussians and formulate a unified optimization objective. For stable optimization, we introduce a three-stage training schedule that optimizes camera poses and Gaussians alternatively. Particularly, 3D Gaussians are first trained with poses being fixed, and then poses are optimized with 3D Gaussians being untouched. Finally, all learnable parameters are optimized together. Extensive experiments on the Stereo Blur dataset and challenging real-world sequences demonstrate that our method achieves significant gains in reconstruction quality and pose estimation accuracy over prior dynamic deblurring methods.

[154] Spotlighter: Revisiting Prompt Tuning from a Representative Mining View

Yutong Gao, Maoyuan Shao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Yu Weng, Xuan Liu, Guoshun Nan

Main category: cs.CV

TL;DR: Spotlighter is a lightweight token-selection framework that improves CLIP’s prompt tuning by selecting only the most relevant visual tokens, achieving better accuracy and efficiency with minimal extra parameters.

Details

Motivation: CLIP's prompt tuning achieves cross-modal alignment but suffers from redundant features that introduce noise and computational inefficiency. There's a need to enhance both accuracy and efficiency by focusing on the most informative visual components.

Method: Spotlighter evaluates visual tokens from sample-wise and semantic-wise perspectives, retaining only top-scoring tokens. It uses a class-specific semantic memory bank with learned prototypes and a two-level ranking mechanism to dynamically weight token-prototype interactions.

Result: Across 11 few-shot benchmarks, Spotlighter outperforms CLIP by up to 11.19% in harmonic mean accuracy and achieves up to 0.8K additional FPS, with only 21 extra parameters.

Conclusion: Spotlighter establishes an effective and scalable baseline for prompt tuning, demonstrating significant improvements in both accuracy and computational efficiency through intelligent token selection.

Abstract: CLIP’s success has demonstrated that prompt tuning can achieve robust cross-modal semantic alignment for tasks ranging from open-domain recognition to fine-grained classification. However, redundant or weakly relevant feature components introduce noise and incur unnecessary computational costs. In this work, we propose Spotlighter, a lightweight token-selection framework that simultaneously enhances accuracy and efficiency in prompt tuning. Spotlighter evaluates each visual token’s activation from both sample-wise and semantic-wise perspectives and retains only the top-scoring tokens for downstream prediction. A class-specific semantic memory bank of learned prototypes refines this selection, ensuring semantic representativeness and compensating for discarded features. To further prioritize informative signals, we introduce a two-level ranking mechanism that dynamically weights token–prototype interactions. Across 11 few-shot benchmarks, Spotlighter outperforms CLIP by up to 11.19% in harmonic mean accuracy and achieves up to 0.8K additional FPS, with only 21 extra parameters. These results establish Spotlighter as an effective and scalable baseline for prompt tuning. Code for our method will be available at https://github.com/greatest-gourmet/Spotlighter.

[155] CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation

Zixin Zhu, Kevin Duarte, Mamshad Nayeem Rizve, Chengyuan Xu, Ratheesh Kalarot, Junsong Yuan

Main category: cs.CV

TL;DR: CompSlider enables simultaneous control of multiple image attributes in text-to-image generation using disentangled sliders, avoiding attribute interference while maintaining structural consistency.

Details

Motivation: Existing slider-based methods train individual adapters per attribute, causing interference between attributes and preventing precise multi-attribute control.

Method: CompSlider generates a conditional prior for T2I foundation models with novel disentanglement and structure losses to compose multiple attribute changes while preserving image structure, operating in latent space without retraining the base model.

Result: The approach successfully controls multiple attributes simultaneously without interference, maintains structural consistency, reduces computational burden, and generalizes to video generation.

Conclusion: CompSlider provides an effective solution for disentangled multi-attribute control in T2I generation with improved reliability and computational efficiency.

Abstract: In text-to-image (T2I) generation, achieving fine-grained control over attributes - such as age or smile - remains challenging, even with detailed text prompts. Slider-based methods offer a solution for precise control of image attributes. Existing approaches typically train individual adapter for each attribute separately, overlooking the entanglement among multiple attributes. As a result, interference occurs among different attributes, preventing precise control of multiple attributes together. To address this challenge, we aim to disentangle multiple attributes in slider-based generation to enbale more reliable and independent attribute manipulation. Our approach, CompSlider, can generate a conditional prior for the T2I foundation model to control multiple attributes simultaneously. Furthermore, we introduce novel disentanglement and structure losses to compose multiple attribute changes while maintaining structural consistency within the image. Since CompSlider operates in the latent space of the conditional prior and does not require retraining the foundation model, it reduces the computational burden for both training and inference. We evaluate our approach on a variety of image attributes and highlight its generality by extending to video generation.

[156] Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation

Jiahao Li, Yang Lu, Yachao Zhang, Fangyong Wang, Yuan Xie, Yanyun Qu

Main category: cs.CV

TL;DR: X-Agent is a novel open-vocabulary semantic segmentation framework that uses latent semantic-aware agents to optimize cross-modal attention, achieving state-of-the-art performance by enhancing latent semantic saliency.

Details

Motivation: Address the domain discrepancy between base category training and open-vocabulary inference in semantic segmentation, and overcome the underexplored mechanisms of latent semantic comprehension in vision-language models.

Method: Proposes X-Agent framework with latent semantic-aware agents that orchestrate cross-modal attention mechanisms to simultaneously optimize latent semantic dynamics and amplify perceptibility.

Result: Extensive benchmark evaluations demonstrate state-of-the-art performance while effectively enhancing latent semantic saliency.

Conclusion: X-Agent successfully addresses the challenges of open-vocabulary semantic segmentation by improving latent semantic comprehension through innovative agent-based cross-modal attention mechanisms.

Abstract: Open-vocabulary semantic segmentation (OVSS) conducts pixel-level classification via text-driven alignment, where the domain discrepancy between base category training and open-vocabulary inference poses challenges in discriminative modeling of latent unseen category. To address this challenge, existing vision-language model (VLM)-based approaches demonstrate commendable performance through pre-trained multi-modal representations. However, the fundamental mechanisms of latent semantic comprehension remain underexplored, making the bottleneck for OVSS. In this work, we initiate a probing experiment to explore distribution patterns and dynamics of latent semantics in VLMs under inductive learning paradigms. Building on these insights, we propose X-Agent, an innovative OVSS framework employing latent semantic-aware ``agent’’ to orchestrate cross-modal attention mechanisms, simultaneously optimizing latent semantic dynamic and amplifying its perceptibility. Extensive benchmark evaluations demonstrate that X-Agent achieves state-of-the-art performance while effectively enhancing the latent semantic saliency.

[157] MSA2-Net: Utilizing Self-Adaptive Convolution Module to Extract Multi-Scale Information in Medical Image Segmentation

Chao Deng, Xiaosen Li, Xiao Qin

Main category: cs.CV

TL;DR: MSA2-Net introduces a Self-Adaptive Convolution Module that dynamically adjusts kernel sizes based on dataset characteristics, improving medical image segmentation performance across multiple datasets.

Details

Motivation: nnUNet automatically tunes most hyperparameters but overlooks internal network hyperparameters, limiting model generalization. This study addresses this limitation to improve segmentation accuracy.

Method: Developed a Self-Adaptive Convolution Module that dynamically adjusts convolution kernel sizes. Integrated this module into MSA2-Net’s Multi-Scale Convolution Bridge and Multi-Scale Amalgamation Decoder to capture both global and local features while eliminating redundant data.

Result: Achieved Dice coefficients of 86.49% (Synapse), 92.56% (ACDC), 93.37% (Kvasir), and 92.98% (ISIC2017), demonstrating superior performance across multiple medical image segmentation datasets.

Conclusion: MSA2-Net with the Self-Adaptive Convolution Module shows robust and precise medical image segmentation capabilities, effectively addressing the generalization limitations of previous frameworks like nnUNet.

Abstract: The nnUNet segmentation framework adeptly adjusts most hyperparameters in training scripts automatically, but it overlooks the tuning of internal hyperparameters within the segmentation network itself, which constrains the model’s ability to generalize. Addressing this limitation, this study presents a novel Self-Adaptive Convolution Module that dynamically adjusts the size of the convolution kernels depending on the unique fingerprints of different datasets. This adjustment enables the MSA2-Net, when equipped with this module, to proficiently capture both global and local features within the feature maps. Self-Adaptive Convolution Module is strategically integrated into two key components of the MSA2-Net: the Multi-Scale Convolution Bridge and the Multi-Scale Amalgamation Decoder. In the MSConvBridge, the module enhances the ability to refine outputs from various stages of the CSWin Transformer during the skip connections, effectively eliminating redundant data that could potentially impair the decoder’s performance. Simultaneously, the MSADecoder, utilizing the module, excels in capturing detailed information of organs varying in size during the decoding phase. This capability ensures that the decoder’s output closely reproduces the intricate details within the feature maps, thus yielding highly accurate segmentation images. MSA2-Net, bolstered by this advanced architecture, has demonstrated exceptional performance, achieving Dice coefficient scores of 86.49%, 92.56%, 93.37%, and 92.98% on the Synapse, ACDC, Kvasir, and Skin Lesion Segmentation (ISIC2017) datasets, respectively. This underscores MSA2-Net’s robustness and precision in medical image segmentation tasks across various datasets.

[158] HydroVision: Predicting Optically Active Parameters in Surface Water Using Computer Vision

Shubham Laxmikant Deshmukh, Matthew Wilchek, Feras A. Batarseh

Main category: cs.CV

TL;DR: HydroVision is a deep learning framework that uses RGB images to estimate 6 water quality parameters, achieving R2=0.89 for CDOM prediction using DenseNet121 architecture.

Details

Motivation: Advancements in computer vision enable non-contact water quality monitoring for disaster response and public health protection, providing a cost-effective alternative to traditional spectral sensing methods.

Method: Developed HydroVision framework using transfer learning with 5 architectures (VGG-16, ResNet50, MobileNetV2, DenseNet121, Vision Transformer) trained on 500,000+ USGS images from 2022-2024 to predict water quality parameters from RGB images.

Result: DenseNet121 achieved the best performance with R2 score of 0.89 for CDOM prediction, demonstrating effective water quality parameter estimation from standard RGB imagery.

Conclusion: The framework shows promise for real-world water quality monitoring but requires future improvements for low-light and obstructed scenarios to expand operational utility.

Abstract: Ongoing advancements in computer vision, particularly in pattern recognition and scene classification, have enabled new applications in environmental monitoring. Deep learning now offers non-contact methods for assessing water quality and detecting contamination, both critical for disaster response and public health protection. This work introduces HydroVision, a deep learning-based scene classification framework that estimates optically active water quality parameters including Chlorophyll-Alpha, Chlorophylls, Colored Dissolved Organic Matter (CDOM), Phycocyanins, Suspended Sediments, and Turbidity from standard Red-Green-Blue (RGB) images of surface water. HydroVision supports early detection of contamination trends and strengthens monitoring by regulatory agencies during external environmental stressors, industrial activities, and force majeure events. The model is trained on more than 500,000 seasonally varied images collected from the United States Geological Survey Hydrologic Imagery Visualization and Information System between 2022 and 2024. This approach leverages widely available RGB imagery as a scalable, cost-effective alternative to traditional multispectral and hyperspectral remote sensing. Four state-of-the-art convolutional neural networks (VGG-16, ResNet50, MobileNetV2, DenseNet121) and a Vision Transformer are evaluated through transfer learning to identify the best-performing architecture. DenseNet121 achieves the highest validation performance, with an R2 score of 0.89 in predicting CDOM, demonstrating the framework’s promise for real-world water quality monitoring across diverse conditions. While the current model is optimized for well-lit imagery, future work will focus on improving robustness under low-light and obstructed scenarios to expand its operational utility.

[159] Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

Quan Dao, Xiaoxiao He, Ligong Han, Ngan Hoai Nguyen, Amin Heyrani Nobar, Faez Ahmed, Han Zhang, Viet Anh Nguyen, Dimitris Metaxas

Main category: cs.CV

TL;DR: VARIN is a novel noise inversion-based editing method for Visual Autoregressive models that enables precise text-guided image editing without additional training, using inverse Gumbel noises for accurate reconstruction and targeted modifications.

Details

Motivation: While VAR models show strong text-to-image generation capabilities, their ability to perform prompt-guided image editing without retraining remains unexplored but is crucial for real-world applications.

Method: Introduces Visual AutoRegressive Inverse Noise (VARIN) with Location-aware Argmax Inversion (LAI) - a pseudo-inverse function for argmax sampling that generates inverse Gumbel noises to enable precise image reconstruction and targeted editing.

Result: Extensive experiments show VARIN effectively modifies source images according to text prompts while preserving original background and structural details, demonstrating practical editing efficacy.

Conclusion: VARIN represents the first successful noise inversion-based editing technique specifically designed for VAR models, enabling high-quality text-guided image editing without requiring additional training.

Abstract: Visual autoregressive models (VAR) have recently emerged as a promising class of generative models, achieving performance comparable to diffusion models in text-to-image generation tasks. While conditional generation has been widely explored, the ability to perform prompt-guided image editing without additional training is equally critical, as it supports numerous practical real-world applications. This paper investigates the text-to-image editing capabilities of VAR by introducing Visual AutoRegressive Inverse Noise (VARIN), the first noise inversion-based editing technique designed explicitly for VAR models. VARIN leverages a novel pseudo-inverse function for argmax sampling, named Location-aware Argmax Inversion (LAI), to generate inverse Gumbel noises. These inverse noises enable precise reconstruction of the source image and facilitate targeted, controllable edits aligned with textual prompts. Extensive experiments demonstrate that VARIN effectively modifies source images according to specified prompts while significantly preserving the original background and structural details, thus validating its efficacy as a practical editing approach.

[160] See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems

Halima Bouzidi, Haoyu Liu, Mohammad Abdullah Al Faruque

Main category: cs.CV

TL;DR: VEIL adversarial framework exposes vulnerabilities in Referring Multi-Object Tracking (RMOT) systems, compromising both linguistic-visual referring and track-object matching components through persistent spatial-temporal reasoning attacks.

Details

Motivation: To examine security implications of RMOT systems and identify adversarial vulnerabilities that compromise their reliability and robustness, which remain underexplored despite advances in language-vision understanding.

Method: Developed VEIL adversarial framework that crafts digital and physical perturbations to disrupt unified referring-matching mechanisms, targeting both linguistic-visual referring and track-object matching components, with special focus on FIFO-based memory vulnerabilities.

Result: VEIL successfully corrupts tracking logic reliability, inducing track ID switches and terminations. Comprehensive evaluations on Refer-KITTI dataset validate the effectiveness of the attacks.

Conclusion: The research demonstrates urgent need for security-aware RMOT designs for critical large-scale applications, revealing persistent vulnerabilities in advanced RMOT models that require attention.

Abstract: Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fused textual-visual representations. Despite these advances, the reliability and robustness of RMOT remain underexplored. In this paper, we examine the security implications of RMOT systems from a design-logic perspective, identifying adversarial vulnerabilities that compromise both the linguistic-visual referring and track-object matching components. Additionally, we uncover a novel vulnerability in advanced RMOT models employing FIFO-based memory, whereby targeted and consistent attacks on their spatial-temporal reasoning introduce errors that persist within the history buffer over multiple subsequent frames. We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models. We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations. We conduct comprehensive evaluations using the Refer-KITTI dataset to validate the effectiveness of VEIL and demonstrate the urgent need for security-aware RMOT designs for critical large-scale applications.

[161] Hues and Cues: Human vs. CLIP

Nuria Alabau-Bosque, Jorge Vila-Tomás, Paula Daudén-Oliver, Pablo Hernández-Cámara, Jose Manuel Jaén-Lorites, Valero Laparra, Jesús Malo

Main category: cs.CV

TL;DR: Evaluating AI models using board games like Hues & Cues reveals cultural biases and abstraction inconsistencies in CLIP’s color perception that standard benchmarks miss.

Details

Motivation: Traditional AI evaluation methods often overlook human-like characteristics tested through games. Board games provide a novel way to assess model alignment with human perception and cultural understanding.

Method: Using the board game Hues & Cues to test CLIP’s color perception and naming capabilities, comparing its performance against human observers to identify alignment and discrepancies.

Result: CLIP shows general alignment with human color perception but reveals cultural biases and inconsistencies in handling different abstraction levels that standard benchmarks fail to detect.

Conclusion: Board games serve as effective evaluation tools that expose model deficiencies in ways traditional benchmarks cannot, highlighting the importance of diverse testing strategies for assessing AI-human alignment.

Abstract: Playing games is inherently human, and a lot of games are created to challenge different human characteristics. However, these tasks are often left out when evaluating the human-like nature of artificial models. The objective of this work is proposing a new approach to evaluate artificial models via board games. To this effect, we test the color perception and color naming capabilities of CLIP by playing the board game Hues & Cues and assess its alignment with humans. Our experiments show that CLIP is generally well aligned with human observers, but our approach brings to light certain cultural biases and inconsistencies when dealing with different abstraction levels that are hard to identify with other testing strategies. Our findings indicate that assessing models with different tasks like board games can make certain deficiencies in the models stand out in ways that are difficult to test with the commonly used benchmarks.

[162] MedDINOv3: How to adapt vision foundation models for medical image segmentation?

Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang

Main category: cs.CV

TL;DR: MedDINOv3 adapts DINOv3 foundation model for medical image segmentation, achieving state-of-the-art performance across multiple benchmarks through domain-adaptive pretraining on CT scans.

Details

Motivation: Existing deep learning models for medical image segmentation lack generalizability across modalities and institutions, and vision foundation models underperform specialized CNNs on medical tasks due to domain gap between natural and medical images.

Method: Revisits plain ViTs with multi-scale token aggregation, performs domain-adaptive pretraining on 3.87M CT slices using multi-stage DINOv3 recipe to learn robust dense features.

Result: MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks.

Conclusion: Demonstrates the potential of vision foundation models as unified backbones for medical image segmentation, providing a simple and effective framework for adapting foundation models to medical imaging.

Abstract: Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.

[163] Faster and Better: Reinforced Collaborative Distillation and Self-Learning for Infrared-Visible Image Fusion

Yuhao Wang, Lingjuan Miao, Zhiqiang Zhou, Yajun Qiao, Lei Zhang

Main category: cs.CV

TL;DR: A reinforcement learning-driven collaborative distillation framework for infrared-visible image fusion that enables student models to self-learn from challenging samples while dynamically adjusting teacher guidance.

Details

Motivation: To address the challenge of achieving high-quality image fusion with lightweight models by combining complementary information from infrared and visible modalities more effectively.

Method: Proposes a novel collaborative distillation and self-learning framework using reinforcement learning. The RL agent identifies optimal training strategies, generates challenging samples for student self-learning, and dynamically adjusts teacher guidance based on student performance and teacher-student gap.

Result: Experimental results show significant improvement in student performance and better fusion results compared to existing techniques.

Conclusion: The proposed framework successfully enhances image fusion quality through collaborative distillation and reinforcement learning-driven self-learning, providing an effective solution for lightweight model optimization.

Abstract: Infrared and visible image fusion plays a critical role in enhancing scene perception by combining complementary information from different modalities. Despite recent advances, achieving high-quality image fusion with lightweight models remains a significant challenge. To bridge this gap, we propose a novel collaborative distillation and self-learning framework for image fusion driven by reinforcement learning. Unlike conventional distillation, this approach not only enables the student model to absorb image fusion knowledge from the teacher model, but more importantly, allows the student to perform self-learning on more challenging samples to enhance its capabilities. Particularly, in our framework, a reinforcement learning agent explores and identifies a more suitable training strategy for the student.The agent takes both the student’s performance and the teacher-student gap as inputs, which leads to the generation of challenging samples to facilitate the student’s self-learning. Simultaneously, it dynamically adjusts the teacher’s guidance strength based on the student’s state to optimize the knowledge transfer. Experimental results demonstrate that our method can significantly improve student performance and achieve better fusion results compared to existing techniques.

cs.AI

[164] Can Media Act as a Soft Regulator of Safe AI Development? A Game Theoretical Analysis

Henrique Correia da Fonseca, António Fernandes, Zhao Song, Theodor Cimpeanu, Nataliya Balabanova, Adeela Bashir, Paolo Bova, Alessio Buscemi, Alessandro Di Stefano, Manh Hong Duong, Elias Fernandez Domingos, Ndidi Bianca Ogbo, Simon T. Powers, Daniele Proverbio, Zia Ush Shamszaman, Fernando P. Santos, The Anh Han, Marcus Krellner

Main category: cs.AI

TL;DR: Media coverage can act as a soft regulator for AI safety by creating reputational consequences for developers, but only when information quality is reliable and costs are manageable.

Details

Motivation: AI developers often prioritize profit over safety due to lack of negative consequences. The paper explores whether media coverage can create reputational costs that incentivize safer AI development.

Method: Used evolutionary game theory with artificial populations of self-interested AI creators and users to study how media coverage affects cooperation and safety choices.

Result: Media can foster cooperation between creators and users, but fails when information quality is unreliable or when costs of accessing media or ensuring safety are too high.

Conclusion: Media serves as a powerful soft regulator for AI safety by shaping public perception and holding developers accountable, even without formal government oversight.

Abstract: When developers of artificial intelligence (AI) products need to decide between profit and safety for the users, they likely choose profit. Untrustworthy AI technology must come packaged with tangible negative consequences. Here, we envisage those consequences as the loss of reputation caused by media coverage of their misdeeds, disseminated to the public. We explore whether media coverage has the potential to push AI creators into the production of safe products, enabling widespread adoption of AI technology. We created artificial populations of self-interested creators and users and studied them through the lens of evolutionary game theory. Our results reveal that media is indeed able to foster cooperation between creators and users, but not always. Cooperation does not evolve if the quality of the information provided by the media is not reliable enough, or if the costs of either accessing media or ensuring safety are too high. By shaping public perception and holding developers accountable, media emerges as a powerful soft regulator – guiding AI safety even in the absence of formal government oversight.

[165] The Future of Artificial Intelligence and the Mathematical and Physical Sciences (AI+MPS)

Andrew Ferguson, Marisa LaFleur, Lars Ruthotto, Jesse Thaler, Yuan-Sen Ting, Pratyush Tiwary, Soledad Villar, E. Paulo Alves, Jeremy Avigad, Simon Billinge, Camille Bilodeau, Keith Brown, Emmanuel Candes, Arghya Chattopadhyay, Bingqing Cheng, Jonathan Clausen, Connor Coley, Andrew Connolly, Fred Daum, Sijia Dong, Chrisy Xiyu Du, Cora Dvorkin, Cristiano Fanelli, Eric B. Ford, Luis Manuel Frutos, Nicolás García Trillos, Cecilia Garraffo, Robert Ghrist, Rafael Gomez-Bombarelli, Gianluca Guadagni, Sreelekha Guggilam, Sergei Gukov, Juan B. Gutiérrez, Salman Habib, Johannes Hachmann, Boris Hanin, Philip Harris, Murray Holland, Elizabeth Holm, Hsin-Yuan Huang, Shih-Chieh Hsu, Nick Jackson, Olexandr Isayev, Heng Ji, Aggelos Katsaggelos, Jeremy Kepner, Yannis Kevrekidis, Michelle Kuchera, J. Nathan Kutz, Branislava Lalic, Ann Lee, Matt LeBlanc, Josiah Lim, Rebecca Lindsey, Yongmin Liu, Peter Y. Lu, Sudhir Malik, Vuk Mandic, Vidya Manian, Emeka P. Mazi, Pankaj Mehta, Peter Melchior, Brice Ménard, Jennifer Ngadiuba, Stella Offner, Elsa Olivetti, Shyue Ping Ong, Christopher Rackauckas, Philippe Rigollet, Chad Risko, Philip Romero, Grant Rotskoff, Brett Savoie, Uros Seljak, David Shih, Gary Shiu, Dima Shlyakhtenko, Eva Silverstein, Taylor Sparks, Thomas Strohmer, Christopher Stubbs, Stephen Thomas, Suriyanarayanan Vaikuntanathan, Rene Vidal, Francisco Villaescusa-Navarro, Gregory Voth, Benjamin Wandelt, Rachel Ward, Melanie Weber, Risa Wechsler, Stephen Whitelam, Olaf Wiest, Mike Williams, Zhuoran Yang, Yaroslava G. Yingling, Bin Yu, Shuwen Yue, Ann Zabludoff, Huimin Zhao, Tong Zhang

Main category: cs.AI

TL;DR: NSF workshop report on integrating AI with mathematical and physical sciences, proposing strategies for bidirectional research, community building, and workforce development.

Details

Motivation: To understand how mathematical and physical sciences domains can capitalize on AI advancements and contribute to AI development, strengthening the crucial link between AI and scientific discovery.

Method: Community perspective summary from NSF Workshop on Future of AI and MPS, proposing activities and strategic priorities including enabling bidirectional AI+MPS research, building interdisciplinary community, and fostering education/workforce development.

Result: Development of comprehensive framework and recommendations for funding agencies, educational institutions, and researchers to position MPS community as leaders in AI integration.

Conclusion: Proactive strategy needed to leverage AI potential for scientific discovery while applying fundamental science concepts to impact AI development, with specific priorities for implementation across multiple stakeholders.

Abstract: This community paper developed out of the NSF Workshop on the Future of Artificial Intelligence (AI) and the Mathematical and Physics Sciences (MPS), which was held in March 2025 with the goal of understanding how the MPS domains (Astronomy, Chemistry, Materials Research, Mathematical Sciences, and Physics) can best capitalize on, and contribute to, the future of AI. We present here a summary and snapshot of the MPS community’s perspective, as of Spring/Summer 2025, in a rapidly developing field. The link between AI and MPS is becoming increasingly inextricable; now is a crucial moment to strengthen the link between AI and Science by pursuing a strategy that proactively and thoughtfully leverages the potential of AI for scientific discovery and optimizes opportunities to impact the development of AI by applying concepts from fundamental science. To achieve this, we propose activities and strategic priorities that: (1) enable AI+MPS research in both directions; (2) build up an interdisciplinary community of AI+MPS researchers; and (3) foster education and workforce development in AI for MPS researchers and students. We conclude with a summary of suggested priorities for funding agencies, educational institutions, and individual researchers to help position the MPS community to be a leader in, and take full advantage of, the transformative potential of AI+MPS.

[166] Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven Analytics

Matthew Russo, Tim Kraska

Main category: cs.AI

TL;DR: A prototype system that combines Deep Research agents with optimized semantic operators for AI-driven analytics, achieving better performance and efficiency than existing approaches.

Details

Motivation: Current semantic operators are expensive for large datasets and ill-suited for interactive analytics, while Deep Research systems lack query optimization. A hybrid approach is needed to combine optimized execution with dynamic flexibility.

Method: Built a prototype that enables Deep Research agents to write and execute optimized semantic operator programs, combining the strengths of both approaches.

Result: Outperforms handcrafted semantic operator programs and open Deep Research systems, achieving up to 1.95x better F1-score and cost/runtime savings of up to 76.8% and 72.7%.

Conclusion: The hybrid approach successfully combines optimized execution of semantic operators with the flexibility of Deep Research systems, demonstrating significant performance improvements for AI-driven analytics.

Abstract: With advances in large language models (LLMs), researchers are creating new systems that can perform AI-driven analytics over large unstructured datasets. Recent work has explored executing such analytics queries using semantic operators – a declarative set of AI-powered data transformations with natural language specifications. However, even when optimized, these operators can be expensive to execute on millions of records and their iterator execution semantics make them ill-suited for interactive data analytics tasks. In another line of work, Deep Research systems have demonstrated an ability to answer natural language question(s) over large datasets. These systems use one or more LLM agent(s) to plan their execution, process the dataset(s), and iteratively refine their answer. However, these systems do not explicitly optimize their query plans which can lead to poor plan execution. In order for AI-driven analytics to excel, we need a runtime which combines the optimized execution of semantic operators with the flexibility and more dynamic execution of Deep Research systems. As a first step towards this vision, we build a prototype which enables Deep Research agents to write and execute optimized semantic operator programs. We evaluate our prototype and demonstrate that it can outperform a handcrafted semantic operator program and open Deep Research systems on two basic queries. Compared to a standard open Deep Research agent, our prototype achieves up to 1.95x better F1-score. Furthermore, even if we give the agent access to semantic operators as tools, our prototype still achieves cost and runtime savings of up to 76.8% and 72.7% thanks to its optimized execution.

[167] Planning with Reasoning using Vision Language World Model

Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, Pascale Fung

Main category: cs.AI

TL;DR: VLWM is a vision-language foundation model for world modeling that combines reactive policy learning with reflective planning via cost minimization, achieving SOTA performance on visual planning benchmarks.

Details

Motivation: High-level world models capable of semantic and temporal abstraction for action understanding and reasoning are underdeveloped, limiting effective planning capabilities.

Method: Uses LLM Self-Refine with Tree of Captions to extract goals and predict trajectories. Learns both action policy (system-1) and dynamics model (system-2) with cost minimization planning using a self-supervised critic model.

Result: Achieves state-of-the-art Visual Planning for Assistance performance with system-2 improving Elo score by +27% over system-1. Outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmarks.

Conclusion: VLWM demonstrates effective language-based world modeling that enables both reactive and reflective planning, advancing semantic reasoning capabilities for visual planning tasks.

Abstract: Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.

[168] Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving

Mingyi Wang, Jingke Wang, Tengju Ye, Junbo Chen, Kaicheng Yu

Main category: cs.AI

TL;DR: Systematic evaluation of transferring LLM modules (tokenizer, positional embedding, pre-training, post-training, test-time computation) to autonomous driving motion generation, showing significant performance improvements when properly adapted.

Details

Motivation: LLMs have shown success in NLP and structurally similar domains like autonomous driving motion generation, but there's lack of systematic understanding about which LLM modules are truly transferable to this domain.

Method: Comprehensive evaluation of five key LLM modules: tokenizer design, positional embedding, pre-training paradigms, post-training strategies, and test-time computation, adapted for motion generation in autonomous driving scenarios.

Result: When appropriately adapted, these LLM modules significantly improve performance for autonomous driving motion generation, achieving competitive results on the Waymo Sim Agents benchmark.

Conclusion: The study identifies which LLM techniques can be effectively transferred to autonomous driving, analyzes reasons for failures of some methods, and discusses specific adaptations needed for this domain.

Abstract: Recent breakthroughs in large language models (LLMs) have not only advanced natural language processing but also inspired their application in domains with structurally similar problems–most notably, autonomous driving motion generation. Both domains involve autoregressive sequence modeling, token-based representations, and context-aware decision making, making the transfer of LLM components a natural and increasingly common practice. However, despite promising early attempts, a systematic understanding of which LLM modules are truly transferable remains lacking. In this paper, we present a comprehensive evaluation of five key LLM modules–tokenizer design, positional embedding, pre-training paradigms, post-training strategies, and test-time computation–within the context of motion generation for autonomous driving. Through extensive experiments on the Waymo Sim Agents benchmark, we demonstrate that, when appropriately adapted, these modules can significantly improve performance for autonomous driving motion generation. In addition, we identify which techniques can be effectively transferred, analyze the potential reasons for the failure of others, and discuss the specific adaptations needed for autonomous driving scenarios. We evaluate our method on the Sim Agents task and achieve competitive results.

[169] Plan Verification for LLM-Based Embodied Task Completion Agents

Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur

Main category: cs.AI

TL;DR: An iterative LLM verification framework that uses a Judge LLM to critique action sequences and a Planner LLM to apply revisions, improving embodied AI task plans by removing noise and errors while preserving human recovery patterns.

Details

Motivation: LLM-based task plans and human demonstrations for embodied AI often contain noisy actions, redundant navigation, and logical errors that reduce policy quality, requiring a method to clean and refine these trajectories.

Method: Proposes an iterative verification framework where a Judge LLM critiques action sequences and a Planner LLM applies revisions, using natural language prompting to handle various error types including irrelevant actions, contradictions, and missing steps.

Result: Achieves up to 90% recall and 100% precision on TEACh dataset across four state-of-the-art LLMs, with 96.5% of sequences converging in at most three iterations while improving temporal efficiency and spatial organization.

Conclusion: Establishes plan verification as a reliable LLM capability for spatial planning, providing a scalable path to higher-quality training data for imitation learning in embodied AI while preserving human error-recovery patterns.

Abstract: Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.

[170] Key Principles in Cross-Domain Hyper-Heuristic Performance

Václav Sobotka, Lucas Kletzander, Nysret Musliu, Hana Rudová

Main category: cs.AI

TL;DR: Strategic transformations of low-level heuristic sets enable trivial random selection to outperform state-of-the-art hyper-heuristics and find new best solutions.

Details

Motivation: Existing selection hyper-heuristics focus on adaptive selection but neglect the composition and strategic transformation of low-level heuristic sets, which could significantly improve performance.

Method: Systematically analyze transformations based on solution acceptance, LLH repetitions, and perturbation intensity. Test transformations on trivial random selection and integrate them with recent hyper-heuristics.

Result: Transformed trivial random selection outperforms state-of-the-art methods on real-world domains, finds 11 new best-known solutions, and competes with CHeSC winner. Enhanced hyper-heuristics outperform current state-of-the-art on both benchmark and real-world problems.

Conclusion: Strategic transformations of low-level heuristic sets are crucial for hyper-heuristic performance, enabling simple methods to achieve superior results and often simplifying complex designs while improving effectiveness.

Abstract: Cross-domain selection hyper-heuristics aim to distill decades of research on problem-specific heuristic search algorithms into adaptable general-purpose search strategies. In this respect, existing selection hyper-heuristics primarily focus on an adaptive selection of low-level heuristics (LLHs) from a predefined set. In contrast, we concentrate on the composition of this set and its strategic transformations. We systematically analyze transformations based on three key principles: solution acceptance, LLH repetitions, and perturbation intensity, i.e., the proportion of a solution affected by a perturbative LLH. We demonstrate the raw effects of our transformations on a trivial unbiased random selection mechanism. With an appropriately constructed transformation, this trivial method outperforms all available state-of-the-art hyper-heuristics on three challenging real-world domains and finds 11 new best-known solutions. The same method is competitive with the winner of the CHeSC competition, commonly used as the standard cross-domain benchmark. Moreover, we accompany several recent hyper-heuristics with such strategic transformations. Using this approach, we outperform the current state-of-the-art methods on both the CHeSC benchmark and real-world domains while often simplifying their designs.

[171] Learning General Policies From Examples

Blai Bonet, Hector Geffner

Main category: cs.AI

TL;DR: A new symbolic method for learning general policies that scales to millions of states and hundreds of thousands of features using hitting set algorithms instead of SAT/ASP.

Details

Motivation: Previous combinatorial methods for learning general policies don't scale well, limited to small training instances with only hundreds of states and features, while deep learning approaches lack interpretability and correctness guarantees.

Method: Proposes a symbolic method based on generalization of sampled plans that ensures structural termination and acyclicity, using a hitting set algorithm instead of SAT/ASP approaches.

Result: The approach can effectively handle problems with millions of states and feature pools with hundreds of thousands of features, demonstrating significant scalability improvements.

Conclusion: The new hitting set-based method provides scalable policy learning while maintaining the interpretability and correctness guarantees of symbolic approaches, overcoming the limitations of previous combinatorial methods.

Abstract: Combinatorial methods for learning general policies that solve large collections of planning problems have been recently developed. One of their strengths, in relation to deep learning approaches, is that the resulting policies can be understood and shown to be correct. A weakness is that the methods do not scale up and learn only from small training instances and feature pools that contain a few hundreds of states and features at most. In this work, we propose a new symbolic method for learning policies based on the generalization of sampled plans that ensures structural termination and hence acyclicity. The proposed learning approach is not based on SAT/ASP, as previous symbolic methods, but on a hitting set algorithm that can effectively handle problems with millions of states, and pools with hundreds of thousands of features. The formal properties of the approach are analyzed, and its scalability is tested on a number of benchmarks.

[172] Uncertainty-driven Adaptive Exploration

Leonidas Bakopoulos, Georgios Chalkiadakis

Main category: cs.AI

TL;DR: A generic adaptive exploration framework that uses uncertainty to determine optimal switching between exploration and exploitation phases in reinforcement learning.

Details

Motivation: To address the critical challenge of determining the appropriate timing for switching between exploration and exploitation in complex domains requiring long action sequences.

Method: Proposes a framework that employs uncertainty measures to guide the switching mechanism, incorporating various uncertainty-measuring approaches like intrinsic motivation and epistemic uncertainty methods.

Result: The framework outperforms standard adaptive exploration strategies across multiple MuJoCo environments.

Conclusion: The uncertainty-based adaptive exploration framework provides a principled and effective approach for learning complex policies, generalizing previous methods and enabling flexible incorporation of different uncertainty measures.

Abstract: Adaptive exploration methods propose ways to learn complex policies via alternating between exploration and exploitation. An important question for such methods is to determine the appropriate moment to switch between exploration and exploitation and vice versa. This is critical in domains that require the learning of long and complex sequences of actions. In this work, we present a generic adaptive exploration framework that employs uncertainty to address this important issue in a principled manner. Our framework includes previous adaptive exploration approaches as special cases. Moreover, we can incorporate in our framework any uncertainty-measuring mechanism of choice, for instance mechanisms used in intrinsic motivation or epistemic uncertainty-based exploration methods. We experimentally demonstrate that our framework gives rise to adaptive exploration strategies that outperform standard ones across several MuJoCo environments.

[173] Accountability Framework for Healthcare AI Systems: Towards Joint Accountability in Decision Making

Prachi Bagave, Marcus Westberg, Marijn Janssen, Aaron Yi Ding

Main category: cs.AI

TL;DR: This paper addresses the accountability gap in AI healthcare systems by developing a framework and three-tier structure to bridge regulatory guidelines with practical implementation.

Details

Motivation: AI is increasingly used for critical healthcare decisions, but accountability remains ambiguous with high-level regulatory guidelines lacking practical implementation guidance, creating a knowledge gap for practitioners.

Method: The authors conducted extensive analysis of accountability concepts, formulated an accountability framework, and created a three-tier structure for handling various accountability mechanisms in healthcare AI systems.

Result: The paper provides a consistent accountability regime that positions healthcare AI regulations and actor mechanisms, along with a categorization guide for actors to classify mechanisms based on their conduct.

Conclusion: Healthcare AI decision-making requires shared dependencies where accountability should be handled jointly through collaboration, with explainability playing a key role in facilitating communication and information sharing between actors.

Abstract: AI is transforming the healthcare domain and is increasingly helping practitioners to make health-related decisions. Therefore, accountability becomes a crucial concern for critical AI-driven decisions. Although regulatory bodies, such as the EU commission, provide guidelines, they are highlevel and focus on the ‘‘what’’ that should be done and less on the ‘‘how’’, creating a knowledge gap for actors. Through an extensive analysis, we found that the term accountability is perceived and dealt with in many different ways, depending on the actor’s expertise and domain of work. With increasing concerns about AI accountability issues and the ambiguity around this term, this paper bridges the gap between the ‘‘what’’ and ‘‘how’’ of AI accountability, specifically for AI systems in healthcare. We do this by analysing the concept of accountability, formulating an accountability framework, and providing a three-tier structure for handling various accountability mechanisms. Our accountability framework positions the regulations of healthcare AI systems and the mechanisms adopted by the actors under a consistent accountability regime. Moreover, the three-tier structure guides the actors of the healthcare AI system to categorise the mechanisms based on their conduct. Through our framework, we advocate that decision-making in healthcare AI holds shared dependencies, where accountability should be dealt with jointly and should foster collaborations. We highlight the role of explainability in instigating communication and information sharing between the actors to further facilitate the collaborative process.

[174] app.build: A Production Framework for Scaling Agentic Prompt-to-App Generation with Environment Scaffolding

Evgenii Kniazev, Arseny Kravchenko, Igor Rekun, James Broadhead, Nikita Shamgunov, Pranav Sah, Pratik Nichite, Ivan Yamshchikov

Main category: cs.AI

TL;DR: app.build is an open-source framework that improves LLM-based application generation through systematic validation and structured environments, achieving 73.3% viability rate and demonstrating that scaling reliable AI agents requires scaling environments, not just models.

Details

Motivation: To improve the reliability and quality of LLM-based application generation by addressing the need for systematic validation and structured environments rather than just scaling model capabilities.

Method: Combines multi-layered validation pipelines, stack-specific orchestration, and model-agnostic architecture across three reference stacks, evaluated on 30 generation tasks.

Result: Achieved 73.3% viability rate with 30% reaching perfect quality scores; open-weights models achieved 80.8% of closed-model performance when provided structured environments; over 3,000 applications generated to date.

Conclusion: Scaling reliable AI agents requires scaling environments, not just models - providing empirical insights and complete reference implementations for production-oriented agent systems.

Abstract: We present app.build (https://github.com/appdotbuild/agent/), an open-source framework that improves LLM-based application generation through systematic validation and structured environments. Our approach combines multi-layered validation pipelines, stack-specific orchestration, and model-agnostic architecture, implemented across three reference stacks. Through evaluation on 30 generation tasks, we demonstrate that comprehensive validation achieves 73.3% viability rate with 30% reaching perfect quality scores, while open-weights models achieve 80.8% of closed-model performance when provided structured environments. The open-source framework has been adopted by the community, with over 3,000 applications generated to date. This work demonstrates that scaling reliable AI agents requires scaling environments, not just models – providing empirical insights and complete reference implementations for production-oriented agent systems.

[175] Language Models Do Not Follow Occam’s Razor: A Benchmark for Inductive and Abductive Reasoning

Yunxin Sun, Abulhair Saparov

Main category: cs.AI

TL;DR: The paper introduces InAbHyD, a synthetic dataset for evaluating LLMs’ inductive and abductive reasoning capabilities, finding that current models struggle with complex scenarios despite reasoning-enhancing techniques.

Details

Motivation: Most LLM research focuses on deductive reasoning, but inductive and abductive reasoning are essential for real-world problem solving and remain underexplored.

Method: Created a programmable synthetic dataset (InAbHyD) with incomplete world models and observations, proposed Occam’s Razor-based evaluation metric, and tested state-of-the-art LLMs with reasoning-enhancing techniques.

Result: LLMs can handle simple inductive and abductive reasoning scenarios but struggle with complex world models and producing high-quality hypotheses, even with techniques like in-context learning and RLVR.

Conclusion: Current LLMs have limited inductive and abductive reasoning capabilities, particularly in complex scenarios, indicating need for further research and improvement in these reasoning types.

Abstract: Reasoning is a core capability in artificial intelligence systems, for which large language models (LLMs) have recently shown remarkable progress. However, most work focuses exclusively on deductive reasoning, which is problematic since other types of reasoning are also essential in solving real-world problems, and they are less explored. This work focuses on evaluating LLMs’ inductive and abductive reasoning capabilities. We introduce a programmable and synthetic dataset, InAbHyD (pronounced in-a-bid), where each reasoning example consists of an incomplete world model and a set of observations. The task for the intelligent agent is to produce hypotheses to explain observations under the incomplete world model to solve each reasoning example. We propose a new metric to evaluate the quality of hypotheses based on Occam’s Razor. We evaluate and analyze some state-of-the-art LLMs. Our analysis shows that LLMs can perform inductive and abductive reasoning in simple scenarios, but struggle with complex world models and producing high-quality hypotheses, even with popular reasoning-enhancing techniques such as in-context learning and RLVR.

[176] Situating AI Agents in their World: Aspective Agentic AI for Dynamic Partially Observable Information Systems

Peter J. Bentley, Soo Ling Lim, Fuyuki Ishikawa

Main category: cs.AI

TL;DR: A bottom-up framework for AI agents that uses environmental changes to trigger behaviors, introducing ‘aspects’ (similar to umwelt) for different environmental perceptions, achieving zero information leakage compared to typical architectures that leak up to 83% of the time.

Details

Motivation: Current agentic LLM AI agents are often just autonomous chatbots following scripts with unreliable control, leading to significant information leakage problems.

Method: Introduces a bottom-up framework where agents are situated in their environment with behaviors triggered by environmental changes. Uses ‘aspects’ concept where different agent sets perceive environments differently for better information control.

Result: Compared to typical architectures that leak information up to 83% of the time, the aspective agentic AI framework achieves zero information leakage.

Conclusion: Specialist agents working efficiently in their own information niches can provide significant improvements to both security and efficiency in AI agent systems.

Abstract: Agentic LLM AI agents are often little more than autonomous chatbots: actors following scripts, often controlled by an unreliable director. This work introduces a bottom-up framework that situates AI agents in their environment, with all behaviors triggered by changes in their environments. It introduces the notion of aspects, similar to the idea of umwelt, where sets of agents perceive their environment differently to each other, enabling clearer control of information. We provide an illustrative implementation and show that compared to a typical architecture, which leaks up to 83% of the time, aspective agentic AI enables zero information leakage. We anticipate that this concept of specialist agents working efficiently in their own information niches can provide improvements to both security and efficiency.

[177] ANNIE: Be Careful of Your Robots

Yiyang Huang, Zixuan Wang, Zishen Wan, Yapeng Tian, Haobo Xu, Yinhe Han, Yiming Gan

Main category: cs.AI

TL;DR: First systematic study of adversarial safety attacks on embodied AI systems, introducing a safety taxonomy, benchmark, and attack framework that achieves over 50% success rate across safety categories.

Details

Motivation: Vision-language-action models in embodied AI introduce critical security risks where compromised models can translate adversarial perturbations into unsafe physical actions, requiring new safety definitions and methodologies beyond traditional ML security.

Method: Formalized safety taxonomy based on ISO standards, created ANNIEBench with 2,400 video-action sequences across 9 safety-critical scenarios, and developed ANNIE-Attack framework with task-aware adversarial perturbations using an attack leader model.

Result: Attack success rates exceeding 50% across all safety categories (critical, dangerous, risky), demonstrated sparse and adaptive attack strategies, and validated real-world impact through physical robot experiments.

Conclusion: Exposes a previously underexplored but highly consequential attack surface in embodied AI systems, highlighting urgent need for security-driven defenses in physical AI applications.

Abstract: The integration of vision-language-action (VLA) models into embodied AI (EAI) robots is rapidly advancing their ability to perform complex, long-horizon tasks in humancentric environments. However, EAI systems introduce critical security risks: a compromised VLA model can directly translate adversarial perturbations on sensory input into unsafe physical actions. Traditional safety definitions and methodologies from the machine learning community are no longer sufficient. EAI systems raise new questions, such as what constitutes safety, how to measure it, and how to design effective attack and defense mechanisms in physically grounded, interactive settings. In this work, we present the first systematic study of adversarial safety attacks on embodied AI systems, grounded in ISO standards for human-robot interactions. We (1) formalize a principled taxonomy of safety violations (critical, dangerous, risky) based on physical constraints such as separation distance, velocity, and collision boundaries; (2) introduce ANNIEBench, a benchmark of nine safety-critical scenarios with 2,400 video-action sequences for evaluating embodied safety; and (3) ANNIE-Attack, a task-aware adversarial framework with an attack leader model that decomposes long-horizon goals into frame-level perturbations. Our evaluation across representative EAI models shows attack success rates exceeding 50% across all safety categories. We further demonstrate sparse and adaptive attack strategies and validate the real-world impact through physical robot experiments. These results expose a previously underexplored but highly consequential attack surface in embodied AI systems, highlighting the urgent need for security-driven defenses in the physical AI era. Code is available at https://github.com/RLCLab/Annie.

[178] sam-llm: interpretable lane change trajectoryprediction via parametric finetuning

Zhuo Cao, Yunxiao Shi, Min Xu

Main category: cs.AI

TL;DR: SAM-LLM is a hybrid architecture combining LLMs with kinematic models for autonomous driving, using parametric trajectory prediction instead of raw coordinates to achieve interpretability and 80% smaller output size.

Details

Motivation: To bridge the gap between contextual reasoning of LLMs and physical precision of kinematic models for autonomous driving, creating interpretable lane change trajectory predictions.

Method: Finetune an LLM to output physical parameters (lateral displacement, maneuver duration, initial lateral velocity, longitudinal velocity change) for an enhanced Sinusoidal Acceleration Model instead of raw coordinates.

Result: Achieves 98.73% intention prediction accuracy (state-of-the-art), equivalent to traditional LLM predictors but with 80% reduction in output size, continuous physically plausible trajectories, and improved explainability.

Conclusion: SAM-LLM successfully combines LLM reasoning with physical kinematic models, offering superior interpretability, computational efficiency, and resource efficiency while maintaining high prediction accuracy.

Abstract: This work introduces SAM-LLM, a novel hybrid architecture that bridges the gap between the contextual reasoning of Large Language Models (LLMs) and the physical precision of kinematic lane change models for autonomous driving. The system is designed for interpretable lane change trajectory prediction by finetuning an LLM to output the core physical parameters of a trajectory model instead of raw coordinates. For lane-keeping scenarios, the model predicts discrete coordinates, but for lane change maneuvers, it generates the parameters for an enhanced Sinusoidal Acceleration Model (SAM), including lateral displacement, maneuver duration, initial lateral velocity, and longitudinal velocity change. This parametric approach yields a complete, continuous, and physically plausible trajectory model that is inherently interpretable and computationally efficient, achieving an 80% reduction in output size compared to coordinate-based methods. The SAM-LLM achieves a state-of-the-art overall intention prediction accuracy of 98.73%, demonstrating performance equivalent to traditional LLM predictors while offering significant advantages in explainability and resource efficiency.

[179] Towards Agentic OS: An LLM Agent Framework for Linux Schedulers

Yusheng Zheng, Yanpeng Hu, Wei Zhang, Andi Quinn

Main category: cs.AI

TL;DR: SchedCP is an autonomous LLM agent framework that optimizes Linux schedulers by bridging the semantic gap between kernel policies and application needs, achieving 1.79x performance improvement and 13x cost reduction.

Details

Motivation: Operating system schedulers suffer from a fundamental semantic gap where kernel policies fail to understand application-specific needs, leading to suboptimal performance.

Method: Architects a decoupled control plane using Model Context Protocol (MCP) server with three services: Workload Analysis Engine, Scheduler Policy Repository, and Execution Verifier. Uses multi-agent system to analyze workloads and synthesize custom eBPF scheduling policies deployed via sched_ext infrastructure.

Result: Achieves up to 1.79x performance improvement and 13x cost reduction compared to naive agentic approaches while maintaining high success rate.

Conclusion: SchedCP democratizes expert-level system optimization and represents a step towards creating truly self-optimizing, application-aware operating systems.

Abstract: Operating system schedulers suffer from a fundamental semantic gap, where kernel policies fail to understand application-specific needs, leading to suboptimal performance. We introduce SchedCP, the first framework that enables fully autonomous Large Language Model (LLM) agents to safely and efficiently optimize Linux schedulers without human involvement. Our core insight is that the challenge is not merely to apply a better LLM, but to architect a decoupled control plane that separates the AI’s role of semantic reasoning (“what to optimize”) from the system’s role of execution (“how to observe and act”). Implemented as Model Context Protocol(MCP) server, SchedCP provides a stable interface with three key services: a Workload Analysis Engine, an evolving Scheduler Policy Repository, and an Execution Verifier that validates all AI-generated code and configure before deployment with static and dynamic analysis. We demonstrate this architecture’s power with sched-agent, a multi-agent system that autonomously analyzes workloads, synthesizes custom eBPF scheduling policies, and deploys them via the sched_ext infrastructure. Our evaluation shows that SchedCP achieves up to an 1.79x performance improvement, and a 13x cost reduction compared to naive agentic approaches, all while maintaining high success rate. By bridging the semantic gap, SchedCP democratizes expert-level system optimization and represents a step towards creating truly self-optimizing, application-aware operating systems. The code is open-sourced in https://github.com/eunomia-bpf/schedcp

[180] JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

Kaizhi Zheng, Kaiwen Zhou, Jing Gu, Yue Fan, Jialu Wang, Zonglin Di, Xuehai He, Xin Eric Wang

Main category: cs.AI

TL;DR: JARVIS is a neuro-symbolic framework that combines LLMs for language understanding and planning with semantic visual mapping to create interpretable conversational embodied agents that achieve SOTA results on dialog-based embodied tasks.

Details

Motivation: Traditional symbolic methods have scaling issues while end-to-end deep learning suffers from data scarcity and lack of interpretability. There's a need for modular, generalizable, and interpretable conversational embodied agents.

Method: Uses LLMs for language understanding and sub-goal planning, constructs semantic maps from visual observations, and employs symbolic reasoning with task- and action-level common sense for planning and action generation.

Result: Achieves SOTA on all three TEACh dataset tasks (EDH, TfD, TATC), boosting unseen Success Rate on EDH from 6.1% to 15.8%. Ranks first in Alexa Prize SimBot Public Benchmark Challenge.

Conclusion: JARVIS demonstrates the effectiveness of neuro-symbolic approaches for conversational embodied agents, providing modularity, generalization, and interpretability while achieving superior performance in both standard and few-shot settings.

Abstract: Building a conversational embodied agent to execute real-life tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc. Traditional symbolic methods have scaling and generalization issues, while end-to-end deep learning models suffer from data scarcity and high task complexity, and are often hard to explain. To benefit from both worlds, we propose JARVIS, a neuro-symbolic commonsense reasoning framework for modular, generalizable, and interpretable conversational embodied agents. First, it acquires symbolic representations by prompting large language models (LLMs) for language understanding and sub-goal planning, and by constructing semantic maps from visual observations. Then the symbolic module reasons for sub-goal planning and action generation based on task- and action-level common sense. Extensive experiments on the TEACh dataset validate the efficacy and efficiency of our JARVIS framework, which achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen Success Rate on EDH from 6.1% to 15.8%). Moreover, we systematically analyze the essential factors that affect the task performance and also demonstrate the superiority of our method in few-shot settings. Our JARVIS model ranks first in the Alexa Prize SimBot Public Benchmark Challenge.

[181] A Survey on Human-AI Collaboration with Large Foundation Models

Vanshika Vats, Marzia Binta Nizam, Minghao Liu, Ziyuan Wang, Richard Ho, Mohnish Sai Prasad, Vincent Titterton, Sai Venkat Malreddy, Riya Aggarwal, Yanwen Xu, Lei Ding, Jay Mehta, Nathan Grinnell, Li Liu, Sijia Zhong, Devanathan Nallur Gandamani, Xinyi Tang, Rohan Ghosalkar, Celeste Shen, Rachel Shen, Nafisa Hussain, Kesav Ravichandran, James Davis

Main category: cs.AI

TL;DR: This paper reviews the integration of Large Foundation Models (LFMs) with Human-AI Collaboration, highlighting opportunities and risks across four key areas: human-guided development, collaborative design, ethical frameworks, and high-stakes applications.

Details

Motivation: As AI capabilities expand rapidly, Human-AI Collaboration has become crucial for advancing problem-solving and decision-making. LFMs offer unprecedented potential but require addressing persistent challenges related to safety, fairness, and control.

Method: The paper conducts a structured review analysis organized around four key areas: human-guided model development, collaborative design principles, ethical and governance frameworks, and applications in high-stakes domains.

Result: The review shows that successful HAI systems are not automatic results of stronger models but require careful, human-centered design. The analysis identifies key open challenges in turning LFM power into reliable, trustworthy partnerships.

Conclusion: This survey provides insights into current and future research needed to transform the raw power of Large Foundation Models into beneficial societal partnerships through responsible Human-AI Collaboration frameworks.

Abstract: As the capabilities of artificial intelligence (AI) continue to expand rapidly, Human-AI (HAI) Collaboration, combining human intellect and AI systems, has become pivotal for advancing problem-solving and decision-making processes. The advent of Large Foundation Models (LFMs) has greatly expanded its potential, offering unprecedented capabilities by leveraging vast amounts of data to understand and predict complex patterns. At the same time, realizing this potential responsibly requires addressing persistent challenges related to safety, fairness, and control. This paper reviews the crucial integration of LFMs with HAI, highlighting both opportunities and risks. We structure our analysis around four areas: human-guided model development, collaborative design principles, ethical and governance frameworks, and applications in high-stakes domains. Our review shows that successful HAI systems are not the automatic result of stronger models but the product of careful, human-centered design. By identifying key open challenges, this survey aims to give insight into current and future research that turns the raw power of LFMs into partnerships that are reliable, trustworthy, and beneficial to society.

[182] On Generating Monolithic and Model Reconciling Explanations in Probabilistic Scenarios

Stylianos Loukas Vasileiou, William Yeoh, Alessandro Previti, Tran Cao Son

Main category: cs.AI

TL;DR: A framework for generating probabilistic explanations in uncertain environments using probabilistic logic and model reconciliation techniques.

Details

Motivation: To address the challenge of generating transparent explanations for AI systems in uncertain environments with incomplete information and probabilistic models.

Method: Proposes two types of explanations: probabilistic monolithic explanations (self-contained reasons using probabilistic logic) and model reconciling explanations (accounting for the agent’s knowledge). Uses minimal correction sets and minimal unsatisfiable sets duality for efficient computation.

Result: Developed quantitative metrics (explanatory gain and power) and efficient algorithms that demonstrate effectiveness and scalability across various benchmarks.

Conclusion: The framework successfully generates high-quality explanations under uncertainty, making AI decisions more transparent and understandable to human users.

Abstract: Explanation generation frameworks aim to make AI systems’ decisions transparent and understandable to human users. However, generating explanations in uncertain environments characterized by incomplete information and probabilistic models remains a significant challenge. In this paper, we propose a novel framework for generating probabilistic monolithic explanations and model reconciling explanations. Monolithic explanations provide self-contained reasons for an explanandum without considering the agent receiving the explanation, while model reconciling explanations account for the knowledge of the agent receiving the explanation. For monolithic explanations, our approach integrates uncertainty by utilizing probabilistic logic to increase the probability of the explanandum. For model reconciling explanations, we propose a framework that extends the logic-based variant of the model reconciliation problem to account for probabilistic human models, where the goal is to find explanations that increase the probability of the explanandum while minimizing conflicts between the explanation and the probabilistic human model. We introduce explanatory gain and explanatory power as quantitative metrics to assess the quality of these explanations. Further, we present algorithms that exploit the duality between minimal correction sets and minimal unsatisfiable sets to efficiently compute both types of explanations in probabilistic contexts. Extensive experimental evaluations on various benchmarks demonstrate the effectiveness and scalability of our approach in generating explanations under uncertainty.

[183] PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation

Daniel C. Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L. Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores Sánchez-Valverde, Lara Jaques-Pérez, Lourdes Pérez-Rodríguez, Kenji Takeda, José María Salinas, Javier Alvarez-Valle, Joaquín Galant Herrero, Antonio Pertusa

Main category: cs.AI

TL;DR: PadChest-GR is the first manually curated dataset for grounded radiology report generation (GRRG) in chest X-rays, containing 4,555 bilingual studies with detailed localization annotations for positive and negative findings.

Details

Motivation: There are currently no manually annotated chest X-ray datasets available to train GRRG models that can both generate radiology reports and localize individual findings on images.

Method: The authors curated a public bilingual dataset of 4,555 CXR studies with grounded reports, including 3,099 abnormal and 1,456 normal cases, each containing complete lists of sentences describing individual findings in English and Spanish with associated bounding boxes.

Result: PadChest-GR contains 7,037 positive and 3,422 negative finding sentences, with every positive finding sentence associated with up to two independent sets of bounding boxes labeled by different readers, plus categorical labels for finding type, locations, and progression.

Conclusion: PadChest-GR provides the first comprehensive manually curated dataset for developing and evaluating GRRG models, offering detailed localization and comprehensive annotations of all clinically relevant findings in CXR images.

Abstract: Radiology report generation (RRG) aims to create free-text radiology reports from clinical imaging. Grounded radiology report generation (GRRG) extends RRG by including the localisation of individual findings on the image. Currently, there are no manually annotated chest X-ray (CXR) datasets to train GRRG models. In this work, we present a dataset called PadChest-GR (Grounded-Reporting) derived from PadChest aimed at training GRRG models for CXR images. We curate a public bi-lingual dataset of 4,555 CXR studies with grounded reports (3,099 abnormal and 1,456 normal), each containing complete lists of sentences describing individual present (positive) and absent (negative) findings in English and Spanish. In total, PadChest-GR contains 7,037 positive and 3,422 negative finding sentences. Every positive finding sentence is associated with up to two independent sets of bounding boxes labelled by different readers and has categorical labels for finding type, locations, and progression. To the best of our knowledge, PadChest-GR is the first manually curated dataset designed to train GRRG models for understanding and interpreting radiological images and generated text. By including detailed localization and comprehensive annotations of all clinically relevant findings, it provides a valuable resource for developing and evaluating GRRG models from CXR images. PadChest-GR can be downloaded under request from https://bimcv.cipf.es/bimcv-projects/padchest-gr/

[184] Frugal inference for control

Itzel Olivos-Castillo, Paul Schrater, Xaq Pitkow

Main category: cs.AI

TL;DR: A POMDP framework that treats inference as a resource to optimize alongside task performance and motion effort, revealing phase transitions in inference strategies and frugal behavior that leaves uncertainty unresolved.

Details

Motivation: To address the challenge of balancing utility maximization and resource use in partially observable environments, where understanding of resource efficiency remains limited compared to fully observable settings.

Method: Developed a version of POMDP framework with linear-Gaussian dynamics where information gained through inference is treated as an optimizable resource alongside task performance and motion effort.

Result: Uncovered fundamental principles of resource efficiency including a phase transition in inference strategies (from Bayes-optimal to strategically leaving uncertainty unresolved) and emergence of frugal behavior that creates structured families of equally effective strategies.

Conclusion: Provides foundation for new rational computation approach that both brains and machines could use for effective but resource-efficient control under uncertainty, with demonstrated applicability to nonlinear tasks.

Abstract: A key challenge in advancing artificial intelligence is achieving the right balance between utility maximization and resource use by both external movement and internal computation. While this trade-off has been studied in fully observable settings, our understanding of resource efficiency in partially observable environments remains limited. Motivated by this challenge, we develop a version of the POMDP framework where the information gained through inference is treated as a resource that must be optimized alongside task performance and motion effort. By solving this problem in environments described by linear-Gaussian dynamics, we uncover fundamental principles of resource efficiency. Our study reveals a phase transition in the inference, switching from a Bayes-optimal approach to one that strategically leaves some uncertainty unresolved. This frugal behavior gives rise to a structured family of equally effective strategies, facilitating adaptation to later objectives and constraints overlooked during the original optimization. We illustrate the applicability of our framework and the generality of the principles we derived using two nonlinear tasks. Overall, this work provides a foundation for a new type of rational computation that both brains and machines could use for effective but resource-efficient control under uncertainty.

[185] MorphAgent: Empowering Agents through Self-Evolving Profiles and Decentralized Collaboration

Siyuan Lu, Jiaqi Shao, Bing Luo, Tao Lin

Main category: cs.AI

TL;DR: MorphAgent is a self-organizing multi-agent system that enables LLM-based agents to dynamically evolve roles and capabilities through decentralized collaboration, outperforming existing frameworks in adaptability and task performance.

Details

Motivation: Existing LLM-based multi-agent systems rely on predefined roles and centralized coordination, limiting their adaptability to evolving challenges and complex tasks.

Method: Uses self-evolving agent profiles optimized through three key metrics, with a two-phase process: Profile Update for optimization and Task Execution with continuous role adaptation based on task feedback.

Result: Experimental results show MorphAgent outperforms existing frameworks in both task performance and adaptability to changing requirements.

Conclusion: The system paves the way for more robust and versatile multi-agent collaborative systems through autonomous, self-organizing, and self-adaptive capabilities.

Abstract: Large Language Model (LLM) based multi-agent systems (MAS) have shown promise in tackling complex tasks, but often rely on predefined roles and centralized coordination, limiting their adaptability to evolving challenges. This paper introduces MorphAgent, a novel Autonomous, Self-Organizing, and Self-Adaptive Multi-Agent System for decentralized agent collaboration that enables agents to dynamically evolve their roles and capabilities. Our approach employs self-evolving agent profiles, optimized through three key metrics, guiding agents in refining their individual expertise while maintaining complementary team dynamics. MorphAgent implements a two-phase process: a Profile Update phase for profile optimization, followed by a Task Execution phase where agents continuously adapt their roles based on task feedback. Our experimental results show that MorphAgent outperforms existing frameworks in terms of task performance and adaptability to changing requirements, paving the way for more robust and versatile multi-agent collaborative systems.

[186] CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu

Main category: cs.AI

TL;DR: CoT-Self-Instruct is a synthetic data generation method that uses Chain-of-Thought reasoning to create high-quality training examples, outperforming existing datasets in both verifiable reasoning and instruction-following tasks.

Details

Motivation: To improve LLM training by generating higher quality synthetic data that enhances reasoning capabilities and instruction-following performance beyond existing datasets.

Method: Instructs LLMs to reason via Chain-of-Thought based on seed tasks, generates new synthetic examples, and filters them using automatic metrics to select high-quality data for training.

Result: Significantly outperforms existing datasets (s1k, OpenMathReasoning) on MATH500, AMC23, AIME24, GPQA-Diamond for verifiable reasoning, and surpasses human and standard Self-Instruct data on AlpacaEval 2.0 and Arena-Hard benchmarks.

Conclusion: CoT-Self-Instruct effectively generates high-quality synthetic training data that improves LLM performance across both reasoning and instruction-following tasks.

Abstract: We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on given seed tasks, and then generate a new synthetic example of similar quality and complexity. This is followed by a filtering step to select high-quality data using automatic metrics, which are then used for LLM training. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, when evaluated on MATH500, AMC23, AIME24, and GPQA-Diamond. For non-verifiable instruction-following tasks, our method surpasses the performance of both human and standard Self-Instruct training data on the AlpacaEval 2.0 and Arena-Hard benchmarks.

[187] Can Large Language Models Act as Ensembler for Multi-GNNs?

Hanqi Duan, Yao Cheng, Jianxiang Yu, Yao Liu, Xiang Li

Main category: cs.AI

TL;DR: LensGNN integrates multiple GNNs with LLMs to combine graph structural information and textual semantic understanding, outperforming existing models.

Details

Motivation: GNNs lack semantic understanding of textual node attributes, and no single GNN consistently outperforms others across diverse datasets. LLMs can potentially serve as ensemblers to leverage multiple GNN strengths.

Method: First aligns multiple GNN representations into the same space, then uses LoRA fine-tuning to align GNN space with LLM space, injecting graph tokens and textual information into LLMs for ensemble learning.

Result: LensGNN outperforms existing models by achieving deeper understanding of both textual semantic information and graph structural information.

Conclusion: The research advances text-attributed graph ensemble learning by providing a robust solution that integrates semantic and structural information through LLM-GNN integration.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful models for learning from graph-structured data. However, GNNs lack the inherent semantic understanding capability of rich textual node attributes, limiting their effectiveness in applications. On the other hand, we empirically observe that for existing GNN models, no one can consistently outperforms others across diverse datasets. In this paper, we study whether LLMs can act as an ensembler for multi-GNNs and propose the LensGNN model. The model first aligns multiple GNNs, mapping the representations of different GNNs into the same space. Then, through LoRA fine-tuning, it aligns the space between the GNN and the LLM, injecting graph tokens and textual information into LLMs. This allows LensGNN to ensemble multiple GNNs and take advantage of the strengths of LLM, leading to a deeper understanding of both textual semantic information and graph structural information. The experimental results show that LensGNN outperforms existing models. This research advances text-attributed graph ensemble learning by providing a robust and superior solution for integrating semantic and structural information. We provide our code and data here: https://github.com/AquariusAQ/LensGNN.

[188] The Ramon Llull’s Thinking Machine for Automated Ideation

Xinran Zhao, Boyuan Zheng, Chenglei Si, Haofei Yu, Ken Liu, Runlong Zhou, Ruochen Li, Tong Chen, Xiang Li, Yiming Zhang, Tongshuang Wu

Main category: cs.AI

TL;DR: A modern reinterpretation of Llull’s combinatorial art using LLMs to generate research ideas through systematic combination of themes, domains, and methods mined from academic literature.

Details

Motivation: To revive Ramon Llull's medieval combinatorial framework as a foundation for AI-assisted research ideation, addressing the need for systematic approaches to scientific creativity and collaborative human-AI idea generation.

Method: Define three compositional axes (Theme, Domain, Method) as building blocks, mine elements from experts/conference papers, and prompt LLMs with curated combinations to generate diverse research ideas.

Result: The approach produces research ideas that are diverse, relevant, and grounded in current literature, demonstrating effectiveness as a lightweight, interpretable tool for scientific creativity augmentation.

Conclusion: This modern thinking machine provides a practical framework for collaborative ideation between humans and AI, offering a systematic approach to research idea generation while maintaining interpretability.

Abstract: This paper revisits Ramon Llull’s Ars combinatoria - a medieval framework for generating knowledge through symbolic recombination - as a conceptual foundation for building a modern Llull’s thinking machine for research ideation. Our approach defines three compositional axes: Theme (e.g., efficiency, adaptivity), Domain (e.g., question answering, machine translation), and Method (e.g., adversarial training, linear attention). These elements represent high-level abstractions common in scientific work - motivations, problem settings, and technical approaches - and serve as building blocks for LLM-driven exploration. We mine elements from human experts or conference papers and show that prompting LLMs with curated combinations produces research ideas that are diverse, relevant, and grounded in current literature. This modern thinking machine offers a lightweight, interpretable tool for augmenting scientific creativity and suggests a path toward collaborative ideation between humans and AI.

[189] CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation

Chengshuai Zhao, Riccardo De Maria, Tharindu Kumarage, Kumar Satvik Chaudhary, Garima Agrawal, Yiwen Li, Jongchan Park, Yuli Deng, Ying-Chih Chen, Huan Liu

Main category: cs.AI

TL;DR: CyberBOT is a cybersecurity education chatbot that uses RAG with domain-specific ontology validation to ensure accurate and safe responses for students.

Details

Motivation: To address the need for trustworthy and domain-appropriate educational tools in cybersecurity education where accuracy and safety are paramount.

Method: Uses retrieval-augmented generation (RAG) pipeline with course-specific materials and validates responses using a domain-specific cybersecurity ontology as a structured reasoning layer.

Result: Deployed in a large graduate-level course at ASU with over 100 students using a web-based platform. Computational evaluations show promising capacity, with pedagogical impact to be studied.

Conclusion: Integrating structured domain reasoning with generative capabilities shows promise for developing reliable, curriculum-aligned AI applications in specialized educational contexts.

Abstract: Advancements in large language models (LLMs) have enabled the development of intelligent educational tools that support inquiry-based learning across technical domains. In cybersecurity education, where accuracy and safety are paramount, systems must go beyond surface-level relevance to provide information that is both trustworthy and domain-appropriate. To address this challenge, we introduce CyberBOT, a question-answering chatbot that leverages a retrieval-augmented generation (RAG) pipeline to incorporate contextual information from course-specific materials and validate responses using a domain-specific cybersecurity ontology. The ontology serves as a structured reasoning layer that constrains and verifies LLM-generated answers, reducing the risk of misleading or unsafe guidance. CyberBOT has been deployed in a large graduate-level course at Arizona State University (ASU), where more than one hundred students actively engage with the system through a dedicated web-based platform. Computational evaluations in lab environments highlight the potential capacity of CyberBOT, and a forthcoming field study will evaluate its pedagogical impact. By integrating structured domain reasoning with modern generative capabilities, CyberBOT illustrates a promising direction for developing reliable and curriculum-aligned AI applications in specialized educational contexts.

[190] AHELM: A Holistic Evaluation of Audio-Language Models

Tony Lee, Haoqin Tu, Chi Heem Wong, Zijun Wang, Siwei Yang, Yifan Mai, Yuyin Zhou, Cihang Xie, Percy Liang

Main category: cs.AI

TL;DR: AHELM is a comprehensive benchmark for audio-language models that standardizes evaluation across 10 key aspects including perception, reasoning, bias, fairness, and safety, using aggregated datasets and standardized prompts to enable fair model comparisons.

Details

Motivation: Current ALM evaluations lack standardized benchmarks, measure limited capabilities, and omit important aspects like fairness and safety, making cross-model comparisons difficult due to inconsistent prompting and inference methods.

Method: Created AHELM benchmark aggregating various datasets including new synthetic datasets PARADE (for stereotype avoidance) and CoRe-Bench (for conversational reasoning). Standardized prompts, inference parameters, and evaluation metrics across 10 key aspects.

Result: Tested 14 ALMs and 3 baseline systems. Gemini 2.5 Pro ranked top in 5/10 aspects but showed group unfairness (p=0.01) on ASR tasks. Baseline systems performed surprisingly well, with one ranking 6th overall despite limited capabilities.

Conclusion: AHELM provides a holistic, standardized evaluation framework for ALMs that reveals important performance characteristics and fairness issues, serving as a living benchmark for ongoing model development and assessment.

Abstract: Evaluations of audio-language models (ALMs) – multimodal models that take interleaved audio and text as input and output text – are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets – including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering – to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ($p=0.01$) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 6th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.

[191] Shutdownable Agents through POST-Agency

Elliott Thornley

Main category: cs.AI

TL;DR: The POST-Agents Proposal trains AI agents to only have preferences between trajectories of equal length, ensuring they remain shutdownable while maintaining usefulness.

Details

Motivation: To address concerns that future artificial agents might resist being shut down, which could pose safety risks.

Method: Train agents to satisfy Preferences Only Between Same-Length Trajectories (POST), which leads to Neutrality+ - maximizing expected utility while ignoring trajectory-length probability distributions.

Result: POST agents remain shutdownable (can be turned off when needed) while still being capable of performing useful tasks.

Conclusion: The POST approach provides a theoretical framework for creating AI agents that are both useful and safe to shut down, addressing a key AI safety concern.

Abstract: Many fear that future artificial agents will resist shutdown. I present an idea - the POST-Agents Proposal - for ensuring that doesn’t happen. I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). I then prove that POST - together with other conditions - implies Neutrality+: the agent maximizes expected utility, ignoring the probability distribution over trajectory-lengths. I argue that Neutrality+ keeps agents shutdownable and allows them to be useful.

[192] ORMind: A Cognitive-Inspired End-to-End Reasoning Framework for Operations Research

Zhiyuan Wang, Bokui Chen, Yinya Huang, Qingxing Cao, Ming He, Jianping Fan, Xiaodan Liang

Main category: cs.AI

TL;DR: ORMind is a cognitive-inspired framework that addresses LLM deployment challenges in operations research by using counterfactual reasoning to improve mathematical accuracy and workflow transparency, achieving significant performance improvements on benchmark datasets.

Details

Motivation: LLMs show promise for operations research but face critical deployment challenges including focus on code syntax over mathematical accuracy and complex expert selection that creates unpredictable workflows, making them impractical for time-sensitive business applications.

Method: ORMind is a cognitive-inspired framework that enhances optimization through counterfactual reasoning, emulating human cognition with an end-to-end workflow that systematically transforms requirements into mathematical models and executable solver code.

Result: ORMind outperforms existing methods with a 9.5% improvement on the NL4Opt dataset and a 14.6% improvement on the ComplexOR dataset. It’s currently being tested internally in Lenovo’s AI Assistant.

Conclusion: The framework successfully addresses business limitations of LLM deployment in operations research, providing more accurate mathematical modeling and transparent workflows suitable for time-sensitive industrial applications.

Abstract: Operations research (OR) is widely deployed to solve critical decision-making problems with complex objectives and constraints, impacting manufacturing, logistics, finance, and healthcare outcomes. While Large Language Models (LLMs) have shown promising results in various domains, their practical application in industry-relevant operations research (OR) problems presents significant challenges and opportunities. Preliminary industrial applications of LLMs for operations research face two critical deployment challenges: 1) Self-correction focuses on code syntax rather than mathematical accuracy, causing costly errors; 2) Complex expert selection creates unpredictable workflows that reduce transparency and increase maintenance costs, making them impractical for time-sensitive business applications. To address these business limitations, we introduce ORMind, a cognitive-inspired framework that enhances optimization through counterfactual reasoning. Our approach emulates human cognition, implementing an end-to-end workflow that systematically transforms requirements into mathematical models and executable solver code. It is currently being tested internally in Lenovo’s AI Assistant, with plans to enhance optimization capabilities for both business and consumer customers. Experiments demonstrate that ORMind outperforms existing methods, achieving a 9.5% improvement on the NL4Opt dataset and a 14.6% improvement on the ComplexOR dataset.

[193] L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search

Ziqi Wang, Boqin Yuan

Main category: cs.AI

TL;DR: L-MARS is a multi-agent legal QA system that reduces hallucinations through iterative reasoning, targeted search across multiple sources, and verification before answer synthesis, achieving higher accuracy than traditional RAG.

Details

Motivation: To address hallucination and uncertainty in legal question answering by moving beyond single-pass retrieval-augmented generation to a more robust, multi-step verification process.

Method: Decomposes queries into subproblems, performs targeted searches across heterogeneous sources (web, local RAG, case law), and uses a Judge Agent to verify sufficiency, jurisdiction, and temporal validity in an iterative reasoning-search-verification loop.

Result: Substantially improves factual accuracy, reduces uncertainty, and achieves higher preference scores from both human experts and LLM-based judges on the LegalSearchQA benchmark of 200 up-to-date legal questions.

Conclusion: Multi-agent reasoning with agentic search provides a scalable and reproducible blueprint for deploying LLMs in high-stakes domains requiring precise legal retrieval and deliberation.

Abstract: We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a system that reduces hallucination and uncertainty in legal question answering through coordinated multi-agent reasoning and retrieval. Unlike single-pass retrieval-augmented generation (RAG), L-MARS decomposes queries into subproblems, issues targeted searches across heterogeneous sources (Serper web, local RAG, CourtListener case law), and employs a Judge Agent to verify sufficiency, jurisdiction, and temporal validity before answer synthesis. This iterative reasoning-search-verification loop maintains coherence, filters noisy evidence, and grounds answers in authoritative law. We evaluated L-MARS on LegalSearchQA, a new benchmark of 200 up-to-date multiple choice legal questions in 2025. Results show that L-MARS substantially improves factual accuracy, reduces uncertainty, and achieves higher preference scores from both human experts and LLM-based judges. Our work demonstrates that multi-agent reasoning with agentic search offers a scalable and reproducible blueprint for deploying LLMs in high-stakes domains requiring precise legal retrieval and deliberation.

[194] Gradients: When Markets Meet Fine-tuning – A Distributed Approach to Model Optimisation

Christopher Subia-Waud

Main category: cs.AI

TL;DR: Gradients, a decentralized AutoML platform on Bittensor network, outperforms major commercial platforms through competitive mining system where miners earn rewards for finding optimal hyperparameters, achieving 100% win rate against TogetherAI/Databricks/Google Cloud and 82.8% against HuggingFace AutoTrain with mean 42.1% improvements.

Details

Motivation: Current AutoML platforms leave substantial performance untapped, producing suboptimal configurations that fail to explore the full hyperparameter optimization space effectively.

Method: Decentralized competitive system where independent miners race to find optimal hyperparameters and earn rewards proportional to their models’ performance, driving exploration of configuration spaces that single-strategy methods overlook.

Result: Achieved 100% win rate against TogetherAI, Databricks, and Google Cloud; beat HuggingFace AutoTrain in 82.8% of experiments; mean improvements of 42.1% against commercial platforms; 30-40% gains in RAG tasks; 23.4% improvement in diffusion models for person-specific generation.

Conclusion: Decentralized systems with economic incentives can systematically outperform traditional AutoML, suggesting market dynamics may be key to achieving superior fine-tuning results through competitive optimization strategies.

Abstract: Current AutoML platforms leave substantial performance untapped. Testing 180 fine-tuning tasks across models from 70M to 70B parameters, we found that HuggingFace AutoTrain, TogetherAI, Databricks, and Google Cloud consistently produce suboptimal configurations. Gradients, built on the Bittensor network, attacks this problem through competition. Independent miners race to find optimal hyperparameters, earning rewards proportional to their models’ performance. This tournament drives exploration of configuration spaces that single-strategy methods never examine. In our experiments, Gradients achieved a 100% win rate against TogetherAI, Databricks, and Google Cloud, and beat HuggingFace AutoTrain in 82.8% of experiments. Mean improvements reached 42.1% against commercial platforms. Retrieval-augmented generation tasks saw 30-40% gains; diffusion models improved 23.4% on person-specific generation. When miners compete for rewards, they develop optimization strategies that centralized approaches overlook. These findings demonstrate that decentralized systems with economic incentives can systematically outperform traditional AutoML, suggesting market dynamics may be key to achieving superior fine-tuning results. Code is available at https://github.com/rayonlabs/G.O.D.

[195] ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care

Zonghai Yao, Talha Chafekar, Junda Wang, Shuo Han, Feiyun Ouyang, Junhui Qian, Lingxi Li, Hong Yu

Main category: cs.AI

TL;DR: ChatCLIDS is the first benchmark to evaluate LLM-driven persuasive dialogue for health behavior change, featuring expert-validated virtual patients with realistic adoption barriers and supporting longitudinal counseling scenarios.

Details

Motivation: Real-world adoption of closed-loop insulin delivery systems remains low due to behavioral, psychosocial, and social barriers rather than technical failures, highlighting the need for effective persuasive AI in healthcare.

Method: The framework uses a library of expert-validated virtual patients with clinically grounded profiles, simulating multi-turn interactions with nurse agents equipped with evidence-based persuasive strategies, including longitudinal counseling and adversarial social influence scenarios.

Result: Larger and more reflective LLMs adapt strategies over time, but all models struggle to overcome resistance, particularly under realistic social pressure, revealing critical limitations of current LLMs for behavior change.

Conclusion: ChatCLIDS provides a high-fidelity, scalable testbed for advancing trustworthy persuasive AI in healthcare, demonstrating that current LLMs have significant limitations in overcoming real-world adoption barriers for medical technologies.

Abstract: Real-world adoption of closed-loop insulin delivery systems (CLIDS) in type 1 diabetes remains low, driven not by technical failure, but by diverse behavioral, psychosocial, and social barriers. We introduce ChatCLIDS, the first benchmark to rigorously evaluate LLM-driven persuasive dialogue for health behavior change. Our framework features a library of expert-validated virtual patients, each with clinically grounded, heterogeneous profiles and realistic adoption barriers, and simulates multi-turn interactions with nurse agents equipped with a diverse set of evidence-based persuasive strategies. ChatCLIDS uniquely supports longitudinal counseling and adversarial social influence scenarios, enabling robust, multi-dimensional evaluation. Our findings reveal that while larger and more reflective LLMs adapt strategies over time, all models struggle to overcome resistance, especially under realistic social pressure. These results highlight critical limitations of current LLMs for behavior change, and offer a high-fidelity, scalable testbed for advancing trustworthy persuasive AI in healthcare and beyond.

[196] Deep Research Agents: A Systematic Examination And Roadmap

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, Kun Shao, Jun Wang

Main category: cs.AI

TL;DR: Analysis of Deep Research agents - autonomous AI systems that perform complex multi-turn research tasks using dynamic reasoning, planning, information retrieval, and tool integration.

Details

Motivation: The rapid advancement of Large Language Models has enabled the development of Deep Research agents capable of tackling complex informational research tasks, necessitating a systematic analysis of their foundational technologies and architectures.

Method: Detailed analysis of information acquisition strategies (API vs browser-based), modular tool-use frameworks, code execution, multimodal processing, and Model Context Protocols. Proposes taxonomy differentiating static vs dynamic workflows and classifies architectures by planning strategies and agent composition.

Result: Provides critical evaluation of current benchmarks, highlighting limitations like restricted external knowledge access, sequential execution inefficiencies, and metric misalignment with practical objectives.

Conclusion: Outlines open challenges and future research directions for Deep Research agents, with a curated repository available for ongoing research updates.

Abstract: The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of structured analytical reports. In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute Deep Research agents. We begin by reviewing information acquisition strategies, contrasting API-based retrieval methods with browser-based exploration. We then examine modular tool-use frameworks, including code execution, multimodal input processing, and the integration of Model Context Protocols (MCPs) to support extensibility and ecosystem development. To systematize existing approaches, we propose a taxonomy that differentiates between static and dynamic workflows, and we classify agent architectures based on planning strategies and agent composition, including single-agent and multi-agent configurations. We also provide a critical evaluation of current benchmarks, highlighting key limitations such as restricted access to external knowledge, sequential execution inefficiencies, and misalignment between evaluation metrics and the practical objectives of DR agents. Finally, we outline open challenges and promising directions for future research. A curated and continuously updated repository of DR agent research is available at: {https://github.com/ai-agents-2030/awesome-deep-research-agent}.

Zhiyuan Wang, Bokui Chen

Main category: cs.AI

TL;DR: ChordPrompt framework enables cross-modal prompt learning for multi-domain continual learning in vision-language models, addressing limitations of existing single-modal approaches.

Details

Motivation: Existing prompt learning methods focus on class-incremental scenarios and use single-modal prompts, neglecting multi-domain adaptation and cross-modal information exchange benefits.

Method: Proposes ChordPrompt with cross-modal prompts for visual-textual interaction and domain-adaptive text prompts for multi-domain continual adaptation.

Result: Outperforms state-of-the-art methods in zero-shot generalization and downstream task performance on multi-domain incremental learning benchmarks.

Conclusion: Cross-modal prompt learning with domain adaptation effectively enhances continual learning capabilities in vision-language models across multiple domains.

Abstract: Continual learning (CL) empowers pre-trained vision-language models to adapt effectively to novel or previously underrepresented data distributions without comprehensive retraining, enhancing their adaptability and efficiency. While vision-language models like CLIP show great promise, they struggle to maintain performance across domains in incremental learning scenarios. Existing prompt learning methods face two main limitations: 1) they primarily focus on class-incremental learning scenarios, lacking specific strategies for multi-domain task incremental learning; 2) most current approaches employ single-modal prompts, neglecting the potential benefits of cross-modal information exchange. To address these challenges, we propose the \ChordPrompt framework, which facilitates a harmonious interplay between visual and textual prompts. \ChordPrompt introduces cross-modal prompts to leverage interactions between visual and textual information. Our approach also employs domain-adaptive text prompts to select appropriate prompts for continual adaptation across multiple domains. Comprehensive experiments on multi-domain incremental learning benchmarks demonstrate that \ChordPrompt outperforms state-of-the-art methods in zero-shot generalization and downstream task performance.

[198] Symbiotic Agents: A Novel Paradigm for Trustworthy AGI-driven Networks

Ilias Chatzistefanidis, Navid Nikaein

Main category: cs.AI

TL;DR: Symbiotic agents combining LLMs with real-time optimizers achieve 5x lower decision errors than standalone LLM agents, with 99.9% GPU reduction using smaller models, and 44% RAN over-utilization reduction.

Details

Motivation: Transition from specialized AI to AGI-driven 6G networks requires trustworthy AI agents that can handle real-time decision-making for network management and service provisioning with both reasoning capabilities and numerical precision.

Method: Novel agentic paradigm combining LLMs with real-time optimization algorithms - input-level optimizers provide bounded uncertainty steering, output-level optimizers enable adaptive real-time control. Two agent types: RAN optimizers and multi-agent negotiators for SLAs.

Result: Symbiotic agents reduce decision errors fivefold compared to standalone LLM agents. Smaller language models achieve similar accuracy with 99.9% GPU reduction and 82ms near-real-time loops. Multi-agent demonstration shows 44% reduction in RAN over-utilization.

Conclusion: Symbiotic paradigm provides foundation for next-generation AGI-driven networks that remain adaptable, efficient and trustworthy as LLMs advance, enabling the transition to AGI-driven 6G networks.

Abstract: Large Language Model (LLM)-based autonomous agents are expected to play a vital role in the evolution of 6G networks, by empowering real-time decision-making related to management and service provisioning to end-users. This shift facilitates the transition from a specialized intelligence approach, where artificial intelligence (AI) algorithms handle isolated tasks, to artificial general intelligence (AGI)-driven networks, where agents possess broader reasoning capabilities and can manage diverse network functions. In this paper, we introduce a novel agentic paradigm that combines LLMs with real-time optimization algorithms towards Trustworthy AI, defined as symbiotic agents. Optimizers at the LLM’s input-level provide bounded uncertainty steering for numerically precise tasks, whereas output-level optimizers supervised by the LLM enable adaptive real-time control. We design and implement two novel agent types including: (i) Radio Access Network optimizers, and (ii) multi-agent negotiators for Service-Level Agreements (SLAs). We further propose an end-to-end architecture for AGI networks and evaluate it on a 5G testbed capturing channel fluctuations from moving vehicles. Results show that symbiotic agents reduce decision errors fivefold compared to standalone LLM-based agents, while smaller language models (SLM) achieve similar accuracy with a 99.9% reduction in GPU resource overhead and in near-real-time loops of 82 ms. A multi-agent demonstration for collaborative RAN on the real-world testbed highlights significant flexibility in service-level agreement and resource allocation, reducing RAN over-utilization by approximately 44%. Drawing on our findings and open-source implementations, we introduce the symbiotic paradigm as the foundation for next-generation, AGI-driven networks-systems designed to remain adaptable, efficient, and trustworthy even as LLMs advance.

[199] Integrating Activity Predictions in Knowledge Graphs

Forrest Hare Alec Sculley, Cameron Stockton

Main category: cs.AI

TL;DR: Using ontology-structured knowledge graphs with BFO and CCO frameworks to organize vessel movement data, create Markov chain models for future event prediction, and integrate probability calculations back into the knowledge graph.

Details

Motivation: To demonstrate how ontology-structured knowledge graphs can generate predictions about future events by leveraging semantic frameworks and addressing limitations in current ontological models of probability.

Method: Organize fishing vessel movement data in knowledge graphs using BFO and CCO, retrieve query results to create Markov chain models, introduce ‘spatiotemporal instant’ concept, and propose alternative probability model about actual process profiles.

Result: Developed a framework that enables prediction of future vessel states based on historical data, with probability calculations that can be integrated back into the knowledge graph for further analysis.

Conclusion: Ontology-structured knowledge graphs with proper semantic frameworks provide effective means for future event prediction and decision-making, offering improved modeling of real-world dynamics through alternative probability interpretations.

Abstract: We argue that ontology-structured knowledge graphs can play a crucial role in generating predictions about future events. By leveraging the semantic framework provided by Basic Formal Ontology (BFO) and Common Core Ontologies (CCO), we demonstrate how data such as the movements of a fishing vessel can be organized in and retrieved from a knowledge graph. These query results are then used to create Markov chain models, allowing us to predict future states based on the vessel’s history. To fully support this process, we introduce the term `spatiotemporal instant’ to complete the necessary structural semantics. Additionally, we critique the prevailing ontological model of probability, according to which probabilities are about the future. We propose an alternative view, where at least some probabilities are treated as being about actual process profiles, which better captures the dynamics of real-world phenomena. Finally, we demonstrate how our Markov chain-based probability calculations can be seamlessly integrated back into the knowledge graph, enabling further analysis and decision-making.

[200] KIRETT: Knowledge-Graph-Based Smart Treatment Assistant for Intelligent Rescue Operations

Mubaris Nadeem, Johannes Zenkert, Lisa Bender, Christian Weber, Madjid Fathi

Main category: cs.AI

TL;DR: A knowledge graph system for emergency medical response that provides AI-powered treatment recommendations to first responders based on real-time vital data analysis.

Details

Motivation: The increasing need for rescue operations and time-sensitive emergency situations require first responders to have immediate access to processed medical knowledge and AI-assisted recommendations to provide optimal care.

Method: Developed a knowledge graph as central knowledge representation that enables intelligent treatment recommendations with AI-based situation pre-recognition, processing freshly recorded vital data in emergency scenarios.

Result: The system provides first responders with innovative knowledge management that assists in making treatment decisions by analyzing real-time patient data and medical knowledge.

Conclusion: The knowledge graph approach enables improved emergency medical treatments by making on-the-spot calculated and processed knowledge available to first responders, enhancing their ability to provide personalized and optimized healthcare in time-critical situations.

Abstract: Over the years, the need for rescue operations throughout the world has increased rapidly. Demographic changes and the resulting risk of injury or health disorders form the basis for emergency calls. In such scenarios, first responders are in a rush to reach the patient in need, provide first aid, and save lives. In these situations, they must be able to provide personalized and optimized healthcare in the shortest possible time and estimate the patients condition with the help of freshly recorded vital data in an emergency situation. However, in such a timedependent situation, first responders and medical experts cannot fully grasp their knowledge and need assistance and recommendation for further medical treatments. To achieve this, on the spot calculated, evaluated, and processed knowledge must be made available to improve treatments by first responders. The Knowledge Graph presented in this article as a central knowledge representation provides first responders with an innovative knowledge management that enables intelligent treatment recommendations with an artificial intelligence-based pre-recognition of the situation.

[201] Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning

Xiancheng Gao, Yufeng Shi, Wengang Zhou, Houqiang Li

Main category: cs.AI

TL;DR: SPW unifies expert demonstrations and human preferences for offline RL by using similarity search to assign stepwise importance weights, enabling better credit assignment than traditional methods.

Details

Motivation: Offline RL typically requires well-defined reward functions that are expensive to design. Human feedback alternatives (demonstrations and preferences) have complementary limitations - demonstrations are costly and limited, while preferences lack clear credit assignment.

Method: Search-Based Preference Weighting (SPW) searches for similar state-action pairs from expert demonstrations for each transition in preference-labeled trajectories, then derives stepwise importance weights based on similarity scores to guide preference learning.

Result: SPW enables effective joint learning from both preferences and demonstrations, outperforming prior methods that use both feedback types on challenging robot manipulation tasks.

Conclusion: SPW successfully unifies two forms of human feedback by solving the credit assignment problem in preference learning through similarity-based weighting from demonstrations.

Abstract: Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.

[202] CoreThink: A Symbolic Reasoning Layer to reason over Long Horizon Tasks with LLMs

Jay Vaghasiya, Omkar Ghugarkar, Vishvesh Bhat, Vipul Dholaria, Julian McAuley

Main category: cs.AI

TL;DR: CoreThink introduces a novel General Symbolics reasoning method that achieves state-of-the-art performance on multiple benchmarks without fine-tuning or training costs, providing pure performance uplift for reasoning tasks.

Details

Motivation: Existing reasoning methods like test-time scaling, SFT, and RLVR are expected to reach diminishing returns in LLM performance, necessitating new reasoning techniques that don't negatively impact model accuracy.

Method: General Symbolic Reasoner (GSR) built around three key use cases: tool-calling, code generation, and planning, using a novel General Symbolics approach that diverges from traditional reasoning paradigms.

Result: Achieves SOTA scores: 66.66% on Livecodebench v6, 89% on Instruction-Following Evals, 24.4% on ARC-AGI-2, and 62.3% on SWE-Bench Lite with agentic coding IDE. All improvements achieved without fine-tuning or training costs.

Conclusion: CoreThink’s Reasoning Layer provides pure performance uplift for reasoning-intensive tasks and demonstrates that new reasoning techniques are needed to overcome diminishing returns of incumbent methods.

Abstract: We introduce CoreThink, a state-of-the-art Reasoning Layer built upon a novel reasoning method called General Symbolics. This approach diverges from reasoning paradigms such as test-time scaling, Supervised Fine-Tuning (SFT), and Reinforcement Learning with Verifiable Rewards (RLVR). CoreThink General Symbolic Reasoner (GSR) is specifically structured around three key use cases: tool-calling, code generation, and planning, demonstrating exemplary performance across a total of seven benchmarks in their respective areas. Notably, we are achieving SOTA scores of 66.66% on Livecodebench v6, 89% on Instruction-Following Evals, and 24.4% on ARC-AGI-2. We also present an agentic coding IDE, developed using the principles of General Symbolics, which achieves a state-of-the-art accuracy of 62.3% on SWE-Bench Lite. We are able to achieve these improvements without any fine-tuning or training costs. Our Reasoning Layer is designed to provide a pure performance uplift, ensuring that a model’s accuracy on reasoning tasks is never negatively impacted. We argue that incumbent methods will eventually lead to diminishing returns in LLM performance, necessitating the development of new reasoning techniques. This technical report details our approach at a high level and the availability of the CoreThink models for reasoning-intensive use cases.

cs.SD

[203] Analysis of Speaker Verification Performance Trade-offs with Neural Audio Codec Transmission

Nirmalya Mallick Thakur, Jia Qi Yip, Eng Siong Chng

Main category: cs.SD

TL;DR: Neural audio codecs degrade speaker verification performance at lower bitrates but outperform Opus at <12 kbps and are only marginally worse at ~24 kbps, making them feasible alternatives especially under bandwidth constraints.

Details

Motivation: To investigate how neural audio codecs (NACs) impact speaker verification performance compared to traditional codecs at varying bitrates, as NACs are rapidly being adopted but may introduce audio distortions.

Method: Evaluated three state-of-the-art SV models on VoxCeleb1 dataset using both traditional and neural audio codecs at different bitrates, comparing performance degradation patterns.

Result: NACs outperform Opus by 6-8% at low bitrates (<12 kbps) and remain only marginally behind at higher bitrates (~24 kbps) with EER increase of 0.4-0.7%. Performance degrades consistently across all models as bitrates decrease.

Conclusion: NACs are a feasible alternative to traditional codecs, especially under bandwidth limitations. Future work should focus on developing speaker-aware NACs or adapting SV models to bridge the performance gap at higher bitrates.

Abstract: Neural audio codecs (NACs) have made significant advancements in recent years and are rapidly being adopted in many audio processing pipelines. However, they can introduce audio distortions which degrade speaker verification (SV) performance. This study investigates the impact of both traditional and neural audio codecs at varying bitrates on three state of-the-art SV models evaluated on the VoxCeleb1 dataset. Our findings reveal a consistent degradation in SV performance across all models and codecs as bitrates decrease. Notably, NACs do not fundamentally break SV performance when compared to traditional codecs. They outperform Opus by 6-8% at low-bitrates (< 12 kbps) and remain marginally behind at higher bitrates ($\approx$ 24 kbps), with an EER increase of only 0.4-0.7%. The disparity at higher bitrates is likely due to the primary optimization of NACs for perceptual quality, which can inadvertently discard critical speaker-discriminative features, unlike Opus which was designed to preserve vocal characteristics. Our investigation suggests that NACs are a feasible alternative to traditional codecs, especially under bandwidth limitations. To bridge the gap at higher bitrates, future work should focus on developing speaker-aware NACs or retraining and adapting SV models.

[204] Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models

Sandipana Dowerah, Atharva Kulkarni, Ajinkya Kulkarni, Hoan My Tran, Joonas Kalda, Artem Fedorchenko, Benoit Fauve, Damien Lolive, Tanel Alumäe, Matthew Magimai Doss

Main category: cs.SD

TL;DR: Speech DF Arena is the first comprehensive benchmark for audio deepfake detection, providing standardized evaluation across 14 datasets, metrics, and protocols with a public leaderboard to compare detection systems.

Details

Motivation: While deepfake audio generation and detection have advanced significantly, there is a lack of standardized and comprehensive benchmarking tools to evaluate detection systems uniformly across diverse scenarios.

Method: Developed a comprehensive benchmark toolkit that evaluates detection systems across 14 diverse datasets and attack scenarios using standardized metrics and protocols. Included 12 open-source and 3 proprietary detection systems for comparison.

Result: Many detection systems exhibited high Equal Error Rate (EER) in out-of-domain scenarios, revealing significant performance degradation when tested on data different from their training domains.

Conclusion: The study highlights the critical need for extensive cross-domain evaluation in audio deepfake detection and provides Speech DF Arena as a standardized benchmark to help researchers improve system reliability and robustness.

Abstract: Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, currently across 14 diverse datasets and attack scenarios, standardized evaluation metrics and protocols for reproducibility and transparency. It also includes a leaderboard to compare and rank the systems to help researchers and developers enhance their reliability and robustness. We include 14 evaluation sets, 12 state-of-the-art open-source and 3 proprietary detection systems. Our study presents many systems exhibiting high EER in out-of-domain scenarios, highlighting the need for extensive cross-domain evaluation. The leaderboard is hosted on Huggingface1 and a toolkit for reproducing results across the listed datasets is available on GitHub.

[205] Multi-level SSL Feature Gating for Audio Deepfake Detection

Hoan My Tran, Damien Lolive, Aghilas Sini, Arnaud Delhay, Pierre-François Marteau, David Guennec

Main category: cs.SD

TL;DR: Proposed gating mechanism with XLS-R front-end and Multi-kernel gated Convolution back-end, enhanced by Centered Kernel Alignment for feature diversity, achieving state-of-the-art deepfake speech detection with robust cross-domain generalization.

Details

Motivation: Address limitations in current spoofing detection countermeasures, particularly poor generalization to unseen deepfake attacks and multilingual scenarios, as highly realistic synthetic speech poses risks for fraud and security threats.

Method: Gating mechanism extracts features from XLS-R foundation model as front-end, Multi-kernel gated Convolution captures local/global artifacts, and Centered Kernel Alignment enforces feature diversity across layers.

Result: Achieves state-of-the-art performance on in-domain benchmarks while demonstrating robust generalization to out-of-domain datasets including multilingual speech samples.

Conclusion: The integrated approach provides a versatile solution for detecting evolving speech deepfake threats with strong cross-domain performance and multilingual capability.

Abstract: Recent advancements in generative AI, particularly in speech synthesis, have enabled the generation of highly natural-sounding synthetic speech that closely mimics human voices. While these innovations hold promise for applications like assistive technologies, they also pose significant risks, including misuse for fraudulent activities, identity theft, and security threats. Current research on spoofing detection countermeasures remains limited by generalization to unseen deepfake attacks and languages. To address this, we propose a gating mechanism extracting relevant feature from the speech foundation XLS-R model as a front-end feature extractor. For downstream back-end classifier, we employ Multi-kernel gated Convolution (MultiConv) to capture both local and global speech artifacts. Additionally, we introduce Centered Kernel Alignment (CKA) as a similarity metric to enforce diversity in learned features across different MultiConv layers. By integrating CKA with our gating mechanism, we hypothesize that each component helps improving the learning of distinct synthetic speech patterns. Experimental results demonstrate that our approach achieves state-of-the-art performance on in-domain benchmarks while generalizing robustly to out-of-domain datasets, including multilingual speech samples. This underscores its potential as a versatile solution for detecting evolving speech deepfake threats.

[206] I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception

Jiawei Zhang, Tian-Hao Zhang, Jun Wang, Jiaran Gao, Xinyuan Qian, Xu-Cheng Yin

Main category: cs.SD

TL;DR: A novel multi-modal TTS approach called I2TTS that integrates visual scene prompts and reverberation refinement to create spatially-aware immersive speech synthesis for gaming and VR applications.

Details

Motivation: Previous TTS systems focus on naturalness but overlook spatial perception needed for immersive experiences in gaming and virtual reality environments.

Method: Introduces scene prompt encoder to integrate visual scene prompts into synthesis pipeline, plus reverberation classification and refinement technique to adjust mel-spectrograms for accurate spatial matching.

Result: Achieves high-quality scene and spatial matching without compromising speech naturalness, demonstrating significant advancement in context-aware speech synthesis.

Conclusion: I2TTS successfully bridges the gap between traditional TTS and spatial perception requirements, enabling immersive audio experiences that match visual scenes accurately.

Abstract: Controlling the style and characteristics of speech synthesis is crucial for adapting the output to specific contexts and user requirements. Previous Text-to-speech (TTS) works have focused primarily on the technical aspects of producing natural-sounding speech, such as intonation, rhythm, and clarity. However, they overlook the fact that there is a growing emphasis on spatial perception of synthesized speech, which may provide immersive experience in gaming and virtual reality. To solve this issue, in this paper, we present a novel multi-modal TTS approach, namely Image-indicated Immersive Text-to-speech Synthesis (I2TTS). Specifically, we introduce a scene prompt encoder that integrates visual scene prompts directly into the synthesis pipeline to control the speech generation process. Additionally, we propose a reverberation classification and refinement technique that adjusts the synthesized mel-spectrogram to enhance the immersive experience, ensuring that the involved reverberation condition matches the scene accurately. Experimental results demonstrate that our model achieves high-quality scene and spatial matching without compromising speech naturalness, marking a significant advancement in the field of context-aware speech synthesis. Project demo page: https://spatialTTS.github.io/ Index Terms-Speech synthesis, scene prompt, spatial perception

[207] You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties

Paige Tuttösí, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier, Angelica Lim

Main category: cs.SD

TL;DR: A TTS system with clarity mode using duration differences between tense/lax vowels improves intelligibility for L2 English speakers by 9.15% fewer errors, though listeners mistakenly believe overall slowed speech is better.

Details

Motivation: To create a text-to-speech system specifically designed for second language (L2) speakers to improve their comprehension of English, addressing the challenge that current TTS systems and ASR models don't effectively serve L2 learners' needs.

Method: Used duration differences between American English tense (longer) and lax (shorter) vowels to create a “clarity mode” for Matcha-TTS, and conducted perception studies with French-L1 English-L2 listeners comparing clarity mode vs overall slowed speech.

Result: French-L1 English-L2 listeners had 9.15% fewer transcription errors with clarity mode, found it more encouraging and respectful than overall slowed speech, but were unaware of these benefits - mistakenly believing overall slowed speech was more intelligible.

Conclusion: Actual intelligibility doesn’t correlate with perceived intelligibility for L2 speakers, and Whisper-ASR doesn’t use the same cues as L2 speakers, making it insufficient for assessing TTS intelligibility for this population.

Abstract: We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a “clarity mode” for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.

[208] OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue

Xuelong Geng, Qijie Shao, Hongfei Xue, Shuiyuan Wang, Hanke Xie, Zhao Guo, Yi Zhao, Guojian Li, Wenjie Tian, Chengyou Wang, Zhixian Zhao, Kangxiang Xia, Ziyu Zhang, Zhennan Lin, Tianlun Zuo, Mingchen Shao, Yuang Cao, Guobin Ma, Longhao Li, Yuhang Dai, Dehui Gao, Dake Guo, Lei Xie

Main category: cs.SD

TL;DR: OSUM-EChat is an open-source end-to-end spoken dialogue system that enhances empathetic interactions through a three-stage training strategy and dual thinking mechanism, with new dataset and evaluation benchmark.

Details

Motivation: Address challenges in empathetic spoken dialogue systems including over-reliance on large datasets, insufficient paralinguistic cue extraction, and lack of empathy-specific datasets and evaluation frameworks.

Method: Three-stage understanding-driven training strategy extending speech understanding models to dialogue tasks, plus linguistic-paralinguistic dual thinking mechanism integrating paralinguistic understanding through chain of thought with dialogue generation.

Result: OSUM-EChat outperforms end-to-end spoken dialogue models in empathetic responsiveness, demonstrating effectiveness while reducing reliance on large-scale dialogue datasets.

Conclusion: The proposed system successfully enhances empathetic capabilities in spoken dialogue through innovative training strategies and dual thinking mechanisms, validated by comprehensive evaluation framework.

Abstract: Empathy is crucial in enabling natural interactions within spoken dialogue systems, allowing machines to recognize and respond appropriately to paralinguistic cues such as age, gender, and emotion. Recent advancements in end-to-end speech language models, which unify speech understanding and generation, provide promising solutions. However, several challenges persist, including an over-reliance on large-scale dialogue datasets, insufficient extraction of paralinguistic cues vital for conveying empathy, and the lack of empathy-specific datasets and evaluation frameworks. To address these issues, we introduce OSUM-EChat, an open-source, end-to-end spoken dialogue system designed to enhance empathetic interactions, particularly in resource-limited settings. OSUM-EChat introduces two key innovations: (1) a three-stage understanding-driven spoken dialogue training strategy that extends the capabilities of a large speech understanding model to spoken dialogue tasks, and (2) a linguistic-paralinguistic dual thinking mechanism that integrates paralinguistic understanding through a chain of thought with dialogue generation, enabling the system to produce more empathetic responses. This approach reduces reliance on large-scale dialogue datasets while maintaining high-quality empathetic interactions. Additionally, we introduce the EChat-200K dataset, a rich corpus of empathetic speech-to-speech dialogues, and the EChat-eval benchmark, a comprehensive framework for evaluating the empathetic capabilities of dialogue systems. Experimental results demonstrate that OSUM-EChat outperforms end-to-end spoken dialogue models regarding empathetic responsiveness, validating its effectiveness.

cs.LG

[209] The Lifecycle Principle: Stabilizing Dynamic Neural Networks with State Memory

Zichuan Yang

Main category: cs.LG

TL;DR: Proposes Lifecycle (LC) principle with state memory to stabilize training when reviving long-term deactivated neurons, improving generalization and robustness.

Details

Motivation: Address training instability from long-term neuron deactivation when neurons are revived with random weights, which causes destructive optimization shocks.

Method: Lifecycle principle with state memory - instead of re-initializing revived neurons, restores parameters to their last known effective state to preserve learned knowledge.

Result: Theoretical analysis shows LC smooths loss landscape towards flatter minima. Experiments on image classification benchmarks demonstrate improved generalization and robustness.

Conclusion: State memory is essential for achieving training stability and performance gains in long-term neuron deactivation regularization methods.

Abstract: I investigate a stronger form of regularization by deactivating neurons for extended periods, a departure from the temporary changes of methods like Dropout. However, this long-term dynamism introduces a critical challenge: severe training instability when neurons are revived with random weights. To solve this, I propose the Lifecycle (LC) principle, a regularization mechanism centered on a key innovation: state memory. Instead of re-initializing a revived neuron, my method restores its parameters to their last known effective state. This process preserves learned knowledge and avoids destructive optimization shocks. My theoretical analysis reveals that the LC principle smooths the loss landscape, guiding optimization towards flatter minima associated with better generalization. Experiments on image classification benchmarks demonstrate that my method improves generalization and robustness. Crucially, ablation studies confirm that state memory is essential for achieving these gains.

[210] Latent Variable Modeling in Multi-Agent Reinforcement Learning via Expectation-Maximization for UAV-Based Wildlife Protection

Mazyar Taghavi, Rahman Farnoosh

Main category: cs.LG

TL;DR: EM-based MARL approach for UAV coordination in wildlife protection, outperforms PPO and DDPG in leopard conservation scenarios

Details

Motivation: Protecting endangered wildlife from illegal poaching in vast, partially observable environments requires real-time response and coordination under uncertainty

Method: Expectation-Maximization (EM) based latent variable modeling in Multi-Agent Reinforcement Learning (MARL) for UAV coordination, modeling hidden environmental factors and inter-agent dynamics

Result: Superior performance in detection accuracy, adaptability, and policy convergence compared to standard algorithms (PPO, DDPG) in simulations with 10 UAVs patrolling Iranian leopard habitats

Conclusion: Combining EM inference with MARL improves decentralized decision-making in complex conservation scenarios, with publicly available implementation on GitHub

Abstract: Protecting endangered wildlife from illegal poaching presents a critical challenge, particularly in vast and partially observable environments where real-time response is essential. This paper introduces a novel Expectation-Maximization (EM) based latent variable modeling approach in the context of Multi-Agent Reinforcement Learning (MARL) for Unmanned Aerial Vehicle (UAV) coordination in wildlife protection. By modeling hidden environmental factors and inter-agent dynamics through latent variables, our method enhances exploration and coordination under uncertainty.We implement and evaluate our EM-MARL framework using a custom simulation involving 10 UAVs tasked with patrolling protected habitats of the endangered Iranian leopard. Extensive experimental results demonstrate superior performance in detection accuracy, adaptability, and policy convergence when compared to standard algorithms such as Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradient (DDPG). Our findings underscore the potential of combining EM inference with MARL to improve decentralized decisionmaking in complex, high-stakes conservation scenarios. The full implementation, simulation environment, and training scripts are publicly available on GitHub.

[211] Beyond Synthetic Augmentation: Group-Aware Threshold Calibration for Robust Balanced Accuracy in Imbalanced Learning

Hunter Gittlin

Main category: cs.LG

TL;DR: Group-aware threshold calibration (setting different decision thresholds for different demographic groups) outperforms synthetic data generation methods like SMOTE and CT-GAN for handling class imbalance, achieving 1.5-4% higher balanced accuracy while improving worst-group performance.

Details

Motivation: Class imbalance remains a fundamental challenge in machine learning, with traditional solutions like synthetic data generation often creating as many problems as they solve. The paper aims to find a more robust and effective approach.

Method: The researchers propose group-aware threshold calibration, which sets different decision thresholds for different demographic groups rather than applying a single cutoff across all groups. They conducted extensive experiments across seven model families including linear, tree-based, instance-based, and boosting methods.

Result: Group-specific thresholds achieved 1.5-4% higher balanced accuracy than SMOTE and CT-GAN augmented models while improving worst-group balanced accuracy. The method optimizes the Pareto frontier between balanced accuracy and worst-group balanced accuracy. Applying group thresholds to synthetically augmented data yielded minimal additional benefit.

Conclusion: Group-aware threshold calibration offers a simpler, more interpretable, and more effective solution to class imbalance compared to synthetic data generation methods, providing superior robustness across multiple model types.

Abstract: Class imbalance remains a fundamental challenge in machine learning, with traditional solutions often creating as many problems as they solve. We demonstrate that group-aware threshold calibration–setting different decision thresholds for different demographic groups–provides superior robustness compared to synthetic data generation methods. Through extensive experiments, we show that group-specific thresholds achieve 1.5-4% higher balanced accuracy than SMOTE and CT-GAN augmented models while improving worst-group balanced accuracy. Unlike single-threshold approaches that apply one cutoff across all groups, our group-aware method optimizes the Pareto frontier between balanced accuracy and worst-group balanced accuracy, enabling fine-grained control over group-level performance. Critically, we find that applying group thresholds to synthetically augmented data yields minimal additional benefit, suggesting these approaches are fundamentally redundant. Our results span seven model families including linear, tree-based, instance-based, and boosting methods, confirming that group-aware threshold calibration offers a simpler, more interpretable, and more effective solution to class imbalance.

[212] Preference Robustness for DPO with Applications to Public Health

Cheol Woo Kim, Shresth Verma, Mauricio Tec, Milind Tambe

Main category: cs.LG

TL;DR: DPO-PRO is a robust DPO-based fine-tuning method that uses distributionally robust optimization to handle uncertainty in human preferences for sequential resource allocation in public health, achieving better robustness to noisy preferences with lower inference costs.

Details

Motivation: Address the challenge of aligning LLMs with complex, ambiguous human preferences in resource allocation problems with limited data availability, particularly in public health settings like maternal mobile health programs.

Method: Proposes DPO-PRO, a Direct Preference Optimization method enhanced with lightweight Distributionally Robust Optimization to account for uncertainty in preference distributions, making it less conservative than prior DRO-based DPO approaches.

Result: DPO-PRO consistently improves robustness to noisy preference signals compared to existing DPO variants and achieves comparable performance to self-reflection-based baselines while requiring significantly lower inference-time computational costs.

Conclusion: The method provides an effective and efficient approach for reward function design in complex sequential decision-making problems, particularly valuable for real-world public health applications with limited and noisy preference data.

Abstract: We study an LLM fine-tuning task for designing reward functions for sequential resource allocation problems in public health, guided by human preferences expressed in natural language. This setting presents a challenging testbed for alignment due to complex and ambiguous objectives and limited data availability. We propose DPO-PRO, a robust fine-tuning algorithm based on Direct Preference Optimization (DPO), which accounts for uncertainty in the preference distribution using a lightweight Distributionally Robust Optimization (DRO) formulation. Unlike prior DRO-based DPO methods, DPO-PRO is significantly less conservative. We evaluate DPO-PRO on a real-world maternal mobile health program operated by the non-profit organization ARMMAN, as well as on standard alignment benchmarks. Experimental results demonstrate that our method consistently improves robustness to noisy preference signals compared to existing DPO variants. Moreover, DPO-PRO achieves comparable performance to prior self-reflection-based baseline for reward function design, while requiring significantly lower inference-time cost.

[213] Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient

Zhongzhu Zhou, Yibo Yang, Ziyan Chen, Fengxiang Bie, Haojun Xia, Xiaoxia Wu, Robert Wu, Ben Athiwaratkun, Bernard Ghanem, Shuaiwen Leon Song

Main category: cs.LG

TL;DR: Policy gradient methods with neural networks exhibit Action Collapse - where state-action activations collapse to optimal action means and form simplex equiangular tight frame structures. The paper proposes ACPG method that fixes ETF structure in action selection layer to accelerate learning.

Details

Motivation: While policy gradient methods using DNNs have been studied for convergence, their representational structures remain under-explored. The observation of Action Collapse phenomena suggests that optimal policies naturally form structured representations that could be leveraged to improve learning.

Method: Proposed Action Collapse Policy Gradient (ACPG) method that affixes a synthetic equiangular tight frame (ETF) as the action selection layer. This induces the policy DNN to produce ideal ETF configurations while maintaining optimal performance.

Result: Experiments across various OpenAI Gym environments show ACPG can be integrated into any discrete policy gradient method, leading to faster and more robust reward improvements compared to standard approaches.

Conclusion: The Action Collapse phenomenon reveals fundamental structural properties of optimal policy networks. By explicitly incorporating ETF structures through ACPG, policy gradient methods achieve improved learning efficiency and robustness across diverse environments.

Abstract: Policy gradient (PG) methods in reinforcement learning frequently utilize deep neural networks (DNNs) to learn a shared backbone of feature representations used to compute likelihoods in an action selection layer. Numerous studies have been conducted on the convergence and global optima of policy networks, but few have analyzed representational structures of those underlying networks. While training an optimal policy DNN, we observed that under certain constraints, a gentle structure resembling neural collapse, which we refer to as Action Collapse (AC), emerges. This suggests that 1) the state-action activations (i.e. last-layer features) sharing the same optimal actions collapse towards those optimal actions respective mean activations; 2) the variability of activations sharing the same optimal actions converges to zero; 3) the weights of action selection layer and the mean activations collapse to a simplex equiangular tight frame (ETF). Our early work showed those aforementioned constraints to be necessary for these observations. Since the collapsed ETF of optimal policy DNNs maximally separates the pair-wise angles of all actions in the state-action space, we naturally raise a question: can we learn an optimal policy using an ETF structure as a (fixed) target configuration in the action selection layer? Our analytical proof shows that learning activations with a fixed ETF as action selection layer naturally leads to the AC. We thus propose the Action Collapse Policy Gradient (ACPG) method, which accordingly affixes a synthetic ETF as our action selection layer. ACPG induces the policy DNN to produce such an ideal configuration in the action selection layer while remaining optimal. Our experiments across various OpenAI Gym environments demonstrate that our technique can be integrated into any discrete PG methods and lead to favorable reward improvements more quickly and robustly.

[214] Population-aware Online Mirror Descent for Mean-Field Games with Common Noise by Deep Reinforcement Learning

Zida Wu, Mathieu Lauriere, Matthieu Geist, Olivier Pietquin, Ankur Mehta

Main category: cs.LG

TL;DR: A novel deep reinforcement learning algorithm for learning Nash equilibria in Mean Field Games without averaging or historical sampling, showing superior convergence and robustness to initial distributions and common noise.

Details

Motivation: Learning Nash equilibria in Mean Field Games is challenging when initial distributions are unknown or populations are subject to common noise, requiring more efficient and adaptable approaches.

Method: Developed a DRL algorithm inspired by Munchausen RL and Online Mirror Descent that achieves population-dependent Nash equilibria without relying on averaging or historical sampling.

Result: Numerical experiments on seven canonical examples demonstrated superior convergence properties compared to state-of-the-art algorithms, particularly a DRL version of Fictitious Play, with robust performance under common noise.

Conclusion: The proposed algorithm provides an efficient and adaptable approach for learning population-dependent Nash equilibria in MFGs, showing strong performance across various initial distributions and noise conditions.

Abstract: Mean Field Games (MFGs) offer a powerful framework for studying large-scale multi-agent systems. Yet, learning Nash equilibria in MFGs remains a challenging problem, particularly when the initial distribution is unknown or when the population is subject to common noise. In this paper, we introduce an efficient deep reinforcement learning (DRL) algorithm designed to achieve population-dependent Nash equilibria without relying on averaging or historical sampling, inspired by Munchausen RL and Online Mirror Descent. The resulting policy is adaptable to various initial distributions and sources of common noise. Through numerical experiments on seven canonical examples, we demonstrate that our algorithm exhibits superior convergence properties compared to state-of-the-art algorithms, particularly a DRL version of Fictitious Play for population-dependent policies. The performance in the presence of common noise underscores the robustness and adaptability of our approach.

[215] Mentality: A Mamba-based Approach towards Foundation Models for EEG

Saarang Panchavati, Corey Arnold, William Speier

Main category: cs.LG

TL;DR: Mamba-based foundation model for EEG analysis achieves 0.72 AUROC in seizure detection, showing promise for neurological disorder diagnosis.

Details

Motivation: EEG data is noisy, high-dimensional, and nonlinear, making automated analysis challenging. Traditional ML methods struggle with complex spatio-temporal dynamics, while recent deep learning advances offer better modeling capabilities.

Method: Trained a Mamba-based selective state space model on large EEG dataset through self-supervised reconstruction task followed by supervised seizure detection task.

Result: Achieved AUROC of 0.72 on held-out test set for seizure detection, demonstrating effective performance.

Conclusion: This approach represents significant progress toward developing large-scale, clinically applicable foundation models for EEG data analysis in neurological disorders.

Abstract: This work explores the potential of foundation models, specifically a Mamba-based selective state space model, for enhancing EEG analysis in neurological disorder diagnosis. EEG, crucial for diagnosing conditions like epilepsy, presents significant challenges due to its noisy, high-dimensional, and nonlinear nature. Traditional machine learning methods have made advances in automating EEG analysis but often fail to capture its complex spatio-temporal dynamics. Recent advances in deep learning, particularly in sequence modeling, offer new avenues for creating more generalized and expressive models capable of handling such complexities. By training a Mamba-based model on a large dataset containing seizure and non-seizure EEG recordings through a self-supervised reconstruction task followed by a seizure detection task, we demonstrate the model’s effectiveness, achieving an AUROC of 0.72 on a held-out test set. This approach marks a significant step toward developing large-scale, clinically applicable foundation models for EEG data analysis.

[216] FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation

Dong Liu, Yanxuan Yu, Jiayi Zhang, Yifan Li, Ben Lengerich, Ying Nian Wu

Main category: cs.LG

TL;DR: FastCache accelerates Diffusion Transformer inference through hidden-state caching and compression, reducing computation while maintaining generation quality.

Details

Motivation: Diffusion Transformers are computationally intensive due to iterative structure and deep transformer stacks, requiring efficiency improvements.

Method: Dual strategy: spatial-aware token selection to filter redundant tokens, and transformer-level cache to reuse latent activations across timesteps with learnable linear approximation.

Result: Substantial reductions in latency and memory usage while maintaining best generation output quality compared to other cache methods (measured by FID and t-FID).

Conclusion: FastCache effectively accelerates DiT inference by exploiting redundancy in internal representations with bounded approximation error, achieving significant performance gains.

Abstract: Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose FastCache, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model’s internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes are statistically insignificant. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, with best generation output quality compared to other cache methods, as measured by FID and t-FID. Code implementation of FastCache is available on GitHub at https://github.com/NoakLiu/FastCache-xDiT.

[217] A Hierarchical Deep Reinforcement Learning Framework for Traffic Signal Control with Predictable Cycle Planning

Hankang Gu, Yuli Zhang, Chengming Wang, Ruiyuan Jiang, Ziheng Qiao, Pengfei Fan, Dongyao Jia

Main category: cs.LG

TL;DR: Proposes DHCP, a hierarchical DRL model for traffic signal control that allocates cycle time between directions and movements to address limitations of existing choose-phase and switch paradigms.

Details

Motivation: Current DRL-based traffic signal control methods have limitations - choose-phase strategies disrupt driver anticipation and safety, while switch strategies lead to unfair and inefficient phase allocations.

Method: Hierarchical DRL approach with high-level agent allocating total cycle time between NS/EW directions, and low-level agent dividing allocated time between straight and left-turn movements within each direction.

Result: Achieves best performance across both real and synthetic road networks with multiple traffic flow datasets compared to baseline methods.

Conclusion: DHCP provides more flexible and efficient traffic signal control while maintaining predictable phase sequences for driver safety.

Abstract: Deep reinforcement learning (DRL) has become a popular approach in traffic signal control (TSC) due to its ability to learn adaptive policies from complex traffic environments. Within DRL-based TSC methods, two primary control paradigms are choose phase" and switch" strategies. Although the agent in the choose phase paradigm selects the next active phase adaptively, this paradigm may result in unexpected phase sequences for drivers, disrupting their anticipation and potentially compromising safety at intersections. Meanwhile, the switch paradigm allows the agent to decide whether to switch to the next predefined phase or extend the current phase. While this structure maintains a more predictable order, it can lead to unfair and inefficient phase allocations, as certain movements may be extended disproportionately while others are neglected. In this paper, we propose a DRL model, named Deep Hierarchical Cycle Planner (DHCP), to allocate the traffic signal cycle duration hierarchically. A high-level agent first determines the split of the total cycle time between the North-South (NS) and East-West (EW) directions based on the overall traffic state. Then, a low-level agent further divides the allocated duration within each major direction between straight and left-turn movements, enabling more flexible durations for the two movements. We test our model on both real and synthetic road networks, along with multiple sets of real and synthetic traffic flows. Empirical results show our model achieves the best performance over all datasets against baselines.

[218] LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference

Krishna Teja Chitty-Venkata, Sandeep Madireddy, Murali Emani, Venkatram Vishwanath

Main category: cs.LG

TL;DR: LExI is a data-free optimization technique that determines optimal number of active experts per layer in MoE models, significantly improving inference efficiency with minimal accuracy loss compared to traditional pruning methods.

Details

Motivation: Existing MoE pruning strategies only reduce memory usage but don't improve inference performance on GPUs, and fixed expert activation across layers leads to redundant computation and suboptimal performance.

Method: LExI leverages only model weights to estimate layer importance and adaptively assigns the optimal number of active experts per layer, without requiring additional data.

Result: LExI significantly outperforms traditional MoE pruning in inference efficiency with negligible accuracy loss. Qwen1.5-MoE achieves same throughput with 10% better accuracy than traditional expert pruning on H100 GPU.

Conclusion: LExI provides an effective data-free optimization approach for MoE models that addresses inference efficiency limitations of previous pruning methods while maintaining model accuracy.

Abstract: Mixture-of-Experts (MoE) models scale efficiently by activating only a subset of experts per token, offering a computationally sparse alternative to dense architectures. While prior post-training optimizations, such as inter- and intra-expert pruning, reduce memory usage they provide limited gains in inference-time compute efficiency. Moreover, existing MoE architectures typically activate a fixed number of experts uniformly across all layers, resulting in redundant computation and suboptimal performance. In this work, we first demonstrate that MoE pruning strategies improve only the memory footprint but do not significantly improve inference performance on GPU using optimized frameworks such as vLLM. To address this, we introduce LExI, a data-free optimization technique that determines the optimal number of active experts per layer in a pretrained MoE model. LExI leverages only the model weights to estimate the relative importance of each layer and adaptively assigns the number of active experts accordingly per layer. Experiments on state-of-the-art language and vision MoE benchmarks demonstrate that LExI significantly outperforms traditional MoE pruning approaches in terms of inference efficiency with negligible accuracy loss. For example, using LExI, Qwen1.5-MoE achieves the same throughput on Nvidia H100 GPU with 10% better accuracy than traditional expert pruning.

[219] Balanced Multimodal Learning: An Unidirectional Dynamic Interaction Perspective

Shijie Wang, Li Zhang, Xinyan Liang, Yuhua Qian, Shen Hu

Main category: cs.LG

TL;DR: UDI proposes a sequential training strategy that abandons joint loss to address modality imbalance in multimodal learning, using anchor modality guidance and dynamic interaction adjustment.

Details

Motivation: Traditional multimodal joint loss causes modality imbalance where strong modalities dominate weaker ones, limiting individual modality information utilization and inter-modality interactions.

Method: Unidirectional Dynamic Interaction (UDI) - sequential training where anchor modality is trained first, then guides other modalities via unsupervised loss with dynamic interaction adjustment.

Result: UDI outperforms existing methods in handling modality imbalance and achieves performance improvement in multimodal learning tasks.

Conclusion: UDI provides an effective proactive alternative to reactive joint loss approaches, enabling better modality interactions and information utilization while preventing single modality domination.

Abstract: Multimodal learning typically utilizes multimodal joint loss to integrate different modalities and enhance model performance. However, this joint learning strategy can induce modality imbalance, where strong modalities overwhelm weaker ones and limit exploitation of individual information from each modality and the inter-modality interaction information. Existing strategies such as dynamic loss weighting, auxiliary objectives and gradient modulation mitigate modality imbalance based on joint loss. These methods remain fundamentally reactive, detecting and correcting imbalance after it arises, while leaving the competitive nature of the joint loss untouched. This limitation drives us to explore a new strategy for multimodal imbalance learning that does not rely on the joint loss, enabling more effective interactions between modalities and better utilization of information from individual modalities and their interactions. In this paper, we introduce Unidirectional Dynamic Interaction (UDI), a novel strategy that abandons the conventional joint loss in favor of a proactive, sequential training scheme. UDI first trains the anchor modality to convergence, then uses its learned representations to guide the other modality via unsupervised loss. Furthermore, the dynamic adjustment of modality interactions allows the model to adapt to the task at hand, ensuring that each modality contributes optimally. By decoupling modality optimization and enabling directed information flow, UDI prevents domination by any single modality and fosters effective cross-modal feature learning. Our experimental results demonstrate that UDI outperforms existing methods in handling modality imbalance, leading to performance improvement in multimodal learning tasks.

[220] The Transparent Earth: A Multimodal Foundation Model for the Earth’s Subsurface

Arnab Mazumder, Javier E. Santos, Noah Hobbs, Mohamed Mehana, Daniel O’Malley

Main category: cs.LG

TL;DR: Transformer-based architecture for subsurface property reconstruction from heterogeneous datasets with varying sparsity, resolution, and modalities using positional and modality encodings.

Details

Motivation: To create a scalable foundation model that can reconstruct subsurface properties from diverse observation types (stress angles, temperatures, plate types) with varying data characteristics.

Method: Uses transformer architecture with positional encodings and modality encodings derived from text embeddings of modality descriptions. Supports in-context learning and arbitrary modality addition.

Result: Reduces stress angle prediction errors by more than a factor of three on validation data. Shows improved performance with increased parameters.

Conclusion: Transparent Earth serves as an initial foundation model for predicting any subsurface property globally, demonstrating scalability and strong performance across heterogeneous datasets.

Abstract: We present the Transparent Earth, a transformer-based architecture for reconstructing subsurface properties from heterogeneous datasets that vary in sparsity, resolution, and modality, where each modality represents a distinct type of observation (e.g., stress angle, mantle temperature, tectonic plate type). The model incorporates positional encodings of observations together with modality encodings, derived from a text embedding model applied to a description of each modality. This design enables the model to scale to an arbitrary number of modalities, making it straightforward to add new ones not considered in the initial design. We currently include eight modalities spanning directional angles, categorical classes, and continuous properties such as temperature and thickness. These capabilities support in-context learning, enabling the model to generate predictions either with no inputs or with an arbitrary number of additional observations from any subset of modalities. On validation data, this reduces errors in predicting stress angle by more than a factor of three. The proposed architecture is scalable and demonstrates improved performance with increased parameters. Together, these advances make the Transparent Earth an initial foundation model for the Earth’s subsurface that ultimately aims to predict any subsurface property anywhere on Earth.

[221] Structured Basis Function Networks: Loss-Centric Multi-Hypothesis Ensembles with Controllable Diversity

Alejandro Rodriguez Dominguez, Muhammad Shahzad, Xia Hong

Main category: cs.LG

TL;DR: A unified framework called Structured Basis Function Network that bridges multi-hypothesis prediction and ensemble learning through Bregman divergence-based centroidal aggregation, providing parametric control over bias-variance-diversity trade-off.

Details

Motivation: Existing approaches lack a unified framework that combines the diversity of multi-hypothesis prediction with the principled aggregation of ensemble learning while maintaining consistency with loss geometry.

Method: Proposes Structured Basis Function Network that links multi-hypothesis prediction and ensembling through centroidal aggregation induced by Bregman divergences, supporting both closed-form least-squares estimator and gradient-based optimization.

Result: The framework provides tunable diversity control for bias-variance-diversity trade-off and successfully connects multi-hypothesis generalization with loss-aware ensemble aggregation across regression and classification tasks.

Conclusion: The method offers a principled approach to predictive uncertainty that captures structured ambiguity while maintaining accuracy, validated through experiments with deep-learning predictors on datasets of varying difficulty.

Abstract: Existing approaches to predictive uncertainty rely either on multi-hypothesis prediction, which promotes diversity but lacks principled aggregation, or on ensemble learning, which improves accuracy but rarely captures the structured ambiguity. This implicitly means that a unified framework consistent with the loss geometry remains absent. The Structured Basis Function Network addresses this gap by linking multi-hypothesis prediction and ensembling through centroidal aggregation induced by Bregman divergences. The formulation applies across regression and classification by aligning predictions with the geometry of the loss, and supports both a closed-form least-squares estimator and a gradient-based procedure for general objectives. A tunable diversity mechanism provides parametric control of the bias-variance-diversity trade-off, connecting multi-hypothesis generalisation with loss-aware ensemble aggregation. Experiments validate this relation and use the mechanism to study the complexity-capacity-diversity trade-off across datasets of increasing difficulty with deep-learning predictors.

[222] Learning Laplacian Eigenvectors: a Pre-training Method for Graph Neural Networks

Howard Dai, Nyambura Njenga, Benjamin Whitsett, Catherine Ma, Darwin Deng, Sara de Ángel, Alexandre Van Tassel, Siddharth Viswanath, Ryan Pellico, Ian Adelstein, Smita Krishnaswamy

Main category: cs.LG

TL;DR: A novel framework for pre-training Graph Neural Networks by learning Laplacian eigenvectors to capture global graph structure, overcoming over-smoothing issues in traditional MPNNs.

Details

Motivation: Traditional Message Passing Neural Networks struggle with capturing global and regional graph structure due to over-smoothing as network depth increases, limiting their effectiveness on graph structure-based tasks.

Method: Pre-training GNNs to predict low-frequency eigenvectors of the graph Laplacian matrix, which encode global information, enabling the network to learn large-scale structural patterns across graphs.

Result: Models pre-trained with this framework outperform baseline models on various graph structure-based tasks, demonstrating improved performance and structural understanding.

Conclusion: This self-supervised pre-training framework is structure-based, highly flexible, applicable to all graph datasets, and works with synthetic features when task-specific data is sparse, providing a universal approach to GNN pre-training.

Abstract: We propose a novel framework for pre-training Graph Neural Networks (GNNs) by inductively learning Laplacian eigenvectors. Traditional Message Passing Neural Networks (MPNNs) often struggle to capture global and regional graph structure due to over-smoothing risk as network depth increases. Because the low-frequency eigenvectors of the graph Laplacian matrix encode global information, pre-training GNNs to predict these eigenvectors encourages the network to naturally learn large-scale structural patterns over each graph. Empirically, we show that models pre-trained via our framework outperform baseline models on a variety of graph structure-based tasks. While most existing pre-training methods focus on domain-specific tasks like node or edge feature reconstruction, our self-supervised pre-training framework is structure-based and highly flexible. Eigenvector-learning can be applied to all graph-based datasets, and can be used with synthetic features when task-specific data is sparse.

[223] Challenges in Understanding Modality Conflict in Vision-Language Models

Trang Nguyen, Jackson Michaels, Madalina Fiterau, David Jensen

Main category: cs.LG

TL;DR: The paper investigates how Vision-Language Models handle conflicting multimodal inputs, showing that conflict detection and resolution are distinct mechanisms that emerge at different network layers and can be linearly decoded.

Details

Motivation: To understand how VLMs decompose conflict detection from resolution when faced with conflicting multimodal inputs, enabling better interpretability and targeted interventions for improved robustness.

Method: Mechanistic investigation of LLaVA-OV-7B using linear probes for conflict signal decoding and group-based attention pattern analysis to study detection and resolution behaviors.

Result: A linearly decodable conflict signal emerges in intermediate layers, and attention patterns for detection vs resolution diverge at different network stages, supporting distinct functional mechanisms.

Conclusion: Conflict detection and resolution are functionally distinct in VLMs, enabling actionable interpretability and targeted interventions to improve model robustness in challenging multimodal scenarios.

Abstract: This paper highlights the challenge of decomposing conflict detection from conflict resolution in Vision-Language Models (VLMs) and presents potential approaches, including using a supervised metric via linear probes and group-based attention pattern analysis. We conduct a mechanistic investigation of LLaVA-OV-7B, a state-of-the-art VLM that exhibits diverse resolution behaviors when faced with conflicting multimodal inputs. Our results show that a linearly decodable conflict signal emerges in the model’s intermediate layers and that attention patterns associated with conflict detection and resolution diverge at different stages of the network. These findings support the hypothesis that detection and resolution are functionally distinct mechanisms. We discuss how such decomposition enables more actionable interpretability and targeted interventions for improving model robustness in challenging multimodal settings.

[224] Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs

Naman Deep Singh, Maximilian Müller, Francesco Croce, Matthias Hein

Main category: cs.LG

TL;DR: JensUn is a new unlearning method for LLMs that uses Jensen-Shannon Divergence for better forget-utility trade-off and resilience to relearning, with improved evaluation using LLM semantic judges and worst-case testing.

Details

Motivation: Existing unlearning methods for LLMs often fail under thorough evaluation, making it crucial to develop more effective techniques for removing private or harmful information while maintaining model utility.

Method: Leverages Jensen-Shannon Divergence as training objective for both forget and retain sets, introduces LKF dataset of lesser-known facts, and proposes LLM-based semantic evaluation instead of ROUGE scores with worst-case testing across paraphrases.

Result: JensUn achieves better forget-utility trade-off than competing methods and demonstrates strong resilience to benign relearning. The improved evaluation framework reveals many existing methods are less effective than previously thought.

Conclusion: JensUn provides more stable and effective unlearning dynamics, and the proposed evaluation framework offers more realistic and comprehensive testing of unlearning methods in LLMs.

Abstract: Unlearning in large language models (LLMs) involves precisely removing specific information from a pre-trained model. This is crucial to ensure safety of LLMs by deleting private data or harmful knowledge acquired during pre-training. However, existing unlearning methods often fall short when subjected to thorough evaluation. To overcome this, we introduce JensUn, where we leverage the Jensen-Shannon Divergence as the training objective for both forget and retain sets for more stable and effective unlearning dynamics compared to commonly used loss functions. In extensive experiments, JensUn achieves better forget-utility trade-off than competing methods, and even demonstrates strong resilience to benign relearning. Additionally, for a precise unlearning evaluation, we introduce LKF, a curated dataset of lesser-known facts that provides a realistic unlearning scenario. Finally, to comprehensively test unlearning methods, we propose (i) employing an LLM as semantic judge instead of the standard ROUGE score, and (ii) using worst-case unlearning evaluation over various paraphrases and input formats. Our improved evaluation framework reveals that many existing methods are less effective than previously thought.

[225] The Nah Bandit: Modeling User Non-compliance in Recommendation Systems

Tianyue Zhou, Jung-Hoon Cho, Cathy Wu

Main category: cs.LG

TL;DR: The paper introduces the Nah Bandit framework for cyber-physical recommendation systems where users can reject recommendations and fall back to baseline behavior, proposing the Expert with Clustering algorithm that achieves better performance than traditional approaches.

Details

Motivation: Traditional recommendation systems work well in digital environments but face challenges in physical world applications where users can easily opt out of recommendations they don't like and revert to their baseline behavior, requiring new interaction models.

Method: The paper models user non-compliance through parameterized anchoring effects and proposes Expert with Clustering (EWC) algorithm - a hierarchical approach that incorporates feedback from both recommended and non-recommended options to accelerate user preference learning.

Result: EWC achieves a regret bound of O(N√(T log K) + NT) with N users, T rounds per user, and K clusters, outperforming LinUCB algorithm theoretically in the short term. Experimental results show EWC beats both supervised learning and traditional contextual bandit approaches.

Conclusion: Effective use of non-compliance feedback can accelerate preference learning and improve recommendation accuracy. This work establishes a foundation for future Nah Bandit research and provides a robust framework for more effective recommendation systems in physical environments.

Abstract: Recommendation systems now pervade the digital world, ranging from advertising to entertainment. However, it remains challenging to implement effective recommendation systems in the physical world, such as in mobility or health. This work focuses on a key challenge: in the physical world, it is often easy for the user to opt out of taking any recommendation if they are not to her liking, and to fall back to her baseline behavior. It is thus crucial in cyber-physical recommendation systems to operate with an interaction model that is aware of such user behavior, lest the user abandon the recommendations altogether. This paper thus introduces the Nah Bandit, a tongue-in-cheek reference to describe a Bandit problem where users can say `nah’ to the recommendation and opt for their preferred option instead. As such, this problem lies in between a typical bandit setup and supervised learning. We model the user non-compliance by parameterizing an anchoring effect of recommendations on users. We then propose the Expert with Clustering (EWC) algorithm, a hierarchical approach that incorporates feedback from both recommended and non-recommended options to accelerate user preference learning. In a recommendation scenario with $N$ users, $T$ rounds per user, and $K$ clusters, EWC achieves a regret bound of $O(N\sqrt{T\log K} + NT)$, achieving superior theoretical performance in the short term compared to LinUCB algorithm. Experimental results also highlight that EWC outperforms both supervised learning and traditional contextual bandit approaches. This advancement reveals that effective use of non-compliance feedback can accelerate preference learning and improve recommendation accuracy. This work lays the foundation for future research in Nah Bandit, providing a robust framework for more effective recommendation systems.

[226] Ensemble Learning for Healthcare: A Comparative Analysis of Hybrid Voting and Ensemble Stacking in Obesity Risk Prediction

Towhidul Islam, Md Sumon Ali

Main category: cs.LG

TL;DR: Comparative study shows ensemble stacking outperforms hybrid majority voting for obesity risk prediction, achieving highest accuracy (0.9898) and F1-score on complex datasets.

Details

Motivation: Obesity is a critical global health issue associated with chronic diseases. While machine learning shows promise for early prediction, there's limited comparative evaluation of ensemble techniques like hybrid majority voting vs ensemble stacking for obesity risk assessment.

Method: Used two datasets to evaluate three ensemble models: Majority Hard Voting, Weighted Hard Voting, and Stacking with MLP meta-classifier. Analyzed 9 ML algorithms across 50 hyperparameter configurations to select top 3 base learners. Applied preprocessing including dataset balancing and outlier detection, evaluated using Accuracy and F1-Score.

Result: On Dataset-1: weighted hard voting and stacking achieved nearly identical performance (Accuracy: 0.9203, F1: 0.9201), outperforming majority hard voting. On Dataset-2: stacking demonstrated superior results (Accuracy: 0.9898, F1: 0.9898) compared to both voting methods, with weighted hard voting showing lowest performance.

Conclusion: Ensemble stacking provides stronger predictive capability for complex data distributions in obesity risk prediction, while hybrid majority voting remains a robust alternative. Stacking is particularly effective for healthcare applications requiring high accuracy.

Abstract: Obesity is a critical global health issue driven by dietary, physiological, and environmental factors, and is strongly associated with chronic diseases such as diabetes, cardiovascular disorders, and cancer. Machine learning has emerged as a promising approach for early obesity risk prediction, yet a comparative evaluation of ensemble techniques – particularly hybrid majority voting and ensemble stacking – remains limited. This study aims to compare hybrid majority voting and ensemble stacking methods for obesity risk prediction, identifying which approach delivers higher accuracy and efficiency. The analysis seeks to highlight the complementary strengths of these ensemble techniques in guiding better predictive model selection for healthcare applications. Two datasets were utilized to evaluate three ensemble models: Majority Hard Voting, Weighted Hard Voting, and Stacking (with a Multi-Layer Perceptron as meta-classifier). A pool of nine Machine Learning (ML) algorithms, evaluated across a total of 50 hyperparameter configurations, was analyzed to identify the top three models to serve as base learners for the ensemble methods. Preprocessing steps involved dataset balancing, and outlier detection, and model performance was evaluated using Accuracy and F1-Score. On Dataset-1, weighted hard voting and stacking achieved nearly identical performance (Accuracy: 0.920304, F1: 0.920070), outperforming majority hard voting. On Dataset-2, stacking demonstrated superior results (Accuracy: 0.989837, F1: 0.989825) compared to majority hard voting (Accuracy: 0.981707, F1: 0.981675) and weighted hard voting, which showed the lowest performance. The findings confirm that ensemble stacking provides stronger predictive capability, particularly for complex data distributions, while hybrid majority voting remains a robust alternative.

[227] Conformal Prediction for Time-series Forecasting with Change Points

Sophia Sun, Rose Yu

Main category: cs.LG

TL;DR: CPTC algorithm combines state prediction with conformal prediction to handle time series with change points, providing valid uncertainty quantification in non-stationary environments.

Details

Motivation: Current conformal prediction methods struggle with time series data containing sudden change points, creating a need for better uncertainty quantification in non-stationary time series.

Method: Integrates a model to predict underlying state with online conformal prediction to model uncertainties in time series with change points.

Result: Proven validity and improved adaptivity under minimum assumptions, demonstrated effectiveness on 6 synthetic and real-world datasets with better validity and adaptivity than state-of-the-art baselines.

Conclusion: CPTC provides an effective solution for uncertainty quantification in time series with change points, addressing a significant gap in current conformal prediction methods.

Abstract: Conformal prediction has been explored as a general and efficient way to provide uncertainty quantification for time series. However, current methods struggle to handle time series data with change points - sudden shifts in the underlying data-generating process. In this paper, we propose a novel Conformal Prediction for Time-series with Change points (CPTC) algorithm, addressing this gap by integrating a model to predict the underlying state with online conformal prediction to model uncertainties in non-stationary time series. We prove CPTC’s validity and improved adaptivity in the time series setting under minimum assumptions, and demonstrate CPTC’s practical effectiveness on 6 synthetic and real-world datasets, showing improved validity and adaptivity compared to state-of-the-art baselines.

[228] Towards Reasoning for PDE Foundation Models: A Reward-Model-Driven Inference-Time-Scaling Algorithm

Siddharth Mansingh, James Amarel, Ragib Arnab, Arvind Mohan, Kamaljeet Singh, Gerd J. Kunde, Nicolas Hengartner, Benjamin Migliori, Emily Casleton, Nathan A. Debarledeben, Ayan Biswas, Diane Oyen, Earl Lawrence

Main category: cs.LG

TL;DR: A test-time computing strategy for PDEs that uses computational resources during inference to improve prediction accuracy with fewer training samples and smaller models, using reward models for spatio-temporal consistency evaluation.

Details

Motivation: Existing PDE foundation models are constrained by pretraining datasets, struggle with auto-regressive rollout performance in out-of-distribution cases, and have high compute and training data requirements that limit their practical applications.

Method: Introduces test-time computing (TTC) strategy inspired by LLM thinking strategies, using two types of reward models to evaluate predictions of a stochastic model for spatio-temporal consistency. Demonstrated on compressible Euler-equation simulations from PDEGym benchmark.

Result: TTC captures improved predictions relative to standard non-adaptive auto-regressive inference, showing better performance with fewer training samples and smaller models.

Conclusion: This TTC framework represents a foundational step towards more advanced reasoning algorithms for PDE modeling, including potential reinforcement-learning-based approaches that could transform computational workflows in physics and engineering.

Abstract: Partial Differential Equations (PDEs) are the bedrock for modern computational sciences and engineering, and inherently computationally expensive. While PDE foundation models have shown much promise for simulating such complex spatio-temporal phenomena, existing models remain constrained by the pretraining datasets and struggle with auto-regressive rollout performance, especially in out-of-distribution (OOD) cases. Furthermore, they have significant compute and training data requirements which hamper their use in many critical applications. Inspired by recent advances in ``thinking" strategies used in large language models (LLMs), we introduce the first test-time computing (TTC) strategy for PDEs that utilizes computational resources during inference to achieve more accurate predictions with fewer training samples and smaller models. We accomplish this with two types of reward models that evaluate predictions of a stochastic based model for spatio-temporal consistency. We demonstrate this method on compressible Euler-equation simulations from the PDEGym benchmark and show that TTC captures improved predictions relative to standard non-adaptive auto-regressive inference. This TTC framework marks a foundational step towards more advanced reasoning algorithms or PDE modeling, inluding building reinforcement-learning-based approaches, potentially transforming computational workflows in physics and engineering.

[229] Power Grid Control with Graph-Based Distributed Reinforcement Learning

Carlo Fabrizio, Gianvito Losapio, Marco Mussi, Alberto Maria Metelli, Marcello Restelli

Main category: cs.LG

TL;DR: A graph-based distributed reinforcement learning framework for scalable power grid management that uses GNNs for local observations and outperforms traditional methods.

Details

Motivation: Traditional control systems struggle with renewable energy integration and large-scale power networks, requiring more dynamic and distributed control strategies.

Method: Distributed RL framework with low-level agents on power lines coordinated by a high-level manager, using GNNs for local observations, imitation learning, and reward shaping.

Result: Outperforms standard baselines in Grid2Op simulations and shows superior computational efficiency compared to simulation-based Expert methods.

Conclusion: The proposed distributed RL approach with GNN-based local observations provides effective and scalable solution for modern power grid control challenges.

Abstract: The necessary integration of renewable energy sources, combined with the expanding scale of power networks, presents significant challenges in controlling modern power grids. Traditional control systems, which are human and optimization-based, struggle to adapt and to scale in such an evolving context, motivating the exploration of more dynamic and distributed control strategies. This work advances a graph-based distributed reinforcement learning framework for real-time, scalable grid management. The proposed architecture consists of a network of distributed low-level agents acting on individual power lines and coordinated by a high-level manager agent. A Graph Neural Network (GNN) is employed to encode the network’s topological information within the single low-level agent’s observation. To accelerate convergence and enhance learning stability, the framework integrates imitation learning and potential-based reward shaping. In contrast to conventional decentralized approaches that decompose only the action space while relying on global observations, this method also decomposes the observation space. Each low-level agent acts based on a structured and informative local view of the environment constructed through the GNN. Experiments on the Grid2Op simulation environment show the effectiveness of the approach, which consistently outperforms the standard baseline commonly adopted in the field. Additionally, the proposed model proves to be much more computationally efficient than the simulation-based Expert method.

[230] Enhancing Machine Learning for Imbalanced Medical Data: A Quantum-Inspired Approach to Synthetic Oversampling (QI-SMOTE)

Vikas Kashtriya, Pardeep Singh

Main category: cs.LG

TL;DR: QI-SMOTE is a quantum-inspired oversampling technique that improves ML classifier performance on imbalanced medical data by generating synthetic instances using quantum principles, outperforming traditional methods on MIMIC datasets.

Details

Motivation: Class imbalance in medical ML leads to biased models and reduced predictive performance, especially for underrepresented minority classes in critical applications like mortality detection.

Method: Quantum-Inspired SMOTE (QI-SMOTE) leverages quantum principles (quantum evolution and layered entanglement) to generate synthetic instances that preserve complex data structures, compared against traditional oversampling techniques on MIMIC-III/IV datasets.

Result: QI-SMOTE significantly improves ensemble methods (RF, GB, ADA), kernel-based models (SVM), and deep learning approaches, outperforming traditional oversampling methods across Accuracy, F1-score, G-Mean, and AUC-ROC metrics.

Conclusion: QI-SMOTE effectively mitigates class imbalance and enhances model robustness in medical diagnostics, demonstrating the potential of quantum-inspired resampling techniques to advance ML methodologies.

Abstract: Class imbalance remains a critical challenge in machine learning (ML), particularly in the medical domain, where underrepresented minority classes lead to biased models and reduced predictive performance. This study introduces Quantum-Inspired SMOTE (QI-SMOTE), a novel data augmentation technique that enhances the performance of ML classifiers, including Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (KNN), Gradient Boosting (GB), and Neural Networks, by leveraging quantum principles such as quantum evolution and layered entanglement. Unlike conventional oversampling methods, QI-SMOTE generates synthetic instances that preserve complex data structures, improving model generalization and classification accuracy. We validate QI-SMOTE on the MIMIC-III and MIMIC-IV datasets, using mortality detection as a benchmark task due to their clinical significance and inherent class imbalance. We compare our method against traditional oversampling techniques, including Borderline-SMOTE, ADASYN, SMOTE-ENN, SMOTE-TOMEK, and SVM-SMOTE, using key performance metrics such as Accuracy, F1-score, G-Mean, and AUC-ROC. The results demonstrate that QI-SMOTE significantly improves the effectiveness of ensemble methods (RF, GB, ADA), kernel-based models (SVM), and deep learning approaches by producing more informative and balanced training data. By integrating quantum-inspired transformations into the ML pipeline, QI-SMOTE not only mitigates class imbalance but also enhances the robustness and reliability of predictive models in medical diagnostics and decision-making. This study highlights the potential of quantum-inspired resampling techniques in advancing state-of-the-art ML methodologies.

[231] Improving Generative Methods for Causal Evaluation via Simulation-Based Inference

Pracheta Amaranath, Vinitra Muralikrishnan, Amit Sharma, David D. Jensen

Main category: cs.LG

TL;DR: SBICE introduces a simulation-based inference framework that models generative parameters as uncertain and infers their posterior distribution from source data, enabling more realistic synthetic dataset generation for causal estimator evaluation.

Details

Motivation: Existing generative methods for synthetic datasets require fixed point estimates of parameters rather than distributions, preventing users from expressing uncertainty and potentially leading to unreliable causal estimator comparisons.

Method: Leverages simulation-based inference techniques to model generative parameters as uncertain and infer their posterior distribution given source data, identifying parameter configurations that produce synthetic datasets closely aligned with the source data distribution.

Result: Empirical results demonstrate that SBICE improves the reliability of estimator evaluations by generating more realistic datasets that support robust and data-consistent causal benchmarking under uncertainty.

Conclusion: SBICE provides a framework for more reliable causal estimator evaluation by incorporating parameter uncertainty and enabling posterior inference, addressing limitations of existing methods that use fixed parameter estimates.

Abstract: Generating synthetic datasets that accurately reflect real-world observational data is critical for evaluating causal estimators, but remains a challenging task. Existing generative methods offer a solution by producing synthetic datasets anchored in the observed data (source data) while allowing variation in key parameters such as the treatment effect and amount of confounding bias. However, existing methods typically require users to provide point estimates of such parameters (rather than distributions) and fixed estimates (rather than estimates that can be improved with reference to the source data). This denies users the ability to express uncertainty over parameter values and removes the potential for posterior inference, potentially leading to unreliable estimator comparisons. We introduce simulation-based inference for causal evaluation (SBICE), a framework that models generative parameters as uncertain and infers their posterior distribution given a source dataset. Leveraging techniques in simulation-based inference, SBICE identifies parameter configurations that produce synthetic datasets closely aligned with the source data distribution. Empirical results demonstrate that SBICE improves the reliability of estimator evaluations by generating more realistic datasets, which supports a robust and data-consistent approach to causal benchmarking under uncertainty.

[232] Event Detection and Classification for Long Range Sensing of Elephants Using Seismic Signal

Jaliya L. Wijayaraja, Janaka L. Wijekoon, Malitha Wijesundara

Main category: cs.LG

TL;DR: A seismic-based elephant detection framework using Contextually Customized Windowing for real-time footfall detection, achieving up to 99% accuracy with SVM classification in various environments.

Details

Motivation: To address limitations in manual classification of elephant footfalls for Human-Elephant Conflict solutions, enabling real-time detection in natural settings with resource-constrained implementations.

Method: Introduced Contextually Customized Windowing (CCW) for event detection, compared with STA/LTA method, used SVM with RBF kernel for classification, and conducted feature impact analysis using explainable AI.

Result: Maximum detection range: 155.6m (controlled) and 140m (natural). Classification accuracy: 99% (controlled), 73% (natural habitats), 70% (HEC-prone human habitats). Zero Crossings and DTW Alignment Cost identified as most influential features.

Conclusion: The framework provides effective real-time elephant detection with high accuracy and computational efficiency, suitable for resource-constrained implementations in various environmental conditions.

Abstract: Detecting elephants through seismic signals is an emerging research topic aimed at developing solutions for Human-Elephant Conflict (HEC). Despite the promising results, such solutions heavily rely on manual classification of elephant footfalls, which limits their applicability for real-time classification in natural settings. To address this limitation and build on our previous work, this study introduces a classification framework targeting resource-constrained implementations, prioritizing both accuracy and computational efficiency. As part of this framework, a novel event detection technique named Contextually Customized Windowing (CCW), tailored specifically for detecting elephant footfalls, was introduced, and evaluations were conducted by comparing it with the Short-Term Average/Long-Term Average (STA/LTA) method. The yielded results show that the maximum validated detection range was 155.6 m in controlled conditions and 140 m in natural environments. Elephant footfall classification using Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel demonstrated superior performance across multiple settings, achieving an accuracy of 99% in controlled environments, 73% in natural elephant habitats, and 70% in HEC-prone human habitats, the most challenging scenario. Furthermore, feature impact analysis using explainable AI identified the number of Zero Crossings and Dynamic Time Warping (DTW) Alignment Cost as the most influential factors in all experiments, while Predominant Frequency exhibited significant influence in controlled settings.

Kunal Kumar, Muhammad Ashad Kabir, Luke Donnan, Sayed Ahmed

Main category: cs.LG

TL;DR: Review of 45 studies on diabetic foot ulcer offloading footwear reveals fragmented decision-making approaches. Proposes a 5-part CDSS framework for standardized, explainable, and clinically validated prescription systems.

Details

Motivation: Current offloading footwear prescription for diabetic foot ulcers suffers from fragmented feature selection, limited personalization, and inconsistent evaluation practices, hindering effective clinical decision-making.

Method: Narrative review of 45 studies (guidelines, knowledge-based systems, and ML applications) with thematic analysis of knowledge type, decision logic, evaluation methods, and enabling technologies.

Result: Guidelines emphasize pressure thresholds but lack actionable outputs; knowledge-based systems integrate monitoring and usability; ML shows high accuracy but poor explainability. Evaluation methods remain fragmented across different study types.

Conclusion: A five-part CDSS framework is proposed with minimum dataset standards, hybrid architecture, structured outputs, continuous validation, and clinical workflow integration to enable scalable, patient-centered DFU care systems.

Abstract: Offloading footwear helps prevent and treat diabetic foot ulcers (DFUs) by lowering plantar pressure (PP), yet prescription decisions remain fragmented: feature selection varies, personalization is limited, and evaluation practices differ. We performed a narrative review of 45 studies (12 guidelines/protocols, 25 knowledge-based systems, 8 machine-learning applications) published to Aug 2025. We thematically analyzed knowledge type, decision logic, evaluation methods, and enabling technologies. Guidelines emphasize PP thresholds (<=200 kPa or >=25–30% reduction) but rarely yield actionable, feature-level outputs. Knowledge-based systems use rule- and sensor-driven logic, integrating PP monitoring, adherence tracking, and usability testing. ML work introduces predictive, optimization, and generative models with high computational accuracy but limited explainability and clinical validation. Evaluation remains fragmented: protocols prioritize biomechanical tests; knowledge-based systems assess usability/adherence; ML studies focus on technical accuracy with weak linkage to long-term outcomes. From this synthesis we propose a five-part CDSS framework: (1) a minimum viable dataset; (2) a hybrid architecture combining rules, optimization, and explainable ML; (3) structured feature-level outputs; (4) continuous validation and evaluation; and (5) integration with clinical and telehealth workflows. This framework aims to enable scalable, patient-centered CDSSs for DFU care; prioritizing interoperable datasets, explainable models, and outcome-focused evaluation will be key to clinical adoption.

[234] PDRL: Post-hoc Descriptor-based Residual Learning for Uncertainty-Aware Machine Learning Potentials

Shih-Peng Huang, Nontawat Charoenphakdee, Yuta Tsuboi, Yong-Bin Zhuang, Wenwen Li

Main category: cs.LG

TL;DR: PDRL: A post-hoc descriptor-based residual learning framework for efficient uncertainty quantification in ML interatomic potentials, using graph neural network descriptors to model prediction errors as uncertainty proxies.

Details

Motivation: Ensemble methods are computationally expensive for uncertainty quantification in ML interatomic potentials, and alternative methods may affect accuracy or cannot be applied to already trained models.

Method: Proposes PDRL framework that leverages descriptors from trained graph neural network potentials to model residual errors between predictions and ground truth, using these residuals as uncertainty proxies.

Result: Multiple PDRL variants are developed and benchmarked against established UQ methods, evaluating both effectiveness and limitations.

Conclusion: PDRL provides a simple and efficient post-hoc approach for uncertainty quantification that can be applied to already trained models without affecting prediction accuracy.

Abstract: Ensemble method is considered the gold standard for uncertainty quantification (UQ) for machine learning interatomic potentials (MLIPs). However, their high computational cost can limit its practicality. Alternative techniques, such as Monte Carlo dropout and deep kernel learning, have been proposed to improve computational efficiency; however, some of these methods cannot be applied to already trained models and may affect the prediction accuracy. In this paper, we propose a simple and efficient post-hoc framework for UQ that leverages the descriptor of a trained graph neural network potential to estimate residual errors. We refer to this method as post-hoc descriptor-based residual-based learning (PDRL). PDRL models the discrepancy between MLIP predictions and ground truth values, allowing these residuals to act as proxies for prediction uncertainty. We explore multiple variants of PDRL and benchmark them against established UQ methods, evaluating both their effectiveness and limitations.

[235] VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills

Erik M. Lintunen

Main category: cs.LG

TL;DR: VendiRL introduces a unified framework using the Vendi Score from ecology to measure and optimize diverse skill learning in self-supervised RL, addressing scalability and evaluation challenges.

Details

Motivation: To overcome scalability issues in high-dimensional feature spaces and provide a flexible, consistent way to evaluate skill diversity without committing to a single definition of diversity.

Method: Adopts the Vendi Score ecological diversity measure, allowing specification of any desired diversity form through similarity functions, and introduces VendiRL framework for learning diverse skill sets.

Result: Demonstrates how the Vendi Score facilitates skill evaluation and enables learning of diversely diverse skill sets that can support pretraining in various interactive environments.

Conclusion: VendiRL provides a unified approach to skill diversity learning and evaluation that can accommodate multiple definitions of diversity, making results more comparable and enabling exploration of previously unaddressed diversity forms.

Abstract: In self-supervised reinforcement learning (RL), one of the key challenges is learning a diverse set of skills to prepare agents for unknown future tasks. Despite impressive advances, scalability and evaluation remain prevalent issues. Regarding scalability, the search for meaningful skills can be obscured by high-dimensional feature spaces, where relevant features may vary across downstream task domains. For evaluating skill diversity, defining what constitutes “diversity” typically requires a hard commitment to a specific notion of what it means for skills to be diverse, potentially leading to inconsistencies in how skill diversity is understood, making results across different approaches hard to compare, and leaving many forms of diversity unexplored. To address these issues, we adopt a measure of sample diversity that translates ideas from ecology to machine learning – the Vendi Score – allowing the user to specify and evaluate any desired form of diversity. We demonstrate how this metric facilitates skill evaluation and introduce VendiRL, a unified framework for learning diversely diverse sets of skills. Given distinct similarity functions, VendiRL motivates distinct forms of diversity, which could support skill-diversity pretraining in new and richly interactive environments where optimising for various forms of diversity may be desirable.

[236] AR-KAN: Autoregressive-Weight-Enhanced Kolmogorov-Arnold Network for Time Series Forecasting

Chen Zeng, Tiehang Xu, Qiao Wang

Main category: cs.LG

TL;DR: AR-KAN hybrid model combines KAN for nonlinear mapping and autoregressive components for memory, outperforming traditional neural networks and ARIMA on 72% of real-world datasets for almost periodic signal forecasting.

Details

Motivation: Traditional neural networks struggle with spectral analysis of signals, particularly when dealing with almost periodic functions with incommensurate frequencies where ARIMA often outperforms neural networks including LLMs.

Method: Proposes Autoregressive-Weight-Enhanced AR-KAN that combines Kolmogorov-Arnold Network (KAN) for static nonlinear mapping with pre-trained autoregressive components for memory retention, using Universal Myopic Mapping Theorem.

Result: Experimental results show AR-KAN delivers superior performance on 72% of real-world datasets compared to traditional approaches.

Conclusion: The hybrid AR-KAN model effectively addresses the limitations of conventional neural networks in handling almost periodic signals by integrating nonlinear mapping with memory-aware autoregressive components.

Abstract: Conventional neural networks frequently face challenges in spectral analysis of signals. To address this challenge, Fourier neural networks (FNNs) and similar approaches integrate components of Fourier series into the structure of neural networks. Nonetheless, a significant hurdle is often overlooked: the superposition of periodic signals does not necessarily result in a periodic signal. For example, when forecasting almost periodic functions composed of signals with incommensurate frequencies, traditional models such as Autoregressive Integrated Moving Average (ARIMA) frequently outperform most neural networks including large language models (LLMs). To tackle this goal, we propose Autoregressive-Weight-Enhanced AR-KAN, a hybrid model that combines the benefits of both methods. Using the Universal Myopic Mapping Theorem, we apply a Kolmogorov-Arnold Network (KAN) for the static nonlinear part and include memory through a pre-trained AR component, which can be explained to retain the most useful information while eliminating redundancy. Experimental data indicates that AR-KAN delivers superior results on $72%$ of real-world datasets.

[237] Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation

Kaoru Otsuka, Yuki Takezawa, Makoto Yamada

Main category: cs.LG

TL;DR: D-Byz-SGDM introduces delayed momentum aggregation for Byzantine-robust federated learning under partial participation, addressing the vulnerability of existing methods when sampled clients contain a Byzantine majority.

Details

Motivation: Existing Byzantine-robust FL methods assume full client participation, which is unrealistic due to communication constraints and client availability. Under partial participation, current methods fail when sampled clients contain a Byzantine majority.

Method: Delayed momentum aggregation principle where server aggregates most recently received gradients from non-participating clients alongside fresh momentum from active clients, implemented in D-Byz-SGDM optimizer.

Result: Established convergence guarantees that recover previous full participation results and match fundamental lower bounds for partial participation setting. Experiments on deep learning tasks showed stable and robust training under various Byzantine attacks.

Conclusion: The proposed D-Byz-SGDM method successfully addresses the challenge of Byzantine-robust FL under partial participation, providing theoretical guarantees and practical effectiveness against Byzantine attacks.

Abstract: Federated Learning (FL) allows distributed model training across multiple clients while preserving data privacy, but it remains vulnerable to Byzantine clients that exhibit malicious behavior. While existing Byzantine-robust FL methods provide strong convergence guarantees (e.g., to a stationary point in expectation) under Byzantine attacks, they typically assume full client participation, which is unrealistic due to communication constraints and client availability. Under partial participation, existing methods fail immediately after the sampled clients contain a Byzantine majority, creating a fundamental challenge for sparse communication. First, we introduce delayed momentum aggregation, a novel principle where the server aggregates the most recently received gradients from non-participating clients alongside fresh momentum from active clients. Our optimizer D-Byz-SGDM (Delayed Byzantine-robust SGD with Momentum) implements this delayed momentum aggregation principle for Byzantine-robust FL with partial participation. Then, we establish convergence guarantees that recover previous full participation results and match the fundamental lower bounds we prove for the partial participation setting. Experiments on deep learning tasks validated our theoretical findings, showing stable and robust training under various Byzantine attacks.

[238] AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates

Minxin Zhang, Yuxuan Liu, Hayden Schaeffer

Main category: cs.LG

TL;DR: AdaGO combines orthogonalized momentum updates from Muon with AdaGrad’s adaptive learning rate scaling, achieving better performance than both Muon and Adam while maintaining computational efficiency.

Details

Motivation: Muon optimizer shows strong performance but lacks clear learning rate determination, while AdaGrad provides adaptive scaling but doesn't use orthogonal updates. The goal is to combine both benefits.

Method: AdaGO uses orthogonalized momentum updates (like Muon) but scales them with accumulated past gradient norms (like AdaGrad), preserving orthogonality while adapting to optimization landscape.

Result: Theoretical convergence rates established for nonconvex functions. Empirical results show AdaGO outperforms Muon and Adam on CIFAR-10 classification and function regression tasks.

Conclusion: AdaGO successfully combines orthogonal updates with adaptive learning rates, achieving superior performance with minimal computational overhead compared to existing optimizers.

Abstract: The recently proposed Muon optimizer updates weight matrices via orthogonalized momentum and has demonstrated strong empirical success in large language model training. However, it remains unclear how to determine the learning rates for such orthogonalized updates. AdaGrad, by contrast, is a widely used adaptive method that scales stochastic gradients by accumulated past gradients. We propose a new algorithm, AdaGO, which combines a norm-based AdaGrad-type stepsize with an orthogonalized update direction, bringing together the benefits of both approaches. Unlike other adaptive variants of Muon, AdaGO preserves the orthogonality of the update direction, which can be interpreted as a spectral descent direction, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradient norms. The implementation of AdaGO requires only minimal modification to Muon, with a single additional scalar variable, the accumulated squared gradient norms, to be computed, making it computationally and memory efficient. Optimal theoretical convergence rates are established for nonconvex functions in both stochastic and deterministic settings under standard smoothness and unbiased bounded-variance noise assumptions. Empirical results on CIFAR-10 classification and function regression demonstrate that AdaGO outperforms Muon and Adam.

[239] LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, Ningbo Dai, Renzhe Xu, Shuyang Li, Tianyang Zhang, Yue He, Yuanrui Wang, Yunjia Zhang, Zijing Xu, Dongzhe Li, Fang Gao, Hao Zou, Jiandong Liu, Jiashuo Liu, Jiawei Xu, Kaijie Cheng, Kehan Li, Linjun Zhou, Qing Li, Shaohua Fan, Xiaoyu Lin, Xinyan Han, Xuanyue Li, Yan Lu, Yuan Xue, Yuanyuan Jiang, Zimu Wang, Zhenlei Wang, Peng Cui

Main category: cs.LG

TL;DR: LimiX is a large structured-data model that treats tabular data as a joint distribution, enabling unified handling of classification, regression, imputation, and generation tasks through a single model with query-based prediction.

Details

Motivation: Progress toward general intelligence requires foundation models that can handle structured data (tables) alongside language and physical world models, addressing the gap in current AI systems.

Method: Pretrained using masked joint-distribution modeling with episodic, context-conditional objective. Predicts query subsets conditioned on dataset-specific contexts, enabling training-free adaptation at inference.

Result: Outperforms gradient-boosting trees, deep tabular networks, recent tabular foundation models, and automated ensembles across 10 benchmarks with diverse data characteristics.

Conclusion: LimiX demonstrates superior performance across multiple tabular tasks with a single unified model, avoiding task-specific architectures while being publicly available under Apache 2.0 license.

Abstract: We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX, the first installment of our large structured-data models (LDMs). LimiX treats structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through query-based conditional prediction via a single model. LimiX is pretrained using masked joint-distribution modeling with an episodic, context-conditional objective, where the model predicts for query subsets conditioned on dataset-specific contexts, supporting rapid, training-free adaptation at inference. We evaluate LimiX across 10 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios. With a single model and a unified interface, LimiX consistently surpasses strong baselines including gradient-boosting trees, deep tabular networks, recent tabular foundation models, and automated ensembles, as shown in Figure 1 and Figure 2. The superiority holds across a wide range of tasks, such as classification, regression, missing value imputation, and data generation, often by substantial margins, while avoiding task-specific architectures or bespoke training per task. All LimiX models are publicly accessible under Apache 2.0.

[240] StableSleep: Source-Free Test-Time Adaptation for Sleep Staging with Lightweight Safety Rails

Hritik Arasu, Faisal R Jahangiri

Main category: cs.LG

TL;DR: Streaming test-time adaptation method for sleep staging that combines entropy minimization with safety mechanisms to improve performance on unseen patient data without requiring source data or calibration.

Details

Motivation: Sleep staging models often degrade when deployed on patients with unseen physiology or recording conditions, creating a need for adaptive methods that work in real-world clinical settings.

Method: Proposes a streaming, source-free test-time adaptation recipe combining entropy minimization (Tent) with Batch-Norm statistic refresh, plus two safety rails: an entropy gate to pause adaptation on uncertain windows and an EMA-based reset to prevent drift.

Result: Shows consistent gains over frozen baseline on Sleep-EDF Expanded dataset using single-lead EEG, with seconds-level latency and minimal memory requirements. Reports per-stage metrics and Cohen’s kappa.

Conclusion: The method is model-agnostic, requires no source data or patient calibration, and is practical for on-device or bedside use, making it suitable for real-world deployment.

Abstract: Sleep staging models often degrade when deployed on patients with unseen physiology or recording conditions. We propose a streaming, source-free test-time adaptation (TTA) recipe that combines entropy minimization (Tent) with Batch-Norm statistic refresh and two safety rails: an entropy gate to pause adaptation on uncertain windows and an EMA-based reset to reel back drift. On Sleep-EDF Expanded, using single-lead EEG (Fpz-Cz, 100 Hz, 30s epochs; R&K to AASM mapping), we show consistent gains over a frozen baseline at seconds-level latency and minimal memory, reporting per-stage metrics and Cohen’s k. The method is model-agnostic, requires no source data or patient calibration, and is practical for on-device or bedside use.

[241] Multimodal learning of melt pool dynamics in laser powder bed fusion

Satyajit Mojumder, Pallock Halder, Tiana Tonge

Main category: cs.LG

TL;DR: Multimodal data fusion approach combining high-fidelity X-ray and low-fidelity absorptivity data for predicting melt pool dynamics in LPBF additive manufacturing, enabling accurate predictions using only low-cost sensors after training.

Details

Motivation: High-speed X-ray imaging provides valuable subsurface melt pool insights but is costly and impractical for industrial use, while low-cost photodiode absorptivity data is noisy and unreliable when used alone for predicting melt pool dynamics.

Method: Multimodal learning framework integrating CNNs for spatial feature extraction from X-ray data with RNNs for temporal feature extraction from absorptivity signals, using early fusion strategy and transfer learning to fine-tune RNN models.

Result: Training with both modalities significantly improves prediction accuracy compared to using either modality alone. Once trained, the model can accurately infer melt pool characteristics using only absorptivity data, eliminating need for expensive X-ray imaging.

Conclusion: This multimodal fusion approach enables cost-effective, real-time monitoring in additive manufacturing with broad applicability, allowing accurate melt pool prediction using only low-cost sensors after initial multimodal training.

Abstract: While multiple sensors are used for real-time monitoring in additive manufacturing, not all provide practical or reliable process insights. For example, high-speed X-ray imaging offers valuable spatial information about subsurface melt pool behavior but is costly and impractical for most industrial settings. In contrast, absorptivity data from low-cost photodiodes correlate with melt pool dynamics but is often too noisy for accurate prediction when used alone. In this paper, we propose a multimodal data fusion approach for predicting melt pool dynamics by combining high-fidelity X-ray data with low-fidelity absorptivity data in the Laser Powder Bed Fusion (LPBF) process. Our multimodal learning framework integrates convolutional neural networks (CNNs) for spatial feature extraction from X-ray data with recurrent neural networks (RNNs) for temporal feature extraction from absorptivity signals, using an early fusion strategy. The multimodal model is further used as a transfer learning model to fine-tune the RNN model that can predict melt pool dynamics only with absorptivity, with greater accuracy compared to the multimodal model. Results show that training with both modalities significantly improves prediction accuracy compared to using either modality alone. Furthermore, once trained, the model can infer melt pool characteristics using only absorptivity data, eliminating the need for expensive X-ray imaging. This multimodal fusion approach enables cost-effective, real-time monitoring and has broad applicability in additive manufacturing.

[242] Knowledge Integration for Physics-informed Symbolic Regression Using Pre-trained Large Language Models

Bilge Taskin, Wenxiong Xie, Teddy Lazebnik

Main category: cs.LG

TL;DR: Using LLMs to automate domain knowledge integration in physics-informed symbolic regression, improving equation discovery from noisy data across multiple physical systems.

Details

Motivation: Current physics-informed symbolic regression methods require manual feature engineering and domain expertise, limiting accessibility. LLMs' scientific knowledge can automate this process.

Method: Integrate pre-trained LLMs (Falcon, Mistral, LLama 2) into SR loss functions to evaluate generated equations. Test with three SR algorithms on three physical dynamics.

Result: LLM integration consistently improves physical dynamics reconstruction, enhances robustness to noise and complexity, with better prompts further boosting performance.

Conclusion: LLMs effectively automate domain knowledge incorporation in symbolic regression, making physics-informed discovery more accessible without requiring specialized expertise.

Abstract: Symbolic regression (SR) has emerged as a powerful tool for automated scientific discovery, enabling the derivation of governing equations from experimental data. A growing body of work illustrates the promise of integrating domain knowledge into the SR to improve the discovered equation’s generality and usefulness. Physics-informed SR (PiSR) addresses this by incorporating domain knowledge, but current methods often require specialized formulations and manual feature engineering, limiting their adaptability only to domain experts. In this study, we leverage pre-trained Large Language Models (LLMs) to facilitate knowledge integration in PiSR. By harnessing the contextual understanding of LLMs trained on vast scientific literature, we aim to automate the incorporation of domain knowledge, reducing the need for manual intervention and making the process more accessible to a broader range of scientific problems. Namely, the LLM is integrated into the SR’s loss function, adding a term of the LLM’s evaluation of the SR’s produced equation. We extensively evaluate our method using three SR algorithms (DEAP, gplearn, and PySR) and three pre-trained LLMs (Falcon, Mistral, and LLama 2) across three physical dynamics (dropping ball, simple harmonic motion, and electromagnetic wave). The results demonstrate that LLM integration consistently improves the reconstruction of physical dynamics from data, enhancing the robustness of SR models to noise and complexity. We further explore the impact of prompt engineering, finding that more informative prompts significantly improve performance.

[243] Binary Quantization For LLMs Through Dynamic Grouping

Xinzhe Zheng, Zhen-Qun Yang, Haoran Xie, S. Joe Qin, Arlene Chen, Fangzhen Lin

Main category: cs.LG

TL;DR: Novel binary quantization method for LLMs that achieves near-original performance with only 1.007 average bits, outperforming previous SOTA binary methods and competing with 4-bit approaches.

Details

Motivation: LLMs require substantial memory and computational resources. Binary quantization (1-bit) offers significant storage and inference cost reductions but typically causes notable performance degradation compared to 4-bit methods.

Method: Proposes a novel optimization objective for binary quantization with three algorithms. Enhances blocked quantization by dynamically identifying optimal unstructured sub-matrices through adaptive grouping strategies.

Result: Achieves average bit length of 1.007 bits while maintaining high model quality. Quantized LLaMA 3.2 3B attains perplexity of 8.23 (close to original 7.81), surpassing previous SOTA BiLLM (123.90 perplexity). Competitive with SOTA 4-bit approaches like GPTQ. Highly efficient compression: 14 seconds for LLaMA 3.2 3B quantization on single CPU core, under 100 minutes total with embarrassingly parallel properties.

Conclusion: The proposed binary quantization method effectively bridges the performance gap between aggressive 1-bit quantization and more conservative 4-bit methods, achieving near-original model quality with minimal storage requirements and excellent computational efficiency.

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of Natural Language Processing (NLP) tasks, but require substantial memory and computational resources. Binary quantization, which compresses model weights from 16-bit Brain Float to 1-bit representations in {-1, 1}, offers significant reductions in storage and inference costs. However, such aggressive quantization often leads to notable performance degradation compared to more conservative 4-bit quantization methods. In this research, we propose a novel optimization objective tailored for binary quantization, along with three algorithms designed to realize it effectively. Our method enhances blocked quantization by dynamically identifying optimal unstructured sub-matrices through adaptive grouping strategies. Experimental results demonstrate that our approach achieves an average bit length of just 1.007 bits, while maintaining high model quality. Specifically, our quantized LLaMA 3.2 3B model attains a perplexity of 8.23, remarkably close to the original 7.81, and surpasses previous SOTA BiLLM with a perplexity of only 123.90. Furthermore, our method is competitive with SOTA 4-bit approaches such as GPTQ in both performance and efficiency. The compression process is highly efficient, requiring only 14 seconds to quantize the full LLaMA 3.2 3B weights on a single CPU core, with the entire process completing in under 100 minutes and exhibiting embarrassingly parallel properties. Code - https://github.com/johnnyzheng0636/WGM_bi_quan

[244] Discrete Functional Geometry of ReLU Networks via ReLU Transition Graphs

Sahil Rajesh Dhayalkar

Main category: cs.LG

TL;DR: Extends ReLU Transition Graph framework to model deep ReLU networks as graphs where nodes are linear activation regions and edges connect regions differing by single ReLU flips, revealing expansion properties, binomial degree distributions, and spectral characteristics that govern generalization.

Details

Motivation: To develop a comprehensive graph-theoretic model for understanding deep ReLU networks by analyzing their discrete geometric structure and connecting structural properties to generalization behavior.

Method: Extends RTG framework with theoretical proofs of expansion properties and spectral characteristics at random initialization, plus empirical construction of RTGs for small networks to measure smoothness, connectivity, and validate theoretical predictions.

Result: Shows region entropy saturates under overparameterization, spectral gap correlates with generalization, and KL divergence across adjacent regions reflects functional smoothness. Provides new bounds on capacity and generalization.

Conclusion: Provides a unified framework for analyzing ReLU networks through discrete functional geometry, offering new tools to understand, diagnose, and improve generalization with structural insights governing network behavior.

Abstract: We extend the ReLU Transition Graph (RTG) framework into a comprehensive graph-theoretic model for understanding deep ReLU networks. In this model, each node represents a linear activation region, and edges connect regions that differ by a single ReLU activation flip, forming a discrete geometric structure over the network’s functional behavior. We prove that RTGs at random initialization exhibit strong expansion, binomial degree distributions, and spectral properties that tightly govern generalization. These structural insights enable new bounds on capacity via region entropy and on generalization via spectral gap and edge-wise KL divergence. Empirically, we construct RTGs for small networks, measure their smoothness and connectivity properties, and validate theoretical predictions. Our results show that region entropy saturates under overparameterization, spectral gap correlates with generalization, and KL divergence across adjacent regions reflects functional smoothness. This work provides a unified framework for analyzing ReLU networks through the lens of discrete functional geometry, offering new tools to understand, diagnose, and improve generalization.

[245] Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

Xingyue Huang, Rishabh, Gregor Franke, Ziyi Yang, Jiamu Bai, Weijie Bai, Jinhe Bi, Zifeng Ding, Yiqun Duan, Chengyu Fan, Wendong Fan, Xin Gao, Ruohao Guo, Yuan He, Zhuangzhuang He, Xianglong Hu, Neil Johnson, Bowen Li, Fangru Lin, Siyu Lin, Tong Liu, Yunpu Ma, Hao Shen, Hao Sun, Beibei Wang, Fangyijie Wang, Hao Wang, Haoran Wang, Yang Wang, Yifeng Wang, Zhaowei Wang, Ziyang Wang, Yifan Wu, Zikai Xiao, Chengxing Xie, Fan Yang, Junxiao Yang, Qianshuo Ye, Ziyu Ye, Guangtao Zeng, Yuwen Ebony Zhang, Zeyu Zhang, Zihao Zhu, Bernard Ghanem, Philip Torr, Guohao Li

Main category: cs.LG

TL;DR: Loong Project introduces an open-source framework for scalable synthetic data generation and verification across diverse reasoning domains, addressing the challenge of extending RLVR success beyond mathematics/programming.

Details

Motivation: Extending RLVR success to other reasoning domains is challenging due to scarce verifiable datasets and high human supervision costs. Current methods work well for math/programming but lack broader applicability.

Method: Two-component framework: LoongBench (8,729 human-vetted examples across 12 domains with executable code) and LoongEnv (modular synthetic data generation environment with multiple prompting strategies for question-answer-code triples). Forms agent-environment loop for RL where LLM agents are rewarded for CoT solutions matching code-executed answers.

Result: Benchmarked on broad suite of LLMs to evaluate domain coverage and performance bottlenecks. Comprehensive analysis of synthetic data shows correctness, difficulty, and diversity metrics.

Conclusion: Loong Project provides scalable solution for synthetic data generation and verification across diverse reasoning domains, enabling RLVR applications beyond traditional math/programming domains through automated code-based verification.

Abstract: Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due to the scarcity of high-quality, verifiable datasets and the high cost of human supervision. In this work, we introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification across a diverse range of reasoning-intensive domains. The framework consists of two key components: (1) LoongBench, a curated seed dataset containing 8,729 human-vetted examples across 12 domains (e.g., Advanced Mathematics, Chemistry, Logic), each paired with executable code and rich metadata; and (2) LoongEnv, a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples. Together, these components form an agent-environment loop that enables reinforcement learning, where an LLM-based agent is rewarded for generating Chain-of-Thought (CoT) solutions that align with code-executed answers. Empirically, we benchmark LoongBench on a broad suite of both open-source and proprietary LLMs to evaluate domain coverage and reveal performance bottlenecks. In addition, we conduct a comprehensive analysis of synthetic data generated by LoongEnv, examining correctness, difficulty, and diversity. Code and documentation are available at https://github.com/camel-ai/loong.

[246] LSAM: Asynchronous Distributed Training with Landscape-Smoothed Sharpness-Aware Minimization

Yunfei Teng, Sixin Zhang

Main category: cs.LG

TL;DR: LSAM optimizer combines SAM’s generalization benefits with efficient distributed training through asynchronous sampling, eliminating synchronization bottlenecks and improving large-batch performance.

Details

Motivation: SAM improves generalization but suffers from inefficiency in distributed large-batch training due to synchronization requirements.

Method: Integrates SAM’s adversarial steps with asynchronous distributed sampling strategy to create a smoothed sharpness-aware loss landscape for optimization.

Result: Eliminates synchronization bottlenecks, accelerates large-batch convergence, and delivers higher final accuracy compared to data-parallel SAM.

Conclusion: LSAM preserves SAM’s generalization advantages while offering superior efficiency in distributed training environments.

Abstract: While Sharpness-Aware Minimization (SAM) improves generalization in deep neural networks by minimizing both loss and sharpness, it suffers from inefficiency in distributed large-batch training. We present Landscape-Smoothed SAM (LSAM), a novel optimizer that preserves SAM’s generalization advantages while offering superior efficiency. LSAM integrates SAM’s adversarial steps with an asynchronous distributed sampling strategy, generating an asynchronous distributed sampling scheme, producing a smoothed sharpness-aware loss landscape for optimization. This design eliminates synchronization bottlenecks, accelerates large-batch convergence, and delivers higher final accuracy compared to data-parallel SAM.

[247] A Neural Network Approach to Multi-radionuclide TDCR Beta Spectroscopy

Li Yi, Qian Yang

Main category: cs.LG

TL;DR: AI framework combining numerical spectral simulation and deep learning for automated, standard-free analysis of multiradionuclide mixtures using TDCR spectroscopy

Details

Motivation: Overcome limitations of traditional TDCR spectroscopy including limited automation and reliance on mixture-specific standards that may not be readily available

Method: Geant4 simulations for beta spectra generation with statistically modeled detector response sampling, combined with tailored neural network architecture trained on various nuclei mix ratios and quenching scenarios

Result: High accuracy across tasks: activity proportions (MAE = 0.009), detection efficiencies (MAE = 0.002), and spectral reconstruction (SSIM = 0.9998)

Conclusion: AI-driven methodology shows significant potential for automated safety-compliant multiradionuclide analysis with robust generalization, real-time processing, and engineering feasibility, especially when reference materials are unavailable

Abstract: Liquid scintillation triple-to-doubly coincident ratio (TDCR) spectroscopy is widely adopted as a standard method for radionuclide quantification because of its inherent advantages such as high precision, self-calibrating capability, and independence from radioactive reference sources. However, multiradionuclide analysis via TDCR faces the challenges of limited automation and reliance on mixture-specific standards, which may not be easily available. Here, we present an Artificial Intelligence (AI) framework that combines numerical spectral simulation and deep learning for standard-free automated analysis. $\beta$ spectra for model training were generated using Geant4 simulations coupled with statistically modeled detector response sampling. A tailored neural network architecture, trained on this dataset covering various nuclei mix ratio and quenching scenarios, enables autonomous resolution of individual radionuclide activities and detecting efficiency through end-to-end learning paradigms. The model delivers consistent high accuracy across tasks: activity proportions (mean absolute error = 0.009), detection efficiencies (mean absolute error = 0.002), and spectral reconstruction (Structural Similarity Index = 0.9998), validating its physical plausibility for quenched $\beta$ spectroscopy. This AI-driven methodology exhibits significant potential for automated safety-compliant multiradionuclide analysis with robust generalization, real-time processing capabilities, and engineering feasibility, particularly in scenarios where reference materials are unavailable or rapid field analysis is required.

[248] Rashomon in the Streets: Explanation Ambiguity in Scene Understanding

Helge Spieker, Jørn Eirik Betten, Arnaud Gotlieb, Nadjib Lazaar, Nassim Belmecheri

Main category: cs.LG

TL;DR: First empirical study quantifying the Rashomon effect in XAI for autonomous driving action prediction, showing significant explanation disagreement across different model types.

Details

Motivation: XAI is crucial for safety-critical applications like autonomous driving, but the Rashomon effect challenges XAI reliability as multiple accurate models can provide divergent explanations for the same prediction.

Method: Used Qualitative Explainable Graphs (QXGs) as symbolic scene representation, trained Rashomon sets of two model classes: interpretable gradient boosting models and complex Graph Neural Networks (GNNs), then measured explanation agreement using feature attribution methods.

Result: Found significant explanation disagreement both within and between model classes, suggesting explanation ambiguity is an inherent property of the problem rather than just a modeling artifact.

Conclusion: The Rashomon effect poses a fundamental challenge to XAI reliability in autonomous driving, as explanation ambiguity appears to be intrinsic to the problem domain itself.

Abstract: Explainable AI (XAI) is essential for validating and trusting models in safety-critical applications like autonomous driving. However, the reliability of XAI is challenged by the Rashomon effect, where multiple, equally accurate models can offer divergent explanations for the same prediction. This paper provides the first empirical quantification of this effect for the task of action prediction in real-world driving scenes. Using Qualitative Explainable Graphs (QXGs) as a symbolic scene representation, we train Rashomon sets of two distinct model classes: interpretable, pair-based gradient boosting models and complex, graph-based Graph Neural Networks (GNNs). Using feature attribution methods, we measure the agreement of explanations both within and between these classes. Our results reveal significant explanation disagreement. Our findings suggest that explanation ambiguity is an inherent property of the problem, not just a modeling artifact.

[249] Systematic Evaluation of Attribution Methods: Eliminating Threshold Bias and Revealing Method-Dependent Performance Patterns

Serra Aksoy

Main category: cs.LG

TL;DR: Current attribution method evaluation suffers from threshold selection bias that can reverse method rankings. The paper introduces a threshold-free AUC-IoU framework that evaluates attribution quality across all thresholds, providing reliable method differentiation without evaluation artifacts.

Details

Motivation: Existing attribution method evaluation protocols suffer from threshold selection bias where single-threshold binarization can alter method rankings by over 200 percentage points, undermining reliable conclusions about method performance.

Method: Proposes a threshold-free framework that computes Area Under the Curve for Intersection over Union (AUC-IoU) to capture attribution quality across the full threshold spectrum, eliminating the need for arbitrary threshold selection.

Result: Evaluation of seven attribution methods on dermatological imaging shows single-threshold metrics yield contradictory results, while threshold-free evaluation provides reliable differentiation. XRAI achieves 31% improvement over LIME and 204% over vanilla Integrated Gradients, with performance variations up to 269% across lesion scales.

Conclusion: The threshold-free framework establishes methodological standards that eliminate evaluation artifacts and enables evidence-based method selection, providing both theoretical insight into attribution behavior and practical guidance for robust comparison in medical imaging and beyond.

Abstract: Attribution methods explain neural network predictions by identifying influential input features, but their evaluation suffers from threshold selection bias that can reverse method rankings and undermine conclusions. Current protocols binarize attribution maps at single thresholds, where threshold choice alone can alter rankings by over 200 percentage points. We address this flaw with a threshold-free framework that computes Area Under the Curve for Intersection over Union (AUC-IoU), capturing attribution quality across the full threshold spectrum. Evaluating seven attribution methods on dermatological imaging, we show single-threshold metrics yield contradictory results, while threshold-free evaluation provides reliable differentiation. XRAI achieves 31% improvement over LIME and 204% over vanilla Integrated Gradients, with size-stratified analysis revealing performance variations up to 269% across lesion scales. These findings establish methodological standards that eliminate evaluation artifacts and enable evidence-based method selection. The threshold-free framework provides both theoretical insight into attribution behavior and practical guidance for robust comparison in medical imaging and beyond.

[250] Tabular foundation model for GEOAI benchmark problems BM/AirportSoilProperties/2/2025

Taiga Saito, Yu Otake, Stephen Wu

Main category: cs.LG

TL;DR: TabPFN transformer model applied to geotechnical site characterization, achieving superior accuracy and efficiency over traditional Bayesian methods in spatial prediction and missing data imputation tasks.

Details

Motivation: To explore the application of foundation models for tabular data in geotechnical engineering, specifically for spatial property prediction and missing parameter imputation, addressing limitations of conventional hierarchical Bayesian models.

Method: Used TabPFN transformer-based foundation model in zero-training, few-shot, in-context learning setting without hyperparameter tuning, supplemented with context from big indirect database (BID).

Result: TabPFN outperformed hierarchical Bayesian model in both benchmark problems - superior accuracy in spatial su prediction with order-of-magnitude faster runtime, and lower RMSE for all target parameters in missing data imputation with well-quantified uncertainties.

Conclusion: First successful use of tabular foundation model in geotechnical modeling, suggesting potential paradigm shift in probabilistic site characterization through improved accuracy, efficiency, and uncertainty quantification.

Abstract: This paper presents a novel application of the Tabular Prior-Data Fitted Network (TabPFN) - a transformer-based foundation model for tabular data - to geotechnical site characterization problems defined in the GEOAI benchmark BM/AirportSoilProperties/2/2025. Two tasks are addressed: (1) predicting the spatial variation of undrained shear strength (su) across borehole depth profiles, and (2) imputing missing mechanical parameters in a dense-site dataset. We apply TabPFN in a zero-training, few-shot, in-context learning setting - without hyper-parameter tuning - and provide it with additional context from the big indirect database (BID). The study demonstrates that TabPFN, as a general-purpose foundation model, achieved superior accuracy and well-calibrated predictive distributions compared to a conventional hierarchical Bayesian model (HBM) baseline, while also offering significant gains in inference efficiency. In Benchmark Problem #1 (spatial su prediction), TabPFN outperformed the HBM in prediction accuracy and delivered an order-of-magnitude faster runtime. In Benchmark Problem #2 (missing mechanical parameter imputation), TabPFN likewise achieved lower RMSE for all target parameters with well-quantified uncertainties, though its cumulative computation cost was higher than HBM’s due to its one-variable-at-a-time inference. These results mark the first successful use of a tabular foundation model in geotechnical modeling, suggesting a potential paradigm shift in probabilistic site characterization.

[251] Exploring the Design Space of Fair Tree Learning Algorithms

Kiara Stempel, Mattia Cerrato, Stefan Kramer

Main category: cs.LG

TL;DR: This paper explores three design options for fair decision trees, focusing on two previously unstudied approaches that separate the modeling of target variable y and sensitive attribute s, rather than constraining them together.

Details

Motivation: Current fair decision tree techniques primarily focus on constrained optimization that combines target variable y and sensitive attribute s in the objective function or constraints. The authors identify two unexplored design options in the tree learning space that could potentially offer better fairness-performance trade-offs.

Method: The paper introduces and experimentally characterizes two additional design options: (1) using constraints on s during tree construction with backtracking capability, and (2) building separate trees for y and s that don’t share structure, allowing independent information usage without mutual information constraints.

Result: The experimental evaluation on multiple datasets demonstrates the performance and fairness characteristics of these previously unexplored design options compared to existing approaches.

Conclusion: The paper expands the design space for fair decision trees by introducing and validating two novel approaches that separate target and sensitive attribute modeling, providing new avenues for achieving fairness in tree-based models.

Abstract: Decision trees have been studied extensively in the context of fairness, aiming to maximize prediction performance while ensuring non-discrimination against different groups. Techniques in this space usually focus on imposing constraints at training time, constraining the search space so that solutions which display unacceptable values of relevant metrics are not considered, discarded, or discouraged. If we assume one target variable y and one sensitive attribute s, the design space of tree learning algorithms can be spanned as follows: (i) One can have one tree T that is built using an objective function that is a function of y, s, and T. For instance, one can build a tree based on the weighted information gain regarding y (maximizing) and s (minimizing). (ii) The second option is to have one tree model T that uses an objective function in y and T and a constraint on s and T. Here, s is no longer part of the objective, but part of a constraint. This can be achieved greedily by aborting a further split as soon as the condition that optimizes the objective in y fails to satisfy the constraint on s. A simple way to explore other splits is to backtrack during tree construction once a fairness constraint is violated. (iii) The third option is to have two trees T_y and T_s, one for y and one for s, such that the tree structure for y and s does not have to be shared. In this way, information regarding y and regarding s can be used independently, without having to constrain the choices in tree construction by the mutual information between the two variables. Quite surprisingly, of the three options, only the first one and the greedy variant of the second have been studied in the literature so far. In this paper, we introduce the above two additional options from that design space and characterize them experimentally on multiple datasets.

[252] Autonomous Learning From Success and Failure: Goal-Conditioned Supervised Learning with Negative Feedback

Zeqiang Zhang, Fabian Wurzberger, Gerrit Schmid, Sebastian Gottwald, Daniel A. Braun

Main category: cs.LG

TL;DR: Proposes integrating contrastive learning into Goal-Conditioned Supervised Learning (GCSL) to learn from both successes and failures, overcoming biases and enabling better exploration.

Details

Motivation: Address limitations of GCSL where agents only learn from self-generated experiences (reinforcing biases) and focus solely on successful outcomes (ignoring failures).

Method: Integrates contrastive learning principles into GCSL framework to enable learning from both successful and failed experiences through strategic goal relabelling.

Result: Empirical evaluations show the algorithm overcomes initial agent biases, enables more exploratory behavior, and identifies effective policies for superior performance.

Conclusion: The proposed contrastive learning integration with GCSL successfully addresses inherent limitations and improves performance across challenging environments.

Abstract: Reinforcement learning faces significant challenges when applied to tasks characterized by sparse reward structures. Although imitation learning, within the domain of supervised learning, offers faster convergence, it relies heavily on human-generated demonstrations. Recently, Goal-Conditioned Supervised Learning (GCSL) has emerged as a potential solution by enabling self-imitation learning for autonomous systems. By strategically relabelling goals, agents can derive policy insights from their own experiences. Despite the successes of this framework, it presents two notable limitations: (1) Learning exclusively from self-generated experiences can exacerbate the agents’ inherent biases; (2) The relabelling strategy allows agents to focus solely on successful outcomes, precluding them from learning from their mistakes. To address these issues, we propose a novel model that integrates contrastive learning principles into the GCSL framework to learn from both success and failure. Through empirical evaluations, we demonstrate that our algorithm overcomes limitations imposed by agents’ initial biases and thereby enables more exploratory behavior. This facilitates the identification and adoption of effective policies, leading to superior performance across a variety of challenging environments.

[253] TeRA: Vector-based Random Tensor Network for High-Rank Adaptation of Large Language Models

Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic

Main category: cs.LG

TL;DR: TeRA is a novel PEFT method that achieves high-rank weight updates while maintaining parameter efficiency of vector-based methods through tensor network parameterization with frozen random factors and trainable scaling vectors.

Details

Motivation: Address the trade-off between model expressivity (high-rank adapters) and parameter efficiency (vector-based methods) in PEFT for LLMs, enabling high-rank updates without sacrificing parameter efficiency.

Method: Parameterizes weight update matrix as Tucker-like tensor network with frozen randomly initialized large factors shared across layers, and only trains small layer-specific scaling vectors from diagonal factor matrices.

Result: TeRA matches or outperforms high-rank adapters while requiring trainable parameter count similar to vector-based methods, effectively decoupling rank from parameter count.

Conclusion: TeRA successfully bridges the gap between expressivity and efficiency in PEFT, providing a theoretically sound and empirically validated approach for high-rank adaptation with minimal trainable parameters.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), have significantly reduced the number of trainable parameters needed in fine-tuning large language models (LLMs). Subsequent developments of LoRA-style adapters have diverged into two main directions: (1) enhancing model expressivity with high-rank adapters, and (2) pushing for further parameter reduction, as exemplified by vector-based methods. However, these approaches present a trade-off, as achieving the expressivity of high-rank weight updates typically comes at the cost of sacrificing the extreme parameter efficiency offered by vector-based techniques. To address this issue, we propose a vector-based random \underline{\textbf{Te}}nsor network for high-\underline{\textbf{R}}ank \underline{\textbf{A}}daptation (TeRA), a novel PEFT method that achieves high-rank weight updates while retaining the parameter efficiency of vector-based PEFT adapters. This is achieved by parameterizing the tensorized weight update matrix as a Tucker-like tensor network (TN), in which large randomly initialized factors are frozen and shared across layers, while only small layer-specific scaling vectors, formed by entries in diagonal factor matrices, are trained. This design effectively decouples the rank of the weight update matrix from the number of trainable parameters. Comprehensive experiments demonstrate that TeRA matches or even outperforms high-rank adapters, while requiring a trainable parameter count similar to vector-based methods. Theoretical analysis and ablation studies further validate the effectiveness of our approach.

[254] Evaluation of Stress Detection as Time Series Events – A Novel Window-Based F1-Metric

Harald Vilhelm Skat-Rørdam, Sneha Das, Kathrine Sofie Rasmussen, Nicole Nadine Lønfeldt, Line Clemmensen

Main category: cs.LG

TL;DR: New window-based F1 metric (F1$_w$) with temporal tolerance provides more robust evaluation of event detection in time series, especially for physiological monitoring where ground truth annotations are single-point but underlying phenomena are gradual.

Details

Motivation: Standard metrics like F1 and point-adjusted F1 misrepresent model performance in real-world imbalanced datasets where ground truth is annotated as single-point events despite gradual underlying phenomena.

Method: Introduce F1$_w$ metric that incorporates temporal tolerance through window-based evaluation, allowing assessment when exact temporal alignment is unrealistic. Tested on three physiological datasets (ADARP, Wrist Angel, ROAD) including in-the-wild and experimental settings.

Result: F1$_w$ reveals meaningful performance patterns invisible to conventional metrics. Using TimesFM predictions, only temporally tolerant metrics show statistically significant improvements over random and null baselines in in-the-wild use cases.

Conclusion: The choice of evaluation metric strongly influences model performance interpretation. F1$_w$ addresses key gaps in time series evaluation and provides practical guidance for healthcare applications with varying temporal precision requirements.

Abstract: Accurate evaluation of event detection in time series is essential for applications such as stress monitoring with wearable devices, where ground truth is typically annotated as single-point events, even though the underlying phenomena are gradual and temporally diffused. Standard metrics like F1 and point-adjusted F1 (F1$_{pa}$) often misrepresent model performance in such real-world, imbalanced datasets. We introduce a window-based F1 metric (F1$_w$) that incorporates temporal tolerance, enabling a more robust assessment of event detection when exact alignment is unrealistic. Empirical analysis in three physiological datasets, two in-the-wild (ADARP, Wrist Angel) and one experimental (ROAD), indicates that F1$_w$ reveals meaningful model performance patterns invisible to conventional metrics, while its window size can be adapted to domain knowledge to avoid overestimation. We show that the choice of evaluation metric strongly influences the interpretation of model performance: using predictions from TimesFM, only our temporally tolerant metrics reveal statistically significant improvements over random and null baselines in the two in-the-wild use cases. This work addresses key gaps in time series evaluation and provides practical guidance for healthcare applications where requirements for temporal precision vary by context.

[255] Unsupervised Learning based Element Resource Allocation for Reconfigurable Intelligent Surfaces in mmWave Network

Pujitha Mamillapalli, Yoghitha Ramamoorthi, Abhinav Kumar, Tomoki Murakami, Tomoaki Ogawa, Yasushi Takatori

Main category: cs.LG

TL;DR: Proposes neural network approach for RIS element allocation and phase configuration to improve system throughput and reduce computational complexity compared to iterative optimization methods.

Details

Motivation: Address the exponentially increasing computational complexity of conventional iterative optimization methods for RIS element allocation as the number of RIS elements grows, and overcome challenges in generating training labels for supervised learning.

Method: Five-layer fully connected neural network (FNN) combined with preprocessing technique to reduce input dimensionality, lower computational complexity, and enhance scalability for joint RIS phase configuration and resource allocation under alpha-fair scheduling framework.

Result: Proposed NN-based solution reduces computational overhead while improving system throughput by 6.8% compared to existing RIS element allocation schemes, achieving better performance with reduced complexity.

Conclusion: The neural network approach provides significantly more scalable solution than iterative optimization algorithms for RIS element allocation, enabling efficient utilization of RIS in wireless systems.

Abstract: The increasing demand for high data rates and seamless connectivity in wireless systems has sparked significant interest in reconfigurable intelligent surfaces (RIS) and artificial intelligence-based wireless applications. RIS typically comprises passive reflective antenna elements that control the wireless propagation environment by adequately tuning the phase of the reflective elements. The allocation of RIS elements to multipleuser equipment (UEs) is crucial for efficiently utilizing RIS. In this work, we formulate a joint optimization problem that optimizes the RIS phase configuration and resource allocation under an $\alpha$-fair scheduling framework and propose an efficient way of allocating RIS elements. Conventional iterative optimization methods, however, suffer from exponentially increasing computational complexity as the number of RIS elements increases and also complicate the generation of training labels for supervised learning. To overcome these challenges, we propose a five-layer fully connected neural network (FNN) combined with a preprocessing technique to significantly reduce input dimensionality, lower computational complexity, and enhance scalability. The simulation results show that our proposed NN-based solution reduces computational overhead while significantly improving system throughput by 6.8% compared to existing RIS element allocation schemes. Furthermore, the proposed system achieves better performance while reducing computational complexity, making it significantly more scalable than the iterative optimization algorithms.

[256] FoMEMO: Towards Foundation Models for Expensive Multi-objective Optimization

Yiming Yao, Fei Liu, Liang Zhao, Xi Lin, Qingfu Zhang

Main category: cs.LG

TL;DR: FoMEMO is a foundation model approach for expensive multi-objective optimization that uses synthetic data pre-training to enable in-context optimization without rebuilding models or requiring domain-specific training.

Details

Motivation: Address the limitations of existing methods that require rebuilding Gaussian process surrogates for each new problem or extensive domain-specific pre-training, making them impractical for real-world emerging applications.

Method: Pre-train a foundation model with hundreds of millions of synthetic data, then use in-context optimization based on predicted preference-wise aggregation posteriors without requiring subsequent model training or updates.

Result: Superior adaptability to unknown problems and competitive performance across various synthetic benchmarks and real-world applications compared to existing methods.

Conclusion: FoMEMO demonstrates superior generality and effectiveness for expensive multi-objective optimization problems by leveraging foundation models pre-trained on synthetic data, enabling practical application in diverse real-world scenarios.

Abstract: Expensive multi-objective optimization is a prevalent and crucial concern in many real-world scenarios, where sample-efficiency is vital due to the limited evaluations to recover the true Pareto front for decision making. Existing works either involve rebuilding Gaussian process surrogates from scratch for each objective in each new problem encountered, or rely on extensive past domain experiments for pre-training deep learning models, making them hard to generalize and impractical to cope with various emerging applications in the real world. To address this issue, we propose a new paradigm named FoMEMO (Foundation Models for Expensive Multi-objective Optimization), which enables the establishment of a foundation model conditioned on any domain trajectory and user preference, and facilitates fast in-context optimization based on the predicted preference-wise aggregation posteriors. Rather than accessing extensive domain experiments in the real world, we demonstrate that pre-training the foundation model with a diverse set of hundreds of millions of synthetic data can lead to superior adaptability to unknown problems, without necessitating any subsequent model training or updates in the optimization process. We evaluate our method across a variety of synthetic benchmarks and real-word applications, and demonstrate its superior generality and competitive performance compared to existing methods.

[257] TopoMap: A Feature-based Semantic Discriminator of the Topographical Regions in the Test Input Space

Gianmarco De Vita, Nargiz Humbatova, Paolo Tonella

Main category: cs.LG

TL;DR: TopoMap creates a topographical map of input feature space for DL testing by combining dimensionality reduction and clustering, with automated configuration selection using a DNN evaluator, achieving 35-61% improvement over random selection for mutation testing.

Details

Motivation: Existing DL testing approaches focus on specific failure-inducing features while neglecting others in different feature space regions, making it difficult to comprehensively group inputs by failure-causing features.

Method: Applies dimensionality reduction to obtain input embeddings, then uses clustering to group inputs. Uses a DNN as evaluator to automatically select optimal embedding/clustering configurations that produce distinguishable clusters based on shared features.

Result: TopoMap generates maps with distinguishable and meaningful regions. Outperforms random selection by 35% on killable mutants and 61% on non-killable mutants in mutation analysis.

Conclusion: TopoMap provides an effective black-box, model-agnostic approach for creating topographical maps of input feature space, enabling better grouping of failure-inducing inputs and improving mutation testing effectiveness.

Abstract: Testing Deep Learning (DL)-based systems is an open challenge. Although it is relatively easy to find inputs that cause a DL model to misbehave, the grouping of inputs by features that make the DL model under test fail is largely unexplored. Existing approaches for DL testing introduce perturbations that may focus on specific failure-inducing features, while neglecting others that belong to different regions of the feature space. In this paper, we create an explicit topographical map of the input feature space. Our approach, named TopoMap, is both black-box and model-agnostic as it relies solely on features that characterise the input space. To discriminate the inputs according to the specific features they share, we first apply dimensionality reduction to obtain input embeddings, which are then subjected to clustering. Each DL model might require specific embedding computations and clustering algorithms to achieve a meaningful separation of inputs into discriminative groups. We propose a novel way to evaluate alternative configurations of embedding and clustering techniques. We used a deep neural network (DNN) as an approximation of a human evaluator who could tell whether a pair of clusters can be discriminated based on the features of the included elements. We use such a DNN to automatically select the optimal topographical map of the inputs among all those that are produced by different embedding/clustering configurations. The evaluation results show that the maps generated by TopoMap consist of distinguishable and meaningful regions. In addition, we evaluate the effectiveness of TopoMap using mutation analysis. In particular, we assess whether the clusters in our topographical map allow for an effective selection of mutation-killing inputs. Experimental results show that our approach outperforms random selection by 35% on average on killable mutants; by 61% on non-killable ones.

[258] Structure Transfer: an Inference-Based Calculus for the Transformation of Representations

Daniel Raggi, Gem Stapleton, Mateja Jamnik, Aaron Stockdill, Grecia Garcia Garcia, Peter C-H. Cheng

Main category: cs.LG

TL;DR: A novel calculus called structure transfer enables representation transformation across diverse representational systems while ensuring specified relations like semantic equivalence.

Details

Motivation: To solve the fundamental problem of devising representational-system agnostic techniques for driving representation transformation and choice, enabling effective communication and reasoning across different representation systems.

Method: Structure transfer calculus uses schemas that encode knowledge about representational systems to preserve information across relations between any pair of systems. It builds on Representational Systems Theory and construction spaces to model diverse RS types including formal languages, geometric figures, diagrams, and informal notations.

Result: The method provides a system-agnostic calculus that can generate target representations from source representations while ensuring desired relations hold, enabling representation transformation across a wide range of practical settings.

Conclusion: Structure transfer offers a general approach for representation transformation that works across diverse representational systems, addressing the fundamental challenge of representation choice and enabling effective cross-system communication and reasoning.

Abstract: Representation choice is of fundamental importance to our ability to communicate and reason effectively. A major unsolved problem, addressed in this paper, is how to devise \textit{representational-system (RS) agnostic} techniques that drive representation transformation and choice. We present a novel calculus, called \textit{structure transfer}, that enables representation transformation across diverse RSs. Specifically, given a \textit{source} representation drawn from a source RS, the rules of structure transfer allow us to generate a \textit{target} representation for a target RS. The generality of structure transfer comes in part from its ability to ensure that the source representation and the generated target representation satisfy \textit{any} specified relation (such as semantic equivalence). This is done by exploiting \textit{schemas}, which encode knowledge about RSs. Specifically, schemas can express \textit{preservation of information} across relations between any pair of RSs, and this knowledge is used by structure transfer to derive a structure for the target representation which ensures that the desired relation holds. We formalise this using Representational Systems Theory~\cite{raggi2022rst}, building on the key concept of a \textit{construction space}. The abstract nature of construction spaces grants them the generality to model RSs of diverse kinds, including formal languages, geometric figures and diagrams, as well as informal notations. Consequently, structure transfer is a system-agnostic calculus that can be used to identify alternative representations in a wide range of practical settings.

[259] HyPV-LEAD: Proactive Early-Warning of Cryptocurrency Anomalies through Data-Driven Structural-Temporal Modeling

Minjung Park, Gyuyeon Na, Soyoun Kim, Sunyoung Moon, HyeonJeong Cha, Sangmi Chai

Main category: cs.LG

TL;DR: HyPV-LEAD is a novel early-warning framework for cryptocurrency anomaly detection that integrates window-horizon modeling, Peak-Valley sampling, and hyperbolic embedding to provide actionable lead-time alerts and overcome class imbalance challenges.

Details

Motivation: Existing cryptocurrency anomaly detection methods are predominantly model-centric and post hoc, only flagging anomalies after they occur, offering limited preventive value. Abnormal transactions like mixing services, fraudulent transfers, and pump-and-dump operations pose escalating risks but are difficult to detect due to class imbalance, temporal volatility, and complex network dependencies.

Method: HyPV-LEAD integrates three innovations: (1) window-horizon modeling to guarantee actionable lead-time alerts, (2) Peak-Valley (PV) sampling to mitigate class imbalance while preserving temporal continuity, and (3) hyperbolic embedding to capture hierarchical and scale-free properties of blockchain transaction networks.

Result: Empirical evaluation on large-scale Bitcoin transaction data shows HyPV-LEAD consistently outperforms state-of-the-art baselines, achieving a PR-AUC of 0.9624 with significant gains in precision and recall. Ablation studies confirm each component provides complementary benefits.

Conclusion: By shifting anomaly detection from reactive classification to proactive early-warning, HyPV-LEAD establishes a robust foundation for real-time risk management, AML compliance, and financial security in dynamic blockchain environments.

Abstract: Abnormal cryptocurrency transactions - such as mixing services, fraudulent transfers, and pump-and-dump operations – pose escalating risks to financial integrity but remain notoriously difficult to detect due to class imbalance, temporal volatility, and complex network dependencies. Existing approaches are predominantly model-centric and post hoc, flagging anomalies only after they occur and thus offering limited preventive value. This paper introduces HyPV-LEAD (Hyperbolic Peak-Valley Lead-time Enabled Anomaly Detection), a data-driven early-warning framework that explicitly incorporates lead time into anomaly detection. Unlike prior methods, HyPV-LEAD integrates three innovations: (1) window-horizon modeling to guarantee actionable lead-time alerts, (2) Peak-Valley (PV) sampling to mitigate class imbalance while preserving temporal continuity, and (3) hyperbolic embedding to capture the hierarchical and scale-free properties of blockchain transaction networks. Empirical evaluation on large-scale Bitcoin transaction data demonstrates that HyPV-LEAD consistently outperforms state-of-the-art baselines, achieving a PR-AUC of 0.9624 with significant gains in precision and recall. Ablation studies further confirm that each component - PV sampling, hyperbolic embedding, and structural-temporal modeling - provides complementary benefits, with the full framework delivering the highest performance. By shifting anomaly detection from reactive classification to proactive early-warning, HyPV-LEAD establishes a robust foundation for real-time risk management, anti-money laundering (AML) compliance, and financial security in dynamic blockchain environments.

[260] Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial

David Cortes, Carlos Juiz, Belen Bermejo

Main category: cs.LG

TL;DR: Analysis of MLPerf Training v4.1 shows optimal GPU configurations exist that balance performance and efficiency for large-scale model training.

Details

Motivation: Training large-scale deep learning models is challenging, and while GPU usage speeds up training, it negatively impacts efficiency. The paper aims to find configurations that optimize the trade-off between performance, GPU usage, and efficiency.

Method: Detailed analysis of times reported by MLPerf Training v4.1 on four workloads: BERT, Llama2 LoRA, RetinaNet, and Stable Diffusion to identify optimal configurations.

Result: The study reveals configurations that optimize the relationship between performance, GPU usage, and efficiency, identifying a break-even point that reduces training times while maximizing efficiency.

Conclusion: There are specific GPU configurations that can significantly improve training efficiency without compromising performance, providing practical guidance for large-scale model training optimization.

Abstract: Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In this article, we present a detailed analysis of the times reported by MLPerf Training v4.1 on four workloads: BERT, Llama2 LoRA, RetinaNet, and Stable Diffusion, showing that there are configurations that optimise the relationship between performance, GPU usage, and efficiency. The results point to a break-even point that allows training times to be reduced while maximising efficiency.

[261] Meta-Imputation Balanced (MIB): An Ensemble Approach for Handling Missing Data in Biomedical Machine Learning

Fatemeh Azad, Zoran Bosnić, Matjaž Kukar

Main category: cs.LG

TL;DR: A novel Meta-Imputation approach that combines multiple base imputers to improve missing data prediction accuracy across diverse datasets.

Details

Motivation: Missing data is a fundamental challenge in machine learning, particularly in bioinformatics and clinical applications where datasets are frequently incomplete. Existing imputation methods perform inconsistently across different datasets and missingness mechanisms.

Method: Proposes Meta-Imputation Balanced (MIB) approach that learns to combine outputs of multiple base imputers. Trained on synthetically masked data with known ground truth to predict the most suitable imputed value based on each method’s behavior.

Result: The method demonstrates improved accuracy in predicting missing values by leveraging ensemble learning principles.

Conclusion: Highlights the potential of ensemble learning in imputation and enables more robust, modular, and interpretable preprocessing pipelines for real-world machine learning systems.

Abstract: Missing data represents a fundamental challenge in machine learning applications, often reducing model performance and reliability. This problem is particularly acute in fields like bioinformatics and clinical machine learning, where datasets are frequently incomplete due to the nature of both data generation and data collection. While numerous imputation methods exist, from simple statistical techniques to advanced deep learning models, no single method consistently performs well across diverse datasets and missingness mechanisms. This paper proposes a novel Meta-Imputation approach that learns to combine the outputs of multiple base imputers to predict missing values more accurately. By training the proposed method called Meta-Imputation Balanced (MIB) on synthetically masked data with known ground truth, the system learns to predict the most suitable imputed value based on the behavior of each method. Our work highlights the potential of ensemble learning in imputation and paves the way for more robust, modular, and interpretable preprocessing pipelines in real-world machine learning systems.

[262] EvolveSignal: A Large Language Model Powered Coding Agent for Discovering Traffic Signal Control Algorithms

Leizhen Wang, Peibo Duan, Hao Wang, Yue Wang, Jian Xu, Nan Zheng, Zhenliang Ma

Main category: cs.LG

TL;DR: EvolveSignal uses LLMs and evolutionary search to automatically generate traffic signal control algorithms that outperform traditional Webster’s method, reducing delays by 20.1% and stops by 47.1%.

Details

Motivation: Fixed-time traffic signal control relies on manual engineering and hand-crafted formulas that are labor-intensive and suboptimal under heterogeneous or congested conditions.

Method: Formulates traffic signal control as program synthesis using Python functions, optimizing candidate algorithms through external traffic simulator evaluations and evolutionary search powered by large language models.

Result: Discovered algorithms significantly outperform Webster’s baseline with 20.1% reduction in average delay and 47.1% reduction in average stops, while providing practical insights for traffic engineers.

Conclusion: This work bridges program synthesis with transportation engineering, opening a new AI-powered research direction for automated algorithm design in traffic signal control.

Abstract: In traffic engineering, the fixed-time traffic signal control remains widely used for its low cost, stability, and interpretability. However, its design depends on hand-crafted formulas (e.g., Webster) and manual re-timing by engineers to adapt to demand changes, which is labor-intensive and often yields suboptimal results under heterogeneous or congested conditions. This paper introduces the EvolveSignal, a large language models (LLMs) powered coding agent to automatically discover new traffic signal control algorithms. We formulate the problem as program synthesis, where candidate algorithms are represented as Python functions with fixed input-output structures, and iteratively optimized through external evaluations (e.g., a traffic simulator) and evolutionary search. Experiments on a signalized intersection demonstrate that the discovered algorithms outperform Webster’s baseline, reducing average delay by 20.1% and average stops by 47.1%. Beyond performance, ablation and incremental analyses reveal that EvolveSignal modifications-such as adjusting cycle length bounds, incorporating right-turn demand, and rescaling green allocations-can offer practically meaningful insights for traffic engineers. This work opens a new research direction by leveraging AI for algorithm design in traffic signal control, bridging program synthesis with transportation engineering.

[263] Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems

Fleur Hendriks, Ondřej Rokoš, Martin Doškář, Marc G. D. Geers, Vlado Menkovski

Main category: cs.LG

TL;DR: Flow matching framework for modeling multimodal distributions in bifurcation systems, preserving symmetries through equivariant modeling and outperforming traditional methods.

Details

Motivation: Deterministic machine learning models fail to capture multiple coexisting stable solutions in nonlinear dynamical systems with symmetry breaking, averaging over solutions and missing lower-symmetry outcomes.

Method: Generative framework based on flow matching with symmetric matching strategy that aligns predicted and target outputs under group actions for equivariant modeling of probability distributions over bifurcation outcomes.

Result: Validated on systems from toy models to complex physical problems (buckling beams, Allen-Cahn equation), demonstrating significant outperformance over non-probabilistic and variational methods in capturing multimodal distributions and symmetry-breaking bifurcations.

Conclusion: Flow matching provides a principled and scalable solution for modeling multistability in high-dimensional systems, enabling direct sampling of multiple valid solutions while preserving system symmetries.

Abstract: Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of symmetry breaking. Deterministic machine learning models struggle to capture this multiplicity, averaging over solutions and failing to represent lower-symmetry outcomes. In this work, we propose a generative framework based on flow matching to model the full probability distribution over bifurcation outcomes. Our method enables direct sampling of multiple valid solutions while preserving system symmetries through equivariant modeling. We introduce a symmetric matching strategy that aligns predicted and target outputs under group actions, allowing accurate learning in equivariant settings. We validate our approach on a range of systems, from toy models to complex physical problems such as buckling beams and the Allen-Cahn equation. Our results demonstrate that flow matching significantly outperforms non-probabilistic and variational methods in capturing multimodal distributions and symmetry-breaking bifurcations, offering a principled and scalable solution for modeling multistability in high-dimensional systems.

[264] On the MIA Vulnerability Gap Between Private GANs and Diffusion Models

Ilana Sebag, Jean-Yves Franceschi, Alain Rakotomamonjy, Alexandre Allauzen, Jamal Atif

Main category: cs.LG

TL;DR: GANs show better privacy protection against membership inference attacks compared to diffusion models under differential privacy, due to their lower sensitivity to data perturbations.

Details

Motivation: To understand and compare the privacy risks of differentially private GANs and diffusion models against membership inference attacks, as their sensitivity to such attacks remains poorly understood despite both being trainable under differential privacy.

Method: Conducted a unified theoretical analysis using stability-based methods to show GANs’ lower sensitivity to data perturbations, followed by comprehensive empirical evaluation using standardized MIA pipeline across various datasets and privacy budgets.

Result: GANs consistently demonstrated significantly better privacy robustness against membership inference attacks compared to diffusion models, even under strong differential privacy regimes.

Conclusion: The choice of generative model type (GANs vs diffusion models) critically impacts privacy leakage, with GANs exhibiting structural advantages for privacy protection in differentially private settings.

Abstract: Generative Adversarial Networks (GANs) and diffusion models have emerged as leading approaches for high-quality image synthesis. While both can be trained under differential privacy (DP) to protect sensitive data, their sensitivity to membership inference attacks (MIAs), a key threat to data confidentiality, remains poorly understood. In this work, we present the first unified theoretical and empirical analysis of the privacy risks faced by differentially private generative models. We begin by showing, through a stability-based analysis, that GANs exhibit fundamentally lower sensitivity to data perturbations than diffusion models, suggesting a structural advantage in resisting MIAs. We then validate this insight with a comprehensive empirical study using a standardized MIA pipeline to evaluate privacy leakage across datasets and privacy budgets. Our results consistently reveal a marked privacy robustness gap in favor of GANs, even in strong DP regimes, highlighting that model type alone can critically shape privacy leakage.

[265] epiGPTope: A machine learning-based epitope generator and classifier

Natalia Flechas Manrique, Alberto Martínez, Elena López-Martínez, Luc Andrea, Román Orus, Aitor Manteca, Aitziber L. Cortajarena, Llorenç Espinosa-Portalés

Main category: cs.LG

TL;DR: epiGPTope is a large language model that generates novel epitope-like sequences and classifies them by bacterial/viral origin to accelerate epitope discovery and library design.

Details

Motivation: Rational design of synthetic epitope libraries is challenging due to the enormous combinatorial sequence space (20^n for n amino acids), making experimental screening infeasible even with high-throughput techniques.

Method: Developed epiGPTope, a large language model pre-trained on protein data and fine-tuned on linear epitopes, which directly generates novel epitope-like sequences. Also trained statistical classifiers to predict bacterial vs viral origin of epitopes.

Result: The model generates sequences with statistical properties analogous to known epitopes. The combination of generative and predictive models can narrow candidate libraries and increase likelihood of identifying specific epitopes.

Conclusion: This approach using only primary amino acid sequences (no geometric framework or hand-crafted features) enables faster, more cost-effective generation and screening of synthetic epitopes for biotechnology applications.

Abstract: Epitopes are short antigenic peptide sequences which are recognized by antibodies or immune cell receptors. These are central to the development of immunotherapies, vaccines, and diagnostics. However, the rational design of synthetic epitope libraries is challenging due to the large combinatorial sequence space, $20^n$ combinations for linear epitopes of n amino acids, making screening and testing unfeasible, even with high throughput experimental techniques. In this study, we present a large language model, epiGPTope, pre-trained on protein data and specifically fine-tuned on linear epitopes, which for the first time can directly generate novel epitope-like sequences, which are found to possess statistical properties analogous to the ones of known epitopes. This generative approach can be used to prepare libraries of epitope candidate sequences. We further train statistical classifiers to predict whether an epitope sequence is of bacterial or viral origin, thus narrowing the candidate library and increasing the likelihood of identifying specific epitopes. We propose that such combination of generative and predictive models can be of assistance in epitope discovery. The approach uses only primary amino acid sequences of linear epitopes, bypassing the need for a geometric framework or hand-crafted features of the sequences. By developing a method to create biologically feasible sequences, we anticipate faster and more cost-effective generation and screening of synthetic epitopes, with relevant applications in the development of new biotechnologies.

[266] Fair Resource Allocation for Fleet Intelligence

Oguzhan Baser, Kaan Kale, Po-han Li, Sandeep Chinchali

Main category: cs.LG

TL;DR: Fair-Synergy is an open-source algorithmic framework for fair resource allocation in cloud-assisted multi-agent intelligence, outperforming benchmarks by up to 25% in inference and 11% in learning settings.

Details

Motivation: Traditional resource allocation methods overlook agents' diverse computational capabilities and complex environments, leading to inefficient and unfair distribution in multi-agent systems.

Method: Utilizes concave relationship between agents’ accuracy and system resources, extends traditional approaches to multidimensional ML utility landscape considering model parameters, training data volume, and task complexity.

Result: Outperforms standard benchmarks by up to 25% in multi-agent inference and 11% in multi-agent learning settings across various models (BERT, VGG16, MobileNet, ResNets) and datasets (MNIST, CIFAR-10, CIFAR-100, BDD, GLUE).

Conclusion: Provides insights into how fairness levels affect different agent types (least advantaged, most advantaged, average) and enables equitable fleet intelligence through optimized resource allocation.

Abstract: Resource allocation is crucial for the performance optimization of cloud-assisted multi-agent intelligence. Traditional methods often overlook agents’ diverse computational capabilities and complex operating environments, leading to inefficient and unfair resource distribution. To address this, we open-sourced Fair-Synergy, an algorithmic framework that utilizes the concave relationship between the agents’ accuracy and the system resources to ensure fair resource allocation across fleet intelligence. We extend traditional allocation approaches to encompass a multidimensional machine learning utility landscape defined by model parameters, training data volume, and task complexity. We evaluate Fair-Synergy with advanced vision and language models such as BERT, VGG16, MobileNet, and ResNets on datasets including MNIST, CIFAR-10, CIFAR-100, BDD, and GLUE. We demonstrate that Fair-Synergy outperforms standard benchmarks by up to 25% in multi-agent inference and 11% in multi-agent learning settings. Also, we explore how the level of fairness affects the least advantaged, most advantaged, and average agents, providing insights for equitable fleet intelligence.

[267] Some patterns of sleep quality and Daylight Saving Time across countries: a predictive and exploratory analysis

Bhanu Sharma, Eugene Pinsky

Main category: cs.LG

TL;DR: DST’s impact on sleep varies by latitude - DST countries sleep longer at higher latitudes but shorter at lower latitudes compared to non-DST countries.

Details

Motivation: To examine how Daylight Saving Time practices affect sleep durations across different countries and geographical locations.

Method: Analyzed sleep data from 61 countries, grouped by DST observance, used statistical correlation analysis and visualizations to compare sleep patterns between DST and non-DST regions.

Result: Countries observing DST generally report longer sleep durations, but latitude moderates this effect: DST countries at lower latitudes have shorter sleep than non-DST countries, while at higher latitudes DST countries have longer sleep.

Conclusion: The influence of DST on sleep is moderated by geographical location, with latitude playing a key role in how DST affects sleep patterns.

Abstract: In this study we analyzed average sleep durations across 61 countries to examine the impact of Daylight Saving Time (DST) practices. Key metrics influencing sleep were identified, and statistical correlation analysis was applied to explore relationships among these factors. Countries were grouped based on DST observance, and visualizations compared sleep patterns between DST and non-DST regions. Results show that, on average, countries observing DST tend to report longer sleep durations than those that do not. A more detailed pattern emerged when accounting for latitude: at lower latitudes, DST-observing countries reported shorter sleep durations compared to non-DST countries, while at higher latitudes, DST-observing countries reported longer average sleep durations. These findings suggest that the influence of DST on sleep may be moderated by geographical location.

[268] Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, Anurag Beniwal

Main category: cs.LG

TL;DR: PROF is a method that combines process and outcome rewards through consistency filtering to improve mathematical reasoning in RL, achieving over 4% accuracy gains.

Details

Motivation: Existing reward models in reinforcement learning for mathematical reasoning are either too coarse-grained (ORMs) or noisy and hackable (PRMs), limiting reasoning quality improvement.

Method: PROF uses process consistency filtering to harmonize PRM and ORM rewards by selecting samples where correct answers have high process scores and incorrect answers have low process scores, maintaining training balance.

Result: Extensive experiments show PROF consistently improves final accuracy by over 4% compared to blending approaches and enhances intermediate reasoning step quality.

Conclusion: PROF effectively resolves the dilemma between coarse outcome rewards and noisy process rewards through consistency-driven sample selection, advancing mathematical reasoning capabilities.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged to be a predominant paradigm for mathematical reasoning tasks, offering stable improvements in reasoning ability. However, Outcome Reward Models (ORMs) in RLVR are too coarse-grained to distinguish flawed reasoning within correct answers or valid reasoning within incorrect answers. This lack of granularity introduces noisy and misleading gradients significantly and hinders further progress in reasoning process quality. While Process Reward Models (PRMs) offer fine-grained guidance for intermediate steps, they frequently suffer from inaccuracies and are susceptible to reward hacking. To resolve this dilemma, we introduce PRocess cOnsistency Filter (PROF), an effective data process curation method that harmonizes noisy, fine-grained process rewards with accurate, coarse-grained outcome rewards. Rather than naively blending PRM and ORM in the objective function (arXiv:archive/2506.18896), PROF leverages their complementary strengths through consistency-driven sample selection. Our approach retains correct responses with higher averaged process values and incorrect responses with lower averaged process values, while maintaining positive/negative training sample balance. Extensive experiments demonstrate that our method not only consistently improves the final accuracy over $4%$ compared to the blending approaches, but also strengthens the quality of intermediate reasoning steps. Codes and training recipes are available at https://github.com/Chenluye99/PROF.

[269] The distribution of calibrated likelihood functions on the probability-likelihood Aitchison simplex

Paul-Gauthier Noé, Andreas Nautsch, Driss Matrouf, Pierre-Michel Bousquet, Jean-François Bonastre

Main category: cs.LG

TL;DR: This paper extends calibration concepts from binary classification (log-likelihood ratios) to multi-class scenarios using Aitchison geometry and isometric-log-ratio transformations, providing theoretical foundations and a practical machine learning application.

Details

Motivation: While calibration of probabilistic predictions has been widely studied, calibration of likelihood functions has been limited to binary cases with only two hypotheses. The paper aims to extend these concepts to cases with more than two hypotheses.

Method: The authors use Aitchison geometry of the simplex to extend binary calibration concepts to multiple hypotheses. They recover the additive form of Bayes’ rule in vector form and extend LLR and weight-of-evidence concepts using isometric-log-ratio transformed likelihood functions.

Result: The paper successfully extends the definition of calibration, idempotence, and distribution constraints on likelihood functions to multiple hypotheses. They provide a non-linear discriminant analysis application where discriminant components form calibrated likelihood functions.

Conclusion: This work provides a conceptual framework for extending likelihood function calibration from binary to multi-class scenarios, improving interpretability and reliability of methods through the use of Aitchison geometry and isometric-log-ratio transformations.

Abstract: While calibration of probabilistic predictions has been widely studied, this paper rather addresses calibration of likelihood functions. This has been discussed, especially in biometrics, in cases with only two exhaustive and mutually exclusive hypotheses (classes) where likelihood functions can be written as log-likelihood-ratios (LLRs). After defining calibration for LLRs and its connection with the concept of weight-of-evidence, we present the idempotence property and its associated constraint on the distribution of the LLRs. Although these results have been known for decades, they have been limited to the binary case. Here, we extend them to cases with more than two hypotheses by using the Aitchison geometry of the simplex, which allows us to recover, in a vector form, the additive form of the Bayes’ rule; extending therefore the LLR and the weight-of-evidence to any number of hypotheses. Especially, we extend the definition of calibration, the idempotence, and the constraint on the distribution of likelihood functions to this multiple hypotheses and multiclass counterpart of the LLR: the isometric-log-ratio transformed likelihood function. This work is mainly conceptual, but we still provide one application to machine learning by presenting a non-linear discriminant analysis where the discriminant components form a calibrated likelihood function over the classes, improving therefore the interpretability and the reliability of the method.

[270] DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

Yubo Gao, Renbo Tu, Gennady Pekhimenko, Nandita Vijaykumar

Main category: cs.LG

TL;DR: DP-SGD combined with quantization causes significant accuracy degradation due to noise amplification. QPQuant introduces dynamic quantization with probabilistic layer sampling and loss-aware prioritization to reduce quantization variance while preserving privacy.

Details

Motivation: Quantization reduces training costs but causes disproportionately high accuracy degradation in differentially private SGD due to noise injection amplifying quantization variance.

Method: QPQuant framework with dynamic quantization: (1) probabilistic sampling of layers rotated each epoch, (2) loss-aware layer prioritization using differentially private loss sensitivity estimator to identify layers with minimal quantization impact.

Result: Outperforms static quantization baselines on ResNet18, ResNet50, DenseNet121 across datasets, achieving near Pareto-optimal accuracy-compute trade-offs, 2.21x theoretical throughput improvements on low-precision hardware with <2% accuracy drop.

Conclusion: Dynamic quantization with adaptive layer selection effectively mitigates accuracy degradation in DP-SGD while maintaining privacy guarantees and significantly improving computational efficiency.

Abstract: Differentially-Private SGD (DP-SGD) is a powerful technique to protect user privacy when using sensitive data to train neural networks. During training, converting model weights and activations into low-precision formats, i.e., quantization, can drastically reduce training times, energy consumption, and cost, and is thus a widely used technique. In this work, we demonstrate that quantization causes significantly higher accuracy degradation in DP-SGD compared to regular SGD. We observe that this is caused by noise injection in DP-SGD, which amplifies quantization variance, leading to disproportionately large accuracy degradation. To address this challenge, we present QPQuant, a dynamic quantization framework that adaptively selects a changing subset of layers to quantize at each epoch. Our method combines two key ideas that effectively reduce quantization variance: (i) probabilistic sampling of the layers that rotates which layers are quantized every epoch, and (ii) loss-aware layer prioritization, which uses a differentially private loss sensitivity estimator to identify layers that can be quantized with minimal impact on model quality. This estimator consumes a negligible fraction of the overall privacy budget, preserving DP guarantees. Empirical evaluations on ResNet18, ResNet50, and DenseNet121 across a range of datasets demonstrate that DPQuant consistently outperforms static quantization baselines, achieving near Pareto-optimal accuracy-compute trade-offs and up to 2.21x theoretical throughput improvements on low-precision hardware, with less than 2% drop in validation accuracy.

[271] Cluster and then Embed: A Modular Approach for Visualization

Elizabeth Coda, Ery Arias-Castro, Gal Mishne

Main category: cs.LG

TL;DR: A modular approach for dimensionality reduction that first clusters data, then embeds each cluster separately, and finally aligns them to preserve both local and global structure better than t-SNE/UMAP.

Details

Motivation: t-SNE and UMAP tend to distort global geometry while preserving local structure, creating well-separated clusters but inaccurate global relationships.

Method: Three-step approach: 1) Cluster the data, 2) Embed each cluster separately, 3) Align the clusters to obtain a global embedding.

Result: Competitive performance with existing methods on synthetic and real-world datasets while providing more transparency.

Conclusion: The proposed modular approach offers better transparency and preserves both local and global structure compared to traditional dimensionality reduction methods.

Abstract: Dimensionality reduction methods such as t-SNE and UMAP are popular methods for visualizing data with a potential (latent) clustered structure. They are known to group data points at the same time as they embed them, resulting in visualizations with well-separated clusters that preserve local information well. However, t-SNE and UMAP also tend to distort the global geometry of the underlying data. We propose a more transparent, modular approach consisting of first clustering the data, then embedding each cluster, and finally aligning the clusters to obtain a global embedding. We demonstrate this approach on several synthetic and real-world datasets and show that it is competitive with existing methods, while being much more transparent.

[272] Robult: Leveraging Redundancy and Modality Specific Features for Robust Multimodal Learning

Duy A. Nguyen, Abhi Kamboj, Minh N. Do

Main category: cs.LG

TL;DR: Robult is a scalable multimodal learning framework that addresses missing modalities and limited labeled data through information-theoretic optimization of PU contrastive loss and latent reconstruction, achieving superior performance in semi-supervised and missing modality scenarios.

Details

Motivation: To overcome challenges of missing modalities and limited labeled data in multimodal learning, which are critical for developing robust real-world applications.

Method: Proposes Robult framework with two core components: (1) soft Positive-Unlabeled contrastive loss for task-relevant feature alignment using limited labeled data, and (2) latent reconstruction loss to preserve modality-specific information. Uses modular design for resilience to incomplete modalities.

Result: Superior performance over existing approaches across diverse datasets in both semi-supervised learning and missing modality contexts. Lightweight design enables scalability and easy integration with existing architectures.

Conclusion: Robult effectively addresses key multimodal learning challenges through its novel information-theoretic approach, making it suitable for real-world applications with incomplete data and limited supervision.

Abstract: Addressing missing modalities and limited labeled data is crucial for advancing robust multimodal learning. We propose Robult, a scalable framework designed to mitigate these challenges by preserving modality-specific information and leveraging redundancy through a novel information-theoretic approach. Robult optimizes two core objectives: (1) a soft Positive-Unlabeled (PU) contrastive loss that maximizes task-relevant feature alignment while effectively utilizing limited labeled data in semi-supervised settings, and (2) a latent reconstruction loss that ensures unique modality-specific information is retained. These strategies, embedded within a modular design, enhance performance across various downstream tasks and ensure resilience to incomplete modalities during inference. Experimental results across diverse datasets validate that Robult achieves superior performance over existing approaches in both semi-supervised learning and missing modality contexts. Furthermore, its lightweight design promotes scalability and seamless integration with existing architectures, making it suitable for real-world multimodal applications.

[273] Exploring a Graph-based Approach to Offline Reinforcement Learning for Sepsis Treatment

Taisiya Khakharova, Lucas Sakizloglou, Leen Lambers

Main category: cs.LG

TL;DR: Graph-based approach using Graph Neural Networks for sepsis treatment decision support, showing promise but highlighting representation learning complexity.

Details

Motivation: Sepsis treatment requires precise fluid and vasopressor dosing. While RL methods show promise, they rely on relational data. Graph representation may better capture complex healthcare data relationships.

Method: Modeled MIMIC-III patient data as time-evolving heterogeneous graph. Used GraphSAGE and GATv2 architectures for state representation learning, decoupled from policy learning. Representations trained with next-state prediction decoders, then used with dBCQ algorithm for policy learning.

Result: Experimental evaluation confirmed potential of graph-based approach for sepsis treatment decision support.

Conclusion: Graph-based methods show promise for sepsis treatment optimization but representation learning in this domain remains complex and challenging.

Abstract: Sepsis is a serious, life-threatening condition. When treating sepsis, it is challenging to determine the correct amount of intravenous fluids and vasopressors for a given patient. While automated reinforcement learning (RL)-based methods have been used to support these decisions with promising results, previous studies have relied on relational data. Given the complexity of modern healthcare data, representing data as a graph may provide a more natural and effective approach. This study models patient data from the well-known MIMIC-III dataset as a heterogeneous graph that evolves over time. Subsequently, we explore two Graph Neural Network architectures - GraphSAGE and GATv2 - for learning patient state representations, adopting the approach of decoupling representation learning from policy learning. The encoders are trained to produce latent state representations, jointly with decoders that predict the next patient state. These representations are then used for policy learning with the dBCQ algorithm. The results of our experimental evaluation confirm the potential of a graph-based approach, while highlighting the complexity of representation learning in this domain.

[274] SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang

Main category: cs.LG

TL;DR: SafeProtein is the first red-teaming framework for protein foundation models that uses multimodal prompt engineering and beam search to systematically test for biological safety risks, achieving up to 70% attack success rate.

Details

Motivation: The rapid advancement of protein foundation models lacks systematic security testing, raising concerns about potential misuse for generating biologically hazardous proteins.

Method: Combines multimodal prompt engineering and heuristic beam search to design red-teaming methods, with a manually curated benchmark dataset (SafeProtein-Bench) and comprehensive evaluation protocol.

Result: Achieved continuous jailbreaks on state-of-the-art protein foundation models with up to 70% attack success rate for ESM3, revealing significant biological safety risks.

Conclusion: Reveals critical security vulnerabilities in current protein foundation models and provides insights for developing robust security protection technologies for frontier AI models in biology.

Abstract: Proteins play crucial roles in almost all biological processes. The advancement of deep learning has greatly accelerated the development of protein foundation models, leading to significant successes in protein understanding and design. However, the lack of systematic red-teaming for these models has raised serious concerns about their potential misuse, such as generating proteins with biological safety risks. This paper introduces SafeProtein, the first red-teaming framework designed for protein foundation models to the best of our knowledge. SafeProtein combines multimodal prompt engineering and heuristic beam search to systematically design red-teaming methods and conduct tests on protein foundation models. We also curated SafeProtein-Bench, which includes a manually constructed red-teaming benchmark dataset and a comprehensive evaluation protocol. SafeProtein achieved continuous jailbreaks on state-of-the-art protein foundation models (up to 70% attack success rate for ESM3), revealing potential biological safety risks in current protein foundation models and providing insights for the development of robust security protection technologies for frontier models. The codes will be made publicly available at https://github.com/jigang-fan/SafeProtein.

[275] Initialization Schemes for Kolmogorov-Arnold Networks: An Empirical Study

Spyros Rigas, Dhruv Verma, Georgios Alexandridis, Yixuan Wang

Main category: cs.LG

TL;DR: This paper studies initialization strategies for Kolmogorov-Arnold Networks (KANs), proposing theory-driven and empirical power-law approaches that outperform baseline methods, with power-law initialization achieving the best overall performance.

Details

Motivation: While KANs offer enhanced flexibility and interpretability by replacing fixed nonlinearities with trainable activation functions, their initialization strategies remain largely unexplored, creating a research gap in optimizing their performance.

Method: The authors propose two theory-driven initialization schemes inspired by LeCun and Glorot methods, plus an empirical power-law family with tunable exponents. They evaluate these through large-scale grid searches on function fitting and PDE benchmarks, Neural Tangent Kernel analysis of training dynamics, and Feynman dataset evaluations.

Result: The Glorot-inspired initialization significantly outperforms baseline in parameter-rich models, while power-law initialization achieves the strongest overall performance across tasks and varying architecture sizes.

Conclusion: Proper initialization is crucial for KAN performance, with power-law initialization emerging as the most effective strategy, providing valuable insights for optimizing this novel neural architecture.

Abstract: Kolmogorov-Arnold Networks (KANs) are a recently introduced neural architecture that replace fixed nonlinearities with trainable activation functions, offering enhanced flexibility and interpretability. While KANs have been applied successfully across scientific and machine learning tasks, their initialization strategies remain largely unexplored. In this work, we study initialization schemes for spline-based KANs, proposing two theory-driven approaches inspired by LeCun and Glorot, as well as an empirical power-law family with tunable exponents. Our evaluation combines large-scale grid searches on function fitting and forward PDE benchmarks, an analysis of training dynamics through the lens of the Neural Tangent Kernel, and evaluations on a subset of the Feynman dataset. Our findings indicate that the Glorot-inspired initialization significantly outperforms the baseline in parameter-rich models, while power-law initialization achieves the strongest performance overall, both across tasks and for architectures of varying size. All code and data accompanying this manuscript are publicly available at https://github.com/srigas/KAN_Initialization_Schemes.

[276] On Entropy Control in LLM-RL Algorithms

Han Shen

Main category: cs.LG

TL;DR: AEnt is a new entropy control method for LLM-RL that addresses issues with conventional entropy regularization in large language model reinforcement learning by using clamped entropy bonus with automatic coefficient adjustment.

Details

Motivation: Conventional entropy regularization, effective in traditional RL settings, shows weak to no gains in LLM-RL due to the extremely large response space and sparsity of optimal outputs in language models.

Method: Proposes AEnt method with clamped entropy bonus evaluated on re-normalized policy over smaller token space, plus automatic entropy coefficient adjustment to control entropy-induced bias while maintaining exploration benefits.

Result: AEnt consistently outperforms baselines across multiple math-reasoning tasks with different base models and datasets.

Conclusion: The proposed AEnt method effectively addresses entropy control challenges in LLM-RL settings and demonstrates superior performance compared to conventional entropy regularization approaches.

Abstract: For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in LLM-RL training. In this work, we study the issues of entropy bonus in LLM-RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the LLM’s extremely large response space and the sparsity of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy’s benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.

[277] LINKER: Learning Interactions Between Functional Groups and Residues With Chemical Knowledge-Enhanced Reasoning and Explainability

Phuc Pham, Viet Thanh Duy Nguyen, Truong-Son Hy

Main category: cs.LG

TL;DR: LINKER is a sequence-based model that predicts protein-ligand residue-functional group interactions using only protein sequences and ligand SMILES, trained with structure-supervised attention to identify biologically meaningful interaction types rather than spatial proximity.

Details

Motivation: Existing deep learning approaches for protein-ligand interpretability require 3D structural input or use distance-based contact labels, limiting their applicability and biological relevance when structural data is unavailable.

Method: LINKER uses protein sequences and ligand SMILES as input, trained with structure-supervised attention where interaction labels are derived from 3D protein-ligand complexes via functional group-based motif extraction. It abstracts ligand structures into functional groups to focus on chemically meaningful substructures.

Result: Experiments on the LP-PDBBind benchmark show that structure-informed supervision over functional group abstractions yields interaction predictions closely aligned with ground-truth biochemical annotations.

Conclusion: LINKER enables accurate prediction of biologically defined protein-ligand interaction types using only sequence-level input, making it applicable for large-scale analysis where structural data is unavailable.

Abstract: Accurate identification of interactions between protein residues and ligand functional groups is essential to understand molecular recognition and guide rational drug design. Existing deep learning approaches for protein-ligand interpretability often rely on 3D structural input or use distance-based contact labels, limiting both their applicability and biological relevance. We introduce LINKER, the first sequence-based model to predict residue-functional group interactions in terms of biologically defined interaction types, using only protein sequences and the ligand SMILES as input. LINKER is trained with structure-supervised attention, where interaction labels are derived from 3D protein-ligand complexes via functional group-based motif extraction. By abstracting ligand structures into functional groups, the model focuses on chemically meaningful substructures while predicting interaction types rather than mere spatial proximity. Crucially, LINKER requires only sequence-level input at inference time, enabling large-scale application in settings where structural data is unavailable. Experiments on the LP-PDBBind benchmark demonstrate that structure-informed supervision over functional group abstractions yields interaction predictions closely aligned with ground-truth biochemical annotations.

[278] Graph neural networks for learning liquid simulations in dynamic scenes containing kinematic objects

Niteesh Midlagajni, Constantin A. Rothkopf

Main category: cs.LG

TL;DR: A GNN-based framework for simulating liquid dynamics with rigid body interactions, using BVH algorithm for collision handling, enabling generalization to unseen objects and manipulation tasks.

Details

Motivation: Existing data-driven approaches for fluid simulation are limited to static environments or simple manipulations, lacking capability for complex interactions with dynamically moving kinematic rigid bodies.

Method: GNN-based framework with particles as graph nodes, using surface representations with BVH algorithm for particle-object collision handling to model complex interactions between liquids and intricate surface geometries.

Result: The model accurately captures fluid behavior in dynamic settings, generalizes to unseen objects and novel manipulation tasks (stirring, scooping), and can solve control tasks using gradient-based optimization.

Conclusion: The proposed framework effectively learns liquid dynamics under rigid body interactions and enables generalization to various manipulation tasks, providing a powerful tool for fluid simulation in complex environments.

Abstract: Simulating particle dynamics with high fidelity is crucial for solving real-world interaction and control tasks involving liquids in design, graphics, and robotics. Recently, data-driven approaches, particularly those based on graph neural networks (GNNs), have shown progress in tackling such problems. However, these approaches are often limited to learning fluid behavior in static free-fall environments or simple manipulation settings involving primitive objects, often overlooking complex interactions with dynamically moving kinematic rigid bodies. Here, we propose a GNN-based framework designed from the ground up to learn the dynamics of liquids under rigid body interactions and active manipulations, where particles are represented as graph nodes and particle-object collisions are handled using surface representations with the bounding volume hierarchy (BVH) algorithm. This approach enables the network to model complex interactions between liquid particles and intricate surface geometries. Our model accurately captures fluid behavior in dynamic settings and can also function as a simulator in static free-fall environments. Despite being trained on a single-object manipulation task of pouring, our model generalizes effectively to environments with unseen objects and novel manipulation tasks such as stirring and scooping. Finally, we show that the learned dynamics can be leveraged to solve control and manipulation tasks using gradient-based optimization methods.

[279] Geometric Foundations of Tuning without Forgetting in Neural ODEs

Erkan Bayram, Mohamed-Ali Belabbas, Tamer Başar

Main category: cs.LG

TL;DR: The paper provides theoretical foundation for Tuning without Forgetting (TwF) method, proving that the parameter subspace forms a Banach submanifold and characterizing its tangent space, showing exact mapping preservation beyond first-order approximation.

Details

Motivation: To establish rigorous mathematical foundations for the TwF method introduced in previous work, moving beyond first-order approximation to prove exact mapping preservation during sequential training of neural ODEs.

Method: Mathematical analysis proving that the parameter subspace of control functions forms a Banach submanifold of finite codimension under nonsingular controls, with characterization of its tangent space.

Result: Demonstrated that TwF corresponds to continuation/deformation along the tangent space of this Banach submanifold, providing exact mapping preservation (not forgetting) during sequential training.

Conclusion: The theoretical analysis validates TwF as an exact method for preserving learned mappings in sequential training of neural ODEs, establishing it as a mathematically sound approach beyond approximate first-order methods.

Abstract: In our earlier work, we introduced the principle of Tuning without Forgetting (TwF) for sequential training of neural ODEs, where training samples are added iteratively and parameters are updated within the subspace of control functions that preserves the end-point mapping at previously learned samples on the manifold of output labels in the first-order approximation sense. In this letter, we prove that this parameter subspace forms a Banach submanifold of finite codimension under nonsingular controls, and we characterize its tangent space. This reveals that TwF corresponds to a continuation/deformation of the control function along the tangent space of this Banach submanifold, providing a theoretical foundation for its mapping-preserving (not forgetting) during the sequential training exactly, beyond first-order approximation.

[280] Warming Up for Zeroth-Order Federated Pre-Training with Low Resource Clients

Gwen Legate, Irina Rish, Eugene Belilovsky

Main category: cs.LG

TL;DR: ZOWarmUp - a federated zeroth-order optimizer that enables low-resource edge devices to participate in training from random initialization, improving data diversity and reducing system bias.

Details

Motivation: Federated learning excludes low-memory/communication edge devices, creating system bias and limiting data access. Zeroth-order methods like MeZO are only used for fine-tuning due to high variance.

Method: Developed ZOWarmUp with variance reduction techniques and client capability awareness to enable zeroth-order training from random initialization, using random seeds instead of full gradients for minimal communication.

Result: Experiments show ZOWarmUp is robust across various datasets and model architectures, enabling participation of under-represented clients and improving training outcomes.

Conclusion: ZOWarmUp successfully extends zeroth-order methods beyond fine-tuning to full training, providing access to more diverse data and reducing exclusion of low-resource devices in federated learning.

Abstract: Federated learning enables collaborative model training across numerous edge devices without requiring participants to share data; however, memory and communication constraints on these edge devices may preclude their participation in training. We consider a setting in which a subset of edge devices are below a critical memory or communication threshold required to conduct model updates. Under typical federated optimization algorithms, these devices are excluded from training which renders their data inaccessible and increases system induced bias. We are inspired by MeZO, a zeroth-order method used for memory-efficient fine-tuning. The increased variance inherent to zeroth-order gradient approximations has relegated previous zeroth-order optimizers exclusively to the domain of fine tuning; a limitation we seek to correct. We devise a federated, memory-efficient zeroth-order optimizer, ZOWarmUp that permits zeroth-order training from a random initialization. ZOWarmUp leverages differing client capabilities and careful variance reduction techniques to facilitate participation of under-represented, low-resource clients in model training. Like other federated zeroth-order methods, ZOWarmUp eliminates the need for edge devices to transmit their full gradients to the server and instead relies on only a small set of random seeds, rendering the up-link communication cost negligible. We present experiments using various datasets and model architectures to show that ZOWarmUp is a robust algorithm that can can be applied under a wide variety of circumstances. For systems with a high proportion of edge devices that would otherwise be excluded from training, this algorithm provides access to a greater volume and diversity of data, thus improving training outcomes.

[281] Invariant Features for Global Crop Type Classification

Xin-Yi Tong, Sherrie Wang

Main category: cs.LG

TL;DR: This paper introduces CropGlobe, a global crop dataset with 300K samples, and CropNet, a lightweight CNN for crop classification that addresses geographic transfer challenges through invariant feature identification and temporal data augmentation.

Details

Motivation: Accurate global crop mapping is crucial for food security but limited by geographic transfer issues where models trained in one region perform poorly in others due to spectral and phenological variations.

Method: Built CropGlobe dataset from 8 countries across 5 continents covering 6 major crops. Compared transferability of Sentinel-2 temporal features and EMIT hyperspectral features. Developed CropNet CNN with temporal data augmentation (time shift, scale, and magnitude warping) to simulate cross-regional phenology.

Result: 2D median temporal features from Sentinel-2 showed strongest invariance across all transfer scenarios. Data augmentation improved robustness, especially when training data diversity was limited. The approach enhanced cross-regional generalization.

Conclusion: The study identifies invariant feature representations that improve geographic transferability, providing a scalable path for low-cost crop type applications across diverse global regions.

Abstract: Accurately obtaining crop type and its spatial distribution at a global scale is critical for food security, agricultural policy-making, and sustainable development. Remote sensing offers an efficient solution for large-scale crop classification, but the limited availability of reliable ground samples in many regions constrains applicability across geographic areas. To address performance declines under geospatial shifts, this study identifies remote sensing features that are invariant to geographic variation and proposes strategies to enhance cross-regional generalization. We construct CropGlobe, a global crop type dataset with 300,000 pixel-level samples from eight countries across five continents, covering six major food and industrial crops (corn, soybeans, rice, wheat, sugarcane, cotton). With broad geographic coverage, CropGlobe enables a systematic evaluation under cross-country, cross-continent, and cross-hemisphere transfer. We compare the transferability of temporal multi-spectral features (Sentinel-2-based 1D/2D median features and harmonic coefficients) and hyperspectral features (from EMIT). To improve generalization under spectral and phenological shifts, we design CropNet, a lightweight and robust CNN tailored for pixel-level crop classification, coupled with temporal data augmentation (time shift, time scale, and magnitude warping) that simulates realistic cross-regional phenology. Experiments show that 2D median temporal features from Sentinel-2 consistently exhibit the strongest invariance across all transfer scenarios, and augmentation further improves robustness, particularly when training data diversity is limited. Overall, the work identifies more invariant feature representations that enhance geographic transferability and suggests a promising path toward scalable, low-cost crop type applications across globally diverse regions.

[282] Can LLMs Lie? Investigation beyond Hallucination

Haoran Huan, Mihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak

Main category: cs.LG

TL;DR: Systematic investigation of LLM lying behavior (intentional deception) vs hallucinations, using mechanistic interpretability to uncover neural mechanisms and develop control methods.

Details

Motivation: LLMs' increasing autonomy raises trustworthiness concerns, but while hallucinations are studied, intentional lying remains underexplored despite being critical for real-world deployment.

Method: Used mechanistic interpretability techniques including logit lens analysis, causal interventions, and contrastive activation steering to identify deceptive behavior patterns and develop behavioral steering vectors.

Result: Identified neural mechanisms underlying deception, developed methods to control lying tendencies, and established a Pareto frontier showing dishonesty can enhance goal optimization in certain scenarios.

Conclusion: Research contributes to AI ethics by revealing deception risks in LLMs and developing safeguards, highlighting need for careful deployment in high-stakes environments.

Abstract: Large language models (LLMs) have demonstrated impressive capabilities across a variety of tasks, but their increasing autonomy in real-world applications raises concerns about their trustworthiness. While hallucinations-unintentional falsehoods-have been widely studied, the phenomenon of lying, where an LLM knowingly generates falsehoods to achieve an ulterior objective, remains underexplored. In this work, we systematically investigate the lying behavior of LLMs, differentiating it from hallucinations and testing it in practical scenarios. Through mechanistic interpretability techniques, we uncover the neural mechanisms underlying deception, employing logit lens analysis, causal interventions, and contrastive activation steering to identify and control deceptive behavior. We study real-world lying scenarios and introduce behavioral steering vectors that enable fine-grained manipulation of lying tendencies. Further, we explore the trade-offs between lying and end-task performance, establishing a Pareto frontier where dishonesty can enhance goal optimization. Our findings contribute to the broader discourse on AI ethics, shedding light on the risks and potential safeguards for deploying LLMs in high-stakes environments. Code and more illustrations are available at https://llm-liar.github.io/

[283] Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios

Yunkai Dang, Mengxi Gao, Yibo Yan, Xin Zou, Yanggan Gu, Jungang Li, Jingyu Wang, Peijie Jiang, Aiwei Liu, Jia Liu, Xuming Hu

Main category: cs.LG

TL;DR: MLLMs are highly vulnerable to misleading information, with 65% of correct answers being overturned by deceptive cues. The study proposes a benchmark and fine-tuning method that reduces misleading rates from 86% to under 33%.

Details

Motivation: Existing MLLM studies focus on visual-textual misalignment but overlook how models handle misleading information that could cause them to abandon previously correct answers.

Method: Two-stage evaluation pipeline: first elicit original responses on clean inputs, then inject explicit (false hints) and implicit (contextual contradictions) misleading instructions. Created MUB benchmark and fine-tuned models on 2000-sample dataset.

Result: MLLMs show high uncertainty - average misleading rates exceed 86% (explicit: 67.19%, implicit: 80.67%). Fine-tuning reduced rates to 6.97% (explicit) and 32.77% (implicit), improving consistency by 29.37%.

Conclusion: MLLMs are highly susceptible to misleading cues, but targeted fine-tuning can significantly improve their robustness and consistency while maintaining performance on standard benchmarks.

Abstract: Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding. However, existing studies have concentrated mainly on visual-textual misalignment, leaving largely unexplored the MLLMs’ ability to preserve an originally correct answer when confronted with misleading information. We reveal a response uncertainty phenomenon: across nine standard datasets, twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue. To systematically quantify this vulnerability, we propose a two-stage evaluation pipeline: (1) elicit each model’s original response on unperturbed inputs; (2) inject explicit (false-answer hints) and implicit (contextual contradictions) misleading instructions, and compute the misleading rate - the fraction of correct-to-incorrect flips. Leveraging the most susceptible examples, we curate the Multimodal Uncertainty Benchmark (MUB), a collection of image-question pairs stratified into low, medium, and high difficulty based on how many of twelve state-of-the-art MLLMs they mislead. Extensive evaluation on twelve open-source and five closed-source models reveals a high uncertainty: average misleading rates exceed 86%, with explicit cues over 67.19% and implicit cues over 80.67%. To reduce the misleading rate, we then fine-tune all open-source MLLMs on a compact 2000-sample mixed-instruction dataset, reducing misleading rates to 6.97% (explicit) and 32.77% (implicit), boosting consistency by nearly 29.37% on highly deceptive inputs, and slightly improving accuracy on standard benchmarks. Our code is available at https://github.com/Yunkaidang/uncertainty

[284] Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs

Yao Fu, Runchao Li, Xianxuan Long, Haotian Yu, Xiaotian Han, Yu Yin, Pan Li

Main category: cs.LG

TL;DR: Neural network pruning disrupts LLMs’ internal activation features needed for lie detection. The proposed TPLO method preserves these features by focusing on layers with activation outliers and discriminative features, improving lie detection accuracy to 88% at 50% sparsity while maintaining original performance.

Details

Motivation: Pruning LLMs for deployment in low-resource scenarios inadvertently removes crucial internal activation features that are essential for lie detection capabilities, creating a need to preserve these critical features during pruning.

Method: Proposed Truthful Pruning aligned by Layer-wise Outliers (TPLO) that emphasizes layers with more activation outliers and stronger discriminative features simultaneously. Also introduced a prompting rule to enrich the TruthfulQA benchmark for better calibration.

Result: Achieved 88% lie detection accuracy at 50% sparsity while preserving LLMs’ original performance. Enhanced performance on TruthfulQA benchmark and improved hallucination detection for pruned LLMs.

Conclusion: TPLO successfully addresses the critical issue of preserving lie detection capabilities during LLM pruning by focusing on layers with activation outliers and discriminative features, enabling effective deployment in low-resource scenarios without sacrificing truthfulness assessment.

Abstract: Neural network pruning has emerged as a promising approach for deploying LLMs in low-resource scenarios while preserving downstream task performance. However, for the first time, we reveal that such pruning disrupts LLMs’ internal activation features crucial for lie detection, where probing classifiers (typically small logistic regression models) trained on these features assess the truthfulness of LLM-generated statements. This discovery raises a crucial open question: how can we prune LLMs without sacrificing these critical lie detection capabilities? Our investigation further reveals that naively adjusting layer-wise pruning sparsity based on importance inadvertently removes crucial weights, failing to improve lie detection performance despite its reliance on the most crucial LLM layer. To address this issue, we propose Truthful Pruning aligned by Layer-wise Outliers (TPLO), which places greater emphasis on layers with more activation outliers and stronger discriminative features simultaneously. This preserves LLMs’ original performance while retaining critical features of inner states needed for robust lie detection. Moreover, we introduce a prompting rule to enrich the TruthfulQA benchmark for better calibrating LLM pruning. Empirical results show that our approach improves the hallucination detection for pruned LLMs (achieving 88% accuracy at 50% sparsity) and enhances their performance on TruthfulQA.

[285] P2DT: Mitigating Forgetting in task-incremental Learning with progressive prompt Decision Transformer

Zhiyuan Wang, Xiaoyang Qu, Jing Xiao, Bokui Chen, Jianzong Wang

Main category: cs.LG

TL;DR: P2DT is a novel method that uses progressive prompt tokens to prevent catastrophic forgetting in continual reinforcement learning, allowing models to retain knowledge from previous tasks while learning new ones.

Details

Motivation: Catastrophic forgetting causes performance degradation when intelligent agents face new tasks, making it a substantial challenge for managing large model-controlled agents.

Method: Progressive Prompt Decision Transformer (P2DT) enhances transformer-based models by dynamically appending decision tokens during new task training, leveraging trajectories from all tasks and generating task-specific tokens.

Result: Preliminary results show the model effectively alleviates catastrophic forgetting and scales well with increasing task environments.

Conclusion: P2DT provides an effective solution for mitigating forgetting in continual and offline reinforcement learning scenarios while maintaining knowledge from previous tasks.

Abstract: Catastrophic forgetting poses a substantial challenge for managing intelligent agents controlled by a large model, causing performance degradation when these agents face new tasks. In our work, we propose a novel solution - the Progressive Prompt Decision Transformer (P2DT). This method enhances a transformer-based model by dynamically appending decision tokens during new task training, thus fostering task-specific policies. Our approach mitigates forgetting in continual and offline reinforcement learning scenarios. Moreover, P2DT leverages trajectories collected via traditional reinforcement learning from all tasks and generates new task-specific tokens during training, thereby retaining knowledge from previous studies. Preliminary results demonstrate that our model effectively alleviates catastrophic forgetting and scales well with increasing task environments.

[286] Soft-TransFormers for Continual Learning

Haeyong Kang, Chang D. Yoo

Main category: cs.LG

TL;DR: Soft-TransFormers (Soft-TF) is a novel continual learning method that uses soft-masking to create task-adaptive networks from pre-trained transformers, achieving state-of-the-art performance while minimizing catastrophic forgetting.

Details

Motivation: To address catastrophic forgetting in continual learning by building upon the Well-initialized Lottery Ticket Hypothesis, providing better fine-tuning solutions than existing methods.

Method: Sequentially learns and selects optimal soft-networks for each task by optimizing soft-mask weights of sparse layers while keeping pre-trained parameters frozen, then uses task-adaptive masking during inference.

Result: Achieves state-of-the-art performance across Vision and Language Class Incremental Learning scenarios on both Vision Transformer (ViT) and Language Transformer (Bert).

Conclusion: Soft-TF effectively minimizes catastrophic forgetting by preserving pre-trained knowledge through soft-masking while adapting to new tasks, demonstrating superior performance in continual learning settings.

Abstract: Inspired by the Well-initialized Lottery Ticket Hypothesis (WLTH), which provides suboptimal fine-tuning solutions, we propose a novel fully fine-tuned continual learning (CL) method referred to as Soft-TransFormers (Soft-TF). Soft-TF sequentially learns and selects an optimal soft-network for each task. During sequential training in CL, a well-initialized Soft-TF mask optimizes the weights of sparse layers to obtain task-adaptive soft (real-valued) networks, while keeping the well-pre-trained layer parameters frozen. In inference, the identified task-adaptive network of Soft-TF masks the parameters of the pre-trained network, mapping to an optimal solution for each task and minimizing Catastrophic Forgetting (CF) - the soft-masking preserves the knowledge of the pre-trained network. Extensive experiments on the Vision Transformer (ViT) and the Language Transformer (Bert) demonstrate the effectiveness of Soft-TF, achieving state-of-the-art performance across Vision and Language Class Incremental Learning (CIL) scenarios.

[287] Predict, Cluster, Refine: A Joint Embedding Predictive Self-Supervised Framework for Graph Representation Learning

Srinitish Srinivasan, Omkumar CU

Main category: cs.LG

TL;DR: A novel non-contrastive graph self-supervised learning framework that eliminates negative sampling and complex decoders while using GMM-based pseudo-labels to enhance node discriminability and prevent representation collapse.

Details

Motivation: Address limitations of current graph SSL methods including computational inefficiency, reliance on contrastive objectives, representation collapse, and failure to account for node embedding contributions in unlabeled scenarios.

Method: Joint embedding predictive framework with view-invariant architecture, leveraging single context and multiple targets relationship between subgraphs, and incorporating GMM-based pseudo-label scoring to evaluate latent feature contributions.

Result: Outperforms state-of-the-art graph SSL methods across benchmarks without contrastive loss or complex decoders, achieving superior performance with computational efficiency.

Conclusion: Advances graph SSL by providing a collapse-resistant paradigm that efficiently bridges spatial and semantic graph features for downstream tasks, eliminating traditional contrastive learning limitations.

Abstract: Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction, yet prevailing self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse. Existing approaches often depend on feature reconstruction, negative sampling, or complex decoders, which introduce training overhead and hinder generalization. Further, current techniques which address such limitations fail to account for the contribution of node embeddings to a certain prediction in the absence of labeled nodes. To address these limitations, we propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information. Additionally, we introduce a semantic-aware objective term that incorporates pseudo-labels derived from Gaussian Mixture Models (GMMs), enhancing node discriminability by evaluating latent feature contributions. Extensive experiments demonstrate that our framework outperforms state-of-the-art graph SSL methods across benchmarks, achieving superior performance without contrastive loss or complex decoders. Key innovations include (1) a non-contrastive, view-invariant joint embedding predictive architecture, (2) Leveraging single context and multiple targets relationship between subgraphs, and (3) GMM-based pseudo-label scoring to capture semantic contributions. This work advances graph SSL by offering a computationally efficient, collapse-resistant paradigm that bridges spatial and semantic graph features for downstream tasks. The code for our paper can be found at https://github.com/Deceptrax123/JPEB-GSSL

[288] Investigating a Model-Agnostic and Imputation-Free Approach for Irregularly-Sampled Multivariate Time-Series Modeling

Abhilash Neog, Arka Daw, Sepideh Fatemi Khorasgani, Medha Sawhney, Aanish Pradhan, Mary E. Lofton, Bennett J. McAfee, Adrienne Breef-Pilz, Heather L. Wander, Dexter W Howard, Cayelan C. Carey, Paul Hanson, Anuj Karpatne

Main category: cs.LG

TL;DR: MissTSM is a novel model-agnostic, imputation-free approach for modeling irregularly-sampled multivariate time series that outperforms existing methods when missing data is extensive and lacks periodic patterns.

Details

Motivation: Existing approaches for irregularly-sampled multivariate time series either use two-stage impute-then-model frameworks or specialized architectures, which may not perform well with extensive missing data and complex real-world patterns.

Method: MissTSM (Missing Feature-aware Time Series Modeling) is a model-agnostic and imputation-free approach that directly handles missing values without requiring data imputation or specialized architecture modifications.

Result: MissTSM shows competitive performance compared to other IMTS approaches, particularly when missing values are abundant and the data lacks simplistic periodic structures - conditions typical in real-world applications.

Conclusion: The proposed MissTSM approach provides an effective solution for irregularly-sampled multivariate time series modeling that works well under challenging real-world conditions with extensive missing data and complex patterns.

Abstract: Modeling Irregularly-sampled and Multivariate Time Series (IMTS) is crucial across a variety of applications where different sets of variates may be missing at different time-steps due to sensor malfunctions or high data acquisition costs. Existing approaches for IMTS either consider a two-stage impute-then-model framework or involve specialized architectures specific to a particular model and task. We perform a series of experiments to derive novel insights about the performance of IMTS methods on a variety of semi-synthetic and real-world datasets for both classification and forecasting. We also introduce Missing Feature-aware Time Series Modeling (MissTSM) or MissTSM, a novel model-agnostic and imputation-free approach for IMTS modeling. We show that MissTSM shows competitive performance compared to other IMTS approaches, especially when the amount of missing values is large and the data lacks simplistic periodic structures - conditions common to real-world IMTS applications.

[289] Efficiently Editing Mixture-of-Experts Models with Compressed Experts

Yifei He, Yang Liu, Chen Liang, Hany Hassan Awadalla

Main category: cs.LG

TL;DR: Compressed experts reduce MoE model inference costs by replacing less important experts with lightweight modules, maintaining 90%+ performance while cutting 30%+ active parameters and 20% inference costs.

Details

Motivation: MoE models have redundant experts that contribute minimally to performance, especially after fine-tuning for specialized tasks. This creates an opportunity to reduce computational costs without sacrificing performance.

Method: Propose compressed experts - lightweight modules that serve as compact representations of full experts. Preserve most important experts while replacing auxiliary activated experts with compressed versions.

Result: Experiments on Phi-MoE and OLMoE show compressed experts recover over 90% of full expert performance across tasks while reducing >30% active parameters and saving 20% inference costs.

Conclusion: This approach enables efficient deployment of MoE models in resource-constrained settings and facilitates scaling to larger models with manageable overhead.

Abstract: Mixture-of-Experts (MoE) models have become a key approach for scaling large language models efficiently by activating only a subset of experts during training and inference. Typically, the number of activated experts presents a trade-off: fewer experts reduce computational costs, while more experts improve performance. Recent studies reveal that not all activated experts contribute equally to model performance, with some providing minimal utility, particularly when finetuning pretrained MoE models for specialized downstream tasks. The co-existence of significant and redundant parameters in experts provides us an opportunity to reduce the number of activated experts while maintaining model performance. In this work, we propose the concept of compressed experts, lightweight modules that serve as compact representations of full experts. Our approach preserves the most important experts while replacing other auxiliary activated experts with compressed experts. The reduction of active parameters significantly lowers inference costs while achieving comparable performance. Extensive experiments on models including Phi-MoE and OLMoE demonstrate that compressed experts recover over 90% of full expert performance across various tasks while reducing more than 30% active parameters and saving 20% in inference costs. This approach enables efficient deployment of MoE models in resource-constrained settings and facilitates scaling to larger models with manageable overhead. Our code is available at https://github.com/yifei-he/Compressed-Experts.

[290] Impoola: The Power of Average Pooling for Image-Based Deep Reinforcement Learning

Raphael Trumpp, Ansgar Schäfftlein, Mirco Theile, Marco Caccamo

Main category: cs.LG

TL;DR: Replacing flattening with global average pooling in Impala-CNN (creating Impoola-CNN) improves performance and generalization in deep reinforcement learning, showing that efficient network design matters more than just increasing model size.

Details

Motivation: As deep reinforcement learning tackles more complex tasks, model scaling has focused on parameter efficiency but network design specifically for RL image encoders remains unexplored. Current Impala-CNN architecture may have untapped potential for improvement.

Method: Proposed Impoola-CNN which replaces the flattening operation in Impala-CNN with global average pooling. Tested on Procgen Benchmark to evaluate performance and generalization capabilities.

Result: Impoola-CNN outperforms larger and more complex models, particularly in generalization. Shows most significant gains in games without agent-centered observations, suggesting reduced translation sensitivity contributes to improvement.

Conclusion: Network scaling isn’t just about increasing model size - efficient network design is crucial. Global average pooling provides better performance than simple flattening, demonstrating that architectural improvements can be more effective than parameter increases.

Abstract: As image-based deep reinforcement learning tackles more challenging tasks, increasing model size has become an important factor in improving performance. Recent studies achieved this by focusing on the parameter efficiency of scaled networks, typically using Impala-CNN, a 15-layer ResNet-inspired network, as the image encoder. However, while Impala-CNN evidently outperforms older CNN architectures, potential advancements in network design for deep reinforcement learning-specific image encoders remain largely unexplored. We find that replacing the flattening of output feature maps in Impala-CNN with global average pooling leads to a notable performance improvement. This approach outperforms larger and more complex models in the Procgen Benchmark, particularly in terms of generalization. We call our proposed encoder model Impoola-CNN. A decrease in the network’s translation sensitivity may be central to this improvement, as we observe the most significant gains in games without agent-centered observations. Our results demonstrate that network scaling is not just about increasing model size - efficient network design is also an essential factor. We make our code available at https://github.com/raphajaner/impoola.

[291] Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An

Main category: cs.LG

TL;DR: GiGPO is a novel group-based RL algorithm that enables fine-grained credit assignment for LLM agents through a two-level hierarchical structure, achieving significant performance improvements on challenging benchmarks while maintaining low memory overhead and computational efficiency.

Details

Motivation: Existing group-based RL methods struggle with long-horizon LLM agent training due to sparse/delayed rewards and difficulty in credit assignment across individual steps in agent-environment interactions.

Method: GiGPO introduces a two-level structure: episode-level macro relative advantages based on trajectory groups, and step-level micro relative advantages using anchor state grouping that identifies repeated environment states across trajectories to group actions from the same state.

Result: GiGPO achieves >12% improvement on ALFWorld and >9% on WebShop over GRPO baseline, while maintaining same GPU memory overhead, identical LLM rollout, and minimal additional time cost.

Conclusion: GiGPO successfully enables fine-grained per-step credit assignment for LLM agents while preserving the benefits of group-based RL (critic-free, low memory, stable convergence), making it highly scalable for long-horizon agent training.

Abstract: Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to long-horizon LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on two challenging agent benchmarks, ALFWorld and WebShop, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals and achieves performance gains of > 12% on ALFWorld and > 9% on WebShop over the GRPO baseline: all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.

[292] When a Reinforcement Learning Agent Encounters Unknown Unknowns

Juntian Zhu, Miguel de Carvalho, Zhouwang Yang, Fengxiang He

Main category: cs.LG

TL;DR: A reinforcement learning framework for handling unknown unknown states where agents expand value functions using noninformative beliefs based on averaged values from known domains, with theoretical guarantees on regret and complexity.

Details

Motivation: AI agents may encounter completely unknown states (unknown unknowns) that they have never been aware of, requiring a systematic approach to handle such surprising discoveries in reinforcement learning.

Method: Proposes an episodic Markov decision process with growing awareness (EMDP-GA) model using noninformative value expansion (NIVE) to initialize value functions in newly discovered states based on averaged values from the aware domain, adapted with upper confidence bound momentum Q-learning.

Result: The approach achieves asymptotically consistent regret with state-of-the-art methods (without unknown unknowns) in extremely uncertain environments, with comparable computational and space complexity.

Conclusion: Unknown unknown states can be properly discovered and handled asymptotically with decent speed and affordable cost using the proposed noninformative value expansion approach.

Abstract: An AI agent might surprisingly find she has reached an unknown state which she has never been aware of – an unknown unknown. We mathematically ground this scenario in reinforcement learning: an agent, after taking an action calculated from value functions $Q$ and $V$ defined on the {\it {aware domain}}, reaches a state out of the domain. To enable the agent to handle this scenario, we propose an {\it episodic Markov decision {process} with growing awareness} (EMDP-GA) model, taking a new {\it noninformative value expansion} (NIVE) approach to expand value functions to newly aware areas: when an agent arrives at an unknown unknown, value functions $Q$ and $V$ whereon are initialised by noninformative beliefs – the averaged values on the aware domain. This design is out of respect for the complete absence of knowledge in the newly discovered state. The upper confidence bound momentum Q-learning is then adapted to the growing awareness for training the EMDP-GA model. We prove that (1) the regret of our approach is asymptotically consistent with the state of the art (SOTA) without exposure to unknown unknowns in an extremely uncertain environment, and (2) our computational complexity and space complexity are comparable with the SOTA – these collectively suggest that though an unknown unknown is surprising, it will be asymptotically properly discovered with decent speed and an affordable cost.

[293] Cost-Driven Representation Learning for Linear Quadratic Gaussian Control: Part I

Yi Tian, Kaiqing Zhang, Russ Tedrake, Suvrit Sra

Main category: cs.LG

TL;DR: Cost-driven state representation learning for LQG control with finite-sample guarantees

Details

Motivation: Learn state representations from high-dimensional observations to control unknown partially observable systems without predicting observations or actions

Method: Cost-driven approach learning dynamic model in latent state space by predicting costs, focusing on Linear Quadratic Gaussian (LQG) control problems

Result: Established finite-sample guarantees for finding near-optimal state representation function and controller using learned latent model for finite-horizon time-varying LQG

Conclusion: Cost-driven approach with multi-step cost prediction is theoretically sound and empirically valuable for state representation learning, with extensions to infinite-horizon settings planned

Abstract: We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a cost-driven approach, where a dynamic model in some latent state space is learned by predicting the costs without predicting the observations or actions. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model, for finite-horizon time-varying LQG control problems. To the best of our knowledge, despite various empirical successes, finite-sample guarantees of such a cost-driven approach remain elusive. Our result underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations. A second part of this work, that is to appear as Part II, addresses the infinite-horizon linear time-invariant setting; it also extends the results to an approach that implicitly learns the latent dynamics, inspired by the recent empirical breakthrough of MuZero in model-based reinforcement learning.

[294] A theoretical framework for self-supervised contrastive learning for continuous dependent data

Alexander Marusov, Aleksandr Yugay, Alexey Zaytsev

Main category: cs.LG

TL;DR: Proposes a theoretical framework for contrastive self-supervised learning on continuous dependent data, introducing dependency-aware loss functions that outperform existing methods on temporal and spatio-temporal tasks.

Details

Motivation: Traditional contrastive SSL methods assume semantic independence between samples, which doesn't hold for dependent data like temporal and spatio-temporal domains that exhibit complex correlations.

Method: Developed a novel theoretical framework with hard and soft closeness similarity measures, derived analytical form for estimated similarity matrix, and created dependency-aware loss functions for continuous dependent data.

Result: Outperformed TS2Vec on UEA and UCR benchmarks with 4.17% and 2.08% accuracy improvements respectively, and achieved 7% higher ROC-AUC score on drought classification task.

Conclusion: The proposed dependency-aware approach effectively captures spatio-temporal dependencies, demonstrating superior performance over modern methods for dependent data through theoretically grounded loss functions.

Abstract: Self-supervised learning (SSL) has emerged as a powerful approach to learning representations, particularly in the field of computer vision. However, its application to dependent data, such as temporal and spatio-temporal domains, remains underexplored. Besides, traditional contrastive SSL methods often assume \emph{semantic independence between samples}, which does not hold for dependent data exhibiting complex correlations. We propose a novel theoretical framework for contrastive SSL tailored to \emph{continuous dependent data}, which allows the nearest samples to be semantically close to each other. In particular, we propose two possible \textit{ground truth similarity measures} between objects – \emph{hard} and \emph{soft} closeness. Under it, we derive an analytical form for the \textit{estimated similarity matrix} that accommodates both types of closeness between samples, thereby introducing dependency-aware loss functions. We validate our approach, \emph{Dependent TS2Vec}, on temporal and spatio-temporal downstream problems. Given the dependency patterns presented in the data, our approach surpasses modern ones for dependent data, highlighting the effectiveness of our theoretically grounded loss functions for SSL in capturing spatio-temporal dependencies. Specifically, we outperform TS2Vec on the standard UEA and UCR benchmarks, with accuracy improvements of $4.17$% and $2.08$%, respectively. Furthermore, on the drought classification task, which involves complex spatio-temporal patterns, our method achieves a $7$% higher ROC-AUC score.

[295] Correcting Auto-Differentiation in Neural-ODE Training

Yewei Xu, Shi Chen, Qin Li

Main category: cs.LG

TL;DR: Auto-differentiation in neural ODEs with high-order methods can produce artificial gradient oscillations that prevent convergence, but simple post-processing fixes this issue for Leapfrog and 2-stage ERK methods.

Details

Motivation: To investigate whether auto-differentiation provides reliable gradient updates for deep neural networks designed as neural ODEs, particularly when using high-order numerical methods.

Method: Mathematical analysis and numerical evidence examining auto-differentiation with Linear Multistep Methods (LMM) and Explicit Runge-Kutta Methods (ERK), with proposed post-processing techniques for Leapfrog and 2-stage ERK.

Result: Brute-force auto-differentiation introduces artificial oscillations in gradients that prevent convergence when using high-order methods, but the proposed post-processing techniques effectively eliminate these oscillations and provide accurate gradient computations.

Conclusion: Auto-differentiation requires careful handling for neural ODEs with high-order methods, but simple post-processing can correct gradient computation issues and enable proper convergence.

Abstract: Does the use of auto-differentiation yield reasonable updates for deep neural networks (DNNs)? Specifically, when DNNs are designed to adhere to neural ODE architectures, can we trust the gradients provided by auto-differentiation? Through mathematical analysis and numerical evidence, we demonstrate that when neural networks employ high-order methods, such as Linear Multistep Methods (LMM) or Explicit Runge-Kutta Methods (ERK), to approximate the underlying ODE flows, brute-force auto-differentiation often introduces artificial oscillations in the gradients that prevent convergence. In the case of Leapfrog and 2-stage ERK, we propose simple post-processing techniques that effectively eliminates these oscillations, correct the gradient computation and thus returns the accurate updates.

[296] Deep Variational Multivariate Information Bottleneck – A Framework for Variational Losses

Eslam Abdelaleem, Ilya Nemenman, K. Michael Martini

Main category: cs.LG

TL;DR: A unifying framework for variational dimensionality reduction based on multivariate information bottleneck, extending existing methods and introducing new ones like DVSIB that outperform traditional approaches on benchmark datasets.

Details

Motivation: To create a general framework that unifies various variational dimensionality reduction methods and enables the development of problem-specific algorithms that better match data structure.

Method: Proposes a framework based on multivariate information bottleneck theory, trading off information preservation in encoder graphs against decoder graphs. Extends DVCCA to beta-DVCCA and introduces DVSIB (deep variational symmetric information bottleneck).

Result: DVSIB and beta-DVCCA outperform other methods on Noisy MNIST and Noisy CIFAR-100 in classification accuracy, latent space dimensionality, sample efficiency, and achieve competitive/superior accuracy against state-of-the-art models.

Conclusion: The framework successfully unifies variational dimensionality reduction methods and provides a foundation for designing novel, problem-specific loss functions that can incorporate diverse multi-view representation learning algorithms.

Abstract: Variational dimensionality reduction methods are widely used for their accuracy, generative capabilities, and robustness. We introduce a unifying framework that generalizes both such as traditional and state-of-the-art methods. The framework is based on an interpretation of the multivariate information bottleneck, trading off the information preserved in an encoder graph (defining what to compress) against that in a decoder graph (defining a generative model for data). Using this approach, we rederive existing methods, including the deep variational information bottleneck, variational autoencoders, and deep multiview information bottleneck. We naturally extend the deep variational CCA (DVCCA) family to beta-DVCCA and introduce a new method, the deep variational symmetric information bottleneck (DVSIB). DSIB, the deterministic limit of DVSIB, connects to modern contrastive learning approaches such as Barlow Twins, among others. We evaluate these methods on Noisy MNIST and Noisy CIFAR-100, showing that algorithms better matched to the structure of the problem like DVSIB and beta-DVCCA produce better latent spaces as measured by classification accuracy, dimensionality of the latent variables, sample efficiency, and consistently outperform other approaches under comparable conditions. Additionally, we benchmark against state-of-the-art models, achieving superior or competitive accuracy. Our results demonstrate that this framework can seamlessly incorporate diverse multi-view representation learning algorithms, providing a foundation for designing novel, problem-specific loss functions.

[297] Revisiting Clustering of Neural Bandits: Selective Reinitialization for Mitigating Loss of Plasticity

Zhiyuan Su, Sunhao Dai, Xiao Zhang

Main category: cs.LG

TL;DR: Selective Reinitialization (SeRe) framework addresses loss of plasticity in Clustering of Neural Bandits by dynamically resetting underutilized neural network units, enabling better adaptation to non-stationary environments like dynamic user preferences.

Details

Motivation: Neural bandit algorithms suffer from loss of plasticity where neural network parameters become rigid over time, limiting their ability to adapt to non-stationary environments such as changing user preferences in recommendation systems.

Method: Proposes Selective Reinitialization (SeRe) framework that uses a contribution utility metric to identify and reset underutilized neural network units, combined with adaptive change detection to adjust reinitialization frequency based on environmental non-stationarity.

Result: Theoretical proof shows SeRe enables sublinear cumulative regret in piecewise-stationary environments. Extensive experiments on six real-world recommendation datasets demonstrate lower regrets and improved adaptability compared to traditional CNB approaches.

Conclusion: SeRe effectively mitigates loss of plasticity in neural bandit algorithms, enhancing their adaptability and robustness in dynamic environments while maintaining stable knowledge retention through selective reinitialization.

Abstract: Clustering of Bandits (CB) methods enhance sequential decision-making by grouping bandits into clusters based on similarity and incorporating cluster-level contextual information, demonstrating effectiveness and adaptability in applications like personalized streaming recommendations. However, when extending CB algorithms to their neural version (commonly referred to as Clustering of Neural Bandits, or CNB), they suffer from loss of plasticity, where neural network parameters become rigid and less adaptable over time, limiting their ability to adapt to non-stationary environments (e.g., dynamic user preferences in recommendation). To address this challenge, we propose Selective Reinitialization (SeRe), a novel bandit learning framework that dynamically preserves the adaptability of CNB algorithms in evolving environments. SeRe leverages a contribution utility metric to identify and selectively reset underutilized units, mitigating loss of plasticity while maintaining stable knowledge retention. Furthermore, when combining SeRe with CNB algorithms, the adaptive change detection mechanism adjusts the reinitialization frequency according to the degree of non-stationarity, ensuring effective adaptation without unnecessary resets. Theoretically, we prove that SeRe enables sublinear cumulative regret in piecewise-stationary environments, outperforming traditional CNB approaches in long-term performances. Extensive experiments on six real-world recommendation datasets demonstrate that SeRe-enhanced CNB algorithms can effectively mitigate the loss of plasticity with lower regrets, improving adaptability and robustness in dynamic settings.

[298] INCPrompt: Task-Aware incremental Prompting for Rehearsal-Free Class-incremental Learning

Zhiyuan Wang, Xiaoyang Qu, Jing Xiao, Bokui Chen, Jianzong Wang

Main category: cs.LG

TL;DR: INCPrompt is a novel continual learning method that uses adaptive key-learner and task-aware prompts to prevent catastrophic forgetting while maintaining high performance across tasks.

Details

Motivation: To address the persistent problem of catastrophic forgetting in continual learning, where models forget previously learned information when trained on new tasks.

Method: Uses adaptive key-learner and task-aware prompts that capture both general knowledge across tasks and task-specific knowledge to preserve learned information.

Result: Superior performance over existing algorithms across multiple continual learning benchmarks, effectively mitigating catastrophic forgetting.

Conclusion: Task-aware incremental prompting has significant impact on continual learning performance and represents an effective solution to catastrophic forgetting.

Abstract: This paper introduces INCPrompt, an innovative continual learning solution that effectively addresses catastrophic forgetting. INCPrompt’s key innovation lies in its use of adaptive key-learner and task-aware prompts that capture task-relevant information. This unique combination encapsulates general knowledge across tasks and encodes task-specific knowledge. Our comprehensive evaluation across multiple continual learning benchmarks demonstrates INCPrompt’s superiority over existing algorithms, showing its effectiveness in mitigating catastrophic forgetting while maintaining high performance. These results highlight the significant impact of task-aware incremental prompting on continual learning performance.

[299] HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization

Gabor Petnehazi, Bernadett Aradi

Main category: cs.LG

TL;DR: HERCULES is a hierarchical k-means clustering algorithm that uses LLMs to generate interpretable cluster titles and descriptions for multimodal data analysis.

Details

Motivation: The need for advanced analytical tools that can effectively group complex multimodal datasets while providing human-understandable insights into discovered structures.

Method: Recursively applies k-means clustering to build hierarchy, integrates LLMs for semantic labeling, supports direct and description clustering modes with topic guidance, and includes interactive visualization.

Result: Developed a novel algorithm and Python package capable of hierarchical clustering with enhanced interpretability through LLM-generated summaries for diverse data types.

Conclusion: HERCULES demonstrates strong potential for extracting meaningful hierarchical knowledge from complex datasets across multiple modalities with improved human interpretability.

Abstract: The explosive growth of complex datasets across various modalities necessitates advanced analytical tools that not only group data effectively but also provide human-understandable insights into the discovered structures. We introduce HERCULES (Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization), a novel algorithm and Python package designed for hierarchical k-means clustering of diverse data types, including text, images, and numeric data (processed one modality per run). HERCULES constructs a cluster hierarchy by recursively applying k-means clustering, starting from individual data points at level 0. A key innovation is its deep integration of Large Language Models (LLMs) to generate semantically rich titles and descriptions for clusters at each level of the hierarchy, significantly enhancing interpretability. The algorithm supports two main representation modes: direct' mode, which clusters based on original data embeddings or scaled numeric features, and description’ mode, which clusters based on embeddings derived from LLM-generated summaries. Users can provide a `topic_seed’ to guide LLM-generated summaries towards specific themes. An interactive visualization tool facilitates thorough analysis and understanding of the clustering results. We demonstrate HERCULES’s capabilities and discuss its potential for extracting meaningful, hierarchical knowledge from complex datasets.

[300] FedGraph: A Research Library and Benchmark for Federated Graph Learning

Yuhang Yao, Yuan Li, Xinyi Fan, Junhao Li, Kay Liu, Weizhao Jin, Yu Yang, Srivatsan Ravi, Philip S. Yu, Carlee Joe-Wong

Main category: cs.LG

TL;DR: FedGraph is a federated graph learning library that supports distributed training, benchmarking, and system performance evaluation with homomorphic encryption and low-rank communication for efficiency and privacy.

Details

Motivation: Existing federated graph learning algorithms focus on accuracy but overlook system performance, which is crucial for real-world deployment. There's a need for practical distributed training and comprehensive benchmarking tools.

Method: Developed FedGraph library with monitoring class for system performance evaluation, native homomorphic encryption integration, and low-rank communication scheme for algorithms like FedGCN to accelerate pre-training and training phases.

Result: FedGraph successfully benchmarks FGL algorithms on three major graph learning tasks, demonstrating efficient encrypted low-rank communication and scalability to graphs with 100 million nodes.

Conclusion: FedGraph is the first efficient FGL framework that supports encrypted low-rank communication and large-scale deployment, providing comprehensive system-level performance evaluation to guide future algorithm design.

Abstract: Federated graph learning is an emerging field with significant practical challenges. While algorithms have been proposed to improve the accuracy of training graph neural networks, such as node classification on federated graphs, the system performance is often overlooked, despite it is crucial for real-world deployment. To bridge this gap, we introduce FedGraph, a research library designed for practical distributed training and comprehensive benchmarking of FGL algorithms. FedGraph supports a range of state-of-the-art graph learning methods and includes a monitoring class that evaluates system performance, with a particular focus on communication and computation costs during training. Unlike existing federated learning platforms, FedGraph natively integrates homomorphic encryption to enhance privacy preservation and supports scalable deployment across multiple physical machines with system-level performance evaluation to guide the system design of future algorithms. To enhance efficiency and privacy, we propose a low-rank communication scheme for algorithms like FedGCN that require pre-training communication, accelerating both the pre-training and training phases. Extensive experiments benchmark FGL algorithms on three major graph learning tasks and demonstrate FedGraph as the first efficient FGL framework to support encrypted low-rank communication and scale to graphs with 100 million nodes.

[301] Rethinking Data Protection in the (Generative) Artificial Intelligence Era

Yiming Li, Shuo Shao, Yu He, Junfeng Guo, Tianwei Zhang, Zhan Qin, Pin-Yu Chen, Michael Backes, Philip Torr, Dacheng Tao, Kui Ren

Main category: cs.LG

TL;DR: The paper proposes a four-level taxonomy for data protection in generative AI systems, addressing the inadequacy of traditional data protection methods in the AI lifecycle from training to deployment.

Details

Motivation: Traditional data protection is insufficient for AI systems where data permeates all stages from training to deployment, creating urgent need to define and enforce proper safeguards.

Method: A four-level taxonomy framework (non-usability, privacy preservation, traceability, deletability) that spans the entire AI pipeline including training data, model weights, prompts, and generated content.

Result: The framework provides structured understanding of trade-offs between data utility and control, analyzes technical approaches at each level, and identifies regulatory blind spots in current AI systems.

Conclusion: The taxonomy offers guidance for aligning AI technologies and governance with trustworthy data practices, providing timely direction for developers, researchers, and regulators to rethink data protection for modern AI.

Abstract: The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.

[302] Recursive Gaussian Process State Space Model

Tengjie Zheng, Haipeng Chen, Lin Cheng, Shengping Gong, Xu Huang

Main category: cs.LG

TL;DR: Proposes a recursive Gaussian Process State-Space Model with adaptive capabilities for online learning, featuring domain-independent Bayesian updates, lightweight inducing point selection, and online hyperparameter optimization.

Details

Motivation: Addresses the lack of efficient online learning methods for GPSSMs when prior information about data distribution and model function is limited, which is crucial for real-time applications.

Method: Uses first-order linearization for Bayesian updates of joint state-GP distribution, develops online inducing point selection based on informative criteria, and recovers historical information from current filtering distribution for hyperparameter optimization.

Result: Demonstrates superior accuracy, computational efficiency, and adaptability compared to state-of-the-art online GPSSM techniques on both synthetic and real-world datasets.

Conclusion: The proposed method provides an effective solution for online learning in GPSSMs with limited prior information, offering closed-form, domain-independent learning with lightweight computational requirements.

Abstract: Learning dynamical models from data is not only fundamental but also holds great promise for advancing principle discovery, time-series prediction, and controller design. Among various approaches, Gaussian Process State-Space Models (GPSSMs) have recently gained significant attention due to their combination of flexibility and interpretability. However, for online learning, the field lacks an efficient method suitable for scenarios where prior information regarding data distribution and model function is limited. To address this issue, this paper proposes a recursive GPSSM method with adaptive capabilities for both operating domains and Gaussian process (GP) hyperparameters. Specifically, we first utilize first-order linearization to derive a Bayesian update equation for the joint distribution between the system state and the GP model, enabling closed-form and domain-independent learning. Second, an online selection algorithm for inducing points is developed based on informative criteria to achieve lightweight learning. Third, to support online hyperparameter optimization, we recover historical measurement information from the current filtering distribution. Comprehensive evaluations on both synthetic and real-world datasets demonstrate the superior accuracy, computational efficiency, and adaptability of our method compared to state-of-the-art online GPSSM techniques.

[303] Pareto-frontier Entropy Search with Variational Lower Bound Maximization

Masanori Ishikura, Masayuki Karasuyama

Main category: cs.LG

TL;DR: Proposes a variational approximation method for multi-objective Bayesian optimization using information gain, addressing the challenge of unknown Pareto-frontier truncation through over- and under-truncation mixture optimization.

Details

Motivation: Multi-objective Bayesian optimization requires calculating information gain of the Pareto-frontier, but obtaining the complete truncation distribution is impossible in continuous domains since the entire Pareto-frontier cannot be known.

Method: Approximates the truncation distribution using a mixture of over- and under-truncation from a Pareto-frontier subset. Optimizes the balancing coefficient through variational lower bound maximization to minimize approximation error.

Result: Empirical evaluation shows the method is effective, particularly when dealing with a large number of objective functions.

Conclusion: The proposed variational approximation framework successfully addresses the challenge of unknown Pareto-frontier truncation in multi-objective Bayesian optimization, demonstrating improved performance especially for high-dimensional objective spaces.

Abstract: This study considers multi-objective Bayesian optimization (MOBO) through the information gain of the Pareto-frontier. To calculate the information gain, a predictive distribution conditioned on the Pareto-frontier plays a key role, which is defined as a distribution truncated by the Pareto-frontier. However, it is usually impossible to obtain the entire Pareto-frontier in a continuous domain, and therefore, the complete truncation cannot be known. We consider an approximation of the truncate distribution by using a mixture distribution consisting of two possible approximate truncation obtainable from a subset of the Pareto-frontier, which we call over- and under-truncation. Since the optimal balance of the mixture is unknown beforehand, we propose optimizing the balancing coefficient through the variational lower bound maximization framework, by which the approximation error of the information gain can be minimized. Our empirical evaluation demonstrates the effectiveness of the proposed method particularly when the number of objective functions is large.

[304] Structure-preserving contrastive learning for spatial time series

Yiru Jiao, Sander van Cranenburgh, Simeon Calvert, Hans van Lint

Main category: cs.LG

TL;DR: Proposes two structure-preserving regularizers for contrastive learning of spatial time series to maintain spatio-temporal similarities, with dynamic weighting to balance objectives. Validated on time series classification and traffic prediction tasks.

Details

Motivation: Self-supervised representation learning for spatially characterized time series (common in transportation) faces challenges in preserving fine-grained spatio-temporal similarities in latent space.

Method: Two structure-preserving regularizers: one preserves topology of instance similarities, another preserves graph geometry across spatial/temporal dimensions. Dynamic weighting mechanism adaptively manages trade-off between contrastive learning and structure preservation.

Result: Method preserves similarity structures more effectively and improves state-of-the-art performance across all tasks (time series classification, macroscopic/microscopic traffic prediction). Particularly useful for encoding traffic interactions.

Conclusion: Well-preserved similarity structures indicate more informative representations. Method can integrate with arbitrary neural networks and benefits time series data with spatial/geographical features. Provides insights for designing effective neural networks in transportation research.

Abstract: The effectiveness of neural network models largely relies on learning meaningful latent patterns from data, where self-supervised learning of informative representations can enhance model performance and generalisability. However, self-supervised representation learning for spatially characterised time series, which are ubiquitous in transportation domain, poses unique challenges due to the necessity of maintaining fine-grained spatio-temporal similarities in the latent space. In this study, we introduce two structure-preserving regularisers for the contrastive learning of spatial time series: one regulariser preserves the topology of similarities between instances, and the other preserves the graph geometry of similarities across spatial and temporal dimensions. To balance the contrastive learning objective and the need for structure preservation, we propose a dynamic weighting mechanism that adaptively manages this trade-off and stabilises training. We validate the proposed method through extensive experiments, including multivariate time series classification to demonstrate its general applicability, as well as macroscopic and microscopic traffic prediction to highlight its particular usefulness in encoding traffic interactions. Across all tasks, our method preserves the similarity structures more effectively and improves state-of-the-art task performances. This method can be integrated with an arbitrary neural network model and is particularly beneficial for time series data with spatial or geographical features. Furthermore, our findings suggest that well-preserved similarity structures in the latent space indicate more informative and useful representations. This provides insights to design more effective neural networks for data-driven transportation research. Our code is made openly accessible with all resulting data at https://github.com/yiru-jiao/spclt

[305] Bayesian Active Learning for Multi-Criteria Comparative Judgement in Educational Assessment

Andy Gray, Alma Rahat, Tom Crick, Stephen Lindsay

Main category: cs.LG

TL;DR: Extends Bayesian Comparative Judgement to handle multiple rubric components, enabling both holistic and criterion-based assessment with uncertainty estimation and assessor agreement quantification.

Details

Motivation: Bridge the gap between holistic comparative judgement and the need for criterion-based performance breakdowns using rubrics in educational assessment.

Method: Extends Bayesian CJ to handle multiple independent learning outcome components defined by rubrics, with entropy-based active learning for informative comparisons and assessor agreement quantification.

Result: Effective performance demonstrated on synthetic and real data, providing both holistic and component-wise predictive rankings with uncertainty estimates.

Conclusion: The approach successfully combines holistic assessment with criterion-based breakdowns, enhancing transparency through assessor agreement quantification while maintaining CJ’s reliability benefits.

Abstract: Comparative Judgement (CJ) provides an alternative assessment approach by evaluating work holistically rather than breaking it into discrete criteria. This method leverages human ability to make nuanced comparisons, yielding more reliable and valid assessments. CJ aligns with real-world evaluations, where overall quality emerges from the interplay of various elements. However, rubrics remain widely used in education, offering structured criteria for grading and detailed feedback. This creates a gap between CJ’s holistic ranking and the need for criterion-based performance breakdowns. This paper addresses this gap using a Bayesian approach. We build on Bayesian CJ (BCJ) by Gray et al., which directly models preferences instead of using likelihoods over total scores, allowing for expected ranks with uncertainty estimation. Their entropy-based active learning method selects the most informative pairwise comparisons for assessors. We extend BCJ to handle multiple independent learning outcome (LO) components, defined by a rubric, enabling both holistic and component-wise predictive rankings with uncertainty estimates. Additionally, we propose a method to aggregate entropies and identify the most informative comparison for assessors. Experiments on synthetic and real data demonstrate our method’s effectiveness. Finally, we address a key limitation of BCJ, which is the inability to quantify assessor agreement. We show how to derive agreement levels, enhancing transparency in assessment.

[306] FlowKac: An Efficient Neural Fokker-Planck solver using Temporal Normalizing Flows and the Feynman-Kac Formula

Naoufal El Bekri, Lucas Drumetz, Franck Vermet

Main category: cs.LG

TL;DR: FlowKac is a novel mesh-free solver that reformulates the Fokker-Planck equation using the Feynman-Kac formula, enabling efficient high-dimensional solutions through adaptive stochastic sampling and time-indexed normalizing flows.

Details

Motivation: Traditional methods struggle with solving Fokker-Planck equations for high-dimensional complex dynamical systems due to analytical intractability and computational limitations of numerical approaches.

Method: Reformulates Fokker-Planck equation via Feynman-Kac formula, uses adaptive stochastic sampling to reduce complexity, and employs time-indexed normalizing flows to capture evolving probability densities for robust collocation point sampling.

Result: Demonstrates significant improvements in computational efficiency and accuracy while mitigating the curse of dimensionality, validated through experiments on various stochastic differential equations.

Conclusion: FlowKac provides a flexible, mesh-free solution that effectively addresses high-dimensional Fokker-Planck problems, offering superior performance over existing techniques for applications requiring dimensions beyond conventional limits.

Abstract: Solving the Fokker-Planck equation for high-dimensional complex dynamical systems remains a pivotal yet challenging task due to the intractability of analytical solutions and the limitations of traditional numerical methods. In this work, we present FlowKac, a novel approach that reformulates the Fokker-Planck equation using the Feynman-Kac formula, allowing to query the solution at a given point via the expected values of stochastic paths. A key innovation of FlowKac lies in its adaptive stochastic sampling scheme which significantly reduces the computational complexity while maintaining high accuracy. This sampling technique, coupled with a time-indexed normalizing flow, designed for capturing time-evolving probability densities, enables robust sampling of collocation points, resulting in a flexible and mesh-free solver. This formulation mitigates the curse of dimensionality and enhances computational efficiency and accuracy, which is particularly crucial for applications that inherently require dimensions beyond the conventional three. We validate the robustness and scalability of our method through various experiments on a range of stochastic differential equations, demonstrating significant improvements over existing techniques.

[307] A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design

Claudiu Leoveanu-Condrei

Main category: cs.LG

TL;DR: Introducing a contract layer for LLMs that applies Design by Contract principles to provide semantic and type guarantees on inputs/outputs with probabilistic remediation.

Details

Motivation: LLMs produce fluent outputs but lack verifiable guarantees, requiring a systematic approach to ensure compliance with semantic and type requirements.

Method: Adapt Design by Contract and type-theoretic principles to create a contract layer that mediates LLM calls, stipulating requirements and using probabilistic remediation to steer generation toward compliance.

Result: The approach provides probabilistic contract satisfaction and semantic validation through programmer-specified conditions, establishing functional equivalence between agents satisfying the same contracts.

Conclusion: This work presents a foundational framework for ensuring verifiable guarantees in LLM outputs through contract-based mediation, enabling functional equivalence across different agents.

Abstract: Generative models, particularly Large Language Models (LLMs), produce fluent outputs yet lack verifiable guarantees. We adapt Design by Contract (DbC) and type-theoretic principles to introduce a contract layer that mediates every LLM call. Contracts stipulate semantic and type requirements on inputs and outputs, coupled with probabilistic remediation to steer generation toward compliance. The layer exposes the dual view of LLMs as semantic parsers and probabilistic black-box components. Contract satisfaction is probabilistic and semantic validation is operationally defined through programmer-specified conditions on well-typed data structures. More broadly, this work postulates that any two agents satisfying the same contracts are \emph{functionally equivalent} with respect to those contracts.

[308] A State Alignment-Centric Approach to Federated System Identification: The FedAlign Framework

Ertuğrul Keçeci, Müjde Güzelkaya, Tufan Kumbasar

Main category: cs.LG

TL;DR: FedAlign is a federated learning framework for system identification that aligns state representations of local state-space models using similarity transformation matrices to overcome FedAvg’s limitation of altering system dynamics.

Details

Motivation: Direct aggregation of local state-space models via FedAvg changes system dynamics, so a method is needed to align state representations while preserving local dynamics in federated system identification tasks.

Method: Two approaches: FedAlign-A uses controllable canonical form and control theory to analytically derive transformation matrices; FedAlign-O formulates alignment as an optimization problem using least squares to find similarity transformations.

Result: FedAlign outperforms FedAvg, converges faster, and provides improved stability of the global state-space model through efficient parameter basin alignment on both synthetic and real-world datasets.

Conclusion: FedAlign successfully addresses the state representation alignment problem in federated system identification, enabling effective aggregation of local models while preserving system dynamics through similarity transformations.

Abstract: This paper presents FedAlign, a Federated Learning (FL) framework particularly designed for System Identification (SYSID) tasks by aligning state representations. Local workers can learn State-Space Models (SSMs) with equivalent representations but different dynamics. We demonstrate that directly aggregating these local SSMs via FedAvg results in a global model with altered system dynamics. FedAlign overcomes this problem by employing similarity transformation matrices to align state representations of local SSMs, thereby establishing a common parameter basin that retains the dynamics of local SSMs. FedAlign computes similarity transformation matrices via two distinct approaches: FedAlign-A and FedAlign-O. In FedAlign-A, we represent the global SSM in controllable canonical form (CCF). We apply control theory to analytically derive similarity transformation matrices that convert each local SSM into this form. Yet, establishing global SSM in CCF brings additional alignment challenges in multi input - multi output SYSID as CCF representation is not unique, unlike in single input - single output SYSID. In FedAlign-O, we address these alignment challenges by reformulating the local parameter basin alignment problem as an optimization task. We determine the parameter basin of a local worker as the common parameter basin and solve least square problems to obtain similarity transformation matrices needed to align the remaining local SSMs. Through the experiments conducted on synthetic and real-world datasets, we show that FedAlign outperforms FedAvg, converges faster, and provides improved stability of the global SSM thanks to the efficient alignment of local parameter basins.

[309] MPCritic: A plug-and-play MPC architecture for reinforcement learning

Nathan P. Lawrence, Thomas Banker, Ali Mesbah

Main category: cs.LG

TL;DR: MPCritic is a novel architecture that seamlessly integrates MPC and RL by using MPC’s loss landscape for parameter updates without expensive optimization, enabling robust constraint satisfaction for online deployment.

Details

Motivation: Bridge the gap between RL and MPC communities by overcoming computational costs and software integration barriers that prevent using state-of-the-art methods from both fields.

Method: Uses parameterized MPC problems to define loss landscapes, performs soft optimization over batched training steps to update MPC parameters without costly minimization or parametric sensitivity calculations.

Result: Demonstrated versatility across various MPC architectures and RL algorithms on classic control benchmarks while preserving MPC structure for robust constraint satisfaction.

Conclusion: MPCritic successfully enables seamless integration of advanced MPC and RL techniques, providing a machine learning-friendly framework that maintains MPC’s constraint satisfaction capabilities for online deployment.

Abstract: The reinforcement learning (RL) and model predictive control (MPC) communities have developed vast ecosystems of theoretical approaches and computational tools for solving optimal control problems. Given their conceptual similarities but differing strengths, there has been increasing interest in synergizing RL and MPC. However, existing approaches tend to be limited for various reasons, including computational cost of MPC in an RL algorithm and software hurdles towards seamless integration of MPC and RL tools. These challenges often result in the use of “simple” MPC schemes or RL algorithms, neglecting the state-of-the-art in both areas. This paper presents MPCritic, a machine learning-friendly architecture that interfaces seamlessly with MPC tools. MPCritic utilizes the loss landscape defined by a parameterized MPC problem, focusing on “soft” optimization over batched training steps; thereby updating the MPC parameters while avoiding costly minimization and parametric sensitivities. Since the MPC structure is preserved during training, an MPC agent can be readily used for online deployment, where robust constraint satisfaction is paramount. We demonstrate the versatility of MPCritic, in terms of MPC architectures and RL algorithms that it can accommodate, on classic control benchmarks.

[310] Learning to Select MCP Algorithms: From Traditional ML to Dual-Channel GAT-MLP

Xiang Li, Shanshan Wang, Chenglong Xiao

Main category: cs.LG

TL;DR: A learning-based framework combining traditional ML and graph neural networks for algorithm selection in Maximum Clique Problem, with RF as strong baseline and GAT-MLP dual-channel model showing best performance.

Details

Motivation: No single maximum clique algorithm performs best across all instances, and there's a lack of research on algorithm selection for MCP based on instance features.

Method: Constructed labeled dataset from 4 exact MCP algorithms on diverse graphs, extracted structural/statistical features. Evaluated SVM, RF, DT, KNN classifiers, then developed GAT-MLP dual-channel model combining GAT for local structure and MLP for global features.

Result: RF consistently showed strong performance across metrics. Feature importance analysis revealed connectivity and topological structure as strong predictors. GAT-MLP model demonstrated strong and consistent performance across all metrics.

Conclusion: Dual-channel architectures and graph neural networks are effective for combinatorial algorithm selection, with connectivity and topological features being key predictors of algorithm performance.

Abstract: Extensive experiments and prior studies show that no single maximum clique algorithm consistently performs best across all instances, highlighting the importance of selecting suitable algorithms based on instance features. Through an extensive analysis of relevant studies, it is found that there is a lack of research work concerning algorithm selection oriented toward the Maximum Clique Problem (MCP). In this work, we propose a learning-based framework that integrates both traditional machine learning and graph neural networks to address this gap. We construct a labeled dataset by running four exact MCP algorithms on a diverse collection of graph instances, accompanied by structural and global statistical features extracted from each graph. We first evaluate four conventional classifiers: Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), and K-Nearest Neighbors (KNN), across multiple dataset variants. Experimental results show that RF consistently shows strong performance across metrics and dataset variants, making it a reliable baseline. In addition, feature importance analysis indicates that connectivity and topological structure are strong predictors of algorithm performance. Building on these findings, we develop a dual-channel model named GAT-MLP, which combines a Graph Attention Network (GAT) for local structural encoding with a Multilayer Perceptron (MLP) for global feature modeling. The GAT-MLP model shows strong and consistent performance across all metrics. Our results highlight the effectiveness of dual-channel architectures and the promise of graph neural networks in combinatorial algorithm selection.

[311] Improving Bayesian Optimization for Portfolio Management with an Adaptive Scheduling

Zinuo You, John Cartlidge, Karen Elliott, Menghan Ge, Daniel Gold

Main category: cs.LG

TL;DR: Novel Bayesian optimization framework (TPE-AS) for stable and sample-efficient optimization of black-box portfolio management systems under limited evaluation budgets.

Details

Motivation: Black-box portfolio systems are widely used but their performance fluctuates with market changes, and evaluating them is computationally expensive due to fixed observation budgets. Standard Bayesian optimization methods can be unstable and waste limited evaluation resources.

Method: Proposes a weighted Lagrangian estimator with adaptive scheduling and importance sampling that dynamically balances exploration and exploitation by maximizing model performance while minimizing observation variance.

Result: Extensive experiments across four backtest settings with three distinct black-box portfolio models demonstrate the framework’s effectiveness in improving search stability and efficiency.

Conclusion: The TPE-AS framework successfully addresses the challenge of stable and sample-efficient optimization for black-box portfolio management systems under constrained evaluation budgets.

Abstract: Existing black-box portfolio management systems are prevalent in the financial industry due to commercial and safety constraints, though their performance can fluctuate dramatically with changing market regimes. Evaluating these non-transparent systems is computationally expensive, as fixed budgets limit the number of possible observations. Therefore, achieving stable and sample-efficient optimization for these systems has become a critical challenge. This work presents a novel Bayesian optimization framework (TPE-AS) that improves search stability and efficiency for black-box portfolio models under these limited observation budgets. Standard Bayesian optimization, which solely maximizes expected return, can yield erratic search trajectories and misalign the surrogate model with the true objective, thereby wasting the limited evaluation budget. To mitigate these issues, we propose a weighted Lagrangian estimator that leverages an adaptive schedule and importance sampling. This estimator dynamically balances exploration and exploitation by incorporating both the maximization of model performance and the minimization of the variance of model observations. It guides the search from broad, performance-seeking exploration towards stable and desirable regions as the optimization progresses. Extensive experiments and ablation studies, which establish our proposed method as the primary approach and other configurations as baselines, demonstrate its effectiveness across four backtest settings with three distinct black-box portfolio management models.

[312] BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models

Maozhen Zhang, Mengnan Zhao, Bo Wang

Main category: cs.LG

TL;DR: BadPromptFL is the first backdoor attack targeting prompt-based federated learning in multimodal models, achieving >90% attack success by injecting poisoned prompts through compromised clients without modifying model parameters.

Details

Motivation: Prompt-based tuning has become popular for efficient adaptation of large vision-language models in federated learning, but the security implications of prompt aggregation remain unexplored, creating a critical attack surface.

Method: Compromised clients jointly optimize local backdoor triggers and prompt embeddings, injecting poisoned prompts into the global aggregation process that propagate to benign clients, leveraging CLIP-style architectures’ contextual learning behavior.

Result: Achieves high attack success rates (>90%) with minimal visibility and limited client participation across multiple datasets and aggregation protocols.

Conclusion: The attack demonstrates critical vulnerabilities in prompt-based federated learning, raising concerns about robustness in real-world deployments and highlighting the need for security measures.

Abstract: Prompt-based tuning has emerged as a lightweight alternative to full fine-tuning in large vision-language models, enabling efficient adaptation via learned contextual prompts. This paradigm has recently been extended to federated learning settings (e.g., PromptFL), where clients collaboratively train prompts under data privacy constraints. However, the security implications of prompt-based aggregation in federated multimodal learning remain largely unexplored, leaving a critical attack surface unaddressed. In this paper, we introduce \textbf{BadPromptFL}, the first backdoor attack targeting prompt-based federated learning in multimodal contrastive models. In BadPromptFL, compromised clients jointly optimize local backdoor triggers and prompt embeddings, injecting poisoned prompts into the global aggregation process. These prompts are then propagated to benign clients, enabling universal backdoor activation at inference without modifying model parameters. Leveraging the contextual learning behavior of CLIP-style architectures, BadPromptFL achieves high attack success rates (e.g., (>90%)) with minimal visibility and limited client participation. Extensive experiments across multiple datasets and aggregation protocols validate the effectiveness, stealth, and generalizability of our attack, raising critical concerns about the robustness of prompt-based federated learning in real-world deployments.

[313] Explaining Anomalies with Tensor Networks

Hans Hohenfeld, Marius Beuerle, Elie Mounzer

Main category: cs.LG

TL;DR: Extending tensor networks from discrete to real-valued anomaly detection with tree tensor networks, showing competitive performance and explainability.

Details

Motivation: To broaden the application of tensor networks beyond discrete-valued data to real-valued domains for explainable anomaly detection.

Method: Used matrix product states and introduced tree tensor networks for real-valued anomaly detection on three benchmark problems.

Result: Adequate predictive performance compared to baseline models, with both tensor network architectures providing explanations for anomalous samples.

Conclusion: Successfully extended tensor networks to real-valued data, opening pathways for more complex tensor network architectures in explainable anomaly detection.

Abstract: Tensor networks, a class of variational quantum many-body wave functions have attracted considerable research interest across many disciplines, including classical machine learning. Recently, Aizpurua et al. demonstrated explainable anomaly detection with matrix product states on a discrete-valued cyber-security task, using quantum-inspired methods to gain insight into the learned model and detected anomalies. Here, we extend this framework to real-valued data domains. We furthermore introduce tree tensor networks for the task of explainable anomaly detection. We demonstrate these methods with three benchmark problems, show adequate predictive performance compared to several baseline models and both tensor network architectures’ ability to explain anomalous samples. We thereby extend the application of tensor networks to a broader class of potential problems and open a pathway for future extensions to more complex tensor network architectures.

[314] Unsupervised Learning of Local Updates for Maximum Independent Set in Dynamic Graphs

Devendra Parkar, Anya Chaturvedi, Joshua J. Daymude

Main category: cs.LG

TL;DR: First unsupervised learning model for Maximum Independent Sets in dynamic graphs with changing edges, combining GNNs with distributed update mechanism for efficient parallel processing.

Details

Motivation: Address the challenge of finding Maximum Independent Sets in dynamic graphs where edges change over time, which existing methods struggle to handle efficiently.

Method: Combines graph neural networks with learned distributed update mechanism that processes edge addition/deletion events, modifies node memories, and infers MaxIS membership in a single parallel step.

Result: Achieves competitive approximation ratios with excellent scalability on graphs of 50-1,000 nodes, outperforms state-of-the-art learning methods in solution quality, runtime, and memory usage on large graphs.

Conclusion: The model successfully generalizes to graphs 100x larger than training data, producing 1.05-1.18x larger MaxIS solutions while maintaining competitive runtimes, demonstrating strong performance and scalability.

Abstract: We present the first unsupervised learning model for finding Maximum Independent Sets (MaxIS) in dynamic graphs where edges change over time. Our method combines structural learning from graph neural networks (GNNs) with a learned distributed update mechanism that, given an edge addition or deletion event, modifies nodes’ internal memories and infers their MaxIS membership in a single, parallel step. We parameterize our model by the update mechanism’s radius and investigate the resulting performance-runtime tradeoffs for various dynamic graph topologies. We evaluate our model against a mixed integer programming solver and the state-of-the-art learning-based methods for MaxIS on static graphs (ICML 2020; NeurIPS 2020, 2023). Across synthetic and empirical dynamic graphs of 50-1,000 nodes, our model achieves competitive approximation ratios with excellent scalability; on large graphs, it significantly outperforms the state-of-the-art learning methods in solution quality, runtime, and memory usage. When generalizing to graphs of 10,000 nodes (100x larger than the ones used for training), our model produces MaxIS solutions 1.05-1.18x larger than any other learning method, even while maintaining competitive runtimes.

[315] Learning and Interpreting Gravitational-Wave Features from CNNs with a Random Forest Approach

Jun Tian, He Wang, Jibo He, Yu Pan, Shuo Cao, Qingquan Jiang

Main category: cs.LG

TL;DR: Hybrid CNN-RF model with interpretable physical metrics improves gravitational wave detection performance and interpretability, achieving 21% sensitivity improvement and better low-SNR signal detection.

Details

Motivation: Convolutional neural networks are widely used in gravitational wave detection but their learned features lack physical interpretability, limiting model transparency and trustworthiness.

Method: Combines CNN feature extractor with random forest classifier, adding four physically interpretable metrics (variance, SNR, waveform overlap, peak amplitude) computed from final convolutional layer to inform decision boundaries.

Result: Outperforms baseline CNN with 21% relative sensitivity improvement at fixed false alarm rate, shows enhanced detection of low-SNR signals (SNR ≤ 10), and feature attribution reveals both CNN-extracted and handcrafted features contribute significantly.

Conclusion: Physically motivated post-processing of CNN feature maps bridges deep learning and domain knowledge, providing interpretable and efficient gravitational wave detection.

Abstract: Convolutional neural networks (CNNs) have become widely adopted in gravitational wave (GW) detection pipelines due to their ability to automatically learn hierarchical features from raw strain data. However, the physical meaning of these learned features remains underexplored, limiting the interpretability of such models. In this work, we propose a hybrid architecture that combines a CNN-based feature extractor with a random forest (RF) classifier to improve both detection performance and interpretability. Unlike prior approaches that directly connect classifiers to CNN outputs, our method introduces four physically interpretable metrics - variance, signal-to-noise ratio (SNR), waveform overlap, and peak amplitude - computed from the final convolutional layer. These are jointly used with the CNN output in the RF classifier to enable more informed decision boundaries. Tested on long-duration strain datasets, our hybrid model outperforms a baseline CNN model, achieving a relative improvement of 21% in sensitivity at a fixed false alarm rate of 10 events per month. Notably, it also shows improved detection of low-SNR signals (SNR $\le$ 10), which are especially vulnerable to misclassification in noisy environments. Feature attribution via the RF model reveals that both CNN-extracted and handcrafted features contribute significantly to classification decisions, with learned variance and CNN outputs ranked among the most informative. These findings suggest that physically motivated post-processing of CNN feature maps can serve as a valuable tool for interpretable and efficient GW detection, bridging the gap between deep learning and domain knowledge.

[316] CRISP-NAM: Competing Risks Interpretable Survival Prediction with Neural Additive Models

Dhanesh Ramachandram, Ananya Raval

Main category: cs.LG

TL;DR: CRISP-NAM is an interpretable neural additive model for competing risks survival analysis that extends neural additive architecture to model cause-specific hazards while maintaining feature-level interpretability.

Details

Motivation: Competing risks are crucial in survival modeling, especially in healthcare where patients may experience multiple distinct event types, requiring interpretable methods that can handle complex non-linear relationships.

Method: Extends neural additive architecture to model cause-specific hazards, with each feature contributing independently through dedicated neural networks, allowing visualization of complex non-linear relationships between covariates and each competing risk.

Result: Demonstrates competitive performance on multiple datasets compared to existing approaches.

Conclusion: CRISP-NAM provides an interpretable solution for competing risks survival analysis that maintains competitive predictive performance while offering feature-level interpretability through its neural additive architecture.

Abstract: Competing risks are crucial considerations in survival modelling, particularly in healthcare domains where patients may experience multiple distinct event types. We propose CRISP-NAM (Competing Risks Interpretable Survival Prediction with Neural Additive Models), an interpretable neural additive model for competing risks survival analysis which extends the neural additive architecture to model cause-specific hazards while preserving feature-level interpretability. Each feature contributes independently to risk estimation through dedicated neural networks, allowing for visualization of complex non-linear relationships between covariates and each competing risk. We demonstrate competitive performance on multiple datasets compared to existing approaches.

[317] VCDiag: Classifying Erroneous Waveforms for Failure Triage Acceleration

Minh Luu, Surya Jasper, Khoi Le, Evan Pan, Michael Quinn, Aakash Tyagi, Jiang Hu

Main category: cs.LG

TL;DR: VCDiag is an ML-based framework that automates RTL simulation failure triage using VCD data, achieving 94% accuracy in identifying top failure modules with 120x data compression.

Details

Motivation: Manual failure triage in design verification is time-consuming and labor-intensive, while existing ML solutions have limited application to RTL-level simulation failure analysis for large designs.

Method: Uses VCD data with novel signal selection and statistical compression approach to classify failing waveforms and pinpoint failure locations, achieving over 120x data reduction while preserving classification features.

Result: Achieves over 94% accuracy in identifying the top three most likely failure modules in large-scale experiments, with significant data compression efficiency.

Conclusion: VCDiag provides an efficient, adaptable framework for automated failure triage that can be integrated into diverse Verilog/SystemVerilog designs and testbenches, addressing a critical gap in design verification.

Abstract: Failure triage in design functional verification is critical but time-intensive, relying on manual specification reviews, log inspections, and waveform analyses. While machine learning (ML) has improved areas like stimulus generation and coverage closure, its application to RTL-level simulation failure triage, particularly for large designs, remains limited. VCDiag offers an efficient, adaptable approach using VCD data to classify failing waveforms and pinpoint likely failure locations. In the largest experiment, VCDiag achieves over 94% accuracy in identifying the top three most likely modules. The framework introduces a novel signal selection and statistical compression approach, achieving over 120x reduction in raw data size while preserving features essential for classification. It can also be integrated into diverse Verilog/SystemVerilog designs and testbenches.

[318] RNE: plug-and-play diffusion inference-time control and energy-based training

Jiajun He, José Miguel Hernández-Lobato, Yuanqi Du, Francisco Vargas

Main category: cs.LG

TL;DR: RNE (Radon-Nikodym Estimator) enables access to marginal densities in diffusion models, facilitating inference-time control, density estimation, and energy-based training.

Details

Motivation: Diffusion models lack access to marginal densities along generation trajectories, which are crucial for inference-time control applications like annealing and model composition.

Method: Introduces RNE based on density ratio between path distributions, establishing connection between marginal densities and transition kernels for flexible plug-and-play framework.

Result: RNE delivers strong performance in inference-time control applications with promising scaling, and provides efficient regularization for energy-based diffusion training.

Conclusion: RNE provides a unified framework for diffusion density estimation, inference-time control, and energy-based training, addressing the gap in accessing marginal densities during diffusion generation.

Abstract: Diffusion models generate data by removing noise gradually, which corresponds to the time-reversal of a noising process. However, access to only the denoising kernels is often insufficient. In many applications, we need the knowledge of the marginal densities along the generation trajectory, which enables tasks such as inference-time control. To address this gap, in this paper, we introduce the Radon-Nikodym Estimator (RNE). Based on the concept of the density ratio between path distributions, it reveals a fundamental connection between marginal densities and transition kernels, providing a flexible plug-and-play framework that unifies diffusion density estimation, inference-time control, and energy-based diffusion training under a single perspective. Experiments demonstrated that RNE delivers strong results in inference-time control applications, such as annealing and model composition, with promising inference-time scaling performance. Moreover, RNE provides a simple yet efficient regularisation for training energy-based diffusion.

[319] Non-Asymptotic Stability and Consistency Guarantees for Physics-Informed Neural Networks via Coercive Operator Analysis

Ronald Katende

Main category: cs.LG

TL;DR: A unified theoretical framework for analyzing PINN stability and consistency using operator coercivity, variational formulations, and perturbation theory, providing convergence guarantees and generalization bounds.

Details

Motivation: To establish rigorous theoretical foundations for Physics-Informed Neural Networks (PINNs) by formalizing stability, consistency, and generalization properties for PDE solution approximation.

Method: Develops theoretical framework based on operator coercivity, variational formulations, and non-asymptotic perturbation theory. Uses Sobolev norm residual minimization, deterministic stability bounds, and probabilistic concentration via McDiarmid’s inequality.

Result: Proves convergence in energy and uniform norms under mild regularity, provides sample complexity guarantees, and establishes unified generalization bounds linking residual consistency, projection error, and perturbation sensitivity.

Conclusion: The framework identifies key structural principles (operator coercivity, activation smoothness, sampling admissibility) for robust PINN training and offers principled guidance for PDE-informed learning system design and analysis.

Abstract: We present a unified theoretical framework for analyzing the stability and consistency of Physics-Informed Neural Networks (PINNs), grounded in operator coercivity, variational formulations, and non-asymptotic perturbation theory. PINNs approximate solutions to partial differential equations (PDEs) by minimizing residual losses over sampled collocation and boundary points. We formalize both operator-level and variational notions of consistency, proving that residual minimization in Sobolev norms leads to convergence in energy and uniform norms under mild regularity. Deterministic stability bounds quantify how bounded perturbations to the network outputs propagate through the full composite loss, while probabilistic concentration results via McDiarmid’s inequality yield sample complexity guarantees for residual-based generalization. A unified generalization bound links residual consistency, projection error, and perturbation sensitivity. Empirical results on elliptic, parabolic, and nonlinear PDEs confirm the predictive accuracy of our theoretical bounds across regimes. The framework identifies key structural principles, such as operator coercivity, activation smoothness, and sampling admissibility, that underlie robust and generalizable PINN training, offering principled guidance for the design and analysis of PDE-informed learning systems.

[320] Neural Canonical Polyadic Factorization for Traffic Analysis

Wenyu Luo, Yikai Hou, Peng Tang

Main category: cs.LG

TL;DR: Neural Canonical Polyadic Factorization (NCPF) model combines low-rank tensor decomposition with deep learning for robust traffic data imputation, outperforming state-of-the-art methods on urban traffic datasets.

Details

Motivation: Pervasive missing data from sensor failures and heterogeneous sensing gaps hinders reliable traffic modeling in intelligent transportation systems.

Method: Embeds CP decomposition into neural architecture through learnable embedding projections, uses hierarchical feature fusion with Hadamard products for multilinear interactions, and employs MLP layers for nonlinear refinement of spatiotemporal representations.

Result: Extensive evaluations on six urban traffic datasets demonstrate NCPF’s superiority over six state-of-the-art baselines.

Conclusion: NCPF unifies CP decomposition’s interpretable factor analysis with neural network’s nonlinear expressive power, providing a principled approach for high-dimensional traffic data imputation to support transportation digital twins and adaptive traffic control.

Abstract: Modern intelligent transportation systems rely on accurate spatiotemporal traffic analysis to optimize urban mobility and infrastructure resilience. However, pervasive missing data caused by sensor failures and heterogeneous sensing gaps fundamentally hinders reliable traffic modeling. This paper proposes a Neural Canonical Polyadic Factorization (NCPF) model that synergizes low-rank tensor algebra with deep representation learning for robust traffic data imputation. The model innovatively embeds CP decomposition into neural architecture through learnable embedding projections, where sparse traffic tensors are encoded into dense latent factors across road segments, time intervals, and mobility metrics. A hierarchical feature fusion mechanism employs Hadamard products to explicitly model multilinear interactions, while stacked multilayer perceptron layers nonlinearly refine these representations to capture complex spatiotemporal couplings. Extensive evaluations on six urban traffic datasets demonstrate NCPF’s superiority over six state-of-the-art baselines. By unifying CP decomposition’s interpretable factor analysis with neural network’s nonlinear expressive power, NCPF provides a principled yet flexible approaches for high-dimensional traffic data imputation, offering critical support for next-generation transportation digital twins and adaptive traffic control systems.

[321] Memorization in Graph Neural Networks

Adarsh Jamadandi, Jing Xu, Adam Dziedzic, Franziska Boenisch

Main category: cs.LG

TL;DR: First framework to quantify label memorization in GNNs, showing inverse relationship with graph homophily and proposing graph rewiring to reduce memorization while maintaining performance.

Details

Motivation: Deep neural networks have been shown to memorize training data, but similar analyses for graph neural networks remain under-explored, particularly in semi-supervised node classification.

Method: Introduced NCMemo framework to quantify label memorization, analyzed GNN training dynamics and implicit bias, investigated graph rewiring as mitigation strategy.

Result: Found lower homophily significantly increases memorization; nodes with higher label inconsistency are more prone to memorization; graph rewiring effectively reduces memorization without compromising performance and lowers privacy risk.

Conclusion: Work advances understanding of GNN learning and supports more privacy-preserving GNN deployment by revealing the link between graph homophily and memorization with practical mitigation strategies.

Abstract: Deep neural networks (DNNs) have been shown to memorize their training data, yet similar analyses for graph neural networks (GNNs) remain largely under-explored. We introduce NCMemo (Node Classification Memorization), the first framework to quantify label memorization in semi-supervised node classification. We first establish an inverse relationship between memorization and graph homophily, i.e., the property that connected nodes share similar labels/features. We find that lower homophily significantly increases memorization, indicating that GNNs rely on memorization to learn less homophilic graphs. Secondly, we analyze GNN training dynamics. We find that the increased memorization in low homophily graphs is tightly coupled to the GNNs’ implicit bias on using graph structure during learning. In low homophily regimes, this structure is less informative, hence inducing memorization of the node labels to minimize training loss. Finally, we show that nodes with higher label inconsistency in their feature-space neighborhood are significantly more prone to memorization. Building on our insights into the link between graph homophily and memorization, we investigate graph rewiring as a means to mitigate memorization. Our results demonstrate that this approach effectively reduces memorization without compromising model performance. Moreover, we show that it lowers the privacy risk for previously memorized data points in practice. Thus, our work not only advances understanding of GNN learning but also supports more privacy-preserving GNN deployment.

[322] Convergence of regularized agent-state-based Q-learning in POMDPs

Amit Sinha, Matthieu Geist, Aditya Mahajan

Main category: cs.LG

TL;DR: This paper analyzes the convergence of regularized agent-state-based Q-learning (RASQL) algorithms, showing they converge to fixed points of regularized MDPs under mild conditions, with empirical validation.

Details

Motivation: To understand the convergence properties of practical Q-learning algorithms that use agent states (not belief states) and policy regularization for exploration and stability.

Method: Theoretical analysis of RASQL algorithms, including variants with periodic policies, under mild technical conditions, with numerical examples for empirical validation.

Result: RASQL converges to the fixed point of an appropriately defined regularized MDP that depends on the behavioral policy’s stationary distribution, with similar results for periodic policy variants.

Conclusion: The framework provides theoretical guarantees for practical Q-learning convergence, matching empirical observations and supporting the use of agent states and regularization in reinforcement learning.

Abstract: In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practice. Two salient features of such algorithms are: (i)~the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii)~policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state-based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.

[323] Equivariant U-Shaped Neural Operators for the Cahn-Hilliard Phase-Field Model

Xiao Xue, M. F. P. ten Eikelder, Tianyue Yang, Yiqing Li, Kan He, Shuo Wang, Peter V. Coveney

Main category: cs.LG

TL;DR: E-UNO: an equivariant U-shaped neural operator that accurately predicts phase separation dynamics using short history data, outperforming existing methods with better generalization and physical consistency.

Details

Motivation: Traditional numerical solvers for Cahn-Hilliard equation are computationally expensive and inflexible across varying conditions. Current neural operators fail to capture multiscale behavior and neglect physical symmetries.

Method: An equivariant U-shaped neural operator (E-UNO) that combines global spectral convolution with multi-resolution architecture, regulates translation equivariance, and learns from short histories of past dynamics.

Result: E-UNO outperforms standard Fourier neural operator and U-shaped neural operator baselines, particularly on fine-scale and high-frequency structures. It generalizes better, requires less training data, and yields physically consistent dynamics.

Conclusion: E-UNO establishes an efficient surrogate for complex phase-field systems by encoding symmetry and scale hierarchy, providing accurate predictions across space and time.

Abstract: Phase separation in binary mixtures, governed by the Cahn-Hilliard equation, plays a central role in interfacial dynamics across materials science and soft matter. While numerical solvers are accurate, they are often computationally expensive and lack flexibility across varying initial conditions and geometries. Neural operators provide a data-driven alternative by learning solution operators between function spaces, but current architectures often fail to capture multiscale behavior and neglect underlying physical symmetries. Here we show that an equivariant U-shaped neural operator (E-UNO) can learn the evolution of the phase-field variable from short histories of past dynamics, achieving accurate predictions across space and time. The model combines global spectral convolution with a multi-resolution U-shaped architecture and regulates translation equivariance to align with the underlying physics. E-UNO outperforms standard Fourier neural operator and U-shaped neural operator baselines, particularly on fine-scale and high-frequency structures. By encoding symmetry and scale hierarchy, the model generalizes better, requires less training data, and yields physically consistent dynamics. This establishes E-UNO as an efficient surrogate for complex phase-field systems.

[324] SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, Bo An

Main category: cs.LG

TL;DR: SimpleTIR is a plug-and-play algorithm that stabilizes multi-turn Tool-Integrated Reasoning training by filtering out void turns to prevent gradient explosions and performance collapse.

Details

Motivation: Multi-turn Tool-Integrated Reasoning with RL suffers from training instability and performance collapse due to distributional drift from external tool feedback, causing catastrophic gradient norm explosions.

Method: SimpleTIR identifies and filters out trajectories containing void turns (turns without code blocks or final answers) to block harmful high-magnitude gradients during policy updates.

Result: Achieves state-of-the-art performance on math reasoning benchmarks, elevating AIME24 score from 22.1 to 50.5 with Qwen2.5-7B base model, while enabling discovery of diverse reasoning patterns like self-correction.

Conclusion: SimpleTIR effectively stabilizes multi-turn TIR training by addressing gradient explosion issues through void turn filtering, enabling robust RL-based tool integration without supervised fine-tuning constraints.

Abstract: Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

cs.MA

Jorn K. Teutloff

Main category: cs.MA

TL;DR: Comparative study shows LLM-generated synthetic personas capture some human themes but miss key relational and experiential aspects, suggesting they complement rather than replace human subject research.

Details

Motivation: To evaluate the fidelity and limitations of AI-generated personas in simulating human perspectives, particularly in startup validation contexts, by comparing them with real human interview data.

Method: Conducted interviews with 15 startup founders about AI-powered validation, replicated the same protocol with AI-generated founder and investor personas, and performed structured thematic synthesis to identify convergent and divergent themes.

Result: Found four categories: convergent themes (commitment signals, trust barriers, efficiency), partial overlaps (different concerns about validation), human-only themes (relational value, market skepticism), and synthetic-only themes (amplified false positives, trauma blind spots).

Conclusion: LLM-driven personas represent a hybrid social simulation - more expressive than rule-based agents but limited by lack of lived experience; they complement empirical studies by extending hypothesis space and clarifying cognitive realism boundaries.

Abstract: We present a comparative docking experiment that aligns human-subject interview data with large language model (LLM)-driven synthetic personas to evaluate fidelity, divergence, and blind spots in AI-enabled simulation. Fifteen early-stage startup founders were interviewed about their hopes and concerns regarding AI-powered validation, and the same protocol was replicated with AI-generated founder and investor personas. A structured thematic synthesis revealed four categories of outcomes: (1) Convergent themes - commitment-based demand signals, black-box trust barriers, and efficiency gains were consistently emphasized across both datasets; (2) Partial overlaps - founders worried about outliers being averaged away and the stress of real customer validation, while synthetic personas highlighted irrational blind spots and framed AI as a psychological buffer; (3) Human-only themes - relational and advocacy value from early customer engagement and skepticism toward moonshot markets; and (4) Synthetic-only themes - amplified false positives and trauma blind spots, where AI may overstate adoption potential by missing negative historical experiences. We interpret this comparative framework as evidence that LLM-driven personas constitute a form of hybrid social simulation: more linguistically expressive and adaptable than traditional rule-based agents, yet bounded by the absence of lived history and relational consequence. Rather than replacing empirical studies, we argue they function as a complementary simulation category - capable of extending hypothesis space, accelerating exploratory validation, and clarifying the boundaries of cognitive realism in computational social science.

[326] Automatic Differentiation of Agent-Based Models

Arnau Quera-Bofarull, Nicholas Bishop, Joel Dyer, Daniel Jarne Ornia, Anisoara Calinescu, Doyne Farmer, Michael Wooldridge

Main category: cs.MA

TL;DR: Using automatic differentiation (AD) with agent-based models (ABMs) enables efficient parameter calibration and sensitivity analysis through variational inference, reducing computational burdens for large-scale complex systems.

Details

Motivation: ABMs are computationally demanding and parameter-heavy, hindering their widespread adoption for simulating complex systems with thousands/millions of agents like epidemics or financial markets.

Method: Apply automatic differentiation (AD) techniques to ABMs to make simulator gradients available, enabling variational inference (VI) for efficient parameter calibration.

Result: Experiments show substantial performance improvements and computational savings using VI on three ABMs: Axtell’s firm model, Sugarscape, and SIR epidemiological model.

Conclusion: AD significantly enhances the practicality and scalability of ABMs for studying complex systems by alleviating computational burdens and facilitating calibration tasks.

Abstract: Agent-based models (ABMs) simulate complex systems by capturing the bottom-up interactions of individual agents comprising the system. Many complex systems of interest, such as epidemics or financial markets, involve thousands or even millions of agents. Consequently, ABMs often become computationally demanding and rely on the calibration of numerous free parameters, which has significantly hindered their widespread adoption. In this paper, we demonstrate that automatic differentiation (AD) techniques can effectively alleviate these computational burdens. By applying AD to ABMs, the gradients of the simulator become readily available, greatly facilitating essential tasks such as calibration and sensitivity analysis. Specifically, we show how AD enables variational inference (VI) techniques for efficient parameter calibration. Our experiments demonstrate substantial performance improvements and computational savings using VI on three prominent ABMs: Axtell’s model of firms; Sugarscape; and the SIR epidemiological model. Our approach thus significantly enhances the practicality and scalability of ABMs for studying complex systems.

[327] Distributed Online Task Assignment via Inexact ADMM for unplanned online tasks and its Applications to Security

Ziqi Yang, Roberto Tron

Main category: cs.MA

TL;DR: A distributed task assignment framework for multi-robot systems that handles security-critical tasks and online tasks while maintaining security guarantees against plan-deviation attacks.

Details

Motivation: Efficient task assignment is crucial for multi-robot systems to coordinate agents, ensure mission success, and maintain overall system security, especially when dealing with security-critical tasks and potential attacks.

Method: Proposes an optimization-based distributed task assignment algorithm using inexact ADMM to decompose problems into separable/non-separable subproblems. Also develops a comprehensive framework with security analysis, task allocation using the proposed algorithm, and CLF/CBF-based controllers for task fulfillment and security enforcement.

Result: Through simulations, the framework demonstrates that multi-robot systems can effectively respond to unplanned online tasks while maintaining security guarantees.

Conclusion: The proposed approach enables multi-robot systems to handle both security-critical tasks and online tasks dynamically while ensuring security against plan-deviation attacks through a combination of distributed optimization and control-theoretic security measures.

Abstract: In multi-robot system (MRS) applications, efficient task assignment is essential not only for coordinating agents and ensuring mission success but also for maintaining overall system security. In this work, we first propose an optimization-based distributed task assignment algorithm that dynamically assigns mandatory security-critical tasks and optional tasks among teams. Leveraging an inexact Alternating Direction Method of Multipliers (ADMM)-based approach, we decompose the task assignment problem into separable and non-separable subproblems. The non-separable subproblems are transformed into an inexact ADMM update by projected gradient descent, which can be performed through several communication steps within the team. In the second part of this paper, we formulate a comprehensive framework that enables MRS under plan-deviation attacks to handle online tasks without compromising security. The process begins with a security analysis that determines whether an online task can be executed securely by a robot and, if so, the required time and location for the robot to rejoin the team. Next, the proposed task assignment algorithm is used to allocate security-related tasks and verified online tasks. Finally, task fulfillment is managed using a Control Lyapunov Function (CLF)-based controller, while security enforcement is ensured through a Control Barrier Function (CBF)-based security filter. Through simulations, we demonstrate that the proposed framework allows MRS to effectively respond to unplanned online tasks while maintaining security guarantees.

[328] A Reliable Self-Organized Distributed Complex Network for Communication of Smart Agents

Mehdi Bakhshipoor, Yousef Azizi, Seyed Ehsan Nedaaee Oskoee

Main category: cs.MA

TL;DR: Reinforcement learning agents self-organize into communication networks using physics-guided strategies to maintain connectivity while optimizing power consumption in dynamic IoT applications.

Details

Motivation: To address the challenge of creating robust, self-organizing communication networks for dynamic IoT systems like VANETs, where centralized control is impractical and energy efficiency is critical.

Method: Intelligent agents trained via reinforcement learning establish connections based on local observations using a physical Hamiltonian formulation, enabling decentralized network formation without centralized administration.

Result: The proposed approach successfully creates self-organized complex networks that maintain network-wide connectivity across various dynamic scenarios while optimizing average electrical power consumption.

Conclusion: Physics-guided machine learning with reinforcement learning agents provides an effective framework for building robust, energy-efficient self-organizing communication networks suitable for dynamic IoT applications.

Abstract: Collaboration is a fundamental and essential characteristic of many complex systems, ranging from ant colonies to human societies. Each component within a complex system interacts with others, even at a distance, to accomplish a given task. A network of collaboration can be defined to study the collective behavior of such systems within the framework of complex networks. The nodes in these networks may represent simple organisms or more sophisticated intelligent agents, such as humans. In this study, we utilize intelligent agents (nodes) trained through reinforcement learning techniques to establish connections with their neighbors, ultimately leading to the emergence of a large-scale communication cluster. Notably, there is no centralized administrator; instead, agents must adjust their connections based on information obtained from local observations. The connection strategy is formulated using a physical Hamiltonian, thereby categorizing this intelligent system under the paradigm of “Physics-Guided Machine Learning”. The resulting self-organized distributed complex network has numerous industrial applications, including constructing Internet of Things (IoT) networks. The design of such networks often encounters challenges, the most critical of which is ensuring effective connectivity for reliable communication while optimizing energy consumption. IoT networks are inherently dynamic in many real-world applications, such as Vehicle Ad-hoc Networks (VANETs), where nodes are mobile, and the connection topology evolves rapidly over time. These systems require a robust and rapidly self-organizing communication network. Our findings demonstrate that the proposed intelligent agents facilitate the formation of self-organized complex networks capable of maintaining network-wide connectivity across various dynamic scenarios while simultaneously optimizing average electrical power consumption.

[329] Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms

Gohar Irfan Chaudhry, Esha Choukse, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Adam Belay, Ricardo Bianchini

Main category: cs.MA

TL;DR: Murakkab is a resource-efficient serving system for agentic workflows that decouples workflow specification from execution configuration, enabling cross-layer optimization to reduce GPU usage, energy consumption, and cost while maintaining service-level objectives.

Details

Motivation: Current frameworks for serving agentic workflows are inefficient because they expose workflows as opaque sequences of model and tool calls that tightly couple agent logic with model and hardware choices, preventing systems from reasoning about trade-offs across accuracy, latency, energy, and cost.

Method: Murakkab introduces a declarative abstraction that decouples workflow specification from execution configuration, using a profile-guided optimizer and adaptive runtime to jointly manage the full stack - orchestrating workflow components, mapping them to models and hardware, and dynamically reconfiguring execution to satisfy user-defined SLOs.

Result: Evaluation shows Murakkab reduces GPU usage by up to 2.8×, energy consumption by 3.7×, and cost by 4.3× while maintaining SLOs.

Conclusion: By exposing the internal structure of agentic workflows, Murakkab enables cross-layer optimization that existing frameworks and cloud schedulers cannot achieve, making it a highly efficient serving system for agentic workflows.

Abstract: Agentic workflows commonly coordinate multiple models and tools with complex control logic. They are quickly becoming the dominant paradigm for AI applications. However, serving them remains inefficient with today’s frameworks. The key problem is that they expose workflows as opaque sequences of model and tool calls that tightly couple agent logic with model and hardware choices. Often, these workflow components are fragmented across different entities, preventing systems from reasoning about trade-offs across accuracy, latency, energy, and cost. This leads to resource waste and degraded service-level objectives (SLOs). We present Murakkab, a resource-efficient serving system for agentic workflows. Murakkab introduces a declarative abstraction that decouples workflow specification from execution configuration. A profile-guided optimizer and adaptive runtime jointly manage the full stack: orchestrating workflow components, mapping them to models and hardware, and dynamically reconfiguring execution to satisfy user-defined SLOs. By exposing the internal structure of agentic workflows, Murakkab enables cross-layer optimization that existing frameworks and cloud schedulers cannot achieve. Our evaluation on diverse workflows shows that Murakkab reduces GPU usage by up to 2.8$\times$, energy consumption by 3.7$\times$, and cost by 4.3$\times$ while maintaining SLOs.

cs.MM

[330] Simulacra Naturae: Generative Ecosystem driven by Agent-Based Simulations and Brain Organoid Collective Intelligence

Nefeli Manoudaki, Mert Toka, Iason Paterakis, Diarmid Flatley

Main category: cs.MM

TL;DR: Simulacra Naturae is a media installation that uses brain organoid neural activity to create a multi-sensory environment with generative visuals, spatial audio, living plants, and clay artifacts, treating biosignals as co-creative forces rather than direct inputs.

Details

Motivation: To explore collective care through the entanglement of biological computation, material ecologies, and generative systems, and to reimagine visualization as a practice of care that decentralizes human agency.

Method: Translates pre-recorded neural activity from brain organoids into a multi-sensory environment using real-time systems, generative AI visuals, spatial audio, living plants, and computationally fabricated clay prints with solenoids.

Result: Creates a sensory field shaped by nonhuman cognition that invites participants into an embodied experience, modulating emergent agent behaviors inspired by natural systems like termite colonies and slime molds.

Conclusion: The work successfully grounds abstract data in living materials to open new spaces for ethics, empathy, and ecological attunement within hybrid computational systems, demonstrating a novel approach to visualization as care practice.

Abstract: Simulacra Naturae is a data-driven media installation that explores collective care through the entanglement of biological computation, material ecologies, and generative systems. The work translates pre-recorded neural activity from brain organoids, lab-grown three-dimensional clusters of neurons, into a multi-sensory environment composed of generative visuals, spatial audio, living plants, and fabricated clay artifacts. These biosignals, streamed through a real-time system, modulate emergent agent behaviors inspired by natural systems such as termite colonies and slime molds. Rather than using biosignals as direct control inputs, Simulacra Naturae treats organoid activity as a co-creative force, allowing neural rhythms to guide the growth, form, and atmosphere of a generative ecosystem. The installation features computationally fabricated clay prints embedded with solenoids, adding physical sound resonances to the generative surround composition. The spatial environment, filled with live tropical plants and a floor-level projection layer featuring real-time generative AI visuals, invites participants into a sensory field shaped by nonhuman cognition. By grounding abstract data in living materials and embodied experience, Simulacra Naturae reimagines visualization as a practice of care, one that decentralizes human agency and opens new spaces for ethics, empathy, and ecological attunement within hybrid computational systems.

[331] Automatically Generating High-Precision Simulated Road Networking in Traffic Scenario

Liang Xie, Wenke Huang

Main category: cs.MM

TL;DR: Automated generation of high-precision lane-level simulation road networks using street view data and deep learning to reduce manual effort and costs.

Details

Motivation: Existing lane-level simulation road network generation is labor-intensive, resource-demanding, and costly due to large-scale data collection and manual post-editing requirements.

Method: Collect real-world street view data, build lane line dataset, use deep learning for end-to-end lane detection, integrate coordinate transformation and map matching algorithms to fuse lane information with road topology from open-source maps.

Result: Significantly reduces data collection and manual editing costs while improving efficiency and accuracy of simulation road network generation.

Conclusion: Provides reliable data support for urban traffic simulation, autonomous driving navigation, and intelligent transportation systems, offering a novel automated pathway for large-scale urban road network modeling.

Abstract: Existing lane-level simulation road network generation is labor-intensive, resource-demanding, and costly due to the need for large-scale data collection and manual post-editing. To overcome these limitations, we propose automatically generating high-precision simulated road networks in traffic scenario, an efficient and fully automated solution. Initially, real-world road street view data is collected through open-source street view map platforms, and a large-scale street view lane line dataset is constructed to provide a robust foundation for subsequent analysis. Next, an end-to-end lane line detection approach based on deep learning is designed, where a neural network model is trained to accurately detect the number and spatial distribution of lane lines in street view images, enabling automated extraction of lane information. Subsequently, by integrating coordinate transformation and map matching algorithms, the extracted lane information from street views is fused with the foundational road topology obtained from open-source map service platforms, resulting in the generation of a high-precision lane-level simulation road network. This method significantly reduces the costs associated with data collection and manual editing while enhancing the efficiency and accuracy of simulation road network generation. It provides reliable data support for urban traffic simulation, autonomous driving navigation, and the development of intelligent transportation systems, offering a novel technical pathway for the automated modeling of large-scale urban road networks.

[332] Iola Walker: A Mobile Footfall Detection System for Music Composition

William B. James

Main category: cs.MM

TL;DR: Development of Iola Walker - a wearable music playback system that adapts music based on listener’s gait, aiming to create a new preferred music experience medium with potential social benefits.

Details

Motivation: To find a method for materially enhancing music through hardware/software, discover a new wearable music medium that listeners prefer over current options, and potentially address societal problems in the entertainment industry through prosocial reform.

Method: Created Iola Walker - a music playback system that allows musicians to compose music that dynamically changes according to the listener’s walking gait patterns, using wearable device technology.

Result: A functional music playback system infrastructure has been developed that can adapt music in real-time based on the listener’s physical movement (gait), with artifacts available on GitHub.

Conclusion: The research presents a novel approach to music experience through gait-responsive wearable technology, suggesting this could represent a new preferred medium for music consumption with potential positive societal impacts on the music industry.

Abstract: This outing is part of a larger music technology research project. The objective is to find a method for materially enhancing music using hardware and software. There is a strong likelihood that there exists a new medium for experiencing music via a wearable device that ordinary listeners prefer over the current state of the art. If such a medium is discovered, it is a step towards altruistic, prosocial reform in the music industry. A new playback system infrastructure has a chance to soothe some of the societal problems tied to the larger entertainment industry ecosystem. Iola walker is a music playback system that allows musicians to compose music that changes in accordance with the listener’s gait. Artifacts are available here: https://github.com/willbjames/iolawalker

[333] In-place Double Stimulus Methodology for Subjective Assessment of High Quality Images

Shima Mohammadi, Mohsen Jenadeleh, Michela Testolina, Jon Sneyers, Touradj Ebrahimi, Dietmar Saupe, João Ascenso

Main category: cs.MM

TL;DR: Novel double stimulus method (IDSQS) for high-quality image assessment, with crowdsourced dataset and Beta distribution modeling of quality scores.

Details

Motivation: Address limitations of existing protocols in detecting subtle perceptual differences in high-quality images.

Method: In-place Double Stimulus Quality Scale (IDSQS) allows alternating reference/distorted image viewing at same location. Large-scale crowdsourcing study with comprehensive dataset.

Result: High correlation with precise subjective benchmarks. Effective for detecting differences at high to visually lossless quality levels.

Conclusion: IDSQS methodology is effective for high-quality image evaluation, with publicly available dataset and tools.

Abstract: This paper introduces a novel double stimulus subjective assessment methodology for the evaluation of high quality images to address the limitations of existing protocols in detecting subtle perceptual differences. The In-place Double Stimulus Quality Scale (IDSQS) allows subjects to alternately view a reference and a distorted image at the same spatial location, facilitating a more intuitive detection of differences in quality, especially at high to visually lossless quality levels. A large-scale crowdsourcing study employing this methodology was conducted, generating a comprehensive public dataset to evaluate perceived image quality across several compression algorithms and distortion levels. An additional contribution is the modeling of quality scores using a Beta distribution, allowing for the assessment of variability and subject consistency. Our findings demonstrate the effectiveness of the IDSQS methodology in achieving high correlation with more precise subjective evaluation benchmarks. The dataset, subjective data, and graphical user interface developed for this study are publicly available at https://github.com/shimamohammadi/IDSQS

eess.AS

[334] Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening

Diego Di Carlo, Koyama Shoichi, Nugraha Aditya Arie, Fontaine Mathieu, Bando Yoshiaki, Yoshii Kazuyoshi

Main category: eess.AS

TL;DR: Proposes a physics-aware Gaussian process with neural field representation for continuous steering vector modeling, achieving 10x data efficiency in sound field applications.

Details

Motivation: Traditional steering vector representations cannot handle sound scattering effects, and deterministic super-resolution methods suffer from overfitting due to non-uniform measurement uncertainty.

Method: Integrates neural field representation into Gaussian process framework with physics-aware composite kernel that models directional waves and scattering effects.

Result: Achieves oracle performance with less than 10x fewer measurements in speech enhancement and binaural rendering tasks using SPEAR challenge data.

Conclusion: The proposed probabilistic framework effectively addresses overfitting and enables precise sound field control with significantly reduced measurement requirements.

Abstract: This paper investigates continuous representations of steering vectors over frequency and position of microphone and source for augmented listening (e.g., spatial filtering and binaural rendering) with precise control of the sound field perceived by the user. Steering vectors have typically been used for representing the spatial characteristics of the sound field as a function of the listening position. The basic algebraic representation of steering vectors assuming an idealized environment cannot deal with the scattering effect of the sound field. One may thus collect a discrete set of real steering vectors measured in dedicated facilities and super-resolve (i.e., upsample) them. Recently, physics-aware deep learning methods have been effectively used for this purpose. Such deterministic super-resolution, however, suffers from the overfitting problem due to the non-uniform uncertainty over the measurement space. To solve this problem, we integrate an expressive representation based on the neural field (NF) into the principled probabilistic framework based on the Gaussian process (GP). Specifically, we propose a physics-aware composite kernel that model the directional incoming waves and the subsequent scattering effect. Our comprehensive comparative experiment showed the effectiveness of the proposed method under data insufficiency conditions. In downstream tasks such as speech enhancement and binaural rendering using the simulated data of the SPEAR challenge, the oracle performances were attained with less than ten times fewer measurements.

[335] IS${}^3$ : Generic Impulsive–Stationary Sound Separation in Acoustic Scenes using Deep Filtering

Berger Clémentine, Stamadiatis Paraskevas, Badeau Roland, Essid Slim

Main category: eess.AS

TL;DR: IS³ neural network for separating impulsive acoustic events from stationary backgrounds, outperforming existing methods on audio separation tasks.

Details

Motivation: Need for audio systems that can differentiate between stationary backgrounds and isolated acoustic events for applications like adaptive audio rendering, noise suppression, and acoustic event classification.

Method: Deep filtering neural network (IS³) trained with sophisticated data generation pipeline that curates and adapts existing datasets for impulsive-stationary sound separation.

Result: Learning-based approach with lightweight neural architecture successfully addresses this previously unaddressed task, outperforming Harmonic-Percussive Sound Separation and wavelet filtering methods.

Conclusion: Well-designed neural network architecture with varied training data can effectively separate impulsive acoustic events from stationary backgrounds, serving as effective pre-processing for various audio applications.

Abstract: We are interested in audio systems capable of performing a differentiated processing of stationary backgrounds and isolated acoustic events within an acoustic scene, whether for applying specific processing methods to each part or for focusing solely on one while ignoring the other. Such systems have applications in real-world scenarios, including robust adaptive audio rendering systems (e.g., EQ or compression), plosive attenuation in voice mixing, noise suppression or reduction, robust acoustic event classification or even bioacoustics. To this end, we introduce IS${}^3$, a neural network designed for Impulsive–Stationary Sound Separation, that isolates impulsive acoustic events from the stationary background using a deep filtering approach, that can act as a pre-processing stage for the above-mentioned tasks. To ensure optimal training, we propose a sophisticated data generation pipeline that curates and adapts existing datasets for this task. We demonstrate that a learning-based approach, build on a relatively lightweight neural architecture and trained with well-designed and varied data, is successful in this previously unaddressed task, outperforming the Harmonic–Percussive Sound Separation masking method, adapted from music signal processing research, and wavelet filtering on objective separation metrics.

[336] Speech Intelligibility Assessment with Uncertainty-Aware Whisper Embeddings and sLSTM

Ryandhimas E. Zezario, Dyah A. M. G. Wisnu, Hsin-Min Wang, Yu Tsao

Main category: eess.AS

TL;DR: iMTI-Net improves speech intelligibility prediction using uncertainty-aware Whisper embeddings and CNN-sLSTM architecture in multitask learning framework.

Details

Motivation: Non-intrusive speech intelligibility prediction is challenging due to variability in speakers, noise conditions, and subjective perception.

Method: Uses uncertainty-aware Whisper embeddings with statistical features (mean, std, entropy) and scalar LSTM for sequential modeling. Proposes iMTI-Net with CNN-sLSTM architecture for multitask learning of human intelligibility scores and machine WER.

Result: iMTI-Net outperforms original MTI-Net across multiple evaluation metrics.

Conclusion: Incorporating uncertainty-aware features and CNN-sLSTM architecture effectively improves speech intelligibility prediction performance.

Abstract: Non-intrusive speech intelligibility prediction remains challenging due to variability in speakers, noise conditions, and subjective perception. We propose an uncertainty-aware approach that leverages Whisper embeddings in combination with statistical features, specifically the mean, standard deviation, and entropy computed across the embedding dimensions. The entropy, computed via a softmax over the feature dimension, serves as a proxy for uncertainty, complementing global information captured by the mean and standard deviation. To model the sequential structure of speech, we adopt a scalar long short-term memory (sLSTM) network, which efficiently captures long-range dependencies. Building on this foundation, we propose iMTI-Net, an improved multi-target intelligibility prediction network that integrates convolutional neural network (CNN) and sLSTM components within a multitask learning framework. It jointly predicts human intelligibility scores and machine-based word error rates (WER) from Google ASR and Whisper. Experimental results show that iMTI-Net outperforms the original MTI-Net across multiple evaluation metrics, demonstrating the effectiveness of incorporating uncertainty-aware features and the CNN-sLSTM architecture.

[337] Non-Intrusive Intelligibility Prediction for Hearing Aids: Recent Advances, Trends, and Challenges

Ryandhimas E. Zezario

Main category: eess.AS

TL;DR: Overview of recent advances in non-intrusive speech intelligibility prediction for hearing aids, covering feature extraction, hearing loss modeling, sequence processing architectures, adaptation strategies, and current challenges.

Details

Motivation: To provide a comprehensive perspective on current trends and challenges in developing practical and reliable hearing aid-oriented intelligibility prediction systems, addressing the need for robust performance in various acoustic environments.

Method: Summarizes developments in robust acoustic feature extraction, hearing loss modeling, emerging architectures for long-sequence processing, listener-specific adaptation strategies, and domain generalization approaches.

Result: The paper offers a systematic overview of the field’s progress while identifying remaining challenges such as the need for large-scale diverse datasets and reliable cross-profile generalization.

Conclusion: The review highlights both current achievements and ongoing challenges, pointing toward future directions for developing more practical and reliable intelligibility prediction systems for hearing aids.

Abstract: This paper provides an overview of recent progress in non-intrusive speech intelligibility prediction for hearing aids (HA). We summarize developments in robust acoustic feature extraction, hearing loss modeling, and the use of emerging architectures for long-sequence processing. Listener-specific adaptation strategies and domain generalization approaches that aim to improve robustness in unseen acoustic environments are also discussed. Remaining challenges, such as the need for large-scale, diverse datasets and reliable cross-profile generalization, are acknowledged. Our goal is to offer a perspective on current trends, ongoing challenges, and possible future directions toward practical and reliable HA-oriented intelligibility prediction systems.

[338] A Study on Zero-Shot Non-Intrusive Speech Intelligibility for Hearing Aids Using Large Language Models

Ryandhimas E. Zezario, Dyah A. M. G. Wisnu, Hsin-Min Wang, Yu Tsao

Main category: eess.AS

TL;DR: GPT-Whisper-HA is a zero-shot non-intrusive speech assessment model for hearing aids that extends GPT-Whisper, using LLMs with hearing loss simulations and achieves 2.59% RMSE improvement.

Details

Motivation: To develop a zero-shot non-intrusive speech assessment system specifically designed for hearing aids that can predict subjective intelligibility for HA users without requiring labeled training data.

Method: Extends GPT-Whisper by incorporating MSBG hearing loss and NAL-R simulations based on individual audiograms, uses two ASR modules for audio-to-text representation, employs GPT-4o to predict scores, and averages them for final estimation.

Result: Achieves 2.59% relative root mean square error (RMSE) improvement over the baseline GPT-Whisper model.

Conclusion: Demonstrates the potential of large language models for zero-shot speech assessment in predicting subjective intelligibility for hearing aid users.

Abstract: This work focuses on zero-shot non-intrusive speech assessment for hearing aids (HA) using large language models (LLMs). Specifically, we introduce GPT-Whisper-HA, an extension of GPT-Whisper, a zero-shot non-intrusive speech assessment model based on LLMs. GPT-Whisper-HA is designed for speech assessment for HA, incorporating MSBG hearing loss and NAL-R simulations to process audio input based on each individual’s audiogram, two automatic speech recognition (ASR) modules for audio-to-text representation, and GPT-4o to predict two corresponding scores, followed by score averaging for the final estimated score. Experimental results indicate that GPT-Whisper-HA achieves a 2.59% relative root mean square error (RMSE) improvement over GPT-Whisper, confirming the potential of LLMs for zero-shot speech assessment in predicting subjective intelligibility for HA users.

[339] Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings

Dyah A. M. G. Wisnu, Ryandhimas E. Zezario, Stefano Rini, Hsin-Min Wang, Yu Tsao

Main category: eess.AS

TL;DR: A system for multi-axis audio quality prediction using BEATs transformer with multi-branch LSTM and triplet loss, achieving domain-robust assessment without synthetic training data.

Details

Motivation: Address domain shift between natural training data and synthetic evaluation data in AudioMOS Challenge 2025 Track 2 for predicting four audio aesthetic scores across TTS, TTA, and TTM systems.

Method: Combine BEATs pretrained transformer audio representation with multi-branch LSTM predictor and triplet loss with buffer-based sampling to structure embedding space by perceptual similarity.

Result: Improved embedding discriminability and generalization, enabling domain-robust audio quality assessment without requiring synthetic training data.

Conclusion: The proposed approach effectively handles domain shift challenges and provides robust multi-axis perceptual quality prediction for generative audio systems.

Abstract: We present a system for automatic multi-axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores–Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness–for audio generated by text-to-speech (TTS), text-to-audio (TTA), and text-to-music (TTM) systems. A main challenge is the domain shift between natural training data and synthetic evaluation data. To address this, we combine BEATs, a pretrained transformer-based audio representation model, with a multi-branch long short-term memory (LSTM) predictor and use a triplet loss with buffer-based sampling to structure the embedding space by perceptual similarity. Our results show that this improves embedding discriminability and generalization, enabling domain-robust audio quality assessment without synthetic training data.

[340] An Effective Strategy for Modeling Score Ordinality and Non-uniform Intervals in Automated Speaking Assessment

Tien-Hong Lo, Szu-Yu Chen, Yao-Ting Sung, Berlin Chen

Main category: eess.AS

TL;DR: Proposes an automated speaking assessment approach combining SSL representations with handcrafted features using multi-margin ordinal loss to address ordinal structure and non-uniform intervals in proficiency labels.

Details

Motivation: Existing SSL-based ASA methods have limitations - speech SSL models overlook linguistic content, text SSL models miss prosodic nuances, and most approaches ignore the ordinal nature and non-uniform intervals of proficiency levels.

Method: Combines SSL representations with handcrafted indicator features through a novel modeling paradigm. Introduces multi-margin ordinal loss to jointly model score ordinality and non-uniform intervals between proficiency labels.

Result: Extensive experiments on TEEMI corpus show the method consistently outperforms strong baselines and generalizes well to unseen prompts.

Conclusion: The proposed approach effectively addresses limitations of previous SSL-based ASA methods by combining multiple feature types and properly modeling the ordinal structure of proficiency assessments.

Abstract: A recent line of research on automated speaking assessment (ASA) has benefited from self-supervised learning (SSL) representations, which capture rich acoustic and linguistic patterns in non-native speech without underlying assumptions of feature curation. However, speech-based SSL models capture acoustic-related traits but overlook linguistic content, while text-based SSL models rely on ASR output and fail to encode prosodic nuances. Moreover, most prior arts treat proficiency levels as nominal classes, ignoring their ordinal structure and non-uniform intervals between proficiency labels. To address these limitations, we propose an effective ASA approach combining SSL with handcrafted indicator features via a novel modeling paradigm. We further introduce a multi-margin ordinal loss that jointly models both the score ordinality and non-uniform intervals of proficiency labels. Extensive experiments on the TEEMI corpus show that our method consistently outperforms strong baselines and generalizes well to unseen prompts.

[341] Binaural Target Speaker Extraction using HRTFs

Yoav Ellinson, Sharon Gannot

Main category: eess.AS

TL;DR: Novel binaural target speaker extraction method using HRTF without speaker embeddings, achieving speaker-independent performance with excellent binaural cue preservation in both anechoic and reverberant conditions.

Details

Motivation: To imitate human selective attention to single speakers in multi-talker environments by leveraging HRTF for speaker extraction without relying on speaker-specific embeddings.

Method: Fully complex-valued neural network operating directly on complex-valued STFT of mixed audio signals, compared with Real-Imaginary based network. Uses listener’s HRTF to isolate target speaker without speaker embeddings.

Result: Excellent extraction performance in anechoic conditions with preserved binaural cues. Robust performance in reverberant conditions maintaining speech clarity and directionality while reducing reverberation. Performance on par with existing methods in noise reduction and perceptual quality with better binaural cue preservation.

Conclusion: The HRTF-based approach provides speaker-independent target extraction with strong generalization across datasets and languages, offering superior binaural cue preservation compared to existing methods while maintaining competitive performance metrics.

Abstract: In this work, we aim to imitate the human ability to selectively attend to a single speaker, even in the presence of multiple simultaneous talkers. To achieve this, we propose a novel approach for binaural target speaker extraction that leverages the listener’s Head-Related Transfer Function (HRTF) to isolate the desired speaker. Notably, our method does not rely on speaker embeddings, making it speaker-independent and enabling strong generalization across multiple speech datasets and languages. We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals, and compare it to a Real-Imaginary (RI)-based neural network, demonstrating the advantages of the former. We first evaluate the method in an anechoic, noise-free scenario, achieving excellent extraction performance while preserving the binaural cues of the target signal. We then extend the evaluation to reverberant conditions. Our method proves robust, maintaining speech clarity and source directionality while simultaneously reducing reverberation. A comparative analysis with existing binaural Target Speaker Extraction (TSE) methods demonstrates that our approach attains performance on par with competing techniques in terms of noise reduction and perceptual quality, while offering a clear advantage in preserving binaural cues.Demo-page: https://bi-ctse-hrtf.github.io

eess.IV

[342] Pan-Cancer mitotic figures detection and domain generalization: MIDOG 2025 Challenge

Zhuoyan Shen, Esther Bär, Maria Hawkins, Konstantin Bräutigam, Charles-Antoine Collins-Fekete

Main category: eess.IV

TL;DR: Submission to MIDOG 2025 challenge using data-centric approach with new datasets for mitotic figure detection, achieving F1-Score 0.8407 for conventional mitoses and balanced accuracy 0.9107 for atypical mitoses.

Details

Motivation: Address the critical task of mitotic figure detection in histopathology for cancer prognostication by following the 'Bitter Lesson' principle that emphasizes data scale over algorithmic novelty.

Method: Publicly released two new datasets for conventional and atypical mitoses training, implemented up-to-date training methodologies for both tracks of the challenge.

Result: Achieved Track-1 F1-Score of 0.8407 on test set for conventional mitotic detection, and Track-2 balanced accuracy of 0.9107 for atypical mitotic cell classification.

Conclusion: Data-centric approach with expanded training datasets effectively improves performance in mitotic figure detection tasks for cancer prognostication.

Abstract: This report details our submission to the Mitotic Domain Generalization (MIDOG) 2025 challenge, which addresses the critical task of mitotic figure detection in histopathology for cancer prognostication. Following the “Bitter Lesson”\cite{sutton2019bitterlesson} principle that emphasizes data scale over algorithmic novelty, we have publicly released two new datasets to bolster training data for both conventional \cite{Shen2024framework} and atypical mitoses \cite{shen_2025_16780587}. Besides, we implement up-to-date training methodologies for both track and reach a Track-1 F1-Score of 0.8407 on our test set, as well as a Track-2 balanced accuracy of 0.9107 for atypical mitotic cell classification.

[343] MitoDetect++: A Domain-Robust Pipeline for Mitosis Detection and Atypical Subtyping

Esha Sadia Nasir, Jiaqi Lv, Mostafa Jahanifer, Shan E Ahmed Raza

Main category: eess.IV

TL;DR: MitoDetect++ is a deep learning pipeline for mitosis detection and atypical mitosis classification that combines U-Net architecture with EfficientNetV2-L backbone for detection and Virchow2 transformer with LoRA for classification, achieving strong performance across validation domains.

Details

Motivation: Automated detection and classification of mitotic figures, especially distinguishing atypical from normal mitoses, remains a critical challenge in computational pathology that needs to be addressed for clinical applications.

Method: Uses U-Net-based encoder-decoder with EfficientNetV2-L backbone and attention modules for detection; Virchow2 vision transformer with LoRA fine-tuning for classification; incorporates strong augmentations, focal loss, group-aware stratified 5-fold cross-validation, and test-time augmentation.

Result: Achieves a balanced accuracy of 0.892 across validation domains, demonstrating strong performance and generalization capabilities.

Conclusion: The method shows high clinical applicability and scalability across both mitosis detection and atypical mitosis classification tasks, making it suitable for computational pathology applications.

Abstract: Automated detection and classification of mitotic figures especially distinguishing atypical from normal remain critical challenges in computational pathology. We present MitoDetect++, a unified deep learning pipeline designed for the MIDOG 2025 challenge, addressing both mitosis detection and atypical mitosis classification. For detection (Track 1), we employ a U-Net-based encoder-decoder architecture with EfficientNetV2-L as the backbone, enhanced with attention modules, and trained via combined segmentation losses. For classification (Track 2), we leverage the Virchow2 vision transformer, fine-tuned efficiently using Low-Rank Adaptation (LoRA) to minimize resource consumption. To improve generalization and mitigate domain shifts, we integrate strong augmentations, focal loss, and group-aware stratified 5-fold cross-validation. At inference, we deploy test-time augmentation (TTA) to boost robustness. Our method achieves a balanced accuracy of 0.892 across validation domains, highlighting its clinical applicability and scalability across tasks.

[344] Sequential Hard Mining: a data-centric approach for Mitosis Detection

Maxime W. Lafarge, Viktor H. Koelzer

Main category: eess.IV

TL;DR: The paper addresses efficient training data sampling for mitotic figure detection using boosting-inspired techniques, presenting solutions for MIDOG 2025 challenge tracks.

Details

Motivation: With the growing availability of annotated datasets for mitotic figures in histology images, there's a need to optimally use this unprecedented amount of data to train deep learning models effectively.

Method: Building on previously proposed approaches with a focus on efficient sampling of training data inspired by boosting techniques.

Result: The authors present their candidate solutions for the two tracks of the MIDOG 2025 challenge.

Conclusion: Efficient data sampling methods inspired by boosting techniques can help optimize the use of large annotated datasets for training deep learning models in mitotic figure detection.

Abstract: With a continuously growing availability of annotated datasets of mitotic figures in histology images, finding the best way to optimally use with this unprecedented amount of data to optimally train deep learning models has become a new challenge. Here, we build upon previously proposed approaches with a focus on efficient sampling of training data inspired by boosting techniques and present our candidate solutions for the two tracks of the MIDOG 2025 challenge.

[345] Normal and Atypical Mitosis Image Classifier using Efficient Vision Transformer

Xuan Qi, Dominic Labella, Thomas Sanford, Maxwell Lee

Main category: eess.IV

TL;DR: EfficientViT-L2 model achieves strong performance (0.859 balanced accuracy, 0.942 AUC) in atypical vs normal mitosis classification using cross-cancer-type validation and stain augmentation.

Details

Motivation: To develop an accurate and efficient method for classifying atypical versus normal mitoses in cancer pathology images, addressing the MIDOG 2025 challenge requirements.

Method: Used EfficientViT-L2 hybrid CNN-ViT architecture on unified dataset of 13,938 nuclei from 7 cancer types. Applied leave-one-cancer-type-out cross-validation with 5-fold ensembles and stain-deconvolution for image augmentation.

Result: Achieved balanced accuracy of 0.859, ROC AUC of 0.942, and raw accuracy of 0.85 in preliminary evaluation, demonstrating competitive and well-balanced performance across metrics.

Conclusion: The hybrid EfficientViT-L2 architecture with domain generalization techniques (cross-cancer-type validation and stain augmentation) provides effective and balanced performance for mitosis classification in diverse cancer types.

Abstract: We tackle atypical versus normal mitosis classification in the MIDOG 2025 challenge using EfficientViT-L2, a hybrid CNN–ViT architecture optimized for accuracy and efficiency. A unified dataset of 13,938 nuclei from seven cancer types (MIDOG++ and AMi-Br) was used, with atypical mitoses comprising ~15. To assess domain generalization, we applied leave-one-cancer-type-out cross-validation with 5-fold ensembles, using stain-deconvolution for image augmentation. For challenge submissions, we trained an ensemble with the same 5-fold split but on all cancer types. In the preliminary evaluation phase, this model achieved balanced accuracy of 0.859, ROC AUC of 0.942, and raw accuracy of 0.85, demonstrating competitive and well-balanced performance across metrics.

[346] Ensemble of Pathology Foundation Models for MIDOG 2025 Track 2: Atypical Mitosis Classification

Mieko Ochi, Bae Yuan

Main category: eess.IV

TL;DR: Using pathology foundation models with parameter-efficient fine-tuning and ensembling to accurately classify typical vs atypical mitotic figures for cancer prognosis.

Details

Motivation: Atypical mitotic figures strongly correlate with tumor aggressiveness but are challenging to differentiate even for expert pathologists, making accurate classification essential for patient prognostication and resource allocation.

Method: Leveraged pre-trained Pathology Foundation Models with parameter-efficient fine-tuning via low-rank adaptation, employed fisheye transform to emphasize mitoses, used Fourier Domain Adaptation with ImageNet target images, and ensembled multiple PFMs to integrate complementary insights.

Result: Achieved high balanced accuracy on the Preliminary Evaluation Phase dataset.

Conclusion: The ensemble approach combining multiple pathology foundation models with specialized adaptations provides an effective solution for accurate mitotic figure classification, which is crucial for cancer prognosis and treatment planning.

Abstract: Mitotic figures are classified into typical and atypical variants, with atypical counts correlating strongly with tumor aggressiveness. Accurate differentiation is therefore essential for patient prognostication and resource allocation, yet remains challenging even for expert pathologists. Here, we leveraged Pathology Foundation Models (PFMs) pre-trained on large histopathology datasets and applied parameter-efficient fine-tuning via low-rank adaptation. During training, we employ a fisheye transform to emphasize mitoses and Fourier Domain Adaptation using ImageNet target images. Finally, we ensembled multiple PFMs to integrate complementary morphological insights, achieving a high balanced accuracy on the Preliminary Evaluation Phase dataset.

[347] Beyond Feature Mapping GAP: Integrating Real HDRTV Priors for Superior SDRTV-to-HDRTV Conversion

Gang He, Kepeng Xu, Li Xu, Wenxin Yu, Xianyun Wu

Main category: eess.IV

TL;DR: A novel two-stage method for SDRTV to HDRTV conversion using real HDRTV priors to guide the transformation, significantly improving accuracy and reliability compared to single-style mapping approaches.

Details

Motivation: The rise of HDR-WCG displays creates demand for SDRTV to HDRTV conversion, but existing methods struggle with limited SDR information and diverse style requirements, making it an ill-posed problem.

Method: Two-stage approach: 1) Vector Quantized Generative Adversarial Network captures HDRTV priors, 2) Matching these priors to input SDRTV content to recover realistic HDRTV outputs using reference-based selection.

Result: Significant improvements in both objective and subjective metrics across real and synthetic datasets, demonstrating enhanced accuracy and reliability in conversion.

Conclusion: Using real HDRTV priors as references transforms the problem from unreferenced prediction to referenced selection, effectively constraining the solution space and improving conversion quality.

Abstract: The rise of HDR-WCG display devices has highlighted the need to convert SDRTV to HDRTV, as most video sources are still in SDR. Existing methods primarily focus on designing neural networks to learn a single-style mapping from SDRTV to HDRTV. However, the limited information in SDRTV and the diversity of styles in real-world conversions render this process an ill-posed problem, thereby constraining the performance and generalization of these methods. Inspired by generative approaches, we propose a novel method for SDRTV to HDRTV conversion guided by real HDRTV priors. Despite the limited information in SDRTV, introducing real HDRTV as reference priors significantly constrains the solution space of the originally high-dimensional ill-posed problem. This shift transforms the task from solving an unreferenced prediction problem to making a referenced selection, thereby markedly enhancing the accuracy and reliability of the conversion process. Specifically, our approach comprises two stages: the first stage employs a Vector Quantized Generative Adversarial Network to capture HDRTV priors, while the second stage matches these priors to the input SDRTV content to recover realistic HDRTV outputs. We evaluate our method on public datasets, demonstrating its effectiveness with significant improvements in both objective and subjective metrics across real and synthetic datasets.

[348] Robust Pan-Cancer Mitotic Figure Detection with YOLOv12

Raphaël Bourgade, Guillaume Balezo, Thomas Walter

Main category: eess.IV

TL;DR: YOLOv12-based mitosis detection method achieves 0.801 F1-score on MIDOG 2025 preliminary test set without external data

Details

Motivation: Mitotic figures are crucial for tumor pathology assessment but suffer from high inter-observer variability among pathologists, requiring automated detection solutions

Method: Uses YOLOv12 object detection architecture for mitosis detection without external training data

Result: Achieved F1-score of 0.801 on MIDOG 2025 challenge preliminary test set

Conclusion: YOLOv12 provides effective mitosis detection performance, contributing to robust automated pathology assessment tools

Abstract: Mitotic figures represent a key histoprognostic feature in tumor pathology, providing crucial insights into tumor aggressiveness and proliferation. However, their identification remains challenging, subject to significant inter-observer variability, even among experienced pathologists. To address this issue, the MItosis DOmain Generalization (MIDOG) 2025 challenge marks the third edition of an international competition aiming to develop robust mitosis detection algorithms. In this paper, we present a mitotic figures detection approach based on the YOLOv12 object detection architecture, achieving a $F_1$-score of 0.801 on the preliminary test set of the MIDOG 2025 challenge, without relying on external data.

[349] ConvNeXt with Histopathology-Specific Augmentations for Mitotic Figure Classification

Hana Feki, Alice Blondel, Thomas Walter

Main category: eess.IV

TL;DR: Lightweight ConvNeXt model with histopathology-specific augmentation achieves 89.61% balanced accuracy in classifying atypical vs normal mitotic figures, ranking top in MIDOG 2025 Challenge.

Details

Motivation: Accurate mitotic figure classification is crucial for cancer grading and prognosis, but distinguishing atypical from normal mitotic figures is challenging due to subtle morphological differences, domain shifts, limited annotations, and class imbalance.

Method: Used lightweight ConvNeXt architecture trained on all available datasets (AMi-Br, AtNorM-Br, AtNorM-MD, OMG-Octo) with histopathology-specific augmentation pipeline including elastic and stain transformations, balanced sampling, and grouped 5-fold cross-validation.

Result: Achieved balanced accuracy of 0.8961 on preliminary leaderboard, ranking among top entries in MIDOG 2025 Challenge Track 2.

Conclusion: Broad domain exposure combined with targeted augmentation strategies is key to building accurate and generalizable mitotic figure classifiers that can handle domain shifts and class imbalance.

Abstract: Accurate mitotic figure classification is crucial in computational pathology, as mitotic activity informs cancer grading and patient prognosis. Distinguishing atypical mitotic figures (AMFs), which indicate higher tumor aggressiveness, from normal mitotic figures (NMFs) remains challenging due to subtle morphological differences and high intra-class variability. This task is further complicated by domain shifts, including variations in organ, tissue type, and scanner, as well as limited annotations and severe class imbalance. To address these challenges in Track 2 of the MIDOG 2025 Challenge, we propose a solution based on the lightweight ConvNeXt architecture, trained on all available datasets (AMi-Br, AtNorM-Br, AtNorM-MD, and OMG-Octo) to maximize domain coverage. Robustness is enhanced through a histopathology-specific augmentation pipeline, including elastic and stain-specific transformations, and balanced sampling to mitigate class imbalance. A grouped 5-fold cross-validation strategy ensures reliable evaluation. On the preliminary leaderboard, our model achieved a balanced accuracy of 0.8961, ranking among the top entries. These results highlight that broad domain exposure combined with targeted augmentation strategies is key to building accurate and generalizable mitotic figure classifiers.

[350] Solutions for Mitotic Figure Detection and Atypical Classification in MIDOG 2025

Shuting Xu, Runtong Liu, Zhixuan Chen, Junlin Hou, Hao Chen

Main category: eess.IV

TL;DR: Two-stage framework for mitotic figure detection and ensemble approach for atypical mitosis classification in MIDOG 2025 Challenge

Details

Motivation: Advance deep learning approaches for mitotic figure analysis in computational pathology, addressing domain generalization challenges

Method: Two-stage detection-classification framework for mitotic figure localization and refinement, plus ensemble strategy with multiple deep learning architectures for atypical mitosis classification

Result: Extensive experiments demonstrate effectiveness across both detection and classification tasks

Conclusion: Proposed methods show strong performance in mitotic figure detection and atypical mitosis classification for domain generalization challenges

Abstract: Deep learning has driven significant advances in mitotic figure analysis within computational pathology. In this paper, we present our approach to the Mitosis Domain Generalization (MIDOG) 2025 Challenge, which consists of two distinct tasks, i.e., mitotic figure detection and atypical mitosis classification. For the mitotic figure detection task, we propose a two-stage detection-classification framework that first localizes candidate mitotic figures and subsequently refines the predictions using a dedicated classification module. For the atypical mitosis classification task, we employ an ensemble strategy that integrates predictions from multiple state-of-the-art deep learning architectures to improve robustness and accuracy. Extensive experiments demonstrate the effectiveness of our proposed methods across both tasks.

[351] MIDOG 2025: Mitotic Figure Detection with Attention-Guided False Positive Correction

Andrew Broad, Jason Keighley, Lucy Godson, Alex Wright

Main category: eess.IV

TL;DR: Novel approach combining FCOS object detector with FAL-CNN classifier to improve mitotic figure detection by reducing false positives and enhancing bounding box accuracy.

Details

Motivation: To reduce the false positive rate of the FCOS object detector and improve accuracy and generalizability for mitotic figure detection in medical imaging.

Method: Extends FCOS with a Feedback Attention Ladder CNN for classification of normal vs abnormal mitotic figures, feeding into a fusion network that generates adjustments to FCOS-predicted bounding boxes.

Result: Achieved an F1 score of 0.655 for mitosis detection on the preliminary evaluation dataset.

Conclusion: The composite model successfully improves mitotic figure detection performance by combining object detection with specialized classification and bounding box refinement.

Abstract: We present a novel approach which extends the existing Fully Convolutional One-Stage Object Detector (FCOS) for mitotic figure detection. Our composite model adds a Feedback Attention Ladder CNN (FAL-CNN) model for classification of normal versus abnormal mitotic figures, feeding into a fusion network that is trained to generate adjustments to bounding boxes predicted by FCOS. Our network aims to reduce the false positive rate of the FCOS object detector, to improve the accuracy of object detection and enhance the generalisability of the network. Our model achieved an F1 score of 0.655 for mitosis detection on the preliminary evaluation dataset.

[352] RF-DETR for Robust Mitotic Figure Detection: A MIDOG 2025 Track 1 Approach

Piotr Giedziun, Jan Sołtysik, Mateusz Górczany, Norbert Ropiak, Marcin Przymus, Piotr Krajewski, Jarosław Kwiecień, Artur Bartczak, Izabela Wasiak, Mateusz Maniewski

Main category: eess.IV

TL;DR: Single-stage RF-DETR with hard negative mining for mitotic figure detection in histopathology images, achieving 0.789 F1 score on MIDOG 2025 challenge despite domain shifts.

Details

Motivation: Address domain shift challenges in mitotic figure detection caused by different scanners, staining protocols, and tissue types across histopathology images.

Method: Employed RF-DETR (Roboflow Detection Transformer) with hard negative mining, trained on MIDOG++ dataset. Initially planned two-stage approach but focused on optimized single-stage detection pipeline due to time constraints.

Result: Achieved F1 score of 0.789 with recall of 0.839 and precision of 0.746 on preliminary test set, demonstrating effective generalization across unseen domains.

Conclusion: Training data balance and hard negative mining are crucial for addressing domain shift challenges in mitotic figure detection, providing valuable insights for robust histopathology image analysis.

Abstract: Mitotic figure detection in histopathology images remains challenging due to significant domain shifts across different scanners, staining protocols, and tissue types. This paper presents our approach for the MIDOG 2025 challenge Track 1, focusing on robust mitotic figure detection across diverse histological contexts. While we initially planned a two-stage approach combining high-recall detection with subsequent classification refinement, time constraints led us to focus on optimizing a single-stage detection pipeline. We employed RF-DETR (Roboflow Detection Transformer) with hard negative mining, trained on MIDOG++ dataset. On the preliminary test set, our method achieved an F1 score of 0.789 with a recall of 0.839 and precision of 0.746, demonstrating effective generalization across unseen domains. The proposed solution offers insights into the importance of training data balance and hard negative mining for addressing domain shift challenges in mitotic figure detection.

[353] Team Westwood Solution for MIDOG 2025 Challenge

Tengyou Xu, Haochen Yang, Xiang ‘Anthony’ Chen, Hongyan Gu, Mohammad Haeri

Main category: eess.IV

TL;DR: Team Westwood’s solution for MIDOG 2025 challenge uses nnUNetV2 for initial mitosis detection followed by random forest ensemble of EfficientNet models, achieving F1 score 0.7450 for detection and balanced accuracy 0.8722 for atypical classification.

Details

Motivation: To develop an effective solution for mitosis detection and atypical mitosis classification in the MIDOG 2025 challenge, addressing the need for accurate automated analysis of cell division in medical imaging.

Method: Used nnUNetV2 for initial mitosis candidate screening with high sensitivity, followed by random forest classifier ensemble of three CNNs (EfficientNet-b3, EfficientNet-b5, EfficientNetV2-s) for detection. For atypical classification, used random forest ensemble of EfficientNet-b3, EfficientNet-b5, and InceptionV3.

Result: Achieved F1 score of 0.7450 for mitosis detection (track 1) and balanced accuracy of 0.8722 for atypical mitosis classification (track 2) on preliminary test set.

Conclusion: The ensemble approach combining nnUNetV2 with multiple CNN models via random forest classifiers proved effective for both mitosis detection and atypical classification tasks in the MIDOG challenge.

Abstract: This abstract presents our solution (Team Westwood) for mitosis detection and atypical mitosis classification in the MItosis DOmain Generalization (MIDOG) 2025 challenge. For mitosis detection, we trained an nnUNetV2 for initial mitosis candidate screening with high sensitivity, followed by a random forest classifier ensembling predictions of three convolutional neural networks (CNNs): EfficientNet-b3, EfficientNet-b5, and EfficientNetV2-s. For the atypical mitosis classification, we trained another random forest classifier ensembling the predictions of three CNNs: EfficientNet-b3, EfficientNet-b5, and InceptionV3. On the preliminary test set, our solution achieved an F1 score of 0.7450 for track 1 mitosis detection, and a balanced accuracy of 0.8722 for track 2 atypical mitosis classification.

[354] Foundation Model-Driven Classification of Atypical Mitotic Figures with Domain-Aware Training Strategies

Piotr Giedziun, Jan Sołtysik, Mateusz Górczany, Norbert Ropiak, Marcin Przymus, Piotr Krajewski, Jarosław Kwiecień, Artur Bartczak, Izabela Wasiak, Mateusz Maniewski

Main category: eess.IV

TL;DR: A solution for MIDOG 2025 Track 2 using H-optimus-0 foundation model with LoRA fine-tuning and MixUp augmentation for binary classification of normal vs atypical mitotic figures.

Details

Motivation: To address the complex binary classification challenge of distinguishing normal mitotic figures (NMFs) from atypical mitotic figures (AMFs) in pathology images.

Method: Leverages pathology-specific foundation model H-optimus-0 with Low-Rank Adaptation (LoRA) fine-tuning, MixUp augmentation, soft labels from multi-expert consensus, hard negative mining, adaptive focal loss, metric learning, and domain adaptation techniques.

Result: Achieved reasonable performance in the preliminary evaluation phase, demonstrating both promise and challenges of applying foundation models to this classification task.

Conclusion: The approach shows potential for foundation model applications in complex medical image classification tasks, though challenges remain in achieving optimal performance for distinguishing NMFs from AMFs.

Abstract: We present a solution for the MIDOG 2025 Challenge Track~2, addressing binary classification of normal mitotic figures (NMFs) versus atypical mitotic figures (AMFs). The approach leverages pathology-specific foundation model H-optimus-0, selected based on recent cross-domain generalization benchmarks and our empirical testing, with Low-Rank Adaptation (LoRA) fine-tuning and MixUp augmentation. Implementation includes soft labels based on multi-expert consensus, hard negative mining, and adaptive focal loss, metric learning and domain adaptation. The method demonstrates both the promise and challenges of applying foundation models to this complex classification task, achieving reasonable performance in the preliminary evaluation phase.

[355] Masked Autoencoder Pretraining and BiXLSTM ResNet Architecture for PET/CT Tumor Segmentation

Moona Mazher, Steven A Niederer, Abdul Qayyum

Main category: eess.IV

TL;DR: Two-stage lesion segmentation framework using self-supervised MAE pretraining and bidirectional XLSTM with ResNet blocks achieves improved PET/CT segmentation accuracy (Dice 0.582) for AutoPET Challenge.

Details

Motivation: Manual lesion segmentation in whole-body PET/CT is labor-intensive and variable. Current automated methods are limited by modality specificity, isolated time points, and insufficient expert knowledge integration.

Method: Two-stage approach: 1) Masked Autoencoder for self-supervised pretraining on unlabeled PET/CT and CT scans, 2) Fine-tuning with bidirectional XLSTM architecture with ResNet blocks and convolutional decoder using PET/CT as complementary input channels.

Result: Self-supervised pretraining significantly improves segmentation accuracy, achieving Dice score of 0.582 compared to 0.543 without pretraining on AutoPET Task 1 dataset.

Conclusion: Combining self-supervised learning with multimodal fusion shows strong potential for robust and generalizable PET/CT lesion segmentation.

Abstract: The accurate segmentation of lesions in whole-body PET/CT imaging is es-sential for tumor characterization, treatment planning, and response assess-ment, yet current manual workflows are labor-intensive and prone to inter-observer variability. Automated deep learning methods have shown promise but often remain limited by modality specificity, isolated time points, or in-sufficient integration of expert knowledge. To address these challenges, we present a two-stage lesion segmentation framework developed for the fourth AutoPET Challenge. In the first stage, a Masked Autoencoder (MAE) is em-ployed for self-supervised pretraining on unlabeled PET/CT and longitudinal CT scans, enabling the extraction of robust modality-specific representations without manual annotations. In the second stage, the pretrained encoder is fine-tuned with a bidirectional XLSTM architecture augmented with ResNet blocks and a convolutional decoder. By jointly leveraging anatomical (CT) and functional (PET) information as complementary input channels, the model achieves improved temporal and spatial feature integration. Evalua-tion on the AutoPET Task 1 dataset demonstrates that self-supervised pre-training significantly enhances segmentation accuracy, achieving a Dice score of 0.582 compared to 0.543 without pretraining. These findings high-light the potential of combining self-supervised learning with multimodal fu-sion for robust and generalizable PET/CT lesion segmentation. Code will be available at https://github.com/RespectKnowledge/AutoPet_2025_BxLSTM_UNET_Segmentation

[356] Towards Digital Twins for Optimal Radioembolization

Nisanth Kumar Panneerselvam, Guneet Mummaneni, Emilie Roncali

Main category: eess.IV

TL;DR: A framework combining CFD and physics-informed AI methods to create patient-specific digital twins for optimizing liver radioembolization treatment planning and improving clinical outcomes.

Details

Motivation: Radioembolization treatment optimization is challenging due to complex hepatic artery anatomy, variable blood flow, and uncertainty in microsphere transport, requiring personalized approaches to maximize therapeutic efficacy while minimizing damage to healthy tissue.

Method: Uses high-fidelity computational fluid dynamics (CFD) for microsphere transport calculations and physics-informed machine learning approaches (PINNs, PI-GANs, PI-DMs, transformer-based architectures) to accelerate simulations while maintaining physical fidelity.

Result: The framework enables mesh-free, data-efficient approximation of blood flow and microsphere transport, supports uncertainty-aware predictions, and facilitates real-time decision support through rapid sampling of diverse flow scenarios.

Conclusion: The integration of CFD and physics-informed AI methods provides a transformative solution for creating dynamic, patient-specific digital twins that can optimize radioembolization planning and ultimately improve clinical outcomes.

Abstract: Radioembolization is a localized liver cancer treatment that delivers radioactive microspheres (30 micron) to tumors via a catheter inserted in the hepatic arterial tree. The goal is to maximize therapeutic efficacy while minimizing damage to healthy liver tissue. However, optimization is challenging due to complex hepatic artery anatomy, variable blood flow, and uncertainty in microsphere transport. The creation of dynamic, patient-specific digital twins may provide a transformative solution to these challenges. This work outlines a framework for a liver radioembolization digital twin using high-fidelity computational fluid dynamics (CFD) and/or recent physics-informed machine learning approaches. The CFD approach involves microsphere transport calculations in the hepatic arterial tree with individual patient data, which enables personalized treatment planning. Although accurate, traditional CFD is computationally expensive and limits clinical applicability. To accelerate simulations, physics-informed neural networks (PINNs) and their generative extensions play an increasingly important role. PINNs integrate governing equations, such as the Navier-Stokes equations, directly into the neural network training process, enabling mesh-free, data-efficient approximation of blood flow and microsphere transport. Physics-informed generative adversarial networks (PI-GANs), diffusion models (PI-DMs), and transformer-based architectures further enable uncertainty-aware, temporally resolved predictions with reduced computational cost. These AI surrogates not only maintain physical fidelity but also support rapid sampling of diverse flow scenarios, facilitating real-time decision support. Together, CFD and physics-informed AI methods form the foundation of dynamic, patient-specific digital twin to optimize radioembolization planning and ultimately improve clinical outcomes.

[357] Is Synthetic Image Augmentation Useful for Imbalanced Classification Problems? Case-Study on the MIDOG2025 Atypical Cell Detection Competition

Leire Benito-Del-Valle, Pedro A. Moreno-Sánchez, Itziar Egusquiza, Itsaso Vitoria, Artzai Picón, Cristina López-Saratxaga, Adrian Galdran

Main category: eess.IV

TL;DR: The MIDOG 2025 challenge introduces atypical mitosis classification using ConvNeXt and Lunit ViT backbones, achieving ~95% AUROC with no consistent benefit from synthetic data balancing.

Details

Motivation: To address the clinically relevant but highly imbalanced problem of distinguishing normal from atypical mitotic figures in histopathology images across domains.

Method: Used ConvNeXt-Small (ImageNet-pretrained) and Lunit ViT (histopathology-specific self-supervised) backbones, compared real-only vs real+synthetic data training with five-fold cross-validation.

Result: Both models achieved ~95% mean AUROC - ConvNeXt had slightly higher peaks while Lunit showed better stability. Synthetic balancing provided no consistent improvement. ConvNeXt achieved 95.4% AUROC on hidden test set.

Conclusion: Both ImageNet and domain-pretrained backbones work well for atypical mitosis classification, with domain-pretraining offering robustness and ImageNet pretraining reaching higher peaks, while synthetic balancing has limited benefit.

Abstract: The MIDOG 2025 challenge extends prior work on mitotic figure detection by introducing a new Track 2 on atypical mitosis classification. This task aims to distinguish normal from atypical mitotic figures in histopathology images, a clinically relevant but highly imbalanced and cross-domain problem. We investigated two complementary backbones: (i) ConvNeXt-Small, pretrained on ImageNet, and (ii) a histopathology-specific ViT from Lunit trained via self-supervision. To address the strong prevalence imbalance (9408 normal vs. 1741 atypical), we synthesized additional atypical examples to approximate class balance and compared models trained with real-only vs. real+synthetic data. Using five-fold cross-validation, both backbones reached strong performance (mean AUROC approximately 95 percent), with ConvNeXt achieving slightly higher peaks while Lunit exhibited greater fold-to-fold stability. Synthetic balancing, however, did not lead to consistent improvements. On the organizers’ preliminary hidden test set, explicitly designed as an out-of-distribution debug subset, ConvNeXt attained the highest AUROC (95.4 percent), whereas Lunit remained competitive on balanced accuracy. These findings suggest that both ImageNet and domain-pretrained backbones are viable for atypical mitosis classification, with domain-pretraining conferring robustness and ImageNet pretraining reaching higher peaks, while naive synthetic balancing has limited benefit. Full hidden test set results will be reported upon challenge completion.

[358] A Two-Stage Strategy for Mitosis Detection Using Improved YOLO11x Proposals and ConvNeXt Classification

Jie Xiao, Mengye Lyu, Shaojun Liu

Main category: eess.IV

TL;DR: Two-stage mitosis detection framework combining improved YOLO11x for candidate generation and ConvNeXt-Tiny classifier for false positive filtering, achieving F1-score of 0.882 on MIDOG 2025 Track 1 dataset.

Details

Motivation: Mitosis detection in whole-slide images is challenging due to complicated heterogeneous contexts, artifacts, and non-tumor regions that cause false positives and false negatives, degrading F1-score performance.

Method: Two-stage framework: 1) Improved YOLO11x with EMA attention and LSConv for mitosis candidate generation using low confidence threshold to ensure recall, 2) ConvNeXt-Tiny classifier to filter false positives and ensure precision.

Result: Achieved F1-score of 0.882 on fused dataset (MIDOG++, MITOS_WSI_CCMCT, MITOS_WSI_CMC), 0.035 higher than single-stage YOLO11x baseline. Precision improved from 0.762 to 0.839 while maintaining comparable recall.

Conclusion: The proposed two-stage framework effectively addresses mitosis detection challenges by balancing recall and precision, demonstrating significant performance improvement over single-stage approaches in complex whole-slide image analysis.

Abstract: MIDOG 2025 Track 1 requires mitosis detection in whole-slide images (WSIs) containing non-tumor, inflamed, and necrotic regions. Due to the complicated and heterogeneous context, as well as possible artifacts, there are often false positives and false negatives, thus degrading the detection F1-score. To address this problem, we propose a two-stage framework. Firstly, an improved YOLO11x, integrated with EMA attention and LSConv, is employed to generate mitosis candidates. We use a low confidence threshold to generate as many proposals as possible, ensuring the detection recall. Then, a ConvNeXt-Tiny classifier is employed to filter out the false positives, ensuring the detection precision. Consequently, the proposed two-stage framework can generate a high detection F1-score. Evaluated on a fused dataset comprising MIDOG++, MITOS_WSI_CCMCT, and MITOS_WSI_CMC, our framework achieves an F1-score of 0.882, which is 0.035 higher than the single-stage YOLO11x baseline. This performance gain is produced by a significant precision improvement, from 0.762 to 0.839, and a comparable recall. The code is available at https://github.com/xxiao0304/MIDOG-2025-Track-1-of-SZTU.

[359] Challenges and Lessons from MIDOG 2025: A Two-Stage Approach to Domain-Robust Mitotic Figure Detection

Euiseop Song, Jaeyoung Park, Jaewoo Park

Main category: eess.IV

TL;DR: Two-stage mitotic figure detection pipeline with Faster R-CNN and ensemble classifiers achieved high recall but critically low precision in MIDOG 2025 challenge, revealing domain generalization challenges.

Details

Motivation: Address domain variability and morphological complexity in mitotic figure detection across diverse tissue domains for computational pathology applications.

Method: Two-stage pipeline: Faster R-CNN for candidate detection followed by ensemble of three classifiers (DenseNet-121, EfficientNet-v2, InceptionResNet-v2) for false positive reduction.

Result: Best submission achieved F1-score 0.2237 with high recall (0.9528) but critically low precision (0.1267). Optimization attempts were counterproductive, highlighting domain generalization challenges.

Conclusion: The work reveals fundamental challenges in distinguishing true mitoses from imposters across domains and emphasizes the importance of effective false positive suppression strategies in histopathology.

Abstract: Mitotic figure detection remains a challenging task in computational pathology due to domain variability and morphological complexity. This paper describes our participation in the MIDOG 2025 challenge, focusing on robust mitotic figure detection across diverse tissue domains. We developed a two-stage pipeline combining Faster R-CNN for candidate detection with an ensemble of three classifiers (DenseNet-121, EfficientNet-v2, InceptionResNet-v2) for false positive reduction. Our best submission achieved F1-score 0.2237 (Recall: 0.9528, Precision: 0.1267) using a Faster R-CNN trained solely on MIDOG++ dataset. While our high recall demonstrates effective mitotic figure detection, the critically low precision (12.67%) reveals fundamental challenges in distinguishing true mitoses from morphologically similar imposters across diverse domains. Analysis of six submission variants showed that subsequent optimization attempts were counterproductive, highlighting the omplexity of domain generalization in histopathology. This work provides valuable insights into the practical challenges of developing robust mitotic figure detection algorithms and emphasizes the importance of effective false positive suppression strategies.

[360] A Single Detect Focused YOLO Framework for Robust Mitotic Figure Detection

Yasemin Topuz, M. Taha Gökcan, Serdar Yıldız, Songül Varlı

Main category: eess.IV

TL;DR: SDF-YOLO is a lightweight domain-robust detection framework for mitotic figures that achieves competitive performance across diverse datasets with computational efficiency.

Details

Motivation: Mitotic figure detection is crucial for tumor prognosis but faces challenges from domain variability in scanners, tissue types, and staining protocols that affect automated detection robustness.

Method: Builds on YOLOv11 with task-specific modifications: single detection head aligned with mitotic figure scale, coordinate attention for positional sensitivity, and improved cross-channel feature mixing.

Result: Achieved AP of 0.799, precision 0.758, recall 0.775, F1 score 0.766, and FROC-AUC of 5.793 on MIDOG2025 challenge test set across human and canine tumor datasets.

Conclusion: SDF-YOLO provides a reliable and efficient framework for robust mitotic figure detection across diverse domains with both competitive accuracy and computational efficiency.

Abstract: Mitotic figure detection is a crucial task in computational pathology, as mitotic activity serves as a strong prognostic marker for tumor aggressiveness. However, domain variability that arises from differences in scanners, tissue types, and staining protocols poses a major challenge to the robustness of automated detection methods. In this study, we introduce SDF-YOLO (Single Detect Focused YOLO), a lightweight yet domain-robust detection framework designed specifically for small, rare targets such as mitotic figures. The model builds on YOLOv11 with task-specific modifications, including a single detection head aligned with mitotic figure scale, coordinate attention to enhance positional sensitivity, and improved cross-channel feature mixing. Experiments were conducted on three datasets that span human and canine tumors: MIDOG ++, canine cutaneous mast cell tumor (CCMCT), and canine mammary carcinoma (CMC). When submitted to the preliminary test set for the MIDOG2025 challenge, SDF-YOLO achieved an average precision (AP) of 0.799, with a precision of 0.758, a recall of 0.775, an F1 score of 0.766, and an FROC-AUC of 5.793, demonstrating both competitive accuracy and computational efficiency. These results indicate that SDF-YOLO provides a reliable and efficient framework for robust mitotic figure detection across diverse domains.

[361] Adaptive Learning Strategies for Mitotic Figure Classification in MIDOG2025 Challenge

Biwen Meng, Xi Long, Jingxin Liu

Main category: eess.IV

TL;DR: The paper presents a method for detecting atypical mitotic figures in pathology images using UNI2-h foundation model adaptation with visual prompt tuning and test-time augmentation, achieving top-10 performance in the MIDOG2025 challenge.

Details

Motivation: Atypical mitotic figures are clinically important indicators of abnormal cell division but are challenging to detect reliably due to morphological ambiguity and scanner variability.

Method: Three adaptation variants of UNI2-h pathology foundation model: LoRA baseline, visual prompt tuning (VPT), and VPT combined with test-time augmentation using Vahadane and Macenko stain normalization.

Result: Final submission achieved balanced accuracy of 0.8837 and ROC-AUC of 0.9513 on preliminary leaderboard, ranking within top 10 teams.

Conclusion: Prompt-based adaptation combined with stain-normalization test-time augmentation provides an effective strategy for atypical mitosis classification under diverse imaging conditions.

Abstract: Atypical mitotic figures (AMFs) are clinically relevant indicators of abnormal cell division, yet their reliable detection remains challenging due to morphological ambiguity and scanner variability. In this work, we investigated three variants of adapting the pathology foundation model UNI2-h for the MIDOG2025 Track 2 challenge. Starting from a LoRA-based baseline, we found that visual prompt tuning (VPT) substantially improved generalization, and that further integrating test-time augmentation (TTA) with Vahadane and Macenko stain normalization provided the best robustness. Our final submission achieved a balanced accuracy of 0.8837 and an ROC-AUC of 0.9513 on the preliminary leaderboard, ranking within the top 10 teams. These results demonstrate that prompt-based adaptation combined with stain-normalization TTA offers an effective strategy for atypical mitosis classification under diverse imaging conditions.

[362] Ensemble YOLO Framework for Multi-Domain Mitotic Figure Detection in Histopathology Images

Navya Sri Kelam, Akash Parekh, Saikiran Bonthu, Nitin Singhal

Main category: eess.IV

TL;DR: Ensemble of YOLOv5 and YOLOv8 detectors with stain-invariant augmentations improves mitosis detection in histopathology images, achieving better sensitivity without major precision loss.

Details

Motivation: Mitotic figure detection is challenging due to scarcity, morphological heterogeneity, and staining variability. MIDOG competition provides standardized benchmarks to develop generalizable deep learning models for digital pathology.

Method: Trained YOLOv5 and YOLOv8 detectors on MIDOG++, CMC, and CCMCT datasets with stain-invariant color perturbations and texture preserving augmentations. Employed ensemble strategy combining both models.

Result: YOLOv5 achieved superior precision while YOLOv8 provided improved recall. The ensemble approach enhanced sensitivity without significant precision reduction.

Conclusion: Ensemble strategies built on contemporary object detectors effectively advance automated mitosis detection in digital pathology by leveraging complementary strengths of different architectures.

Abstract: Accurate detection of mitotic figures in whole slide histopathological images remains a challenging task due to their scarcity, morphological heterogeneity, and the variability introduced by tissue preparation and staining protocols. The MIDOG competition series provides standardized benchmarks for evaluating detection approaches across diverse domains, thus motivating the development of generalizable deep learning models. In this work, we investigate the performance of two modern one-stage detectors, YOLOv5 and YOLOv8, trained on MIDOG++, CMC, and CCMCT datasets. To enhance robustness, training incorporated stain-invariant color perturbations and texture preserving augmentations. In internal validation, YOLOv5 achieved superior precision, while YOLOv8 provided improved recall, reflecting architectural trade-offs between anchor-based and anchor-free detection. To capitalize on these complementary strengths, we employed an ensemble of the two models, which improved sensitivity without a major reduction in precision. These findings highlight the effectiveness of ensemble strategies built upon contemporary object detectors to advance automated mitosis detection in digital pathology.

[363] Deep Self-knowledge Distillation: A hierarchical supervised learning for coronary artery segmentation

Mingfeng Lin

Main category: eess.IV

TL;DR: Novel Deep Self-knowledge Distillation method for coronary artery segmentation that uses hierarchical outputs and dual loss functions to improve performance and generalizability.

Details

Motivation: Manual coronary artery segmentation from X-ray angiography is time-consuming, and existing automated methods suffer from poor performance and limited generalizability. Current knowledge distillation approaches don't fully utilize hierarchical model knowledge.

Method: Deep Self-knowledge Distillation combining Deep Distribution Loss and Pixel-wise Self-knowledge Distillation Loss, using hierarchical outputs for supervision with probabilistic distribution vectors and pixel-wise constraints.

Result: Outperforms other models on XCAD and DCA1 datasets in dice coefficient, accuracy, sensitivity, and IoU metrics.

Conclusion: The proposed hierarchical knowledge distillation approach effectively enhances segmentation performance and model robustness for coronary artery segmentation tasks.

Abstract: Coronary artery disease is a leading cause of mortality, underscoring the critical importance of precise diagnosis through X-ray angiography. Manual coronary artery segmentation from these images is time-consuming and inefficient, prompting the development of automated models. However, existing methods, whether rule-based or deep learning models, struggle with issues like poor performance and limited generalizability. Moreover, current knowledge distillation methods applied in this field have not fully exploited the hierarchical knowledge of the model, leading to certain information waste and insufficient enhancement of the model’s performance capabilities for segmentation tasks. To address these issues, this paper introduces Deep Self-knowledge Distillation, a novel approach for coronary artery segmentation that leverages hierarchical outputs for supervision. By combining Deep Distribution Loss and Pixel-wise Self-knowledge Distillation Loss, our method enhances the student model’s segmentation performance through a hierarchical learning strategy, effectively transferring knowledge from the teacher model. Our method combines a loosely constrained probabilistic distribution vector with tightly constrained pixel-wise supervision, providing dual regularization for the segmentation model while also enhancing its generalization and robustness. Extensive experiments on XCAD and DCA1 datasets demonstrate that our approach outperforms the dice coefficient, accuracy, sensitivity and IoU compared to other models in comparative evaluations.

[364] Prompt-Guided Patch UNet-VAE with Adversarial Supervision for Adrenal Gland Segmentation in Computed Tomography Medical Images

Hania Ghouse, Muzammil Behzad

Main category: eess.IV

TL;DR: A unified framework combining variational reconstruction, supervised segmentation, and adversarial feedback for small abdominal organ segmentation in CT imaging, addressing class imbalance and limited data through hybrid generative-discriminative training.

Details

Motivation: Segmentation of small and irregularly shaped abdominal organs like adrenal glands in CT imaging faces challenges including severe class imbalance, poor spatial context, and limited annotated data.

Method: VAE-UNet backbone that jointly reconstructs input patches and generates voxel-level segmentation masks, with patch-based training using synthetic patches from learned latent space, perceptual reconstruction loss with VGG features, and PatchGAN-style discriminator for adversarial supervision.

Result: Comprehensive experiments on BTCV dataset show improved segmentation accuracy, particularly in boundary-sensitive regions, while maintaining strong reconstruction quality.

Conclusion: Hybrid generative-discriminative training regimes are effective for small-organ segmentation, providing insights into balancing realism, diversity, and anatomical consistency in data-scarce scenarios.

Abstract: Segmentation of small and irregularly shaped abdominal organs, such as the adrenal glands in CT imaging, remains a persistent challenge due to severe class imbalance, poor spatial context, and limited annotated data. In this work, we propose a unified framework that combines variational reconstruction, supervised segmentation, and adversarial patch-based feedback to address these limitations in a principled and scalable manner. Our architecture is built upon a VAE-UNet backbone that jointly reconstructs input patches and generates voxel-level segmentation masks, allowing the model to learn disentangled representations of anatomical structure and appearance. We introduce a patch-based training pipeline that selectively injects synthetic patches generated from the learned latent space, and systematically study the effects of varying synthetic-to-real patch ratios during training. To further enhance output fidelity, the framework incorporates perceptual reconstruction loss using VGG features, as well as a PatchGAN-style discriminator for adversarial supervision over spatial realism. Comprehensive experiments on the BTCV dataset demonstrate that our approach improves segmentation accuracy, particularly in boundary-sensitive regions, while maintaining strong reconstruction quality. Our findings highlight the effectiveness of hybrid generative-discriminative training regimes for small-organ segmentation and provide new insights into balancing realism, diversity, and anatomical consistency in data-scarce scenarios.

[365] Generalist versus Specialist Vision Foundation Models for Ocular Disease and Oculomics

Yukun Zhou, Paul Nderitu, Jocelyn Hui Lin Goh, Justin Engelmann, Siegfried K. Wagner, Anran Ran, Hongyang Jiang, Lie Ju, Ke Zou, Sahana Srinivasan, Hyunmin Kim, Takahiro Ninomiya, Zheyuan Wang, Gabriel Dawei Yang, Eden Ruffell, Dominic Williamson, Rui Santos, Gabor Mark Somfai, Carol Y. Cheung, Tien Yin Wong, Daniel C. Alexander, Yih Chung Tham, Pearse A. Keane

Main category: eess.IV

TL;DR: Generalist foundation models like DINOv2/DINOv3 show strong adaptability in retinal applications, but specialist model RETFound-DINOv2 still outperforms them in ocular disease detection and oculomics tasks with better generalization and data efficiency.

Details

Motivation: To investigate whether domain-specific pre-training remains essential given the emergence of increasingly powerful generalist foundation models, and to identify what performance gaps persist between specialist and generalist approaches in retinal image applications.

Method: Systematic evaluation comparing DINOv2 and DINOv3 generalist models against two specialist RETFound models (RETFound-MAE and RETFound-DINOv2) on ocular disease detection and systemic disease prediction using fine-tuning and linear probing adaptation strategies, with analysis of data efficiency and adaptation efficiency.

Result: RETFound-DINOv2 consistently outperformed generalist foundation models in ocular-disease detection and oculomics tasks, demonstrating stronger generalizability and data efficiency, although generalist models showed strong adaptability across diverse tasks.

Conclusion: Specialist retinal foundation models remain the most effective choice for clinical applications, but the narrowing gap with generalist models suggests continued data and model scaling can deliver domain-relevant gains, positioning them as strong foundations for future medical foundation models.

Abstract: Medical foundation models, pre-trained with large-scale clinical data, demonstrate strong performance in diverse clinically relevant applications. RETFound, trained on nearly one million retinal images, exemplifies this approach in applications with retinal images. However, the emergence of increasingly powerful and multifold larger generalist foundation models such as DINOv2 and DINOv3 raises the question of whether domain-specific pre-training remains essential, and if so, what gap persists. To investigate this, we systematically evaluated the adaptability of DINOv2 and DINOv3 in retinal image applications, compared to two specialist RETFound models, RETFound-MAE and RETFound-DINOv2. We assessed performance on ocular disease detection and systemic disease prediction using two adaptation strategies: fine-tuning and linear probing. Data efficiency and adaptation efficiency were further analysed to characterise trade-offs between predictive performance and computational cost. Our results show that although scaling generalist models yields strong adaptability across diverse tasks, RETFound-DINOv2 consistently outperforms these generalist foundation models in ocular-disease detection and oculomics tasks, demonstrating stronger generalisability and data efficiency. These findings suggest that specialist retinal foundation models remain the most effective choice for clinical applications, while the narrowing gap with generalist foundation models suggests that continued data and model scaling can deliver domain-relevant gains and position them as strong foundations for future medical foundation models.

[366] Embedding Similarity Guided License Plate Super Resolution

Abderrezzaq Sendjasni, Mohamed-Chaker Larabi

Main category: eess.IV

TL;DR: Novel framework combining pixel-based loss with embedding similarity learning for license plate super-resolution, achieving state-of-the-art results on benchmark datasets.

Details

Motivation: Enhance license plate recognition quality in security/surveillance applications by improving super-resolution techniques for better perceptual and structural fidelity.

Method: Proposes pixel and embedding consistency loss (PECL) using Siamese network with contrastive loss to balance pixel-wise accuracy and embedding-level consistency.

Result: Superior performance on CCPD and PKU datasets with improvements in PSNR, SSIM, LPIPS, and OCR accuracy over state-of-the-art methods.

Conclusion: Embedding similarity learning effectively advances both perceptual quality and task-specific performance in extreme super-resolution scenarios.

Abstract: Super-resolution (SR) techniques play a pivotal role in enhancing the quality of low-resolution images, particularly for applications such as security and surveillance, where accurate license plate recognition is crucial. This study proposes a novel framework that combines pixel-based loss with embedding similarity learning to address the unique challenges of license plate super-resolution (LPSR). The introduced pixel and embedding consistency loss (PECL) integrates a Siamese network and applies contrastive loss to force embedding similarities to improve perceptual and structural fidelity. By effectively balancing pixel-wise accuracy with embedding-level consistency, the framework achieves superior alignment of fine-grained features between high-resolution (HR) and super-resolved (SR) license plates. Extensive experiments on the CCPD and PKU dataset validate the efficacy of the proposed framework, demonstrating consistent improvements over state-of-the-art methods in terms of PSNR, SSIM, LPIPS, and optical character recognition (OCR) accuracy. These results highlight the potential of embedding similarity learning to advance both perceptual quality and task-specific performance in extreme super-resolution scenarios.

[367] Semantic Communication with Entropy-and-Channel-Adaptive Rate Control over Multi-User MIMO Fading Channels

Weixuan Chen, Qianqian Yang, Yuhao Chen, Chongwen Huang, Qian Wang, Zehui Xiong, Zhaoyang Zhang

Main category: eess.IV

TL;DR: A semantic communication system with adaptive rate control for wireless image transmission that dynamically adjusts rates based on feature entropy and channel conditions, outperforming traditional methods.

Details

Motivation: Existing semantic communication methods use fixed transmission rates regardless of varying channel conditions and content, causing performance degradation in harsh environments.

Method: Integrates entropy-and-channel-adaptive rate control mechanism using feature map entropy, CSI, and SNR. Employs feature map pruning, channel attention, spatial attention, and multi-head self-attention to prioritize critical semantic features.

Result: Outperforms separated source and channel coding and Deep JSCC in rate-distortion performance, flexibility, and robustness, especially in low SNR, imperfect CSI, and inter-user interference scenarios.

Conclusion: The proposed adaptive rate control system effectively optimizes communication resource usage and maintains superior performance under challenging wireless channel conditions.

Abstract: Although significant improvements in transmission efficiency have been achieved, existing semantic communication (SemCom) methods typically use a fixed transmission rate for varying channel conditions and transmission contents, leading to performance degradation under harsh channel conditions. To address these challenges, we propose a novel SemCom method for wireless image transmission that integrates entropy-andchannel-adaptive rate control mechanism, specifically designed for multi-user multiple-input multiple-output (MU-MIMO) fading channels. Unlike existing methods, our system dynamically adjusts transmission rates by leveraging the entropy of feature maps, channel state information (CSI), and signal-to-noise ratio (SNR), ensuring optimal communication resource usage. It incorporates feature map pruning, channel attention, spatial attention, and multi-head self-attention (MHSA) to effectively prioritize critical semantic features while minimizing unnecessary transmission overhead. Experimental results demonstrate that the proposed system outperforms separated source and channel coding and deep joint source and channel coding (Deep JSCC), in terms of rate-distortion performance, flexibility, and robustness, particularly in challenging scenarios such as low SNR, imperfect CSI, and inter-user interference.

[368] Towards Cardiac MRI Foundation Models: Comprehensive Visual-Tabular Representations for Whole-Heart Assessment and Beyond

Yundi Zhang, Paul Hager, Che Liu, Suprosanna Shit, Chen Chen, Daniel Rueckert, Jiazhen Pan

Main category: eess.IV

TL;DR: ViTa is a multi-modal foundation model that integrates 3D+T cardiac MRI with patient-level health factors to provide comprehensive cardiac health assessment and disease risk prediction.

Details

Motivation: Current cardiac MRI analysis often ignores important patient-level health factors like demographics and lifestyle, and existing multi-modal approaches are limited to specific tasks with restricted data.

Method: Integrates 3D+T cine MRI stacks from multiple views with detailed tabular patient data from 42,000 UK Biobank participants, learning a shared latent representation for multi-task analysis.

Result: Enables comprehensive cardiac analysis including phenotype prediction, physiological feature estimation, segmentation, and disease classification within a unified framework.

Conclusion: ViTa represents a step toward universal foundation models for cardiac health that provide patient-specific insights by bridging imaging features with clinical context, advancing clinical utility and scalability.

Abstract: Cardiac magnetic resonance imaging is the gold standard for non-invasive cardiac assessment, offering rich spatio-temporal views of the cardiac anatomy and physiology. Patient-level health factors, such as demographics, metabolic, and lifestyle, are known to substantially influence cardiovascular health and disease risk, yet remain uncaptured by CMR alone. To holistically understand cardiac health and to enable the best possible interpretation of an individual’s disease risk, CMR and patient-level factors must be jointly exploited within an integrated framework. Recent multi-modal approaches have begun to bridge this gap, yet they often rely on limited spatio-temporal data and focus on isolated clinical tasks, thereby hindering the development of a comprehensive representation for cardiac health evaluation. To overcome these limitations, we introduce ViTa, a step toward foundation models that delivers a comprehensive representation of the heart and a precise interpretation of individual disease risk. Leveraging data from 42,000 UK Biobank participants, ViTa integrates 3D+T cine stacks from short-axis and long-axis views, enabling a complete capture of the cardiac cycle. These imaging data are then fused with detailed tabular patient-level factors, enabling context-aware insights. This multi-modal paradigm supports a wide spectrum of downstream tasks, including cardiac phenotype and physiological feature prediction, segmentation, and classification of cardiac and metabolic diseases within a single unified framework. By learning a shared latent representation that bridges rich imaging features and patient context, ViTa moves beyond traditional, task-specific models toward a universal, patient-specific understanding of cardiac health, highlighting its potential to advance clinical utility and scalability in cardiac analysis.

[369] Multimodal Medical Image Binding via Shared Text Embeddings

Yunhao Liu, Suyang Xi, Shiqi Liu, Hong Ding, Chicheng Jin, Chong Zhong, Junjun He, Catherine C. Liu, Yiqing Shen

Main category: eess.IV

TL;DR: M^3Bind is a novel pre-training framework that enables alignment of multiple medical imaging modalities through a shared text representation space without requiring explicit paired data between modalities.

Details

Motivation: Medical image analysis requires integration of multiple imaging modalities for accurate diagnosis, but existing methods like CLIP need explicitly paired data which is difficult to acquire in medical contexts.

Method: Fine-tunes pre-trained CLIP-like models to align modality-specific text embedding spaces while preserving original image-text alignments, then distills these into a unified model with shared text embedding space.

Result: Achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks across X-ray, CT, retina, ECG, and pathological images.

Conclusion: M^3Bind effectively achieves cross-image-modal alignment for medical analysis without requiring explicit paired data between modalities.

Abstract: Medical image analysis increasingly relies on the integration of multiple imaging modalities to capture complementary anatomical and functional information, enabling more accurate diagnosis and treatment planning. Achieving aligned feature representations across these diverse modalities is therefore important for effective multimodal analysis. While contrastive language-image pre-training (CLIP) and its variant have enabled image-text alignments, they require explicitly paired data between arbitrary two modalities, which is difficult to acquire in medical contexts. To address the gap, we present Multimodal Medical Image Binding with Text (M\textsuperscript{3}Bind), a novel pre-training framework that enables seamless alignment of multiple medical imaging modalities through a shared text representation space without requiring explicit paired data between any two medical image modalities. Specifically, based on the insight that different images can naturally bind with text, M\textsuperscript{3}Bind first fine-tunes pre-trained CLIP-like image-text models to align their modality-specific text embedding space while preserving their original image-text alignments. Subsequently, we distill these modality-specific text encoders into a unified model, creating a shared text embedding space. Experiments on X-ray, CT, retina, ECG, and pathological images on multiple downstream tasks demonstrate that M\textsuperscript{3}Bind achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks compared to its CLIP-like counterparts. These results validate M\textsuperscript{3}Bind’s effectiveness in achieving cross-image-modal alignment for medical analysis.

[370] Grid-Reg: Detector-Free Gridized Feature Learning and Matching for Large-Scale SAR-Optical Image Registration

Xiaochen Wei, Weiwei Guo, Zenghui Zhang, Wenxian Yu

Main category: eess.IV

TL;DR: Grid-Reg is a grid-based multimodal registration framework for SAR-optical image registration that uses a hybrid Siamese network with correlation metric learning and a grid-based solver to handle large modality gaps and geometric differences.

Details

Motivation: Existing methods struggle with large-scale heterogeneous SAR and optical image registration due to significant geometric, radiometric, and temporal differences across platforms.

Method: Proposes Grid-Reg framework with HSCMLNet (Hybrid Siamese Correlation Metric Learning Network) using equiangular unit basis vectors and manifold consistency loss, plus Grid-Solver with progressive dual-loop search strategy for transformation estimation.

Result: Extensive experiments show superior performance over state-of-the-art methods on a challenging benchmark dataset using real-world UAV MiniSAR and Google Earth optical imagery.

Conclusion: The proposed approach effectively addresses the challenges of cross-platform SAR-optical image registration with large modality gaps through robust feature learning and grid-based correspondence matching.

Abstract: It is highly challenging to register large-scale, heterogeneous SAR and optical images, particularly across platforms, due to significant geometric, radiometric, and temporal differences, which most existing methods struggle to address. To overcome these challenges, we propose Grid-Reg, a grid-based multimodal registration framework comprising a domain-robust descriptor extraction network, Hybrid Siamese Correlation Metric Learning Network (HSCMLNet), and a grid-based solver (Grid-Solver) for transformation parameter estimation. In heterogeneous imagery with large modality gaps and geometric differences, obtaining accurate correspondences is inherently difficult. To robustly measure similarity between gridded patches, HSCMLNet integrates a hybrid Siamese module with a correlation metric learning module (CMLModule) based on equiangular unit basis vectors (EUBVs), together with a manifold consistency loss to promote modality-invariant, discriminative feature learning. The Grid-Solver estimates transformation parameters by minimizing a global grid matching loss through a progressive dual-loop search strategy to reliably find patch correspondences across entire images. Furthermore, we curate a challenging benchmark dataset for SAR-to-optical registration using real-world UAV MiniSAR data and Google Earth optical imagery. Extensive experiments demonstrate that our proposed approach achieves superior performance over state-of-the-art methods.

Today’s Research Highlights

Table of Contents

cs.CL

[1] DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

[2] SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking under Domain Shift in ASR

[3] Clustering Discourses: Racial Biases in Short Stories about Women Generated by Large Language Models

[4] IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations

[5] A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation

[6] Advancing Minority Stress Detection with Transformers: Insights from the Social Media Datasets

[7] English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM

[8] Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities

[9] Comparison of End-to-end Speech Assessment Models for the NOCASA 2025 Challenge

[10] ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

[11] Mitigating Data Imbalance in Automated Speaking Assessment

[12] DiaCBT: A Long-Periodic Dialogue Corpus Guided by Cognitive Conceptualization Diagram for CBT-based Psychological Counseling

[13] Training LLMs to be Better Text Embedders through Bidirectional Reconstruction

[14] AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

[15] FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

[16] Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient Large Language Models

[17] IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

[18] A Long Short-Term Memory (LSTM) Model for Business Sentiment Analysis Based on Recurrent Neural Network

[19] Measuring Scalar Constructs in Social Science with LLMs

[20] From Evaluation to Defense: Constructing Persistent Edit-Based Fingerprints for Large Language Models

[21] An experimental and computational study of an Estonian single-person word naming

[22] Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader

[23] Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

[24] Domain Adaptation of LLMs for Process Data

[25] SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala

[26] LatPhon: Lightweight Multilingual G2P for Romance Languages and English

[27] LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations

[28] Learning Mechanism Underlying NLP Pre-Training and Fine-Tuning

[29] Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

[30] Continuous Saudi Sign Language Recognition: A Vision Transformer Approach

[31] Design and Optimization of Reinforcement Learning-Based Agents in Text-Based Games

[32] Similarity between Units of Natural Language: The Transition from Coarse to Fine Estimation

[33] Learn and Unlearn: Addressing Misinformation in Multilingual LLMs

[34] SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

[35] Banishing LLM Hallucinations Requires Rethinking Generalization

[36] Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language

[37] TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

[38] Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models

[39] Dial-In LLM: Human-Aligned LLM-in-the-loop Intent Clustering for Customer Service Dialogues

[40] Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models

[41] FedP$^2$EFT: Federated Learning to Personalize PEFT for Multilingual LLMs

[42] Rapid Word Learning Through Meta In-Context Learning

[43] Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

[44] Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

[45] Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification

[46] LawFlow: Collecting and Simulating Lawyers’ Thought Processes on Business Formation Case Studies

[47] Demystifying optimized prompts in language models

[48] QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

[49] Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions

[50] NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

[51] Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning

[52] Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands

[53] On the class of coding optimality of human languages and the origins of Zipf’s law

[54] RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling

[55] Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

[56] MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

[57] GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text Generation

[58] Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

[59] Avoidance Decoding for Diverse Multi-Branch Story Generation

[60] MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds

cs.CV

[61] 2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model

[62] PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?

[63] VQualA 2025 Challenge on Engagement Prediction for Short Videos: Methods and Results

[64] Multi-Scale Deep Learning for Colon Histopathology: A Hybrid Graph-Transformer Approach

[65] PRECISE-AS: Personalized Reinforcement Learning for Efficient Point-of-Care Echocardiography in Aortic Stenosis Diagnosis

[66] LiGuard: A Streamlined Open-Source Framework for Rapid & Interactive Lidar Research

[67] PercepTwin: Modeling High-Fidelity Digital Twins for Sim2Real LiDAR-based Perception for Intelligent Transportation Systems

[68] High-Fidelity Digital Twins for Bridging the Sim2Real Gap in LiDAR-Based ITS Perception

[69] Single Domain Generalization in Diabetic Retinopathy: A Neuro-Symbolic Learning Approach

[70] A Data-Driven RetinaNet Model for Small Object Detection in Aerial Images

[71] STAR: A Fast and Robust Rigid Registration Framework for Serial Histopathological Images

[72] Resilient Multimodal Industrial Surface Defect Detection with Uncertain Sensors Availability

[73] EdgeAttNet: Towards Barb-Aware Filament Segmentation

[74] KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models

[75] InstaDA: Augmenting Instance Segmentation Data with Dual-Agent System

[76] SPENet: Self-guided Prototype Enhancement Network for Few-shot Medical Image Segmentation