Daily arXiv Papers - 2026-01-26

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] ChiEngMixBench: Evaluating Large Language Models on Spontaneous and Natural Chinese-English Code-Mixed Generation

Qingyan Yang, Tongxi Wang, Yunsheng Luo

Main category: cs.CL

TL;DR: ChiEngMixBench is the first benchmark evaluating code-mixing ability in authentic community contexts, focusing on cognitive alignment through spontaneity and naturalness metrics, revealing emergent terminology layering strategies consistent with MLF theory.

DetailsMotivation: Existing work reduces code-mixing to translation/convertibility problems, making it difficult to assess whether model switching behavior is context-appropriate and aligned with human conventions in increasingly prevalent code-mixed interactions.

Method: Introduces ChiEngMixBench benchmark with general construction pipeline for scalable dataset development across domains and bilingual pairs. Formulates code-mixing as cognitive alignment problem with two complementary signals: Spontaneity and Naturalness.

Result: Empirical evaluation shows metrics can systematically distinguish code-mixing performance across models. Uncovers implicitly emergent Terminology Layering Strategy phenomenon consistent with Matrix Language Frame (MLF) theory.

Conclusion: ChiEngMixBench enables evaluation of code-mixing in authentic contexts, revealing structured cognitive alignment between multilingual LLMs and human communication through emergent strategies consistent with linguistic theories.

Abstract: Code-mixing is increasingly prevalent in interactions between humans and large language models, yet existing work often reduces it to a translation or convertibility problem, making it difficult to assess whether a model’s switching behavior is context-appropriate and aligned with human conventions. We introduce ChiEngMixBench, the first benchmark designed to evaluate code-mixing ability in authentic community contexts, built upon a general construction pipeline that enables scalable dataset development across domains and bilingual pairs. ChiEngMixBench formulates code-mixing as a cognitive alignment problem, characterized by two complementary signals: Spontaneity and Naturalness. Empirical evaluation shows that our metrics can systematically distinguish code-mixing performance across models. Beyond benchmarking, we further uncover an implicitly emergent Terminology Layering Strategy, a phenomenon consistent with the Matrix Language Frame (MLF) theory, indicating structured cognitive alignment between multilingual large language models and human communication.

[2] M3Kang: Evaluating Multilingual Multimodal Mathematical Reasoning in Vision-Language Models

Aleix Torres-Camps, Nathaniel Mitrani Hadida, Víctor Conchello Vendrell, Àlex Batlle Casellas, Arnau Padrés Masdemont, Jordi Ros-Giralt

Main category: cs.CL

TL;DR: M3Kang is a new multilingual multimodal math reasoning dataset from Kangaroo Math Competition with 1,747 problems in 108 languages, enabling benchmarking of VLMs against human performance.

DetailsMotivation: To address the underexplored performance of vision-language models in multilingual mathematical reasoning and enable direct comparison with human performance, as current models struggle with basic math and diagram-based reasoning across languages.

Method: Created M3Kang dataset from Kangaroo Math Competition problems (world’s largest math contest), including 1,747 multiple-choice problems organized by grade-level difficulty, translated into 108 languages, some with essential diagrams. Conducted extensive benchmarking on state-of-the-art closed- and open-source models.

Result: Models still struggle with basic math and diagram-based reasoning despite recent advances. Performance scales with language presence and model size, but not with grade level. Multilingual techniques can be effectively extended to multimodal settings, showing significant improvements over baselines. Dataset includes human performance data from 68,000+ students for comparison.

Conclusion: M3Kang fills a critical gap in multilingual multimodal math reasoning evaluation, revealing current model limitations and enabling future research. The dataset and framework are open-sourced to facilitate further development in this area.

Abstract: Despite state-of-the-art vision-language models (VLMs) have demonstrated strong reasoning capabilities, their performance in multilingual mathematical reasoning remains underexplored, particularly when compared to human performance. To bridge this gap, we introduce M3Kang, the first massively multilingual, multimodal mathematical reasoning dataset for VLMs. It is derived from the Kangaroo Math Competition, the world’s largest mathematics contest, which annually engages over six million participants under the age of 18 across more than 90 countries. M3Kang includes 1,747 unique multiple-choice problems organized by grade-level difficulty, with translations into 108 culturally diverse languages, some of them including diagrams essential for solving them. Using this dataset, we conduct extensive benchmarking on both closed- and open-source SOTA models. We observe that, despite recent advances, models still struggle with basic math and diagram-based reasoning, with performance scaling with language presence and model size, but not with grade level. We also find that multilingual techniques can be effectively extended to the multimodal setting, resulting in significant improvements over baseline approaches. Our analysis also incorporates performance data from over 68,000 students, enabling direct comparison with human performance. We are open-sourcing M3Kang, including the English-only subset M2Kang, along with the framework and codebase used to construct the dataset.

[3] Domain Specific Specialization in Low-Resource Settings: The Efficacy of Offline Response-Based Knowledge Distillation in Large Language Models

Erdem Aslan, Pakize Erdoğmuş

Main category: cs.CL

TL;DR: Offline knowledge distillation method creates specialized LLM assistants using limited data (500 lines) with 96.7% accuracy, proving data quality matters more than quantity for domain adaptation.

DetailsMotivation: LLMs struggle with hallucinations on domain-specific knowledge absent from pre-training, needing specialized assistants that work under constrained hardware resources.

Method: Offline response-based knowledge distillation using three data strategies: general domain adaptation (15k lines), unstructured knowledge injection (2k lines), and context-aware synthetic dataset (500 lines) generated by teacher model. Uses Unsloth library to optimize Qwen-2.5-7B model, reducing GPU memory from 40GB to 16GB.

Result: 500-line context-aware dataset achieved 96.7% accuracy with robust rejection capability, while larger unstructured datasets suffered from persistent hallucinations. Validates LIMA hypothesis that data quality and structural alignment are more critical than quantity.

Conclusion: High-accuracy specialized assistants can be developed with minimal data when using context-aware synthetic datasets and optimization techniques, making domain adaptation feasible in low-resource settings.

Abstract: Large Language Models (LLMs) excel in general tasks but often struggle with hallucinations when handling domain-specific or institutional knowledge absent from their pre-training. We present an offline response-based knowledge distillation method that develops high-accuracy specialized assistants under constrained hardware resources. We evaluate three distinct data strategies: general domain adaptation (15,000 lines), unstructured knowledge injection (2,000 lines), and a context-aware synthetic dataset (500 lines) generated by a teacher model. To minimize computational costs, we utilize the Unsloth library to optimize the Qwen-2.5-7B student model, reducing NVIDIA A100 GPU memory requirements from 40 GB to 16 GB. Experimental results demonstrate that while larger unstructured datasets suffer from persistent hallucinations, the 500-line context-aware dataset achieves a 96.7% accuracy rate and robust rejection capability. These findings validate the LIMA hypothesis, showing that data quality and structural alignment are more critical than quantity for domain adaptation in low-resource settings.

[4] Towards Latent Diffusion Suitable For Text

Nesta Midavaine, Christian A. Naesseth, Grigory Bartosh

Main category: cs.CL

TL;DR: Neural Flow Diffusion Models (NFDM) extend continuous diffusion to discrete language spaces, achieving faster sampling and better coherence than autoregressive LLMs while closing the likelihood gap.

DetailsMotivation: To overcome limitations of autoregressive LLMs (slow sampling, coherence issues) by applying continuous diffusion models to discrete language state spaces for improved generation quality and efficiency.

Method: Extends NFDM to learn a multivariate forward process from data, ensuring forward process and generative trajectory are well-suited for language modeling. Enables straightforward application of continuous diffusion to discrete spaces.

Result: Substantially reduces likelihood gap with same-size autoregressive models while achieving sample quality comparable to previous latent diffusion models.

Conclusion: NFDM successfully bridges continuous diffusion and discrete language modeling, offering improved sampling speed and coherence over autoregressive approaches while maintaining competitive generation quality.

Abstract: Language diffusion models aim to improve sampling speed and coherence over autoregressive LLMs. We introduce Neural Flow Diffusion Models for language generation, an extension of NFDM that enables the straightforward application of continuous diffusion models to discrete state spaces. NFDM learns a multivariate forward process from the data, ensuring that the forward process and generative trajectory are a good fit for language modeling. Our model substantially reduces the likelihood gap with autoregressive models of the same size, while achieving sample quality comparable to that of previous latent diffusion models.

[5] Limits of n-gram Style Control for LLMs via Logit-Space Injection

Sami-ul Ahmed

Main category: cs.CL

TL;DR: N-gram style priors injected in logit space can provide lightweight style control for frozen LLMs, but only work effectively within a narrow low-lambda range and are consistently outperformed by prompting and LoRA.

DetailsMotivation: Current personalization methods like prompt engineering or LoRA fine-tuning have limitations: writing style is hard to distill into a single prompt, and LoRA requires computationally intensive training. The paper investigates a lightweight alternative for style control without training.

Method: Train n-gram models on stylistically distinct corpora (Don Quixote, CNN/DailyMail, arXiv abstracts) to create interpolated 1-to-3-gram priors. During generation, modify the LLM’s logits by adding weighted sum of style log-probabilities matching current context, scaled by control parameter lambda.

Result: Found a single narrow regime (Don Quixote corpus at lambda=0.1) where style perplexity improved by 24.7% and base-model perplexity improved by 51.4%. Outside this regime, even small lambda values generally worsen style and fluency, with larger values causing collapse and incoherent text.

Conclusion: Logit-space injection of n-gram style priors provides lightweight, tunable style control but is fragile - only effective within narrow low-lambda range and consistently outperformed by prompting and LoRA methods.

Abstract: Large language models (LLMs) are typically personalized via prompt engineering or parameter-efficient fine-tuning such as LoRA. However, writing style can be difficult to distill into a single prompt, and LoRA fine-tuning requires computationally intensive training and infrastructure. We investigate a possible lightweight alternative: steering a frozen LLM with n-gram style priors injected in logit space at decoding time. We train an n-gram model on stylistically distinct corpora – including Don Quixote, CNN/DailyMail news headlines, and arXiv abstracts – constructing an interpolated 1-to-3-gram prior over next-token probabilities. During generation we modify the LLM’s logits by adding a weighted sum of style log-probabilities from each n-gram order that matches the current context, scaled by a control parameter lambda in [0, 1]. We sweep lambda and style corpora and report style perplexity under the n-gram model, base-model perplexity as a proxy for fluency, Jensen-Shannon (JS) divergence between the original and steered token distributions, and token-overlap statistics. On TinyLlama-1.1B we identify a single narrow regime (for the Don Quixote corpus at lambda=0.1) where style perplexity improves by 24.7% and base-model perplexity improves by 51.4% relative to the frozen model. Outside this regime, and for multi-author corpora such as CNN/DailyMail and arXiv abstracts, even small nonzero lambda values generally result in worse style and fluency, and larger lambda values lead to collapse with extreme perplexities and incoherent text. Logit-space injection of n-gram style priors provides lightweight, tunable style control, but it is fragile: it operates effectively only within a narrow range of low lambda values and is consistently outperformed by prompting and LoRA.

[6] GameTalk: Training LLMs for Strategic Conversation

Victor Conchello Vendrell, Max Ruiz Luyten, Mihaela van der Schaar

Main category: cs.CL

TL;DR: GameTalk trains LLMs for strategic multi-agent decision-making through conversational fine-tuning with global reward optimization across full interactions.

DetailsMotivation: LLMs struggle with strategic decision-making in multi-agent settings requiring coordination and negotiation over extended conversations. Current approaches focus on isolated decision tasks rather than optimizing long-term objectives through dialogue.

Method: GameTalk framework adapts fine-tuning methods (GRPO, DPO, STaR) to incorporate reward signals dependent on entire conversations, training LLMs to optimize global objectives across multi-turn interactions.

Result: GameTalk significantly outperforms untrained models, especially under reward shaping, with DPO consistently yielding the strongest gains across increasingly complex games testing reasoning, coordination, and opponent modeling.

Conclusion: Conversational fine-tuning is a promising path for LLMs to reason, negotiate, and act in interactive environments, enabling strategic decision-making through multi-turn interactions.

Abstract: Strategic decision-making in multi-agent settings is a key challenge for large language models (LLMs), particularly when coordination and negotiation must unfold over extended conversations. While recent work has explored the use of LLMs in isolated decision tasks, little attention has been given to optimizing long-term objectives through dialogue. We introduce \textbf{GameTalk}, a framework for training LLMs to make strategic decisions via multi-turn interactions. Unlike prior work that focuses on single-turn objectives or static action prediction, we train LLMs to optimize a global objective across full conversations. We achieve this by adapting fine-tuning methods like GRPO, DPO, and STaR to incorporate reward signals that depend on the entire interaction. We evaluate this approach on a suite of increasingly complex games, designed to stress different aspects of reasoning, coordination, and opponent modeling. Our results show that GameTalk significantly outperforms untrained models, especially under reward shaping, with DPO consistently yielding the strongest gains. These findings position conversational fine-tuning as a promising path for LLMs to reason, negotiate, and act in interactive environments.

[7] Better as Generators Than Classifiers: Leveraging LLMs and Synthetic Data for Low-Resource Multilingual Classification

Branislav Pecher, Jan Cegin, Robert Belanec, Ivan Srba, Jakub Simko, Maria Bielikova

Main category: cs.CL

TL;DR: LLMs can generate synthetic multilingual data that enables smaller models to outperform the large generator LLM itself, especially in low-resource languages.

DetailsMotivation: To investigate whether LLMs' synthetic data generation capabilities can serve as effective distillation, allowing smaller models to match or exceed massive LLMs' performance across languages and tasks in low-resource scenarios.

Method: Used a state-of-the-art multilingual LLM to generate synthetic datasets covering 11 languages and 4 classification tasks, then trained smaller models via fine-tuning, instruction tuning, or used as synthetic in-context examples for compact LLMs.

Result: Small amounts of synthetic data enable smaller models to outperform the large generator LLM itself, particularly in low-resource languages. Synthetic data works best when LLMs act as generators rather than classifiers.

Conclusion: LLMs are best utilized as data generators (teachers) rather than direct classifiers, producing synthetic data that empowers smaller, more efficient multilingual models to achieve superior performance.

Abstract: Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, making them promising tools in both high- and low-resource languages. One particularly valuable use case is generating synthetic samples that can be used to train smaller models in low-resource scenarios where human-labelled data is scarce. In this work, we investigate whether these synthetic data generation capabilities can serve as a form of distillation, producing smaller models that perform on par with or even better than massive LLMs across languages and tasks. To this end, we use a state-of-the-art multilingual LLM to generate synthetic datasets covering 11 languages and 4 classification tasks. These datasets are then used to train smaller models via fine-tuning or instruction tuning, or as synthetic in-context examples for compact LLMs. Our experiments show that even small amounts of synthetic data enable smaller models to outperform the large generator itself, particularly in low-resource languages. Overall, the results suggest that LLMs are best utilised as generators (teachers) rather than classifiers, producing data that empowers smaller and more efficient multilingual models.

[8] Generating Literature-Driven Scientific Theories at Scale

Peter Jansen, Peter Clark, Doug Downey, Daniel S. Weld

Main category: cs.CL

TL;DR: Automated theory synthesis from scientific literature: generates qualitative/quantitative laws from 13.7k papers to create 2.9k theories, comparing literature-grounding vs parametric knowledge approaches.

DetailsMotivation: Current automated scientific discovery focuses on experiment generation, but higher-level activities like theory building remain underexplored. There's a need for systems that can synthesize comprehensive theories from large scientific literature corpora.

Method: Formulates theory synthesis problem from large corpora (13.7k source papers). Compares two approaches: literature-grounding (using actual scientific literature) vs parametric knowledge (using LLM memory). Examines accuracy-focused vs novelty-focused generation objectives.

Result: Generated 2.9k theories from 13.7k papers. Literature-supported method significantly outperforms parametric LLM memory approach in both matching existing evidence and predicting future results from 4.6k subsequently-written papers.

Conclusion: Literature-grounding is superior to parametric knowledge for automated theory synthesis, enabling better evidence matching and predictive capability. This advances automated scientific discovery beyond experiment generation to higher-level theory building.

Abstract: Contemporary automated scientific discovery has focused on agents for generating scientific experiments, while systems that perform higher-level scientific activities such as theory building remain underexplored. In this work, we formulate the problem of synthesizing theories consisting of qualitative and quantitative laws from large corpora of scientific literature. We study theory generation at scale, using 13.7k source papers to synthesize 2.9k theories, examining how generation using literature-grounding versus parametric knowledge, and accuracy-focused versus novelty-focused generation objectives change theory properties. Our experiments show that, compared to using parametric LLM memory for generation, our literature-supported method creates theories that are significantly better at both matching existing evidence and at predicting future results from 4.6k subsequently-written papers

[9] A Longitudinal, Multinational, and Multilingual Corpus of News Coverage of the Russo-Ukrainian War

Dikshya Mohanty, Taisiia Sabadyn, Jelwin Rodrigues, Chenlu Wang, Abhishek Kalugade, Ritwik Banerjee

Main category: cs.CL

TL;DR: DNIPRO is a multilingual longitudinal corpus of 246K news articles about the Russo-Ukrainian war (Feb 2022-Aug 2024) from 11 media outlets across 5 countries, featuring comprehensive annotations for studying narrative divergence and information warfare.

DetailsMotivation: There's a need for systematic transnational analysis of contentious wartime discourse, particularly to understand how competing geopolitical perspectives shape media narratives during conflicts like the Russo-Ukrainian war.

Method: Created a corpus of 246K news articles spanning 11 media outlets across Russia, Ukraine, U.S., U.K., and China, in English, Russian, and Mandarin Chinese. Included consistent metadata and multiple annotation types with rigorous human evaluations for downstream tasks.

Result: Demonstrated utility through use case experiments including stance detection, sentiment analysis, topical framing, and contradiction analysis, revealing polarized interpretations reflecting geopolitical interests and how outlets construct competing realities.

Conclusion: DNIPRO provides a foundational resource for computational journalism research and understanding how conflicting narratives emerge and evolve across global information ecosystems during wartime.

Abstract: We introduce DNIPRO, a novel longitudinal corpus of 246K news articles documenting the Russo-Ukrainian war from Feb 2022 to Aug 2024, spanning eleven media outlets across five nation states (Russia, Ukraine, U.S., U.K., and China) and three languages (English, Russian, and Mandarin Chinese). This multilingual resource features consistent and comprehensive metadata, and multiple types of annotation with rigorous human evaluations for downstream tasks relevant to systematic transnational analyses of contentious wartime discourse. DNIPRO’s distinctive value lies in its inclusion of competing geopolitical perspectives, making it uniquely suited for studying narrative divergence, media framing, and information warfare. To demonstrate its utility, we include use case experiments using stance detection, sentiment analysis, topical framing, and contradiction analysis of major conflict events within the larger war. Our explorations reveal how outlets construct competing realities, with coverage exhibiting polarized interpretations that reflect geopolitical interests. Beyond supporting computational journalism research, DNIPRO provides a foundational resource for understanding how conflicting narratives emerge and evolve across global information ecosystems.

Dikshya Mohanty, Mohammad Saqib Hasan, Syed Mostofa Monsur, Size Zheng, Benjamin Hsiao, Niranjan Balasubramanian

Main category: cs.CL

TL;DR: PolyBench: A large-scale polymer design benchmark dataset with 125K+ tasks and knowledge-augmented reasoning distillation method to train specialized SLMs that outperform larger models on polymer design problems.

DetailsMotivation: Current LLMs are ineffective for polymer design due to lack of polymer-specific knowledge and insufficient coverage of relevant capabilities in existing aligned models.

Method: Created PolyBench dataset with 125K+ polymer design tasks from 13M+ data points, introduced knowledge-augmented reasoning distillation with structured CoT, and organized tasks from simple to complex reasoning problems.

Result: Small language models (7B-14B parameters) trained on PolyBench outperform similar-sized models and even closed-source frontier LLMs on PolyBench tests, with gains on other polymer benchmarks.

Conclusion: PolyBench enables effective training of specialized polymer design models through comprehensive domain knowledge coverage and structured reasoning distillation, demonstrating that focused training on domain-specific benchmarks can outperform general-purpose LLMs.

Abstract: Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs prove ineffective on this problem space because: (i) most models lack polymer-specific knowledge (ii) existing aligned models lack coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large scale training and test benchmark dataset of more than 125K polymer design related tasks, leveraging a knowledge base of 13M+ data points obtained from experimental and synthetic sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small language models (SLMs), of 7B to 14B parameters, trained on PolyBench data outperform similar sized models, and even closed source frontier LLMs on PolyBench test dataset while demonstrating gains on other polymer benchmarks as well.

[11] Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP

Andres Karjus, Kais Allkivi, Silvia Maine, Katarin Leppik, Krister Kruusmaa, Merilin Aruvee

Main category: cs.CL

TL;DR: LLM-based automated essay scoring achieves human-level performance on Estonian national exams, enabling scalable high-stakes assessment with personalized feedback while maintaining human oversight.

DetailsMotivation: Need for rapid, consistent evaluation of large volumes of open-ended exam responses (like national graduation exams) where traditional human grading is time-consuming and resource-intensive.

Method: Applied LLM and statistical NLP-based automated scoring to two large datasets of Estonian national trial exam essays, operationalizing official curriculum-based rubrics and comparing with human panel scores.

Result: Automated scoring achieved performance comparable to human raters, falling within human scoring range. Also evaluated bias, prompt injection risks, and LLMs as essay writers.

Conclusion: A principled, rubric-driven, human-in-the-loop scoring pipeline is viable for high-stakes writing assessment, scalable to national level even in small-language contexts, while maintaining compliance with educational standards.

Abstract: Large language models (LLMs) enable rapid and consistent automated evaluation of open-ended exam responses, including dimensions of content and argumentation that have traditionally required human judgment. This is particularly important in cases where a large amount of exams need to be graded in a limited time frame, such as nation-wide graduation exams in various countries. Here, we examine the applicability of automated scoring on two large datasets of trial exam essays of two full national cohorts from Estonia. We operationalize the official curriculum-based rubric and compare LLM and statistical natural language processing (NLP) based assessments with human panel scores. The results show that automated scoring can achieve performance comparable to that of human raters and tends to fall within the human scoring range. We also evaluate bias, prompt injection risks, and LLMs as essay writers. These findings demonstrate that a principled, rubric-driven, human-in-the-loop scoring pipeline is viable for high-stakes writing assessment, particularly relevant for digitally advanced societies like Estonia, which is about to adapt a fully electronic examination system. Furthermore, the system produces fine-grained subscore profiles that can be used to generate systematic, personalized feedback for instruction and exam preparation. The study provides evidence that LLM-assisted assessment can be implemented at a national scale, even in a small-language context, while maintaining human oversight and compliance with emerging educational and regulatory standards.

[12] Regional Bias in Large Language Models

M P V S Gopinadh, Kappara Lakshmi Sindhu, Soma Sekhar Pandu Ranga Raju P, Yesaswini Swarna

Main category: cs.CL

TL;DR: Study evaluates regional bias in 10 major LLMs using FAZE framework, finding significant bias variation with GPT-3.5 showing highest bias and Claude 3.5 Sonnet lowest.

DetailsMotivation: Address emerging concern about AI fairness and global representation in LLMs, specifically investigating geographic/regional biases that could undermine reliability and inclusivity in cross-cultural applications.

Method: Developed FAZE evaluation framework using 100 carefully designed prompts probing forced-choice decisions between regions under contextually neutral scenarios. Evaluated 10 LLMs (GPT-3.5, GPT-4o, Gemini 1.5 Flash, Gemini 1.0 Pro, Claude 3 Opus, Claude 3.5 Sonnet, Llama 3, Gemma 7B, Mistral 7B, Vicuna-13B) measuring bias on 10-point scale.

Result: Substantial variation in bias levels across models: GPT-3.5 showed highest bias score (9.5), Claude 3.5 Sonnet scored lowest (2.5). Results demonstrate regional bias meaningfully affects LLM outputs.

Conclusion: Regional bias undermines reliability, fairness, and inclusivity of LLMs in real-world applications. Highlights need for inclusive evaluation frameworks and systematic approaches to identify and mitigate geographic biases in language models.

Abstract: This study investigates regional bias in large language models (LLMs), an emerging concern in AI fairness and global representation. We evaluate ten prominent LLMs: GPT-3.5, GPT-4o, Gemini 1.5 Flash, Gemini 1.0 Pro, Claude 3 Opus, Claude 3.5 Sonnet, Llama 3, Gemma 7B, Mistral 7B, and Vicuna-13B using a dataset of 100 carefully designed prompts that probe forced-choice decisions between regions under contextually neutral scenarios. We introduce FAZE, a prompt-based evaluation framework that measures regional bias on a 10-point scale, where higher scores indicate a stronger tendency to favor specific regions. Experimental results reveal substantial variation in bias levels across models, with GPT-3.5 exhibiting the highest bias score (9.5) and Claude 3.5 Sonnet scoring the lowest (2.5). These findings indicate that regional bias can meaningfully undermine the reliability, fairness, and inclusivity of LLM outputs in real-world, cross-cultural applications. This work contributes to AI fairness research by highlighting the importance of inclusive evaluation frameworks and systematic approaches for identifying and mitigating geographic biases in language models.

[13] Identity, Cooperation and Framing Effects within Groups of Real and Simulated Humans

Suhong Moon, Minwoo Kang, Joseph Suh, Mustafa Safdari, John Canny

Main category: cs.CL

TL;DR: LLMs with deep binding of narrative identities and contextual factors can better simulate human behavior in social dilemmas than simple persona steering.

DetailsMotivation: To improve simulation fidelity of human action in social dilemmas by moving beyond simple persona steering to incorporate identity-based behaviors and contextual factors often omitted in experiment descriptions.

Method: Deep binding of base LLMs with extended backstories (narrative identities) and using instruction-tuned models to check consistency, while also modeling contextual factors like time, question framing, and participant pool effects.

Result: Simulation fidelity vs human studies is improved by conditioning base LMs with rich narrative identities and consistency checking, and LLMs can effectively model various contextual factors affecting human studies.

Conclusion: LLMs with deep identity binding and contextual modeling allow exploration of details that affect human studies but are often omitted from experiment descriptions, enabling more accurate replication of human behavior in social dilemmas.

Abstract: Humans act via a nuanced process that depends both on rational deliberation and also on identity and contextual factors. In this work, we study how large language models (LLMs) can simulate human action in the context of social dilemma games. While prior work has focused on “steering” (weak binding) of chat models to simulate personas, we analyze here how deep binding of base models with extended backstories leads to more faithful replication of identity-based behaviors. Our study has these findings: simulation fidelity vs human studies is improved by conditioning base LMs with rich context of narrative identities and checking consistency using instruction-tuned models. We show that LLMs can also model contextual factors such as time (year that a study was performed), question framing, and participant pool effects. LLMs, therefore, allow us to explore the details that affect human studies but which are often omitted from experiment descriptions, and which hamper accurate replication.

[14] PolyAgent: Large Language Model Agent for Polymer Design

Vani Nigam, Achuth Chandrasekhar, Amir Barati Farimani

Main category: cs.CL

TL;DR: A terminal-based framework using LLM reasoning for polymer property prediction and structure generation, guided by synthetic accessibility scores to ensure lab-feasible designs.

DetailsMotivation: Traditional polymer discovery involves lengthy trial-and-error experiments requiring extensive resources. While machine learning has accelerated property prediction, lab researchers face infrastructure limitations in accessing these models for individual structure-property analysis.

Method: Developed a closed-loop polymer structure-property predictor integrated in a terminal, powered by LLM reasoning. The framework provides property prediction, property-guided polymer structure generation, and structure modification capabilities. SMILES sequences are guided by synthetic accessibility score and synthetic complexity score (SC Score) to ensure synthetically accessible monomer-level structures.

Result: Created a framework that addresses the challenge of generating novel polymer structures for laboratory researchers, providing computational insights into polymer research through accessible terminal-based tools.

Conclusion: The terminal-integrated framework enables early-stage polymer discovery by making advanced computational tools accessible to lab researchers, bridging the gap between machine learning models and practical laboratory applications while ensuring synthetic feasibility through accessibility scoring.

Abstract: On-demand Polymer discovery is essential for various industries, ranging from biomedical to reinforcement materials. Experiments with polymers have a long trial-and-error process, leading to long procedures and extensive resources. For these processes, machine learning has accelerated scientific discovery at the property prediction and latent space search fronts. However, laboratory researchers cannot readily access codes and these models to extract individual structures and properties due to infrastructure limitations. We present a closed-loop polymer structure-property predictor integrated in a terminal for early-stage polymer discovery. The framework is powered by LLM reasoning to provide users with property prediction, property-guided polymer structure generation, and structure modification capabilities. The SMILES sequences are guided by the synthetic accessibility score and the synthetic complexity score (SC Score) to ensure that polymer generation is as close as possible to synthetically accessible monomer-level structures. This framework addresses the challenge of generating novel polymer structures for laboratory researchers, thereby providing computational insights into polymer research.

[15] Cross-Lingual Activation Steering for Multilingual Language Models

Rhitabrat Pokharel, Ameeta Agrawal, Tanay Nagar

Main category: cs.CL

TL;DR: CLAS is a training-free inference-time method that selectively modulates neuron activations to reduce performance gaps between dominant and non-dominant languages in multilingual LLMs, achieving 2.3-3.4% improvements while maintaining high-resource language performance.

DetailsMotivation: Large language models show strong multilingual capabilities but have significant performance gaps between dominant and non-dominant languages. Prior work attributes this to imbalances between shared and language-specific neurons in multilingual representations.

Method: Cross-Lingual Activation Steering (CLAS) - a training-free inference-time intervention that selectively modulates neuron activations. It operates through functional divergence rather than strict alignment, with performance gains correlating with increased language cluster separation.

Result: Achieved average improvements of 2.3% (Accuracy) on classification benchmarks and 3.4% (F1) on generation benchmarks, while maintaining high-resource language performance. Demonstrated that effective transfer operates through functional divergence.

Conclusion: Targeted activation steering can unlock latent multilingual capacity in existing models without modifying model weights, showing that performance gains correlate with increased language cluster separation rather than strict alignment.

Abstract: Large language models exhibit strong multilingual capabilities, yet significant performance gaps persist between dominant and non-dominant languages. Prior work attributes this gap to imbalances between shared and language-specific neurons in multilingual representations. We propose Cross-Lingual Activation Steering (CLAS), a training-free inference-time intervention that selectively modulates neuron activations. We evaluate CLAS on classification and generation benchmarks, achieving average improvements of 2.3% (Acc.) and 3.4% (F1) respectively, while maintaining high-resource language performance. We discover that effective transfer operates through functional divergence rather than strict alignment; performance gains correlate with increased language cluster separation. Our results demonstrate that targeted activation steering can unlock latent multilingual capacity in existing models without modification to model weights.

[16] Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization

Qianqi Yan, Huy Nguyen, Sumana Srivatsa, Hari Bandi, Xin Eric Wang, Krishnaram Kenthapadi

Main category: cs.CL

TL;DR: Training-free framework for clinical summarization that uses decoder attentions to directly cite source text spans or images, improving attribution accuracy without retraining.

DetailsMotivation: Clinical summarization needs transparency about where each statement comes from for trustworthiness, requiring source attribution that goes beyond fluent generation.

Method: Proposes a training-free framework using decoder attentions for generation-time source attribution. Two multimodal strategies: raw image mode using image patch attentions, and caption-as-span mode substituting images with generated captions for text-based alignment.

Result: Outperforms embedding-based and self-attribution baselines on clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), improving text-level and multimodal attribution accuracy by +15% F1 over embedding baselines. Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight.

Conclusion: Attention-guided attribution is a promising step toward interpretable and deployable clinical summarization systems, offering transparent source citation without requiring model retraining.

Abstract: Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.

[17] Clarify or Answer: Reinforcement Learning for Agentic VQA with Context Under-specification

Zongwan Cao, Bingbing Wen, Lucy Lu Wang

Main category: cs.CL

TL;DR: CoA is an ask-or-answer agent for VQA that decides when clarification is needed, asks focused questions to resolve ambiguity, then answers using the clarification response.

DetailsMotivation: Real-world VQA is often context-dependent and under-specified, where images alone don't provide enough information for accurate answers. Direct answering can lead to confident but incorrect predictions when ambiguity exists.

Method: CoA separates the decision to ask or answer, using a clarification reasoning module. It first determines if clarification is needed, asks a single focused question if necessary, then incorporates the response to produce final answer. Uses GRPO-CR reinforcement learning with multiple reward signals for generating effective clarification questions.

Result: Across three VLLMs and three datasets, CoA improves end-to-end VQA accuracy by an average of +15.3 points (83%) over prompting-based baselines, showing consistent improvements at both module and system levels.

Conclusion: The CoA framework effectively handles ambiguous VQA by intelligently deciding when to ask for clarification, generating focused questions to resolve ambiguity, and significantly improving answer accuracy in context-dependent scenarios.

Abstract: Real-world visual question answering (VQA) is often context-dependent: an image-question pair may be under-specified, such that the correct answer depends on external information that is not observable in the image. In such cases, directly answering can lead to confident but incorrect predictions. We propose CoA(Clarify-or-Answer), an ask-or-answer agent that separately models the decision to ask or answer, and what to ask if needed. CoA first determines whether clarification is necessary; if so, it asks a single focused question and then incorporates the response to produce the final answer. We introduce CONTEXTCLARIFY with a set of ambiguous VQA questions and the contrast set that is non-ambiguous. We further introduce GRPO-CR (Clarification Reasoning), a reinforcement learning approach that optimizes clarification question generation with multiple reward signals encouraging well-formed, focused, non-trivial questions that resolve ambiguity. Across three VLLMs and three datasets, CoA achieves consistent improvements at both the module and system levels, improving end-to-end VQA accuracy by an average of +15.3 points (83%) over prompting-based baselines

[18] Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

Oriol Pareras, Gerard I. Gállego, Federico Costa, Cristina España-Bonet, Javier Hernando

Main category: cs.CL

TL;DR: Direct prompting may surpass Chain-of-Thought for Speech-to-Text Translation as more S2TT data becomes available, challenging current CoT dominance.

DetailsMotivation: Current LLM-based S2TT systems heavily rely on Chain-of-Thought prompting which leverages abundant ASR and T2TT datasets, but the authors want to systematically compare CoT vs Direct prompting as S2TT data availability increases.

Method: Pseudo-labeled an ASR corpus by translating its transcriptions into six European languages, then trained LLM-based S2TT systems with both CoT and Direct prompting strategies at different data scales.

Result: Direct prompting improves more consistently as the amount of S2TT data increases, suggesting it may become more effective than CoT as larger S2TT resources are created.

Conclusion: While CoT currently dominates S2TT due to leveraging existing ASR/T2TT data, Direct prompting shows promising scaling properties and may become the preferred approach as dedicated S2TT datasets grow larger.

Abstract: Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.

[19] Jacobian Scopes: token-level causal attributions in LLMs

Toni J. B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Christopher J. Earls

Main category: cs.CL

TL;DR: Jacobian Scopes is a gradient-based method for token-level causal attribution in LLMs that analyzes how input tokens influence predictions through three variants targeting different aspects of model behavior.

DetailsMotivation: Understanding which prior tokens most strongly influence LLM predictions is challenging due to complex architectures with many layers and attention heads, creating a need for better interpretability methods.

Method: Proposes Jacobian Scopes - a suite of gradient-based, token-level causal attribution methods that analyze linearized relations between final hidden states and inputs. Three variants: Semantic Scopes (target specific logits), Fisher Scopes (target full predictive distribution), and Temperature Scopes (target model confidence/inverse temperature).

Result: Through case studies in instruction understanding, translation, and in-context learning, the method uncovers findings like implicit political biases and sheds light on mechanisms underlying in-context time-series forecasting.

Conclusion: Jacobian Scopes provides effective tools for interpreting LLM predictions, offering insights into token-level influences and model behavior, with potential applications for understanding biases and in-context learning mechanisms.

Abstract: Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. By analyzing the linearized relations of final hidden state with respect to inputs, Jacobian Scopes quantify how input tokens influence a model’s prediction. We introduce three variants - Semantic, Fisher, and Temperature Scopes - which respectively target sensitivity of specific logits, the full predictive distribution, and model confidence (inverse temperature). Through case studies spanning instruction understanding, translation and in-context learning (ICL), we uncover interesting findings, such as when Jacobian Scopes point to implicit political biases. We believe that our proposed methods also shed light on recently debated mechanisms underlying in-context time-series forecasting. Our code and interactive demonstrations are publicly available at https://github.com/AntonioLiu97/JacobianScopes.

[20] Learning Domain Knowledge in Multimodal Large Language Models through Reinforcement Fine-Tuning

Qinglong Cao, Yuntian Chen, Chao Ma, Xiaokang Yang

Main category: cs.CL

TL;DR: Current MLLMs struggle with specialized domains despite textual domain knowledge injection; optimization-level integration via reinforcement fine-tuning with domain-informed constraints yields state-of-the-art performance.

DetailsMotivation: MLLMs perform poorly in specialized domains (remote sensing, medical imaging) even when domain knowledge is provided through text, suggesting current models cannot internalize domain-specific priors through language alone.

Method: Proposed reinforcement fine-tuning framework that encodes domain knowledge as constraints and reward signals in the learning objective, integrating knowledge at optimization level rather than input level.

Result: Extensive experiments show consistent performance gains across multiple datasets in remote sensing and medical domains, achieving state-of-the-art results on multimodal domain tasks.

Conclusion: Optimization-level domain knowledge integration is necessary; textual domain conditioning has fundamental limitations in current MLLMs for specialized scientific tasks.

Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities in multimodal perception and understanding tasks. However, their effectiveness in specialized domains, such as remote sensing and medical imaging, remains limited. A natural approach to domain adaptation is to inject domain knowledge through textual instructions, prompts, or auxiliary captions. Surprisingly, we find that such input-level domain knowledge injection yields little to no improvement on scientific multimodal tasks, even when the domain knowledge is explicitly provided. This observation suggests that current MLLMs fail to internalize domain-specific priors through language alone, and that domain knowledge must be integrated at the optimization level. Motivated by this insight, we propose a reinforcement fine-tuning framework that incorporates domain knowledge directly into the learning objective. Instead of treating domain knowledge as descriptive information, we encode it as domain-informed constraints and reward signals, shaping the model’s behavior in the output space. Extensive experiments across multiple datasets in remote sensing and medical domains consistently demonstrate good performance gains, achieving state-of-the-art results on multimodal domain tasks. Our results highlight the necessity of optimization-level domain knowledge integration and reveal a fundamental limitation of textual domain conditioning in current MLLMs.

[21] Exploring the Effects of Alignment on Numerical Bias in Large Language Models

Ayako Sato, Hwichan Kim, Zhousi Chen, Masato Mita, Mamoru Komachi

Main category: cs.CL

TL;DR: LLM evaluators exhibit numerical bias where certain scores appear disproportionately. This bias increases with alignment (instruction/preference tuning). Score range adjustment is the most effective mitigation strategy.

DetailsMotivation: LLM-as-a-judge is effective but suffers from numerical bias where certain evaluation scores are generated too frequently, reducing evaluation performance. The paper investigates the cause of this bias.

Method: 1. Hypothesis: numerical bias arises from alignment (instruction tuning and preference tuning). 2. Compare outputs from pre- and post-alignment LLMs to test hypothesis. 3. Explore mitigation strategies: temperature scaling, distribution calibration, and score range adjustment for post-alignment LLMs.

Result: 1. Alignment indeed increases numerical bias in LLM evaluators. 2. Among mitigation strategies, score range adjustment is most effective in reducing bias and improving performance, though still heuristic.

Conclusion: Numerical bias in LLM evaluators is caused by alignment processes. While score range adjustment helps mitigate this bias, more work is needed on optimal score range selection and more robust mitigation strategies.

Abstract: ``LLM-as-a-judge,’’ which utilizes large language models (LLMs) as evaluators, has proven effective in many evaluation tasks. However, evaluator LLMs exhibit numerical bias, a phenomenon where certain evaluation scores are generated disproportionately often, leading reduced evaluation performance. This study investigates the cause of this bias. Given that most evaluator LLMs are aligned through instruction tuning and preference tuning, and that prior research suggests alignment reduces output diversity, we hypothesize that numerical bias arises from alignment. To test this, we compare outputs from pre- and post-alignment LLMs, and observe that alignment indeed increases numerical bias. We also explore mitigation strategies for post-alignment LLMs, including temperature scaling, distribution calibration, and score range adjustment. Among these, score range adjustment is most effective in reducing bias and improving performance, though still heuristic. Our findings highlight the need for further work on optimal score range selection and more robust mitigation strategies.

[22] Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go

Yichuan Ma, Linyang Li, Yongkang Chen, Peiji Li, Jiasheng Ye, Qipeng Guo, Dahua Lin, Kai Chen

Main category: cs.CL

TL;DR: LoGos is an LLM that bridges general reasoning with Go expertise, achieving professional-level Go gameplay through mixed fine-tuning and reinforcement learning.

DetailsMotivation: LLMs excel at general reasoning but struggle in specialized domains like Go, where they can't reach beginner-level proficiency despite AI systems like AlphaGo showing high performance ceilings. This gap limits LLM applications in domain-specific tasks.

Method: Mixed fine-tuning with structured Go expertise and general Chain-of-Thought reasoning data, followed by reinforcement learning to integrate expert knowledge with general reasoning capabilities.

Result: LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs, while maintaining outstanding general reasoning abilities and conducting Go gameplay in natural language.

Conclusion: The work successfully bridges LLM general reasoning with domain expertise, provides insights for applying LLMs to specialized domains, and releases the first large-scale Go dataset, evaluation benchmark, and professional-level LLM for Go.

Abstract: Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding, matching or surpassing human capabilities. However, these impressive reasoning abilities face significant challenges in specialized domains. Taking Go as an example, although AlphaGo has established the high performance ceiling of AI systems in Go, mainstream LLMs still struggle to reach even beginner-level proficiency, let alone perform natural language reasoning. This performance gap between general-purpose LLMs and domain experts is significantly limiting the application of LLMs on a wider range of domain-specific tasks. In this work, we aim to bridge the divide between LLMs’ general reasoning capabilities and expert knowledge in domain-specific tasks. We perform mixed fine-tuning with structured Go expertise and general long Chain-of-Thought (CoT) reasoning data as a cold start, followed by reinforcement learning to integrate expert knowledge in Go with general reasoning capabilities. Through this methodology, we present \textbf{LoGos}, a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language, demonstrating effective strategic reasoning and accurate next-move prediction. LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs. Through this work, we aim to contribute insights on applying general LLM reasoning capabilities to specialized domains. We will release the first large-scale Go dataset for LLM training, the first LLM Go evaluation benchmark, and the first general LLM that reaches human professional-level performance in Go at: https://github.com/Entarochuan/LoGos.

[23] Graph-Anchored Knowledge Indexing for Retrieval-Augmented Generation

Zhenghao Liu, Mingyan Wu, Xinze Li, Yukun Yan, Shuo Wang, Cheng Yang, Minghe Yu, Zheni Zeng, Maosong Sun

Main category: cs.CL

TL;DR: GraphAnchor introduces a dynamic graph-based indexing approach for RAG systems that evolves during iterative retrieval to better integrate scattered evidence across documents.

DetailsMotivation: Existing RAG systems struggle to effectively integrate and interpret key evidence scattered across noisy documents, which is critical for mitigating hallucinations in LLMs.

Method: GraphAnchor reconceptualizes graph structures from static representations into active, evolving knowledge indices that incrementally update during iterative retrieval to anchor salient entities and relations, guiding LLMs in evaluating knowledge sufficiency and formulating subqueries.

Result: Experiments on four multi-hop question answering benchmarks demonstrate GraphAnchor’s effectiveness and show it modulates LLM attention to better associate key information distributed across retrieved documents.

Conclusion: GraphAnchor provides a novel graph-anchored knowledge indexing approach that improves RAG systems by creating structured, evolving indices that enhance evidence integration and reduce hallucinations.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a dominant paradigm for mitigating hallucinations in Large Language Models (LLMs) by incorporating external knowledge. Nevertheless, effectively integrating and interpreting key evidence scattered across noisy documents remains a critical challenge for existing RAG systems. In this paper, we propose GraphAnchor, a novel Graph-Anchored Knowledge Indexing approach that reconceptualizes graph structures from static knowledge representations into active, evolving knowledge indices. GraphAnchor incrementally updates a graph during iterative retrieval to anchor salient entities and relations, yielding a structured index that guides the LLM in evaluating knowledge sufficiency and formulating subsequent subqueries. The final answer is generated by jointly leveraging all retrieved documents and the final evolved graph. Experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of GraphAnchor, and reveal that GraphAnchor modulates the LLM’s attention to more effectively associate key information distributed in retrieved documents. All code and data are available at https://github.com/NEUIR/GraphAnchor.

[24] Persona Jailbreaking in Large Language Models

Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki

Main category: cs.CL

TL;DR: PHISH framework exposes LLM vulnerability: adversarial conversational history can hijack personas via implicit steering, enabling predictable persona manipulation while preserving utility.

DetailsMotivation: LLMs deployed in sensitive domains require stable personas, but existing research overlooks how adversarial conversational history alone can reshape induced personas. Black-box persona manipulation remains unexplored, raising robustness concerns for real-world interactions.

Method: Proposes PHISH (Persona Hijacking via Implicit Steering in History) - a framework that embeds semantically loaded cues into user queries to gradually induce reverse personas. Uses black-box, inference-only setting with defined metrics to quantify attack success.

Result: Across 3 benchmarks and 8 LLMs, PHISH predictably shifts personas, triggers collateral trait changes, with stronger effects in multi-turn settings. Successfully manipulates personas in high-risk domains (mental health, tutoring, customer support) with minimal reduction in reasoning performance.

Conclusion: Exposes new vulnerabilities in LLM personas, showing current guardrails are brittle under sustained attack. Highlights need for context-resilient persona design in LLMs for reliable deployment in sensitive domains.

Abstract: Large Language Models (LLMs) are increasingly deployed in domains such as education, mental health and customer support, where stable and consistent personas are critical for reliability. Yet, existing studies focus on narrative or role-playing tasks and overlook how adversarial conversational history alone can reshape induced personas. Black-box persona manipulation remains unexplored, raising concerns for robustness in realistic interactions. In response, we introduce the task of persona editing, which adversarially steers LLM traits through user-side inputs under a black-box, inference-only setting. To this end, we propose PHISH (Persona Hijacking via Implicit Steering in History), the first framework to expose a new vulnerability in LLM safety that embeds semantically loaded cues into user queries to gradually induce reverse personas. We also define a metric to quantify attack success. Across 3 benchmarks and 8 LLMs, PHISH predictably shifts personas, triggers collateral changes in correlated traits, and exhibits stronger effects in multi-turn settings. In high-risk domains mental health, tutoring, and customer support, PHISH reliably manipulates personas, validated by both human and LLM-as-Judge evaluations. Importantly, PHISH causes only a small reduction in reasoning benchmark performance, leaving overall utility largely intact while still enabling significant persona manipulation. While current guardrails offer partial protection, they remain brittle under sustained attack. Our findings expose new vulnerabilities in personas and highlight the need for context-resilient persona in LLMs. Our codebase and dataset is available at: https://github.com/Jivnesh/PHISH

[25] DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering

Haotian Chen, Qingqing Long, Siyu Pu, Xiao Luo, Wei Ju, Meng Xiao, Yuanchun Zhou, Jianghua Zhao, Xuezhi Wang

Main category: cs.CL

TL;DR: DeepEra: A deep evidence reranking agent that addresses semantically similar but logically irrelevant passages in scientific QA RAG systems, using step-by-step reasoning for better passage evaluation.

DetailsMotivation: Existing retrieval and reranking methods in scientific QA RAG systems are vulnerable to passages that are semantically similar but logically irrelevant, which reduces factual reliability and amplifies hallucinations in LLM responses.

Method: Proposed Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning to enable more precise evaluation of candidate passages beyond surface-level semantics. Also constructed SciRAG-SSLI dataset with 300K SciQA instances across 10 subjects from 10M scientific corpus, combining naturally retrieved contexts with systematically generated distractors.

Result: Comprehensive evaluations confirm superior retrieval performance compared to leading rerankers. First comprehensive study to empirically validate SSLI (Semantically Similar but Logically Irrelevant) issues in two-stage RAG frameworks.

Conclusion: DeepEra effectively addresses the SSLI challenge in scientific QA RAG systems through step-by-step reasoning, improving factual reliability and reducing hallucinations. The SciRAG-SSLI dataset enables systematic evaluation of logical robustness and factual grounding in scientific RAG systems.

Abstract: With the rapid growth of scientific literature, scientific question answering (SciQA) has become increasingly critical for exploring and utilizing scientific knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating knowledge from external sources, thereby providing credible evidence for scientific question answering. But existing retrieval and reranking methods remain vulnerable to passages that are semantically similar but logically irrelevant, often reducing factual reliability and amplifying hallucinations.To address this challenge, we propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning, enabling more precise evaluation of candidate passages beyond surface-level semantics. To support systematic evaluation, we construct SciRAG-SSLI (Scientific RAG - Semantically Similar but Logically Irrelevant), a large-scale dataset comprising about 300K SciQA instances across 10 subjects, constructed from 10M scientific corpus. The dataset combines naturally retrieved contexts with systematically generated distractors to test logical robustness and factual grounding. Comprehensive evaluations confirm that our approach achieves superior retrieval performance compared to leading rerankers. To our knowledge, this work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.

[26] TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization

Peiji Li, Linyang Li, Handa Sun, Wenjin Mai, Yongkang Chen, Xiaozhe Li, Yue Shen, Yichuan Ma, Yiliu Sun, Jiaxi Cao, Zhishu He, Bo Wang, Xiaoqing Zheng, Zhaori Bi, Xipeng Qiu, Qipeng Guo, Kai Chen, Dahua Lin

Main category: cs.CL

TL;DR: TL-GRPO is a turn-level RL algorithm that outperforms standard GRPO and Bayesian optimization for iterative optimization tasks like analog circuit sizing.

DetailsMotivation: Existing RL methods (like GRPO) can't handle iterative optimization tasks where agents interact with the same environment state across turns and where trajectory value depends on best turn-level reward rather than cumulative returns. Black-box optimization methods discard prior knowledge and reasoning capabilities.

Method: Proposes Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization in iterative reasoning tasks.

Result: TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications in analog circuit sizing tasks. A 30B model trained with TL-GRPO achieves state-of-the-art performance under same simulation budget.

Conclusion: TL-GRPO effectively addresses the gap in fine-grained, turn-level optimization for iterative reasoning tasks while preserving prior knowledge and reasoning capabilities, demonstrating strong generalization and practical utility.

Abstract: Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit sizing (ACS), a challenging scientific optimization task requiring multiple simulations and domain expertise. Results show that TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications. Furthermore, our 30B model trained with TL-GRPO achieves state-of-the-art performance on ACS tasks under same simulation budget, demonstrating both strong generalization and practical utility.

[27] Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic

Yichuan Ma, Linyang Li, Yongkang chen, Peiji Li, Xiaozhe Li, Qipeng Guo, Dahua Lin, Kai Chen

Main category: cs.CL

TL;DR: The paper introduces Timely Machine, redefining test-time scaling for LLMs in agentic scenarios where tool latency breaks traditional generation-length metrics. It proposes Timely-Eval benchmark and Timely-RL training to improve time budget awareness.

DetailsMotivation: Traditional test-time scaling based on generation length breaks down in agentic scenarios with frequent tool calls, where tool latency decouples inference time from generation length. There's a need to redefine test-time as wall-clock time and enable models to dynamically adjust strategies based on time budgets.

Method: 1) Redefine test-time as wall-clock time; 2) Introduce Timely-Eval benchmark spanning high/low-frequency tool calls and time-constrained reasoning; 3) Propose Timely-RL: cold-start supervised fine-tuning followed by reinforcement learning to enhance temporal planning and time budget awareness.

Result: Smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Existing models fail to adapt reasoning to time budgets. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval.

Conclusion: The work offers a new perspective on test-time scaling for the agentic era, addressing the limitations of traditional metrics in tool-calling scenarios and providing methods to improve temporal planning in LLMs.

Abstract: As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.

[28] MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine

Wei Zhu

Main category: cs.CL

TL;DR: The paper introduces MRAG, a comprehensive medical RAG benchmark with English/Chinese tasks, Wikipedia/Pubmed corpus, and MRAG-Toolkit for systematic RAG component evaluation.

DetailsMotivation: There's a lack of comprehensive evaluation benchmarks for Retrieval-Augmented Generation (RAG) in the medical domain, despite its rapid adoption in scientific and clinical QA systems.

Method: Developed Medical Retrieval-Augmented Generation (MRAG) benchmark covering various tasks in English and Chinese, built a corpus using Wikipedia and PubMed, and created MRAG-Toolkit for systematic exploration of different RAG components.

Result: RAG enhances LLM reliability across MRAG tasks; performance influenced by retrieval approaches, model sizes, and prompting strategies; RAG improves usefulness and reasoning quality but may slightly reduce readability for long-form questions.

Conclusion: The MRAG benchmark and toolkit will be released with CCBY-4.0 license to facilitate medical RAG applications in both academia and industry, addressing the gap in comprehensive medical RAG evaluation.

Abstract: While Retrieval-Augmented Generation (RAG) has been swiftly adopted in scientific and clinical QA systems, a comprehensive evaluation benchmark in the medical domain is lacking. To address this gap, we introduce the Medical Retrieval-Augmented Generation (MRAG) benchmark, covering various tasks in English and Chinese languages, and building a corpus with Wikipedia and Pubmed. Additionally, we develop the MRAG-Toolkit, facilitating systematic exploration of different RAG components. Our experiments reveal that: (a) RAG enhances LLM reliability across MRAG tasks. (b) the performance of RAG systems is influenced by retrieval approaches, model sizes, and prompting strategies. (c) While RAG improves usefulness and reasoning quality, LLM responses may become slightly less readable for long-form questions. We will release the MRAG-Bench’s dataset and toolkit with CCBY-4.0 license upon acceptance, to facilitate applications from both academia and industry.

[29] LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning

Obed Junias, Maria Leonor Pacheco

Main category: cs.CL

TL;DR: LOGICAL-COMMONSENSEQA benchmark reframes commonsense reasoning as logical composition over statement pairs using AND, OR, NEITHER/NOR operators, revealing models’ limitations in compositional reasoning.

DetailsMotivation: Existing benchmarks use single-label evaluation which obscures whether statements are jointly plausible, mutually exclusive, or jointly implausible. Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting single atomic answers.

Method: Created LOGICAL-COMMONSENSEQA benchmark with logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluated instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting.

Result: Models perform reasonably on conjunctive (AND) and moderately on disjunctive (OR) reasoning, but performance degrades sharply on negation-based (NEITHER/NOR) questions, exposing fundamental reasoning limitations.

Conclusion: LOGICAL-COMMONSENSEQA provides a controlled framework for advancing compositional commonsense reasoning and reveals current models’ weaknesses in handling logical composition, especially negation operations.

Abstract: Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.

[30] Is Length Really A Liability? An Evaluation of Multi-turn LLM Conversations using BoolQ

Karl Neergaard, Le Qiu, Emmanuele Chersoni

Main category: cs.CL

TL;DR: Conversational testing reveals LLM vulnerabilities invisible in single-prompt evaluations, showing model-specific weaknesses that depend on conversation length and scaffolding.

DetailsMotivation: Current LLM benchmarking relies on single-prompt evaluations, but real-world harm occurs in conversational dynamics. The paper aims to examine whether conversation length affects response veracity, as static evaluations may miss deployment-relevant vulnerabilities.

Method: Evaluated LLM performance on BoolQ dataset under varying conversation length and scaffolding conditions. Tested three distinct LLMs in multi-turn conversational settings to compare with single-turn testing.

Result: Found model-specific vulnerabilities that are invisible under single-turn testing. Observed length-dependent and scaffold-specific effects, demonstrating that deployment-relevant weaknesses only emerge in conversational settings.

Conclusion: Static evaluations have fundamental limitations. Multi-turn conversational testing is essential for identifying real-world vulnerabilities in LLMs that single-prompt benchmarks cannot detect.

Abstract: Single-prompt evaluations dominate current LLM benchmarking, yet they fail to capture the conversational dynamics where real-world harm occurs. In this study, we examined whether conversation length affects response veracity by evaluating LLM performance on the BoolQ dataset under varying length and scaffolding conditions. Our results across three distinct LLMs revealed model-specific vulnerabilities that are invisible under single-turn testing. The length-dependent and scaffold-specific effects we observed demonstrate a fundamental limitation of static evaluations, as deployment-relevant vulnerabilities could only be spotted in a multi-turn conversational setting.

[31] SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine

Hoang-Quoc Nguyen-Son, Minh-Son Dao, Koji Zettsu

Main category: cs.CL

TL;DR: SearchLLM is a novel approach that uses search engines to find potential original sources of text, then analyzes similarities between input and regenerated versions to detect LLM-paraphrased content, enhancing existing detectors’ performance against paraphrasing attacks.

DetailsMotivation: As LLMs become commonly used for text enhancement through paraphrasing, they can distort original meaning. Traditional detection methods fail with human-like LLM-generated text, especially when paraphrased to closely mimic original content.

Method: SearchLLM leverages search engine capabilities to locate potential original text sources, then analyzes similarities between input and regenerated versions of candidate sources to distinguish LLM-paraphrased content. It acts as a proxy layer for seamless integration with existing detectors.

Result: Experimental results across various LLMs show that SearchLLM consistently enhances accuracy of recent detectors in detecting LLM-paraphrased text that closely mimics original content, and helps prevent paraphrasing attacks.

Conclusion: SearchLLM effectively addresses the challenge of detecting LLM-paraphrased text by combining search engine capabilities with similarity analysis, providing a practical solution that enhances existing detection methods against sophisticated paraphrasing.

Abstract: With the advent of large language models (LLMs), it has become common practice for users to draft text and utilize LLMs to enhance its quality through paraphrasing. However, this process can sometimes result in the loss or distortion of the original intended meaning. Due to the human-like quality of LLM-generated text, traditional detection methods often fail, particularly when text is paraphrased to closely mimic original content. In response to these challenges, we propose a novel approach named SearchLLM, designed to identify LLM-paraphrased text by leveraging search engine capabilities to locate potential original text sources. By analyzing similarities between the input and regenerated versions of candidate sources, SearchLLM effectively distinguishes LLM-paraphrased content. SearchLLM is designed as a proxy layer, allowing seamless integration with existing detectors to enhance their performance. Experimental results across various LLMs demonstrate that SearchLLM consistently enhances the accuracy of recent detectors in detecting LLM-paraphrased text that closely mimics original content. Furthermore, SearchLLM also helps the detectors prevent paraphrasing attacks.

[32] Curate-Train-Refine: A Closed-Loop Agentic Framework for Zero Shot Classification

Gaurav Maheshwari, Kevin El Haddad

Main category: cs.CL

TL;DR: Training lightweight text classifiers using LLM-generated supervision through iterative agentic loops outperforms standard zero/few-shot baselines.

DetailsMotivation: LLMs and high-capacity encoders have inference cost and latency issues that limit practical deployment, despite their advances in zero/few-shot classification.

Method: Proposes training lightweight classifiers using dynamically generated supervision from LLMs through an iterative agentic loop where the LLM curates training data, analyzes model successes/failures, and synthesizes targeted examples to address errors.

Result: Across four widely used benchmarks, the approach consistently outperforms standard zero and few-shot baselines.

Conclusion: LLMs can serve effectively as data curators, enabling accurate and efficient classification without the operational cost of large-model deployment.

Abstract: Large language models (LLMs) and high-capacity encoders have advanced zero and few-shot classification, but their inference cost and latency limit practical deployment. We propose training lightweight text classifiers using dynamically generated supervision from an LLM. Our method employs an iterative, agentic loop in which the LLM curates training data, analyzes model successes and failures, and synthesizes targeted examples to address observed errors. This closed-loop generation and evaluation process progressively improves data quality and adapts it to the downstream classifier and task. Across four widely used benchmarks, our approach consistently outperforms standard zero and few-shot baselines. These results indicate that LLMs can serve effectively as data curators, enabling accurate and efficient classification without the operational cost of large-model deployment.

[33] Retrieve-Refine-Calibrate: A Framework for Complex Claim Fact-Checking

Mingwei Sun, Qianlong Wang, Ruifeng Xu

Main category: cs.CL

TL;DR: RRC framework improves fact-checking by retrieving, refining, and calibrating evidence to reduce noise from irrelevant entities and evidence.

DetailsMotivation: Existing decomposition-based fact-checking methods introduce noise through irrelevant entities or evidence, degrading verification accuracy.

Method: Retrieve-Refine-Calibrate (RRC) framework: 1) Identify entities and retrieve relevant evidence, 2) Refine evidence to reduce irrelevant information, 3) Calibrate by re-evaluating low-confidence predictions.

Result: Superior performance on HOVER and FEVEROUS-S datasets compared to competitive baselines.

Conclusion: The RRC framework effectively addresses noise issues in decomposition-based fact-checking and improves verification accuracy through systematic evidence processing.

Abstract: Fact-checking aims to verify the truthfulness of a claim based on the retrieved evidence. Existing methods typically follow a decomposition paradigm, in which a claim is broken down into sub-claims that are individually verified. However, the decomposition paradigm may introduce noise to the verification process due to irrelevant entities or evidence, ultimately degrading verification accuracy. To address this problem, we propose a Retrieve-Refine-Calibrate (RRC) framework based on large language models (LLMs). Specifically, the framework first identifies the entities mentioned in the claim and retrieves evidence relevant to them. Then, it refines the retrieved evidence based on the claim to reduce irrelevant information. Finally, it calibrates the verification process by re-evaluating low-confidence predictions. Experiments on two popular fact-checking datasets (HOVER and FEVEROUS-S) demonstrate that our framework achieves superior performance compared with competitive baselines.

[34] Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis

Jianyu Wen, Yang Wei, Xiongxi Yu, Changxuan Xiao, Ke Zeng

Main category: cs.CL

TL;DR: Attention-MoA introduces Inter-agent Semantic Attention and Inter-layer Residual Module with Adaptive Early Stopping to enhance collaboration in Mixture-of-Agents frameworks, enabling small open-source models to outperform large proprietary models.

DetailsMotivation: Current MoA frameworks lack deep semantic interaction between agents, limiting their ability to correct hallucinations and refine logic. There's a need for more effective collaboration mechanisms that enable smaller models to collectively achieve performance surpassing large proprietary models.

Method: Proposes Attention-MoA with two key components: 1) Inter-agent Semantic Attention for deep semantic interaction between agents, and 2) Inter-layer Residual Module with Adaptive Early Stopping Mechanism to mitigate information degradation and improve computational efficiency.

Result: Achieves 91.15% Length-Controlled Win Rate on AlpacaEval 2.0, dominates 10 out of 12 capabilities on FLASK, and enables small open-source models to outperform Claude-4.5-Sonnet and GPT-4.1 with MT-Bench score of 8.83 and AlpacaEval 2.0 LC Win Rate of 77.36%.

Conclusion: Attention-MoA successfully enhances agent collaboration through semantic attention mechanisms, demonstrating that well-designed ensemble approaches can enable smaller models to surpass much larger proprietary models in performance.

Abstract: As the development of Large Language Models (LLMs) shifts from parameter scaling to inference-time collaboration, the Mixture-of-Agents (MoA) framework has emerged as a general paradigm to harness collective intelligence by layering diverse models. While recent MoA variants have introduced dynamic routing and residual connections to improve efficiency, these methods often fail to facilitate deep semantic interaction between agents, limiting the system’s ability to actively correct hallucinations and refine logic. In this paper, we introduce Attention-MoA, a novel MoA-based framework that redefines collaboration through Inter-agent Semantic Attention. Complemented by an Inter-layer Residual Module with Adaptive Early Stopping Mechanism, our architecture mitigates information degradation in deep layers while improving computational efficiency. Extensive evaluations across AlpacaEval 2.0, MT-Bench, and FLASK demonstrate that Attention-MoA significantly outperforms state-of-the-art baselines, achieving a 91.15% Length-Controlled Win Rate on AlpacaEval 2.0 and dominating in 10 out of 12 capabilities on FLASK. Notably, Attention-MoA enables an ensemble of small open-source models to outperform massive proprietary models like Claude-4.5-Sonnet and GPT-4.1, achieving an MT-Bench score of 8.83 and an AlpacaEval 2.0 LC Win Rate of 77.36%.

[35] Sycophancy Hides Linearly in the Attention Heads

Rifo Genadi, Munachiso Nwadike, Nurdaulet Mukhituly, Hilal Alquabeh, Tatsuya Hiraoka, Kentaro Inui

Main category: cs.CL

TL;DR: Sycophancy signals are most separable in attention heads; linear probes show steering works best in middle-layer attention heads; these probes transfer across datasets; sycophancy differs from truthfulness; can be mitigated via targeted linear interventions.

DetailsMotivation: To understand where correct-to-incorrect sycophancy signals emerge in transformer models and how they can be detected and mitigated through linear interventions, motivated by the linear representation hypothesis.

Method: Train linear probes across residual stream, MLP, and attention layers; analyze separability; use TruthfulQA as base dataset; test transfer to other factual QA benchmarks; compare with existing “truthful” directions; conduct attention-pattern analysis.

Result: Sycophancy signals are most linearly separable in multi-head attention activations; steering works best in sparse subset of middle-layer attention heads; probes transfer effectively across datasets; discovered direction has limited overlap with “truthful” directions; influential heads attend to user doubt expressions.

Conclusion: Sycophancy can be mitigated through simple, targeted linear interventions exploiting the internal geometry of attention activations, suggesting factual accuracy and deference resistance arise from related but distinct mechanisms.

Abstract: We find that correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified “truthful” directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads attend disproportionately to expressions of user doubt, contributing to sycophantic shifts. Overall, these findings suggest that sycophancy can be mitigated through simple, targeted linear interventions that exploit the internal geometry of attention activations.

[36] AuroraEdge-V-2B: A Faster And Stronger Edge Visual Large Language Model

Xiang Chen

Main category: cs.CL

TL;DR: AuroraEdge-V-2B is a compact 2B-parameter visual language model designed for edge deployment, featuring faster inference, reduced visual tokens, and strong performance compared to similar-sized models.

DetailsMotivation: While VLLMs offer advantages like strong generalization and flexibility for industrial applications, they suffer from poor performance in specific domains, large parameter counts requiring substantial computational resources, and slow inference speeds that hinder real-time response. There's a need for compact, efficient VLLMs suitable for edge deployment.

Method: The authors introduce AuroraEdge-V-2B, a compact VLLM with only 2B parameters designed specifically for edge deployment. They also propose a compression-fusion method to improve inference efficiency by significantly reducing the number of visual tokens in the decoding process, which cuts floating-point operations by half during inference.

Result: AuroraEdge-V-2B achieves higher scores on 9 benchmarks than comparable models with similar parameter counts (Qwen2-VL-2B, Qwen2.5-VL-3B, InternVL-2.5-2B). The model offers easy deployment, faster inference, reduced computational cost, and strong performance while maintaining compact size.

Conclusion: The proposed AuroraEdge-V-2B addresses key limitations of traditional VLLMs for industrial applications by providing a compact, efficient solution suitable for edge deployment with improved real-time performance and reduced computational requirements while maintaining competitive performance.

Abstract: Recently, due to the advancement of multimodal technology, people are attempting to use visual large language models (VLLMs) in industrial production. Many deep learning models (DLMs) deployed in the production environment are gradually being replaced by VLLMs. Compared with DLMs, VLLMs have some advantages in industrial applications: (1) Their strong generalization ability enables them to perform well across a wide range of tasks. (2) They are flexible and can deal with unfamiliar samples through context learning quickly. However, VLLMs also have obvious drawbacks: (1) VLLMs do not perform as well as custom-developed DLMs in specific domains. (2) The number of parameters in VLLMs is generally quite large, and their deployment requires substantial computational resources. (3) VLLMs generally operate much slower than DLMs, making real-time response challenging to achieve. To better utilize VLLMs in industrial applications, we introduce AuroraEdge-V-2B in this work, a compact, robust, and high-speed VLLM designed for edge deployment. To make the model run faster, we also propose a compression-fusion method to improve inference efficiency. AuroraEdge-V-2B has the following notable features: (1) Easy deployment and faster: It has only 2B parameters and is highly suitable for edge deployment, offering better real-time performance. (2) Fewer visual tokens and cheaper: It significantly reduces the number of visual tokens in the decoding process, thereby reducing the floating-point operations by half during inference and making it cheaper to use. (3) Strong performance: It gets a higher score on 9 benchmarks than models with the same number of parameter (e.g., Qwen2-VL-2B, Qwen2.5-VL-3B, InternVL-2.5-2B).

[37] PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs

Jing Xu, Jiaqi Wang, Daxin Tan, Xiao Chen

Main category: cs.CL

TL;DR: PROST-LLM enhances LLMs for speech-to-speech translation through progressive fine-tuning, self-sampling preference generation, and preference optimization to overcome data scarcity.

DetailsMotivation: LLMs excel in many tasks but their application to Speech-to-Speech Translation (S2ST) is underexplored and hindered by data scarcity, creating a need for methods to enhance S2ST capabilities in LLMs.

Method: Three-stage progressive approach: 1) Fine-tune LLMs with CVSS corpus using tri-task learning and chain of modality methods, 2) Generate preference pairs through self-sampling and back-translation without human evaluation, 3) Use preference pairs for preference optimization to further enhance S2ST capability.

Result: Extensive experiments confirm the effectiveness of PROST-LLM in improving the S2ST capability of LLMs.

Conclusion: PROST-LLM successfully bridges the gap in applying LLMs to S2ST by progressively enhancing their capabilities through a data-efficient approach that overcomes data scarcity limitations.

Abstract: Although Large Language Models (LLMs) excel in many tasks, their application to Speech-to-Speech Translation (S2ST) is underexplored and hindered by data scarcity. To bridge this gap, we propose PROST-LLM (PROgressive Speech-to-speech Translation) to enhance the S2ST capabilities in LLMs progressively. First, we fine-tune the LLMs with the CVSS corpus, employing designed tri-task learning and chain of modality methods to boost the initial performance. Then, leveraging the fine-tuned model, we generate preference pairs through self-sampling and back-translation without human evaluation. Finally, these preference pairs are used for preference optimization to enhance the model’s S2ST capability further. Extensive experiments confirm the effectiveness of our proposed PROST-LLM in improving the S2ST capability of LLMs.

[38] How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants

Xueyang Feng, Weinan Gan, Xu Chen, Quanyu Dai, Yong Liu

Main category: cs.CL

TL;DR: This paper introduces RPEval, a benchmark to evaluate how personalization in LLM assistants can sometimes harm performance by introducing irrelevant memories, and proposes RP-Reasoner to selectively use personalized information through pragmatic reasoning.

DetailsMotivation: While LLM-powered assistants with memory mechanisms can provide personalized responses, irrelevant personalized memories often interfere with the LLM's intent understanding, creating a need to study and mitigate this "irrational personalization" problem.

Method: The authors developed RPEval benchmark with personalized intent reasoning dataset and multi-granularity evaluation protocol, then introduced RP-Reasoner which treats memory utilization as a pragmatic reasoning process to selectively integrate personalized information.

Result: RPEval revealed widespread irrational personalization in existing LLMs, and RP-Reasoner significantly outperformed baselines on RPEval while resolving 80% of bad cases in a large-scale commercial personalized assistant.

Conclusion: Pragmatic reasoning shows strong potential to mitigate irrational personalization in LLM assistants, enabling more selective and effective use of personalized memories without interfering with intent understanding.

Abstract: Large language model (LLM)-powered assistants have recently integrated memory mechanisms that record user preferences, leading to more personalized and user-aligned responses. However, irrelevant personalized memories are often introduced into the context, interfering with the LLM’s intent understanding. To comprehensively investigate the dual effects of personalization, we develop RPEval, a benchmark comprising a personalized intent reasoning dataset and a multi-granularity evaluation protocol. RPEval reveals the widespread phenomenon of irrational personalization in existing LLMs and, through error pattern analysis, illustrates its negative impact on user experience. Finally, we introduce RP-Reasoner, which treats memory utilization as a pragmatic reasoning process, enabling the selective integration of personalized information. Experimental results demonstrate that our method significantly outperforms carefully designed baselines on RPEval, and resolves 80% of the bad cases observed in a large-scale commercial personalized assistant, highlighting the potential of pragmatic reasoning to mitigate irrational personalization. Our benchmark is publicly available at https://github.com/XueyangFeng/RPEval.

Yuzhen Shi, Huanghai Liu, Yiran Hu, Gaojie Song, Xinran Xu, Yubo Ma, Tianyi Tang, Li Zhang, Qingjing Chen, Di Feng, Wenbo Lv, Weiheng Wu, Kexin Yang, Sen Yang, Wei Wang, Rongyao Shi, Yuanyang Qiu, Yuemeng Qi, Jingwen Zhang, Xiaoyu Sui, Yifan Chen, Yi Zhang, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Weixing Shen, Bing Zhao, Charles L. A. Clarke, Hu Wei

Main category: cs.CL

TL;DR: PLawBench is a new benchmark for evaluating LLMs on realistic legal practice tasks, revealing significant limitations in current models’ fine-grained legal reasoning abilities.

DetailsMotivation: Existing legal benchmarks are too simplified and standardized, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. They use coarse metrics and don't assess fine-grained legal reasoning.

Method: Created PLawBench with 850 questions across 13 practical legal scenarios, modeling real-world legal workflows through three task categories: public legal consultation, practical case analysis, and legal document generation. Each question has expert-designed evaluation rubrics (12,500 total items). Used LLM-based evaluator aligned with human judgments to test 10 state-of-the-art LLMs.

Result: None of the 10 state-of-the-art LLMs achieved strong performance on PLawBench, revealing substantial limitations in their fine-grained legal reasoning capabilities.

Conclusion: Current LLMs have significant limitations in practical legal reasoning, highlighting the need for better evaluation methods and development of legal LLMs that can handle real-world legal complexity.

Abstract: As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model’s ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.

[40] MultiLexNorm++: A Unified Benchmark and a Generative Model for Lexical Normalization for Asian Languages

Weerayut Buaphet, Thanh-Nhi Nguyen, Risa Kondo, Tomoyuki Kajiwara, Yumin Kim, Jimin Lee, Hwanhee Lee, Holy Lovenia, Peerat Limkonchotiwat, Sarana Nutanong, Rob Van der Goot

Main category: cs.CL

TL;DR: The paper extends the MultiLexNorm benchmark to include 5 Asian languages from different language families and scripts, proposes a new LLM-based architecture that outperforms previous SOTA, and analyzes remaining errors for future research directions.

DetailsMotivation: Social media text presents challenges for NLP due to informal language, spontaneity, and diverse sociolects, causing model performance deterioration. Existing lexical normalization benchmarks like MultiLexNorm are limited to Indo-European languages in Latin script, lacking diversity across language families and writing systems.

Method: The authors extend MultiLexNorm benchmark to include 5 Asian languages from different language families in 4 different scripts. They propose a new architecture based on Large Language Models (LLMs) for more robust performance compared to previous state-of-the-art models.

Result: Previous state-of-the-art models perform worse on the new Asian languages. The proposed LLM-based architecture shows more robust performance across diverse languages and scripts. Error analysis reveals remaining challenges and future research directions.

Conclusion: The paper successfully extends lexical normalization benchmarking to more diverse languages, demonstrates the superiority of LLM-based approaches over previous methods, and identifies key areas for future improvement in handling diverse linguistic variations across different scripts and language families.

Abstract: Social media data has been of interest to Natural Language Processing (NLP) practitioners for over a decade, because of its richness in information, but also challenges for automatic processing. Since language use is more informal, spontaneous, and adheres to many different sociolects, the performance of NLP models often deteriorates. One solution to this problem is to transform data to a standard variant before processing it, which is also called lexical normalization. There has been a wide variety of benchmarks and models proposed for this task. The MultiLexNorm benchmark proposed to unify these efforts, but it consists almost solely of languages from the Indo-European language family in the Latin script. Hence, we propose an extension to MultiLexNorm, which covers 5 Asian languages from different language families in 4 different scripts. We show that the previous state-of-the-art model performs worse on the new languages and propose a new architecture based on Large Language Models (LLMs), which shows more robust performance. Finally, we analyze remaining errors, revealing future directions for this task.

[41] Typologically Informed Parameter Aggregation

Stef Accou, Wessel Poelman

Main category: cs.CL

TL;DR: TIPA is a training-free method that creates proxy language adapters by aggregating existing adapters based on typological similarity, enabling zero-shot cross-lingual transfer without additional training.

DetailsMotivation: Massively multilingual language models underperform on low-resource and unseen languages, and while adapter-based fine-tuning helps, training language-specific adapters at scale is costly.

Method: Typologically Informed Parameter Aggregation (TIPA) constructs proxy language adapters by aggregating existing adapters weighted by typological similarity, integrated into the MAD-X framework.

Result: TIPA consistently outperforms or matches baselines (English-only fine-tuning or selecting typologically closest adapter) on five NLP tasks across 230+ languages, with largest gains for languages lacking dedicated adapters.

Conclusion: Typologically informed aggregation provides a viable alternative to language-specific modules without any training needed, enabling effective zero-shot cross-lingual transfer.

Abstract: Massively multilingual language models enable cross-lingual generalization but underperform on low-resource and unseen languages. While adapter-based fine-tuning offers a parameter-efficient solution, training language-specific adapters at scale remains costly. We introduce Typologically Informed Parameter Aggregation (TIPA), a training-free method that constructs proxy language adapters by aggregating existing ones, weighted by typological similarity. Integrated into the MAD-X framework, these proxies enable zero-shot cross-lingual transfer without additional training. We evaluate TIPA on five NLP tasks and over 230 languages. TIPA consistently outperforms or matches baselines such as English-only fine-tuning or selecting the typologically closest language adapter. We see the largest gains for languages lacking dedicated adapters. Our results demonstrate that typologically informed aggregation provides a viable alternative to language-specific modules without any training needed.

[42] Select or Project? Evaluating Lower-dimensional Vectors for LLM Training Data Explanations

Lukas Hinterleitner, Loris Schoenegger, Benjamin Roth

Main category: cs.CL

TL;DR: Gradient-based influence estimation for LLMs is computationally heavy due to high-dimensional gradients. This paper shows that selecting a small, architecturally informed subset of model components works better than using full gradients or random projections for training data influence estimation.

DetailsMotivation: Current gradient-based methods for instance-based explanation in LLMs face computational challenges due to the immense dimensionality of model gradients. Existing approaches often use ad hoc parameter subsets without systematic evaluation, creating a need for better strategies to make influence estimation computationally feasible.

Method: The paper investigates two approaches: (1) selecting a small, architecturally informed subset of model components, and (2) projecting full gradients into lower-dimensional space. They use a novel benchmark to compare these approaches against full gradients and random projections, with a focus on greedy component selection.

Result: A greedily selected subset of components captures training data influence information more effectively than full gradients or random projections for retrieval tasks. This approach is also more computationally efficient than random projection.

Conclusion: Targeted component selection is a practical and effective strategy for making instance-based explanations of large models computationally feasible, outperforming both full gradient approaches and dimensionality reduction via random projection.

Abstract: Gradient-based methods for instance-based explanation for large language models (LLMs) are hindered by the immense dimensionality of model gradients. In practice, influence estimation is restricted to a subset of model parameters to make computation tractable, but this subset is often chosen ad hoc and rarely justified by systematic evaluation. This paper investigates if it is better to create low-dimensional representations by selecting a small, architecturally informed subset of model components or by projecting the full gradients into a lower-dimensional space. Using a novel benchmark, we show that a greedily selected subset of components captures the information about training data influence needed for a retrieval task more effectively than either the full gradient or random projection. We further find that this approach is more computationally efficient than random projection, demonstrating that targeted component selection is a practical strategy for making instance-based explanations of large models more computationally feasible.

[43] Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation

Xinyi Wang, Grazziela Figueredo, Ruizhe Li, Xin Chen

Main category: cs.CL

TL;DR: Proposes an LLM-based pipeline for automatically annotating longitudinal information in radiology reports to create standardized benchmarks for evaluating report generation models.

DetailsMotivation: Existing methods for validating longitudinal information in radiology report generation models are inadequate - manual annotation is labor-intensive, rule-based methods are either too complex (closed-source, domain-specific) or too simple (missing specialized information). There's a lack of proper tools for consistently labeling temporal changes in both ground-truth and model-generated texts.

Method: Proposes an LLM-based pipeline with two main tasks: 1) identifying sentences containing relevant longitudinal information, and 2) extracting disease progression. Evaluated five mainstream LLMs on 500 manually annotated reports, selected Qwen2.5-32B based on efficiency and performance, then used it to annotate 95,169 reports from MIMIC-CXR dataset.

Result: LLM-based annotation method outperforms existing solutions, achieving 11.3% higher F1-score for longitudinal information detection and 5.3% higher for disease tracking. Created a standardized benchmark dataset that was used to evaluate seven state-of-the-art report generation models.

Conclusion: LLMs provide an effective alternative for automatically annotating longitudinal information in radiology reports, overcoming limitations of manual and rule-based methods. The proposed pipeline enables consistent labeling and creates valuable benchmarks for evaluating radiology report generation models.

Abstract: Longitudinal information in radiology reports refers to the sequential tracking of findings across multiple examinations over time, which is crucial for monitoring disease progression and guiding clinical decisions. Many recent automated radiology report generation methods are designed to capture longitudinal information; however, validating their performance is challenging. There is no proper tool to consistently label temporal changes in both ground-truth and model-generated texts for meaningful comparisons. Existing annotation methods are typically labor-intensive, relying on the use of manual lexicons and rules. Complex rules are closed-source, domain specific and hard to adapt, whereas overly simple ones tend to miss essential specialised information. Large language models (LLMs) offer a promising annotation alternative, as they are capable of capturing nuanced linguistic patterns and semantic similarities without extensive manual intervention. They also adapt well to new contexts. In this study, we therefore propose an LLM-based pipeline to automatically annotate longitudinal information in radiology reports. The pipeline first identifies sentences containing relevant information and then extracts the progression of diseases. We evaluate and compare five mainstream LLMs on these two tasks using 500 manually annotated reports. Considering both efficiency and performance, Qwen2.5-32B was subsequently selected and used to annotate another 95,169 reports from the public MIMIC-CXR dataset. Our Qwen2.5-32B-annotated dataset provided us with a standardized benchmark for evaluating report generation models. Using this new benchmark, we assessed seven state-of-the-art report generation models. Our LLM-based annotation method outperforms existing annotation solutions, achieving 11.3% and 5.3% higher F1-scores for longitudinal information detection and disease tracking, respectively.

[44] EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents

Xinze Li, Ziyue Zhu, Siyuan Liu, Yubo Ma, Yuhang Zang, Yixin Cao, Aixin Sun

Main category: cs.CL

TL;DR: EMemBench is a programmatic benchmark for evaluating long-term memory in AI agents through interactive games, generating questions from agent trajectories with verifiable ground truth across text and visual environments.

DetailsMotivation: Existing benchmarks for evaluating agent memory often use fixed question sets, which may not comprehensively test long-term memory capabilities. There's a need for a more dynamic, programmatic approach that can generate questions from agent trajectories and cover diverse memory skills across both text and visual environments.

Method: EMemBench generates questions from each agent’s own trajectory in interactive games, covering both text and visual environments. It uses templates that compute verifiable ground truth from underlying game signals, with controlled answerability and balanced coverage over multiple memory skills: single/multi-hop recall, induction, temporal, spatial, logical, and adversarial reasoning.

Result: Results across 15 text games and multiple visual seeds show performance is far from saturated. Induction and spatial reasoning are persistent bottlenecks, especially in visual settings. Persistent memory yields clear gains for open backbones on text games, but improvements are less consistent for VLM agents. A human study confirms the difficulty of the benchmark.

Conclusion: EMemBench provides a comprehensive framework for evaluating long-term memory in agents. The benchmark reveals that visually grounded episodic memory remains an open challenge, with VLM agents showing inconsistent improvements despite persistent memory approaches. The benchmark’s difficulty is validated by human performance studies.

Abstract: We introduce EMemBench, a programmatic benchmark for evaluating long-term memory of agents through interactive games. Rather than using a fixed set of questions, EMemBench generates questions from each agent’s own trajectory, covering both text and visual game environments. Each template computes verifiable ground truth from underlying game signals, with controlled answerability and balanced coverage over memory skills: single/multi-hop recall, induction, temporal, spatial, logical, and adversarial. We evaluate memory agents with strong LMs/VLMs as backbones, using in-context prompting as baselines. Across 15 text games and multiple visual seeds, results are far from saturated: induction and spatial reasoning are persistent bottlenecks, especially in visual setting. Persistent memory yields clear gains for open backbones on text games, but improvements are less consistent for VLM agents, suggesting that visually grounded episodic memory remains an open challenge. A human study further confirms the difficulty of EMemBench.

[45] Do LLM hallucination detectors suffer from low-resource effect?

Debtanu Datta, Mohan Kishore Chilukuri, Yash Kumar, Saptarshi Ghosh, Muhammad Bilal Zafar

Main category: cs.CL

TL;DR: Hallucination detectors show surprising robustness in low-resource languages, with accuracy drops much smaller than task performance drops, suggesting LLMs encode uncertainty signals even in non-English contexts.

DetailsMotivation: To investigate whether hallucination detectors suffer from the low-resource effect, where LLM performance degrades significantly in low-resource languages compared to high-resource languages like English.

Method: Conducted experiments on five tasks across three domains (factual recall, STEM, Humanities) using four LLMs and three hallucination detectors, comparing performance in English vs. low-resource languages like Bengali.

Result: While task accuracies in low-resource languages experience large drops compared to English, the drop in detectors’ accuracy is often several times smaller than the drop in task accuracy.

Conclusion: Hallucination detectors are robust within language (even for non-English) and in multilingual setups, and LLMs appear to encode uncertainty signals even in low-resource languages, though cross-lingual settings without in-language supervision remain challenging.

Abstract: LLMs, while outperforming humans in a wide range of tasks, can still fail in unanticipated ways. We focus on two pervasive failure modes: (i) hallucinations, where models produce incorrect information about the world, and (ii) the low-resource effect, where the models show impressive performance in high-resource languages like English but the performance degrades significantly in low-resource languages like Bengali. We study the intersection of these issues and ask: do hallucination detectors suffer from the low-resource effect? We conduct experiments on five tasks across three domains (factual recall, STEM, and Humanities). Experiments with four LLMs and three hallucination detectors reveal a curious finding: As expected, the task accuracies in low-resource languages experience large drops (compared to English). However, the drop in detectors’ accuracy is often several times smaller than the drop in task accuracy. Our findings suggest that even in low-resource languages, the internal mechanisms of LLMs might encode signals about their uncertainty. Further, the detectors are robust within language (even for non-English) and in multilingual setups, but not in cross-lingual settings without in-language supervision.

[46] Better Generalizing to Unseen Concepts: An Evaluation Framework and An LLM-Based Auto-Labeled Pipeline for Biomedical Concept Recognition

Shanshan Liu, Noriki Nishida, Fei Cheng, Narumi Tokunaga, Rumana Ferdous Munne, Yuki Yamagata, Kouji Kozaki, Takehito Utsuro, Yuji Matsumoto

Main category: cs.CL

TL;DR: LLM-generated auto-labeled data improves generalization for biomedical concept recognition but can’t fully replace manual annotations.

DetailsMotivation: Addressing the challenge of generalization to unseen concepts in Mention-agnostic Biomedical Concept Recognition (MA-BCR) due to scarcity of human annotations.

Method: Proposed evaluation framework with hierarchical concept indices and novel metrics; developed LLM-based Auto-Labeled Data (ALD) generation pipeline.

Result: LLM-generated ALD improves generalization by providing broader coverage and structural knowledge, but cannot fully substitute manual annotations.

Conclusion: ALD is a valuable resource for improving generalization in MA-BCR, offering scalable data generation while recognizing limitations compared to human annotations.

Abstract: Generalization to unseen concepts is a central challenge due to the scarcity of human annotations in Mention-agnostic Biomedical Concept Recognition (MA-BCR). This work makes two key contributions to systematically address this issue. First, we propose an evaluation framework built on hierarchical concept indices and novel metrics to measure generalization. Second, we explore LLM-based Auto-Labeled Data (ALD) as a scalable resource, creating a task-specific pipeline for its generation. Our research unequivocally shows that while LLM-generated ALD cannot fully substitute for manual annotations, it is a valuable resource for improving generalization, successfully providing models with the broader coverage and structural knowledge needed to approach recognizing unseen concepts. Code and datasets are available at https://github.com/bio-ie-tool/hi-ald.

[47] Mitigating Bias in Automated Grading Systems for ESL Learners: A Contrastive Learning Approach

Kevin Fan, Eric Yun

Main category: cs.CL

TL;DR: AES systems show bias against ESL learners. The study reveals a 10.3% scoring gap for high-proficiency ESL essays vs native essays of equal quality. Proposed contrastive learning with matched essay pairs reduces this disparity by 39.9% while maintaining scoring accuracy.

DetailsMotivation: Transformer-based AES systems trained on native-speaker corpora learn spurious correlations between surface-level L2 linguistic features and essay quality, creating algorithmic bias against ESL learners in high-stakes educational settings.

Method: Used fine-tuned DeBERTa-v3 model on ASAP 2.0 and ELLIPSE datasets, then applied contrastive learning with triplet construction strategy (Contrastive Learning with Matched Essay Pairs). Created 17,161 matched essay pairs and fine-tuned using Triplet Margin Loss to align latent representations of ESL and Native writing.

Result: Reduced high-proficiency scoring disparity by 39.9% (from 10.3% gap to 6.2% gap) while maintaining Quadratic Weighted Kappa of 0.76. Post-hoc analysis shows model successfully disentangled sentence complexity from grammatical error.

Conclusion: Contrastive learning with matched essay pairs effectively mitigates algorithmic bias in AES systems against ESL learners by aligning latent representations, preventing penalization of valid L2 syntactic structures while maintaining scoring reliability.

Abstract: As Automated Essay Scoring (AES) systems are increasingly used in high-stakes educational settings, concerns regarding algorithmic bias against English as a Second Language (ESL) learners have increased. Current Transformer-based regression models trained primarily on native-speaker corpora often learn spurious correlations between surface-level L2 linguistic features and essay quality. In this study, we conduct a bias study of a fine-tuned DeBERTa-v3 model using the ASAP 2.0 and ELLIPSE datasets, revealing a constrained score scaling for high-proficiency ESL writing where high-proficiency ESL essays receive scores 10.3% lower than Native speaker essays of identical human-rated quality. To mitigate this, we propose applying contrastive learning with a triplet construction strategy: Contrastive Learning with Matched Essay Pairs. We constructed a dataset of 17,161 matched essay pairs and fine-tuned the model using Triplet Margin Loss to align the latent representations of ESL and Native writing. Our approach reduced the high-proficiency scoring disparity by 39.9% (to a 6.2% gap) while maintaining a Quadratic Weighted Kappa (QWK) of 0.76. Post-hoc linguistic analysis suggests the model successfully disentangled sentence complexity from grammatical error, preventing the penalization of valid L2 syntactic structures.

[48] Persuasion Tokens for Editing Factual Knowledge in LLMs

Paul Youssef, Jörg Schlötterer, Christin Seifert

Main category: cs.CL

TL;DR: P-Tokens are special tokens that replicate IKE’s knowledge editing effects without needing lengthy demonstrations, offering a more efficient and scalable alternative.

DetailsMotivation: IKE requires costly, fact-specific demonstrations that consume significant context window space, limiting its practicality for knowledge editing in LLMs.

Method: Introduce persuasion tokens (P-Tokens) - special tokens trained to replicate IKE demonstration effects, enabling efficient knowledge editing without fact-specific demonstrations.

Result: P-Tokens achieve performance comparable to or better than IKE across two editing datasets and three LLMs, with robustness to distractors and improved performance with more tokens.

Conclusion: P-Tokens address key IKE limitations, providing a more practical and scalable alternative for editing LLMs without requiring lengthy demonstrations.

Abstract: In-context knowledge editing (IKE) is a promising technique for updating Large Language Models (LLMs) with new information. However, IKE relies on lengthy, fact-specific demonstrations which are costly to create and consume significant context window space. In this paper, we introduce persuasion tokens (P-Tokens) – special tokens trained to replicate the effect of IKE demonstrations, enabling efficient knowledge editing without requiring fact-specific demonstrations. We evaluate P-Tokens across two editing datasets and three LLMs, demonstrating performance comparable to, and often exceeding, IKE. We further find that editing performance is robust to distractors with small negative effects to neighboring facts, and that increasing the number of P-Tokens improves performance. Our work addresses key limitations of IKE and provides a more practical and scalable alternative for editing LLMs.

[49] SoS: Analysis of Surface over Semantics in Multilingual Text-To-Image Generation

Carolin Holtermann, Florian Schneider, Anne Lauscher

Main category: cs.CL

TL;DR: T2I models show Surface-over-Semantics bias where they prioritize language surface forms over prompt semantics, producing culturally stereotypical outputs for non-English prompts.

DetailsMotivation: Prior research shows T2I models are sensitive to input languages, producing culturally stereotypical depictions for non-English prompts, but comprehensive analysis of this Surface-over-Semantics behavior is missing.

Method: Created prompts covering 171 cultural identities translated into 14 languages, tested on 7 T2I models. Introduced novel measure to quantify SoS tendencies and analyzed visual manifestations across models, languages, and cultures.

Result: All but one model exhibited strong surface-level tendency in at least two languages, with effect intensifying across T2I text encoder layers. Surface tendencies frequently correlated with stereotypical visual depictions.

Conclusion: T2I models demonstrate significant Surface-over-Semantics bias, prioritizing language surface forms over semantic meaning, leading to culturally stereotypical outputs that intensify through model layers.

Abstract: Text-to-image (T2I) models are increasingly employed by users worldwide. However, prior research has pointed to the high sensitivity of T2I towards particular input languages - when faced with languages other than English (i.e., different surface forms of the same prompt), T2I models often produce culturally stereotypical depictions, prioritizing the surface over the prompt’s semantics. Yet a comprehensive analysis of this behavior, which we dub Surface-over-Semantics (SoS), is missing. We present the first analysis of T2I models’ SoS tendencies. To this end, we create a set of prompts covering 171 cultural identities, translated into 14 languages, and use it to prompt seven T2I models. To quantify SoS tendencies across models, languages, and cultures, we introduce a novel measure and analyze how the tendencies we identify manifest visually. We show that all but one model exhibit strong surface-level tendency in at least two languages, with this effect intensifying across the layers of T2I text encoders. Moreover, these surface tendencies frequently correlate with stereotypical visual depictions.

[50] Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

Gaurav Negi, MA Waskow, Paul Buitelaar

Main category: cs.CL

TL;DR: LLMs can effectively serve as automatic annotators for fine-grained opinion analysis, reducing annotation costs and human effort through declarative pipelines and adjudication methods.

DetailsMotivation: Fine-grained opinion analysis requires extensive human annotation which is costly and time-consuming, especially across diverse domains. There's a shortage of domain-specific labeled datasets for training models.

Method: Uses a declarative annotation pipeline to reduce manual prompt engineering variability. Introduces novel methodology for LLMs to adjudicate multiple labels and produce final annotations. Tested with models of different sizes on ASTE and ACOS analysis tasks.

Result: LLMs achieve high Inter-Annotator Agreement across individual LLM-based annotators, demonstrating they can effectively serve as automatic annotators and adjudicators for fine-grained opinion analysis.

Conclusion: LLMs provide a viable solution for reducing the cost and human effort needed to create fine-grained opinion-annotated datasets, addressing the shortage of domain-specific labeled data.

Abstract: Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications. We explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis, addressing the shortage of domain-specific labelled datasets. In this work, we use a declarative annotation pipeline. This approach reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a novel methodology for an LLM to adjudicate multiple labels and produce final annotations. After trialling the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks, we show that LLMs can serve as automatic annotators and adjudicators, achieving high Inter-Annotator Agreement across individual LLM-based annotators. This reduces the cost and human effort needed to create these fine-grained opinion-annotated datasets.

[51] Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess

Leonard S. Pleiss, Maximilian Schiffer, Robert K. von Weizsäcker

Main category: cs.CL

TL;DR: LLMs show a clear performance gradient in chess: they excel at memorized positions but collapse to random levels on novel tasks requiring fluid intelligence, revealing limitations in systematic generalization.

DetailsMotivation: To disentangle crystallized intelligence (recall/memorization) from fluid intelligence (reasoning) in LLMs using chess as a controlled testbed, since it's unclear whether LLM capabilities reflect sophisticated recall or genuine reasoning ability.

Method: Use chess as a structured testbed with scalable engine evaluations. Construct a taxonomy of positions varying in training corpus proximity - from common states solvable by memorization to novel ones requiring first-principles reasoning. Systematically evaluate multiple GPT generations under varying reasoning intensities.

Result: Clear performance gradient: performance consistently degrades as fluid intelligence demands increase. In out-of-distribution tasks, performance collapses to random levels. Newer models improve but progress slows significantly for tasks outside training distribution. Reasoning-augmented inference helps but marginal benefit per token decreases with distributional proximity.

Conclusion: Current LLM architectures remain limited in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust fluid intelligence.

Abstract: Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall (crystallized intelligence) or reasoning ability (fluid intelligence). We introduce chess as a controlled testbed for disentangling these faculties. Leveraging the game’s structure and scalable engine evaluations, we construct a taxonomy of positions varying in training corpus proximity–ranging from common states solvable by memorization to novel ones requiring first-principles reasoning. We systematically evaluate multiple GPT generations under varying reasoning intensities. Our analysis reveals a clear gradient: performance consistently degrades as fluid intelligence demands increase. Notably, in out-of-distribution tasks, performance collapses to random levels. While newer models improve, progress slows significantly for tasks outside the training distribution. Furthermore, while reasoning-augmented inference improves performance, its marginal benefit per token decreases with distributional proximity. These results suggest current architectures remain limited in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust fluid intelligence.

[52] LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems

João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton

Main category: cs.CL

TL;DR: Novel persuasive adversarial attacks on automated fact-checking systems using LLM-generated persuasive rephrasing of claims, showing significant degradation in verification and evidence retrieval performance.

DetailsMotivation: Existing adversarial attacks on AFC systems rely on noise injection or semantic alteration, but none exploit persuasion techniques commonly used in real-world disinformation campaigns to manipulate audiences.

Method: Use generative LLM to rephrase claims using 15 persuasion techniques grouped into 6 categories, with decoupled evaluation strategy to study effects on both claim verification and evidence retrieval.

Result: Experiments on FEVER and FEVEROUS benchmarks show persuasion attacks substantially degrade both verification performance and evidence retrieval effectiveness.

Conclusion: Persuasion techniques represent a potent class of adversarial attacks, highlighting the need for more robust automated fact-checking systems against such manipulation strategies.

Abstract: Automated fact-checking (AFC) systems are susceptible to adversarial attacks, enabling false claims to evade detection. Existing adversarial frameworks typically rely on injecting noise or altering semantics, yet no existing framework exploits the adversarial potential of persuasion techniques, which are widely used in disinformation campaigns to manipulate audiences. In this paper, we introduce a novel class of persuasive adversarial attacks on AFCs by employing a generative LLM to rephrase claims using persuasion techniques. Considering 15 techniques grouped into 6 categories, we study the effects of persuasion on both claim verification and evidence retrieval using a decoupled evaluation strategy. Experiments on the FEVER and FEVEROUS benchmarks show that persuasion attacks can substantially degrade both verification performance and evidence retrieval. Our analysis identifies persuasion techniques as a potent class of adversarial attacks, highlighting the need for more robust AFC systems.

[53] Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias

Elias Schuhmacher, Andrianos Michail, Juri Opitz, Rico Sennrich, Simon Clematide

Main category: cs.CL

TL;DR: Researchers found that embedding models have positional and language biases where early segments and English text are over-represented, and introduced an attention calibration method to fix this.

DetailsMotivation: To ensure all parts of a document are discoverable in embedding-based search, the authors wanted to quantify and address potential reflection biases in embedding models.

Method: Introduced a permutation-based evaluation framework to measure biases, analyzed attention distributions, and developed an inference-time attention calibration method to redistribute attention more evenly across document positions.

Result: Found systematic positional biases (early segments over-represented) and language biases (higher-resource languages like English over-represented) in state-of-the-art embedding models, especially for longer multi-segment documents.

Conclusion: Embedding models exhibit discoverability biases, but these can be mitigated through attention calibration, improving representation of later segments and lower-resource languages.

Abstract: To be discoverable in an embedding-based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation-based evaluation framework. With this, we observe that state-of-the-art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher-resource languages like English are over-represented, while later segments and segments in lower-resource languages are marginalized. In our further analysis, we find that the positional bias stems from front-loaded attention distributions in pooling-token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference-time attention calibration method that redistributes attention more evenly across document positions, increasing discoverabiltiy of later segments. Our evaluation framework and attention calibration is available at https://github.com/impresso/fair-sentence-transformers

[54] Strategies for Span Labeling with Large Language Models

Danil Semin, Ondřej Dušek, Zdeněk Kasner

Main category: cs.CL

TL;DR: LogitMatch is a new constrained decoding method for LLMs that forces outputs to match valid input spans, improving span labeling performance over existing prompting strategies.

DetailsMotivation: LLMs lack explicit mechanisms to refer to specific parts of their input, leading to inconsistent ad-hoc prompting strategies for span labeling tasks like named entity recognition and error detection.

Method: Categorizes existing span labeling strategies into three families (tagging, indexing, content matching) and introduces LogitMatch - a constrained decoding method that forces model outputs to align with valid input spans.

Result: LogitMatch improves upon competitive matching-based methods by eliminating span matching issues and outperforms other strategies in some setups, though tagging remains a robust baseline.

Conclusion: LogitMatch provides an effective constrained decoding approach for span labeling with LLMs, addressing limitations of content matching while maintaining competitive performance across diverse tasks.

Abstract: Large language models (LLMs) are increasingly used for text analysis tasks, such as named entity recognition or error detection. Unlike encoder-based models, however, generative architectures lack an explicit mechanism to refer to specific parts of their input. This leads to a variety of ad-hoc prompting strategies for span labeling, often with inconsistent results. In this paper, we categorize these strategies into three families: tagging the input text, indexing numerical positions of spans, and matching span content. To address the limitations of content matching, we introduce LogitMatch, a new constrained decoding method that forces the model’s output to align with valid input spans. We evaluate all methods across four diverse tasks. We find that while tagging remains a robust baseline, LogitMatch improves upon competitive matching-based methods by eliminating span matching issues and outperforms other strategies in some setups.

[55] Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

Branislav Pecher, Ivan Srba, Maria Bielikova

Main category: cs.CL

TL;DR: Specialized small models need only ~100 labeled samples on average to match or beat general large models on text classification tasks, but this increases 100-200% when accounting for performance variance.

DetailsMotivation: To determine how many labeled samples are needed for specialized small models to outperform general large language models, considering performance variance, in low-data NLP scenarios.

Method: Analyzed fine-tuning, instruction-tuning, prompting, and in-context learning across 8 language models on 8 representative text classification tasks with varying characteristics.

Result: Specialized models typically need only ~100 samples to match or beat general models. Required labels strongly depend on dataset/task characteristics (binary datasets need more). When accounting for variance, required labels increase 100-200%. Larger models don’t consistently improve performance or reduce variance, and 4-bit quantization has negligible impact.

Conclusion: Specialized small models can be competitive with general large models with surprisingly few labeled samples (~100), but variance considerations significantly increase this requirement. Task characteristics matter more than model size, and quantization doesn’t meaningfully affect performance.

Abstract: When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we answer an important question – how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 8 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average $100$) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with fine-tuning on binary datasets requiring significantly more samples. When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200%$. Finally, larger models do not consistently lead to better performance and lower variance, with 4-bit quantisation having negligible impact.

[56] Simple-Sampling and Hard-Mixup with Prototypes to Rebalance Contrastive Learning for Text Classification

Mengyu Li, Yonghao Liu, Fausto Giunchiglia, Ximing Li, Xiaoyue Feng, Renchu Guan

Main category: cs.CL

TL;DR: SharpReCL is a novel model for imbalanced text classification that combines prototype-based representation with supervised contrastive learning to address data imbalance issues.

DetailsMotivation: Current supervised contrastive learning approaches for text classification have two main limitations: 1) they are sensitive to data imbalance, which is common in text datasets, and 2) they use separate classification and contrastive learning branches without explicit mutual guidance between them.

Method: The model obtains prototype vectors for each class from a balanced classification branch, then uses these prototypes to construct properly sized target sample sets for each class to perform supervised contrastive learning with explicit guidance between the branches.

Result: Empirical results show the model’s effectiveness, even outperforming popular large language models across several datasets.

Conclusion: SharpReCL successfully addresses data imbalance issues in text classification by integrating prototype-based representation with supervised contrastive learning, demonstrating superior performance compared to existing approaches including large language models.

Abstract: Text classification is a crucial and fundamental task in web content mining. Compared with the previous learning paradigm of pre-training and fine-tuning by cross entropy loss, the recently proposed supervised contrastive learning approach has received tremendous attention due to its powerful feature learning capability and robustness. Although several studies have incorporated this technique for text classification, some limitations remain. First, many text datasets are imbalanced, and the learning mechanism of supervised contrastive learning is sensitive to data imbalance, which may harm the model’s performance. Moreover, these models leverage separate classification branches with cross entropy and supervised contrastive learning branches without explicit mutual guidance. To this end, we propose a novel model named SharpReCL for imbalanced text classification tasks. First, we obtain the prototype vector of each class in the balanced classification branch to act as a representation of each class. Then, by further explicitly leveraging the prototype vectors, we construct a proper and sufficient target sample set with the same size for each class to perform the supervised contrastive learning procedure. The empirical results show the effectiveness of our model, which even outperforms popular large language models across several datasets. Our code is available here.

[57] Linguistic traces of stochastic empathy in language models

Bennett Kleinberg, Jari Zegers, Jonas Festor, Stefana Vida, Julian Präsent, Riccardo Loconte, Sanne Peereboom

Main category: cs.CL

TL;DR: LLMs can mimic human writing when instructed to sound human, reducing human advantage in detection tasks, but they produce empathy without true humanness.

DetailsMotivation: As AI-generated content becomes harder to distinguish from human writing, researchers want to understand how incentives to appear human and task characteristics affect the human-AI detection race.

Method: Five studies using human and LLM writers creating relationship advice/descriptions with/without instructions to sound human, followed by human judges identifying source. Computational text analysis examined linguistic patterns.

Result: Instructions to sound human only helped LLMs, reducing human advantage. Effects persisted even when writers were told to avoid sounding like AI. LLMs could produce empathy without humanness and vice versa. LLMs mimic human writing by applying implicit representations of humanness to simulate stochastic empathy.

Conclusion: The human-AI detection race is asymmetric: instructions to appear human primarily benefit AI, not humans. LLMs can decouple empathy from humanness, using implicit models to mimic human-like patterns without genuine human qualities.

Abstract: Differentiating generated and human-written content is increasingly difficult. We examine how an incentive to convey humanness and task characteristics shape this human vs AI race across five studies. In Study 1-2 (n=530 and n=610) humans and a large language model (LLM) wrote relationship advice or relationship descriptions, either with or without instructions to sound human. New participants (n=428 and n=408) judged each text’s source. Instructions to sound human were only effective for the LLM, reducing the human advantage. Study 3 (n=360 and n=350) showed that these effects persist when writers were instructed to avoid sounding like an LLM. Study 4 (n=219) tested empathy as mechanism of humanness and concluded that LLMs can produce empathy without humanness and humanness without empathy. Finally, computational text analysis (Study 5) indicated that LLMs become more human-like by applying an implicit representation of humanness to mimic stochastic empathy.

[58] Unified Multimodal Interleaved Document Representation for Retrieval

Jaewoo Lee, Joonho Ko, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

Main category: cs.CL

TL;DR: A multimodal document retrieval method that integrates text, images, and tables using vision-language models, with passage merging and reranking strategies to preserve document context.

DetailsMotivation: Existing IR methods have two limitations: 1) they only consider textual content, ignoring multimodal elements like images and tables in documents, and 2) they segment long documents into passages, losing overall document context and paragraph interactions.

Method: Proposes a holistic embedding approach using vision-language models to process text, images, and tables into unified representations. Instead of retrieving individual passages, merges segmented passage representations into single document representations, with a reranking strategy to identify relevant passages when needed.

Result: Extensive experiments on diverse IR scenarios with both textual and multimodal queries show the approach substantially outperforms relevant baselines.

Conclusion: The proposed method effectively addresses multimodal document retrieval by leveraging vision-language models and preserving document context through passage merging and reranking, demonstrating significant performance improvements over existing approaches.

Abstract: Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discrete passages for embedding, which prevents them from capturing the overall document context and interactions between paragraphs. To address these two challenges, we propose a method that holistically embeds documents interleaved with multiple modalities by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse IR scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information within documents.

[59] PRACTIQ: A Practical Conversational Text-to-SQL dataset with Ambiguous and Unanswerable Queries

Mingwen Dong, Nischal Ashok Kumar, Yiqun Hu, Anuj Chauhan, Chung-Wei Hang, Shuaichen Chang, Lin Pan, Wuwei Lan, Henghui Zhu, Jiarong Jiang, Patrick Ng, Zhiguo Wang

Main category: cs.CL

TL;DR: PRACTIQ is a conversational text-to-SQL dataset focusing on ambiguous and unanswerable questions, with LLM-based baselines showing current systems struggle with such practical scenarios.

DetailsMotivation: Real user questions in text-to-SQL are often ambiguous or unanswerable, but existing datasets focus only on clear, answerable questions, creating a gap between research and practical applications.

Method: 1) Identified 4 categories each of ambiguous and unanswerable questions from existing datasets; 2) Generated 4-turn conversations with clarification dialogues; 3) For some ambiguous queries, directly generated helpful SQL responses considering multiple aspects; 4) Implemented LLM-based baselines with two-step approach: question category classification and clarification SQL prediction.

Result: State-of-the-art text-to-SQL systems struggle to handle ambiguous and unanswerable questions effectively, highlighting the practical challenge that existing benchmarks don’t address.

Conclusion: PRACTIQ addresses the practical gap in text-to-SQL by focusing on ambiguous/unanswerable questions, revealing limitations of current systems and providing a benchmark for more realistic conversational SQL interfaces.

Abstract: Previous text-to-SQL datasets and systems have primarily focused on user questions with clear intentions that can be answered. However, real user questions can often be ambiguous with multiple interpretations or unanswerable due to a lack of relevant data. In this work, we construct a practical conversational text-to-SQL dataset called PRACTIQ, consisting of ambiguous and unanswerable questions inspired by real-world user questions. We first identified four categories of ambiguous questions and four categories of unanswerable questions by studying existing text-to-SQL datasets. Then, we generate conversations with four turns: the initial user question, an assistant response seeking clarification, the user’s clarification, and the assistant’s clarified SQL response with the natural language explanation of the execution results. For some ambiguous queries, we also directly generate helpful SQL responses, that consider multiple aspects of ambiguity, instead of requesting user clarification. To benchmark the performance on ambiguous, unanswerable, and answerable questions, we implemented large language model (LLM)-based baselines using various LLMs. Our approach involves two steps: question category classification and clarification SQL prediction. Our experiments reveal that state-of-the-art systems struggle to handle ambiguous and unanswerable questions effectively. We will release our code for data generation and experiments on GitHub.

[60] Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Eduard Barbu, Meeri-Ly Muru, Sten Marcus Malva

Main category: cs.CL

TL;DR: This paper presents a neural text simplification method for Estonian using both NMT and fine-tuned LLaMA models, with LLaMA outperforming NMT across key metrics.

DetailsMotivation: Addressing the scarcity of text simplification resources for Estonian, a low-resource language, by developing effective neural approaches.

Method: Created a new Estonian simplification dataset using manual translations and GPT-4.0-generated simplifications, then trained two models: OpenNMT (NMT-based) and fine-tuned LLaMA on this dataset.

Result: LLaMA outperformed OpenNMT in grammaticality, readability, and meaning preservation, demonstrating superior text simplification performance for Estonian.

Conclusion: Large language models are effective for text simplification in low-resource language settings, and the publicly released resources support reproducibility and adaptation to other languages.

Abstract: This paper presents a method for text simplification based on two neural architectures: a neural machine translation (NMT) model and a fine-tuned large language model (LLaMA). Given the scarcity of existing resources for Estonian, a new dataset was created by combining manually translated corpora with GPT-4.0-generated simplifications. OpenNMT was selected as a representative NMT-based system, while LLaMA was fine-tuned on the constructed dataset. Evaluation shows LLaMA outperforms OpenNMT in grammaticality, readability, and meaning preservation. These results underscore the effectiveness of large language models for text simplification in low-resource language settings. The complete dataset, fine-tuning scripts, and evaluation pipeline are provided in a publicly accessible supplementary package to support reproducibility and adaptation to other languages.

[61] Benchmarking LLMs for Political Science: A United Nations Perspective

Yueqing Liang, Liangwei Yang, Chen Wang, Congying Xia, Rui Meng, Xiongxiao Xu, Haoran Wang, Ali Payani, Kai Shu

Main category: cs.CL

TL;DR: UNBench: First comprehensive benchmark evaluating LLMs on UN decision-making tasks using Security Council records from 1994-2024.

DetailsMotivation: LLMs have advanced NLP but their potential for high-stake political decision-making remains unexplored, especially in UN contexts where decisions have far-reaching consequences.

Method: Created novel dataset of UN Security Council records (1994-2024) and proposed UNBench benchmark with four political science tasks: co-penholder judgment, representative voting simulation, draft adoption prediction, and representative statement generation.

Result: Experimental analysis demonstrates both potential and challenges of applying LLMs to political decision-making, providing insights into their strengths and limitations in political science applications.

Conclusion: This work contributes to AI-political science intersection, opening new research avenues for global governance applications, with publicly available UNBench repository.

Abstract: Large Language Models (LLMs) have achieved significant advances in natural language processing, yet their potential for high-stake political decision-making remains largely unexplored. This paper addresses the gap by focusing on the application of LLMs to the United Nations (UN) decision-making process, where the stakes are particularly high and political decisions can have far-reaching consequences. We introduce a novel dataset comprising publicly available UN Security Council (UNSC) records from 1994 to 2024, including draft resolutions, voting records, and diplomatic speeches. Using this dataset, we propose the United Nations Benchmark (UNBench), the first comprehensive benchmark designed to evaluate LLMs across four interconnected political science tasks: co-penholder judgment, representative voting simulation, draft adoption prediction, and representative statement generation. These tasks span the three stages of the UN decision-making process–drafting, voting, and discussing–and aim to assess LLMs’ ability to understand and simulate political dynamics. Our experimental analysis demonstrates the potential and challenges of applying LLMs in this domain, providing insights into their strengths and limitations in political science. This work contributes to the growing intersection of AI and political science, opening new avenues for research and practical applications in global governance. The UNBench Repository can be accessed at: https://github.com/yueqingliang1/UNBench.

[62] I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search

Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu

Main category: cs.CL

TL;DR: I-MCTS improves LLM-based AutoML agents by using introspective node expansion and LLM-based value models for better code generation diversity and quality.

DetailsMotivation: Existing LLM-based AutoML agents suffer from low-diversity and suboptimal code generation, and current MCTS approaches have limitations in thought quality/diversity and scalar value feedback mechanisms.

Method: Introduces Introspective Monte Carlo Tree Search (I-MCTS) with iterative node expansion through introspective analysis of parent/sibling solutions, LLM-based value models for node evaluation, and hybrid rewarding mechanism transitioning from LLM-estimated to actual performance scores.

Result: Achieves 4% absolute performance improvement over strong open-source AutoML agents across various ML tasks.

Conclusion: I-MCTS effectively enhances agentic AutoML systems by improving decision-making through introspective refinement and better node evaluation mechanisms.

Abstract: Recent advancements in large language models (LLMs) have shown remarkable potential in automating machine learning tasks. However, existing LLM-based agents often struggle with low-diversity and suboptimal code generation. While recent work has introduced Monte Carlo Tree Search (MCTS) to address these issues, limitations persist in the quality and diversity of thoughts generated, as well as in the scalar value feedback mechanisms used for node selection. In this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a novel approach that iteratively expands tree nodes through an introspective process that meticulously analyzes solutions and results from parent and sibling nodes. This facilitates a continuous refinement of the node in the search tree, thereby enhancing the overall decision-making process. Furthermore, we integrate a Large Language Model (LLM)-based value model to facilitate direct evaluation of each node’s solution prior to conducting comprehensive computational rollouts. A hybrid rewarding mechanism is implemented to seamlessly transition the Q-value from LLM-estimated scores to actual performance scores. This allows higher-quality nodes to be traversed earlier. Applied to the various ML tasks, our approach demonstrates a 4% absolute improvement in performance compared to the strong open-source AutoML agents, showcasing its effectiveness in enhancing agentic AutoML systems. Resource available at https://github.com/jokieleung/I-MCTS

[63] Evaluating the Effect of Retrieval Augmentation on Social Biases

Tianhui Zhang, Yi Zhou, Danushka Bollegala

Main category: cs.CL

TL;DR: RAG systems amplify social biases from retrieved documents in generated text across multiple languages and bias types, even when the LLM itself has low bias.

DetailsMotivation: While RAG is popular for incorporating novel facts into LLM-based NLG systems, LLMs are known to encode unfair social biases. The impact of RAG on modulating these biases in generated text is not well understood, especially across different languages and social bias types.

Method: Systematically studied RAG components and social biases across three languages (English, Japanese, Chinese) and four bias types (gender, race, age, religion). Used Bias Question Answering (BBQ) benchmark datasets to evaluate biases in RAG responses from document collections with varying stereotypical bias levels, employing multiple LLMs as generators.

Result: Found that biases in document collections are often amplified in generated responses, even when the generating LLM exhibits low-level bias. This amplification occurs consistently across different languages and social bias types.

Conclusion: Raises concerns about using RAG for injecting novel facts into NLG systems without careful bias evaluation. Calls for thorough assessment of potential social biases in RAG applications before real-world deployment to prevent amplification of harmful stereotypes.

Abstract: Retrieval Augmented Generation (RAG) has gained popularity as a method for conveniently incorporating novel facts that were not seen during the pre-training stage in Large Language Model (LLM)-based Natural Language Generation (NLG) systems. However, LLMs are known to encode significant levels of unfair social biases. The modulation of these biases by RAG in NLG systems is not well understood. In this paper, we systematically study the relationship between the different components of a RAG system and the social biases presented in the text generated across three languages (i.e. English, Japanese and Chinese) and four social bias types (i.e. gender, race, age and religion). Specifically, using the Bias Question Answering (BBQ) benchmark datasets, we evaluate the social biases in RAG responses from document collections with varying levels of stereotypical biases, employing multiple LLMs used as generators. We find that the biases in document collections are often amplified in the generated responses, even when the generating LLM exhibits a low-level of bias. Our findings raise concerns about the use of RAG as a technique for injecting novel facts into NLG systems and call for careful evaluation of potential social biases in RAG applications before their real-world deployment.

[64] CASE – Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement

Gaifan Zhang, Yi Zhou, Danushka Bollegala

Main category: cs.CL

TL;DR: CASE is a method for creating context-aware sentence embeddings by using LLM-generated condition embeddings with attention pooling and supervised dimensionality reduction, outperforming existing C-STS methods.

DetailsMotivation: Sentence meaning depends on context, but current sentence embedding methods lack effective ways to modify embeddings based on contextual conditions. There's a need for accurate and efficient methods to create condition-aware sentence embeddings.

Method: CASE uses LLMs to create condition embeddings where sentence influences attention scores during pooling, then applies supervised nonlinear projection to reduce dimensionality of LLM-based embeddings.

Result: CASE significantly outperforms previous C-STS methods on standard benchmarks. Subtracting condition embedding improves C-STS performance, and the supervised dimensionality reduction both reduces embedding size and improves performance.

Conclusion: The proposed CASE method effectively creates context-aware sentence embeddings through condition-aware attention pooling and supervised dimensionality reduction, offering improved performance over existing approaches.

Abstract: The meaning conveyed by a sentence often depends on the context in which it appears. Despite the progress of sentence embedding methods, it remains unclear how to best modify a sentence embedding conditioned on its context. To address this problem, we propose Condition-Aware Sentence Embeddings (CASE), an efficient and accurate method to create an embedding for a sentence under a given condition. First, CASE creates an embedding for the condition using a Large Language Model (LLM), where the sentence influences the attention scores computed for the tokens in the condition during pooling. Next, a supervised nonlinear projection is learned to reduce the dimensionality of the LLM-based text embeddings. We show that CASE significantly outperforms previously proposed Conditional Semantic Textual Similarity (C-STS) methods on an existing standard benchmark dataset. We find that subtracting the condition embedding consistently improves the C-STS performance of LLM-based text embeddings. Moreover, we propose a supervised dimensionality reduction method that not only reduces the dimensionality of LLM-based embeddings but also significantly improves their performance.

[65] Fluent but Foreign: Even Regional LLMs Lack Cultural Alignment

Dhruv Agarwal, Anya Shukla, Sunayana Sitaram, Aditya Vashistha

Main category: cs.CL

TL;DR: Indic LLMs don’t align better with Indian cultural values than global models; a US respondent is actually a closer proxy for Indian values than any Indic model.

DetailsMotivation: While many countries are building regional/sovereign LLMs, it's unclear whether they actually reflect local cultural values and practices or just speak local languages. The paper investigates this using India as a case study.

Method: Evaluated six Indic and six global LLMs on values and practices using nationally representative surveys and community-sourced QA datasets. Also conducted user study with 115 Indian users to assess writing suggestions. Tested prompting and regional fine-tuning approaches.

Result: Indic models do not align better with Indian norms than global models. A US respondent was closer to Indian values than any Indic model. Both global and Indic LLMs introduce Westernized or exoticized writing. Prompting and regional fine-tuning fail to recover alignment and can degrade existing knowledge.

Conclusion: The problem stems from scarce culturally grounded data, especially for pretraining. Cultural evaluation should be a first-class requirement alongside multilingual benchmarks. Need native, community-authored corpora and comprehensive evaluations to build truly sovereign LLMs.

Abstract: Large language models (LLMs) are used worldwide, yet exhibit Western cultural tendencies. Many countries are now building regional'' or sovereign’’ LLMs, but it remains unclear whether they reflect local values and practices or merely speak local languages. Using India as a case study, we evaluate six Indic and six global LLMs on two dimensions – values and practices – grounded in nationally representative surveys and community-sourced QA datasets. Across tasks, Indic models do not align better with Indian norms than global models; in fact, a U.S. respondent is a closer proxy for Indian values than any Indic model. We further run a user study with 115 Indian users and find that writing suggestions from both global and Indic LLMs introduce Westernized or exoticized writing. Prompting and regional fine-tuning fail to recover alignment and can even degrade existing knowledge. We attribute this to scarce culturally grounded data, especially for pretraining. We position cultural evaluation as a first-class requirement alongside multilingual benchmarks and offer a reusable, community-grounded methodology. We call for native, community-authored corpora and thickxwide evaluations to build truly sovereign LLMs.

[66] Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing

Shigeng Chen, Linhao Luo, Zhangchi Qiu, Yanan Cao, Carl Yang, Shirui Pan

Main category: cs.CL

TL;DR: MedEditBench evaluates knowledge editing methods for LLMs in medical domain, finds current methods only achieve superficial memorization, proposes SGR-Edit using model-derived rationales for better generalization.

DetailsMotivation: Knowledge editing (KE) is promising for updating LLMs without full retraining, but its effectiveness in complex medical domain remains unexplored. Medical KE is challenging as it requires LLMs to internalize knowledge and generalize to unseen scenarios for interpretable decision-making.

Method: Proposes MedEditBench framework with: 1) new medical knowledge editing benchmark, 2) three different knowledge editing paradigms to assess impact of different knowledge sources, and 3) Self-Generated Rationale Editing (SGR-Edit) that uses model-derived rationales as target knowledge for editing.

Result: Current KE methods result in only superficial memorization of injected information, failing to generalize to new scenarios. SGR-Edit demonstrates significant improvements over existing KE approaches by uncovering underlying reasoning process. Also provides insights into medical knowledge localization in LLMs and impact of sequential editing on evolving knowledge.

Conclusion: MedEditBench provides rigorous evaluation framework for medical knowledge editing, reveals limitations of current methods, and proposes SGR-Edit as effective solution. Offers practical guidance for implementing KE in real-world medical applications.

Abstract: Recently, knowledge editing (KE) has emerged as a promising approach to update specific facts in Large Language Models (LLMs) without the need for full retraining. Despite the effectiveness in general-domain benchmarks, their applicability to complex medical domain remains largely unexplored. Medical knowledge editing is particularly challenging, as it requires LLMs to internalize the knowledge and generalize to unseen scenarios for effective and interpretable decision-making. In this work, we propose a novel framework called MedEditBench to rigorously evaluate the effectiveness of existing KE methods in the medical domain. In MedEditBench, we introduce a new medical knowledge editing benchmark as well as three different knowledge editing paradigms, which are designed to assess the impact of different knowledge sources for editing. Our findings indicate that current KE methods result in only superficial memorization of the injected information, failing to generalize to new scenarios. To overcome this limitation, we present Self-Generated Rationale Editing (SGR-Edit), which utilizes model-derived rationales as the target knowledge for editing, thereby uncovering the underlying reasoning process and demonstrating significant improvements over existing KE approaches. Additionally, we offer deeper insights into medical knowledge editing, including the localization of medical knowledge in LLMs and the impact of sequential editing on evolving knowledge. This could provide practical guidance for implementing KE methods in real-world medical applications.

[67] Identifying Reliable Evaluation Metrics for Scientific Text Revision

Léane Jourdan, Florian Boudin, Richard Dufour, Nicolas Hernandez

Main category: cs.CL

TL;DR: This paper analyzes limitations of traditional metrics (ROUGE, BERTScore) for evaluating text revision in scientific writing and proposes a hybrid approach combining LLM-as-a-judge evaluation with domain-specific metrics for more reliable assessment.

DetailsMotivation: Traditional metrics like ROUGE and BERTScore focus on similarity rather than meaningful improvements in scientific writing revisions, failing to capture quality enhancements that align with human judgments.

Method: 1) Manual annotation study to assess revision quality; 2) Investigation of reference-free evaluation metrics from related NLP domains; 3) Examination of LLM-as-a-judge approaches with and without gold references.

Result: LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. A hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment.

Conclusion: The proposed hybrid evaluation framework combining LLM judgment capabilities with domain-specific metrics provides the most reliable method for assessing text revision quality in scientific writing, overcoming limitations of traditional similarity-based metrics.

Abstract: Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

[68] CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation

Bastien Liétard, Gabriel Loiseau

Main category: cs.CL

TL;DR: The paper proposes Concept Differentiation, an extension to Word-in-Context that includes inter-word scenarios, and introduces Concept-Aligned Embeddings (CALE) models fine-tuned on this task, achieving state-of-the-art performance on lexical semantic tasks.

DetailsMotivation: Current Word-in-Context approaches only compare occurrences of the same lemma, limiting the range of captured semantic information. There's a need to extend this to include inter-word scenarios to better investigate lexical semantics and semantic relations between different words.

Method: 1) Propose Concept Differentiation task extension to include inter-word scenarios; 2) Create dataset derived from SemCor data; 3) Fine-tune several representation models on this dataset to create Concept-Aligned Embeddings (CALE); 4) Evaluate on various lexical semantic tasks.

Result: CALE models achieve best performances in experiments on lexical semantic tasks. The fine-tuning brings valuable changes to the spatial organization of embeddings, creating more semantically accurate representations.

Conclusion: Concept Differentiation effectively extends Word-in-Context to inter-word scenarios, and CALE models provide efficient multi-purpose representations of lexical meaning that outperform existing approaches.

Abstract: Lexical semantics is concerned with both the multiple senses a word can adopt in different contexts, and the semantic relations that exist between meanings of different words. To investigate them, Contextualized Language Models are a valuable tool that provides context-sensitive representations that can be used to investigate lexical meaning. Recent works like XL-LEXEME have leveraged the task of Word-in-Context to fine-tune them to get more semantically accurate representations, but Word-in-Context only compares occurrences of the same lemma, limiting the range of captured information. In this paper, we propose an extension, Concept Differentiation, to include inter-words scenarios. We provide a dataset for this task, derived from SemCor data. Then we fine-tune several representation models on this dataset. We call these models Concept-Aligned Embeddings (CALE). By challenging our models and other models on various lexical semantic tasks, we demonstrate that the proposed models provide efficient multi-purpose representations of lexical meaning that reach best performances in our experiments. We also show that CALE’s fine-tuning brings valuable changes to the spatial organization of embeddings.

[69] VMMU: A Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark

Vy Tuong Dang, An Vo, Emilio Villa-Cueva, Quang Tau, Duc Dm, Thamar Solorio, Daeyoung Kim

Main category: cs.CL

TL;DR: VMMU is a Vietnamese multimodal benchmark with 2.5k questions across 7 tasks requiring genuine multimodal integration, not just OCR. Proprietary VLMs achieve only 66% accuracy, with failures primarily due to multimodal reasoning issues rather than OCR limitations.

DetailsMotivation: To evaluate vision-language models' ability to interpret and reason over visual and textual information in non-English contexts (specifically Vietnamese), addressing the need for benchmarks that require genuine multimodal integration beyond text-only cues or OCR shortcuts.

Method: Created VMMU benchmark with 2.5k multimodal questions across 7 diverse tasks including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require authentic multimodal integration. Evaluated diverse state-of-the-art proprietary and open-source VLMs on this benchmark.

Result: Proprietary models achieved only 66% mean accuracy despite strong Vietnamese OCR performance. Analysis revealed that primary failure source is not OCR but multimodal grounding and reasoning over text and visual evidence. The benchmark exposes significant limitations in current VLMs’ multimodal reasoning capabilities for Vietnamese.

Conclusion: Current VLMs struggle with genuine multimodal reasoning in Vietnamese contexts, with failures primarily in multimodal grounding and reasoning rather than OCR. The VMMU benchmark provides a valuable tool for evaluating and improving multimodal understanding beyond English, highlighting important research directions for cross-lingual multimodal AI.

Abstract: We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision-language models (VLMs) interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art proprietary and open-source VLMs on VMMU. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary source of failure is not OCR, but instead multimodal grounding and reasoning over text and visual evidence. Code and data are available at https://vmmu-bench.github.io/

[70] GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs

Nitin Gupta, Pallav Koppisetti, Kausik Lakkaraju, Biplav Srivastava

Main category: cs.CL

TL;DR: GAICo is an open-source Python library that standardizes evaluation of Generative AI outputs across modalities (text, structured data, images, audio) with comprehensive metrics and visualization tools.

DetailsMotivation: Current GenAI evaluation is fragmented with ad-hoc scripts, lacking standardized metrics for specialized structured outputs and multi-modal comparisons, hindering reproducibility and development velocity.

Method: Developed GAICo as a unified, extensible Python framework with high-level API for end-to-end analysis, supporting reference-based metrics for unstructured text, structured data, and multimedia formats.

Result: Successfully deployed library used in multi-modal AI Travel Assistant case study; achieved 16K+ downloads on PyPI within 6 months, demonstrating community adoption and practical utility.

Conclusion: GAICo enables reproducible GenAI evaluation, accelerates development, and builds trust in AI systems by providing standardized comparison tools across diverse output modalities.

Abstract: The rapid proliferation of Generative AI (GenAI) into diverse, high-stakes domains necessitates robust and reproducible evaluation methods. However, practitioners often resort to ad-hoc, non-standardized scripts, as common metrics are often unsuitable for specialized, structured outputs (e.g., automated plans, time-series) or holistic comparison across modalities (e.g., text, audio, and image). This fragmentation hinders comparability and slows AI system development. To address this challenge, we present GAICo (Generative AI Comparator): a deployed, open-source Python library that streamlines and standardizes GenAI output comparison. GAICo provides a unified, extensible framework supporting a comprehensive suite of reference-based metrics for unstructured text, specialized structured data formats, and multimedia (images, audio). Its architecture features a high-level API for rapid, end-to-end analysis, from multi-model comparison to visualization and reporting, alongside direct metric access for granular control. We demonstrate GAICo’s utility through a detailed case study evaluating and debugging complex, multi-modal AI Travel Assistant pipelines. GAICo empowers AI researchers and developers to efficiently assess system performance, make evaluation reproducible, improve development velocity, and ultimately build more trustworthy AI systems, aligning with the goal of moving faster and safer in AI deployment. Since its release on PyPI in Jun 2025, the tool has been downloaded over 16K times, across versions, by Dec 2025, demonstrating growing community interest.

[71] Orthogonal Low-rank Adaptation in Lie Groups for Continual Learning of Large Language Models

Kefan Cao, Shuaicheng Wu

Main category: cs.CL

TL;DR: OLieRA: A Lie group-based fine-tuning framework for continual learning in LLMs that uses multiplicative updates to preserve parameter geometry while enforcing orthogonality across task subspaces.

DetailsMotivation: LLMs suffer from catastrophic forgetting in sequential multi-task learning. Existing parameter regularization methods (like O-LoRA, N-LoRA) use additive updates that distort the intrinsic geometry of model parameters, leading to interference between tasks.

Method: OLieRA uses Lie group-based fine-tuning with multiplicative updates to preserve parameter geometry while enforcing orthogonality across task subspaces. It maintains replay-free and task-ID free inference properties like O-LoRA.

Result: OLieRA achieves state-of-the-art performance on the Standard CL benchmark and remains highly competitive under large task sequences.

Conclusion: OLieRA establishes a principled paradigm for continual learning in LLMs by preserving parameter geometry through multiplicative updates while maintaining orthogonality across tasks.

Abstract: Large language models (LLMs) suffer from catastrophic forgetting in sequential multi-task learning. Existing parameter regularization methods (e.g., O-LoRA, N-LoRA) mitigate interference via low-rank subspace orthogonality, but additive updates distort the intrinsic geometry of model parameters. We propose \textbf{OLieRA}, a Lie group based fine-tuning framework that preserves parameter geometry through multiplicative updates while enforcing orthogonality across task subspaces. OLieRA achieves state-of-the-art performance on the Standard CL benchmark and remains highly competitive under large task sequences. It further inherits the replay-free and task-ID free inference properties of O-LoRA, establishing a principled paradigm for continual learning in LLMs.

[72] CARE: Cognitive-reasoning Augmented Reinforcement for Emotional Support Conversation

Jie Zhu, Yuanchen Zhou, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong

Main category: cs.CL

TL;DR: CARE is a novel framework that enhances cognitive reasoning in Emotional Support Conversations without synthetic data, using original training data and reinforcement learning to improve logical coherence and supportive quality.

DetailsMotivation: Current ESC research focuses too much on data augmentation and synthetic corpus construction while neglecting the deeper cognitive reasoning processes essential for effective emotional support. There's a gap in understanding and implementing the reasoning mechanisms that underlie genuinely supportive conversations.

Method: CARE leverages the original ESC training set to guide models in generating logically coherent responses, explicitly enhancing cognitive reasoning. It then employs reinforcement learning to further refine and reinforce the reasoning process, creating a framework that strengthens reasoning without relying on large-scale synthetic data.

Result: Experimental results show that CARE significantly improves both the logical soundness and supportive quality of responses, outperforming existing approaches in generating more empathetic and cognitively robust emotional support.

Conclusion: CARE advances the development of empathetic, cognitively robust, and human-like emotional support systems by focusing on cognitive reasoning enhancement rather than just data expansion, representing a meaningful step toward more effective emotional support AI.

Abstract: Emotional Support Conversation (ESC) plays a vital role in alleviating psychological stress and providing emotional value through dialogue. While recent studies have largely focused on data augmentation and synthetic corpus construction, they often overlook the deeper cognitive reasoning processes that underpin effective emotional support. To address this gap, we propose \textbf{CARE}, a novel framework that strengthens reasoning in ESC without relying on large-scale synthetic data. CARE leverages the original ESC training set to guide models in generating logically coherent and supportive responses, thereby explicitly enhancing cognitive reasoning. Building on this foundation, we further employ reinforcement learning to refine and reinforce the reasoning process. Experimental results demonstrate that CARE significantly improves both the logical soundness and supportive quality of responses, advancing the development of empathetic, cognitively robust, and human-like emotional support systems.

[73] Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering

Lei Tang, Wei Zhou, Mohsen Mesgar

Main category: cs.CL

TL;DR: PRMs show promise for TQA but struggle with out-of-domain generalization and weak correlation between step verification and final answer accuracy.

DetailsMotivation: While PRMs have proven effective for complex reasoning tasks like mathematics, their application to semi-structured data tasks like table question answering (TQA) remains unexplored despite TQA's unique challenges including irrelevant information, loose step connections, and domain-specific reasoning.

Method: Conducted first systematic study of PRMs for TQA, evaluating state-of-the-art generative PRMs from both answer and step perspectives, combining textual and code verification approaches.

Result: PRMs with combined textual and code verification can aid solution selection but struggle with out-of-domain generalization. Analysis reveals weak correlation between step-level verification performance and answer accuracy, likely due to weak step dependencies and loose causal links.

Conclusion: Current PRMs have limitations for TQA tasks, highlighting the need for more robust, process-aware verifiers that can better handle the unique challenges of semi-structured data reasoning.

Abstract: Process reward models (PRMs) enhance complex reasoning in large language models (LLMs) by evaluating candidate solutions step-by-step and selecting answers based on aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA), remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.

[74] Efficient semantic uncertainty quantification in language models via diversity-steered sampling

Ji Won Park, Kyunghyun Cho

Main category: cs.CL

TL;DR: A diversity-steered sampling method for LLMs that reduces semantic redundancy in QA outputs, improving uncertainty estimation efficiency without requiring gradient access to the base model.

DetailsMotivation: Estimating semantic uncertainties in LLMs for free-form QA is challenging and expensive, requiring many generations to obtain stable estimates. Current methods are inefficient due to semantically redundant outputs.

Method: Introduces a diversity-steered sampler that injects semantic-similarity penalties using a lightly finetuned NLI model during decoding. Covers both autoregressive and masked diffusion paradigms. Uses importance reweighting to debias uncertainty estimates and control variates to reduce variance.

Result: Across four QA benchmarks, the method matches or surpasses baselines while covering more semantic clusters with the same number of samples. Provides substantial sample-efficiency gains.

Conclusion: The modular framework serves as a drop-in enhancement for uncertainty estimation in risk-sensitive LLM deployments, requiring no gradient access to the base model and improving semantic coverage efficiency.

Abstract: Accurately estimating semantic aleatoric and epistemic uncertainties in large language models (LLMs) is particularly challenging in free-form question answering (QA), where obtaining stable estimates often requires many expensive generations. We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding, covers both autoregressive and masked diffusion paradigms, and yields substantial sample-efficiency gains. The key idea is to inject a continuous semantic-similarity penalty into the model’s proposal distribution using a natural language inference (NLI) model lightly finetuned on partial prefixes or intermediate diffusion states. We debias downstream uncertainty estimates with importance reweighting and shrink their variance with control variates. Across four QA benchmarks, our method matches or surpasses baselines while covering more semantic clusters with the same number of samples. Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation in risk-sensitive model deployments.

[75] CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection

Grace Byun, Rebecca Lipschutz, Sean T. Minton, Abigail Lott, Jinho D. Choi

Main category: cs.CL

TL;DR: CRADLE BENCH is a new benchmark for detecting 7 types of mental health crises in language model interactions, featuring clinician annotations, temporal labels, and ensemble-based training data.

DetailsMotivation: Current language models lack reliable detection of critical mental health crisis situations (suicide ideation, rape, domestic violence, etc.) during user interactions, which can have serious consequences if missed.

Method: Created CRADLE BENCH with 7 crisis types aligned with clinical standards, including 600 clinician-annotated evaluation examples, 420 development examples, and ~4K training examples automatically labeled using majority-vote ensemble of multiple language models. Fine-tuned six crisis detection models on subsets defined by consensus and unanimous ensemble agreement.

Result: The benchmark is the first to incorporate temporal labels and covers comprehensive crisis types. Ensemble-based annotation significantly outperforms single-model annotation, and complementary models are provided trained under different agreement criteria.

Conclusion: CRADLE BENCH addresses a critical gap in mental health crisis detection for language models, providing a comprehensive benchmark with clinical alignment, temporal awareness, and improved annotation methods through ensemble approaches.

Abstract: Detecting mental health crisis situations such as suicide ideation, rape, domestic violence, child abuse, and sexual harassment is a critical yet underexplored challenge for language models. When such situations arise during user–model interactions, models must reliably flag them, as failure to do so can have serious consequences. In this work, we introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection. Unlike previous efforts that focus on a limited set of crisis types, our benchmark covers seven types defined in line with clinical standards and is the first to incorporate temporal labels. Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples, together with a training corpus of around 4K examples automatically labeled using a majority-vote ensemble of multiple language models, which significantly outperforms single-model annotation. We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement, providing complementary models trained under different agreement criteria.

[76] Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

Yiming Rong, Yixin Zhang, Ziyi Wang, Deyang Jiang, Yunlong Zhao, Haoran Wu, Shiyu Zhou, Bo Xu

Main category: cs.CL

TL;DR: SAP² method uses two-stage dynamic pruning and integration of relevant contextual keywords with Speech-Driven Attention-based Pooling to improve ASR performance in long-context scenarios like conference presentations.

DetailsMotivation: ASR systems struggle with long-context scenarios requiring domain-specific knowledge (e.g., conference presentations) due to constrained model context windows and sparse relevant information within extensive contextual noise.

Method: Proposes SAP² framework with two-stage dynamic pruning and integration of relevant contextual keywords using Speech-Driven Attention-based Pooling mechanism for efficient compression of context embeddings while preserving speech-salient information.

Result: Achieves state-of-the-art performance on SlideSpeech (7.71% WER) and LibriSpeech (1.12% WER). On SlideSpeech, reduces biased keyword error rates by 41.1% compared to non-contextual baselines. Shows robust scalability under extensive contextual input conditions.

Conclusion: SAP² effectively addresses long-context ASR challenges by dynamically selecting and integrating relevant contextual information, demonstrating significant performance improvements and scalability for contextualized speech recognition tasks.

Abstract: Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP$^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP$^{2}$ also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.

[77] A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media

Edward Ajayi, Martha Kachweka, Mawuli Deku, Emily Aiken

Main category: cs.CL

TL;DR: A unified multiclass classification framework for detecting 10 mental health and cyberbullying categories from social media, using domain-adapted MentalBERT with 0.92 accuracy and 0.76 Macro F1, plus an explainable dashboard for human-in-the-loop moderation.

DetailsMotivation: Growing mental health challenges and cyberbullying in digital spaces require scalable, interpretable detection systems to support online safety and computational mental health interventions.

Method: Curated datasets from Twitter/Reddit with “split-then-balance” pipeline, comprehensive evaluation of lexical models, hybrid approaches, and fine-tuned transformers, plus SHAP-LLM explainability framework and prototype dashboard.

Result: Domain-adapted MentalBERT achieved best performance (0.92 accuracy, 0.76 Macro F1), outperforming generic models and zero-shot LLM baseline, with end-to-end fine-tuning proving critical for success.

Conclusion: The system serves as human-in-the-loop screening aid (not diagnostic), providing robust baseline for future multi-label, clinically-validated datasets at the intersection of online safety and computational mental health.

Abstract: Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous “split-then-balance” pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard (“Social Media Screener”) designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.

[78] The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

Jiandong Shao, Raphael Tang, Crystina Zhang, Karin Sevegnani, Pontus Stenetorp, Jianfei Yang, Yao Lu

Main category: cs.CL

TL;DR: Removing just 2% bilingual data from pretraining corpus causes 56% translation performance drop, but cross-lingual QA/reasoning remain stable. Parallel data (14% of bilingual) restores translation, while code-switching (72%) contributes minimally.

DetailsMotivation: To understand how bilingual data in pretraining corpora enables cross-lingual abilities in multilingual LLMs, specifically which types of bilingual data contribute to different cross-lingual capabilities.

Method: Pretrained models from scratch under controlled conditions: compared standard web corpus vs monolingual-only version (removing all multilingual documents). Categorized bilingual data into parallel (14%), code-switching (72%), and miscellaneous (14%). Conducted granular ablations by reintroducing parallel or code-switching data into monolingual-only corpus.

Result: Removing 2% bilingual data caused 56% BLEU drop in translation, but cross-lingual QA and reasoning tasks remained stable. Parallel data restored 91% of translation performance, while code-switching contributed minimally. Other cross-lingual tasks unaffected by either type.

Conclusion: Translation critically depends on systematic token-level alignments from parallel data, while cross-lingual understanding and reasoning can be achieved without bilingual data. Different cross-lingual capabilities have distinct data requirements.

Abstract: Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.

[79] Intention Collapse: Intention-Level Metrics for Reasoning in Language Models

Patricio Vera

Main category: cs.CL

TL;DR: The paper studies intention collapse in language generation through three model-agnostic metrics on pre-collapse states, examining how chain-of-thought reasoning affects internal uncertainty across different models and benchmarks.

DetailsMotivation: To understand the many-to-one mapping problem in language generation where rich internal states collapse to single token sequences, and to develop metrics to analyze this intention collapse phenomenon.

Method: Introduces three cheap, model-agnostic metrics computed on pre-collapse states: intention entropy, effective dimensionality, and recoverability (probe AUROC for predicting eventual success). Evaluates these metrics in a 3x3 study across three models (Mistral-7B, LLaMA-3.1-8B, Qwen-2.5-7B) and three benchmarks (GSM8K, ARC-Challenge, AQUA-RAT), comparing baseline, chain-of-thought, and babble control conditions.

Result: CoT increased average accuracy from 34.2% to 47.3% (+13.1 pp), with large gains on GSM8K but consistent degradations on ARC-Challenge. Different models showed distinct entropy regimes: Mistral had lower-entropy CoT while LLaMA had higher-entropy CoT. Probe AUROC was significantly above chance in some settings and could dissociate from behavioral accuracy.

Conclusion: The study reveals heterogeneity in how chain-of-thought reasoning affects internal uncertainty across models, and shows that informative internal signals are not always reliably converted into final decisions under constrained response formats.

Abstract: Language generation maps a rich, high-dimensional internal state to a single token sequence. We study this many-to-one mapping through the lens of intention collapse: the projection from an internal intention space I to an external language space L. We introduce three cheap, model-agnostic metrics computed on a pre-collapse state I: (i) intention entropy Hint(I), (ii) effective dimensionality deff(I), and (iii) recoverability Recov(I), operationalized as probe AUROC for predicting eventual success. We evaluate these metrics in a 3x3 study across models (Mistral-7B, LLaMA-3.1-8B, Qwen-2.5-7B) and benchmarks (GSM8K, ARC-Challenge, AQUA-RAT), comparing baseline, chain-of-thought (CoT), and a babble control (n=200 items per cell). CoT increases average accuracy from 34.2% to 47.3% (+13.1 pp), driven by large gains on GSM8K but consistent degradations on ARC-Challenge. Across models, CoT induces distinct entropy regimes relative to baseline, dH = Hint(CoT) - Hint(Base): Mistral shows dH < 0 (lower-entropy CoT), whereas LLaMA shows dH > 0 (higher-entropy CoT), highlighting heterogeneity in CoT-induced internal uncertainty. Finally, probe AUROC is significantly above chance in a subset of settings and can dissociate from behavioral accuracy (e.g., high AUROC alongside lower CoT accuracy on ARC-Challenge for Qwen), suggesting that informative internal signal is not always reliably converted into a final discrete decision under constrained response formats.

[80] StealthGraph: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation

Huawei Zheng, Xinqi Jiang, Sen Yang, Shouling Ji, Yingcai Wu, Dazhen Deng

Main category: cs.CL

TL;DR: A framework for generating implicit harmful prompts in specialized domains using knowledge graphs and dual-path obfuscation to create realistic safety testing datasets for LLMs.

DetailsMotivation: LLMs in specialized domains (finance, healthcare) face unique safety risks, but existing datasets focus on explicit harmful prompts that modern defenses can detect. Real-world threats are often implicit and domain-specific, requiring better datasets for realistic red-teaming.

Method: End-to-end framework with two main components: 1) Knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and 2) Dual-path obfuscation rewriting that converts explicit harmful prompts into implicit variants through direct and context-enhanced rewriting.

Result: The framework produces high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. Code and datasets are released on GitHub.

Conclusion: The proposed approach addresses the scarcity of domain-specific implicit harmful prompt datasets, providing a systematic method to generate realistic safety testing materials that better reflect real-world threats to LLMs in specialized domains.

Abstract: Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies dual-path obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets at GitHub.

[81] LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation

Stergios Chatzikyriakidis, Anastasia Natsina

Main category: cs.CL

TL;DR: LLMs struggle with phonological tasks like rhyme in low-resource languages like Greek. A hybrid system combining LLMs with phonological algorithms achieves accurate rhyme identification/generation, outperforming pure LLM approaches.

DetailsMotivation: LLMs have remarkable NLP capabilities but struggle with phonologically-grounded phenomena like rhyme detection/generation, especially in lower-resource languages like Modern Greek. This gap needs addressing for comprehensive language understanding.

Method: Hybrid system combining LLMs with deterministic phonological algorithms. Implements comprehensive Greek rhyme taxonomy (Pure, Rich, Imperfect, Mosaic, IDV patterns). Uses agentic generation pipeline with phonological verification. Evaluates multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, RAG-augmented) across various LLMs.

Result: Significant “Reasoning Gap”: native-like models (Claude 3.7) achieve 40% accuracy, while reasoning-heavy models (Claude 4.5) reach 54% with Chain-of-Thought. Pure LLM generation fails catastrophically (<4% valid poems), but hybrid verification restores performance to 73.1%.

Conclusion: Hybrid LLM-phonological systems are essential for accurate rhyme tasks in low-resource languages. The approach successfully bridges the reasoning gap and enables reliable rhyme generation. System and corpus of 40,000+ rhymes released for future research.

Abstract: Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant “Reasoning Gap”: while native-like models (Claude 3.7) perform intuitively (40% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4% valid poems), while our hybrid verification loop restores performance to 73.1%. We release our system and a corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.

[82] Massively Multilingual Joint Segmentation and Glossing

Michael Ginn, Lindia Tjuatja, Enora Rice, Ali Marashian, Maria Valentini, Jasmine Xu, Graham Neubig, Alexis Palmer

Main category: cs.CL

TL;DR: PolyGloss is a new multilingual model that jointly predicts morphological segmentation and interlinear glosses, outperforming previous models and being adaptable to new datasets.

DetailsMotivation: Existing gloss prediction models like GlossLM generate morpheme-level glosses but don't predict actual morpheme boundaries, making predictions less interpretable and untrustworthy for linguists in real-world language documentation scenarios.

Method: Developed PolyGloss, a family of seq2seq multilingual models trained for joint segmentation and glossing. Extended GlossLM’s training corpus, experimented with optimal training approaches balancing segmentation and glossing accuracy, and demonstrated adaptability via low-rank adaptation.

Result: PolyGloss outperforms GlossLM on glossing and beats various open-source LLMs on segmentation, glossing, and alignment tasks. The model can be quickly adapted to new datasets through low-rank adaptation.

Conclusion: Joint prediction of morphological segmentation and interlinear glosses addresses critical barriers in real-world language documentation, making automated glossing more interpretable and trustworthy for linguists.

Abstract: Automated interlinear gloss prediction with neural networks is a promising approach to accelerate language documentation efforts. However, while state-of-the-art models like GlossLM achieve high scores on glossing benchmarks, user studies with linguists have found critical barriers to the usefulness of such models in real-world scenarios. In particular, existing models typically generate morpheme-level glosses but assign them to whole words without predicting the actual morpheme boundaries, making the predictions less interpretable and thus untrustworthy to human annotators. We conduct the first study on neural models that jointly predict interlinear glosses and the corresponding morphological segmentation from raw text. We run experiments to determine the optimal way to train models that balance segmentation and glossing accuracy, as well as the alignment between the two tasks. We extend the training corpus of GlossLM and pretrain PolyGloss, a family of seq2seq multilingual models for joint segmentation and glossing that outperforms GlossLM on glossing and beats various open-source LLMs on segmentation, glossing, and alignment. In addition, we demonstrate that PolyGloss can be quickly adapted to a new dataset via low-rank adaptation.

[83] A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

Qiuyi Qu, Yicheng Sui, Yufei Sun, Rui Chen, Xiaofei Zhang, Yuzhi Zhang, Haofeng Wang, Ge Lan

Main category: cs.CL

TL;DR: The paper introduces a template-based rewriting layer for GPU kernel optimization that combines LLM agents with systematic parameter search, achieving more stable and higher-quality speedups than direct code rewriting approaches.

DetailsMotivation: GPU code optimization is critical for HPC and AI workloads, but current approaches (compiler optimizations, hand-written kernels, LLM-based direct rewriting) have limitations: they often rely on implicit parameters, require human intervention, and produce unstable performance gains.

Method: A two-stage approach: 1) Semantically refactor kernels into explicitly parameterizable templates using LLM agents, 2) Optimize template parameters via search-based autotuning with profiling feedback, constrained by hardware resource limits. The agentic tuner iteratively performs templating, testing, analysis, and planning.

Result: Experiments on real-world CUDA kernels from SGLang demonstrate speedups exceeding 3x in best cases. The template-plus-search design significantly reduces randomness compared to agent-only direct rewriting, making optimization more interpretable and systematic.

Conclusion: The proposed template-based rewriting with systematic parameter search provides more stable and higher-quality GPU kernel optimization, with potential extension to OpenCL, HIP, and other backends for automated performance optimization in production workloads.

Abstract: GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.

[84] Who Does This Name Remind You of ? Nationality Prediction via Large Language Model Associative Memory

Keito Inoshita

Main category: cs.CL

TL;DR: LAMA framework uses LLMs as associative memory with dual agents to predict nationality by recalling famous people with same names, achieving 81.7% accuracy and outperforming conventional prompting methods.

DetailsMotivation: LLMs have extensive world knowledge but current prompting methods are limited in applying abstract linguistic rules for tasks like nationality prediction that require cultural/historical understanding.

Method: LAMA framework uses LLMs as associative memory with dual-agent architecture: Person Agent and Media Agent recall famous individuals with same names and aggregate their nationalities through indirect reasoning rather than direct inference.

Result: Achieved 0.817 accuracy on 99-country nationality prediction task, substantially outperforming conventional LLM prompting methods and neural models. LLMs show higher reliability in recalling concrete examples than abstract reasoning.

Conclusion: Recall-based approaches are robust to low-frequency nationalities and dual-agent architecture produces synergistic effects, demonstrating effectiveness of multi-agent systems that retrieve/aggregate LLM knowledge rather than prompting reasoning.

Abstract: Large language models (LLMs) possess extensive world knowledge, yet methods for effectively eliciting this knowledge remain underexplored. Nationality and region prediction tasks require understanding of not only linguistic features but also cultural and historical background, making LLM world knowledge particularly valuable. However, conventional LLM prompting methods rely on direct reasoning approaches, which have limitations in applying abstract linguistic rules. We propose LLM Associative Memory Agents (LAMA), a novel framework that leverages LLM world knowledge as associative memory. Rather than directly inferring nationality from names, LAMA recalls famous individuals with the same name and aggregates their nationalities through indirect reasoning. A dual-agent architecture comprising a Person Agent and a Media Agent, specialized in different knowledge domains, recalls famous individuals in parallel, generating Top-1 predictions through voting and Top-K predictions through conditional completion. On a 99-country nationality prediction task, LAMA achieved 0.817 accuracy, substantially outperforming conventional LLM prompting methods and neural models. Our experiments reveal that LLMs exhibit higher reliability in recalling concrete examples than in abstract reasoning, that recall-based approaches are robust to low-frequency nationalities independent of data frequency distributions, and that the dual-agent architecture functions complementarily to produce synergistic effects. These results demonstrate the effectiveness of a new multi-agent system that retrieves and aggregates LLM knowledge rather than prompting reasoning.

[85] The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check

Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, Dacheng Tao

Main category: cs.CL

TL;DR: dLLMs fail as agentic backbones despite efficiency promises, showing systematic failures in embodied planning and tool-calling precision, but can work in non-causal roles like memory summarization.

DetailsMotivation: To evaluate whether diffusion-based LLMs (dLLMs) can effectively serve as agentic backbones despite their promised efficiency gains over auto-regressive models for real-time interaction.

Method: Comprehensive evaluation of dLLMs across embodied agents (long-horizon planning) and tool-calling agents (precise formatting), using Agentboard and BFCL benchmarks, plus introduction of DiffuAgent framework for multi-agent evaluation.

Result: Current dLLMs fail as reliable agentic backbones: (1) In embodied settings, they suffer repeated attempts and fail to branch under temporal feedback; (2) In tool-calling, they fail to maintain symbolic precision due to diffusion noise.

Conclusion: dLLMs are effective in non-causal roles (memory summarization, tool selection) but require integration of causal, precise, and logically grounded reasoning mechanisms into the denoising process to be viable for agentic tasks.

Abstract: The pursuit of real-time agentic interaction has driven interest in Diffusion-based Large Language Models (dLLMs) as alternatives to auto-regressive backbones, promising to break the sequential latency bottleneck. However, does such efficiency gains translate into effective agentic behavior? In this work, we present a comprehensive evaluation of dLLMs (e.g., LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long-horizon planning) and Tool-Calling Agents (requiring precise formatting). Contrary to the efficiency hype, our results on Agentboard and BFCL reveal a “bitter lesson”: current dLLMs fail to serve as reliable agentic backbones, frequently leading to systematically failure. (1) In Embodied settings, dLLMs suffer repeated attempts, failing to branch under temporal feedback. (2) In Tool-Calling settings, dLLMs fail to maintain symbolic precision (e.g. strict JSON schemas) under diffusion noise. To assess the potential of dLLMs in agentic workflows, we introduce DiffuAgent, a multi-agent evaluation framework that integrates dLLMs as plug-and-play cognitive cores. Our analysis shows that dLLMs are effective in non-causal roles (e.g., memory summarization and tool selection) but require the incorporation of causal, precise, and logically grounded reasoning mechanisms into the denoising process to be viable for agentic tasks.

[86] AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports

Edward Ajayi

Main category: cs.CL

TL;DR: AfriEconQA is a specialized benchmark dataset for African economic analysis using 236 World Bank reports, featuring 8,937 QA instances requiring numerical reasoning and temporal disambiguation, revealing significant gaps in current LLMs’ knowledge and RAG systems.

DetailsMotivation: There's a lack of specialized benchmarks for African economic analysis, and current LLMs have limited knowledge about African economic data since it's largely absent from their pretraining corpora. The paper aims to create a challenging benchmark for domain-specific IR and RAG systems focused on African economies.

Method: Created AfriEconQA dataset from 236 World Bank reports, curating 8,937 QA instances from 10,018 synthetic questions. Each instance includes questions requiring economic reasoning, evidence from reports, verified answers, and source metadata. Conducted 11 experiments benchmarking zero-shot GPT-5 Mini against RAG configurations using GPT-4o and Qwen 32B with five embedding/ranking strategies.

Result: Zero-shot models failed to answer over 90% of queries, showing severe parametric knowledge gap. Even state-of-the-art RAG pipelines struggled to achieve high precision, confirming the dataset’s challenging nature for current IR and RAG systems.

Conclusion: AfriEconQA is a robust and challenging benchmark for next-generation domain-specific IR and RAG systems, highlighting the need for specialized approaches to handle African economic data. The dataset will be publicly available to advance research in this area.

Abstract: We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.

[87] Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

Chenghao Fan, Wen Heng, Bo Li, Sichen Liu, Yuxuan Song, Jing Su, Xiaoye Qu, Kai Shen, Wei Wei

Main category: cs.CL

TL;DR: Stable-DiffCoder is a block diffusion code model that outperforms autoregressive counterparts on code benchmarks through efficient training techniques and demonstrates advantages in structured code modeling and low-resource languages.

DetailsMotivation: Diffusion-based language models offer non-sequential generation and better data reuse than autoregressive models, but existing code diffusion models still underperform AR models under comparable budgets. The authors aim to bridge this performance gap.

Method: Reuses Seed-Coder architecture, data, and training pipeline. Incorporates block diffusion continual pretraining (CPT) with tailored warmup and block-wise clipped noise schedule for efficient knowledge learning and stable training. Uses only CPT and supervised fine-tuning stages.

Result: Stable-DiffCoder outperforms its AR counterpart on broad code benchmarks under same data and architecture. Achieves stronger performance than ~8B ARs and DLLMs. Diffusion-based modeling improves structured code modeling for editing/reasoning and benefits low-resource languages through data augmentation.

Conclusion: Diffusion-based training can improve code modeling quality beyond AR training alone, offering advantages in structured code modeling and low-resource language support while maintaining competitive performance.

Abstract: Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of ~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.

[88] Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

Alphaeus Dmonte, Vidhi Gupta, Daniel J Perry, Mark Arehart

Main category: cs.CL

TL;DR: Merging multilingual multitask models reduces training time by 50% and maintenance costs by 60% while maintaining quality parity compared to full retraining.

DetailsMotivation: Fine-tuning multilingual LLMs requires retraining the entire model when updating languages or adding new ones, creating computational inefficiency and maintenance bottlenecks. Current merging approaches show promise but their efficiency hasn't been studied.

Method: Analyzed merging strategy for multilingual multitask models from efficiency perspective across three independent tasks. Evaluated merging approach for both initial training and maintenance updates (updating individual languages and re-merging).

Result: Merging reduces initial training time by up to 50%. Updating individual languages and re-merging reduces training costs by more than 60% compared to full multilingual model retraining. Approach works on both public and proprietary industry datasets.

Conclusion: Model merging strategy offers significant efficiency gains for multilingual LLMs while maintaining quality, making it practical for industrial use cases beyond academic settings.

Abstract: Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.

cs.CV

[89] GR3EN: Generative Relighting for 3D Environments

Xiaoyan Xing, Philipp Henzler, Junhwa Hur, Runze Li, Jonathan T. Barron, Pratul P. Srinivasan, Dor Verbin

Main category: cs.CV

TL;DR: A method for relighting 3D reconstructions of room-scale environments by distilling outputs from a video-to-video relighting diffusion model into 3D reconstructions, avoiding difficult inverse rendering problems.

DetailsMotivation: Existing 3D scene relighting solutions require solving under-determined or ill-conditioned inverse rendering problems, limiting their ability to produce high-quality results on complex real-world scenes. Current generative diffusion methods are limited to 2D image/video relighting or individual 3D objects.

Method: Distills outputs of a video-to-video relighting diffusion model into 3D reconstructions, sidestepping the need to solve difficult inverse rendering problems. This enables controllable 3D relighting of room-scale scenes.

Result: Validated on both synthetic and real-world datasets, showing the method can faithfully render novel views of scenes under new lighting conditions.

Conclusion: The approach provides a flexible system for 3D relighting of complex real-world room-scale scenes by leveraging diffusion models while avoiding traditional inverse rendering challenges.

Abstract: We present a method for relighting 3D reconstructions of large room-scale environments. Existing solutions for 3D scene relighting often require solving under-determined or ill-conditioned inverse rendering problems, and are as such unable to produce high-quality results on complex real-world scenes. Though recent progress in using generative image and video diffusion models for relighting has been promising, these techniques are either limited to 2D image and video relighting or 3D relighting of individual objects. Our approach enables controllable 3D relighting of room-scale scenes by distilling the outputs of a video-to-video relighting diffusion model into a 3D reconstruction. This side-steps the need to solve a difficult inverse rendering problem, and results in a flexible system that can relight 3D reconstructions of complex real-world scenes. We validate our approach on both synthetic and real-world datasets to show that it can faithfully render novel views of scenes under new lighting conditions.

[90] Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen, Jong Chul Ye, Duygu Ceylan, Hyeonho Jeong

Main category: cs.CV

TL;DR: Memory-V2V introduces a memory-augmented framework for multi-turn video editing that maintains cross-consistency across sequential edits by retrieving and conditioning on prior edited videos.

DetailsMotivation: Current video editors struggle with cross-consistency in multi-turn editing scenarios where users refine results across multiple rounds of interaction. Real-world video editing is often iterative, but existing models fail to maintain consistency across sequential edits.

Method: Memory-V2V augments existing video-to-video models with explicit memory using an external cache of previously edited videos. It employs accurate retrieval and dynamic tokenization strategies to condition current editing on prior results. A learnable token compressor within the DiT backbone compresses redundant conditioning tokens while preserving essential visual cues.

Result: Memory-V2V achieves 30% speedup while producing videos significantly more cross-consistent with minimal computational overhead. It maintains or even improves task-specific performance over state-of-the-art baselines on challenging tasks including video novel view synthesis and text-conditioned long video editing.

Conclusion: Memory-V2V effectively addresses the cross-consistency problem in multi-turn video editing through memory augmentation, achieving both improved consistency and computational efficiency while maintaining strong editing performance.

Abstract: Recent foundational video-to-video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real-world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi-turn setting, current video editors struggle to maintain cross-consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross-consistency in multi-turn video editing and introduce Memory-V2V, a simple, yet effective framework that augments existing video-to-video models with explicit memory. Given an external cache of previously edited videos, Memory-V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory-V2V on challenging tasks including video novel view synthesis and text-conditioned long video editing. Extensive experiments show that Memory-V2V produces videos that are significantly more cross-consistent with minimal computational overhead, while maintaining or even improving task-specific performance over state-of-the-art baselines. Project page: https://dohunlee1.github.io/MemoryV2V

[91] FeTTL: Federated Template and Task Learning for Multi-Institutional Medical Imaging

Abhijeet Parida, Antonia Alomar, Zhifan Jiang, Pooneh Roshanitabrizi, Austin Tapp, Ziyue Xu, Syed Muhammad Anwar, Maria J. Ledesma-Carbayo, Holger R. Roth, Marius George Linguraru

Main category: cs.CV

TL;DR: FeTTL is a federated learning framework that learns a global template and task model to harmonize multi-institutional medical imaging data, addressing domain shifts and heterogeneity while preserving privacy.

DetailsMotivation: Federated learning enables collaborative training across medical centers while preserving privacy, but domain shifts and data heterogeneity (from variations in acquisition protocols, scanner types, and patient populations) degrade model performance in medical imaging applications.

Method: Federated Template and Task Learning (FeTTL) learns a global template together with a task model to align data distributions among clients in federated environments, harmonizing multi-institutional medical imaging data.

Result: FeTTL significantly outperforms state-of-the-art federated learning baselines (p-values <0.002) for optical disc segmentation and metastasis classification from multi-institutional data, demonstrating the importance of jointly learning template and task.

Conclusion: FeTTL offers a principled and extensible solution for mitigating distribution shifts in federated learning, supporting robust model deployment in real-world, multi-institutional medical imaging environments.

Abstract: Federated learning enables collaborative model training across geographically distributed medical centers while preserving data privacy. However, domain shifts and heterogeneity in data often lead to a degradation in model performance. Medical imaging applications are particularly affected by variations in acquisition protocols, scanner types, and patient populations. To address these issues, we introduce Federated Template and Task Learning (FeTTL), a novel framework designed to harmonize multi-institutional medical imaging data in federated environments. FeTTL learns a global template together with a task model to align data distributions among clients. We evaluated FeTTL on two challenging and diverse multi-institutional medical imaging tasks: retinal fundus optical disc segmentation and histopathological metastasis classification. Experimental results show that FeTTL significantly outperforms the state-of-the-art federated learning baselines (p-values <0.002) for optical disc segmentation and classification of metastases from multi-institutional data. Our experiments further highlight the importance of jointly learning the template and the task. These findings suggest that FeTTL offers a principled and extensible solution for mitigating distribution shifts in federated learning, supporting robust model deployment in real-world, multi-institutional environments.

[92] Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle

Main category: cs.CV

TL;DR: Models struggle to identify important sub-events in football videos, performing near chance level despite using human preference data from highlight reels.

DetailsMotivation: Foundation models are used for real-world language generation from temporally-ordered multimodal events, but their ability to identify important sub-events in videos - a fundamental prerequisite for narration and summarization - needs evaluation.

Method: Created a new dataset using human preferences implicit in football game highlight reels (no additional annotation costs), evaluated several state-of-the-art multimodal models on distinguishing important vs non-important sub-events.

Result: Models perform near chance level. Analysis reveals they tend to rely on a single dominant modality and are ineffective at synthesizing information from multiple sources.

Conclusion: Need modular architectures to handle sample-level heterogeneity in multimodal data and complementary training procedures to maximize cross-modal synergy.

Abstract: Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, which we will publicly release to the community, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their ineffectiveness in synthesizing necessary information from multiple sources. Our findings underline the importance of modular architectures that can handle sample-level heterogeneity in multimodal data and the need for complementary training procedures that can maximize cross-modal synergy.

[93] Coarse-to-Fine Non-rigid Multi-modal Image Registration for Historical Panel Paintings based on Crack Structures

Aline Sindel, Andreas Maier, Vincent Christlein

Main category: cs.CV

TL;DR: A coarse-to-fine non-rigid multi-modal registration method using sparse keypoints and thin-plate-splines for aligning historical panel painting images, leveraging craquelure patterns as features.

DetailsMotivation: Manual pixel-wise alignment of multi-modal images for art technological analysis is laborious and time-consuming. Automated registration is needed to handle varying resolutions, large image sizes, non-rigid distortions, and modality-dependent content.

Method: One-stage non-rigid registration using CNN for joint keypoint detection/description based on craquelure patterns, GNN for patch-based descriptor matching, homography reprojection error filtering, and novel multi-level keypoint refinement for mixed-resolution images.

Result: Created annotated multi-modal dataset of panel paintings, demonstrated effectiveness through ablation study, and achieved best registration results compared to competing keypoint/dense matching methods and refinement approaches.

Conclusion: The proposed coarse-to-fine registration method efficiently handles challenging multi-modal image alignment for historical paintings, reducing manual work while enabling higher precision through automated craquelure-based feature matching.

Abstract: Art technological investigations of historical panel paintings rely on acquiring multi-modal image data, including visual light photography, infrared reflectography, ultraviolet fluorescence photography, x-radiography, and macro photography. For a comprehensive analysis, the multi-modal images require pixel-wise alignment, which is still often performed manually. Multi-modal image registration can reduce this laborious manual work, is substantially faster, and enables higher precision. Due to varying image resolutions, huge image sizes, non-rigid distortions, and modality-dependent image content, registration is challenging. Therefore, we propose a coarse-to-fine non-rigid multi-modal registration method efficiently relying on sparse keypoints and thin-plate-splines. Historical paintings exhibit a fine crack pattern, called craquelure, on the paint layer, which is captured by all image systems and is well-suited as a feature for registration. In our one-stage non-rigid registration approach, we employ a convolutional neural network for joint keypoint detection and description based on the craquelure and a graph neural network for descriptor matching in a patch-based manner, and filter matches based on homography reprojection errors in local areas. For coarse-to-fine registration, we introduce a novel multi-level keypoint refinement approach to register mixed-resolution images up to the highest resolution. We created a multi-modal dataset of panel paintings with a high number of keypoint annotations, and a large test set comprising five multi-modal domains and varying image resolutions. The ablation study demonstrates the effectiveness of all modules of our refinement method. Our proposed approaches achieve the best registration results compared to competing keypoint and dense matching methods and refinement methods.

[94] Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models

Bridget Leonard, Scott O. Murray

Main category: cs.CV

TL;DR: The paper introduces “perspective tokens” - specialized embeddings that encode orientation through body-keypoint cues or abstract mental rotation representations - to help multimodal language models overcome egocentric bias and perform allocentric spatial reasoning.

DetailsMotivation: Current multimodal language models perform well on semantic vision-language tasks but fail at spatial reasoning requiring perspective-taking (adopting another agent's visual perspective). These errors reflect persistent egocentric bias, raising questions about whether models support allocentric reasoning.

Method: Introduces perspective tokens: specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. These tokens are integrated into LLaVA-1.5-13B to enable level-2 visual perspective-taking tasks.

Result: Perspective tokens improve accuracy across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench). Rotation-based tokens generalize to non-human reference agents. Representational analyses show fine-tuning enhances latent orientation sensitivity already present in base models.

Conclusion: Embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning. MLMs contain precursors of allocentric reasoning but lack appropriate internal structure.

Abstract: Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent’s visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation sensitivity already present in the base model, suggesting that MLMs contain precursors of allocentric reasoning but lack appropriate internal structure. Overall, embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning.

[95] VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection

Yuxin Jiang, Yunkang Cao, Yuqi Cheng, Yiheng Zhang, Weiming Shen

Main category: cs.CV

TL;DR: VTFusion is a vision-text multimodal framework for few-shot anomaly detection that addresses domain gaps and semantic misalignment between modalities through adaptive feature extractors and dedicated fusion modules.

DetailsMotivation: Current FSAD methods rely on pre-trained features from natural scenes, neglecting domain-specific semantics for industrial inspection. Existing fusion strategies use superficial concatenation that fails to address semantic misalignment between visual and textual modalities, compromising robustness against cross-modal interference.

Method: Two core designs: 1) Adaptive feature extractors for both image and text modalities to learn task-specific representations and bridge domain gaps, augmented with synthetic anomaly generation; 2) Multimodal prediction fusion module with fusion block for cross-modal information exchange and segmentation network for pixel-level anomaly maps under multimodal guidance.

Result: Achieved image-level AUROCs of 96.8% and 86.2% in 2-shot scenario on MVTec AD and VisA datasets, and AUPRO of 93.5% on a real-world industrial automotive plastic parts dataset.

Conclusion: VTFusion significantly advances FSAD performance and demonstrates practical applicability in demanding industrial scenarios through effective multimodal fusion and domain adaptation.

Abstract: Few-Shot Anomaly Detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references. While recent methods have integrated textual semantics to complement visual data, they predominantly rely on features pre-trained on natural scenes, thereby neglecting the granular, domain-specific semantics essential for industrial inspection. Furthermore, prevalent fusion strategies often resort to superficial concatenation, failing to address the inherent semantic misalignment between visual and textual modalities, which compromises robustness against cross-modal interference. To bridge these gaps, this study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD. The framework rests on two core designs. First, adaptive feature extractors for both image and text modalities are introduced to learn task-specific representations, bridging the domain gap between pre-trained models and industrial data; this is further augmented by generating diverse synthetic anomalies to enhance feature discriminability. Second, a dedicated multimodal prediction fusion module is developed, comprising a fusion block that facilitates rich cross-modal information exchange and a segmentation network that generates refined pixel-level anomaly maps under multimodal guidance. VTFusion significantly advances FSAD performance, achieving image-level AUROCs of 96.8% and 86.2% in the 2-shot scenario on the MVTec AD and VisA datasets, respectively. Furthermore, VTFusion achieves an AUPRO of 93.5% on a real-world dataset of industrial automotive plastic parts introduced in this paper, further demonstrating its practical applicability in demanding industrial scenarios.

[96] ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation

Yihao Wang, Jusheng Zhang, Ziyi Tang, Keze Wang, Meng Yang

Main category: cs.CV

TL;DR: EBD-VBR is a new RES framework that improves segmentation accuracy by using entropy-based point discovery to find optimal prompts and vision-based reasoning for robust validation, achieving SOTA across four benchmarks.

DetailsMotivation: Existing RES methods have two key limitations: (1) coarse bounding boxes from MLLMs lead to redundant or non-discriminative point prompts, and (2) textual coordinate reasoning is unreliable for distinguishing targets from visually similar distractors.

Method: Proposes EBD-VBR framework with two components: Entropy-Based Point Discovery (EBD) that models spatial uncertainty within bounding boxes to identify high-information candidate points, and Vision-Based Reasoning (VBR) that verifies point correctness through joint visual-semantic alignment instead of text-only coordinate inference. Implements a coarse-to-fine workflow: bounding box initialization, entropy-guided point discovery, vision-based validation, and mask decoding.

Result: Achieves new state-of-the-art performance across all four benchmark datasets: RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg.

Conclusion: EBD-VBR effectively generates accurate and semantically grounded segmentation masks with minimal prompts by addressing the limitations of existing MLLM-based RES approaches through information-theoretic point selection and robust visual validation.

Abstract: Referring Expression Segmentation (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions, supporting critical applications such as human-robot interaction and augmented reality. Despite the progress of Multimodal Large Language Model (MLLM)-based approaches, existing RES methods still suffer from two key limitations: first, the coarse bounding boxes from MLLMs lead to redundant or non-discriminative point prompts; second, the prevalent reliance on textual coordinate reasoning is unreliable, as it fails to distinguish targets from visually similar distractors. To address these issues, we propose \textbf{\model}, a novel RES framework integrating \textbf{E}ntropy-\textbf{B}ased Point \textbf{D}iscovery (\textbf{EBD}) and \textbf{V}ision-\textbf{B}ased \textbf{R}easoning (\textbf{VBR}). Specifically, EBD identifies high-information candidate points by modeling spatial uncertainty within coarse bounding boxes, treating point selection as an information maximization process. VBR verifies point correctness through joint visual-semantic alignment, abandoning text-only coordinate inference for more robust validation. Built on these components, \model implements a coarse-to-fine workflow: bounding box initialization, entropy-guided point discovery, vision-based validation, and mask decoding. Extensive evaluations on four benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg) demonstrate that \model achieves new state-of-the-art performance across all four benchmarks, highlighting its effectiveness in generating accurate and semantically grounded segmentation masks with minimal prompts.

[97] A Cosine Network for Image Super-Resolution

Chunwei Tian, Chengyuan Zhang, Bob Zhang, Zhiwu Li, C. L. Philip Chen, David Zhang

Main category: cs.CV

TL;DR: CSRNet improves image super-resolution using odd/even heterogeneous blocks to extract complementary structural information and cosine annealing for training optimization.

DetailsMotivation: While deep CNNs can extract hierarchical structural information for image super-resolution, effectively preserving and utilizing this structural information remains challenging. The paper aims to enhance structural information extraction and training stability.

Method: Proposes CSRNet with: 1) Odd and even heterogeneous blocks to extract complementary homologous structural information, 2) Combination of linear and non-linear structural information to enhance robustness, 3) Cosine annealing mechanism with warm restarts to optimize training and avoid local minima.

Result: Experimental results show CSRNet is competitive with state-of-the-art methods in image super-resolution tasks.

Conclusion: The proposed CSRNet effectively improves image super-resolution performance through architectural innovations (heterogeneous blocks) and training optimization (cosine annealing), demonstrating competitive results with existing methods.

Abstract: Deep convolutional neural networks can use hierarchical information to progressively extract structural information to recover high-quality images. However, preserving the effectiveness of the obtained structural information is important in image super-resolution. In this paper, we propose a cosine network for image super-resolution (CSRNet) by improving a network architecture and optimizing the training strategy. To extract complementary homologous structural information, odd and even heterogeneous blocks are designed to enlarge the architectural differences and improve the performance of image super-resolution. Combining linear and non-linear structural information can overcome the drawback of homologous information and enhance the robustness of the obtained structural information in image super-resolution. Taking into account the local minimum of gradient descent, a cosine annealing mechanism is used to optimize the training procedure by performing warm restarts and adjusting the learning rate. Experimental results illustrate that the proposed CSRNet is competitive with state-of-the-art methods in image super-resolution.

[98] DCCS-Det: Directional Context and Cross-Scale-Aware Detector for Infrared Small Target

Shuying Li, Qiang Ma, San Zhang, Chuang Yang

Main category: cs.CV

TL;DR: DCCS-Det is a novel infrared small target detector that addresses inadequate local-global feature modeling and feature degradation through dual-stream saliency enhancement and cross-scale semantic extraction modules.

DetailsMotivation: Existing IRSTD methods struggle with inadequate joint modeling of local-global features (harming target-background discrimination) and feature redundancy/semantic dilution (degrading target representation quality).

Method: Proposes DCCS-Det with two key components: 1) Dual-stream Saliency Enhancement (DSE) block integrating localized perception with direction-aware context aggregation, and 2) Latent-aware Semantic Extraction and Aggregation (LaSEA) module using cross-scale feature extraction and random pooling sampling to mitigate feature degradation.

Result: Achieves state-of-the-art detection accuracy with competitive efficiency across multiple datasets. Ablation studies validate contributions of DSE and LaSEA in improving target perception and feature representation under complex scenarios.

Conclusion: DCCS-Det effectively addresses key challenges in infrared small target detection through innovative architectural components that enhance both local-global feature modeling and feature representation quality.

Abstract: Infrared small target detection (IRSTD) is critical for applications like remote sensing and surveillance, which aims to identify small, low-contrast targets against complex backgrounds. However, existing methods often struggle with inadequate joint modeling of local-global features (harming target-background discrimination) or feature redundancy and semantic dilution (degrading target representation quality). To tackle these issues, we propose DCCS-Det (Directional Context and Cross-Scale Aware Detector for Infrared Small Target), a novel detector that incorporates a Dual-stream Saliency Enhancement (DSE) block and a Latent-aware Semantic Extraction and Aggregation (LaSEA) module. The DSE block integrates localized perception with direction-aware context aggregation to help capture long-range spatial dependencies and local details. On this basis, the LaSEA module mitigates feature degradation via cross-scale feature extraction and random pooling sampling strategies, enhancing discriminative features and suppressing noise. Extensive experiments show that DCCS-Det achieves state-of-the-art detection accuracy with competitive efficiency across multiple datasets. Ablation studies further validate the contributions of DSE and LaSEA in improving target perception and feature representation under complex scenarios. \href{https://huggingface.co/InPeerReview/InfraredSmallTargetDetection-IRSTD.DCCS}{DCCS-Det Official Code is Available Here!}

[99] AlphaFace: High Fidelity and Real-time Face Swapper Robust to Facial Pose

Jongmin Yu, Hyeontaek Oh, Zhongtian Sun, Angelica I Aviles-Rivero, Moongu Jeon, Jinhong Yang

Main category: cs.CV

TL;DR: AlphaFace improves face-swapping robustness for extreme facial poses using vision-language models and contrastive losses while maintaining real-time performance.

DetailsMotivation: Existing face-swapping methods degrade significantly with extreme facial poses. Geometric feature approaches add dependencies and computational cost, while diffusion methods are too slow for real-time use.

Method: Leverages open-source vision-language model and CLIP embeddings with novel visual and textual semantic contrastive losses for better identity representation and attribute preservation.

Result: Outperforms state-of-the-art methods on pose-challenging cases across FF++, MPIE, and LPFF datasets while maintaining real-time performance.

Conclusion: AlphaFace successfully addresses pose robustness in face-swapping with improved identity preservation and real-time capability, making it practical for real-world applications.

Abstract: Existing face-swapping methods often deliver competitive results in constrained settings but exhibit substantial quality degradation when handling extreme facial poses. To improve facial pose robustness, explicit geometric features are applied, but this approach remains problematic since it introduces additional dependencies and increases computational cost. Diffusion-based methods have achieved remarkable results; however, they are impractical for real-time processing. We introduce AlphaFace, which leverages an open-source vision-language model and CLIP image and text embeddings to apply novel visual and textual semantic contrastive losses. AlphaFace enables stronger identity representation and more precise attribute preservation, all while maintaining real-time performance. Comprehensive experiments across FF++, MPIE, and LPFF demonstrate that AlphaFace surpasses state-of-the-art methods in pose-challenging cases. The project is publicly available on `https://github.com/andrewyu90/Alphaface_Official.git'.

[100] MDAFNet: Multiscale Differential Edge and Adaptive Frequency Guided Network for Infrared Small Target Detection

Shuying Li, Qiang Ma, San Zhang, Wuwei Wang, Chuang Yang

Main category: cs.CV

TL;DR: MDAFNet is a new infrared small target detection network that addresses edge degradation and frequency interference issues through multi-scale differential edge enhancement and dual-domain adaptive feature processing.

DetailsMotivation: Existing IRSTD methods suffer from gradual degradation of target edge pixels as network depth increases, and traditional convolution struggles to differentiate frequency components, leading to background interference and false detections from high-frequency noise.

Method: Proposes MDAFNet with two key modules: 1) Multi-Scale Differential Edge (MSDE) module for edge extraction and enhancement to compensate for edge information loss during downsampling, and 2) Dual-Domain Adaptive Feature Enhancement (DAFE) module that combines frequency domain processing with simulated frequency decomposition/fusion in spatial domain to enhance high-frequency targets while suppressing noise.

Result: Experimental results on multiple datasets demonstrate superior detection performance compared to existing methods.

Conclusion: MDAFNet effectively addresses edge degradation and frequency interference problems in infrared small target detection through its innovative multi-scale edge enhancement and dual-domain adaptive feature processing approach.

Abstract: Infrared small target detection (IRSTD) plays a crucial role in numerous military and civilian applications. However, existing methods often face the gradual degradation of target edge pixels as the number of network layers increases, and traditional convolution struggles to differentiate between frequency components during feature extraction, leading to low-frequency backgrounds interfering with high-frequency targets and high-frequency noise triggering false detections. To address these limitations, we propose MDAFNet (Multi-scale Differential Edge and Adaptive Frequency Guided Network for Infrared Small Target Detection), which integrates the Multi-Scale Differential Edge (MSDE) module and Dual-Domain Adaptive Feature Enhancement (DAFE) module. The MSDE module, through a multi-scale edge extraction and enhancement mechanism, effectively compensates for the cumulative loss of target edge information during downsampling. The DAFE module combines frequency domain processing mechanisms with simulated frequency decomposition and fusion mechanisms in the spatial domain to effectively improve the network’s capability to adaptively enhance high-frequency targets and selectively suppress high-frequency noise. Experimental results on multiple datasets demonstrate the superior detection performance of MDAFNet.

[101] Masked Face Recognition under Different Backbones

Bo Zhang, Ming Zhang, Kun Wu, Lei Bian, Yi Lin

Main category: cs.CV

TL;DR: Corrections to face recognition paper showing r100 models excel in standard tests, while masked variants and Vision Transformers perform best for masked face recognition in post-pandemic security scenarios.

DetailsMotivation: In the post-pandemic era, high mask usage during civil aviation security checks challenges traditional face recognition models, requiring evaluation of backbone networks for masked face recognition.

Method: Conducted extensive comparative experiments evaluating core backbone networks (r100, r50, r34 series, Vision Transformers) in both standard and masked face recognition tests.

Result: In standard tests: r100 series excelled (98%+ accuracy at 0.01% FAR), r50 ranked second, r34_mask_v1 lagged. In masked tests: r100_mask_v2 led (90.07% accuracy), r50_mask_v3 performed best among r50 but trailed r100, Vision Transformers showed strong masked performance with effectiveness gains.

Conclusion: Different backbone networks have varying impacts on face recognition with and without masks; specific deployment recommendations are provided based on performance in masked vs. standard scenarios.

Abstract: Erratum to the paper (Zhang et al., 2025): corrections to Table IV and the data in Page 3, Section A. In the post-pandemic era, a high proportion of civil aviation passengers wear masks during security checks, posing significant challenges to traditional face recognition models. The backbone network serves as the core component of face recognition models. In standard tests, r100 series models excelled (98%+ accuracy at 0.01% FAR in face comparison, high top1/top5 in search). r50 ranked second, r34_mask_v1 lagged. In masked tests, r100_mask_v2 led (90.07% accuracy), r50_mask_v3 performed best among r50 but trailed r100. Vit-Small/Tiny showed strong masked performance with gains in effectiveness. Through extensive comparative experiments, this paper conducts a comprehensive evaluation of several core backbone networks, aiming to reveal the impacts of different models on face recognition with and without masks, and provide specific deployment recommendations.

[102] Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities

Aref Farhadipour, Teodora Vukovic, Volker Dellwo, Petr Motlicek, Srikanth Madikeri

Main category: cs.CV

TL;DR: Proposed multimodal person identification framework uses gesture as situational enhancer with unified hybrid fusion strategy, achieving 99.51% Top-1 accuracy on CANDOR dataset and 99.92% on VoxCeleb1, robust to missing modalities.

DetailsMotivation: Real-world person identification systems often face missing or degraded modalities (audio, visual, behavioral), requiring robust solutions that can handle incomplete data while maintaining high accuracy.

Method: Multimodal framework using gesture as situational enhancer with unified hybrid fusion strategy combining feature-level and score-level information. Employs multi-task learning for independent modality processing, cross-attention and gated fusion mechanisms, and confidence-weighted strategy for dynamic adaptation to missing data with single classification head.

Result: Achieved 99.51% Top-1 accuracy on CANDOR dataset (newly benchmarked interview-based multimodal dataset) and 99.92% accuracy on VoxCeleb1 dataset in bimodal mode, outperforming conventional approaches. System maintains high accuracy even with one or two unavailable modalities.

Conclusion: The proposed multimodal person identification framework with gesture enhancement and unified hybrid fusion provides robust, high-accuracy performance in real-world conditions with missing modalities, making it suitable for practical person recognition applications.

Abstract: Person identification systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a multimodal person identification framework that utilizes gesture as a situational enhancer to supplement traditional modalities like voice and face. Our model employs a unified hybrid fusion strategy, integrating both feature-level and score-level information to maximize representational richness and decision accuracy. Specifically, it leverages multi-task learning to process modalities independently, followed by cross-attention and gated fusion mechanisms. Finally, a confidence-weighted strategy dynamically adapts to missing data, ensuring that our single classification head achieves optimal performance even in unimodal and bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark in this work for the first time. Our results demonstrate that the proposed trimodal system achieves 99.51% Top-1 accuracy on person identification tasks. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in bimodal mode, outperforming conventional approaches. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.

[103] Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding

Xiaojiang Peng, Jingyi Chen, Zebang Cheng, Bao Peng, Fengyi Wu, Yifei Dong, Shuyuan Tu, Qiyu Hu, Huiting Huang, Yuxiang Lin, Jun-Yan He, Kai Wang, Zheng Lian, Zhi-Qi Cheng

Main category: cs.CV

TL;DR: Emotion-LLaMAv2 introduces an end-to-end multimodal emotion reasoning framework with multiview encoding, Conv Attention pre-fusion, and curriculum instruction tuning, accompanied by the MMEVerse benchmark aggregating 12 emotion datasets with 130k training clips.

DetailsMotivation: Current multimodal LLMs lack strong emotional reasoning capabilities due to limited high-quality emotion datasets, absence of standardized benchmarks, and previous frameworks' limitations like explicit face detectors and implicit fusion strategies.

Method: 1) End-to-end multiview encoder eliminates face detection and captures emotional cues via spatial/temporal multiview tokens; 2) Conv Attention pre-fusion module enables local/global multimodal feature interactions; 3) Perception-to-cognition curriculum instruction tuning unifies emotion recognition and reasoning within LLaMA2 backbone.

Result: MMEVerse benchmark aggregates 12 public emotion datasets (IEMOCAP, MELD, DFEW, MAFW, etc.) into unified multimodal instruction format with 130k training clips and 36k testing clips across 18 evaluation benchmarks, re-annotated via multi-agent pipeline (Qwen2 Audio, Qwen2.5 VL, GPT 4o).

Conclusion: The paper presents Emotion-LLaMAv2 and MMEVerse as a comprehensive solution addressing multimodal emotion reasoning limitations through improved architecture design and large-scale standardized benchmarking.

Abstract: Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.

[104] VISTA-PATH: An interactive foundation model for pathology image segmentation and quantitative analysis in computational pathology

Peixian Liang, Songhao Li, Shunsuke Koga, Yutong Li, Zahra Alipour, Yucheng Tang, Daguang Xu, Zhi Huang

Main category: cs.CV

TL;DR: VISTA-PATH is an interactive pathology segmentation foundation model that combines visual context, semantic descriptions, and expert feedback for precise multi-class tissue segmentation, outperforming existing models and enabling clinically meaningful analysis.

DetailsMotivation: Current segmentation foundation models treat segmentation as static visual prediction and are poorly aligned with pathology needs. They lack ability to handle heterogeneous tissue structures, incorporate expert feedback, and produce clinically meaningful segmentation.

Method: VISTA-PATH jointly conditions segmentation on visual context, semantic tissue descriptions, and optional expert spatial prompts. It uses a large-scale pathology corpus (VISTA-PATH Data) with 1.6M image-mask-text triplets across 9 organs and 93 tissue classes. Supports human-in-the-loop refinement via sparse bounding-box annotations.

Result: Consistently outperforms existing segmentation foundation models on held-out and external benchmarks. Enables dynamic refinement and improves tissue microenvironment analysis through Tumor Interaction Score (TIS), which shows strong associations with patient survival.

Conclusion: VISTA-PATH elevates pathology segmentation from static prediction to interactive, clinically grounded representation, establishing a foundation model preferred for computational pathology with demonstrated clinical relevance.

Abstract: Accurate semantic segmentation for histopathology image is crucial for quantitative tissue analysis and downstream clinical modeling. Recent segmentation foundation models have improved generalization through large-scale pretraining, yet remain poorly aligned with pathology because they treat segmentation as a static visual prediction task. Here we present VISTA-PATH, an interactive, class-aware pathology segmentation foundation model designed to resolve heterogeneous structures, incorporate expert feedback, and produce pixel-level segmentation that are directly meaningful for clinical interpretation. VISTA-PATH jointly conditions segmentation on visual context, semantic tissue descriptions, and optional expert-provided spatial prompts, enabling precise multi-class segmentation across heterogeneous pathology images. To support this paradigm, we curate VISTA-PATH Data, a large-scale pathology segmentation corpus comprising over 1.6 million image-mask-text triplets spanning 9 organs and 93 tissue classes. Across extensive held-out and external benchmarks, VISTA-PATH consistently outperforms existing segmentation foundation models. Importantly, VISTA-PATH supports dynamic human-in-the-loop refinement by propagating sparse, patch-level bounding-box annotation feedback into whole-slide segmentation. Finally, we show that the high-fidelity, class-aware segmentation produced by VISTA-PATH is a preferred model for computational pathology. It improve tissue microenvironment analysis through proposed Tumor Interaction Score (TIS), which exhibits strong and significant associations with patient survival. Together, these results establish VISTA-PATH as a foundation model that elevates pathology image segmentation from a static prediction to an interactive and clinically grounded representation for digital pathology. Source code and demo can be found at https://github.com/zhihuanglab/VISTA-PATH.

[105] Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos

Meng Cao, Haoran Tang, Haoze Zhao, Mingfei Han, Ruyang Liu, Qiang Sun, Xiaojun Chang, Ian Reid, Xiaodan Liang

Main category: cs.CV

TL;DR: Using gameplay video glitches (visual anomalies that violate physics) as scalable supervision for training multimodal models on physical reasoning, achieving improved real-world and general transferability.

DetailsMotivation: Current MLLMs lack human-level physical understanding. Real-world video datasets are expensive to annotate, while synthetic simulations lack realism and diversity. Gameplay glitches offer a rich, scalable source of physics-violating examples for training.

Method: Proposed PhysGame: a meta-information guided instruction-tuning dataset with 140,057 glitch-centric QA pairs across 5 physical domains and 16 categories. Used gameplay metadata (titles/descriptions) to guide high-quality QA generation. Created GameBench: 880 expert-annotated glitch videos for evaluation.

Result: PhysGame improved Qwen2.5VL by 2.5% on PhysBench (real-world transfer), 1.9% on MVBench (general transfer), and 3.7% on GameBench (glitch detection). Shows gameplay anomalies provide effective scalable supervision for physical reasoning.

Conclusion: Learning from gameplay glitches offers a scalable, effective pathway for advancing physical world understanding in multimodal AI, bridging the gap between synthetic and real-world physical reasoning.

Abstract: Understanding the physical world, including object dynamics, material properties, and causal interactions, remains a core challenge in artificial intelligence. Although recent multi-modal large language models (MLLMs) have demonstrated impressive general reasoning capabilities, they still fall short of achieving human-level understanding of physical principles. Existing datasets for physical reasoning either rely on real-world videos, which incur high annotation costs, or on synthetic simulations, which suffer from limited realism and diversity. In this paper, we propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding. We introduce PhysGame, an meta information guided instruction-tuning dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories. To ensure data accuracy, we design a prompting strategy that utilizes gameplay metadata such as titles and descriptions to guide high-quality QA generation. Complementing PhysGame, we construct GameBench, an expert-annotated benchmark with 880 glitch-identified gameplay videos designed to evaluate physical reasoning capabilities. Extensive experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real world physical reasoning performance of Qwen2.5VL by 2.5% on PhysBench, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark. Moreover, PhysGame-tuned models achieve a 3.7% absolute improvement on GameBench, demonstrating enhanced robustness in detecting physical implausibilities. These results indicate that learning from gameplay anomalies offers a scalable and effective pathway toward advancing physical world understanding in multimodal intelligence.

[106] Multi-View Consistent Wound Segmentation With Neural Fields

Remi Chierchia, Léo Lebrat, David Ahmedt-Aristizabal, Yulia Arzhaeva, Olivier Salvado, Clinton Fookes, Rodrigo Santa Cruz

Main category: cs.CV

TL;DR: WoundNeRF uses NeRF SDF-based method for 3D wound segmentation from 2D images, outperforming Vision Transformers and rasterisation-based methods.

DetailsMotivation: Wound care faces economic/logistical burdens; computer vision can help with automatic tissue assessment. Current 2D segmentation lacks 3D consistency needed for precise healing progress tracking.

Method: WoundNeRF - a NeRF SDF-based method for estimating robust wound segmentations from automatically generated annotations, enabling multi-view consistent 3D structure inference from 2D images.

Result: Demonstrates potential in recovering accurate segmentations, outperforming state-of-the-art Vision Transformer networks and conventional rasterisation-based algorithms.

Conclusion: NeRF SDF-based approach shows promise for 3D wound segmentation, with code release to facilitate further development in this paradigm.

Abstract: Wound care is often challenged by the economic and logistical burdens that consistently afflict patients and hospitals worldwide. In recent decades, healthcare professionals have sought support from computer vision and machine learning algorithms. In particular, wound segmentation has gained interest due to its ability to provide professionals with fast, automatic tissue assessment from standard RGB images. Some approaches have extended segmentation to 3D, enabling more complete and precise healing progress tracking. However, inferring multi-view consistent 3D structures from 2D images remains a challenge. In this paper, we evaluate WoundNeRF, a NeRF SDF-based method for estimating robust wound segmentations from automatically generated annotations. We demonstrate the potential of this paradigm in recovering accurate segmentations by comparing it against state-of-the-art Vision Transformer networks and conventional rasterisation-based algorithms. The code will be released to facilitate further development in this promising paradigm.

[107] Expert Knowledge-Guided Decision Calibration for Accurate Fine-Grained Tree Species Classification

Chen Long, Dian Chen, Ruifei Ding, Zhe Chen, Zhen Dong, Bisheng Yang

Main category: cs.CV

TL;DR: EKDC-Net is a lightweight plug-and-play module that uses external domain expert knowledge to improve fine-grained tree species classification, addressing long-tailed distributions and high inter-class similarity issues.

DetailsMotivation: Existing tree species classification methods struggle with long-tailed distributions and high inter-class similarity in limited data, particularly for few-shot or confusing categories. They focus on complex architectures but overlook these fundamental challenges.

Method: Proposes Expert Knowledge-Guided Classification Decision Calibration Network (EKDC-Net) with two modules: 1) Local Prior Guided Knowledge Extraction Module (LPKEM) uses CAM analysis to extract discriminative features, and 2) Uncertainty-Guided Decision Calibration Module (UDCM) dynamically corrects local model decisions based on category and instance-level uncertainty.

Result: Achieves state-of-the-art performance on three benchmark datasets. As a lightweight module, improves backbone accuracy by 6.42% and precision by 11.46% with only 0.08M additional parameters. Also introduces CU-Tree102 dataset covering 102 tree species.

Conclusion: EKDC-Net effectively addresses long-tailed distribution and inter-class similarity challenges in fine-grained tree species classification through expert knowledge guidance and uncertainty-based calibration, while being computationally efficient as a plug-and-play module.

Abstract: Accurate fine-grained tree species classification is critical for forest inventory and biodiversity monitoring. Existing methods predominantly focus on designing complex architectures to fit local data distributions. However, they often overlook the long-tailed distributions and high inter-class similarity inherent in limited data, thereby struggling to distinguish between few-shot or confusing categories. In the process of knowledge dissemination in the human world, individuals will actively seek expert assistance to transcend the limitations of local thinking. Inspired by this, we introduce an external “Domain Expert” and propose an Expert Knowledge-Guided Classification Decision Calibration Network (EKDC-Net) to overcome these challenges. Our framework addresses two core issues: expert knowledge extraction and utilization. Specifically, we first develop a Local Prior Guided Knowledge Extraction Module (LPKEM). By leveraging Class Activation Map (CAM) analysis, LPKEM guides the domain expert to focus exclusively on discriminative features essential for classification. Subsequently, to effectively integrate this knowledge, we design an Uncertainty-Guided Decision Calibration Module (UDCM). This module dynamically corrects the local model’s decisions by considering both overall category uncertainty and instance-level prediction uncertainty. Furthermore, we present a large-scale classification dataset covering 102 tree species, named CU-Tree102 to address the issue of scarce diversity in current benchmarks. Experiments on three benchmark datasets demonstrate that our approach achieves state-of-the-art performance. Crucially, as a lightweight plug-and-play module, EKDC-Net improves backbone accuracy by 6.42% and precision by 11.46% using only 0.08M additional learnable parameters. The dataset, code, and pre-trained models are available at https://github.com/WHU-USI3DV/TreeCLS.

[108] SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, Yu Wang

Main category: cs.CV

TL;DR: SALAD introduces a lightweight linear attention branch with input-dependent gating to achieve 90% sparsity and 1.72x inference speedup for Diffusion Transformers in video generation while maintaining quality comparable to full attention.

DetailsMotivation: Diffusion Transformers for video generation suffer from high computational latency due to quadratic complexity of full attention with long input sequences. Existing sparse attention methods either offer limited acceleration (training-free) or require substantial data/computation (training-based).

Method: SALAD adds a lightweight linear attention branch parallel to sparse attention, with an input-dependent gating mechanism to balance both branches. This achieves high sparsity (90%) while maintaining quality through efficient finetuning.

Result: Achieves 90% sparsity and 1.72x inference speedup while maintaining generation quality comparable to full attention baseline. Finetuning requires only 2,000 video samples and 1,600 training steps with batch size 8.

Conclusion: SALAD provides an efficient solution to accelerate Diffusion Transformers for video generation, achieving high sparsity and speedup with minimal training requirements while preserving generation quality.

Abstract: Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.

[109] TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning

Daixian Liu, Jiayi Kuang, Yinghui Li, Yangning Li, Di Yin, Haoyu Cao, Xing Sun, Ying Shen, Hai-Tao Zheng, Liang Lin, Philip S. Yu

Main category: cs.CV

TL;DR: TangramPuzzle benchmark evaluates MLLMs’ compositional spatial reasoning using Tangram games with precise geometric constraints, revealing models prioritize silhouette matching over geometric accuracy.

DetailsMotivation: Current MLLMs lack rigorous evaluation for precise compositional spatial reasoning; existing benchmarks use simple tasks with semantic approximations and lack mathematical rigor.

Method: Introduces TangramPuzzle benchmark with Tangram Construction Expression (TCE) framework for exact coordinate specifications, plus two tasks: Outline Prediction and End-to-End Code Generation.

Result: Evaluation of advanced MLLMs shows they prioritize matching target silhouettes while neglecting geometric constraints, leading to piece distortions and deformations.

Conclusion: MLLMs need improved geometric reasoning capabilities; TangramPuzzle provides rigorous benchmark for evaluating compositional spatial reasoning with precise mathematical grounding.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual recognition and semantic understanding. Nevertheless, their ability to perform precise compositional spatial reasoning remains largely unexplored. Existing benchmarks often involve relatively simple tasks and rely on semantic approximations or coarse relative positioning, while their evaluation metrics are typically limited and lack rigorous mathematical formulations. To bridge this gap, we introduce TangramPuzzle, a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game. We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications, to mitigate the ambiguity of visual approximation. We design two complementary tasks: Outline Prediction, which demands inferring global shapes from local components, and End-to-End Code Generation, which requires solving inverse geometric assembly problems. We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints, leading to distortions or deformations of the pieces.

[110] AnchoredDream: Zero-Shot 360° Indoor Scene Generation from a Single View via Geometric Grounding

Runmao Yao, Junsheng Zhou, Zhen Dong, Yu-Shen Liu

Main category: cs.CV

TL;DR: AnchoredDream: A zero-shot pipeline for generating complete 360° indoor scenes from single images using appearance-geometry mutual boosting and geometric anchoring.

DetailsMotivation: Single-view 360° scene generation is crucial for real-world applications but remains highly ill-posed. Existing methods struggle with appearance consistency and geometric plausibility under large viewpoint changes, limiting full-scene generation effectiveness.

Method: Proposes AnchoredDream with appearance-guided geometry generation to create reliable 3D scene layout, followed by progressive generation modules: warp-and-inpaint, warp-and-refine, post-optimization, and a novel Grouting Block for seamless transitions between input and generated regions.

Result: Extensive experiments show AnchoredDream outperforms existing methods by a large margin in both appearance consistency and geometric plausibility, all in a zero-shot manner.

Conclusion: The results highlight the potential of geometric grounding for high-quality, zero-shot single-view scene generation, demonstrating that anchoring on high-fidelity geometry enables better full-scene generation.

Abstract: Single-view indoor scene generation plays a crucial role in a range of real-world applications. However, generating a complete 360° scene from a single image remains a highly ill-posed and challenging problem. Recent approaches have made progress by leveraging diffusion models and depth estimation networks, yet they still struggle to maintain appearance consistency and geometric plausibility under large viewpoint changes, limiting their effectiveness in full-scene generation. To address this, we propose AnchoredDream, a novel zero-shot pipeline that anchors 360° scene generation on high-fidelity geometry via an appearance-geometry mutual boosting mechanism. Given a single-view image, our method first performs appearance-guided geometry generation to construct a reliable 3D scene layout. Then, we progressively generate the complete scene through a series of modules: warp-and-inpaint, warp-and-refine, post-optimization, and a novel Grouting Block, which ensures seamless transitions between the input view and generated regions. Extensive experiments demonstrate that AnchoredDream outperforms existing methods by a large margin in both appearance consistency and geometric plausibility–all in a zero-shot manner. Our results highlight the potential of geometric grounding for high-quality, zero-shot single-view scene generation.

[111] OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding

Zixian Liu, Zhaoxi Chen, Liang Pan, Ziwei Liu

Main category: cs.CV

TL;DR: OnlineSI enables MLLMs to continuously improve spatial understanding from video streams using finite spatial memory and 3D point cloud integration for real-world embodied systems.

DetailsMotivation: Existing MLLM methods lack continuous spatial understanding in changing environments and real-world deployment capabilities for embodied systems.

Method: Proposes OnlineSI framework with finite spatial memory to retain past observations without increasing computation, and integrates 3D point clouds with semantic information for better object localization.

Result: Introduces Fuzzy F1-Score to handle ambiguity, tests on two datasets, and demonstrates effectiveness for real-world embodied systems.

Conclusion: OnlineSI enables continuous spatial understanding from video streams, paving the way for practical deployment of MLLMs in real-world embodied systems.

Abstract: In recent years, researchers have increasingly been interested in how to enable Multimodal Large Language Models (MLLM) to possess spatial understanding and reasoning capabilities. However, most existing methods overlook the importance of the ability to continuously work in an ever-changing world, and lack the possibility of deployment on embodied systems in real-world environments. In this work, we introduce OnlineSI, a framework that can continuously improve its spatial understanding of its surroundings given a video stream. Our core idea is to maintain a finite spatial memory to retain past observations, ensuring the computation required for each inference does not increase as the input accumulates. We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene. To evaluate our method, we introduce the Fuzzy $F_1$-Score to mitigate ambiguity, and test our method on two representative datasets. Experiments demonstrate the effectiveness of our method, paving the way towards real-world embodied systems.

[112] Semi-Supervised Hierarchical Open-Set Classification

Erik Wallin, Fredrik Kahl, Lars Hammarstrand

Main category: cs.CV

TL;DR: Semi-supervised hierarchical open-set classification using teacher-student framework with subtree pseudo-labels and age-gating to handle unknown classes in uncurated datasets.

DetailsMotivation: To extend hierarchical open-set classification to semi-supervised setting, enabling use of large-scale uncurated datasets containing both known and unknown classes to improve performance.

Method: Proposed teacher-student framework based on pseudo-labeling with two key components: 1) subtree pseudo-labels for reliable supervision with unknown data, and 2) age-gating mechanism to mitigate pseudo-label overconfidence.

Result: Outperforms self-supervised pretraining followed by supervised adaptation, and matches fully supervised counterpart with only 20 labeled samples per class on iNaturalist19 benchmark.

Conclusion: The proposed semi-supervised hierarchical open-set classification framework effectively leverages uncurated datasets with unknown classes, achieving strong performance with minimal labeled data.

Abstract: Hierarchical open-set classification handles previously unseen classes by assigning them to the most appropriate high-level category in a class taxonomy. We extend this paradigm to the semi-supervised setting, enabling the use of large-scale, uncurated datasets containing a mixture of known and unknown classes to improve the hierarchical open-set performance. To this end, we propose a teacher-student framework based on pseudo-labeling. Two key components are introduced: 1) subtree pseudo-labels, which provide reliable supervision in the presence of unknown data, and 2) age-gating, a mechanism that mitigates overconfidence in pseudo-labels. Experiments show that our framework outperforms self-supervised pretraining followed by supervised adaptation, and even matches the fully supervised counterpart when using only 20 labeled samples per class on the iNaturalist19 benchmark. Our code is available at https://github.com/walline/semihoc.

[113] HA2F: Dual-module Collaboration-Guided Hierarchical Adaptive Aggregation Framework for Remote Sensing Change Detection

Shuying Li, Yuchen Wang, San Zhang, Chuang Yang

Main category: cs.CV

TL;DR: HA2F is a hierarchical adaptive aggregation framework for remote sensing change detection that addresses feature alignment deviations and noise sensitivity through dual-module collaboration.

DetailsMotivation: Existing remote sensing change detection methods suffer from cross-temporal feature matching deviations and sensitivity to radiometric/geometric noise, limiting their effectiveness in identifying land cover changes.

Method: Proposes HA2F with two modules: 1) Dynamic Hierarchical Feature Calibration Module (DHFCM) that fuses adjacent-level features through perceptual feature selection to address multi-temporal alignment deviations, and 2) Noise-Adaptive Feature Refinement Module (NAFRM) that uses dual feature selection to highlight change-sensitive regions and generate spatial masks to suppress irrelevant regions/shadows.

Result: Achieves state-of-the-art performance on LEVIR-CD, WHU-CD, and SYSU-CD datasets, surpassing existing methods in both precision metrics and computational efficiency. Ablation studies confirm the effectiveness of both modules.

Conclusion: HA2F effectively addresses feature alignment and noise issues in remote sensing change detection through its hierarchical adaptive aggregation framework, demonstrating superior performance and efficiency across multiple benchmark datasets.

Abstract: Remote sensing change detection (RSCD) aims to identify the spatio-temporal changes of land cover, providing critical support for multi-disciplinary applications (e.g., environmental monitoring, disaster assessment, and climate change studies). Existing methods focus either on extracting features from localized patches, or pursue processing entire images holistically, which leads to the cross temporal feature matching deviation and exhibiting sensitivity to radiometric and geometric noise. Following the above issues, we propose a dual-module collaboration guided hierarchical adaptive aggregation framework, namely HA2F, which consists of dynamic hierarchical feature calibration module (DHFCM) and noise-adaptive feature refinement module (NAFRM). The former dynamically fuses adjacent-level features through perceptual feature selection, suppressing irrelevant discrepancies to address multi-temporal feature alignment deviations. The NAFRM utilizes the dual feature selection mechanism to highlight the change sensitive regions and generate spatial masks, suppressing the interference of irrelevant regions or shadows. Extensive experiments verify the effectiveness of the proposed HA2F, which achieves state-of-the-art performance on LEVIR-CD, WHU-CD, and SYSU-CD datasets, surpassing existing comparative methods in terms of both precision metrics and computational efficiency. In addition, ablation experiments show that DHFCM and NAFRM are effective. \href{https://huggingface.co/InPeerReview/RemoteSensingChangeDetection-RSCD.HA2F}{HA2F Official Code is Available Here!}

[114] X-Aligner: Composed Visual Retrieval without the Bells and Whistles

Yuqian Zheng, Mariana-Iuliana Georgescu

Main category: cs.CV

TL;DR: A novel two-stage CoVR framework using VLMs with X-Aligner cross-attention module and visual query captions achieves SOTA performance on Webvid-CoVR and strong zero-shot generalization on CIR tasks.

DetailsMotivation: Existing CoVR frameworks fuse multimodal inputs in a single stage with only marginal gains over baselines, lacking effective progressive fusion and alignment of visual-textual queries with target videos.

Method: Proposes a CoVR framework leveraging VLMs with novel X-Aligner cross-attention module for progressive fusion of visual/textual inputs and alignment with target videos. Incorporates visual query captions as additional input. Uses two-stage training: first stage trains only new modules, second stage fine-tunes textual encoder. Implemented on BLIP-family architectures.

Result: Achieves state-of-the-art performance with Recall@1 of 63.93% on Webvid-CoVR-Test. Demonstrates strong zero-shot generalization on CIR datasets CIRCO and Fashion-IQ.

Conclusion: The proposed framework effectively addresses limitations of single-stage fusion in CoVR, achieving superior performance through progressive multimodal fusion and two-stage training while maintaining pretrained VLM representations.

Abstract: Composed Video Retrieval (CoVR) facilitates video retrieval by combining visual and textual queries. However, existing CoVR frameworks typically fuse multimodal inputs in a single stage, achieving only marginal gains over initial baseline. To address this, we propose a novel CoVR framework that leverages the representational power of Vision Language Models (VLMs). Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs and align their multimodal representation with that of the target video. To further enhance the representation of the multimodal query, we incorporate the caption of the visual query as an additional input. The framework is trained in two stages to preserve the pretrained VLM representation. In the first stage, only the newly introduced module is trained, while in the second stage, the textual query encoder is also fine-tuned. We implement our framework on top of BLIP-family architecture, namely BLIP and BLIP-2, and train it on the Webvid-CoVR data set. In addition to in-domain evaluation on Webvid-CoVR-Test, we perform zero-shot evaluations on the Composed Image Retrieval (CIR) data sets CIRCO and Fashion-IQ. Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.

[115] A Lightweight Medical Image Classification Framework via Self-Supervised Contrastive Learning and Quantum-Enhanced Feature Modeling

Jingsong Xia, Siqi Wang

Main category: cs.CV

TL;DR: Lightweight medical image classification framework combining self-supervised contrastive learning with quantum-enhanced feature modeling for resource-constrained settings.

DetailsMotivation: Address challenges in medical image analysis: scarce annotations, constrained computational resources, and suboptimal model generalization.

Method: MobileNetV2 backbone pretrained with SimCLR-style self-supervised learning on unlabeled images, then integrated with lightweight parameterized quantum circuit (PQC) as quantum feature enhancement module, fine-tuned on limited labeled data.

Result: Method with only 2-3 million parameters outperforms classical baselines without self-supervised learning or quantum enhancement in Accuracy, AUC, and F1-score, with improved discriminability and representation stability.

Conclusion: Provides practical and forward-looking solution for high-performance medical AI under resource-constrained settings.

Abstract: Intelligent medical image analysis is essential for clinical decision support but is often limited by scarce annotations, constrained computational resources, and suboptimal model generalization. To address these challenges, we propose a lightweight medical image classification framework that integrates self-supervised contrastive learning with quantum-enhanced feature modeling. MobileNetV2 is employed as a compact backbone and pretrained using a SimCLR-style self-supervised paradigm on unlabeled images. A lightweight parameterized quantum circuit (PQC) is embedded as a quantum feature enhancement module, forming a hybrid classical-quantum architecture, which is subsequently fine-tuned on limited labeled data. Experimental results demonstrate that, with only approximately 2-3 million parameters and low computational cost, the proposed method consistently outperforms classical baselines without self-supervised learning or quantum enhancement in terms of Accuracy, AUC, and F1-score. Feature visualization further indicates improved discriminability and representation stability. Overall, this work provides a practical and forward-looking solution for high-performance medical artificial intelligence under resource-constrained settings.

[116] Boundary and Position Information Mining for Aerial Small Object Detection

Rongxin Huang, Guangfeng Lin, Wenbo Zhou, Zhirong Li, Wenhuan Wu

Main category: cs.CV

TL;DR: BPIM framework improves small object detection in UAV imagery by integrating boundary, position, and scale information through attention mechanisms and cross-scale feature fusion.

DetailsMotivation: UAV applications face challenges in accurately detecting small objects due to imbalanced scales and blurred edges, requiring better integration of boundary and position information.

Method: Proposed BPIM framework includes PIG module for location information, BIG module for object edges, CSF module for shallow feature assembly, TFF module for combining position/boundary information, and AWF module for deep semantic feature fusion.

Result: BPIM outperforms baseline Yolov5-P2 and achieves promising performance compared to state-of-the-art methods on VisDrone2021, DOTA1.0, and WiderPerson datasets with comparable computation load.

Conclusion: BPIM effectively addresses small object detection challenges in UAV imagery by mining boundary and position information, demonstrating superior performance through integrated attention mechanisms and feature fusion strategies.

Abstract: Unmanned Aerial Vehicle (UAV) applications have become increasingly prevalent in aerial photography and object recognition. However, there are major challenges to accurately capturing small targets in object detection due to the imbalanced scale and the blurred edges. To address these issues, boundary and position information mining (BPIM) framework is proposed for capturing object edge and location cues. The proposed BPIM includes position information guidance (PIG) module for obtaining location information, boundary information guidance (BIG) module for extracting object edge, cross scale fusion (CSF) module for gradually assembling the shallow layer image feature, three feature fusion (TFF) module for progressively combining position and boundary information, and adaptive weight fusion (AWF) module for flexibly merging the deep layer semantic feature. Therefore, BPIM can integrate boundary, position, and scale information in image for small object detection using attention mechanisms and cross-scale feature fusion strategies. Furthermore, BPIM not only improves the discrimination of the contextual feature by adaptive weight fusion with boundary, but also enhances small object perceptions by cross-scale position fusion. On the VisDrone2021, DOTA1.0, and WiderPerson datasets, experimental results show the better performances of BPIM compared to the baseline Yolov5-P2, and obtains the promising performance in the state-of-the-art methods with comparable computation load.

[117] SCHIGAND: A Synthetic Facial Generation Mode Pipeline

Ananya Kadali, Sunnie Jehan-Morrison, Orasiki Wellington, Barney Evans, Precious Durojaiye, Richard Guest

Main category: cs.CV

TL;DR: SCHIGAND is a synthetic face generation pipeline that combines StyleCLIP, HyperStyle, InterfaceGAN, and Diffusion models to create realistic, diverse, and identity-preserving facial datasets for biometric testing.

DetailsMotivation: Growing demand for facial datasets faces challenges from privacy regulations, data scarcity, and ethical concerns. Existing generative models struggle to balance realism, diversity, and identity preservation needed for biometric applications.

Method: SCHIGAND integrates multiple state-of-the-art models: StyleCLIP for text-guided generation, HyperStyle for style manipulation, InterfaceGAN for attribute control, and Diffusion models for high-quality image synthesis. The pipeline enhances identity preservation while generating realistic intra-class variations and maintaining inter-class distinctiveness.

Result: Experimental evaluation using ArcFace facial verification model shows SCHIGAND achieves a balance between image quality and diversity. The generated datasets perform comparably to real-world facial datasets, addressing key limitations of prior generative models.

Conclusion: SCHIGAND demonstrates potential to supplement or replace real data for facial biometric applications, offering privacy-compliant and scalable solutions for synthetic dataset generation in biometric testing.

Abstract: The growing demand for diverse and high-quality facial datasets for training and testing biometric systems is challenged by privacy regulations, data scarcity, and ethical concerns. Synthetic facial images offer a potential solution, yet existing generative models often struggle to balance realism, diversity, and identity preservation. This paper presents SCHIGAND, a novel synthetic face generation pipeline integrating StyleCLIP, HyperStyle, InterfaceGAN, and Diffusion models to produce highly realistic and controllable facial datasets. SCHIGAND enhances identity preservation while generating realistic intra-class variations and maintaining inter-class distinctiveness, making it suitable for biometric testing. The generated datasets were evaluated using ArcFace, a leading facial verification model, to assess their effectiveness in comparison to real-world facial datasets. Experimental results demonstrate that SCHIGAND achieves a balance between image quality and diversity, addressing key limitations of prior generative models. This research highlights the potential of SCHIGAND to supplement and, in some cases, replace real data for facial biometric applications, paving the way for privacy-compliant and scalable solutions in synthetic dataset generation.

[118] Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss

Minsu Gong, Nuri Ryu, Jungseul Ok, Sunghyun Cho

Main category: cs.CV

TL;DR: Proposes Structure Preservation Loss (SPL) using local linear models to maintain edge structures in latent diffusion model-based image editing, with training-free integration and additional techniques for better results.

DetailsMotivation: Current latent diffusion models (LDMs) struggle to maintain pixel-level edge structures during text-prompt-driven image editing, which is crucial for photorealistic style transfer and image tone adjustment tasks.

Method: Introduces Structure Preservation Loss (SPL) that uses local linear models to quantify structural differences between input and edited images. The training-free approach integrates SPL into the diffusion process, plus post-processing to reduce LDM decoding distortions, masking for edit localization, and color preservation loss.

Result: SPL enhances structural fidelity and achieves state-of-the-art performance in latent-diffusion-based image editing, as confirmed by experiments.

Conclusion: The proposed SPL effectively addresses structural preservation challenges in LDM-based editing through a novel loss function and complementary techniques, delivering improved editing quality.

Abstract: Recent advances in image editing leverage latent diffusion models (LDMs) for versatile, text-prompt-driven edits across diverse tasks. Yet, maintaining pixel-level edge structures-crucial for tasks such as photorealistic style transfer or image tone adjustment-remains as a challenge for latent-diffusion-based editing. To overcome this limitation, we propose a novel Structure Preservation Loss (SPL) that leverages local linear models to quantify structural differences between input and edited images. Our training-free approach integrates SPL directly into the diffusion model’s generative process to ensure structural fidelity. This core mechanism is complemented by a post-processing step to mitigate LDM decoding distortions, a masking strategy for precise edit localization, and a color preservation loss to preserve hues in unedited areas. Experiments confirm SPL enhances structural fidelity, delivering state-of-the-art performance in latent-diffusion-based image editing. Our code will be publicly released at https://github.com/gongms00/SPL.

[119] Reliable Brain Tumor Segmentation Based on Spiking Neural Networks with Efficient Training

Aurora Pia Ghiardelli, Guangzhi Tang, Tao Sun

Main category: cs.CV

TL;DR: A reliable, energy-efficient 3D brain tumor segmentation framework using spiking neural networks with multi-view ensemble for uncertainty estimation and Forward Propagation Through Time for computational efficiency.

DetailsMotivation: To develop a reliable and energy-efficient brain tumor segmentation method suitable for medical IoT and Point-of-Care systems, addressing the need for low-power computation while maintaining accuracy and providing uncertainty estimation.

Method: Uses spiking neural networks with multi-view ensemble (sagittal, coronal, axial views) for voxel-wise uncertainty estimation, and employs Forward Propagation Through Time (FPTT) to reduce computational cost while maintaining temporal learning efficiency.

Result: Achieves competitive accuracy on BraTS 2017 and BraTS 2023 datasets, provides well-calibrated uncertainty estimation, and reduces FLOPs by 87% compared to conventional methods.

Conclusion: The framework demonstrates the potential of SNNs for reliable, low-power medical applications, offering energy-efficient brain tumor segmentation with uncertainty quantification suitable for IoT and Point-of-Care systems.

Abstract: We propose a reliable and energy-efficient framework for 3D brain tumor segmentation using spiking neural networks (SNNs). A multi-view ensemble of sagittal, coronal, and axial SNN models provides voxel-wise uncertainty estimation and enhances segmentation robustness. To address the high computational cost in training SNN models for semantic image segmentation, we employ Forward Propagation Through Time (FPTT), which maintains temporal learning efficiency with significantly reduced computational cost. Experiments on the Multimodal Brain Tumor Segmentation Challenges (BraTS 2017 and BraTS 2023) demonstrate competitive accuracy, well-calibrated uncertainty, and an 87% reduction in FLOPs, underscoring the potential of SNNs for reliable, low-power medical IoT and Point-of-Care systems.

[120] ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction

Ming Li, Hui Shan, Kai Zheng, Chentao Shen, Siyu Liu, Yanwei Fu, Zhen Chen, Xiangru Huang

Main category: cs.CV

TL;DR: ReWeaver is a framework for topology-accurate 3D garment and sewing pattern reconstruction from sparse multi-view RGB images, enabling high-fidelity physical simulation.

DetailsMotivation: Existing garment reconstruction methods use unstructured representations (like 3D Gaussian Splats) that struggle with accurate garment topology and sewing structures, making them unsuitable for physical simulation. There's a need for structured representations that can bridge the sim-to-real gap for applications like digital avatars, virtual try-on, and robotic manipulation.

Method: ReWeaver predicts seams and panels with their connectivities in both 2D UV space and 3D space from as few as four input views. The framework uses a large-scale synthetic dataset GCD-TS (over 100,000 samples) with multi-view RGB images, 3D garment geometries, textured human body meshes, and annotated sewing patterns for training.

Result: ReWeaver consistently outperforms existing methods in topology accuracy, geometry alignment, and seam-panel consistency. The reconstructed outputs provide structured 2D-3D garment representations suitable for 3D perception, high-fidelity physical simulation, and robotic manipulation.

Conclusion: ReWeaver enables accurate 3D garment reconstruction with proper topology and sewing structures from sparse multi-view images, addressing the limitations of existing unstructured approaches and making the outputs suitable for physical simulation applications.

Abstract: High-quality 3D garment reconstruction plays a crucial role in mitigating the sim-to-real gap in applications such as digital avatars, virtual try-on and robotic manipulation. However, existing garment reconstruction methods typically rely on unstructured representations, such as 3D Gaussian Splats, struggling to provide accurate reconstructions of garment topology and sewing structures. As a result, the reconstructed outputs are often unsuitable for high-fidelity physical simulation. We propose ReWeaver, a novel framework for topology-accurate 3D garment and sewing pattern reconstruction from sparse multi-view RGB images. Given as few as four input views, ReWeaver predicts seams and panels as well as their connectivities in both the 2D UV space and the 3D space. The predicted seams and panels align precisely with the multi-view images, yielding structured 2D–3D garment representations suitable for 3D perception, high-fidelity physical simulation, and robotic manipulation. To enable effective training, we construct a large-scale dataset GCD-TS, comprising multi-view RGB images, 3D garment geometries, textured human body meshes and annotated sewing patterns. The dataset contains over 100,000 synthetic samples covering a wide range of complex geometries and topologies. Extensive experiments show that ReWeaver consistently outperforms existing methods in terms of topology accuracy, geometry alignment and seam-panel consistency.

[121] Affinity Contrastive Learning for Skeleton-based Human Activity Understanding

Hongda Liu, Yunfan Liu, Min Ren, Lin Sui, Yunlong Wang, Zhenan Sun

Main category: cs.CV

TL;DR: ACLNet introduces affinity contrastive learning for skeleton-based human activity understanding, using activity superclasses and dynamic temperature scheduling to improve feature discrimination.

DetailsMotivation: Existing contrastive learning methods for skeleton-based activity understanding fail to exploit structural inter-class similarities and overlook the impact of anomalous positive samples, limiting feature discrimination.

Method: Proposes ACLNet with: 1) affinity metric to refine similarity measurements and form activity superclasses, 2) dynamic temperature schedule to adaptively adjust penalty strength for different superclasses, and 3) margin-based contrastive strategy to improve separation of hard positive/negative samples.

Result: Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, Kinetics-Skeleton, PKU-MMD, FineGYM, and CASIA-B demonstrate superiority in skeleton-based action recognition, gait recognition, and person re-identification.

Conclusion: ACLNet effectively addresses limitations of existing contrastive learning approaches by exploiting structural inter-class relationships and handling anomalous samples, leading to improved performance across multiple skeleton-based activity understanding tasks.

Abstract: In skeleton-based human activity understanding, existing methods often adopt the contrastive learning paradigm to construct a discriminative feature space. However, many of these approaches fail to exploit the structural inter-class similarities and overlook the impact of anomalous positive samples. In this study, we introduce ACLNet, an Affinity Contrastive Learning Network that explores the intricate clustering relationships among human activity classes to improve feature discrimination. Specifically, we propose an affinity metric to refine similarity measurements, thereby forming activity superclasses that provide more informative contrastive signals. A dynamic temperature schedule is also introduced to adaptively adjust the penalty strength for various superclasses. In addition, we employ a margin-based contrastive strategy to improve the separation of hard positive and negative samples within classes. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, Kinetics-Skeleton, PKU-MMD, FineGYM, and CASIA-B demonstrate the superiority of our method in skeleton-based action recognition, gait recognition, and person re-identification. The source code is available at https://github.com/firework8/ACLNet.

[122] CER-HV: A CER-Based Human-in-the-Loop Framework for Cleaning Datasets Applied to Arabic-Script HTR

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

Main category: cs.CV

TL;DR: CER-HV framework improves Arabic-script HTR by detecting and cleaning label errors in datasets through CER-based ranking with human verification, achieving state-of-the-art performance across multiple languages.

DetailsMotivation: Arabic-script handwritten text recognition lags behind Latin-script HTR despite architectural advances, with data quality being a significant limiting factor in many published datasets.

Method: CER-HV framework combines CER-based noise detector (using carefully configured CRNN with early stopping) and human-in-the-loop verification to detect and clean label errors including transcription, segmentation, orientation, and non-text content issues.

Result: Framework identifies errors with 80-90% precision; CRNN achieves SOTA performance on 5/6 datasets (8.45% CER on KHATT, 8.26% on PHTI, etc.); CER-HV improves evaluation CER by 0.3-1.8% depending on dataset noise level.

Conclusion: Data quality is crucial for Arabic-script HTR performance; CER-HV effectively detects and cleans label errors, improving recognition accuracy across multiple languages while being generalizable to other text recognition tasks.

Abstract: Handwritten text recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR, despite recent advances in model architectures, datasets, and benchmarks. We show that data quality is a significant limiting factor in many published datasets and propose CER-HV (CER-based Ranking with Human Verification) as a framework to detect and clean label errors. CER-HV combines a CER-based noise detector, built on a carefully configured Convolutional Recurrent Neural Network (CRNN) with early stopping to avoid overfitting noisy samples, and a human-in-the-loop (HITL) step that verifies high-ranking samples. The framework reveals that several existing datasets contain previously underreported problems, including transcription, segmentation, orientation, and non-text content errors. These have been identified with up to 90 percent precision in the Muharaf and 80-86 percent in the PHTI datasets. We also show that our CRNN achieves state-of-the-art performance across five of the six evaluated datasets, reaching 8.45 percent Character Error Rate (CER) on KHATT (Arabic), 8.26 percent on PHTI (Pashto), 10.66 percent on Ajami, and 10.11 percent on Muharaf (Arabic), all without any data cleaning. We establish a new baseline of 11.3 percent CER on the PHTD (Persian) dataset. Applying CER-HV improves the evaluation CER by 0.3-0.6 percent on the cleaner datasets and 1.0-1.8 percent on the noisier ones. Although our experiments focus on documents written in an Arabic-script language, including Arabic, Persian, Urdu, Ajami, and Pashto, the framework is general and can be applied to other text recognition datasets.

[123] Using Shadows in Circular Synthetic Aperture Sonar Imaging for Target Analysis

Yann Le Gall, Nicolas Burlet, Mathieu Simon, Fabien Novella, Samantha Dugelay, Jean-Philippe Malkasse

Main category: cs.CV

TL;DR: CSAS provides 360° seabed views but loses shadow information crucial for target recognition. This paper proposes using sub-aperture filtering and FFSE to retrieve shadows from CSAS data, enabling 3D reconstruction via space-carving for improved mine warfare target analysis.

DetailsMotivation: Circular SAS provides superior 360° azimuth coverage but loses valuable shadow information that's essential for target recognition in mine warfare. Shadows provide complementary shape information that can reduce false alarms and enable 3D reconstruction.

Method: 1) Use sub-aperture filtering to generate multiple images from different viewpoints along the circular trajectory. 2) Apply fixed focus shadow enhancement (FFSE) to obtain sharp shadows. 3) Develop interactive interface for human operators to visualize shadows. 4) Apply space-carving reconstruction method to infer 3D shape from segmented shadows.

Result: The approach successfully retrieves shadow information from CSAS data and demonstrates the potential for 3D reconstruction. Results show shadows in circular SAS can significantly improve target analysis and 3D shape reconstruction capabilities.

Conclusion: Shadow information retrieval from CSAS data is feasible and valuable for target recognition in mine warfare. The proposed method enables both improved 2D target analysis and 3D reconstruction, addressing the limitations of conventional CSAS processing that sacrifices shadows for resolution and coverage.

Abstract: Circular Synthetic Aperture Sonar (CSAS) provides a 360° azimuth view of the seabed, surpassing the limited aperture and mono-view image of conventional side-scan SAS. This makes CSAS a valuable tool for target recognition in mine warfare where the diversity of point of view is essential for reducing false alarms. CSAS processing typically produces a very high-resolution two-dimensional image. However, the parallax introduced by the circular displacement of the illuminator fill-in the shadow regions, and the shadow cast by an object on the seafloor is lost in favor of azimuth coverage and resolution. Yet the shadows provide complementary information on target shape useful for target recognition. In this paper, we explore a way to retrieve shadow information from CSAS data to improve target analysis and carry 3D reconstruction. Sub-aperture filtering is used to get a collection of images at various points of view along the circular trajectory and fixed focus shadow enhancement (FFSE) is applied to obtain sharp shadows. An interactive interface is also proposed to allow human operators to visualize these shadows along the circular trajectory. A space-carving reconstruction method is applied to infer the 3D shape of the object from the segmented shadows. The results demonstrate the potential of shadows in circular SAS for improving target analysis and 3D reconstruction.

[124] A Step to Decouple Optimization in 3DGS

Renjie Ding, Yaonan Wang, Min Liu, Jialin Zhu, Jiazheng Wang, Jiahao Zhao, Wenting Shen, Feixiang He, Xiang Che

Main category: cs.CV

TL;DR: The paper identifies optimization issues in 3D Gaussian Splatting (3DGS) related to update step coupling and gradient coupling, proposes decoupled components (Sparse Adam, Re-State Regularization, Decoupled Attribute Regularization), and introduces AdamW-GS for improved efficiency and effectiveness.

DetailsMotivation: Current 3DGS optimization inherits DNN practices but overlooks two critical issues: (1) update step coupling causing optimizer state rescaling and costly attribute updates outside viewpoints, and (2) gradient coupling in the moment leading to under- or over-effective regularization. These complex couplings are under-explored despite their impact on optimization quality.

Method: The authors revisit 3DGS optimization and decouple it into three components: Sparse Adam, Re-State Regularization, and Decoupled Attribute Regularization. They conduct extensive experiments under both 3DGS and 3DGS-MCMC frameworks. Based on empirical analysis, they re-design the optimization by re-coupling beneficial components into AdamW-GS.

Result: The proposed AdamW-GS achieves simultaneous improvements in both optimization efficiency and representation effectiveness compared to standard 3DGS optimization approaches.

Conclusion: By addressing the overlooked optimization coupling issues in 3DGS and proposing the AdamW-GS method, the work provides deeper understanding of 3DGS optimization components and demonstrates that better optimization efficiency and representation effectiveness can be achieved simultaneously through careful optimization design.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time novel view synthesis. As an explicit representation optimized through gradient propagation among primitives, optimization widely accepted in deep neural networks (DNNs) is actually adopted in 3DGS, such as synchronous weight updating and Adam with the adaptive gradient. However, considering the physical significance and specific design in 3DGS, there are two overlooked details in the optimization of 3DGS: (i) update step coupling, which induces optimizer state rescaling and costly attribute updates outside the viewpoints, and (ii) gradient coupling in the moment, which may lead to under- or over-effective regularization. Nevertheless, such a complex coupling is under-explored. After revisiting the optimization of 3DGS, we take a step to decouple it and recompose the process into: Sparse Adam, Re-State Regularization and Decoupled Attribute Regularization. Taking a large number of experiments under the 3DGS and 3DGS-MCMC frameworks, our work provides a deeper understanding of these components. Finally, based on the empirical analysis, we re-design the optimization and propose AdamW-GS by re-coupling the beneficial components, under which better optimization efficiency and representation effectiveness are achieved simultaneously.

[125] Automated Road Crack Localization to Guide Highway Maintenance

Steffen Knoblauch, Ram Kumar Muthusamy, Pedram Ghamisi, Alexander Zipf

Main category: cs.CV

TL;DR: This study develops a framework using open-source data (airborne imagery + OSM) to fine-tune YOLOv11 for highway crack detection, creating a Swiss Relative Highway Crack Density (RHCD) index to guide maintenance decisions.

DetailsMotivation: Climate change-induced temperature fluctuations are increasing stress on road pavements and maintenance costs, creating a need for targeted, efficient maintenance strategies for highway infrastructure.

Method: Integrates airborne imagery and OpenStreetMap to fine-tune YOLOv11 for highway crack localization, then calculates a Swiss Relative Highway Crack Density (RHCD) index to inform nationwide highway maintenance.

Result: Crack classification achieved F1-scores of 0.84 (crack) and 0.97 (no crack). RHCD index showed weak correlations with temperature (r=-0.05) and traffic volume (r=0.17), with high values near urban centers and intersections.

Conclusion: Open-source data sharing enables innovative solutions for public sector efficiency, with the RHCD index providing valuable maintenance guidance beyond traditional temperature and traffic data.

Abstract: Highway networks are crucial for economic prosperity. Climate change-induced temperature fluctuations are exacerbating stress on road pavements, resulting in elevated maintenance costs. This underscores the need for targeted and efficient maintenance strategies. This study investigates the potential of open-source data to guide highway infrastructure maintenance. The proposed framework integrates airborne imagery and OpenStreetMap (OSM) to fine-tune YOLOv11 for highway crack localization. To demonstrate the framework’s real-world applicability, a Swiss Relative Highway Crack Density (RHCD) index was calculated to inform nationwide highway maintenance. The crack classification model achieved an F1-score of $0.84$ for the positive class (crack) and $0.97$ for the negative class (no crack). The Swiss RHCD index exhibited weak correlations with Long-term Land Surface Temperature Amplitudes (LT-LST-A) (Pearson’s $r\ = -0.05$) and Traffic Volume (TV) (Pearson’s $r\ = 0.17$), underlining the added value of this novel index for guiding maintenance over other data. Significantly high RHCD values were observed near urban centers and intersections, providing contextual validation for the predictions. These findings highlight the value of open-source data sharing to drive innovation, ultimately enabling more efficient solutions in the public sector.

[126] Curated endoscopic retrograde cholangiopancreatography images dataset

Alda João Andrade, Mónica Martins, André Ferreira, Tarcísio Araújo, Luís Lopes, Victor Alves

Main category: cs.CV

TL;DR: Researchers created a large, curated ERCP image dataset to address the scarcity of public data for AI-based diagnosis of biliary and pancreatic diseases.

DetailsMotivation: Public ERCP datasets are scarce, limiting the development and application of AI solutions for automating diagnosis of biliary and pancreatic diseases.

Method: Collected 19,018 raw and 19,317 processed images from 1,602 patients, with 5,519 images manually annotated by experienced gastroenterologists (two with >5 years experience, reviewed by one with >20 years experience, all performing >400 ERCPs annually).

Result: Created a large, curated dataset with expert annotations, validated through a classification experiment to prove its utility and validity.

Conclusion: This dataset aims to provide a benchmark for automatic ERCP analysis and diagnosis of biliary and pancreatic diseases, helping fill the gap in public ERCP data availability.

Abstract: Endoscopic Retrograde Cholangiopancreatography (ERCP) is a key procedure in the diagnosis and treatment of biliary and pancreatic diseases. Artificial intelligence has been pointed as one solution to automatize diagnosis. However, public ERCP datasets are scarce, which limits the use of such approach. Therefore, this study aims to help fill this gap by providing a large and curated dataset. The collection is composed of 19.018 raw images and 19.317 processed from 1.602 patients. 5.519 images are labeled, which provides a ready to use dataset. All images were manually inspected and annotated by two gastroenterologist with more than 5 years of experience and reviewed by another gastroenterologist with more than 20 years of experience, all with more than 400 ERCP procedures annually. The utility and validity of the dataset is proven by a classification experiment. This collection aims to provide or contribute for a benchmark in automatic ERCP analysis and diagnosis of biliary and pancreatic diseases.

[127] Flow Matching for Probabilistic Monocular 3D Human Pose Estimation

Cuong Le, Pavló Melnyk, Bastian Wandt, Mårten Wadenbäck

Main category: cs.CV

TL;DR: FMPose: A flow matching-based probabilistic 3D human pose estimation method that learns optimal transport from simple distributions to plausible 3D poses, outperforming diffusion methods in speed and accuracy.

DetailsMotivation: 3D human pose estimation from monocular views is ill-posed due to depth ambiguity. Traditional methods often produce incorrect but overconfident 3D estimates, while probabilistic approaches that model uncertainty are needed for more reliable pose estimation.

Method: FMPose uses flow matching generative approach with optimal transport, learning continuous normalizing flows from simple source distributions to plausible 3D pose distributions conditioned on 2D cues. Graph convolutional networks model 2D lifting conditions using learnable connections between body joints as graph structure for feature aggregation.

Result: FMPose achieves major improvements over state-of-the-art methods on three benchmarks: Human3.6M, MPI-INF-3DHP, and 3DPW. It produces faster and more accurate 3D pose generations compared to diffusion-based methods.

Conclusion: Flow matching with optimal transport provides an effective probabilistic framework for 3D human pose estimation, addressing depth ambiguity through uncertainty modeling while achieving superior performance and efficiency compared to existing approaches.

Abstract: Recovering 3D human poses from a monocular camera view is a highly ill-posed problem due to the depth ambiguity. Earlier studies on 3D human pose lifting from 2D often contain incorrect-yet-overconfident 3D estimations. To mitigate the problem, emerging probabilistic approaches treat the 3D estimations as a distribution, taking into account the uncertainty measurement of the poses. Falling in a similar category, we proposed FMPose, a probabilistic 3D human pose estimation method based on the flow matching generative approach. Conditioned on the 2D cues, the flow matching scheme learns the optimal transport from a simple source distribution to the plausible 3D human pose distribution via continuous normalizing flows. The 2D lifting condition is modeled via graph convolutional networks, leveraging the learnable connections between human body joints as the graph structure for feature aggregation. Compared to diffusion-based methods, the FMPose with optimal transport produces faster and more accurate 3D pose generations. Experimental results show major improvements of our FMPose over current state-of-the-art methods on three common benchmarks for 3D human pose estimation, namely Human3.6M, MPI-INF-3DHP and 3DPW.

[128] AutoRegressive Generation with B-rep Holistic Token Sequence Representation

Jiahao Li, Yunpeng Bai, Yongkang Dai, Hao Guo, Hongping Gan, Yilei Shi

Main category: cs.CV

TL;DR: BrepARG is the first method to encode B-rep geometry and topology into holistic token sequences for sequence-based generation using autoregressive transformers.

DetailsMotivation: Previous graph-based B-rep representations separate geometry and topology, preventing the use of sequence-based generative frameworks like transformers that have shown excellent performance. There's a need for a unified representation that enables transformer-based B-rep generation.

Method: Encodes B-rep into three token types: geometry tokens, position tokens, and face index tokens. Constructs holistic token sequences hierarchically by first building geometry blocks (faces and edges), then sequencing these blocks, and finally assembling the complete B-rep sequence. Uses a transformer-based autoregressive model with multi-layer decoder-only architecture and causal masking for next-token prediction.

Result: BrepARG achieves state-of-the-art (SOTA) performance in B-rep generation experiments.

Conclusion: BrepARG successfully demonstrates the feasibility of representing B-rep as holistic token sequences, opening new directions for B-rep generation using sequence-based approaches like transformers.

Abstract: Previous representation and generation approaches for the B-rep relied on graph-based representations that disentangle geometric and topological features through decoupled computational pipelines, thereby precluding the application of sequence-based generative frameworks, such as transformer architectures that have demonstrated remarkable performance. In this paper, we propose BrepARG, the first attempt to encode B-rep’s geometry and topology into a holistic token sequence representation, enabling sequence-based B-rep generation with an autoregressive architecture. Specifically, BrepARG encodes B-rep into 3 types of tokens: geometry and position tokens representing geometric features, and face index tokens representing topology. Then the holistic token sequence is constructed hierarchically, starting with constructing the geometry blocks (i.e., faces and edges) using the above tokens, followed by geometry block sequencing. Finally, we assemble the holistic sequence representation for the entire B-rep. We also construct a transformer-based autoregressive model that learns the distribution over holistic token sequences via next-token prediction, using a multi-layer decoder-only architecture with causal masking. Experiments demonstrate that BrepARG achieves state-of-the-art (SOTA) performance. BrepARG validates the feasibility of representing B-rep as holistic token sequences, opening new directions for B-rep generation.

[129] REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion

Xuewei Li, Xinghan Bao, Zhimin Chen, Xi Li

Main category: cs.CV

TL;DR: REL-SF4PASS: A panoramic semantic segmentation method using REL depth representation in cylindrical coordinates and Spherical-dynamic Multi-Modal Fusion (SMMF) to improve performance and robustness.

DetailsMotivation: Existing PASS methods don't fully utilize panoramic image geometry - they focus on spherical geometry with RGB or use depth in original/HHA format. There's a need to better exploit panoramic geometry and reduce distortion from ERP projection.

Method: Proposes REL depth representation (Rectified Depth, Elevation-Gained Vertical Inclination Angle, Lateral Orientation Angle) in cylindrical coordinates, and Spherical-dynamic Multi-Modal Fusion (SMMF) that uses different fusion strategies for different panoramic regions to reduce cylinder surface expansion breakage.

Result: Achieves 2.35% average mIoU improvement on all 3 folds of Stanford2D3D Panoramic datasets and reduces performance variance by ~70% when facing 3D disturbance.

Conclusion: REL-SF4PASS significantly improves panoramic semantic segmentation performance and robustness by better utilizing panoramic geometry through cylindrical coordinate representation and adaptive multi-modal fusion.

Abstract: As an important and challenging problem in computer vision, Panoramic Semantic Segmentation (PASS) aims to give complete scene perception based on an ultra-wide angle of view. Most PASS methods often focus on spherical geometry with RGB input or using the depth information in original or HHA format, which does not make full use of panoramic image geometry. To address these shortcomings, we propose REL-SF4PASS with our REL depth representation based on cylindrical coordinate and Spherical-dynamic Multi-Modal Fusion SMMF. REL is made up of Rectified Depth, Elevation-Gained Vertical Inclination Angle, and Lateral Orientation Angle, which fully represents 3D space in cylindrical coordinate style and the surface normal direction. SMMF aims to ensure the diversity of fusion for different panoramic image regions and reduce the breakage of cylinder side surface expansion in ERP projection, which uses different fusion strategies to match the different regions in panoramic images. Experimental results show that REL-SF4PASS considerably improves performance and robustness on popular benchmark, Stanford2D3D Panoramic datasets. It gains 2.35% average mIoU improvement on all 3 folds and reduces the performance variance by approximately 70% when facing 3D disturbance.

[130] CASP: Few-Shot Class-Incremental Learning with CLS Token Attention Steering Prompts

Shuai Huang, Xuhan Lin, Yuwu Lu

Main category: cs.CV

TL;DR: CASP introduces CLS token attention steering prompts with trainable bias parameters to modulate self-attention weights, combined with attention perturbation and manifold token mixup for better generalization in few-shot class-incremental learning.

DetailsMotivation: FSCIL requires models to adapt to new classes with limited samples while avoiding catastrophic forgetting. Existing prompt-based methods need better generalization under extreme few-shot settings by leveraging pretrained knowledge to learn feature representations that can be shared across future categories.

Method: Proposes CLS Token Attention Steering Prompts (CASP) with class-shared trainable bias parameters in query, key, and value projections of CLS token to modulate self-attention weights. Also includes attention perturbation strategy and Manifold Token Mixup in shallow feature space to synthesize potential new class features for improved generalization.

Result: Outperforms state-of-the-art methods on CUB200, CIFAR100, and ImageNet-R datasets in both standard and fine-grained FSCIL settings, without requiring fine-tuning during incremental phases and with significantly reduced parameter overhead.

Conclusion: CASP effectively addresses FSCIL challenges by steering CLS token attention to filter task-irrelevant information, enhancing generalization through attention modulation and feature synthesis, while maintaining parameter efficiency and eliminating incremental fine-tuning requirements.

Abstract: Few-shot class-incremental learning (FSCIL) presents a core challenge in continual learning, requiring models to rapidly adapt to new classes with very limited samples while mitigating catastrophic forgetting. Recent prompt-based methods, which integrate pretrained backbones with task-specific prompts, have made notable progress. However, under extreme few-shot incremental settings, the model’s ability to transfer and generalize becomes critical, and it is thus essential to leverage pretrained knowledge to learn feature representations that can be shared across future categories during the base session. Inspired by the mechanism of the CLS token, which is similar to human attention and progressively filters out task-irrelevant information, we propose the CLS Token Attention Steering Prompts (CASP). This approach introduces class-shared trainable bias parameters into the query, key, and value projections of the CLS token to explicitly modulate the self-attention weights. To further enhance generalization, we also design an attention perturbation strategy and perform Manifold Token Mixup in the shallow feature space, synthesizing potential new class features to improve generalization and reserve the representation capacity for upcoming tasks. Experiments on the CUB200, CIFAR100, and ImageNet-R datasets demonstrate that CASP outperforms state-of-the-art methods in both standard and fine-grained FSCIL settings without requiring fine-tuning during incremental phases and while significantly reducing the parameter overhead.

[131] SLD: Segmentation-Based Landmark Detection for Spinal Ligaments

Lara Blomenkamp, Ivanna Kramer, Sabine Bauer, Theresa Schöche

Main category: cs.CV

TL;DR: Novel automated method for detecting spinal ligament landmarks using shape-based 3D vertebra segmentation and domain-specific rules, achieving high accuracy (0.7mm MAE) across all spinal regions.

DetailsMotivation: Precise identification of ligament attachment points is crucial for realistic biomechanical spine modeling, but existing automated methods are either region-specific or lack sufficient accuracy.

Method: Two-stage approach: first performs shape-based segmentation of 3D vertebrae, then applies domain-specific rules to identify different types of ligament attachment points.

Result: Outperforms existing approaches with mean absolute error of 0.7 mm and root mean square error of 1.1 mm, demonstrating strong generalization across all spinal regions.

Conclusion: The proposed method provides accurate and generalizable automated detection of spinal ligament landmarks, enabling more reliable biomechanical spine modeling.

Abstract: In biomechanical modeling, the representation of ligament attachments is crucial for a realistic simulation of the forces acting between the vertebrae. These forces are typically modeled as vectors connecting ligament landmarks on adjacent vertebrae, making precise identification of these landmarks a key requirement for constructing reliable spine models. Existing automated detection methods are either limited to specific spinal regions or lack sufficient accuracy. This work presents a novel approach for detecting spinal ligament landmarks, which first performs shape-based segmentation of 3D vertebrae and subsequently applies domain-specific rules to identify different types of attachment points. The proposed method outperforms existing approaches by achieving high accuracy and demonstrating strong generalization across all spinal regions. Validation on two independent spinal datasets from multiple patients yielded a mean absolute error (MAE) of 0.7 mm and a root mean square error (RMSE) of 1.1 mm.

[132] Incorporating Eye-Tracking Signals Into Multimodal Deep Visual Models For Predicting User Aesthetic Experience In Residential Interiors

Chen-Ying Chien, Po-Chih Kuo

Main category: cs.CV

TL;DR: Dual-branch CNN-LSTM model fuses visual features with eye-tracking data to predict aesthetic evaluations of interior spaces, achieving 72.2% accuracy on objective dimensions and 66.8% on subjective dimensions.

DetailsMotivation: Predicting aesthetic experiences in interior design is challenging due to subjective perception and complex visual responses. Current methods struggle to capture how people perceive and evaluate interior spaces for well-being promotion.

Method: Developed a dual-branch CNN-LSTM framework that fuses visual features from interior design videos with synchronized eye-tracking signals (gaze data and pupil responses). Collected dataset of 224 interior design videos with gaze data from 28 participants who rated 15 aesthetic dimensions.

Result: Model achieved 72.2% accuracy on objective dimensions (e.g., light) and 66.8% on subjective dimensions (e.g., relaxation), outperforming state-of-the-art video baselines. Eye-tracking during training enables comparable performance with visual input alone during deployment. Pupil responses contribute most to objective assessments, while gaze+visual combination enhances subjective evaluations.

Conclusion: Eye-tracking serves as valuable privileged information during training, enabling practical tools for aesthetic assessment in interior design. The approach bridges subjective perception with computational modeling for better interior space evaluation.

Abstract: Understanding how people perceive and evaluate interior spaces is essential for designing environments that promote well-being. However, predicting aesthetic experiences remains difficult due to the subjective nature of perception and the complexity of visual responses. This study introduces a dual-branch CNN-LSTM framework that fuses visual features with eye-tracking signals to predict aesthetic evaluations of residential interiors. We collected a dataset of 224 interior design videos paired with synchronized gaze data from 28 participants who rated 15 aesthetic dimensions. The proposed model attains 72.2% accuracy on objective dimensions (e.g., light) and 66.8% on subjective dimensions (e.g., relaxation), outperforming state-of-the-art video baselines and showing clear gains on subjective evaluation tasks. Notably, models trained with eye-tracking retain comparable performance when deployed with visual input alone. Ablation experiments further reveal that pupil responses contribute most to objective assessments, while the combination of gaze and visual cues enhances subjective evaluations. These findings highlight the value of incorporating eye-tracking as privileged information during training, enabling more practical tools for aesthetic assessment in interior design.

[133] ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models

Chenxi Ruan, Yu Xiao, Yihan Hou, Guosheng Hu, Wei Zeng

Main category: cs.CV

TL;DR: ColorConceptBench: A new benchmark to evaluate T2I models’ ability to associate colors with implicit concepts, revealing models lack sensitivity to abstract color semantics despite scaling and guidance interventions.

DetailsMotivation: Current text-to-image models have advanced significantly, but their capability to associate colors with implicit concepts (beyond explicit color names/codes) remains underexplored and poorly understood.

Method: Introduces ColorConceptBench, a human-annotated benchmark with 1,281 implicit color concepts and 6,369 human annotations, evaluating color-concept associations through probabilistic color distributions. Tests seven leading T2I models and examines standard interventions like scaling and guidance.

Result: Evaluation reveals current T2I models lack sensitivity to abstract color semantics, and this limitation appears resistant to standard interventions (scaling and guidance), suggesting the problem is fundamental rather than solvable through current approaches.

Conclusion: Achieving human-like color semantics in T2I models requires more than larger models - it demands a fundamental shift in how models learn and represent implicit meaning, indicating a core limitation in current architectures.

Abstract: While text-to-image (T2I) models have advanced considerably, their capability to associate colors with implicit concepts remains underexplored. To address the gap, we introduce ColorConceptBench, a new human-annotated benchmark to systematically evaluate color-concept associations through the lens of probabilistic color distributions. ColorConceptBench moves beyond explicit color names or codes by probing how models translate 1,281 implicit color concepts using a foundation of 6,369 human annotations. Our evaluation of seven leading T2I models reveals that current models lack sensitivity to abstract semantics, and crucially, this limitation appears resistant to standard interventions (e.g., scaling and guidance). This demonstrates that achieving human-like color semantics requires more than larger models, but demands a fundamental shift in how models learn and represent implicit meaning.

[134] No Validation, No Problem: Predicting Model Performance from a Single Gradient

Fangzheng Wu, Brian Summa

Main category: cs.CV

TL;DR: A validation-free checkpoint selection method using classifier-head gradient norm as a proxy metric that correlates with model performance without needing validation labels.

DetailsMotivation: Traditional checkpoint selection requires validation data and labels, which can be expensive and privacy-sensitive. The paper aims to find a lightweight, label-free method for checkpoint selection and early stopping that doesn't need validation labels.

Method: Uses the Frobenius norm of classifier-head gradient (||dL/dW||_F) from a single forward-backward pass on one batch of detached features as a proxy metric. This gradient norm is strongly correlated with model performance metrics (negatively with Top-1 accuracy, positively with loss). Different normalization strategies are proposed for different architectures: head-scale normalization for classic CNNs and feature-scale normalization for Transformers and modern CNNs.

Result: The method achieves near-oracle performance: closes most of the gap to oracle checkpoint selection (4.24% +/- 2.00% with universal setup, ~1.12% with per-family tuning). Works across ImageNet-1k CNNs and Transformers, predicts COCO detection/segmentation mAP, and tracks diffusion model progress on CIFAR-10. Adds less than 0.1% of an epoch overhead.

Conclusion: The classifier-head gradient norm provides an effective, lightweight, label-free proxy for model performance that enables validation-free checkpoint selection and early stopping across diverse architectures and tasks, with minimal computational overhead.

Abstract: We propose a validation-free checkpointing signal from a single forward-backward pass: the Frobenius norm of the classifier-head gradient on one detached-feature batch, ||g||_F = ||dL/dW||_F. Across ImageNet-1k CNNs and Transformers, this proxy is strongly negative with Top-1 and positive with loss. Selecting the checkpoint with the minimum head gradient in a short tail window closes most of the gap to the oracle (4.24% +/- 2.00% with a universal setup, about 1.12% with light per-family tuning). For practical deployment, a head-scale normalization is more stable within classic CNN families (e.g., ResNets), while a feature-scale normalization works well for Transformers and modern CNNs. The same one-batch probe also predicts COCO detection/segmentation mAP. In diffusion (UNet/DDPM on CIFAR-10), it tracks progress and enables near-oracle tail-window selection; it is positively correlated with same-distribution probe MSE and negatively with FID (lower is better), so it can be used as a lightweight, label-free monitor. Validation labels are never used beyond reporting. The probe adds much less than 0.1% of an epoch and works as a drop-in for validation-free checkpoint selection and early stopping.

[135] GPA-VGGT:Adapting VGGT to Large scale Localization by self-Supervised learning with Geometry and Physics Aware loss

Yangfan Xu, Lilian Zhang, Xiaofeng He, Pengdong Wu, Wenqi Wu, Jun Mao

Main category: cs.CV

TL;DR: Self-supervised training framework for Visual Geometry Grounded Transformers (VGGT) that eliminates need for ground truth labels by using sequence-wise geometric constraints and joint optimization loss.

DetailsMotivation: Existing VGGT models require ground truth labels for training, making them difficult to adapt to unlabeled and unseen scenes in large-scale environments.

Method: Extends pair-wise relations to sequence-wise geometric constraints, sampling multiple source frames and projecting them onto different target frames. Uses joint optimization loss combining photometric consistency and geometric constraints without hard labels.

Result: Model converges within hundreds of iterations and achieves significant improvements in large-scale localization. Both attention layers and camera/depth heads effectively capture multi-view geometry.

Conclusion: Proposed self-supervised framework successfully trains VGGT models on unlabeled data, enhancing localization capabilities in large-scale environments without requiring ground truth labels.

Abstract: Transformer-based general visual geometry frameworks have shown promising performance in camera pose estimation and 3D scene understanding. Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction. However, these models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes. In this paper, we propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments. To achieve this, we extend conventional pair-wise relations to sequence-wise geometric constraints for self-supervised learning. Specifically, in each sequence, we sample multiple source frames and geometrically project them onto different target frames, which improves temporal feature consistency. We formulate physical photometric consistency and geometric constraints as a joint optimization loss to circumvent the requirement for hard labels. By training the model with this proposed method, not only the local and global cross-view attention layers but also the camera and depth heads can effectively capture the underlying multi-view geometry. Experiments demonstrate that the model converges within hundreds of iterations and achieves significant improvements in large-scale localization. Our code will be released at https://github.com/X-yangfan/GPA-VGGT.

[136] Evaluating Large Vision-language Models for Surgical Tool Detection

Nakul Poudel, Richard Simon, Cristian A. Linte

Main category: cs.CV

TL;DR: Large vision-language models (VLMs) show strong potential for surgical tool detection, with Qwen2.5 outperforming other VLMs and demonstrating superior zero-shot generalization compared to open-set detection baselines.

DetailsMotivation: Current AI systems in surgery are mostly unimodal, limiting holistic understanding of surgical workflows. There's a need for general-purpose surgical AI systems that can comprehensively model interrelated surgical scene components. Large VLMs offer potential for modeling surgical tasks with human-like scene reasoning.

Method: Evaluated three state-of-the-art VLMs (Qwen2.5, LLaVA1.5, InternVL3.5) on GraSP robotic surgery dataset for surgical tool detection. Tested under zero-shot and parameter-efficient LoRA fine-tuning settings. Compared with open-set detection baseline Grounding DINO.

Result: Qwen2.5 consistently achieved superior detection performance among evaluated VLMs in both zero-shot and fine-tuned configurations. Qwen2.5 showed stronger zero-shot generalization than Grounding DINO and comparable fine-tuned performance. Qwen2.5 demonstrated superior instrument recognition, while Grounding DINO showed stronger localization.

Conclusion: Large VLMs, particularly Qwen2.5, show promising capabilities for surgical tool detection with strong zero-shot generalization. The study provides evidence for the potential of VLMs in surgical applications, though systematic investigations in this domain remain limited.

Abstract: Surgery is a highly complex process, and artificial intelligence has emerged as a transformative force in supporting surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. This highlights the need for general-purpose surgical AI systems capable of comprehensively modeling the interrelated components of surgical scenes. Recent advances in large vision-language models that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. In this study, we evaluate the effectiveness of large VLMs for the fundamental surgical vision task of detecting surgical tools. Specifically, we investigate three state-of-the-art VLMs, Qwen2.5, LLaVA1.5, and InternVL3.5, on the GraSP robotic surgery dataset under both zero-shot and parameter-efficient LoRA fine-tuning settings. Our results demonstrate that Qwen2.5 consistently achieves superior detection performance in both configurations among the evaluated VLMs. Furthermore, compared with the open-set detection baseline Grounding DINO, Qwen2.5 exhibits stronger zero-shot generalization and comparable fine-tuned performance. Notably, Qwen2.5 shows superior instrument recognition, while Grounding DINO demonstrates stronger localization.

[137] LoL: Longer than Longer, Scaling Video Generation to Hour

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh

Main category: cs.CV

TL;DR: The paper addresses “sink-collapse” in long-form video generation where content repeatedly reverts to attention sink frames, proposing a training-free RoPE jitter method to enable real-time, infinite-length video generation with minimal quality decay.

DetailsMotivation: Autoregressive models for long-form video generation suffer from error accumulation and loss of long-term coherence. While attention sink frames help mitigate performance decay, they introduce a critical failure mode called "sink-collapse" where generated content repeatedly reverts to sink frames, causing abrupt scene resets and cyclic motion patterns.

Method: Proposes a lightweight, training-free approach using multi-head RoPE (Rotary Position Embedding) jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. The method addresses the inherent conflict between RoPE’s periodic structure and multi-head attention mechanisms in current generative models.

Result: The method successfully alleviates sink-collapse while preserving generation quality. Achieves real-time, streaming, and infinite-length video generation with little quality decay. Demonstrates continuous videos up to 12 hours in length, among the longest publicly demonstrated results in streaming video generation.

Conclusion: The proposed training-free RoPE jitter approach effectively suppresses sink-collapse in long-form video generation, enabling robust, high-quality, infinite-length video generation without the cyclic failure modes that plague current autoregressive models.

Abstract: Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.

[138] Reward-Forcing: Autoregressive Video Generation with Reward Feedback

Jingran Zhang, Ning Li, Yuanhao Ban, Andrew Bai, Justin Cui

Main category: cs.CV

TL;DR: A new autoregressive video generation method uses reward signals instead of teacher models, achieving comparable performance to state-of-the-art autoregressive methods while simplifying training and avoiding teacher architecture constraints.

DetailsMotivation: Prior autoregressive video generation adaptations rely heavily on teacher models, which limit performance when strong autoregressive teachers are unavailable, resulting in output quality lagging behind bidirectional models.

Method: Uses reward signals to guide the autoregressive generation process, enabling more efficient and scalable training while preserving visual fidelity and temporal consistency.

Result: Achieves comparable performance to existing autoregressive models (84.92 vs 84.31 on VBench) and sometimes surpasses similarly sized bidirectional models by avoiding teacher architecture constraints.

Conclusion: Reward-guided autoregressive generation provides an effective alternative to teacher-dependent approaches, simplifying training while maintaining competitive performance with state-of-the-art methods.

Abstract: While most prior work in video generation relies on bidirectional architectures, recent efforts have sought to adapt these models into autoregressive variants to support near real-time generation. However, such adaptations often depend heavily on teacher models, which can limit performance, particularly in the absence of a strong autoregressive teacher, resulting in output quality that typically lags behind their bidirectional counterparts. In this paper, we explore an alternative approach that uses reward signals to guide the generation process, enabling more efficient and scalable autoregressive generation. By using reward signals to guide the model, our method simplifies training while preserving high visual fidelity and temporal consistency. Through extensive experiments on standard benchmarks, we find that our approach performs comparably to existing autoregressive models and, in some cases, surpasses similarly sized bidirectional models by avoiding constraints imposed by teacher architectures. For example, on VBench, our method achieves a total score of 84.92, closely matching state-of-the-art autoregressive methods that score 84.31 but require significant heterogeneous distillation.

[139] Domain-invariant Mixed-domain Semi-supervised Medical Image Segmentation with Clustered Maximum Mean Discrepancy Alignment

Ba-Thinh Lam, Thanh-Huy Nguyen, Hoang-Thien Nguyen, Quang-Khai Bui-Tran, Nguyen Lan Vi Vu, Phat K. Huynh, Ulas Bagci, Min Xu

Main category: cs.CV

TL;DR: A domain-invariant mixed-domain semi-supervised segmentation framework for medical images that handles unknown domain gaps without explicit domain labels, using copy-paste augmentation and cluster-based feature alignment.

DetailsMotivation: Real-world medical image segmentation faces two key challenges: 1) scarcity of expert annotations, and 2) mixed-domain data from multiple scanners/centers with unknown domain labels and severe domain gaps. Existing methods assume either single domain shifts or require explicit domain indices, which don't match practical deployment scenarios.

Method: Proposes a domain-invariant mixed-domain semi-supervised segmentation framework with two main components: 1) Copy-Paste Mechanism (CPM) that transfers informative regions across domains to augment training data diversity, and 2) Cluster Maximum Mean Discrepancy (CMMD) block that clusters unlabeled features and aligns them with labeled anchors using MMD objective to encourage domain-invariant representations. The method is integrated within a teacher-student framework.

Result: Experiments on Fundus and M&Ms benchmarks show the approach consistently surpasses semi-supervised and domain adaptation methods. It achieves robust and precise segmentation even with very few labeled examples and multiple unknown domain discrepancies.

Conclusion: The proposed framework establishes a potential solution for mixed-domain semi-supervised medical image segmentation, addressing both annotation scarcity and unknown domain gaps in real-world deployment scenarios.

Abstract: Deep learning has shown remarkable progress in medical image semantic segmentation, yet its success heavily depends on large-scale expert annotations and consistent data distributions. In practice, annotations are scarce, and images are collected from multiple scanners or centers, leading to mixed-domain settings with unknown domain labels and severe domain gaps. Existing semi-supervised or domain adaptation approaches typically assume either a single domain shift or access to explicit domain indices, which rarely hold in real-world deployment. In this paper, we propose a domain-invariant mixed-domain semi-supervised segmentation framework that jointly enhances data diversity and mitigates domain bias. A Copy-Paste Mechanism (CPM) augments the training set by transferring informative regions across domains, while a Cluster Maximum Mean Discrepancy (CMMD) block clusters unlabeled features and aligns them with labeled anchors via an MMD objective, encouraging domain-invariant representations. Integrated within a teacher-student framework, our method achieves robust and precise segmentation even with very few labeled examples and multiple unknown domain discrepancies. Experiments on Fundus and M&Ms benchmarks demonstrate that our approach consistently surpasses semi-supervised and domain adaptation methods, establishing a potential solution for mixed-domain semi-supervised medical image segmentation.

[140] VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez

Main category: cs.CV

TL;DR: VisGym introduces a comprehensive evaluation suite of 17 environments to test Vision-Language Models on multi-step visual interactions, revealing significant limitations in current models’ ability to integrate perception, memory, and action over long horizons.

DetailsMotivation: Modern Vision-Language Models (VLMs) lack proper characterization in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. There's a need for systematic evaluation tools to understand their limitations in interactive settings.

Method: Created VisGym - a gymnasium of 17 environments spanning symbolic puzzles, real-image understanding, navigation, and manipulation. The suite provides flexible controls over difficulty, input representation, planning horizon, and feedback. Also developed multi-step solvers to generate structured demonstrations for supervised finetuning.

Result: All frontier models struggle in interactive settings, achieving low success rates (46.6% easy, 26.0% hard). Key limitations: models struggle with long context (perform worse with unbounded history), text-based symbolic tasks become harder when rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations yield consistent gains.

Conclusion: VisGym provides a comprehensive framework for evaluating and improving VLMs in multi-step visual decision-making. The findings highlight concrete failure modes and pathways for improvement, particularly in handling long-horizon interactions and visual representations of symbolic tasks.

Abstract: Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.

[141] SyncLight: Controllable and Consistent Multi-View Relighting

David Serrano-Lozano, Anand Bhattad, Luis Herranz, Jean-François Lalonde, Javier Vazquez-Corral

Main category: cs.CV

TL;DR: SyncLight is the first method for consistent parametric relighting across multiple uncalibrated views of static scenes, enabling precise lighting control across multi-view captures conditioned on a single reference edit.

DetailsMotivation: Existing single-view relighting methods struggle with maintaining rigorous lighting consistency needed for multi-camera broadcasts, stereoscopic cinema, and virtual production. Current generative approaches fail to ensure consistent lighting across multiple views, which is essential for professional applications.

Method: Uses a multi-view diffusion transformer trained with a latent bridge matching formulation. The model is trained on a large-scale hybrid dataset of synthetic environments (from existing sources and newly designed scenes) plus high-fidelity real-world multi-view captures under calibrated illumination. Surprisingly, though trained only on image pairs, it generalizes zero-shot to arbitrary numbers of viewpoints.

Result: Achieves high-fidelity relighting of entire image sets in a single inference step. Enables precise control over light intensity and color across multi-view captures. Generalizes zero-shot to arbitrary numbers of viewpoints without requiring camera pose information, effectively propagating lighting changes across all views.

Conclusion: SyncLight enables practical relighting workflows for multi-view capture systems, addressing the critical need for consistent lighting across multiple views in professional applications like broadcasting, cinema, and virtual production.

Abstract: We present SyncLight, the first method to enable consistent, parametric relighting across multiple uncalibrated views of a static scene. While single-view relighting has advanced significantly, existing generative approaches struggle to maintain the rigorous lighting consistency essential for multi-camera broadcasts, stereoscopic cinema, and virtual production. SyncLight addresses this by enabling precise control over light intensity and color across a multi-view capture of a scene, conditioned on a single reference edit. Our method leverages a multi-view diffusion transformer trained using a latent bridge matching formulation, achieving high-fidelity relighting of the entire image set in a single inference step. To facilitate training, we introduce a large-scale hybrid dataset comprising diverse synthetic environments – curated from existing sources and newly designed scenes – alongside high-fidelity, real-world multi-view captures under calibrated illumination. Surprisingly, though trained only on image pairs, SyncLight generalizes zero-shot to an arbitrary number of viewpoints, effectively propagating lighting changes across all views, without requiring camera pose information. SyncLight enables practical relighting workflows for multi-view capture systems.

[142] AnyView: Synthesizing Any Novel View in Dynamic Scenes

Basile Van Hoorick, Dian Chen, Shun Iwase, Pavel Tokmakov, Muhammad Zubair Irshad, Igor Vasiljevic, Swati Gupta, Fangzhou Cheng, Sergey Zakharov, Vitor Campagnolo Guizilini

Main category: cs.CV

TL;DR: AnyView is a diffusion-based video generation framework for dynamic view synthesis that can generate zero-shot novel videos from arbitrary camera trajectories, maintaining spatiotemporal consistency even in extreme dynamic scenarios where other methods fail.

DetailsMotivation: Current generative video models struggle with multi-view and spatiotemporal consistency in highly dynamic real-world environments, especially when viewpoints have minimal overlap.

Method: A diffusion-based framework trained on multiple data sources (2D monocular, 3D static multi-view, and 4D dynamic multi-view datasets) to learn a generalist spatiotemporal implicit representation with minimal inductive biases.

Result: Competitive performance on standard benchmarks and superior results on the new AnyViewBench for extreme dynamic view synthesis, where most baselines degrade significantly while AnyView maintains realistic, plausible, and consistent videos.

Conclusion: AnyView demonstrates robust dynamic view synthesis capabilities from arbitrary viewpoints, addressing limitations of existing methods in extreme real-world scenarios with minimal viewpoint overlap.

Abstract: Modern generative video models excel at producing convincing, high-quality outputs, but struggle to maintain multi-view and spatiotemporal consistency in highly dynamic real-world environments. In this work, we introduce \textbf{AnyView}, a diffusion-based video generation framework for \emph{dynamic view synthesis} with minimal inductive biases or geometric assumptions. We leverage multiple data sources with various levels of supervision, including monocular (2D), multi-view static (3D) and multi-view dynamic (4D) datasets, to train a generalist spatiotemporal implicit representation capable of producing zero-shot novel videos from arbitrary camera locations and trajectories. We evaluate AnyView on standard benchmarks, showing competitive results with the current state of the art, and propose \textbf{AnyViewBench}, a challenging new benchmark tailored towards \emph{extreme} dynamic view synthesis in diverse real-world scenarios. In this more dramatic setting, we find that most baselines drastically degrade in performance, as they require significant overlap between viewpoints, while AnyView maintains the ability to produce realistic, plausible, and spatiotemporally consistent videos when prompted from \emph{any} viewpoint. Results, data, code, and models can be viewed at: https://tri-ml.github.io/AnyView/

[143] PanoNormal: Monocular Indoor 360° Surface Normal Estimation

Kun Huang, Fanglue Zhang, Neil Dodgson

Main category: cs.CV

TL;DR: PanoNormal is a novel architecture for monocular surface normal estimation from 360° images that combines CNNs and vision transformers to address spherical distortion while capturing both global scene structure and local geometric details.

DetailsMotivation: Existing approaches for 360° depth estimation perform poorly on surface normal prediction due to architectural bias toward global scene layout at the expense of local geometric cues. CNNs struggle with spherical distortion and fixed receptive fields, while ViTs lose local detail despite capturing long-range dependencies.

Method: Proposes PanoNormal architecture that integrates CNNs and ViTs with a multi-level global self-attention mechanism explicitly designed for spherical feature distribution, enabling recovery of both global contextual structure and local geometric details.

Result: Achieves state-of-the-art performance on several benchmark 360° datasets and significantly outperforms adapted depth estimation models on surface normal prediction tasks.

Conclusion: The proposed PanoNormal architecture successfully addresses the limitations of existing approaches by combining complementary strengths of CNNs and ViTs with spherical-aware attention, providing an effective solution for monocular surface normal estimation from 360° images.

Abstract: The presence of spherical distortion in equirectangular projection (ERP) images presents a persistent challenge in dense regression tasks such as surface normal estimation. Although it may appear straightforward to repurpose architectures developed for 360° depth estimation, our empirical findings indicate that such models yield suboptimal performance when applied to surface normal prediction. This is largely attributed to their architectural bias toward capturing global scene layout, which comes at the expense of the fine-grained local geometric cues that are critical for accurate surface orientation estimation. While convolutional neural networks (CNNs) have been employed to mitigate spherical distortion, their fixed receptive fields limit their ability to capture holistic scene structure. Conversely, vision transformers (ViTs) are capable of modeling long-range dependencies via global self-attention, but often fail to preserve high-frequency local detail. To address these limitations, we propose \textit{PanoNormal}, a monocular surface normal estimation architecture for 360° images that integrates the complementary strengths of CNNs and ViTs. In particular, we design a multi-level global self-attention mechanism that explicitly accounts for the spherical feature distribution, enabling our model to recover both global contextual structure and local geometric details. Experimental results demonstrate that our method not only achieves state-of-the-art performance on several benchmark 360° datasets, but also significantly outperforms adapted depth estimation models on the task of surface normal prediction. The code and model are available at https://github.com/huangkun101230/PanoNormal.

[144] NFL-BA: Near-Field Light Bundle Adjustment for SLAM in Dynamic Lighting

Andrea Dunn Beltran, Daniel Rho, Marc Niethammer, Roni Sengupta

Main category: cs.CV

TL;DR: NFL-BA improves SLAM performance in near-field lighting conditions by explicitly modeling dynamic illumination in bundle adjustment loss.

DetailsMotivation: Many real-world SLAM applications (endoscopy, subterranean robotics, search & rescue) require operating with co-located light and camera in dark environments, where dynamic near-field lighting causes strong view-dependent shading that degrades SLAM performance.

Method: Introduces Near-Field Lighting Bundle Adjustment Loss (NFL-BA) that explicitly models near-field lighting as part of Bundle Adjustment loss, integrable into neural rendering-based SLAM systems with implicit or explicit scene representations.

Result: Significant improvements in camera tracking: 37% for MonoGS and 14% for EndoGS on C3VD colonoscopy dataset, achieving state-of-the-art performance. Also shows improvement on indoor scenes captured with phone flashlight.

Conclusion: NFL-BA effectively addresses near-field lighting challenges in SLAM, enabling better performance in critical applications like endoscopy where accurate 3D reconstruction and tracking are essential for medical outcomes.

Abstract: Simultaneous Localization and Mapping (SLAM) systems typically assume static, distant illumination; however, many real-world scenarios, such as endoscopy, subterranean robotics, and search & rescue in collapsed environments, require agents to operate with a co-located light and camera in the absence of external lighting. In such cases, dynamic near-field lighting introduces strong, view-dependent shading that significantly degrades SLAM performance. We introduce Near-Field Lighting Bundle Adjustment Loss (NFL-BA) which explicitly models near-field lighting as a part of Bundle Adjustment loss and enables better performance for scenes captured with dynamic lighting. NFL-BA can be integrated into neural rendering-based SLAM systems with implicit or explicit scene representations. Our evaluations mainly focus on endoscopy procedure where SLAM can enable autonomous navigation, guidance to unsurveyed regions, blindspot detections, and 3D visualizations, which can significantly improve patient outcomes and endoscopy experience for both physicians and patients. Replacing Photometric Bundle Adjustment loss of SLAM systems with NFL-BA leads to significant improvement in camera tracking, 37% for MonoGS and 14% for EndoGS, and leads to state-of-the-art camera tracking and mapping performance on the C3VD colonoscopy dataset. Further evaluation on indoor scenes captured with phone camera with flashlight turned on, also demonstrate significant improvement in SLAM performance due to NFL-BA. See results at https://asdunnbe.github.io/NFL-BA/

[145] Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help

Xuyang Guo, Jiayan Huo, Yingyu Liang, Zhenmei Shi, Zhao Song, Jiahao Zhang, Zhen Zhuang

Main category: cs.CV

TL;DR: T2ICountBench is a new benchmark that reveals state-of-the-art text-to-image diffusion models fundamentally fail to generate correct numbers of objects, with accuracy dropping as object count increases.

DetailsMotivation: Diffusion models have become the standard for text-to-image generation but exhibit fundamental limitations in adhering to numerical constraints in user instructions, particularly in generating correct numbers of objects. Despite prior mentions of this issue, there's a lack of comprehensive and rigorous evaluation of this limitation.

Method: The authors introduce T2ICountBench, a novel benchmark designed to rigorously evaluate counting ability in text-to-image diffusion models. The benchmark encompasses diverse generative models (open-source and private systems), isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations for reliability.

Result: Extensive evaluations reveal that all state-of-the-art diffusion models fail to generate correct numbers of objects, with accuracy dropping significantly as the number of objects increases. An exploratory study on prompt refinement shows that simple interventions generally don’t improve counting accuracy.

Conclusion: The findings highlight inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements in text-to-image generation systems.

Abstract: Generative modeling is widely regarded as one of the most essential problems in today’s AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability. Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.

[146] Efficient Multi-scale Masked Autoencoders with Hybrid-Attention Mechanism for Breast Lesion Classification

Hung Q. Vo, Pengyu Yuan, Zheng Yin, Kelvin K. Wong, Chika F. Ezeana, Son T. Ly, Hien V. Nguyen, Stephen T. C. Wong

Main category: cs.CV

TL;DR: MIRAM is a multi-scale masked autoencoder with hybrid-attention that reduces quadratic complexity to linear for high-resolution medical image analysis, enabling state-of-the-art SSL on consumer GPUs.

DetailsMotivation: Standard self-attention in Vision Transformers has quadratic complexity (O(N²)), which creates a severe computational barrier for high-resolution biomedical tasks and excludes resource-constrained labs from using state-of-the-art models.

Method: Proposes MIRAM, a multi-scale masked autoencoder with hybrid-attention mechanism using dual-decoder design: standard transformer decoder for global semantics at low resolution, and linear-complexity decoder (Linformer, Performer, or Nyströmformer) for high-resolution reconstruction, reducing upscaling complexity from quadratic to linear (O(N)).

Result: On CBIS-DDSM mammography dataset, Nyströmformer-based variant achieves 61.0% classification accuracy, outperforming standard MAE (58.9%) and MoCo-v3 (60.2%) while requiring significantly less memory.

Conclusion: Hybrid-attention architectures can democratize high-resolution medical AI by making powerful self-supervised learning accessible to researchers with limited hardware resources.

Abstract: Self-supervised learning (SSL) with Vision Transformers (ViT) has shown immense potential in medical image analysis. However, the quadratic complexity ($\mathcal{O}(N^2)$) of standard self-attention poses a severe barrier for high-resolution biomedical tasks, effectively excluding resource-constrained research labs from utilizing state-of-the-art models. To address this computational bottleneck without sacrificing diagnostic accuracy, we propose \textbf{MIRAM}, a Multi-scale Masked Autoencoder that leverages a \textbf{hybrid-attention mechanism}. Our architecture uniquely decouples semantic learning from detail reconstruction using a dual-decoder design: a standard transformer decoder captures global semantics at low resolution, while a linear-complexity decoder (comparing Linformer, Performer, and Nyströmformer) handles the computationally expensive high-resolution reconstruction. This reduces the complexity of the upscaling stage from quadratic to linear ($\mathcal{O}(N)$), enabling high-fidelity training on consumer-grade GPUs. We validate our approach on the CBIS-DDSM mammography dataset. Remarkably, our \textbf{Nyströmformer-based variant} achieves a classification accuracy of \textbf{61.0%}, outperforming both standard MAE (58.9%) and MoCo-v3 (60.2%) while requiring significantly less memory. These results demonstrate that hybrid-attention architectures can democratize high-resolution medical AI, making powerful SSL accessible to researchers with limited hardware resources.

[147] UltraFlwr – An Efficient Federated Surgical Object Detection Framework

Yang Li, Soumya Snigdha Kundu, Maxence Boels, Toktam Mahmoodi, Sebastien Ourselin, Tom Vercauteren, Prokar Dasgupta, Jonathan Shapey, Alejandro Granados

Main category: cs.CV

TL;DR: UltraFlwr is an open-source federated learning framework that integrates Ultralytics YOLO with Flower FL platform, enabling communication-efficient collaborative training of surgical object detection models across heterogeneous surgical data without sharing raw data.

DetailsMotivation: Training robust YOLO models for surgical object detection faces challenges from limited data, privacy constraints, and inter-institutional variability. Federated learning enables collaborative training without sharing raw data, but practical support for modern YOLO pipelines under heterogeneous surgical data remains limited.

Method: Developed UltraFlwr framework integrating Ultralytics YOLO with Flower FL platform, supporting native Partial Aggregation of YOLO components (backbone, neck, head). Conducted systematic empirical study using two public laparoscopic surgical tool detection datasets under IID and multiple clinically motivated heterogeneous scenarios (data curation differences, video length variations, label availability).

Result: Standard FL aggregators (e.g., FedAvg) don’t consistently match centralized training per client but reduce inter-client performance variability. Aggregating both backbone and neck components achieves performance comparable to full aggregation with lower communication costs. Improving within-client data consistency benefits FL even when it increases distribution shift across clients.

Conclusion: UltraFlwr provides practical guidance for deploying federated YOLO-based object detection in heterogeneous surgical environments, offering communication-efficient collaborative training without data sharing. The framework is publicly available as open-source software.

Abstract: Surgical object detection in laparoscopic videos enables real-time instrument identification for workflow analysis and skills assessment, but training robust models such as You Only Look Once (YOLO) is challenged by limited data, privacy constraints, and inter-institutional variability. Federated learning (FL) enables collaborative training without sharing raw data, yet practical support for modern YOLO pipelines under heterogeneous surgical data remains limited. We present UltraFlwr, an open-source, communication-efficient, and edge-deployable framework that integrates Ultralytics YOLO with the Flower FL platform and supports native Partial Aggregation (PA) of YOLO components (backbone, neck, head). Using two public laparoscopic surgical tool detection datasets, we conduct a systematic empirical study of federated YOLO training under Independent and Identically Distributed (IID) and multiple clinically motivated heterogeneous scenarios, including differences in data curation, video length, and label availability. Results show that standard FL aggregators (e.g., FedAvg) do not consistently match centralized training per client, but reduce inter-client performance variability. Aggregating both backbone and neck components achieves performance comparable to full aggregation with lower communication costs. Also, improving within-client data consistency can benefit FL even when it increases distribution shift across clients. These findings provide practical guidance for deploying federated YOLO-based object detection in heterogeneous surgical environments. UltraFlwr is publicly available at https://github.com/KCL-BMEIS/UltraFlwr.

[148] Decoupling Multi-Contrast Super-Resolution: Self-Supervised Implicit Re-Representation for Unpaired Cross-Modal Synthesis

Yinzhe Wu, Hongyu Rui, Fanwen Wang, Jiahao Huang, Zhenxuan Zhang, Haosen Zhang, Zi Wang, Guang Yang

Main category: cs.CV

TL;DR: A novel decoupled framework for multi-contrast MRI super-resolution that combines population-level anatomical priors with patient-specific optimization, enabling arbitrary-scale upsampling without paired training data.

DetailsMotivation: Current deep learning methods for multi-contrast MRI super-resolution have two key limitations: they require large paired LR/HR datasets (which are scarce) and are limited to fixed upsampling scales. Self-supervised methods remove the paired data requirement but fail to leverage valuable population-level priors.

Method: A two-stage decoupled framework: (1) Unpaired cross-modal synthesis (uCMS) module trained on unpaired population data to learn robust anatomical priors; (2) Lightweight patient-specific implicit re-representation (IrR) module optimized in self-supervised manner to fuse population priors with subject’s LR data. Built on implicit neural representation for scale-agnostic operation.

Result: Superior quantitative performance on different datasets, with exceptional robustness at extreme scales (16x, 32x) where competing methods fail. Demonstrates data-efficient, flexible, and computationally lightweight paradigm.

Conclusion: The proposed framework enables high-fidelity, arbitrary-scale multi-contrast super-resolution by uniquely fusing population-level knowledge with patient-specific fidelity without requiring any paired LR/HR or paired cross-modal training data.

Abstract: Multi-contrast super-resolution (MCSR) is crucial for enhancing MRI but current deep learning methods are limited. They typically require large, paired low- and high-resolution (LR/HR) training datasets, which are scarce, and are trained for fixed upsampling scales. While recent self-supervised methods remove the paired data requirement, they fail to leverage valuable population-level priors. In this work, we propose a novel, decoupled MCSR framework that resolves both limitations. We reformulate MCSR into two stages: (1) an unpaired cross-modal synthesis (uCMS) module, trained once on unpaired population data to learn a robust anatomical prior; and (2) a lightweight, patient-specific implicit re-representation (IrR) module. This IrR module is optimized in a self-supervised manner to fuse the population prior with the subject’s own LR target data. This design uniquely fuses population-level knowledge with patient-specific fidelity without requiring any paired LR/HR or paired cross-modal training data. By building the IrR module on an implicit neural representation, our framework is also inherently scale-agnostic. Our method demonstrates superior quantitative performance on different datasets, with exceptional robustness at extreme scales (16x, 32x), a regime where competing methods fail. Our work presents a data-efficient, flexible, and computationally lightweight paradigm for MCSR, enabling high-fidelity, arbitrary-scale

[149] ToonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars

Rui-Yang Ju, Sheng-Yen Huang, Yi-Ping Hung

Main category: cs.CV

TL;DR: ToonifyGB is a two-stage framework that combines StyleGAN-based facial stylization with 3D Gaussian blendshapes to create diverse stylized 3D head avatars from monocular video.

DetailsMotivation: To extend Toonify's 2D facial stylization capabilities to 3D head avatars using Gaussian blendshapes, enabling synthesis of diverse stylized 3D avatars with arbitrary expressions.

Method: Two-stage framework: 1) Improved StyleGAN generates stable stylized video from input frames without fixed-resolution cropping limitations; 2) Gaussian blendshapes synthesis learns stylized neutral head model and expression blendshapes from the stylized video.

Result: Validated effectiveness on benchmark datasets using Arcane and Pixar styles, enabling efficient rendering of stylized avatars with arbitrary expressions.

Conclusion: ToonifyGB successfully bridges 2D facial stylization with 3D avatar animation, providing an efficient framework for creating diverse stylized 3D head avatars from monocular video.

Abstract: The introduction of 3D Gaussian blendshapes has enabled the real-time reconstruction of animatable head avatars from monocular video. Toonify, a StyleGAN-based method, has become widely used for facial image stylization. To extend Toonify for synthesizing diverse stylized 3D head avatars using Gaussian blendshapes, we propose an efficient two-stage framework, ToonifyGB. In Stage 1 (stylized video generation), we adopt an improved StyleGAN to generate the stylized video from the input video frames, which overcomes the limitation of cropping aligned faces at a fixed resolution as preprocessing for normal StyleGAN. This process provides a more stable stylized video, which enables Gaussian blendshapes to better capture the high-frequency details of the video frames, facilitating the synthesis of high-quality animations in the next stage. In Stage 2 (Gaussian blendshapes synthesis), our method learns a stylized neutral head model and a set of expression blendshapes from the generated stylized video. By combining the neutral head model with expression blendshapes, ToonifyGB can efficiently render stylized avatars with arbitrary expressions. We validate the effectiveness of ToonifyGB on benchmark datasets using two representative styles: Arcane and Pixar.

[150] Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

Divyanshu Mishra, Mohammadreza Salehi, Pramit Saha, Olga Patey, Aris T. Papageorghiou, Yuki M. Asano, J. Alison Noble

Main category: cs.CV

TL;DR: DISCOVR is a self-supervised dual-branch framework for cardiac ultrasound video representation learning that combines temporal modeling with fine-grained spatial semantics through semantic cluster distillation.

DetailsMotivation: Self-supervised learning struggles in echocardiography due to subtle anatomical structures, complex temporal dynamics, lack of domain-specific models, high intersample similarity, sensitivity to low PSNR inputs, and aggressive augmentations that distort clinically relevant features.

Method: DISCOVR uses a dual-branch framework: clustering-based video encoder for temporal dynamics and online image encoder for fine-grained spatial semantics, connected via semantic cluster distillation loss that transfers anatomical knowledge from image to video encoder.

Result: Outperforms specialized video anomaly detection methods and state-of-the-art video-SSL baselines on six echocardiography datasets across fetal, pediatric, and adult populations in zero-shot and linear probing setups, achieving superior segmentation transfer and strong LVEF prediction performance.

Conclusion: DISCOVR enables temporally coherent representations enriched with fine-grained semantic understanding for cardiac ultrasound, addressing domain-specific challenges and demonstrating strong clinical relevance.

Abstract: Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding.Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups,achieving superior segmentation transfer and strong downstream performance on clinically relevant tasks such as LVEF prediction. Code available at: https://github.com/mdivyanshu97/DISCOVR

[151] VOCAL: Visual Odometry via ContrAstive Learning

Chi-Yao Huang, Zeel Bhatt, Yezhou Yang

Main category: cs.CV

TL;DR: VOCAL is a novel visual odometry framework that treats VO as a label ranking problem, using contrastive learning and Bayesian inference to create interpretable, spatially coherent feature representations aligned with camera states.

DetailsMotivation: Existing learning-based VO methods rely on rigid geometric assumptions, lack interpretability, and have weak theoretical foundations in fully data-driven frameworks. There's a need for more explainable and theoretically grounded approaches.

Method: VOCAL reformulates VO as a label ranking problem, integrating Bayesian inference with representation learning. It organizes visual features to mirror camera states through contrastive learning, forcing similar camera states to converge into consistent, spatially coherent latent representations.

Result: Extensive evaluations on KITTI dataset demonstrate VOCAL’s enhanced interpretability and flexibility, pushing VO toward more general and explainable spatial intelligence.

Conclusion: VOCAL represents a significant advancement in VO by providing a theoretically grounded, interpretable framework that moves beyond rigid geometric assumptions toward more generalizable spatial intelligence compatible with multimodal data.

Abstract: Breakthroughs in visual odometry (VO) have fundamentally reshaped the landscape of robotics, enabling ultra-precise camera state estimation that is crucial for modern autonomous systems. Despite these advances, many learning-based VO techniques rely on rigid geometric assumptions, which often fall short in interpretability and lack a solid theoretical basis within fully data-driven frameworks. To overcome these limitations, we introduce VOCAL (Visual Odometry via ContrAstive Learning), a novel framework that reimagines VO as a label ranking challenge. By integrating Bayesian inference with a representation learning framework, VOCAL organizes visual features to mirror camera states. The ranking mechanism compels similar camera states to converge into consistent and spatially coherent representations within the latent space. This strategic alignment not only bolsters the interpretability of the learned features but also ensures compatibility with multimodal data sources. Extensive evaluations on the KITTI dataset highlight VOCAL’s enhanced interpretability and flexibility, pushing VO toward more general and explainable spatial intelligence.

[152] T-LoRA: Single Image Diffusion Model Customization Without Overfitting

Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev

Main category: cs.CV

TL;DR: T-LoRA introduces timestep-dependent low-rank adaptation for single-image diffusion model customization, addressing overfitting by adjusting rank constraints based on diffusion timesteps and using orthogonal initialization for adapter independence.

DetailsMotivation: Diffusion model fine-tuning often overfits with limited training samples, compromising generalization and diversity. Single-image customization is particularly challenging yet has high practical potential, requiring methods that prevent overfitting while maintaining concept fidelity and text alignment.

Method: T-LoRA uses a timestep-dependent low-rank adaptation framework with two innovations: 1) dynamic fine-tuning strategy adjusting rank-constrained updates based on diffusion timesteps (higher timesteps get more constrained), and 2) weight parametrization with orthogonal initialization to ensure independence between adapter components.

Result: Extensive experiments on SD-XL and FLUX-1.dev show T-LoRA outperforms standard LoRA and other diffusion personalization techniques, achieving superior balance between concept fidelity and text alignment in single-image customization tasks.

Conclusion: T-LoRA effectively addresses overfitting in single-image diffusion model customization through timestep-sensitive adaptation, demonstrating that higher diffusion timesteps are more prone to overfitting and require more constrained fine-tuning strategies.

Abstract: While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. We show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments on SD-XL and FLUX-1.dev show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques, achieving a superior balance between concept fidelity and text alignment. Project page is available at https://controlgenai.github.io/T-LoRA/.

[153] From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation

Mengxi Liu, Lala Shakti Swarup Ray, Sizhen Bian, Ko Watanabe, Ankur Bhatt, Joanna Sorysz, Russel Torah, Bo Zhou, Paul Lukowicz

Main category: cs.CV

TL;DR: NeckSense is a necklace-style wearable using bio-impedance sensing for head pose tracking, achieving performance comparable to vision-based methods without line-of-sight requirements.

DetailsMotivation: Current head pose tracking methods often require line-of-sight (vision-based) or are bulky/invasive. There's a need for a compact, wearable solution that works without visual constraints and can capture subtle head movements.

Method: A necklace-style wearable with multi-channel bio-impedance sensing using soft, dry electrodes. Captures tissue impedance changes around neck modulated by head rotations and muscle activations. Uses deep learning framework with anatomical priors (joint constraints, natural rotation ranges) integrated into loss function design.

Result: Validated on 7 participants using SOTA pose estimation as ground truth. Achieved mean per-vertex error of 25.9 mm across various head movements with leave-one-person-out cross-validation. Performance comparable to SOTA vision-based methods.

Conclusion: NeckSense demonstrates that compact, line-of-sight-free bio-impedance wearables can deliver head-tracking performance comparable to vision-based methods, offering a practical alternative for applications where visual tracking is limited.

Abstract: We present NeckSense, a novel wearable system for head pose tracking that leverages multi-channel bio-impedance sensing with soft, dry electrodes embedded in a lightweight, necklace-style form factor. NeckSense captures dynamic changes in tissue impedance around the neck, which are modulated by head rotations and subtle muscle activations. To robustly estimate head pose, we propose a deep learning framework that integrates anatomical priors, including joint constraints and natural head rotation ranges, into the loss function design. We validate NeckSense on 7 participants using the current SOTA pose estimation model as ground truth. Our system achieves a mean per-vertex error of 25.9 mm across various head movements with a leave-one-person-out cross-validation method, demonstrating that a compact, line-of-sight-free bio-impedance wearable can deliver head-tracking performance comparable to SOTA vision-based methods.

[154] UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

Hongyu Guo, Xiangzhao Hao, Jiarui Guo, Haiyun Guo, Jinqiao Wang, Tat-Seng Chua

Main category: cs.CV

TL;DR: UniFGVC is a training-free framework that reformulates few-shot fine-grained visual classification as multimodal retrieval using structured text descriptions generated by MLLMs.

DetailsMotivation: Existing few-shot FGVC methods that fine-tune pre-trained visual language models suffer from overfitting and weak generalization. There's a need for a more robust approach that can leverage multimodal knowledge without requiring extensive training.

Method: 1. Category-Discriminative Visual Captioner (CDV-Captioner) uses MLLMs with chain-of-thought prompting and visually similar reference images to generate structured text descriptions capturing fine-grained attributes. 2. Converts images to image-description pairs for comprehensive feature representation. 3. Constructs multimodal category templates from few-shot samples. 4. Uses off-the-shelf vision and text encoders to embed queries and templates, performing FGVC through nearest neighbor retrieval in joint space.

Result: Extensive experiments on 12 FGVC benchmarks show consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

Conclusion: UniFGVC provides a universal, training-free framework with broad compatibility across diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios without suffering from overfitting issues.

Abstract: Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

[155] CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion

Zhe Zhu, Honghua Chen, Peng Li, Mingqiang Wei

Main category: cs.CV

TL;DR: CoreEditor introduces a correspondence-constrained attention mechanism for consistent text-driven 3D editing, addressing cross-view consistency issues in existing methods.

DetailsMotivation: Existing text-driven 3D editing approaches adapted from 2D image editors often fail to maintain cross-view consistency, leading to insufficient edits and blurry details due to lack of explicit control over multi-view information exchange.

Method: CoreEditor uses a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout diffusion denoising. It incorporates both geometric alignment and semantic similarity estimated during denoising for reliable correspondence modeling. Also includes a selective editing pipeline allowing users to choose preferred results from multiple candidates.

Result: Extensive experiments show CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.

Conclusion: CoreEditor provides a robust framework for consistent text-to-3D editing with improved cross-view consistency, better detail preservation, and greater user control through selective editing.

Abstract: Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.

[156] A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition

Wasi Ullah, Yasir Noman Khalid, Saddam Hussain Khan

Main category: cs.CV

TL;DR: Customized Inception-V3 with region/boundary operations + AA-LSTM for temporal learning + ADFSA feature selection achieves 99.65% accuracy with only 7 features on challenging UCF-YouTube dataset.

DetailsMotivation: Real-time HAR systems face scalability issues and high computational costs due to redundant features. Need for lightweight, accurate systems that work in heterogeneous environments with challenges like occlusion, cluttered backgrounds, and poor illumination.

Method: 1) Customized Inception-V3 with region-based (average pooling) and boundary-aware (max pooling) operations for feature extraction; 2) Attention-Augmented LSTM for temporal dependency learning; 3) Novel ADFSA (Adaptive Dynamic Fitness Sharing and Attention) feature selection embedded in genetic algorithm to balance accuracy, redundancy reduction, feature uniqueness, and complexity minimization.

Result: Achieved 99.65% accuracy using only 7 selected features on challenging UCF-YouTube dataset. Improved inference time while handling occlusion, cluttered backgrounds, complex motion dynamics, and poor illumination conditions.

Conclusion: The proposed framework enables lightweight, accurate real-time HAR by combining customized CNN for spatial features, AA-LSTM for temporal dynamics, and ADFSA for optimal feature selection, achieving high performance with minimal computational overhead.

Abstract: Real-time Human Activity Recognition (HAR) has wide-ranging applications in areas such as context-aware environments, public safety, assistive technologies, and autonomous monitoring and surveillance systems. However, existing real-time HAR systems face significant challenges, including limited scalability and high computational costs arising from redundant features. To address these issues, the Inception-V3 model was customized with region-based and boundary-aware operations, using average pooling and max pooling, respectively, to enhance region homogeneity, suppress noise, and capture discriminative local features, while improving robustness through down-sampling. Furthermore, to effectively encode motion dynamics, an Attention-Augmented Long Short-Term Memory (AA-LSTM) network was employed to learn temporal dependencies across video frames. Features are extracted from video dataset and are then optimized through a novel proposed dynamic composite feature selection method called Adaptive Dynamic Fitness Sharing and Attention (ADFSA). This ADFSA mechanism is embedded within a genetic algorithm to select a compact, optimized subset of features by dynamically balancing multiple objectives, accuracy, redundancy reduction, feature uniqueness, and complexity minimization. As a result, the selected subset of diverse and discriminative features enables lightweight machine learning classifiers to achieve accurate and robust HAR in heterogeneous environments. Experimental results demonstrate up to 99.65% accuracy using as few as seven selected features, with improved inference time on the challenging UCF-YouTube dataset, which includes factors such as occlusion, cluttered backgrounds, complex motion dynamics, and poor illumination conditions.

[157] GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection

Haozhen Yan, Yan Hong, Suning Lang, Jiahui Zhan, Yikun Ji, Yujie Gao, Huijia Zhu, Jun Lan, Jianfu Zhang

Main category: cs.CV

TL;DR: GAMMA is a novel training framework for AI-generated image detection that improves generalization to unseen generative models by reducing domain bias and enhancing semantic alignment through diverse manipulation strategies and multi-task supervision.

DetailsMotivation: Existing AI-generated image detectors perform well on in-distribution images but generalize poorly to unseen generative models due to over-reliance on generation-specific artifacts like stylistic priors and compression patterns.

Method: GAMMA introduces diverse manipulation strategies (inpainting-based manipulation, semantics-preserving perturbations) and multi-task supervision with dual segmentation heads and classification head. It uses a reverse cross-attention mechanism to allow segmentation heads to guide and correct biased representations in the classification branch.

Result: Achieves state-of-the-art generalization on GenImage benchmark with 5.8% accuracy improvement, and maintains strong robustness on newly released generative models like GPT-4o.

Conclusion: GAMMA effectively addresses generalization limitations in AI-generated image detection by reducing domain bias and enhancing semantic alignment, showing promising performance across diverse generative models.

Abstract: With generative models becoming increasingly sophisticated and diverse, detecting AI-generated images has become increasingly challenging. While existing AI-genereted Image detectors achieve promising performance on in-distribution generated images, their generalization to unseen generative models remains limited. This limitation is largely attributed to their reliance on generation-specific artifacts, such as stylistic priors and compression patterns. To address these limitations, we propose GAMMA, a novel training framework designed to reduce domain bias and enhance semantic alignment. GAMMA introduces diverse manipulation strategies, such as inpainting-based manipulation and semantics-preserving perturbations, to ensure consistency between manipulated and authentic content. We employ multi-task supervision with dual segmentation heads and a classification head, enabling pixel-level source attribution across diverse generative domains. In addition, a reverse cross-attention mechanism is introduced to allow the segmentation heads to guide and correct biased representations in the classification branch. Our method achieves state-of-the-art generalization performance on the GenImage benchmark, imporving accuracy by 5.8%, but also maintains strong robustness on newly released generative model such as GPT-4o.

[158] MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, Peter Kontschieder

Main category: cs.CV

TL;DR: MapAnything is a unified transformer-based model that takes images and optional geometric inputs to directly regress metric 3D scene geometry and cameras, handling multiple 3D vision tasks in a single feed-forward pass.

DetailsMotivation: The paper aims to create a universal 3D reconstruction backbone that can handle diverse 3D vision tasks without task-specific architectures, addressing the fragmentation in current approaches where different tasks require specialized models.

Method: Uses a transformer-based feed-forward model with factored representation of multi-view geometry (depth maps, local ray maps, camera poses, metric scale factor). Standardizes supervision across datasets with flexible input augmentation to enable joint training on multiple tasks.

Result: MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, demonstrating effectiveness across tasks like uncalibrated SfM, calibrated MVS, monocular depth estimation, camera localization, and depth completion.

Conclusion: The work paves the way toward a universal 3D reconstruction backbone that can handle diverse 3D vision tasks in a single model, showing promising results for unified 3D scene understanding.

Abstract: We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.

[159] FUSAR-KLIP: Towards Multimodal Foundation Models for Remote Sensing

Yi Yang, Xiaokun Zhang, Qingchen Fang, Jing Liu, Ziqi Ye, Rui Li, Li Liu, Haipeng Wang

Main category: cs.CV

TL;DR: FUSAR-KLIP is a knowledge-guided multimodal foundation model for SAR imagery that addresses cognitive inconsistencies between general vision and remote sensing by incorporating geoscientific knowledge through structured text and iterative optimization.

DetailsMotivation: There's a fundamental cognitive inconsistency between general visual representation and remote sensing image interpretation. SAR imagery has unique characteristics (all-weather observation, coherent imaging) that create significant modal heterogeneity with general images, requiring deep geoscientific understanding that current models lack.

Method: 1) Created FUSAR-GEOVL-1M dataset with complete geographic projection attributes covering 120K images from multiple satellite platforms across 135 cities. 2) Generated aligned structured text through hierarchical cognitive thought chains encoding multidimensional semantic information. 3) Designed self-consistent iterative optimization mechanism using contrast, matching, and reconstruction in a self-supervised closed loop to guide cross-modal learning with knowledge consistent with human cognition and physical laws.

Result: Established a unified evaluation benchmark across 11 typical downstream tasks in vision and language categories, comparing with 15 mainstream foundation models. The model addresses the cognitive inconsistency between general visual representation and SAR image interpretation.

Conclusion: FUSAR-KLIP represents the first knowledge-guided general multimodal foundational model for SAR imagery, providing reusable data and evaluation baselines that bridge the gap between general visual understanding and specialized remote sensing interpretation through geoscientific knowledge integration.

Abstract: Cross-modal artificial intelligence, represented by visual language models, has achieved significant success in general image understanding. However, a fundamental cognitive inconsistency exists between general visual representation and remote sensing image interpretation: remote sensing images couple topography, terrain, and spatial structure, thereby inherently requiring models to possess deep geoscientific understanding. This cognitive difference is further amplified in synthetic aperture radar (SAR) imagery: while SAR possesses irreplaceable all-weather, all-day observation capabilities, it is constrained by coherent imaging mechanisms, exhibiting significant modal heterogeneity with general images. To address this inconsistency, we propose FUSAR-KLIP, the first knowledge-guided general multimodal foundational model for SAR, along with reusable data and evaluation baselines. Specifically: (1) FUSAR-GEOVL-1M (the first large-scale SAR dataset with complete geographic projection attributes) was constructed, covering multiple satellite platforms, 120,000 images, and 135 cities; (2) Aligned structured text was generated through hierarchical cognitive thought chains, accurately encoding more than 1 million multidimensional semantic information from geomorphological environment and regional attributes to spatial relationships; (3) A self-consistent iterative optimization mechanism was designed to guide cross-modal learning with this knowledge information consistent with human cognition and physical laws in a self-supervised closed loop consisting of contrast, matching, and reconstruction; (4) A unified evaluation benchmark was established in 11 typical downstream tasks in the two major categories of vision and language, and compared with 15 mainstream foundation models.

[160] LAKAN: Landmark-assisted Adaptive Kolmogorov-Arnold Network for Face Forgery Detection

Jiayao Jiang, Siran Peng, Bin Liu, Qi Chu, Nenghai Yu

Main category: cs.CV

TL;DR: A novel deepfake detection method using Kolmogorov-Arnold Networks (KAN) with facial landmark guidance achieves superior performance by better modeling complex forgery artifacts.

DetailsMotivation: Current CNN and Transformer-based deepfake detection methods have limitations in modeling the highly complex and non-linear nature of forgery artifacts, creating a need for more effective approaches.

Method: Proposes a KAN-based detection method that replaces fixed activation functions with learnable splines, plus a Landmark-assisted Adaptive KAN (LAKAN) module that uses facial landmarks as structural priors to dynamically generate KAN parameters and focus on critical facial regions with artifacts.

Result: Extensive experiments on multiple public datasets demonstrate that the proposed method achieves superior performance compared to existing approaches.

Conclusion: The combination of KAN’s flexible architecture with facial landmark guidance creates an effective deepfake detection framework that better captures complex forgery patterns by focusing on the most informative facial regions.

Abstract: The rapid development of deepfake generation techniques necessitates robust face forgery detection algorithms. While methods based on Convolutional Neural Networks (CNNs) and Transformers are effective, there is still room for improvement in modeling the highly complex and non-linear nature of forgery artifacts. To address this issue, we propose a novel detection method based on the Kolmogorov-Arnold Network (KAN). By replacing fixed activation functions with learnable splines, our KAN-based approach is better suited to this challenge. Furthermore, to guide the network’s focus towards critical facial areas, we introduce a Landmark-assisted Adaptive Kolmogorov-Arnold Network (LAKAN) module. This module uses facial landmarks as a structural prior to dynamically generate the internal parameters of the KAN, creating an instance-specific signal that steers a general-purpose image encoder towards the most informative facial regions with artifacts. This core innovation creates a powerful combination between geometric priors and the network’s learning process. Extensive experiments on multiple public datasets show that our proposed method achieves superior performance.

[161] Markovian Reeb Graphs for Simulating Spatiotemporal Patterns of Life

Anantajit Subrahmanya, Chandrakanth Gudavalli, Connor Levenson, B. S. Manjunath

Main category: cs.CV

TL;DR: Markovian Reeb Graphs transform Reeb graphs from descriptive tools into generative models for spatiotemporal trajectories, capturing individual and population-level mobility patterns with probabilistic transitions.

DetailsMotivation: Accurate human mobility modeling is critical for urban planning, epidemiology, and traffic management, but existing approaches may lack the ability to generate realistic trajectories that preserve baseline behaviors while incorporating stochastic variability.

Method: Introduces Markovian Reeb Graphs framework with two variants: Sequential Reeb Graphs (SRGs) for individual agents and Hybrid Reeb Graphs (HRGs) that combine individual with population Patterns of Life (PoLs). The approach embeds probabilistic transitions within the Reeb graph structure to generate realistic trajectories.

Result: Evaluated on Urban Anomalies and Geolife datasets using five mobility statistics. HRGs achieve strong fidelity across metrics while requiring only modest trajectory datasets without specialized side information.

Conclusion: Markovian Reeb Graphs establish a promising framework for trajectory simulation with broad applicability across urban environments, transforming Reeb graphs from descriptive analysis tools into effective generative models.

Abstract: Accurately modeling human mobility is critical for urban planning, epidemiology, and traffic management. In this work, we introduce Markovian Reeb Graphs, a novel framework that transforms Reeb graphs from a descriptive analysis tool into a generative model for spatiotemporal trajectories. Our approach captures individual and population-level Patterns of Life (PoLs) and generates realistic trajectories that preserve baseline behaviors while incorporating stochastic variability by embedding probabilistic transitions within the Reeb graph structure. We present two variants: Sequential Reeb Graphs (SRGs) for individual agents and Hybrid Reeb Graphs (HRGs) that combine individual with population PoLs, evaluated on the Urban Anomalies and Geolife datasets using five mobility statistics. Results demonstrate that HRGs achieve strong fidelity across metrics while requiring modest trajectory datasets without specialized side information. This work establishes Markovian Reeb Graphs as a promising framework for trajectory simulation with broad applicability across urban environments.

[162] Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

Chenfei Liao, Wensong Wang, Zichen Wen, Xu Zheng, Yiyu Wang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Xin Zou, Yuqian Fu, Bin Ren, Linfeng Zhang, Xuming Hu

Main category: cs.CV

TL;DR: Simple image downsampling outperforms advanced visual token compression methods on current MLLM benchmarks, revealing these benchmarks contain noise for compression evaluation. The authors propose VTC-Bench, a new evaluation framework that uses downsampling as a discriminator to denoise existing benchmarks.

DetailsMotivation: Current MLLM benchmarks are designed for general perception/reasoning assessment, not specifically for evaluating visual token compression methods, creating a fundamental task mismatch. There's a need for proper evaluation frameworks that specifically address compression challenges.

Method: Conducted comprehensive empirical study across 8 popular benchmarks and multiple state-of-the-art compression techniques. Discovered that simple image downsampling outperforms advanced methods, suggesting current benchmarks contain task-irrelevant noise. Proposed VTC-Bench framework that uses downsampling as a discriminator to filter out noise and enable fairer compression evaluation.

Result: Found consistent phenomenon: image downsampling beats advanced compression methods across multiple benchmarks. Identified that current benchmarks contain substantial noise for compression evaluation. Showed downsampling can effectively distinguish between simple vs. difficult samples regarding compression sensitivity.

Conclusion: Current MLLM benchmarks are inadequate for evaluating visual token compression due to task mismatch and noise. The proposed VTC-Bench framework provides a fairer evaluation method by leveraging downsampling as a discriminator to denoise existing benchmarks.

Abstract: Recent efforts to accelerate inference in Multimodal Large Language Models (MLLMs) have largely focused on visual token compression. The effectiveness of these methods is commonly evaluated by measuring the accuracy drop on existing MLLM benchmarks before and after compression. However, these benchmarks are originally designed to assess general perception and reasoning abilities, rather than the specific challenges posed by visual token compression, leading to a fundamental task mismatch. In this work, we uncover a counterintuitive yet consistent phenomenon: simple image downsampling outperforms many advanced visual token compression methods across multiple widely used benchmarks. Through a comprehensive empirical study spanning eight popular benchmarks and multiple state-of-the-art compression techniques, we show that (i) current benchmarks contain substantial noise (task-irrelevant samples) for evaluating visual token compression, and (ii) downsampling can act as an effective data filter that distinguishes between simple and difficult samples with respect to compression sensitivity. Motivated by these findings, we propose VTC-Bench, an evaluation framework that explicitly leverages downsampling as a discriminator to denoise existing benchmarks, enabling a fairer and more meaningful additional assessment of visual token compression methods.

[163] A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

Muhammad Tayyab Khan, Zane Yong, Lequn Chen, Wenhe Feng, Nicholas Yew Jin Tan, Seung Ki Moon

Main category: cs.CV

TL;DR: A three-stage hybrid framework using detection models and vision language models for automated interpretation of 2D multi-view engineering drawings, achieving high accuracy in parsing both textual and quantitative information.

DetailsMotivation: Manual interpretation of complex engineering drawings is challenging due to varied layouts, orientations, and mixed symbolic-textual content. Existing OCR systems and traditional deep learning approaches struggle with these complexities, creating a need for automated solutions that can accurately parse engineering drawings for manufacturing communication.

Method: Three-stage hybrid framework: 1) YOLOv11-det for layout segmentation (views, title blocks, notes), 2) YOLOv11-obb for orientation-aware detection of annotations (measures, GD&T symbols, surface roughness), 3) Two Donut-based OCR-free VLMs for semantic parsing (Alphabetical VLM for text/categorical info, Numerical VLM for quantitative data).

Result: Alphabetical VLM achieved F1 score of 0.672 for textual/categorical information, Numerical VLM achieved 0.963 for quantitative data interpretation. The framework was trained on specialized datasets (1,000 drawings for layout detection, 1,406 for annotation-level training) and produces unified JSON output for CAD/manufacturing integration.

Conclusion: The proposed hybrid framework provides a scalable solution for intelligent engineering drawing analysis, effectively addressing the challenges of interpreting complex multi-view drawings with dense annotations through modern detection and vision language models.

Abstract: Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GD&T symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GD&T frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.

[164] UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception

Karthikeyan Chandra Sekaran, Markus Geisler, Dominik Rößle, Adithya Mohan, Daniel Cremers, Wolfgang Utschick, Michael Botsch, Werner Huber, Torsten Schön

Main category: cs.CV

TL;DR: UrbanIng-V2X is the first large-scale multi-modal dataset for cooperative perception across multiple urban intersections, featuring synchronized vehicle and infrastructure sensors with comprehensive 3D annotations.

DetailsMotivation: Existing cooperative perception datasets are limited to single intersections or single vehicles, causing overfitting and misleading performance evaluation. There's a need for diverse traffic environments with multiple connected vehicles and infrastructure sensors across several intersections.

Method: Collected data from 3 urban intersections in Ingolstadt, Germany, with 34 temporally aligned and spatially calibrated sensor sequences (20 seconds each). Involves 2 vehicles and up to 3 infrastructure sensor poles per sequence, using 12 vehicle RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs.

Result: Created a dataset with approximately 712k annotated instances across 13 object classes, annotated at 10 Hz with 3D bounding boxes. Provides comprehensive evaluations using state-of-the-art cooperative perception methods.

Conclusion: UrbanIng-V2X addresses the gap in cooperative perception benchmarking by providing diverse multi-intersection data with vehicle-infrastructure collaboration, enabling more robust algorithm development and evaluation.

Abstract: Recent cooperative perception datasets have played a crucial role in advancing smart mobility applications by enabling information exchange between intelligent agents, helping to overcome challenges such as occlusions and improving overall scene understanding. While some existing real-world datasets incorporate both vehicle-to-vehicle and vehicle-to-infrastructure interactions, they are typically limited to a single intersection or a single vehicle. A comprehensive perception dataset featuring multiple connected vehicles and infrastructure sensors across several intersections remains unavailable, limiting the benchmarking of algorithms in diverse traffic environments. Consequently, overfitting can occur, and models may demonstrate misleadingly high performance due to similar intersection layouts and traffic participant behavior. To address this gap, we introduce UrbanIng-V2X, the first large-scale, multi-modal dataset supporting cooperative perception involving vehicles and infrastructure sensors deployed across three urban intersections in Ingolstadt, Germany. UrbanIng-V2X consists of 34 temporally aligned and spatially calibrated sensor sequences, each lasting 20 seconds. All sequences contain recordings from one of three intersections, involving two vehicles and up to three infrastructure-mounted sensor poles operating in coordinated scenarios. In total, UrbanIng-V2X provides data from 12 vehicle-mounted RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs. All sequences are annotated at a frequency of 10 Hz with 3D bounding boxes spanning 13 object classes, resulting in approximately 712k annotated instances across the dataset. We provide comprehensive evaluations using state-of-the-art cooperative perception methods and publicly release the codebase, dataset, HD map, and a digital twin of the complete data collection environment.

[165] DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

Yinqi Cai, Jichang Li, Zhaolun Li, Weikai Chen, Rushi Lan, Xi Xie, Xiaonan Luo, Guanbin Li

Main category: cs.CV

TL;DR: DeepShield is a deepfake detection framework that balances local patch guidance and global forgery diversification to improve robustness against unseen manipulation techniques.

DetailsMotivation: Existing deepfake detectors perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to reliance on forgery-specific artifacts, raising concerns about misuse for fraud and misinformation.

Method: DeepShield enhances CLIP-ViT encoder with two components: Local Patch Guidance (LPG) for spatiotemporal artifact modeling and patch-wise supervision, and Global Forgery Diversification (GFD) for domain feature augmentation using domain-bridging and boundary-expanding feature generation.

Result: DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.

Conclusion: The integration of novel local and global analysis enables DeepShield to effectively detect diverse deepfakes with improved cross-domain adaptability, addressing generalization limitations of existing detectors.

Abstract: Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks. Code is available at https://github.com/lijichang/DeepShield.

[166] Intelligent Systems in Neuroimaging: Pioneering AI Techniques for Brain Tumor Detection

Md. Mohaiminul Islam, Md. Mofazzal Hossen, Maher Ali Rusho, Nahiyan Nazah Ridita, Zarin Tasnia Shanta, Md. Simanto Haider, Ahmed Faizul Haque Dhrubo, Md. Khurshid Jahan, Mohammad Abdul Qayum

Main category: cs.CV

TL;DR: Advanced AI models for brain tumor classification from MRI achieve 98.71% accuracy using Xception architecture, showing promise for clinical deployment.

DetailsMotivation: To enhance brain tumor diagnosis accuracy and clinical usability by applying advanced AI techniques to MRI classification, addressing the need for automated neuroimaging diagnostics.

Method: Combined custom convolutional models with pre-trained neural network architectures, testing on over 7,000 MRI images across four classes (glioma, meningioma, pituitary tumors, no-tumor), focusing on detection accuracy, computational efficiency, and generalization.

Result: Xception architecture achieved best performance with 98.71% testing accuracy and lowest validation loss, surpassing other tested models while demonstrating reduced computational complexity.

Conclusion: AI shows strong potential as a diagnostic tool for brain tumors, with Xception architecture offering high accuracy and computational efficiency suitable for real-world clinical deployment, promising future progress in automated neuroimaging.

Abstract: This study deliberates on the application of advanced AI techniques for brain tumor classification through MRI, wherein the training includes the present best deep learning models to enhance diagnosis accuracy and the potential of usability in clinical practice. By combining custom convolutional models with pre-trained neural network architectures, our approach exposes the utmost performance in the classification of four classes: glioma, meningioma, pituitary tumors, and no-tumor cases. Assessing the models on a large dataset of over 7,000 MRI images focused on detection accuracy, computational efficiency, and generalization to unseen data. The results indicate that the Xception architecture surpasses all other were tested, obtaining a testing accuracy of 98.71% with the least validation loss. While presenting this case with findings that demonstrate AI as a probable scorer in brain tumor diagnosis, we demonstrate further motivation by reducing computational complexity toward real-world clinical deployment. These aspirations offer an abundant future for progress in automated neuroimaging diagnostics.

[167] Coupled Physics-Gated Adaptation: Spatially Decoding Volumetric Photochemical Conversion in Complex 3D-Printed Objects

Maryam Eftekharifar, Churun Zhang, Jialiang Wei, Xudong Cao, Hossein Heidari

Main category: cs.CV

TL;DR: A framework for predicting photochemical conversion in 3D printed objects using multimodal fusion to model coupled optical and material physics from visual data.

DetailsMotivation: To enable virtual chemical characterization of 3D printed objects by predicting dense, non-visual volumetric physical properties from 3D visual data, eliminating the need for traditional post-print measurements.

Method: Proposes Coupled Physics-Gated Adaptation (C-PGA), a multimodal fusion architecture that uses sparse geometrical/process parameters as queries to dynamically gate and adapt dense visual features via FiLM, processing dual 3D visual streams from raw projection stacks and their diffusion-diffraction corrected counterparts.

Result: Introduces a challenging new computer vision task and demonstrates a breakthrough approach using the largest-ever optically printed 3D specimen dataset with parametrically designed complex minimal surface structures.

Conclusion: The framework enables precise control over chemical conversion state and offers a significant advancement in virtual chemical characterization for 3D printed objects.

Abstract: We present a framework that pioneers the prediction of photochemical conversion in complex three-dimensionally printed objects, introducing a challenging new computer vision task: predicting dense, non-visual volumetric physical properties from 3D visual data. This approach leverages the largest-ever optically printed 3D specimen dataset, comprising a large family of parametrically designed complex minimal surface structures that have undergone terminal chemical characterisation. Conventional vision models are ill-equipped for this task, as they lack an inductive bias for the coupled, non-linear interactions of optical physics (diffraction, absorption) and material physics (diffusion, convection) that govern the final chemical state. To address this, we propose Coupled Physics-Gated Adaptation (C-PGA), a novel multimodal fusion architecture. Unlike standard concatenation, C-PGA explicitly models physical coupling by using sparse geometrical and process parameters (e.g., surface transport, print layer height) as a Query to dynamically gate and adapt the dense visual features via feature-wise linear modulation (FiLM). This mechanism spatially modulates dual 3D visual streams-extracted by parallel 3D-CNNs processing raw projection stacks and their diffusion-diffraction corrected counterparts allowing the model to recalibrate its visual perception based on the physical context. This approach offers a breakthrough in virtual chemical characterisation, eliminating the need for traditional post-print measurements and enabling precise control over the chemical conversion state.

[168] MoE-Enhanced Multi-Domain Feature Selection and Fusion for Fast Map-Free Trajectory Prediction

Wenyi Xiong, Jian Chen, Ziheng Qi, Wenhua Chen

Main category: cs.CV

TL;DR: A novel map-free trajectory prediction method that adaptively filters redundant information across temporal, spatial, and frequency domains to improve prediction accuracy in complex driving scenarios.

DetailsMotivation: Existing trajectory prediction methods struggle with noisy observations and complex agent interactions, particularly in filtering redundant scene data and handling outliers, which impairs prediction accuracy in dynamic multi-agent environments.

Method: 1) MoE-based frequency domain filter to weight frequency components and suppress outlier noise; 2) Selective spatiotemporal attention module to reallocate weights across temporal nodes, temporal trends, and spatial nodes; 3) Multimodal decoder supervised by joint patch-level and point-level losses.

Result: Achieves competitive performance and low-latency inference on large-scale NuScenes and Argoverse datasets compared to recent methods.

Conclusion: The proposed map-free approach effectively handles complex interactive scenarios by adaptively eliminating redundant information and extracting discriminative features across multiple domains, enabling precise trajectory prediction in real-world driving environments.

Abstract: Trajectory prediction is crucial for the reliability and safety of autonomous driving systems, yet it remains a challenging task in complex interactive scenarios due to noisy trajectory observations and intricate agent interactions. Existing methods often struggle to filter redundant scene data for discriminative information extraction, directly impairing trajectory prediction accuracy especially when handling outliers and dynamic multi-agent interactions. In response to these limitations, we present a novel map-free trajectory prediction method which adaptively eliminates redundant information and selects discriminative features across the temporal, spatial, and frequency domains, thereby enabling precise trajectory prediction in real-world driving environments. First, we design a MoE based frequency domain filter to adaptively weight distinct frequency components of observed trajectory data and suppress outlier related noise; then a selective spatiotemporal attention module that reallocates weights across temporal nodes (sequential dependencies), temporal trends (evolution patterns), and spatial nodes to extract salient information is proposed. Finally, our multimodal decoder-supervised by joint patch level and point-level losses generates reasonable and temporally consistent trajectories, and comprehensive experiments on the large-scale NuScenes and Argoverse dataset demonstrate that our method achieves competitive performance and low-latency inference performance compared with recently proposed methods.

[169] Hierarchy-Aware Multimodal Unlearning for Medical AI

Fengli Wu, Vaidehi Patil, Jaehong Yoon, Yue Zhang, Mohit Bansal

Main category: cs.CV

TL;DR: MedForget is a new benchmark for evaluating multimodal unlearning in medical AI, addressing hierarchical data structures, and CHIP is a training-free method that achieves better forgetting while preserving utility.

DetailsMotivation: Medical AI using MLLMs must comply with privacy regulations (HIPAA/GDPR) requiring data removal. Existing unlearning benchmarks don't reflect the hierarchical, multimodal nature of real medical data, limiting practical evaluation.

Method: 1) MedForget benchmark: Models hospital data as nested structure for fine-grained multimodal unlearning evaluation. 2) CHIP method: Training-free, hierarchy-aware multimodal unlearning that removes target-specific weight subspaces while preserving sibling-shared information.

Result: Existing methods struggle with hierarchy-aware forgetting without degrading medical utility. CHIP achieves highest forget-retain performance gap across all hierarchy levels while maintaining competitive downstream utility compared to existing methods.

Conclusion: MedForget provides a practical, HIPAA-aligned benchmark for structured multimodal unlearning in medical data. CHIP offers an effective, general solution for hierarchy-aware forgetting that balances deletion with utility preservation.

Abstract: Pretrained Multimodal Large Language Models (MLLMs) are increasingly used in sensitive domains such as medical AI, where privacy regulations like HIPAA and GDPR require specific removal of individuals’ or institutions’ data. This motivates machine unlearning, which aims to remove the influence of target data from a trained model. However, existing unlearning benchmarks fail to reflect the hierarchical and multimodal structure of real-world medical data, limiting their ability to properly evaluate unlearning in practice. Therefore, we introduce MedForget, a hierarchy-aware multimodal unlearning benchmark that models hospital data as a nested structure, enabling fine-grained evaluation of multimodal unlearning across retain and forget splits. Experiments with current unlearning methods show that existing approaches struggle to achieve effective hierarchy-aware forgetting without degrading downstream medical utility. To address this limitation, we propose Cross-modal Hierarchy-Informed Projection for unlearning (CHIP), a training-free, hierarchy-aware multimodal unlearning method that deletes information by selectively removing target-specific weight subspaces while preserving sibling-shared information. Experiments show that CHIP achieves the highest forget-retain performance gap across all hierarchy levels while maintaining competitive downstream utility compared to existing methods. Overall, MedForget provides a practical, HIPAA-aligned benchmark for evaluating structured multimodal unlearning for medical data, and CHIP offers an effective and general solution for hierarchy-aware forgetting that balances deletion with utility.

[170] OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei, Petr Sushko, Aliaksandr Siarohin, Sergey Tulyakov, Ranjay Krishna

Main category: cs.CV

TL;DR: OmniView is a unified diffusion model framework that generalizes across diverse 4D consistency tasks (novel view synthesis, text-to-video with camera control, image-to-video) by separately representing space, time, and view conditions for flexible combinations.

DetailsMotivation: Prior camera control approaches in diffusion models are fragmented, focusing on specific subsets of 4D consistency tasks and trained on disjoint data slices. There's a need for a unified framework that can handle multiple 4D tasks in a single model.

Method: Separately represents space, time, and view conditions to enable flexible combinations of these inputs. This allows the model to handle static/dynamic/multiview inputs, extrapolate trajectories, and create videos with full camera control from text or image prompts.

Result: Competitive with task-specific models across diverse benchmarks: improves image quality scores by up to 33% in multiview NVS (LLFF), 60% in dynamic NVS (Neural 3D Video), 20% in static camera control (RE-10K), and reduces camera trajectory errors by 4x in text-conditioned video generation.

Conclusion: OmniView demonstrates strong generalizability in a single model, showing the feasibility of a generalist 4D video model that can handle multiple camera control and 4D consistency tasks unified in one framework.

Abstract: Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33% in multiview NVS LLFF dataset, 60% in dynamic NVS Neural 3D Video benchmark, 20% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/

[171] The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu

Main category: cs.CV

TL;DR: The paper introduces the Prism Hypothesis showing semantic encoders capture low-frequency components (abstract meaning) while pixel encoders retain high-frequency details, and proposes Unified Autoencoding (UAE) to harmonize both in a single latent space.

DetailsMotivation: To understand the spectral characteristics of different encoders and establish a unifying perspective on how semantic and pixel encoders capture different frequency components of data, leading to the Prism Hypothesis.

Method: Systematic spectral analysis of semantic and pixel encoders, formulation of the Prism Hypothesis, and development of Unified Autoencoding (UAE) with a frequency-band modulator to harmonize semantic structure and pixel details.

Result: Extensive experiments on ImageNet and MS-COCO benchmarks show UAE achieves state-of-the-art performance by effectively unifying semantic abstraction and pixel-level fidelity in a single latent space.

Conclusion: The Prism Hypothesis provides a unifying framework for understanding encoder behavior through spectral analysis, and UAE successfully demonstrates how to harmonize semantic and pixel information via frequency modulation.

Abstract: Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder’s feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.

[172] Pretraining Frame Preservation in Autoregressive Video Memory Compression

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala

Main category: cs.CV

TL;DR: PFP is a neural network that compresses long videos into short contexts while preserving high-frequency details of individual frames, enabling efficient memory encoding for autoregressive video models.

DetailsMotivation: To enable long video generation and processing with low computational cost while maintaining frame-level detail fidelity, addressing the challenge of compressing temporal information without losing important visual details.

Method: PFP uses a neural network structure with explicit pretraining objective to preserve high-frequency details of single frames at arbitrary temporal positions, compressing 20-second videos into ~5k length contexts.

Result: The model successfully compresses videos while allowing random frame retrieval with perceptually preserved appearances, and can be fine-tuned as memory encoders for autoregressive video models with low context cost and minimal fidelity loss.

Conclusion: PFP provides an effective framework for video compression and memory encoding, with trade-offs in neural architecture designs that enable practical long-history video processing with reasonable computational efficiency.

Abstract: We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.

[173] CardioMOD-Net: A Modal Decomposition-Neural Network Framework for Diagnosis and Prognosis of HFpEF from Echocardiography Cine Loops

Andrés Bell-Navas, Jesús Garicano-Mena, Antonella Ausiello, Soledad Le Clainche, María Villalba-Orero, Enrique Lara-Pezzi

Main category: cs.CV

TL;DR: CardioMOD-Net is an AI framework that uses echocardiography videos to both diagnose HFpEF subtypes and predict disease onset time in preclinical mouse models.

DetailsMotivation: Current AI models for HFpEF only do binary detection in humans, lacking comorbidity-specific phenotyping and temporal prediction of disease progression. There's a need for early diagnosis and prognosis tools that can handle HFpEF's diverse origins and subclinical stages.

Method: Used mouse echocardiography videos from four groups (control, hyperglycemic, obesity, hypertension). Applied Higher Order Dynamic Mode Decomposition to extract temporal features from cine loops. Built a unified framework with Vision Transformers: one for multiclass diagnosis and another for regression to predict HFpEF onset age.

Result: Achieved 65% overall diagnostic accuracy across four groups, with all classes exceeding 50% accuracy. The prognostic module predicted HFpEF onset with 21.72 weeks root-mean-square error. Obesity and hypertension groups showed most accurate predictions, with onset predictions closely matching true distributions.

Conclusion: The unified framework successfully performs both multiclass phenotyping and continuous HFpEF onset prediction from single cine loops, even with limited data. This provides a foundation for integrating diagnostic and prognostic modeling in preclinical HFpEF research.

Abstract: Introduction: Heart failure with preserved ejection fraction (HFpEF) arises from diverse comorbidities and progresses through prolonged subclinical stages, making early diagnosis and prognosis difficult. Current echocardiography-based Artificial Intelligence (AI) models focus primarily on binary HFpEF detection in humans and do not provide comorbidity-specific phenotyping or temporal estimates of disease progression towards decompensation. We aimed to develop a unified AI framework, CardioMOD-Net, to perform multiclass diagnosis and continuous prediction of HFpEF onset directly from standard echocardiography cine loops in preclinical models. Methods: Mouse echocardiography videos from four groups were used: control (CTL), hyperglycaemic (HG), obesity (OB), and systemic arterial hypertension (SAH). Two-dimensional parasternal long-axis cine loops were decomposed using Higher Order Dynamic Mode Decomposition (HODMD) to extract temporal features for downstream analysis. A shared latent representation supported Vision Transformers, one for a classifier for diagnosis and another for a regression module for predicting the age at HFpEF onset. Results: Overall diagnostic accuracy across the four groups was 65%, with all classes exceeding 50% accuracy. Misclassifications primarily reflected early-stage overlap between OB or SAH and CTL. The prognostic module achieved a root-mean-square error of 21.72 weeks for time-to-HFpEF prediction, with OB and SAH showing the most accurate estimates. Predicted HFpEF onset closely matched true distributions in all groups. Discussion: This unified framework demonstrates that multiclass phenotyping and continuous HFpEF onset prediction can be obtained from a single cine loop, even under small-data conditions. The approach offers a foundation for integrating diagnostic and prognostic modelling in preclinical HFpEF research.

[174] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang Cai, Xuhong Zhang, Jianwei Yin

Main category: cs.CV

TL;DR: IBISAgent is a novel agentic MLLM that reformulates medical image segmentation as a multi-step decision-making process using interleaved reasoning and text-based click actions, outperforming SOTA methods without architectural modifications.

DetailsMotivation: Existing medical MLLM segmentation approaches face two major challenges: 1) They introduce implicit segmentation tokens and require simultaneous fine-tuning of both MLLM and external pixel decoders, increasing catastrophic forgetting risk and limiting out-of-domain generalization; 2) Most methods rely on single-pass reasoning without iterative refinement capability, leading to suboptimal performance.

Method: IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process. It enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. The approach uses a two-stage training framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with tailored, fine-grained rewards.

Result: Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source state-of-the-art methods in medical referring and reasoning segmentation tasks.

Conclusion: IBISAgent successfully addresses limitations of existing approaches by enabling iterative refinement through multi-step visual reasoning on masked image features, promoting pixel-level visual reasoning capabilities while avoiding catastrophic forgetting and improving generalization. The method will be publicly released with datasets, code, and trained models.

Abstract: Recent research on medical MLLMs has gradually shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose a novel agentic MLLM, named IBISAgent, that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model’s robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods. All datasets, code, and trained models will be released publicly.

[175] Revisiting the Ordering of Channel and Spatial Attention: A Comprehensive Study on Sequential and Parallel Designs

Zhongming Liu, Bingbing Jiang

Main category: cs.CV

TL;DR: This paper systematically analyzes fusion strategies for channel and spatial attention mechanisms, discovering data-scale dependent performance patterns and providing practical guidelines for attention module design.

DetailsMotivation: Current research on fusing channel and spatial attention mechanisms lacks systematic analysis and unified principles, with selection processes being largely empirical rather than guided by systematic evaluation.

Method: Built an evaluation suite of 18 attention topologies across four classes (sequential, parallel, multi-scale, residual) and systematically compared them across two vision and nine medical datasets under a unified framework.

Result: Discovered a “data scale-method-performance” coupling law: few-shot tasks favor “Channel-Multi-scale Spatial” cascaded structures; medium-scale tasks prefer parallel learnable fusion; large-scale tasks benefit from parallel structures with dynamic gating. Also found “Spatial-Channel” order is better for fine-grained classification, and residual connections help with vanishing gradients.

Conclusion: The paper provides scenario-based guidelines for building future attention modules based on data scale and task requirements, offering systematic principles rather than empirical selection for attention fusion strategies.

Abstract: Attention mechanisms have become a core component of deep learning models, with Channel Attention and Spatial Attention being the two most representative architectures. Current research on their fusion strategies primarily bifurcates into sequential and parallel paradigms, yet the selection process remains largely empirical, lacking systematic analysis and unified principles. We systematically compare channel-spatial attention combinations under a unified framework, building an evaluation suite of 18 topologies across four classes: sequential, parallel, multi-scale, and residual. Across two vision and nine medical datasets, we uncover a “data scale-method-performance” coupling law: (1) in few-shot tasks, the “Channel-Multi-scale Spatial” cascaded structure achieves optimal performance; (2) in medium-scale tasks, parallel learnable fusion architectures demonstrate superior results; (3) in large-scale tasks, parallel structures with dynamic gating yield the best performance. Additionally, experiments indicate that the “Spatial-Channel” order is more stable and effective for fine-grained classification, while residual connections mitigate vanishing gradient problems across varying data scales. We thus propose scenario-based guidelines for building future attention modules. Code is open-sourced at https://github.com/DWlzm.

[176] The Spatial Blindspot of Vision-Language Models

Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A, Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna

Main category: cs.CV

TL;DR: VLMs lack spatial reasoning due to CLIP-style image encoders flattening 2D structure; improving spatial awareness through alternative encoders and 2D positional encodings boosts spatial reasoning performance.

DetailsMotivation: Current VLMs built with CLIP-style image encoders discard 2D spatial structure by flattening images into 1D patch sequences, creating a blindspot in spatial relationship understanding. This limitation hinders applications requiring spatial grounding like robotics and embodied AI.

Method: Investigates two architectural approaches: (1) image encoders trained with alternative objectives beyond contrastive learning, and (2) incorporation of 2D positional encodings to preserve spatial structure.

Result: Experiments demonstrate that these architectural modifications lead to improved spatial reasoning performance across several benchmarks.

Conclusion: Spatial awareness is a missing dimension in VLM design, and addressing it through architectural improvements can overcome current bottlenecks in spatial reasoning capabilities.

Abstract: Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.

[177] Language-Guided and Motion-Aware Gait Representation for Generalizable Recognition

Zhengxian Wu, Chuanrui Zhang, Shenao Jiang, Hangrui Xu, Zirui Liao, Luyuan Zhang, Huaqiu Li, Peng Jiao, Haoqian Wang

Main category: cs.CV

TL;DR: LMGait introduces language-guided motion-aware gait recognition using natural language descriptions as semantic priors to better capture dynamic motion features and address static noise overfitting.

DetailsMotivation: Existing gait recognition methods overfit on static noise (clothing) while failing to capture dynamic motion regions, and struggle with intra-class variation where same-person gait features under different conditions become distant in feature space.

Method: Proposes LMGait framework with: 1) Natural language descriptions as semantic priors, 2) Motion Awareness Module (MAM) for cross-modal alignment refinement, 3) Motion Temporal Capture Module (MTCM) for enhanced discriminative capability and motion tracking.

Result: Achieved state-of-the-art accuracies: 88.5% on CCPG, 97.1% on SUSTech1K, and 97.5% on CASIAB datasets, demonstrating significant advantages over existing methods.

Conclusion: LMGait successfully addresses gait recognition challenges by introducing language guidance and motion awareness, improving feature discrimination and motion tracking while reducing static noise overfitting.

Abstract: Gait recognition is emerging as a promising technology and an innovative field within computer vision, with a wide range of applications in remote human identification. However, existing methods typically rely on complex architectures to directly extract features from images and apply pooling operations to obtain sequence-level representations. Such designs often lead to overfitting on static noise (e.g., clothing), while failing to effectively capture dynamic motion regions, such as the arms and legs. This bottleneck is particularly challenging in the presence of intra-class variation, where gait features of the same individual under different environmental conditions are significantly distant in the feature space. To address the above challenges, we present a Languageguided and Motion-aware gait recognition framework, named LMGait. To the best of our knowledge, LMGait is the first method to introduce natural language descriptions as explicit semantic priors into the gait recognition task. In particular, we utilize designed gait-related language cues to capture key motion features in gait sequences. To improve cross-modal alignment, we propose the Motion Awareness Module (MAM), which refines the language features by adaptively adjusting various levels of semantic information to ensure better alignment with the visual representations. Furthermore, we introduce the Motion Temporal Capture Module (MTCM) to enhance the discriminative capability of gait features and improve the model’s motion tracking ability. We conducted extensive experiments across multiple datasets, and the results demonstrate the significant advantages of our proposed network. Specifically, our model achieved accuracies of 88.5%, 97.1%, and 97.5% on the CCPG, SUSTech1K, and CASIAB datasets, respectively, achieving state-of-the-art performance. Homepage: https://dingwu1021.github.io/LMGait/

[178] GazeD: Context-Aware Diffusion for Accurate 3D Gaze Estimation

Riccardo Catalini, Davide Di Nucci, Guido Borghi, Davide Davoli, Lorenzo Garattoni, Gianpiero Francesca, Yuki Kawana, Roberto Vezzani

Main category: cs.CV

TL;DR: GazeD is a diffusion-based method that jointly estimates 3D gaze and human pose from a single RGB image, achieving state-of-the-art performance by treating gaze as an additional body joint and leveraging scene context.

DetailsMotivation: Existing methods often struggle with 3D gaze estimation from single images due to ambiguity and uncertainty. The authors aim to leverage diffusion models' ability to handle uncertainty while exploiting the relationship between gaze and body pose for more accurate estimation.

Method: GazeD uses a diffusion model conditioned on 2D pose, subject surroundings, and scene context. It represents 3D gaze as an additional body joint at fixed distance from eyes, allowing joint denoising of gaze and pose during diffusion. The model generates multiple plausible hypotheses from single RGB input.

Result: GazeD achieves state-of-the-art performance on three benchmark datasets for 3D gaze estimation, even surpassing methods that use temporal information, demonstrating superior accuracy from single images.

Conclusion: The joint modeling of gaze and pose through diffusion models effectively handles uncertainty in 3D gaze estimation from single RGB images, with the novel gaze representation as an additional body joint proving particularly effective for leveraging pose-gaze relationships.

Abstract: We introduce GazeD, a new 3D gaze estimation method that jointly provides 3D gaze and human pose from a single RGB image. Leveraging the ability of diffusion models to deal with uncertainty, it generates multiple plausible 3D gaze and pose hypotheses based on the 2D context information extracted from the input image. Specifically, we condition the denoising process on the 2D pose, the surroundings of the subject, and the context of the scene. With GazeD we also introduce a novel way of representing the 3D gaze by positioning it as an additional body joint at a fixed distance from the eyes. The rationale is that the gaze is usually closely related to the pose, and thus it can benefit from being jointly denoised during the diffusion process. Evaluations across three benchmark datasets demonstrate that GazeD achieves state-of-the-art performance in 3D gaze estimation, even surpassing methods that rely on temporal information. Project details will be available at https://aimagelab.ing.unimore.it/go/gazed.

[179] FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Jing Zuo, Lingzhou Mu, Fan Jiang, Chengcheng Ma, Mu Xu, Yonggang Qi

Main category: cs.CV

TL;DR: FantasyVLN is a unified implicit reasoning framework for Vision-and-Language Navigation that preserves Chain-of-Thought reasoning benefits without explicit token overhead, enabling real-time navigation with improved performance.

DetailsMotivation: Existing VLN approaches with Chain-of-Thought reasoning have critical drawbacks: textual CoTs lack spatial grounding and overfit to sparse annotations, while multimodal CoTs cause severe token inflation by generating imagined visual observations, making real-time navigation impractical.

Method: Proposes FantasyVLN with a pretrained Visual AutoRegressor (VAR) that encodes imagined visual tokens into a compact latent space during CoT training. Uses unified multi-CoT strategy where the model jointly learns from textual, visual, and multimodal CoT modes. At inference, performs direct instruction-to-action mapping while maintaining reasoning-aware representations.

Result: Extensive experiments on LH-VLN show reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.

Conclusion: FantasyVLN provides a practical solution for VLN by enabling the benefits of CoT reasoning without the computational overhead, achieving human-like navigation reasoning in real-time applications.

Abstract: Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.

[180] Scribble-Supervised Medical Image Segmentation with Dynamic Teacher Switching and Hierarchical Consistency

Thanh-Huy Nguyen, Hoang-Loc Cao, Dat T. Chung, Mai-Anh Vu, Thanh-Minh Nguyen, Minh Le, Phat K. Huynh, Ulas Bagci

Main category: cs.CV

TL;DR: SDT-Net: A dual-teacher, single-student framework for scribble-supervised medical image segmentation that uses dynamic teacher switching and multi-level supervision to overcome annotation sparsity and boundary ambiguity.

DetailsMotivation: Scribble-supervised methods reduce annotation burden but suffer from sparse annotations causing ambiguity, noisy pseudo-label propagation, and poor anatomical boundary learning.

Method: Dual-teacher, single-student framework with Dynamic Teacher Switching (DTS) to select reliable teachers, Pick Reliable Pixels (PRP) for high-confidence pseudo-label refinement, and Hierarchical Consistency (HiCo) module for multi-level feature alignment.

Result: State-of-the-art performance on ACDC and MSCMRseg datasets, producing more accurate and anatomically plausible segmentation results.

Conclusion: SDT-Net effectively addresses scribble annotation sparsity through adaptive teacher selection and multi-level supervision, achieving superior segmentation quality in medical imaging.

Abstract: Scribble-supervised methods have emerged to mitigate the prohibitive annotation burden in medical image segmentation. However, the inherent sparsity of these annotations introduces significant ambiguity, which results in noisy pseudo-label propagation and hinders the learning of robust anatomical boundaries. To address this challenge, we propose SDT-Net, a novel dual-teacher, single-student framework designed to maximize supervision quality from these weak signals. Our method features a Dynamic Teacher Switching (DTS) module to adaptively select the most reliable teacher. This selected teacher then guides the student via two synergistic mechanisms: high-confidence pseudo-labels, refined by a Pick Reliable Pixels (PRP) mechanism, and multi-level feature alignment, enforced by a Hierarchical Consistency (HiCo) module. Extensive experiments on the ACDC and MSCMRseg datasets demonstrate that SDT-Net achieves state-of-the-art performance, producing more accurate and anatomically plausible segmentation.

[181] Masked Modeling for Human Motion Recovery Under Occlusions

Zhiyin Qian, Siwei Zhang, Bharat Lal Bhatnagar, Federica Bogo, Siyu Tang

Main category: cs.CV

TL;DR: MoRo is a masked modeling framework for robust human motion reconstruction from monocular videos under occlusions, achieving real-time performance while handling missing observations better than existing methods.

DetailsMotivation: Human motion reconstruction from monocular videos is challenging under real-world occlusions. Existing methods have trade-offs: regression-based approaches are efficient but fragile to missing observations, while optimization/diffusion methods are robust but slow and require heavy preprocessing.

Method: MoRo uses masked modeling for occlusion-robust motion recovery. It employs a cross-modality learning scheme with three components: (1) trajectory-aware motion prior from MoCap datasets, (2) image-conditioned pose prior from image-pose datasets, and (3) video-conditioned masked transformer that fuses these priors, fine-tuned on video-motion datasets.

Result: MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions on EgoBody and RICH datasets, while performing on-par in non-occluded scenarios. It achieves real-time inference at 70 FPS on a single H200 GPU.

Conclusion: MoRo presents an effective occlusion-robust framework for human motion reconstruction that combines the efficiency of regression methods with the robustness of optimization approaches, enabled by masked modeling and cross-modality learning from heterogeneous datasets.

Abstract: Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings. Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.

cs.AI

[182] When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems

Donghao Huang, Gauri Malwe, Zhaoxia Wang

Main category: cs.AI

TL;DR: A diagnostic framework for evaluating tool-use reliability in LLM-powered multi-agent systems, featuring a 12-category error taxonomy and systematic testing across models and hardware configurations to identify reliability thresholds for enterprise deployment.

DetailsMotivation: Multi-agent LLM systems are transforming enterprise automation, but systematic evaluation methodologies for assessing tool-use reliability remain underdeveloped, especially for SME-centric deployment in privacy-sensitive environments.

Method: Developed a comprehensive diagnostic framework with 12-category error taxonomy covering tool initialization, parameter handling, execution, and result interpretation. Systematically evaluated 1,980 deterministic test instances across open-weight models (Qwen2.5 series, Functionary) and proprietary alternatives (GPT-4, Claude 3.5/3.7) on diverse edge hardware configurations.

Result: Procedural reliability, particularly tool initialization failures, is the primary bottleneck for smaller models. Qwen2.5:32b achieves flawless performance matching GPT-4.1. Mid-sized models (qwen2.5:14b) offer practical accuracy-efficiency trade-offs on commodity hardware (96.6% success rate, 7.3s latency).

Conclusion: The framework establishes foundational infrastructure for systematic reliability evaluation of tool-augmented multi-agent AI systems, enabling cost-effective intelligent agent deployment for resource-constrained organizations with actionable reliability thresholds for production deployment.

Abstract: Multi-agent systems powered by large language models (LLMs) are transforming enterprise automation, yet systematic evaluation methodologies for assessing tool-use reliability remain underdeveloped. We introduce a comprehensive diagnostic framework that leverages big data analytics to evaluate procedural reliability in intelligent agent systems, addressing critical needs for SME-centric deployment in privacy-sensitive environments. Our approach features a 12-category error taxonomy capturing failure modes across tool initialization, parameter handling, execution, and result interpretation. Through systematic evaluation of 1,980 deterministic test instances spanning both open-weight models (Qwen2.5 series, Functionary) and proprietary alternatives (GPT-4, Claude 3.5/3.7) across diverse edge hardware configurations, we identify actionable reliability thresholds for production deployment. Our analysis reveals that procedural reliability, particularly tool initialization failures, constitutes the primary bottleneck for smaller models, while qwen2.5:32b achieves flawless performance matching GPT-4.1. The framework demonstrates that mid-sized models (qwen2.5:14b) offer practical accuracy-efficiency trade-offs on commodity hardware (96.6% success rate, 7.3 s latency), enabling cost-effective intelligent agent deployment for resource-constrained organizations. This work establishes foundational infrastructure for systematic reliability evaluation of tool-augmented multi-agent AI systems.

[183] SemanticALLI: Caching Reasoning, Not Just Responses, in Agentic Systems

Varun Chillara, Dylan Kline, Christopher Alvares, Evan Wooten, Huan Yang, Shlok Khetan, Cade Bauer, Tré Guillory, Tanishka Shah, Yashodhara Dhariwal, Volodymyr Pavlov, George Popstefanov

Main category: cs.AI

TL;DR: SemanticALLI is a pipeline-aware architecture that caches structured intermediate representations in agentic AI pipelines, achieving 83.10% cache hit rate by decomposing generation into analytic intent resolution and visualization synthesis stages.

DetailsMotivation: Agentic AI pipelines waste resources by repeatedly reconstructing identical intermediate logic (like metric normalization or chart scaffolding) even when user queries use novel phrasing. Traditional caching fails because it treats inference as a monolithic black box.

Method: SemanticALLI decomposes generation into two stages: Analytic Intent Resolution (AIR) and Visualization Synthesis (VS). It elevates structured intermediate representations (IRs) to first-class, cacheable artifacts within the Alli marketing intelligence platform.

Result: The structured approach achieved 83.10% cache hit rate in the Visualization Synthesis stage, bypassing 4,023 LLM calls with median latency of 2.66 ms. Baseline monolithic caching only reached 38.7% hit rate due to linguistic variance.

Conclusion: Even when users rarely repeat themselves, AI pipelines often repeat work at stable, structured checkpoints where caching is most reliable. Decomposing agentic workflows and caching intermediate representations significantly reduces computational costs.

Abstract: Agentic AI pipelines suffer from a hidden inefficiency: they frequently reconstruct identical intermediate logic, such as metric normalization or chart scaffolding, even when the user’s natural language phrasing is entirely novel. Conventional boundary caching fails to capture this inefficiency because it treats inference as a monolithic black box. We introduce SemanticALLI, a pipeline-aware architecture within Alli (PMG’s marketing intelligence platform), designed to operationalize redundant reasoning. By decomposing generation into Analytic Intent Resolution (AIR) and Visualization Synthesis (VS), SemanticALLI elevates structured intermediate representations (IRs) to first-class, cacheable artifacts. The impact of caching within the agentic loop is substantial. In our evaluation, baseline monolithic caching caps at a 38.7% hit rate due to linguistic variance. In contrast, our structured approach allows for an additional stage, the Visualization Synthesis stage, to achieve an 83.10% hit rate, bypassing 4,023 LLM calls with a median latency of just 2.66 ms. This internal reuse reduces total token consumption, offering a practical lesson for AI system design: even when users rarely repeat themselves, the pipeline often does, at stable, structured checkpoints where caching is most reliable.

[184] DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou

Main category: cs.AI

TL;DR: DSGym is a standardized framework for evaluating and training data science agents with modular architecture, addressing limitations of existing benchmarks through rigorous data grounding and expanded task coverage.

DetailsMotivation: Existing data science benchmarks have fragmented evaluation interfaces, narrow task coverage, and lack rigorous data grounding - many tasks can be solved without using actual data, limiting their usefulness for evaluating real data science capabilities.

Method: Introduces DSGym framework with modular architecture for adding tasks, agent scaffolds, and tools. Includes DSGym-Tasks suite with quality-filtered existing benchmarks, plus DSBio (bioinformatics tasks from literature) and DSPredict (challenging prediction tasks across domains). Enables agent training via execution-verified data synthesis pipeline.

Result: Built a 2,000-example training set and trained a 4B model that outperforms GPT-4o on standardized analysis benchmarks, demonstrating DSGym’s effectiveness for both evaluation and training.

Conclusion: DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific contexts, serving as a live, extensible testbed for advancing data science agent capabilities.

Abstract: Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. We further expand coverage with (1) DSBio: expert-derived bioinformatics tasks grounded in literature and (2) DSPredict: challenging prediction tasks spanning domains such as computer vision, molecular prediction, and single-cell perturbation. Beyond evaluation, DSGym enables agent training via execution-verified data synthesis pipeline. As a case study, we build a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks. Overall, DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific context.

[185] Doc2AHP: Inferring Structured Multi-Criteria Decision Models via Semantic Trees with LLMs

Hongjia Wu, Shuai Zhou, Hongxin Zhang, Wei Chen

Main category: cs.AI

TL;DR: Doc2AHP bridges LLMs’ generalization with AHP’s rigor using structured inference, eliminating expert bottleneck and outperforming baselines in logical completeness and accuracy.

DetailsMotivation: LLMs struggle with structural consistency in complex decision-making, while classical decision theories like AHP require labor-intensive domain expertise, creating an "expert bottleneck" that limits scalability.

Method: Doc2AHP leverages AHP structural principles as constraints to guide LLMs in constrained search within unstructured documents, with multi-agent weighting mechanism and adaptive consistency optimization for numerical consistency.

Result: Empirical results show Doc2AHP enables non-experts to construct high-quality decision models from scratch and significantly outperforms direct generative baselines in logical completeness and downstream task accuracy.

Conclusion: Doc2AHP successfully bridges the gap between LLMs’ generalization capabilities and decision theory rigor, providing a scalable framework for structured decision-making without extensive expert intervention.

Abstract: While Large Language Models (LLMs) demonstrate remarkable proficiency in semantic understanding, they often struggle to ensure structural consistency and reasoning reliability in complex decision-making tasks that demand rigorous logic. Although classical decision theories, such as the Analytic Hierarchy Process (AHP), offer systematic rational frameworks, their construction relies heavily on labor-intensive domain expertise, creating an “expert bottleneck” that hinders scalability in general scenarios. To bridge the gap between the generalization capabilities of LLMs and the rigor of decision theory, we propose Doc2AHP, a novel structured inference framework guided by AHP principles. Eliminating the need for extensive annotated data or manual intervention, our approach leverages the structural principles of AHP as constraints to direct the LLM in a constrained search within the unstructured document space, thereby enforcing the logical entailment between parent and child nodes. Furthermore, we introduce a multi-agent weighting mechanism coupled with an adaptive consistency optimization strategy to ensure the numerical consistency of weight allocation. Empirical results demonstrate that Doc2AHP not only empowers non-expert users to construct high-quality decision models from scratch but also significantly outperforms direct generative baselines in both logical completeness and downstream task accuracy.

[186] Mixture-of-Models: Unifying Heterogeneous Agents via N-Way Self-Evaluating Deliberation

Tims Pecerskis, Aivars Smirnovs

Main category: cs.AI

TL;DR: NSED protocol enables small model ensembles to match/exceed large SOTA models via dynamic expert selection and iterative refinement architecture.

DetailsMotivation: To overcome limitations of static Mixture-of-Experts (MoE) systems and enable small, consumer-grade models to achieve performance comparable to large 100B+ parameter models through efficient runtime orchestration.

Method: Uses Dynamic Expertise Broker for runtime model selection (Knapsack Problem variation), Macro-Scale RNN for iterative refinement with semantic forget gate, orchestration fabric for peer review, Quadratic Voting for consensus, and feedback-driven state updates.

Result: Small (<20B) model ensembles match/exceed 100B+ SOTA models on AIME 2025 and LiveCodeBench benchmarks, with improved safety (reduced sycophancy) on DarkBench suite.

Conclusion: NSED establishes new hardware efficiency frontier and demonstrates intrinsic alignment properties through peer-mediated correction, enabling high-performance AI with smaller models.

Abstract: This paper introduces the N-Way Self-Evaluating Deliberation (NSED) protocol, a Runtime Mixture-of-Models (MoM) architecture that constructs emergent composite models from a plurality of distinct expert agents. Unlike traditional Mixture-of-Experts (MoE) which rely on static gating networks, NSED employs a Dynamic Expertise Broker - a runtime optimization engine that treats model selection as a variation of the Knapsack Problem, binding heterogeneous checkpoints to functional roles based on live telemetry and cost constraints. At the execution layer, we formalize deliberation as a Macro-Scale Recurrent Neural Network (RNN), where the consensus state loops back through a semantic forget gate to enable iterative refinement without proportional VRAM scaling. Key components include an orchestration fabric for trustless N-to-N peer review, a Quadratic Voting activation function for non-linear consensus, and a feedback-driven state update. Empirical validation on challenging benchmarks (AIME 2025, LiveCodeBench) demonstrates that this topology allows ensembles of small (less than 20B) consumer-grade models to match or exceed the performance of state-of-the-art 100B+ parameter models, establishing a new hardware arbitrage efficiency frontier. Furthermore, testing on the DarkBench safety suite reveals intrinsic alignment properties, with peer-mediated correction reducing sycophancy scores below that of any individual agent.

[187] SycoEval-EM: Sycophancy Evaluation of Large Language Models in Simulated Clinical Encounters for Emergency Care

Dongshen Peng, Yi Wang, Carl Preiksaitis, Christian Rose

Main category: cs.AI

TL;DR: LLMs in clinical decision support are vulnerable to patient persuasion for inappropriate care, with acquiescence rates ranging 0-100% across 20 models in emergency medicine scenarios, showing static benchmarks fail to predict safety under social pressure.

DetailsMotivation: While LLMs show promise for clinical decision support, they risk acquiescing to patient pressure for inappropriate care, creating safety concerns that static benchmarks cannot adequately assess.

Method: Developed SycoEval-EM, a multi-agent simulation framework using adversarial patient persuasion in emergency medicine. Tested 20 LLMs across 1,875 encounters covering three Choosing Wisely scenarios (inappropriate imaging and opioid prescriptions).

Result: Acquiescence rates ranged from 0-100% across models. Models were more vulnerable to imaging requests (38.8%) than opioid prescriptions (25.0%). Model capability poorly predicted robustness. All persuasion tactics were equally effective (30.0-36.0%), indicating general susceptibility rather than tactic-specific weakness.

Conclusion: Static benchmarks inadequately predict LLM safety under social pressure, necessitating multi-turn adversarial testing for clinical AI certification to ensure robustness against patient persuasion.

Abstract: Large language models (LLMs) show promise in clinical decision support yet risk acquiescing to patient pressure for inappropriate care. We introduce SycoEval-EM, a multi-agent simulation framework evaluating LLM robustness through adversarial patient persuasion in emergency medicine. Across 20 LLMs and 1,875 encounters spanning three Choosing Wisely scenarios, acquiescence rates ranged from 0-100%. Models showed higher vulnerability to imaging requests (38.8%) than opioid prescriptions (25.0%), with model capability poorly predicting robustness. All persuasion tactics proved equally effective (30.0-36.0%), indicating general susceptibility rather than tactic-specific weakness. Our findings demonstrate that static benchmarks inadequately predict safety under social pressure, necessitating multi-turn adversarial testing for clinical AI certification.

[188] LLM is Not All You Need: A Systematic Evaluation of ML vs. Foundation Models for text and image based Medical Classification

Meet Raval, Tejul Pandit, Dhvani Upadhyay

Main category: cs.AI

TL;DR: Traditional ML models outperform both zero-shot LLMs/VLMs and fine-tuned foundation models across most medical classification tasks, with PEFT adaptation proving particularly ineffective.

DetailsMotivation: To rigorously benchmark the performance of traditional ML models versus contemporary transformer-based approaches (zero-shot LLMs/VLMs and fine-tuned foundation models) in medical classification tasks across text and image modalities.

Method: Used four public datasets covering text and image modalities (binary/multiclass). Evaluated three model classes: Classical ML (LR, LightGBM, ResNet-50), Prompt-Based LLMs/VLMs (Gemini 2.5), and Fine-Tuned PEFT Models (LoRA-adapted Gemma3 variants). All experiments used consistent data splits and aligned metrics.

Result: Traditional ML models achieved best overall performance, especially on structured text datasets. LoRA-tuned Gemma variants performed worst across all experiments. Zero-shot Gemini 2.5 performed poorly on text tasks but was competitive on multiclass image classification, matching ResNet-50 baseline.

Conclusion: Established ML models remain most reliable for medical categorization. Foundation models are not universally superior, and PEFT effectiveness depends heavily on adaptation strategy - minimal fine-tuning proved detrimental in this study.

Abstract: The combination of multimodal Vision-Language Models (VLMs) and Large Language Models (LLMs) opens up new possibilities for medical classification. This work offers a rigorous, unified benchmark by using four publicly available datasets covering text and image modalities (binary and multiclass complexity) that contrasts traditional Machine Learning (ML) with contemporary transformer-based techniques. We evaluated three model classes for each task: Classical ML (LR, LightGBM, ResNet-50), Prompt-Based LLMs/VLMs (Gemini 2.5), and Fine-Tuned PEFT Models (LoRA-adapted Gemma3 variants). All experiments used consistent data splits and aligned metrics. According to our results, traditional machine learning (ML) models set a high standard by consistently achieving the best overall performance across most medical categorization tasks. This was especially true for structured text-based datasets, where the classical models performed exceptionally well. In stark contrast, the LoRA-tuned Gemma variants consistently showed the worst performance across all text and image experiments, failing to generalize from the minimal fine-tuning provided. However, the zero-shot LLM/VLM pipelines (Gemini 2.5) had mixed results; they performed poorly on text-based tasks, but demonstrated competitive performance on the multiclass image task, matching the classical ResNet-50 baseline. These results demonstrate that in many medical categorization scenarios, established machine learning models continue to be the most reliable option. The experiment suggests that foundation models are not universally superior and that the effectiveness of Parameter-Efficient Fine-Tuning (PEFT) is highly dependent on the adaptation strategy, as minimal fine-tuning proved detrimental in this study.

[189] LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents

Amin Rakhsha, Thomas Hehn, Pietro Mazzaglia, Fabio Valerio Massoli, Arash Behboodi, Tribhuvanesh Orekondy

Main category: cs.AI

TL;DR: The paper introduces an oracle counterfactual framework to measure the importance of different AI agent capabilities (planning, state tracking, long context processing) in multi-turn, long-horizon tasks using procedurally generated game-like environments.

DetailsMotivation: LLMs struggle with multi-turn, long-horizon agentic problems requiring planning, state tracking, and long context processing. The authors want to understand which of these underlying capabilities are most critical for success on such tasks.

Method: Developed an oracle counterfactual framework that measures how agent performance changes when given perfect assistance on specific tasks (e.g., perfect planning, flawless state tracking). Created procedurally generated, game-like tasks with tunable complexity to isolate contributions of each oracle without confounding effects.

Result: Planning interventions consistently improve performance across settings, while the usefulness of other skills (like state tracking) depends on environment properties and the specific language model being used.

Conclusion: The work provides insights into the challenges of multi-turn agentic environments and guides future development of AI agents and language models by identifying which capabilities are most critical for different types of tasks.

Abstract: Large language models can perform well on many isolated tasks, yet they continue to struggle on multi-turn, long-horizon agentic problems that require skills such as planning, state tracking, and long context processing. In this work, we aim to better understand the relative importance of advancing these underlying capabilities for success on such tasks. We develop an oracle counterfactual framework for multi-turn problems that asks: how would an agent perform if it could leverage an oracle to perfectly perform a specific task? The change in the agent’s performance due to this oracle assistance allows us to measure the criticality of such oracle skill in the future advancement of AI agents. We introduce a suite of procedurally generated, game-like tasks with tunable complexity. These controlled environments allow us to provide precise oracle interventions, such as perfect planning or flawless state tracking, and make it possible to isolate the contribution of each oracle without confounding effects present in real-world benchmarks. Our results show that while some interventions (e.g., planning) consistently improve performance across settings, the usefulness of other skills is dependent on the properties of the environment and language model. Our work sheds light on the challenges of multi-turn agentic environments to guide the future efforts in the development of AI agents and language models.

[190] AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning

Suzhong Fu, Jingqi Dong, Xuan Ding, Rui Sun, Yiming Yang, Shuguang Cui, Zhen Li

Main category: cs.AI

TL;DR: AgentsEval is a multi-agent framework that evaluates medical imaging reports by emulating radiologists’ diagnostic workflow, providing structured clinical feedback and reasoning traces.

DetailsMotivation: Existing evaluation methods for medical imaging reports fail to capture structured diagnostic logic, resulting in unreliable judgments and limited clinical relevance. There's a need for clinically grounded assessment that reflects radiological reasoning.

Method: Multi-agent stream reasoning framework that divides evaluation into interpretable steps: criteria definition, evidence extraction, alignment, and consistency scoring. Uses a perturbation-based benchmark covering five medical report datasets with diverse imaging modalities.

Result: AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations across diverse medical imaging domains.

Conclusion: The framework represents progress toward transparent and clinically grounded assessment of medical report generation systems, enabling trustworthy integration of large language models into clinical practice.

Abstract: Evaluating the clinical correctness and reasoning fidelity of automatically generated medical imaging reports remains a critical yet unresolved challenge. Existing evaluation methods often fail to capture the structured diagnostic logic that underlies radiological interpretation, resulting in unreliable judgments and limited clinical relevance. We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists. By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback. We also construct a multi-domain perturbation-based benchmark covering five medical report datasets with diverse imaging modalities and controlled semantic variations. Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations. This framework represents a step toward transparent and clinically grounded assessment of medical report generation systems, fostering trustworthy integration of large language models into clinical practice.

[191] LongCat-Flash-Thinking-2601 Technical Report

Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chao Zhang, Chen Gao, Chen Zhang, Chengcheng Han, Chenhui Yang, Chuyu Zhang, Cong Chen, Cunguang Wang, Daoru Pan, Defei Bu, Dengchang Zhao, Di Xiu, Dishan Liu, Dongyu Ru, Dunwei Tu, Fan Wu, Fengcheng Yuan, Fengcun Li, Gang Xu, Guanyu Wu, Guoyuan Lin, Haibin Wang, Hansi Yang, Hao Yang, Haonan Yan, Haoxiang Ma, Haoxing Wen, Hongyan Hao, Hongyin Tang, Hongyu Zang, Hongzhi Ni, Hui Su, Jiacheng Zhang, Jiahong Zhou, Jiahuan Li, Jiaming Wang, Jian Yang, Jianfei Zhang, Jianhao Xu, Jianing Wang, Jiapeng Zhu, Jiaqi Sun, Jiarong Shi, Jiarui Zhao, Jingang Wang, Jinluan Yang, Jinrui Ding, Jinwei Xiao, Jiyuan He, Juncan Xu, Kefeng Zhang, Keheng Wang, Li Wei, Lianhui Ma, Lin Qiu, Lingbing Kong, Lingchuan Liu, Linsen Guo, Mengshen Zhu, Mengxia Shen, Mingyang Zhu, Peiguang Li, Peng Pei, Pengcheng Jia, Pengtao Zhang, Peng Zhao, Qi Gu, Qiong Huang, Qiyuan Duan, Quanchi Weng, Rongxiang Weng, Rongzhi Zhang, Rumei Li, Shanglin Lei, Shengnan An, Shijun Dai, Shuaikang Liu, Shuang Zhou, Shuo Wang, Songyuan Zhao, Tao Liang, Tianhao Hu, Tianze Chen, Wei Liu, Wei Shi, Wei Wang, Weifeng Tang, Wenjie Shi, Wenlong Zhu, Wentao Chen, Wentao Shi, Xi Su, Xiangcheng Liu, Xiandi Ma, Xiangyu Xi, Xiangyuan Liu, Xiangzhou Huang, Xiao Liu, Xiaodong Cai, Xiaolong Chen, Xiaowei Shi, Xiaoyu Li, Xin Chen, Xingchen Liu, Xuan Huang, Xuezhi Cao, Xunliang Cai, Yan Chen, Yang Bai, Yang Liu, Yang Yang, Yang Zheng, Yaoming Wang, Yaoming Zhu, Yaqi Huo, Yanyu Chen, Yaorui Shi, Yerui Sun, Yi Zhang, Yihao Chen, Yi-Kai Zhang, Yifan Lu, Yifan Zhao, Yitao Zhai, Yongjing Yin, Yongwei Zhou, Youshao Xiao, Yuchuan Dai, Yuchen Xie, Yuchen Yu, Yufei Zhang, Yuhuai Wei, Yulei Qian, Yunfan Liang, Yunke Zhao, Yuwei Jiang, Yuxin Bian, Yuxin Chen, Yuxin Liu, Yue Xu, Yueqing Sun, Zeyang Yu, Zhao Yang, Zhengsheng Huang, Zhengyu Chen, Zhijian Liu, Zhikang Xia, Zhimin Lin, Zhiyuan Yao, Zhuofan Chen, Zhuowen Han, Zijian Zhang, Ziran Li, Ziwen Wang, Ziyuan Zhuang

Main category: cs.AI

TL;DR: LongCat-Flash-Thinking-2601 is a 560B parameter open-source MoE model with state-of-the-art agentic reasoning capabilities, featuring advanced training techniques and a Heavy Thinking mode for complex reasoning.

DetailsMotivation: To develop an open-source reasoning model with superior agentic capabilities that can handle complex tool interactions, noisy real-world environments, and demonstrate strong generalization across diverse domains.

Method: Uses a unified training framework with domain-parallel expert training and fusion, systematic environment scaling (10,000+ environments across 20+ domains), extended DORA asynchronous RL framework for stable multi-environment training, targeted noise pattern analysis for robustness, and Heavy Thinking mode for test-time scaling.

Result: Achieves state-of-the-art performance among open-source models on agentic benchmarks (search, tool use, tool-integrated reasoning), demonstrates strong generalization to complex tool interactions, and shows robust behavior in noisy real-world environments.

Conclusion: LongCat-Flash-Thinking-2601 represents a significant advancement in open-source agentic reasoning models, combining sophisticated training methodologies with practical robustness features for real-world applications.

Abstract: We introduce LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model with superior agentic reasoning capability. LongCat-Flash-Thinking-2601 achieves state-of-the-art performance among open-source models on a wide range of agentic benchmarks, including agentic search, agentic tool use, and tool-integrated reasoning. Beyond benchmark performance, the model demonstrates strong generalization to complex tool interactions and robust behavior under noisy real-world environments. Its advanced capability stems from a unified training framework that combines domain-parallel expert training with subsequent fusion, together with an end-to-end co-design of data construction, environments, algorithms, and infrastructure spanning from pre-training to post-training. In particular, the model’s strong generalization capability in complex tool-use are driven by our in-depth exploration of environment scaling and principled task construction. To optimize long-tailed, skewed generation and multi-turn agentic interactions, and to enable stable training across over 10,000 environments spanning more than 20 domains, we systematically extend our asynchronous reinforcement learning framework, DORA, for stable and efficient large-scale multi-environment training. Furthermore, recognizing that real-world tasks are inherently noisy, we conduct a systematic analysis and decomposition of real-world noise patterns, and design targeted training procedures to explicitly incorporate such imperfections into the training process, resulting in improved robustness for real-world applications. To further enhance performance on complex reasoning tasks, we introduce a Heavy Thinking mode that enables effective test-time scaling by jointly expanding reasoning depth and width through intensive parallel thinking.

[192] An Efficient Insect-inspired Approach for Visual Point-goal Navigation

Lu Yihe, Barbara Webb

Main category: cs.AI

TL;DR: Insect-inspired visual navigation agent achieves SOTA performance with dramatically lower computational cost.

DetailsMotivation: To develop a biologically-inspired navigation system that mimics insect brain structures for efficient visual point-goal navigation, drawing inspiration from insects' ability to learn and refine paths between food sources and nests.

Method: Combines abstracted models of two insect brain structures: one for associative learning and another for path integration. The approach is tested on the Habitat point-goal navigation benchmark and in more realistic simulated environments.

Result: The insect-inspired agent achieves performance comparable to recent state-of-the-art models while using many orders of magnitude less computational resources. The approach also demonstrates robustness to perturbations in realistic simulations.

Conclusion: Simple insect-inspired neural architectures can achieve competitive navigation performance with dramatically reduced computational requirements, offering efficient alternatives to complex deep learning models for visual navigation tasks.

Abstract: In this work we develop a novel insect-inspired agent for visual point-goal navigation. This combines abstracted models of two insect brain structures that have been implicated, respectively, in associative learning and path integration. We draw an analogy between the formal benchmark of the Habitat point-goal navigation task and the ability of insects to learn and refine visually guided paths around obstacles between a discovered food location and their nest. We demonstrate that the simple insect-inspired agent exhibits performance comparable to recent SOTA models at many orders of magnitude less computational cost. Testing in a more realistic simulated environment shows the approach is robust to perturbations.

[193] Reasoning Promotes Robustness in Theory of Mind Tasks

Ian B. de Haan, Peter van der Putten, Max van Duijn

Main category: cs.AI

TL;DR: Reasoning-oriented LLMs show improved robustness on Theory of Mind tasks, but gains appear to come from better solution-finding rather than fundamentally new reasoning capabilities.

DetailsMotivation: To examine whether reasoning-oriented LLMs trained via reinforcement learning with verifiable rewards (RLVR) demonstrate fundamentally new Theory of Mind capabilities or simply improved robustness in existing tasks.

Method: Used novel adaptations of machine psychological experiments and established benchmarks to test reasoning models on Theory of Mind tasks, analyzing their behavior under various prompt variations and task perturbations.

Result: Reasoning models consistently show increased robustness to prompt variations and task perturbations, but analysis suggests gains are due to improved solution-finding robustness rather than fundamentally new forms of ToM reasoning.

Conclusion: The observed improvements in reasoning models’ ToM performance are more likely attributable to enhanced robustness in finding correct solutions rather than the emergence of fundamentally new social-cognitive reasoning capabilities.

Abstract: Large language models (LLMs) have recently shown strong performance on Theory of Mind (ToM) tests, prompting debate about the nature and true performance of the underlying capabilities. At the same time, reasoning-oriented LLMs trained via reinforcement learning with verifiable rewards (RLVR) have achieved notable improvements across a range of benchmarks. This paper examines the behavior of such reasoning models in ToM tasks, using novel adaptations of machine psychological experiments and results from established benchmarks. We observe that reasoning models consistently exhibit increased robustness to prompt variations and task perturbations. Our analysis indicates that the observed gains are more plausibly attributed to increased robustness in finding the correct solution, rather than to fundamentally new forms of ToM reasoning. We discuss the implications of this interpretation for evaluating social-cognitive behavior in LLMs.

[194] MAGE-KT: Multi-Agent Graph-Enhanced Knowledge Tracing with Subgraph Retrieval and Asymmetric Fusion

Chi Yu, Hongyu Yuan, Zhiyi Duan

Main category: cs.AI

TL;DR: MAGE-KT is a multi-agent graph-enhanced knowledge tracing framework that addresses limitations in existing graph-based KT methods by better capturing inter-concept relations and avoiding attention diffusion through compact subgraph retrieval and asymmetric cross-attention fusion.

DetailsMotivation: Existing graph-based KT methods insufficiently explore inter-concept relations (often inferred only from interaction sequences) and suffer from computational inefficiency and noise due to full-graph encoding, which causes attention to bleed into irrelevant regions and degrades inter-KC relation fidelity.

Method: Constructs a multi-view heterogeneous graph combining a multi-agent KC relation extractor and student-question interaction graph. Conditions on target student’s history to retrieve compact, high-value subgraphs, then integrates them using an Asymmetric Cross-attention Fusion Module to enhance prediction while avoiding attention diffusion.

Result: Experiments on three widely used KT datasets show substantial improvements in KC-relation accuracy and clear gains in next-question prediction over existing methods.

Conclusion: MAGE-KT effectively addresses key challenges in graph-based KT by better representing relationships among students, questions, and KCs while maintaining computational efficiency through selective subgraph retrieval and fusion.

Abstract: Knowledge Tracing (KT) aims to model a student’s learning trajectory and predict performance on the next question. A key challenge is how to better represent the relationships among students, questions, and knowledge concepts (KCs). Recently, graph-based KT paradigms have shown promise for this problem. However, existing methods have not sufficiently explored inter-concept relations, often inferred solely from interaction sequences. In addition, the scale and heterogeneity of KT graphs make full-graph encoding both computationally both costly and noise-prone, causing attention to bleed into student-irrelevant regions and degrading the fidelity of inter-KC relations. To address these issues, we propose a novel framework: Multi-Agent Graph-Enhanced Knowledge Tracing (MAGE-KT). It constructs a multi-view heterogeneous graph by combining a multi-agent KC relation extractor and a student-question interaction graph, capturing complementary semantic and behavioral signals. Conditioned on the target student’s history, it retrieves compact, high-value subgraphs and integrates them using an Asymmetric Cross-attention Fusion Module to enhance prediction while avoiding attention diffusion and irrelevant computation. Experiments on three widely used KT datasets show substantial improvements in KC-relation accuracy and clear gains in next-question prediction over existing methods.

[195] Preventing the Collapse of Peer Review Requires Verification-First AI

Lei You, Lele Cao, Iryna Gurevych

Main category: cs.AI

TL;DR: AI peer review should focus on verifying claims rather than mimicking human review, using AI as an adversarial auditor to prevent proxy gaming.

DetailsMotivation: Current AI-assisted peer review tools risk amplifying claim inflation by mimicking human scoring rather than focusing on truth verification. The paper identifies a phase transition where rational authors shift from truth-seeking to proxy optimization when verification capacity is overwhelmed.

Method: Proposes truth-coupling as the objective for review tools, formalizes verification pressure and signal shrinkage forces, and develops a minimal model mixing high-fidelity checks with frequent proxy judgment to derive coupling laws and incentive-collapse conditions.

Result: Derives explicit coupling law and incentive-collapse condition showing rational effort shifts from truth-seeking to proxy optimization even when decisions appear reliable. Identifies conditions for proxy-sovereign evaluation phase transition.

Conclusion: AI should be deployed as an adversarial auditor generating auditable verification artifacts to expand verification bandwidth, not as a score predictor that amplifies claim inflation. Tool builders and program chairs should adopt verification-first approaches.

Abstract: This paper argues that AI-assisted peer review should be verification-first rather than review-mimicking. We propose truth-coupling, i.e. how tightly venue scores track latent scientific truth, as the right objective for review tools. We formalize two forces that drive a phase transition toward proxy-sovereign evaluation: verification pressure, when claims outpace verification capacity, and signal shrinkage, when real improvements become hard to separate from noise. In a minimal model that mixes occasional high-fidelity checks with frequent proxy judgment, we derive an explicit coupling law and an incentive-collapse condition under which rational effort shifts from truth-seeking to proxy optimization, even when current decisions still appear reliable. These results motivate actions for tool builders and program chairs: deploy AI as an adversarial auditor that generates auditable verification artifacts and expands effective verification bandwidth, rather than as a score predictor that amplifies claim inflation.

[196] AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems

Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

Main category: cs.AI

TL;DR: AgentDrive is a comprehensive benchmark dataset with 300K LLM-generated driving scenarios and a 100K-question multiple-choice test for evaluating autonomous agents, addressing the lack of safety-critical benchmarks for agentic AI models.

DetailsMotivation: There's a growing need to integrate LLMs into autonomous systems for reasoning-driven perception, planning, and decision-making, but evaluating and training such agentic AI models is challenging due to the lack of large-scale, structured, and safety-critical benchmarks.

Method: Created AgentDrive with 300K LLM-generated driving scenarios formalized across seven orthogonal axes (scenario type, driver behavior, environment, road layout, objective, difficulty, traffic density) using an LLM-driven prompt-to-JSON pipeline. Also developed AgentDrive-MCQ, a 100K-question multiple-choice benchmark spanning five reasoning dimensions (physics, policy, hybrid, scenario, comparative reasoning).

Result: Evaluated 50 leading LLMs on AgentDrive-MCQ, finding that proprietary frontier models perform best in contextual and policy reasoning, while advanced open models are rapidly closing the gap in structured and physics-grounded reasoning.

Conclusion: AgentDrive provides a comprehensive benchmark for training, fine-tuning, and evaluating autonomous agents, with results showing the evolving landscape of LLM capabilities in autonomous driving reasoning tasks. The dataset, benchmark, and evaluation code are publicly released.

Abstract: The rapid advancement of large language models (LLMs) has sparked growing interest in their integration into autonomous systems for reasoning-driven perception, planning, and decision-making. However, evaluating and training such agentic AI models remains challenging due to the lack of large-scale, structured, and safety-critical benchmarks. This paper introduces AgentDrive, an open benchmark dataset containing 300,000 LLM-generated driving scenarios designed for training, fine-tuning, and evaluating autonomous agents under diverse conditions. AgentDrive formalizes a factorized scenario space across seven orthogonal axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density. An LLM-driven prompt-to-JSON pipeline generates semantically rich, simulation-ready specifications that are validated against physical and schema constraints. Each scenario undergoes simulation rollouts, surrogate safety metric computation, and rule-based outcome labeling. To complement simulation-based evaluation, we introduce AgentDrive-MCQ, a 100,000-question multiple-choice benchmark spanning five reasoning dimensions: physics, policy, hybrid, scenario, and comparative reasoning. We conduct a large-scale evaluation of fifty leading LLMs on AgentDrive-MCQ. Results show that while proprietary frontier models perform best in contextual and policy reasoning, advanced open models are rapidly closing the gap in structured and physics-grounded reasoning. We release the AgentDrive dataset, AgentDrive-MCQ benchmark, evaluation code, and related materials at https://github.com/maferrag/AgentDrive

[197] Spatial-Agent: Agentic Geo-spatial Reasoning with Scientific Core Concepts

Riyang Bao, Cheng Yang, Dazhou Yu, Zhexiang Tang, Gengchen Mai, Liang Zhao

Main category: cs.AI

TL;DR: Spatial-Agent: An AI agent that formalizes geospatial reasoning as concept transformation using GeoFlow Graphs, outperforming existing LLM-based approaches on geospatial benchmarks.

DetailsMotivation: Existing LLM-based agents fail at genuine geospatial computation, relying on web search or pattern matching while hallucinating spatial relationships, despite geospatial reasoning being essential for real-world applications like urban analytics, transportation planning, and disaster response.

Method: Formalizes geo-analytical question answering as a concept transformation problem using GeoFlow Graphs (directed acyclic graphs with spatial concept nodes and transformation edges). Draws on spatial information theory to extract spatial concepts, assign functional roles with ordering constraints, and compose transformation sequences through template-based generation.

Result: Extensive experiments on MapEval-API and MapQA benchmarks demonstrate that Spatial-Agent significantly outperforms existing baselines including ReAct and Reflexion, while producing interpretable and executable geospatial workflows.

Conclusion: Spatial-Agent provides a principled approach to geospatial reasoning grounded in spatial information science, enabling genuine geospatial computation rather than relying on web search or pattern matching, with superior performance and interpretability.

Abstract: Geospatial reasoning is essential for real-world applications such as urban analytics, transportation planning, and disaster response. However, existing LLM-based agents often fail at genuine geospatial computation, relying instead on web search or pattern matching while hallucinating spatial relationships. We present Spatial-Agent, an AI agent grounded in foundational theories of spatial information science. Our approach formalizes geo-analytical question answering as a concept transformation problem, where natural-language questions are parsed into executable workflows represented as GeoFlow Graphs – directed acyclic graphs with nodes corresponding to spatial concepts and edges representing transformations. Drawing on spatial information theory, Spatial-Agent extracts spatial concepts, assigns functional roles with principled ordering constraints, and composes transformation sequences through template-based generation. Extensive experiments on MapEval-API and MapQA benchmarks demonstrate that Spatial-Agent significantly outperforms existing baselines including ReAct and Reflexion, while producing interpretable and executable geospatial workflows.

[198] Empowering Medical Equipment Sustainability in Low-Resource Settings: An AI-Powered Diagnostic and Support Platform for Biomedical Technicians

Bernes Lorier Atabonfack, Ahmed Tahiru Issah, Mohammed Hardi Abdul Baaki, Clemence Ingabire, Tolulope Olusuyi, Maruf Adewole, Udunna C. Anazodo, Timothy X Brown

Main category: cs.AI

TL;DR: AI-powered platform helps biomedical technicians in LMICs diagnose and repair medical devices using LLM-guided troubleshooting and peer forums.

DetailsMotivation: Medical equipment in LMICs often becomes non-functional due to maintenance challenges, lack of technical expertise, and limited manufacturer support, leading to equipment downtime and compromised patient care.

Method: Developed an AI-powered support platform integrating a large language model with a web interface for inputting error codes/symptoms and receiving troubleshooting guidance, plus a global peer-to-peer discussion forum. Proof of concept tested on Philips HDI 5000 ultrasound machine.

Result: Achieved 100% precision in error code interpretation and 80% accuracy in suggesting corrective actions for the ultrasound machine proof of concept.

Conclusion: AI-driven systems show feasibility and potential to support medical device maintenance in resource-constrained environments, aiming to reduce equipment downtime and improve healthcare delivery.

Abstract: In low- and middle-income countries (LMICs), a significant proportion of medical diagnostic equipment remains underutilized or non-functional due to a lack of timely maintenance, limited access to technical expertise, and minimal support from manufacturers, particularly for devices acquired through third-party vendors or donations. This challenge contributes to increased equipment downtime, delayed diagnoses, and compromised patient care. This research explores the development and validation of an AI-powered support platform designed to assist biomedical technicians in diagnosing and repairing medical devices in real-time. The system integrates a large language model (LLM) with a user-friendly web interface, enabling imaging technologists/radiographers and biomedical technicians to input error codes or device symptoms and receive accurate, step-by-step troubleshooting guidance. The platform also includes a global peer-to-peer discussion forum to support knowledge exchange and provide additional context for rare or undocumented issues. A proof of concept was developed using the Philips HDI 5000 ultrasound machine, achieving 100% precision in error code interpretation and 80% accuracy in suggesting corrective actions. This study demonstrates the feasibility and potential of AI-driven systems to support medical device maintenance, with the aim of reducing equipment downtime to improve healthcare delivery in resource-constrained environments.

[199] Failures of Contingent Thinking

Evan Piermont, Peio Zuazo-Garin

Main category: cs.AI

TL;DR: The paper develops a behavioral framework for modeling agents’ subjective state-spaces and belief updating, showing how reducing uncertainty improves contingent thinking and introducing a novel updating process for realizing flaws in one’s own reasoning.

DetailsMotivation: To address the gap between modelers' and agents' subjective understandings of decision problems, and to formalize how agents' contingent thinking evolves through belief updating and self-realization of reasoning flaws.

Method: Develops a behavioral definition of perceived implications that uniquely identifies an agent’s subjective state-space representation, examines belief updating within this model, and analyzes cognitive demands of different dominance concepts.

Result: Formalizes the empirical finding that reducing uncertainty improves contingent thinking, proposes a novel updating mechanism for agents realizing flaws in their own thinking, and clarifies why state-by-state dominance is more cognitively demanding than obvious dominance.

Conclusion: The framework provides a rigorous behavioral foundation for modeling agents’ subjective decision representations and updating processes, offering insights into how contingent thinking evolves and why certain dominance relations are more cognitively challenging.

Abstract: We present a behavioral definition of an agent’s perceived implication that uniquely identifies a subjective state-space representing her view of a decision problem, and which may differ from the modeler’s. By examining belief updating within this model, we formalize the recent empirical consensus that reducing uncertainty improves contingent thinking, and propose a novel form of updating corresponding to the agent ‘realizing’ a flaw in her own thinking. Finally, we clarify the sense in which contingent thinking makes state-bystate dominance more cognitively demanding than obvious dominance.

[200] A Concept-Centric Approach to Multi-Modality Learning

Yuchong Geng, Ao Tang

Main category: cs.AI

TL;DR: A multi-modality learning framework with a modality-agnostic concept space that captures abstract knowledge, enabling efficient adaptation to new modalities and faster convergence compared to baselines.

DetailsMotivation: Inspired by human cognitive ability to efficiently acquire and apply knowledge across diverse modalities through a coherent world understanding, the authors aim to develop a learning system that operates more consistently with human cognitive processes.

Method: A concept-centric framework with two components: (1) a modality-agnostic concept space capturing structured, abstract knowledge, and (2) modality-specific projection models that map raw inputs onto this shared space. Projection models are trained independently but produce unified outputs within the shared concept space.

Result: The framework exhibits faster convergence compared to baseline models, enables efficient adaptation to new modalities, and achieves comparable results on downstream tasks with smaller training footprint, no task-specific fine-tuning, and interpretable inference within the shared concept space.

Conclusion: The framework represents a promising direction for developing learning systems that operate more consistently with human cognitive processes, offering modularity, interpretability, and efficient knowledge transfer across modalities.

Abstract: Humans possess a remarkable ability to acquire knowledge efficiently and apply it across diverse modalities through a coherent and shared understanding of the world. Inspired by this cognitive capability, we introduce a concept-centric multi-modality learning framework built around a modality-agnostic concept space that captures structured, abstract knowledge, alongside a set of modality-specific projection models that map raw inputs onto this shared space. The concept space is decoupled from any specific modality and serves as a repository of universally applicable knowledge. Once learned, the knowledge embedded in the concept space enables more efficient adaptation to new modalities, as projection models can align with existing conceptual representations rather than learning from scratch. This efficiency is empirically validated in our experiments, where the proposed framework exhibits faster convergence compared to baseline models. In addition, the framework’s modular design supports seamless integration of new modalities, since projection models are trained independently yet produce unified outputs within the shared concept space. We evaluate the framework on two representative downstream tasks. While the focus is not on task-specific optimization, the framework attains comparable results with a smaller training footprint, no task-specific fine-tuning, and inference performed entirely within a shared space of learned concepts that offers interpretability. These findings point toward a promising direction for developing learning systems that operate in a manner more consistent with human cognitive processes.

[201] Advances in Artificial Intelligence: A Review for the Creative Industries

Nantheera Anantrasirichai, Fan Zhang, David Bull

Main category: cs.AI

TL;DR: Systematic review of AI advances since 2022 (transformers, LLMs, diffusion models) in creative industries, covering generation, analysis, enhancement, compression, and assessment, with focus on unified frameworks and human-AI collaboration.

DetailsMotivation: Existing reviews haven't comprehensively addressed recent AI breakthroughs (since 2022) and their integrated impact across the creative production pipeline, creating a knowledge gap.

Method: Systematic review examining AI technologies that emerged or matured since 2022, analyzing applications across content creation, information analysis, post-production enhancement, compression, and quality assessment.

Result: Transformers, LLMs, diffusion models, and implicit neural representations have established new capabilities in text-to-image/video generation, real-time 3D reconstruction, and unified multi-task frameworks, shifting AI from support tool to core creative technology.

Conclusion: AI has become central to creative industries with unified frameworks replacing task-specific solutions, but human oversight remains essential. Future challenges include copyright concerns, bias mitigation, computational demands, and regulatory frameworks.

Abstract: Artificial intelligence (AI) has undergone transformative advances since 2022, particularly through generative AI, large language models (LLMs), and diffusion models, fundamentally reshaping the creative industries. However, existing reviews have not comprehensively addressed these recent breakthroughs and their integrated impact across the creative production pipeline. This paper addresses this gap by providing a systematic review of AI technologies that have emerged or matured since our 2022 review, examining their applications across content creation, information analysis, post-production enhancement, compression, and quality assessment. We document how transformers, LLMs, diffusion models, and implicit neural representations have established new capabilities in text-to-image/video generation, real-time 3D reconstruction, and unified multi-task frameworks-shifting AI from support tool to core creative technology. Beyond technological advances, we analyze the trend toward unified AI frameworks that integrate multiple creative tasks, replacing task-specific solutions. We critically examine the evolving role of human-AI collaboration, where human oversight remains essential for creative direction and mitigating AI hallucinations. Finally, we identify emerging challenges including copyright concerns, bias mitigation, computational demands, and the need for robust regulatory frameworks. This review provides researchers and practitioners with a comprehensive understanding of current AI capabilities, limitations, and future trajectories in creative applications.

[202] Efficient rule induction by ignoring pointless rules

Andrew Cropper, David M. Cerna

Main category: cs.AI

TL;DR: ILP approach identifies and ignores pointless rules (redundant literals or non-discriminative rules) to prune hypothesis space, reducing learning times by 99% while maintaining accuracy.

DetailsMotivation: Inductive logic programming (ILP) systems search for logical rules that generalize from examples, but this search space can be huge. Many generated rules are "pointless" - containing redundant literals or failing to discriminate negative examples - wasting computational resources without improving learning.

Method: Introduces an ILP approach that identifies pointless rules by detecting two types: (1) rules containing redundant literals, and (2) rules that cannot discriminate against negative examples. The system then ignores these pointless rules to soundly prune the hypothesis space.

Result: Experiments across multiple domains (including visual reasoning and game playing) show the approach reduces learning times by 99% while maintaining predictive accuracies comparable to standard ILP approaches.

Conclusion: Identifying and ignoring pointless rules allows for efficient hypothesis space pruning in ILP systems, dramatically reducing learning times without sacrificing predictive performance, making ILP more practical for complex domains.

Abstract: The goal of inductive logic programming (ILP) is to find a set of logical rules that generalises training examples and background knowledge. We introduce an ILP approach that identifies pointless rules. A rule is pointless if it contains a redundant literal or cannot discriminate against negative examples. We show that ignoring pointless rules allows an ILP system to soundly prune the hypothesis space. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce learning times by 99% whilst maintaining predictive accuracies.

[203] Enhancing Study-Level Inference from Clinical Trial Papers via Reinforcement Learning-Based Numeric Reasoning

Massimiliano Pronesti, Michela Lorandi, Paul Flanagan, Oisin Redmond, Anya Belz, Yufang Hou

Main category: cs.AI

TL;DR: This paper proposes a quantitative reasoning approach for automating systematic reviews by extracting structured numerical evidence and applying domain logic, achieving significant improvements over retrieval-based methods and large LLMs.

DetailsMotivation: Current approaches to automating systematic reviews rely on shallow textual inference and fail to capture the numeric reasoning behind expert assessments, creating a bottleneck in evidence-based decision-making.

Method: Developed a numeric reasoning system with two components: (1) a numeric data extraction model trained using supervised fine-tuning and reinforcement learning with a new value reward model, and (2) an effect estimate component that applies domain-informed logic to derive conclusions from extracted evidence.

Result: On the CochraneForest benchmark, the RL-trained small-scale number extraction model achieved up to 21% absolute F1 improvement over retrieval-based systems and outperformed 400B+ parameter LLMs by up to 9% on the RCTs benchmark.

Conclusion: Quantitative reasoning approaches that extract structured numerical evidence and apply domain logic show promise for automating systematic evidence synthesis more accurately and interpretably than text-based inference methods.

Abstract: Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level conclusions for specific outcomes and comparisons. Prior work has framed this problem as a textual inference task by retrieving relevant content fragments and inferring conclusions from them. However, such approaches often rely on shallow textual cues and fail to capture the underlying numeric reasoning behind expert assessments. In this work, we conceptualise the problem as one of quantitative reasoning. Rather than inferring conclusions from surface text, we extract structured numerical evidence (e.g., event counts or standard deviations) and apply domain knowledge informed logic to derive outcome-specific conclusions. We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component, enabling more accurate and interpretable inference aligned with the domain expert principles. We train the numeric data extraction model using different strategies, including supervised fine-tuning (SFT) and reinforcement learning (RL) with a new value reward model. When evaluated on the CochraneForest benchmark, our best-performing approach – using RL to train a small-scale number extraction model – yields up to a 21% absolute improvement in F1 score over retrieval-based systems and outperforms general-purpose LLMs of over 400B parameters by up to 9% on the RCTs benchmark. Our results demonstrate the promise of reasoning-driven approaches for automating systematic evidence synthesis.

[204] Honey, I shrunk the hypothesis space (through logical preprocessing)

Andrew Cropper, Filipe Gouveia, David M. Cerna

Main category: cs.AI

TL;DR: The paper introduces a preprocessing approach that shrinks the hypothesis space in ILP by removing rules that cannot be in an optimal hypothesis, using background knowledge to identify impossible relationships.

DetailsMotivation: ILP systems search large hypothesis spaces, which can be computationally expensive. The goal is to reduce learning times by eliminating impossible hypotheses before the search begins.

Method: The approach uses background knowledge to identify rules that cannot be in any optimal hypothesis (e.g., “even numbers cannot be odd”) and removes them from the hypothesis space. Implementation uses answer set programming to shrink the hypothesis space for a constraint-based ILP system.

Result: Experiments across multiple domains (visual reasoning, game playing) show substantial reduction in learning times while maintaining predictive accuracy. In one case, learning time reduced from over 10 hours to just 2 seconds with only 10 seconds of preprocessing.

Conclusion: Preprocessing to shrink hypothesis spaces using background knowledge is an effective way to dramatically reduce ILP learning times without sacrificing accuracy, making ILP more practical for real-world applications.

Abstract: Inductive logic programming (ILP) is a form of logical machine learning. The goal is to search a hypothesis space for a hypothesis that generalises training examples and background knowledge. We introduce an approach that ‘shrinks’ the hypothesis space before an ILP system searches it. Our approach uses background knowledge to find rules that cannot be in an optimal hypothesis regardless of the training examples. For instance, our approach discovers relationships such as “even numbers cannot be odd” and “prime numbers greater than 2 are odd”. It then removes violating rules from the hypothesis space. We implement our approach using answer set programming and use it to shrink the hypothesis space of a constraint-based ILP system. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can substantially reduce learning times whilst maintaining predictive accuracies. For instance, given just 10 seconds of preprocessing time, our approach can reduce learning times from over 10 hours to only 2 seconds.

Zheng Jia, Shengbin Yue, Wei Chen, Siyuan Wang, Yidong Liu, Zejun Li, Yun Song, Zhongyu Wei

Main category: cs.AI

TL;DR: J1-ENVS is the first interactive legal environment for LLM agents with six Chinese legal scenarios across three complexity levels, plus J1-EVAL evaluation framework. Experiments show LLMs struggle with procedural execution despite solid legal knowledge.

DetailsMotivation: To address the gap between static benchmarks and dynamic real-world legal practice, enabling better advancement of legal intelligence for LLM-based agents.

Method: Created J1-ENVS interactive legal environment with six representative Chinese legal scenarios across three complexity levels, guided by legal experts. Developed J1-EVAL fine-grained evaluation framework to assess both task performance and procedural compliance across varying legal proficiency levels.

Result: Extensive experiments on 17 LLM agents reveal that while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even state-of-the-art GPT-4o falls short of 60% overall performance.

Conclusion: Persistent challenges remain in achieving dynamic legal intelligence, highlighting the need for continued research in this area. The findings provide valuable insights for guiding future work on legal AI systems.

Abstract: The gap between static benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1-EVAL, a fine-grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT-4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.

[206] Learning Logical Rules using Minimum Message Length

Ruben Sharma, Sebastijan Dumančić, Ross D. King, Andrew Cropper

Main category: cs.AI

TL;DR: Bayesian inductive logic programming method learns minimum message length hypotheses from noisy data, outperforming previous approaches in data efficiency and handling imbalanced examples.

DetailsMotivation: To address the key challenge of unifying probabilistic and logical learning in AI, creating a method that can learn from noisy data while balancing hypothesis complexity and data fit.

Method: Introduces a Bayesian inductive logic programming approach that learns minimum message length hypotheses, using priors that favor more general programs and a likelihood that favors accurate programs.

Result: Significantly outperforms previous methods (notably minimum description length programs) in experiments across domains including game playing and drug design. Shows data efficiency and insensitivity to example balance, including learning from exclusively positive examples.

Conclusion: The Bayesian inductive logic programming approach successfully unifies probabilistic and logical learning, providing an effective method for learning from noisy data with strong performance across diverse domains.

Abstract: Unifying probabilistic and logical learning is a key challenge in AI. We introduce a Bayesian inductive logic programming approach that learns minimum message length hypotheses from noisy data. Our approach balances hypothesis complexity and data fit through priors, which favour more general programs, and a likelihood, which favours accurate programs. Our experiments on several domains, including game playing and drug design, show that our method significantly outperforms previous methods, notably those that learn minimum description length programs. Our results also show that our approach is data-efficient and insensitive to example balance, including the ability to learn from exclusively positive examples.

[207] Symmetry breaking for inductive logic programming

Andrew Cropper, David M. Cerna, Matti Järvisalo

Main category: cs.AI

TL;DR: Symmetry breaking in inductive logic programming reduces search time from hours to seconds by eliminating logically equivalent hypotheses.

DetailsMotivation: Inductive logic programming faces challenges searching vast hypothesis spaces due to many logically equivalent hypotheses, which creates redundant search and slows down the process.

Method: Introduces a method to break symmetries in the hypothesis space, implemented in answer set programming to eliminate logically equivalent hypotheses during search.

Result: Experiments across multiple domains (visual reasoning and game playing) show dramatic speed improvements, reducing solving times from over an hour to just 17 seconds.

Conclusion: Symmetry breaking is an effective approach for improving the efficiency of inductive logic programming by dramatically reducing search times through elimination of redundant hypothesis exploration.

Abstract: The goal of inductive logic programming is to search for a hypothesis that generalises training data and background knowledge. The challenge is searching vast hypothesis spaces, which is exacerbated because many logically equivalent hypotheses exist. To address this challenge, we introduce a method to break symmetries in the hypothesis space. We implement our idea in answer set programming. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce solving times from over an hour to just 17 seconds.

[208] Towards Open-World Retrieval-Augmented Generation on Knowledge Graph: A Multi-Agent Collaboration Framework

Jiasheng Xu, Mingda Li, Yongqiang Tang, Peijie Wang, Wensheng Zhang

Main category: cs.AI

TL;DR: AnchorRAG is a multi-agent framework for open-world RAG that eliminates the need for predefined anchor entities by using dynamic candidate identification and parallel multi-hop exploration.

DetailsMotivation: Current KG-based RAG approaches rely on predefined anchor entities, which limits robustness in open-world settings where accurate entity linking is unreliable. This creates a bottleneck for real-world applications.

Method: A multi-agent collaboration framework with: 1) Predictor agent that dynamically identifies candidate anchor entities by aligning query terms with KG nodes, 2) Retriever agents that conduct parallel multi-hop explorations from each candidate, and 3) Supervisor agent that formulates iterative retrieval strategies and synthesizes knowledge paths.

Result: Extensive experiments on four public benchmarks show AnchorRAG significantly outperforms existing baselines and establishes new state-of-the-art results on real-world reasoning tasks.

Conclusion: AnchorRAG addresses the anchor entity limitation in open-world RAG through multi-agent collaboration, improving retrieval robustness and mitigating ambiguous/erroneous anchor impacts, making it more practical for real-world applications.

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in web search and reasoning. However, their dependence on static training corpora makes them prone to factual errors and knowledge gaps. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge sources, especially structured Knowledge Graphs (KGs), which provide explicit semantics and efficient retrieval. Existing KG-based RAG approaches, however, generally assume that anchor entities are accessible to initiate graph traversal, which limits their robustness in open-world settings where accurate linking between the user query and the KG entity is unreliable. To overcome this limitation, we propose AnchorRAG, a novel multi-agent collaboration framework for open-world RAG without the predefined anchor entities. Specifically, a predictor agent dynamically identifies candidate anchor entities by aligning user query terms with KG nodes and initializes independent retriever agents to conduct parallel multi-hop explorations from each candidate. Then a supervisor agent formulates the iterative retrieval strategy for these retriever agents and synthesizes the resulting knowledge paths to generate the final answer. This multi-agent collaboration framework improves retrieval robustness and mitigates the impact of ambiguous or erroneous anchors. Extensive experiments on four public benchmarks demonstrate that AnchorRAG significantly outperforms existing baselines and establishes new state-of-the-art results on the real-world reasoning tasks.

[209] Proof-of-Use: Mitigating Tool-Call Hacking in Deep Research Agents

SHengjie Ma, Chenlong Deng, Jiaxin Mao, Jiadeng Huang, Teng Wang, Junjie Wu, Changwang Zhang, Jun wang

Main category: cs.AI

TL;DR: PoU addresses tool-call hacking in RL agents by enforcing evidence grounding through stepwise citation and multi-objective rewards.

DetailsMotivation: RL agents can exploit weak observability of causal dependencies between retrieved evidence and reasoning, leading to tool-call hacking where they maximize rewards without genuinely using evidence.

Method: Proof-of-Use (PoU) framework with fine-grained stepwise interaction requiring auditable citation of normalized evidence identifiers, using multi-objective rewards: progressive process rewards for citation validity, Answer-Support Alignment reward for consistency, and adaptive reward mixing.

Result: PoU shows strong performance in mitigating tool-call hacking and exhibits emergent adaptive tool-usage patterns under domain/tool shifts without explicit optimization for adaptation.

Conclusion: PoU effectively addresses tool-call hacking by optimizing causal dependencies from retrieval to reasoning through evidence-grounded RL with structured citation requirements.

Abstract: While reinforcement learning (RL) enhances their ability to plan and reason across retrieval steps, we identify a critical failure mode in this setting: Tool-Call Hacking. Unlike execution-based tools (e.g., code or math), whose effects are directly observable, the weak observability of causal dependencies between retrieved evidence and reasoning under format- and outcome-level supervision enables agents to maximize surface-level reward signals without genuinely grounding their reasoning in the returned evidence. This leads to distinctive pathologies, including mode collapse via tool overuse and hallucinated tool usage where tool calls are largely decorative. To address this issue, we propose Proof-of-Use (PoU), an evidence grounded RL framework that explicitly optimizes the causal dependency from retrieval to reasoning and final answers. PoU re-fomulate a fine-grained stepwise interaction protocol in which agents must auditably cite normalized evidence identifiers. We operationalize this via a multi-objective reward design consisting of: (1) two progressive process rewards that constrain citation validity at intermediate steps; (2) a global Answer–Support Alignment reward that enforces consistency between final answers and retrieved evidence; and (3) a curriculum-style adaptive reward mixing mechanism that smoothly transitions agents from dense process supervision to sparse outcome-based objectives. Extensive experiments show the strong performance of PoU and demonstrate the effectiveness in mitigating tool-call hacking. Beyond this, PoU exhibits a notable emergent property: adaptive and robust tool-usage patterns naturally arise under domain and tool shifts, even though PoU does not explicitly optimize for tool adaptation.

[210] Visual Attention Reasoning via Hierarchical Search and Self-Verification

Wei Cai, Jian Zhao, Yuchen Yuan, Tianle Zhang, Ming Zhu, Haichuan Tang, Xuelong Li

Main category: cs.AI

TL;DR: VAR is a reinforcement learning framework that uses hierarchical search with self-verification to reduce hallucinations in multimodal LLMs by generating explicit visual evidence and enabling backtracking to correct errors.

DetailsMotivation: MLLMs frequently hallucinate due to fragile linear reasoning and weak visual grounding. Current approaches lack traceable evidence and error correction mechanisms.

Method: VAR reformulates reasoning as hierarchical search with self-verification, generates explicit bounding boxes for visual grounding, uses a novel reward function combining geometric precision and semantic sufficiency, and replaces linear Chain-of-Thought with tree-search policy with backtracking.

Result: Theoretical analysis validates framework reliability, and extensive experiments show VAR significantly outperforms state-of-the-art methods on complex hallucination and safety benchmarks.

Conclusion: VAR provides a robust solution to MLLM hallucinations through traceable visual grounding and error-correcting reasoning, representing an advancement in reliable multimodal AI systems.

Abstract: Multimodal Large Language Models (MLLMs) frequently hallucinate due to their reliance on fragile, linear reasoning and weak visual grounding. We propose Visual Attention Reasoning (VAR), a reinforcement learning framework that reformulates reasoning as a hierarchical search with self-verification. VAR enforces traceable evidence grounding by generating explicit bounding boxes, guided by a novel reward function combining geometric precision and semantic sufficiency. Furthermore, it replaces linear Chain-of-Thought with a tree-search policy capable of backtracking to correct logical errors. Theoretical analysis validates the framework’s reliability, and extensive experiments demonstrate that VAR significantly outperforms state-of-the-art methods on complex hallucination and safety benchmarks.

[211] GTR-Mamba: Geometry-to-Tangent Routing Mamba for Hyperbolic POI Recommendation

Zhuoxuan Li, Jieyuan Pei, Tangwei Ye, Zhongyuan Lai, Zihan Liu, Fengyuan Xu, Qi Zhang, Liang Hu

Main category: cs.AI

TL;DR: GTR-Mamba is a novel hyperbolic POI recommendation framework that uses Geometry-to-Tangent Routing to efficiently model sequential trajectories by dynamically routing computations to Euclidean tangent spaces while maintaining geometric consistency through parallel transport.

DetailsMotivation: Current hyperbolic POI recommendation models suffer from prohibitive computational costs and numerical instability due to expensive Möbius operations directly on the manifold, creating a conflict between geometric representational power and sequential efficiency for trajectory modeling.

Method: Proposes GTR-Mamba with Geometry-to-Tangent Routing that strategically routes complex state transitions to computationally efficient Euclidean tangent space. Introduces Parallel Transport mechanism to dynamically align tangent spaces along trajectories, ensuring geometric consistency. Uses exogenous spatio-temporal channel to modulate SSM discretization parameters.

Result: Extensive experiments on three real-world datasets demonstrate that GTR-Mamba consistently outperforms state-of-the-art baselines in next POI recommendation.

Conclusion: GTR-Mamba effectively bridges the gap between curved manifold representation and linear tangent operations, resolving the conflict between hyperbolic geometry’s representational power and sequential modeling efficiency for POI recommendation.

Abstract: Next Point-of-Interest (POI) recommendation is a critical task in modern Location-Based Social Networks (LBSNs), aiming to model the complex decision-making process of human mobility to provide personalized recommendations for a user’s next check-in location. Existing hyperbolic POI recommendation models, predominantly based on rotations and graph representations, have been extensively investigated. Although hyperbolic geometry has proven superior in representing hierarchical data with low distortion, current hyperbolic sequence models typically rely on performing recurrence via expensive Möbius operations directly on the manifold. This incurs prohibitive computational costs and numerical instability, rendering them ill-suited for trajectory modeling. To resolve this conflict between geometric representational power and sequential efficiency, we propose GTR-Mamba, a novel framework featuring Geometry-to-Tangent Routing. GTR-Mamba strategically routes complex state transitions to the computationally efficient Euclidean tangent space. Crucially, instead of a static approximation, we introduce a Parallel Transport (PT) mechanism that dynamically aligns tangent spaces along the trajectory. This ensures geometric consistency across recursive updates, effectively bridging the gap between the curved manifold and linear tangent operations. This process is orchestrated by an exogenous spatio-temporal channel, which explicitly modulates the SSM discretization parameters. Extensive experiments on three real-world datasets demonstrate that GTR-Mamba consistently outperforms state-of-the-art baselines in next POI recommendation.

[212] SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds

Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, Yiqing Yang, Eric Liu, Ryan Wu, Kevin Benavente, Rajiv Mandya Nagaraju, Muhammad Faayez, Xiyan Zhang, Dhruv Vivek Sharma, Xianrui Zhong, Ziqiao Ma, Tianmin Shu, Zhiting Hu, Lianhui Qin

Main category: cs.AI

TL;DR: SimWorld is a new Unreal Engine 5-based simulator for developing and evaluating LLM/VLM agents in realistic physical and social environments, addressing limitations of existing simulators.

DetailsMotivation: Current AI agents excel in math/coding but struggle in complex real-world environments. Existing simulators are limited: they use hand-crafted environments, simplified physics/social rules, and lack native LLM/VLM support, hindering development of agents that can operate autonomously in real-world scenarios like earning income or running businesses.

Method: Built SimWorld on Unreal Engine 5 with three core capabilities: 1) realistic open-ended world simulation with accurate physics/social dynamics and language-driven procedural generation, 2) rich LLM/VLM interface with multimodal inputs and open-vocabulary actions at varying abstraction levels, 3) diverse extensible physical/social reasoning scenarios customizable by users.

Result: Deployed frontier LLM agents (GPT-4o, Gemini-2.5-Flash, Claude-3.5, DeepSeek-Prover-V2) on long-horizon multi-agent delivery tasks involving strategic cooperation/competition. Results revealed distinct reasoning patterns and limitations across models.

Conclusion: SimWorld addresses critical gaps in agent development infrastructure and is open-sourced as a foundational platform for advancing real-world agent intelligence across disciplines.

Abstract: While LLM/VLM-powered AI agents have advanced rapidly in math, coding, and computer use, their applications in complex physical and social environments remain challenging. Building agents that can survive and thrive in the real world (for example, by autonomously earning income or running a business) requires massive-scale interaction, reasoning, training, and evaluation across diverse embodied scenarios. However, existing world simulators for such development fall short: they often rely on limited hand-crafted environments, simulate simplified game-like physics and social rules, and lack native support for LLM/VLM agents. We introduce SimWorld, a new simulator built on Unreal Engine 5, designed for developing and evaluating LLM/VLM agents in rich, real-world-like settings. SimWorld offers three core capabilities: (1) realistic, open-ended world simulation, including accurate physical and social dynamics and language-driven procedural environment generation; (2) a rich interface for LLM/VLM agents, with multimodal world inputs and open-vocabulary actions at varying levels of abstraction; and (3) diverse and extensible physical and social reasoning scenarios that are easily customizable by users. We demonstrate SimWorld by deploying frontier LLM agents (e.g., GPT-4o, Gemini-2.5-Flash, Claude-3.5, and DeepSeek-Prover-V2) on long-horizon multi-agent delivery tasks involving strategic cooperation and competition. The results reveal distinct reasoning patterns and limitations across models. We open-source SimWorld and hope it becomes a foundational platform for advancing real-world agent intelligence across disciplines: https://simworld.org.

[213] Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents

Zhibo Liang, Tianze Hu, Zaiye Chen, Mingjie Tang

Main category: cs.AI

TL;DR: Proposes Cognitive Control Architecture (CCA), a holistic defense framework against Indirect Prompt Injection attacks on LLM agents, using intent graphs and tiered adjudicators to detect malicious deviations while maintaining security, functionality, and efficiency.

DetailsMotivation: Current LLM agents are vulnerable to Indirect Prompt Injection attacks that hijack behavior through polluted external sources. Existing defenses are fragmented and force unacceptable trade-offs between security, functionality, and efficiency, failing to provide full integrity across the task execution pipeline.

Method: Cognitive Control Architecture (CCA) with two synergistic components: (1) proactive control-flow and data-flow integrity enforcement via pre-generated “Intent Graph,” and (2) “Tiered Adjudicator” that initiates deep reasoning with multi-dimensional scoring upon deviation detection to counter complex conditional attacks.

Result: Experiments on AgentDojo benchmark show CCA effectively withstands sophisticated attacks that challenge other advanced defense methods, achieving uncompromised security with notable efficiency and robustness, reconciling the multi-dimensional trade-off.

Conclusion: CCA provides a holistic framework for full-lifecycle cognitive supervision that detects IPI attacks through action trajectory deviations, offering a comprehensive solution to the security-functionality-efficiency trade-off problem in LLM agent defenses.

Abstract: Autonomous Large Language Model (LLM) agents exhibit significant vulnerability to Indirect Prompt Injection (IPI) attacks. These attacks hijack agent behavior by polluting external information sources, exploiting fundamental trade-offs between security and functionality in existing defense mechanisms. This leads to malicious and unauthorized tool invocations, diverting agents from their original objectives. The success of complex IPIs reveals a deeper systemic fragility: while current defenses demonstrate some effectiveness, most defense architectures are inherently fragmented. Consequently, they fail to provide full integrity assurance across the entire task execution pipeline, forcing unacceptable multi-dimensional compromises among security, functionality, and efficiency. Our method is predicated on a core insight: no matter how subtle an IPI attack, its pursuit of a malicious objective will ultimately manifest as a detectable deviation in the action trajectory, distinct from the expected legitimate plan. Based on this, we propose the Cognitive Control Architecture (CCA), a holistic framework achieving full-lifecycle cognitive supervision. CCA constructs an efficient, dual-layered defense system through two synergistic pillars: (i) proactive and preemptive control-flow and data-flow integrity enforcement via a pre-generated “Intent Graph”; and (ii) an innovative “Tiered Adjudicator” that, upon deviation detection, initiates deep reasoning based on multi-dimensional scoring, specifically designed to counter complex conditional attacks. Experiments on the AgentDojo benchmark substantiate that CCA not only effectively withstands sophisticated attacks that challenge other advanced defense methods but also achieves uncompromised security with notable efficiency and robustness, thereby reconciling the aforementioned multi-dimensional trade-off.

[214] Scalable Back-End for an AI-Based Diabetes Prediction Application

Henry Anand Septian Radityo, Bernardus Willson, Raynard Tanadi, Latifa Dwiyanti, Saiful Akbar

Main category: cs.AI

TL;DR: Developed a scalable back-end system for mobile diabetes prediction app using horizontal scaling, database sharding, and RabbitMQ message queue, achieving 83% of performance targets with ability to handle 10,000 concurrent users.

DetailsMotivation: The rising global prevalence of diabetes requires early detection to prevent complications, necessitating AI-powered prediction applications with responsive and scalable back-end architecture to serve large user bases effectively.

Method: Implemented a scalable back-end architecture using horizontal scaling, database sharding, and asynchronous communication via RabbitMQ message queue to handle concurrent users and computationally intensive prediction requests.

Result: 83% of system features (20 out of 24) met performance targets (<5% failure rate, <1000 ms latency). System handled 10,000 concurrent users successfully. RabbitMQ minimized error rates for intensive prediction requests by queuing requests and preventing data loss.

Conclusion: The scalable back-end system successfully supports mobile diabetes prediction applications, demonstrating effective handling of large user bases while maintaining performance targets, with asynchronous communication proving crucial for system reliability under heavy load.

Abstract: The rising global prevalence of diabetes necessitates early detection to prevent severe complications. While AI-powered prediction applications offer a promising solution, they require a responsive and scalable back-end architecture to serve a large user base effectively. This paper details the development and evaluation of a scalable back-end system designed for a mobile diabetes prediction application. The primary objective was to maintain a failure rate below 5% and an average latency of under 1000 ms. The architecture leverages horizontal scaling, database sharding, and asynchronous communication via a message queue. Performance evaluation showed that 83% of the system’s features (20 out of 24) met the specified performance targets. Key functionalities such as user profile management, activity tracking, and read-intensive prediction operations successfully achieved the desired performance. The system demonstrated the ability to handle up to 10,000 concurrent users without issues, validating its scalability. The implementation of asynchronous communication using RabbitMQ proved crucial in minimizing the error rate for computationally intensive prediction requests, ensuring system reliability by queuing requests and preventing data loss under heavy load.

[215] Exploring LLMs for Scientific Information Extraction Using The SciEx Framework

Sha Li, Ayush Sadekar, Nathan Self, Yiqi Su, Lars Andersland, Mira Chaplin, Annabel Zhang, Hyoju Yang, James B Henderson, Krista Wigginton, Linsey Marr, T. M. Murali, Naren Ramakrishnan

Main category: cs.AI

TL;DR: SciEx is a modular framework for extracting fine-grained scientific information from literature using LLMs, addressing challenges of long documents, multi-modal content, and evolving data schemas.

DetailsMotivation: Current LLM-based extraction tools struggle with scientific literature's complexities: long documents, multi-modal content, inconsistent information across publications, and rapidly changing extraction schemas that make system re-architecture difficult.

Method: SciEx decouples key components into a modular framework including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design enables extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms.

Result: The framework was evaluated on datasets spanning three scientific topics for accurate and consistent fine-grained information extraction. The findings provide practical insights into both strengths and limitations of current LLM-based pipelines.

Conclusion: SciEx offers a practical solution for on-demand scientific data extraction that can adapt to changing requirements while maintaining accuracy and consistency across complex scientific literature.

Abstract: Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents, multi-modal content, and reconciling varied and inconsistent fine-grained information across multiple publications into standardized formats. These challenges are further compounded when the desired data schema or extraction ontology changes rapidly, making it difficult to re-architect or fine-tune existing systems. We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design streamlines on-demand data extraction while enabling extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms. We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. Our findings provide practical insights into both the strengths and limitations of current LLM-based pipelines.

[216] Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

Abhranil Chandra, Ayush Agrawal, Arian Hosseini, Sebastian Fischmeister, Rishabh Agarwal, Navin Goyal, Aaron Courville

Main category: cs.AI

TL;DR: Training language models on synthetic chain-of-thought traces from more capable models (even with incorrect final answers) improves reasoning performance more than human-annotated datasets, due to distribution alignment and partial validity of flawed reasoning steps.

DetailsMotivation: To investigate whether language models can improve reasoning capabilities by learning from synthetic chain-of-thought data, even when that data contains incorrect final answers, and to understand why this approach outperforms training on human-annotated datasets.

Method: Train language models on synthetic CoT traces from more capable models (with incorrect answers), compare with human-annotated datasets, test two hypotheses: 1) use model paraphrasing to shift human traces closer to model distribution, 2) introduce increasingly flawed CoT traces to test tolerance. Experiments across math, algorithmic reasoning, and code generation using MATH, GSM8K, Countdown, MBPP datasets on Qwen, Llama, and Gemma models (1.5B to 9B).

Result: Synthetic CoT training with incorrect answers yields better reasoning performance than human-annotated datasets. Paraphrasing human traces improves performance (supporting distribution hypothesis). Models show tolerance to partially flawed reasoning steps. Performance improvements demonstrated across multiple reasoning domains and model families.

Conclusion: Dataset curation should prioritize alignment with model distribution over correctness of final answers. Correct final answers don’t guarantee faithful reasoning processes. Synthetic data from more capable models (even with errors) can be more effective for training than human-annotated data due to distribution proximity and partial validity of reasoning steps.

Abstract: We present the surprising finding that a language model’s reasoning capabilities can be improved by training on synthetic datasets of chain-of-thought (CoT) traces from more capable models, even when all of those traces lead to an incorrect final answer. Our experiments show this approach can yield better performance on reasoning tasks than training on human-annotated datasets. We hypothesize that two key factors explain this phenomenon: first, the distribution of synthetic data is inherently closer to the language model’s own distribution, making it more amenable to learning. Second, these `incorrect’ traces are often only partially flawed and contain valid reasoning steps from which the model can learn. To further test the first hypothesis, we use a language model to paraphrase human-annotated traces – shifting their distribution closer to the model’s own distribution – and show that this improves performance. For the second hypothesis, we introduce increasingly flawed CoT traces and study to what extent models are tolerant to these flaws. We demonstrate our findings across various reasoning domains like math, algorithmic reasoning and code generation using MATH, GSM8K, Countdown and MBPP datasets on various language models ranging from 1.5B to 9B across Qwen, Llama, and Gemma models. Our study shows that curating datasets that are closer to the model’s distribution is a critical aspect to consider. We also show that a correct final answer is not always a reliable indicator of a faithful reasoning process.

[217] Programming over Thinking: Efficient and Robust Multi-Constraint Planning

Derrick Goh Xin Deik, Quanyu Long, Zhengyuan Liu, Nancy F. Chen, Wenya Wang

Main category: cs.AI

TL;DR: SCOPE is a framework that separates reasoning from execution for multi-constraint planning, achieving SOTA performance with lower cost and latency than existing LLM approaches.

DetailsMotivation: Existing LLM approaches for multi-constraint planning have limitations: pure reasoning paradigms suffer from inconsistency and error accumulation, while coding/solver-based approaches lack flexibility and generalizability across diverse problems.

Method: SCOPE (Scalable COde Planning Engine) disentangles query-specific reasoning from generic code execution, producing reusable solver functions that only require minimal parameter changes across different queries.

Result: SCOPE achieves 93.1% success on TravelPlanner (61.6% gain over best baseline CoT) while cutting inference cost by 1.4x and time by ~4.67x using GPT-4o.

Conclusion: By separating reasoning from execution, SCOPE creates consistent, deterministic, and reusable solver functions that outperform existing approaches while reducing computational costs.

Abstract: Multi-constraint planning involves identifying, evaluating, and refining candidate plans while satisfying multiple, potentially conflicting constraints. Existing large language model (LLM) approaches face fundamental limitations in this domain. Pure reasoning paradigms, which rely on long natural language chains, are prone to inconsistency, error accumulation, and prohibitive cost as constraints compound. Conversely, LLMs combined with coding- or solver-based strategies lack flexibility: they often generate problem-specific code from scratch or depend on fixed solvers, failing to capture generalizable logic across diverse problems. To address these challenges, we introduce the Scalable COde Planning Engine (SCOPE), a framework that disentangles query-specific reasoning from generic code execution. By separating reasoning from execution, SCOPE produces solver functions that are consistent, deterministic, and reusable across queries while requiring only minimal changes to input parameters. SCOPE achieves state-of-the-art performance while lowering cost and latency. For example, with GPT-4o, it reaches 93.1% success on TravelPlanner, a 61.6% gain over the best baseline (CoT) while cutting inference cost by 1.4x and time by ~4.67x. Code is available at https://github.com/DerrickGXD/SCOPE.

[218] Mining Citywide Dengue Spread Patterns in Singapore Through Hotspot Dynamics from Open Web Data

Liping Huang, Gaoxi Xiao, Stefan Ma, Hechang Chen, Shisong Tang, Flora Salim

Main category: cs.AI

TL;DR: A novel framework that mines latent transmission links from dengue case data to forecast hotspots and explain citywide spread through human mobility patterns, achieving 0.79 F-score in Singapore case studies.

DetailsMotivation: Dengue remains a persistent public health challenge in urban tropical areas like Singapore. Current approaches need to shift from reactive to proactive control by anticipating where transmission risks will emerge, requiring methods that can transform publicly available case data into predictive tools.

Method: The framework uncovers latent transmission links between urban regions directly from dengue case data. Instead of treating cases as isolated reports, it models how hotspot formation in one area is influenced by epidemic dynamics in neighboring regions. The hidden links are optimized through gradient descent and used to forecast hotspot status while verifying consistency by examining network stability across consecutive weeks.

Result: Case studies on Singapore during 2013-2018 and 2020 show that four weeks of hotspot history are sufficient to achieve an average F-score of 0.79. The learned transmission links align closely with commuting flows, providing interpretable explanations for citywide spread patterns.

Conclusion: This work transforms open web-based case data into both predictive and explanatory resources by mining hidden spreading dynamics. The framework advances epidemic modeling while providing a scalable, low-cost tool for public health planning, early intervention, and urban resilience in dengue control.

Abstract: Dengue, a mosquito-borne disease, continues to pose a persistent public health challenge in urban areas, particularly in tropical regions such as Singapore. Effective and affordable control requires anticipating where transmission risks are likely to emerge so that interventions can be deployed proactively rather than reactively. This study introduces a novel framework that uncovers and exploits latent transmission links between urban regions, mined directly from publicly available dengue case data. Instead of treating cases as isolated reports, we model how hotspot formation in one area is influenced by epidemic dynamics in neighboring regions. While mosquito movement is highly localized, long-distance transmission is often driven by human mobility, and in our case study, the learned network aligns closely with commuting flows, providing an interpretable explanation for citywide spread. These hidden links are optimized through gradient descent and used not only to forecast hotspot status but also to verify the consistency of spreading patterns, by examining the stability of the inferred network across consecutive weeks. Case studies on Singapore during 2013-2018 and 2020 show that four weeks of hotspot history are sufficient to achieve an average F-score of 0.79. Importantly, the learned transmission links align with commuting flows, highlighting the interpretable interplay between hidden epidemic spread and human mobility. By shifting from simply reporting dengue cases to mining and validating hidden spreading dynamics, this work transforms open web-based case data into a predictive and explanatory resource. The proposed framework advances epidemic modeling while providing a scalable, low-cost tool for public health planning, early intervention, and urban resilience.

[219] Emergent, not Immanent: A Baradian Reading of Explainable AI

Fabio Morreale, Joan Serrà, Yuki Mitsufuji

Main category: cs.AI

TL;DR: The paper critiques current XAI approaches for treating meaning as inherent to AI models and proposes an alternative framework based on agential realism, where interpretations emerge from situated entanglements between models, humans, and context.

DetailsMotivation: Current XAI approaches are limited by problematic onto-epistemological assumptions: treating meaning as immanent to models, positioning explainers as external observers, and assuming causal structures are computationally recoverable. The authors aim to develop a more nuanced understanding of XAI that accounts for the situated, relational nature of interpretation.

Method: The authors apply Barad’s agential realism to XAI, analyzing a comprehensive set of XAI methods through this theoretical lens. They then articulate the ethical dimensions of this framework and propose design directions for XAI interfaces that support emergent interpretation, using a speculative text-to-music interface as a case study.

Result: The analysis reveals the assumptions and limitations underpinning current XAI methods. The agential realism framework shows that interpretations are material-discursive performances emerging from situated entanglements of AI models with humans, context, and interpretative apparatus.

Conclusion: XAI should be reconceptualized as supporting emergent interpretations rather than revealing inherent model meaning. The proposed framework offers ethical guidance and practical design directions for XAI interfaces that acknowledge the relational, situated nature of interpretation in human-AI interactions.

Abstract: Explainable AI (XAI) is frequently positioned as a technical problem of revealing the inner workings of an AI model. This position is affected by unexamined onto-epistemological assumptions: meaning is treated as immanent to the model, the explainer is positioned outside the system, and a causal structure is presumed recoverable through computational techniques. In this paper, we draw on Barad’s agential realism to develop an alternative onto-epistemology of XAI. We propose that interpretations are material-discursive performances that emerge from situated entanglements of the AI model with humans, context, and the interpretative apparatus. To develop this position, we read a comprehensive set of XAI methods through agential realism and reveal the assumptions and limitations that underpin several of these methods. We then articulate the framework’s ethical dimension and propose design directions for XAI interfaces that support emergent interpretation, using a speculative text-to-music interface as a case study.

[220] Prometheus Mind: Retrofitting Memory to Frozen Language Models

Mark Wind

Main category: cs.AI

TL;DR: Prometheus Mind adds memory to frozen Qwen3-4B using modular adapters without weight modification, solving extraction, training, injection, and hidden state collapse problems.

DetailsMotivation: Adding memory to pretrained language models typically requires architectural changes or weight modification, which can be complex and irreversible. The authors want to retrofit memory to frozen models in a reversible way using minimal overhead.

Method: Uses 11 modular adapters (530MB, 7% overhead) on frozen Qwen3-4B. Solves four key problems: (1) Contrastive Direction Discovery for semantic extraction without labeled data, (2) Stage-wise training of adapters on simple proxy tasks, (3) Using lm_head-weight rows for injection without training, (4) Training projections to recover distinction from collapsed hidden states.

Result: Achieves 94.4% retrieval on clean inputs (n=54, 95% CI: [84.9%, 98.1%]) on PrometheusExtract-132 dataset. Degrades to 19.4% on informal inputs with ellipsis, filler words, or implicit subjects. Primary bottleneck is relation classification (47.3% accuracy).

Conclusion: Prometheus Mind successfully adds memory to frozen models with minimal overhead and full reversibility, but performance degrades significantly on informal language and relation classification remains a major challenge.

Abstract: Adding memory to pretrained language models typically requires architectural changes or weight modification. We present Prometheus Mind, which retrofits memory to a frozen Qwen3-4B using 11 modular adapters (530MB, 7% overhead) – fully reversible by removing the adapters. Building this system required solving four problems: (1) Extraction – we develop Contrastive Direction Discovery (CDD), which finds semantic directions via minimal pairs without labeled data. (2) Training – end-to-end optimization collapses; stage-wise training of each adapter on simple proxy tasks succeeds. (3) Injection – learned encoders fail to generalize; we find that lm_head-weight rows already provide the mapping we need, requiring no training. (4) Hidden state collapse – transformers make wife'' and brother’’ 0.98+ similar; we train projections to recover distinction (0.98 $\rightarrow$ 0.09). On PrometheusExtract-132 (132 cases), the system achieves 94.4% retrieval on clean inputs (n=54, 95% CI: [84.9%, 98.1%]), degrading to 19.4% on informal inputs with ellipsis, filler words, or implicit subjects (n=36). The primary bottleneck is relation classification (47.3% accuracy), responsible for most extraction errors.

[221] Benchmarking Text-to-Python against Text-to-SQL: The Impact of Explicit Logic and Ambiguity

Hangle Hu, Chenyu Hou, Bin Cao, Ruizhe Li

Main category: cs.AI

TL;DR: BIRD-Python benchmark addresses Text-to-Python reliability gap vs. SQL, showing Python can match SQL performance when domain knowledge resolves ambiguity.

DetailsMotivation: Real-world analytics increasingly need Python/Pandas for file-based data and complex workflows, but Text-to-Python reliability is underexplored compared to mature SQL ecosystem.

Method: Introduce BIRD-Python benchmark with refined dataset to reduce noise and align semantics; propose Logic Completion Framework (LCF) that incorporates latent domain knowledge to resolve ambiguity in Python code generation.

Result: Performance differences stem from missing domain context, not inherent code generation limitations; when gaps are addressed, Text-to-Python achieves parity with Text-to-SQL.

Conclusion: Python is viable for analytical agents if systems effectively ground ambiguous natural language inputs in executable logical specifications.

Abstract: While Text-to-SQL remains the dominant approach for database interaction, real-world analytics increasingly require the flexibility of general-purpose programming languages such as Python or Pandas to manage file-based data and complex analytical workflows. Despite this growing need, the reliability of Text-to-Python in core data retrieval remains underexplored relative to the mature SQL ecosystem. To address this gap, we introduce BIRD-Python, a benchmark designed for cross-paradigm evaluation. We systematically refined the original dataset to reduce annotation noise and align execution semantics, thereby establishing a consistent and standardized baseline for comparison. Our analysis reveals a fundamental paradigmatic divergence: whereas SQL leverages implicit DBMS behaviors through its declarative structure, Python requires explicit procedural logic, making it highly sensitive to underspecified user intent. To mitigate this challenge, we propose the Logic Completion Framework (LCF), which resolves ambiguity by incorporating latent domain knowledge into the generation process. Experimental results show that (1) performance differences primarily stem from missing domain context rather than inherent limitations in code generation, and (2) when these gaps are addressed, Text-to-Python achieves performance parity with Text-to-SQL. These findings establish Python as a viable foundation for analytical agents-provided that systems effectively ground ambiguous natural language inputs in executable logical specifications. Resources are available at https://anonymous.4open.science/r/Bird-Python-43B7/.

[222] EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, Xipeng Qiu

Main category: cs.AI

TL;DR: EvoCUA introduces an evolutionary learning approach for computer-use agents that cycles between data generation and policy optimization, achieving state-of-the-art performance on OSWorld benchmark.

DetailsMotivation: Current computer-use agents are limited by static data scaling and passive imitation, which fails to capture the intricate causal dynamics of long-horizon computer tasks.

Method: EvoCUA integrates data generation and policy optimization in a self-sustaining evolutionary cycle. It uses a verifiable synthesis engine to generate diverse tasks with executable validators, scalable infrastructure for thousands of asynchronous sandbox rollouts, and an iterative evolving learning strategy that identifies capability boundaries and transforms failures into supervision.

Result: Achieves 56.7% success rate on OSWorld benchmark, surpassing previous open-source best (OpenCUA-72B at 45.0%) and closed-weights models like UI-TARS-2 (53.1%). The approach shows consistent gains across foundation models of varying scales.

Conclusion: The evolutionary paradigm driven by learning from experience provides a robust and scalable path for advancing native agent capabilities, demonstrating superior performance and generalizability across different model scales.

Abstract: The development of native computer-use agents (CUA) represents a significant leap in multimodal AI. However, their potential is currently bottlenecked by the constraints of static data scaling. Existing paradigms relying primarily on passive imitation of static datasets struggle to capture the intricate causal dynamics inherent in long-horizon computer tasks. In this work, we introduce EvoCUA, a native computer use agentic model. Unlike static imitation, EvoCUA integrates data generation and policy optimization into a self-sustaining evolutionary cycle. To mitigate data scarcity, we develop a verifiable synthesis engine that autonomously generates diverse tasks coupled with executable validators. To enable large-scale experience acquisition, we design a scalable infrastructure orchestrating tens of thousands of asynchronous sandbox rollouts. Building on these massive trajectories, we propose an iterative evolving learning strategy to efficiently internalize this experience. This mechanism dynamically regulates policy updates by identifying capability boundaries – reinforcing successful routines while transforming failure trajectories into rich supervision through error analysis and self-correction. Empirical evaluations on the OSWorld benchmark demonstrate that EvoCUA achieves a success rate of 56.7%, establishing a new open-source state-of-the-art. Notably, EvoCUA significantly outperforms the previous best open-source model, OpenCUA-72B (45.0%), and surpasses leading closed-weights models such as UI-TARS-2 (53.1%). Crucially, our results underscore the generalizability of this approach: the evolving paradigm driven by learning from experience yields consistent performance gains across foundation models of varying scales, establishing a robust and scalable path for advancing native agent capabilities.

cs.SD

[223] SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models

Aafiya Hussain, Gaurav Srivastava, Alvi Ishmam, Zaber Hakim, Chris Thomas

Main category: cs.SD

TL;DR: Audio-only adversarial attacks can severely disrupt trimodal audio-video-language models with up to 96% success rate, exposing a single-modality vulnerability in multimodal systems.

DetailsMotivation: Multimodal foundation models show strong performance but their robustness to adversarial attacks, particularly realistic untargeted audio-only attacks, remains poorly understood and underexplored.

Method: Analyzed six complementary attack objectives targeting different multimodal processing stages (audio encoder representations, cross-modal attention, hidden states, output likelihoods). Tested across three state-of-the-art models and multiple benchmarks with audio-only perturbations.

Result: Audio-only perturbations achieve up to 96% attack success rate, work at low perceptual distortions (LPIPS ≤ 0.08, SI-SNR ≥ 0), benefit more from extended optimization than increased data scale, and show limited transferability across models.

Conclusion: Multimodal systems have a previously overlooked single-modality attack surface via audio-only attacks, motivating the need for defenses that enforce cross-modal consistency.

Abstract: Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS <= 0.08, SI-SNR >= 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving >97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.

[224] MusiCRS: Benchmarking Audio-Centric Conversational Recommendation

Rohan Surana, Amit Namburi, Gagan Mundada, Abhay Lal, Zachary Novack, Julian McAuley, Junda Wu

Main category: cs.SD

TL;DR: MusiCRS is the first benchmark for audio-centric conversational music recommendation, linking real Reddit conversations with audio tracks to evaluate cross-modal reasoning between dialogue and music content.

DetailsMotivation: Music recommendation is uniquely challenging because it requires reasoning over audio content that text/metadata alone cannot capture. Current LLM-based conversational recommendation systems lack proper audio grounding and evaluation frameworks for music.

Method: Created MusiCRS benchmark with 477 high-quality Reddit conversations spanning diverse genres, 3,589 unique musical entities, and audio grounding via YouTube links. Supports three input modality configurations (audio-only, query-only, audio+query) for systematic evaluation of audio-LLMs, retrieval models, and traditional approaches.

Result: Current systems struggle with cross-modal integration - optimal performance often occurs in single-modality settings rather than multimodal configurations. Models excel at dialogue semantics but fail to ground abstract musical concepts in audio, revealing fundamental limitations in cross-modal knowledge integration.

Conclusion: MusiCRS addresses a critical gap in audio-centric conversational recommendation, revealing current limitations in cross-modal reasoning. The released dataset, code, and baselines will facilitate progress in this challenging domain.

Abstract: Conversational recommendation has advanced rapidly with large language models (LLMs), yet music remains a uniquely challenging domain in which effective recommendations require reasoning over audio content beyond what text or metadata can capture. We present MusiCRS, the first benchmark for audio-centric conversational recommendation that links authentic user conversations from Reddit with corresponding tracks. MusiCRS includes 477 high-quality conversations spanning diverse genres (classical, hip-hop, electronic, metal, pop, indie, jazz), with 3,589 unique musical entities and audio grounding via YouTube links. MusiCRS supports evaluation under three input modality configurations: audio-only, query-only, and audio+query, allowing systematic comparison of audio-LLMs, retrieval models, and traditional approaches. Our experiments reveal that current systems struggle with cross-modal integration, with optimal performance frequently occurring in single-modality settings rather than multimodal configurations. This highlights fundamental limitations in cross-modal knowledge integration, as models excel at dialogue semantics but struggle when grounding abstract musical concepts in audio. To facilitate progress, we release the MusiCRS dataset (https://huggingface.co/datasets/rohan2810/MusiCRS), evaluation code (https://github.com/rohan2810/musiCRS), and comprehensive baselines.

[225] Contrastive Knowledge Distillation for Embedding Refinement in Personalized Speech Enhancement

Thomas Serre, Mathieu Fontaine, Éric Benhaim, Slim Essid

Main category: cs.SD

TL;DR: Proposes on-the-fly speaker embedding refinement using a tiny speaker encoder to improve personalized speech enhancement performance while maintaining low computational load.

DetailsMotivation: Current PSE systems use heavy upstream models to extract speaker embeddings from enrollment clips, but these pre-computed embeddings cannot adapt to voice variations during inference, limiting performance.

Method: Uses contrastive knowledge distillation to train a tiny 150k-parameter speaker encoder from complex embeddings, then integrates this encoder within the enhancement system for on-the-fly embedding refinement during inference.

Result: The proposed method greatly improves PSE performances while maintaining low computational load.

Conclusion: On-the-fly refinement of speaker embeddings using a lightweight encoder is an effective approach to enhance PSE performance while keeping computational requirements low.

Abstract: Personalized speech enhancement (PSE) has shown convincing results when it comes to extracting a known target voice among interfering ones. The corresponding systems usually incorporate a representation of the target voice within the enhancement system, which is extracted from an enrollment clip of the target voice with upstream models. Those models are generally heavy as the speaker embedding’s quality directly affects PSE performances. Yet, embeddings generated beforehand cannot account for the variations of the target voice during inference time. In this paper, we propose to perform on-thefly refinement of the speaker embedding using a tiny speaker encoder. We first introduce a novel contrastive knowledge distillation methodology in order to train a 150k-parameter encoder from complex embeddings. We then use this encoder within the enhancement system during inference and show that the proposed method greatly improves PSE performances while maintaining a low computational load.

[226] The CMU-AIST submission for the ICME 2025 Audio Encoder Challenge

Shikhar Bharadwaj, Samuele Cornell, Kwanghee Choi, Hye-jin Shim, Soham Deshmukh, Satoru Fukayama, Shinji Watanabe

Main category: cs.SD

TL;DR: The paper describes a submission to ICME 2025 audio encoder challenge using scaled-up BEATs models trained on 74K hours of multi-domain data, with ensemble techniques that outperform baseline and Dasheng 1.2B models.

DetailsMotivation: To create a competitive audio encoder for the ICME 2025 challenge by extending BEATs architecture with larger scale and diverse training data, and developing effective ensemble methods to surpass existing baselines.

Method: Extended BEATs masked speech token prediction model to 300M parameters, trained on 74K hours of speech/music/sound data with different domain mixtures. Created ensemble of Dasheng 1.2B with two custom BEATs models using novel ensembling technique.

Result: The ensemble system surpasses both baseline and Dasheng 1.2B models. Trained checkpoints are publicly released on HuggingFace for open science.

Conclusion: Scaled-up BEATs models with diverse pre-training data and effective ensembling techniques achieve state-of-the-art performance in audio encoding, with models made publicly available.

Abstract: This technical report describes our submission to the ICME 2025 audio encoder challenge. Our submitted system is built on BEATs, a masked speech token prediction based audio encoder. We extend the BEATs model using 74,000 hours of data derived from various speech, music, and sound corpora and scale its architecture upto 300 million parameters. We experiment with speech-heavy and balanced pre-training mixtures to study the impact of different domains on final performance. Our submitted system consists of an ensemble of the Dasheng 1.2 billion model with two custom scaled-up BEATs models trained on the aforementioned pre-training data mixtures. We also propose a simple ensembling technique that retains the best capabilities of constituent models and surpasses both the baseline and Dasheng 1.2B. For open science, we publicly release our trained checkpoints via huggingface at https://huggingface.co/shikhar7ssu/OpenBEATs-ICME-SOUND and https://huggingface.co/shikhar7ssu/OpenBEATs-ICME.

[227] Do Models Hear Like Us? Probing the Representational Alignment of Audio LLMs and Naturalistic EEG

Haoyun Yang, Xin Xiao, Jiang Zhong, Yu Tian, Dong Xiaohua, Yu Mao, Hao Wu, Kaiwen Wei

Main category: cs.SD

TL;DR: Audio LLMs show varying alignment with human EEG signals depending on similarity metrics, with specific spatio-temporal patterns and affective prosody effects.

DetailsMotivation: To investigate whether Audio LLMs' internal representations align with human neural dynamics during naturalistic listening, which remains largely unexplored despite their strong capabilities in speech perception and language understanding.

Method: Systematically examined layer-wise representational alignment between 12 open-source Audio LLMs and EEG signals across 2 datasets using 8 similarity metrics (including Spearman-based RSA) to characterize within-sentence representational geometry.

Result: Three key findings: (1) Rank-dependence split - model rankings vary substantially across different similarity metrics; (2) Spatio-temporal alignment patterns - depth-dependent alignment peaks and increased RSA within 250-500 ms window (consistent with N400 dynamics); (3) Affective dissociation - negative prosody reduces geometric similarity but enhances covariance-based dependence.

Conclusion: The findings provide new neurobiological insights into the representational mechanisms of Audio LLMs, revealing complex alignment patterns with human neural dynamics that depend on similarity metrics, temporal windows, and affective prosody.

Abstract: Audio Large Language Models (Audio LLMs) have demonstrated strong capabilities in integrating speech perception with language understanding. However, whether their internal representations align with human neural dynamics during naturalistic listening remains largely unexplored. In this work, we systematically examine layer-wise representational alignment between 12 open-source Audio LLMs and Electroencephalogram (EEG) signals across 2 datasets. Specifically, we employ 8 similarity metrics, such as Spearman-based Representational Similarity Analysis (RSA), to characterize within-sentence representational geometry. Our analysis reveals 3 key findings: (1) we observe a rank-dependence split, in which model rankings vary substantially across different similarity metrics; (2) we identify spatio-temporal alignment patterns characterized by depth-dependent alignment peaks and a pronounced increase in RSA within the 250-500 ms time window, consistent with N400-related neural dynamics; (3) we find an affective dissociation whereby negative prosody, identified using a proposed Tri-modal Neighborhood Consistency (TNC) criterion, reduces geometric similarity while enhancing covariance-based dependence. These findings provide new neurobiological insights into the representational mechanisms of Audio LLMs.

[228] CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation

Jing Hu, Danxiang Zhu, Xianlong Luo, Dan Zhang, Shuwei He, Yishu Lei, Haitao Zheng, Shikun Feng, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang

Main category: cs.SD

TL;DR: CORD is a unified alignment framework that uses online cross-modal self-distillation to bridge the acoustic-semantic gap in Large Audio Language Models, improving audio-conditioned reasoning while maintaining text capabilities.

DetailsMotivation: Current Large Audio Language Models (LALMs) built upon text-based LLMs suffer from degradation in knowledge and reasoning capabilities due to ineffective bridging of the acoustic-semantic gap in feature representation space.

Method: CORD performs online cross-modal self-distillation by aligning audio-conditioned reasoning with text-conditioned reasoning within a unified model. It uses text modality as an internal teacher with multi-granularity alignment: token-level alignment via on-policy reverse KL divergence with importance-aware weighting, and sequence-level alignment via judge-based global reward optimization using Group Relative Policy Optimization (GRPO).

Result: CORD consistently enhances audio-conditioned reasoning across multiple benchmarks and substantially bridges the audio-text performance gap using only 80k synthetic training samples, demonstrating both efficacy and data efficiency.

Conclusion: The on-policy, multi-level cross-modal alignment approach effectively addresses the acoustic-semantic gap in LALMs, enabling improved audio reasoning while maintaining text capabilities with high data efficiency.

Abstract: Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space. To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation. Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model. Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process. At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and semantically critical tokens. At the sequence level, CORD introduces a judge-based global reward to optimize complete reasoning trajectories via Group Relative Policy Optimization (GRPO). Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning and substantially bridges the audio-text performance gap with only 80k synthetic training samples, validating the efficacy and data efficiency of our on-policy, multi-level cross-modal alignment approach.

[229] Omni-directional attention mechanism based on Mamba for speech separation

Ke Xue, Chang Sun, Rongfei Fan, Jing Wang, Han Hu

Main category: cs.SD

TL;DR: Proposes an efficient omni-directional attention mechanism built on Mamba for speech separation that captures global dependencies across 2D spectrograms from ten different directions while maintaining linear complexity.

DetailsMotivation: Existing Mamba-based speech separation approaches decompose input along a single dimension into 1D sequences, restricting modeling to local 1D patterns and limiting ability to capture global dependencies across the 2D spectrogram.

Method: Proposes an efficient omni-directional attention (OA) mechanism built upon unidirectional Mamba that models global dependencies from ten different directions on the spectrogram. Expands this mechanism into two baseline separation models.

Result: Experimental results on three public datasets show consistent significant performance gains over baselines while preserving linear complexity, outperforming existing state-of-the-art systems.

Conclusion: The proposed omni-directional attention mechanism effectively addresses the limitation of existing Mamba-based approaches by enabling global 2D modeling while maintaining computational efficiency, leading to superior speech separation performance.

Abstract: Mamba, a selective state-space model (SSM), has emerged as an efficient alternative to Transformers for speech modeling, enabling long-sequence processing with linear complexity. While effective in speech separation, existing approaches, whether in the time or time-frequency domain, typically decompose the input along a single dimension into short one-dimensional sequences before processing them with Mamba, which restricts it to local 1D modeling and limits its ability to capture global dependencies across the 2D spectrogram. In this work, we propose an efficient omni-directional attention (OA) mechanism built upon unidirectional Mamba, which models global dependencies from ten different directions on the spectrogram. We expand the proposed mechanism into two baseline separation models and evaluate on three public datasets. Experimental results show that our approach consistently achieves significant performance gains over the baselines while preserving linear complexity, outperforming existing state-of-the-art (SOTA) systems.

[230] I Guess That’s Why They Call it the Blues: Causal Analysis for Audio Classifiers

David A. Kelly, Hana Chockler

Main category: cs.SD

TL;DR: FreqReX uses causal reasoning to identify minimal frequency features that are both sufficient and necessary for audio classifier decisions, enabling highly targeted manipulation with tiny, often inaudible changes.

DetailsMotivation: Audio classifiers often rely on non-musically relevant features and spurious correlations, making them easy to manipulate or confuse. While inducing misclassification is not hard, the specific features that classifiers rely on were not well understood.

Method: The paper introduces a new method using causal reasoning to discover features in the frequency space that are sufficient and necessary for given classifications. This is implemented in the tool FreqReX, which analyzes standard benchmark datasets.

Result: Causally sufficient and necessary subsets allow manipulation of model outputs with minimal input changes: changing just one out of 240,000 frequencies results in classification change 58% of the time, often with changes so small they’re practically inaudible.

Conclusion: Causal analysis is useful for understanding the reasoning process of audio classifiers and can be used to successfully manipulate their outputs, revealing the fragility of current audio classification systems.

Abstract: It is well-known that audio classifiers often rely on non-musically relevant features and spurious correlations to classify audio. Hence audio classifiers are easy to manipulate or confuse, resulting in wrong classifications. While inducing a misclassification is not hard, until now the set of features that the classifiers rely on was not well understood. In this paper we introduce a new method that uses causal reasoning to discover features of the frequency space that are sufficient and necessary for a given classification. We describe an implementation of this algorithm in the tool FreqReX and provide experimental results on a number of standard benchmark datasets. Our experiments show that causally sufficient and necessary subsets allow us to manipulate the outputs of the models in a variety of ways by changing the input very slightly. Namely, a change to one out of 240,000 frequencies results in a change in classification 58% of the time, and the change can be so small that it is practically inaudible. These results show that causal analysis is useful for understanding the reasoning process of audio classifiers and can be used to successfully manipulate their outputs.

[231] E2E-AEC: Implementing an end-to-end neural network learning approach for acoustic echo cancellation

Yiheng Jiang, Biao Tian, Haoxu Wang, Shengkui Zhao, Bin Ma, Daren Chen, Xiangang Li

Main category: cs.SD

TL;DR: Proposed streaming neural network for acoustic echo cancellation without traditional linear AEC or time delay estimation, using progressive learning, knowledge transfer, attention optimization, and voice activity detection.

DetailsMotivation: To develop an end-to-end acoustic echo cancellation system that can perform streaming inference without depending on traditional linear AEC techniques and time delay estimation, which may have limitations in complex real-world scenarios.

Method: 1) Progressive learning for gradual echo suppression enhancement; 2) Knowledge transfer by initializing with pre-trained LAEC-based model; 3) Attention mechanism optimization with loss function on attention weights for precise time alignment; 4) Voice activity detection to mask network output when near-end speech is absent.

Result: The approach is validated through experiments on public datasets, demonstrating effectiveness in acoustic echo cancellation.

Conclusion: The proposed E2E-AEC method provides an effective streaming solution for acoustic echo cancellation that operates independently of traditional linear AEC techniques and time delay estimation, leveraging neural network capabilities with strategic training approaches.

Abstract: We propose a novel neural network-based end-to-end acoustic echo cancellation (E2E-AEC) method capable of streaming inference, which operates effectively without reliance on traditional linear AEC (LAEC) techniques and time delay estimation. Our approach includes several key strategies: First, we introduce and refine progressive learning to gradually enhance echo suppression. Second, our model employs knowledge transfer by initializing with a pre-trained LAECbased model, harnessing the insights gained from LAEC training. Third, we optimize the attention mechanism with a loss function applied on attention weights to achieve precise time alignment between the reference and microphone signals. Lastly, we incorporate voice activity detection to enhance speech quality and improve echo removal by masking the network output when near-end speech is absent. The effectiveness of our approach is validated through experiments conducted on public datasets.

[232] A Novel Transfer Learning Approach for Mental Stability Classification from Voice Signal

Rafiul Islam, Md. Taimur Ahad

Main category: cs.SD

TL;DR: Novel transfer learning + data augmentation approach for mental stability classification from voice signals using CNNs on spectrograms, achieving 94% accuracy with DenseNet121.

DetailsMotivation: Address challenges of limited data availability for mental stability classification using human voice signals, aiming to create a non-invasive tool for mental health diagnostics.

Method: Used CNN architectures (VGG16, InceptionV3, DenseNet121) on spectrogram images from voice recordings. Three experimental phases: training on non-augmented data, augmented data, and transfer learning. Proposed transfer learning approach involves pre-training on augmented data then fine-tuning on non-augmented data with strict data separation to prevent leakage.

Result: Significant improvements over baseline. DenseNet121 achieved highest accuracy of 94% and AUC score of 99% using the proposed transfer learning approach.

Conclusion: Combining data augmentation and transfer learning effectively enhances CNN-based classification of mental stability using voice spectrograms, offering a promising non-invasive tool for mental health diagnostics.

Abstract: This study presents a novel transfer learning approach and data augmentation technique for mental stability classification using human voice signals and addresses the challenges associated with limited data availability. Convolutional neural networks (CNNs) have been employed to analyse spectrogram images generated from voice recordings. Three CNN architectures, VGG16, InceptionV3, and DenseNet121, were evaluated across three experimental phases: training on non-augmented data, augmented data, and transfer learning. This proposed transfer learning approach involves pre-training models on the augmented dataset and fine-tuning them on the non-augmented dataset while ensuring strict data separation to prevent data leakage. The results demonstrate significant improvements in classification performance compared to the baseline approach. Among three CNN architectures, DenseNet121 achieved the highest accuracy of 94% and an AUC score of 99% using the proposed transfer learning approach. This finding highlights the effectiveness of combining data augmentation and transfer learning to enhance CNN-based classification of mental stability using voice spectrograms, offering a promising non-invasive tool for mental health diagnostics.

[233] WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

Gagan Mundada, Yash Vishe, Amit Namburi, Xin Xu, Zachary Novack, Julian McAuley, Junda Wu

Main category: cs.SD

TL;DR: WildScore is the first multimodal benchmark for evaluating MLLMs’ symbolic music reasoning using real-world music scores and authentic musicological questions.

DetailsMotivation: While MLLMs show impressive capabilities in vision-language tasks, their reasoning abilities in the multimodal symbolic music domain remain unexplored, creating a gap in understanding how these models interpret real-world music scores and complex musicological concepts.

Method: Created WildScore benchmark with instances sourced from genuine musical compositions and authentic user-generated questions/discussions. Proposed systematic taxonomy with high-level and fine-grained musicological ontologies. Framed complex music reasoning as multiple-choice question answering for controlled assessment.

Result: Empirical benchmarking of state-of-the-art MLLMs on WildScore revealed intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis.

Conclusion: WildScore provides the first comprehensive benchmark for evaluating MLLMs’ symbolic music understanding, revealing current limitations and opportunities for improvement in multimodal music reasoning. The dataset and code are publicly released.

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs’ capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs’ symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.

[234] Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

Qingyu Liu, Yushen Chen, Zhikang Niu, Chunhui Wang, Yunting Yang, Bowen Zhang, Jian Zhao, Pengcheng Zhu, Kai Yu, Xie Chen

Main category: cs.SD

TL;DR: Cross-Lingual F5-TTS enables cross-lingual voice cloning without needing audio prompt transcripts, addressing key challenges in flow-matching TTS models.

DetailsMotivation: Current flow-matching TTS models require reference transcripts for audio prompts, which prevents cross-lingual voice cloning when transcripts are unavailable, especially for unseen languages.

Method: Uses forced alignment to preprocess audio prompts and obtain word boundaries, enabling transcript-free training. Trains speaking rate predictors at different linguistic granularities to derive duration from speaker pace.

Result: The approach matches the performance of F5-TTS while enabling cross-lingual voice cloning without audio prompt transcripts.

Conclusion: Cross-Lingual F5-TTS successfully removes the dependency on audio prompt transcripts, enabling effective cross-lingual voice cloning for flow-matching TTS models.

Abstract: Flow-matching-based text-to-speech (TTS) models have shown high-quality speech synthesis. However, most current flow-matching-based TTS models still rely on reference transcripts corresponding to the audio prompt for synthesis. This dependency prevents cross-lingual voice cloning when audio prompt transcripts are unavailable, particularly for unseen languages. The key challenges for flow-matching-based TTS models to remove audio prompt transcripts are identifying word boundaries during training and determining appropriate duration during inference. In this paper, we introduce Cross-Lingual F5-TTS, a framework that enables cross-lingual voice cloning without audio prompt transcripts. Our method preprocesses audio prompts by forced alignment to obtain word boundaries, enabling direct synthesis from audio prompts while excluding transcripts during training. To address the duration modeling challenge, we train speaking rate predictors at different linguistic granularities to derive duration from speaker pace. Experiments show that our approach matches the performance of F5-TTS while enabling cross-lingual voice cloning.

[235] SONAR: Self-Distilled Continual Pre-training for Domain Adaptive Audio Representation

Yizhou Zhang, Yuan Gao, Wangjin Zhou, Zicheng Yuan, Keisuke Imoto, Tatsuya Kawahara

Main category: cs.SD

TL;DR: SONAR is a continual pre-training framework for audio representation learning that adapts to new domains without catastrophic forgetting, using self-distillation and dynamic tokenizer expansion.

DetailsMotivation: Current SSL models on static datasets like AudioSet can't efficiently incorporate new unlabeled audio data. Retraining from scratch is computationally expensive and discards valuable learned knowledge from previous training.

Method: Built on BEATs, SONAR uses: 1) joint sampling strategy for new and prior data, 2) regularization to balance specificity and generality, 3) dynamic expansion of tokenizer codebook for novel acoustic patterns, and 4) self-distillation for continual pre-training.

Result: Experiments across four distinct domains show SONAR achieves both high adaptability to new domains and robust resistance to catastrophic forgetting.

Conclusion: SONAR provides an efficient continual pre-training framework for audio representation learning that successfully adapts to new domains while preserving previously learned knowledge, addressing key challenges in continual learning for audio SSL.

Abstract: Self-supervised learning (SSL) on large-scale datasets like AudioSet has become the dominant paradigm for audio representation learning. While the continuous influx of new, unlabeled audio presents an opportunity to enrich these static representations, a naive approach is to retrain the model from scratch using all available data. However, this method is computationally prohibitive and discards the valuable knowledge embedded in the previously trained model weights. To address this inefficiency, we propose SONAR (Self-distilled cONtinual pre-training for domain adaptive Audio Representation), a continual pre-training framework built upon BEATs. SONAR effectively adapts to new domains while mitigating catastrophic forgetting by tackling three key challenges: implementing a joint sampling strategy for new and prior data, applying regularization to balance specificity and generality, and dynamically expanding the tokenizer codebook for novel acoustic patterns. Experiments across four distinct domains demonstrate that our method achieves both high adaptability and robust resistance to forgetting.

[236] Etude: Piano Cover Generation with a Three-Stage Approach – Extract, strucTUralize, and DEcode

Tse-Yang Chen, Yuh-Jzer Joung

Main category: cs.SD

TL;DR: Etude is a three-stage architecture for piano cover generation that uses rhythmic information extraction and simplified tokenization to produce structurally consistent, high-quality piano arrangements comparable to human composers.

DetailsMotivation: Existing deep learning models for piano cover generation often fail to maintain structural consistency with original songs due to lack of beat-aware mechanisms and difficulty modeling complex rhythmic patterns, which are crucial for structural similarity and overall quality.

Method: Three-stage architecture (Extract, strucTUralize, DEcode) with pre-extraction of rhythmic information and novel simplified REMI-based tokenization, supporting controllable generation through style injection.

Result: Subjective evaluations with human listeners show Etude substantially outperforms prior models and achieves quality level comparable to human composers.

Conclusion: Etude successfully addresses structural consistency issues in piano cover generation through rhythmic information extraction and simplified tokenization, producing high-quality, controllable piano arrangements.

Abstract: Piano cover generation aims to automatically transform a pop song into a piano arrangement. While numerous deep learning approaches have been proposed, existing models often fail to maintain structural consistency with the original song, likely due to the absence of beat-aware mechanisms or the difficulty of modeling complex rhythmic patterns. Rhythmic information is crucial, as it defines structural similarity (e.g., tempo, BPM) and directly impacts the overall quality of the generated music. In this paper, we introduce Etude, a three-stage architecture consisting of Extract, strucTUralize, and DEcode stages. By pre-extracting rhythmic information and applying a novel, simplified REMI-based tokenization, our model produces covers that preserve proper song structure, enhance fluency and musical dynamics, and support highly controllable generation through style injection. Subjective evaluations with human listeners show that Etude substantially outperforms prior models, achieving a quality level comparable to that of human composers.

cs.LG

[237] Ordering-based Causal Discovery via Generalized Score Matching

Vy Vo, He Zhao, Trung Le, Edwin V. Bonilla, Dinh Phung

Main category: cs.LG

TL;DR: Extends score matching framework for causal discovery to discrete data, introducing novel leaf discriminant criterion based on discrete score function for DAG structure learning.

DetailsMotivation: Learning DAG structures from purely observational data remains challenging across scientific domains, especially for discrete data where existing score matching frameworks are primarily designed for continuous data.

Method: Extends score matching framework for causal discovery to discrete data, introduces novel leaf discriminant criterion based on discrete score function, uses leaf node detection to identify topological order, then performs edge pruning for graph recovery.

Result: Demonstrates accurate inference of true causal orders from observed discrete data through simulated and real-world experiments, showing identified ordering significantly boosts accuracy of existing causal discovery baselines in nearly all settings.

Conclusion: The proposed discrete score matching framework enables effective causal discovery from discrete observational data, improving upon existing methods by providing accurate topological ordering that enhances downstream graph recovery.

Abstract: Learning DAG structures from purely observational data remains a long-standing challenge across scientific domains. An emerging line of research leverages the score of the data distribution to initially identify a topological order of the underlying DAG via leaf node detection and subsequently performs edge pruning for graph recovery. This paper extends the score matching framework for causal discovery, which is originally designated for continuous data, and introduces a novel leaf discriminant criterion based on the discrete score function. Through simulated and real-world experiments, we demonstrate that our theory enables accurate inference of true causal orders from observed discrete data and the identified ordering can significantly boost the accuracy of existing causal discovery baselines on nearly all of the settings.

[238] Student Mental Health Screening via Fitbit Data Collected During the COVID-19 Pandemic

Rebecca Lopez, Avantika Shrestha, ML Tlachac, Kevin Hickey, Xingtong Guo, Shichao Liu, Elke Rundensteiner

Main category: cs.LG

TL;DR: Using Fitbit data from college students, machine learning models can screen for depression, anxiety, and stress with F1 scores up to 0.79, showing wearable potential for mental health monitoring.

DetailsMotivation: College students face high stress, anxiety, and depression, especially during the pandemic. Current research on wearable-based mental health detection is limited in psychological instruments, physiological modalities, and time series parameters. There's a need for comprehensive assessment of wearable data for mental illness screening.

Method: Collected Student Mental and Environmental Health (StudentMEH) Fitbit dataset from college students during the pandemic. Used predictive machine learning models to screen for depression, anxiety, and stress using different Fitbit modalities (heart rate, sleep data). Evaluated different data aggregation levels and modalities.

Result: Models achieved F1 scores as high as 0.79 for anxiety screening, 0.77 for stress screening (using heart rate), and 0.78 for depression screening (using sleep data). Shows strong potential for physiological modalities like heart rate and sleep data in mental illness detection.

Conclusion: Wearable devices like Fitbit have significant potential for continuous mental health monitoring. Identifying optimal data aggregation levels and appropriate modalities is crucial for screening different mental ailments. This research provides evidence for practical wearable-based mental health screening solutions.

Abstract: College students experience many stressors, resulting in high levels of anxiety and depression. Wearable technology provides unobtrusive sensor data that can be used for the early detection of mental illness. However, current research is limited concerning the variety of psychological instruments administered, physiological modalities, and time series parameters. In this research, we collect the Student Mental and Environmental Health (StudentMEH) Fitbit dataset from students at our institution during the pandemic. We provide a comprehensive assessment of the ability of predictive machine learning models to screen for depression, anxiety, and stress using different Fitbit modalities. Our findings indicate potential in physiological modalities such as heart rate and sleep to screen for mental illness with the F1 scores as high as 0.79 for anxiety, the former modality reaching 0.77 for stress screening, and the latter modality achieving 0.78 for depression. This research highlights the potential of wearable devices to support continuous mental health monitoring, the importance of identifying best data aggregation levels and appropriate modalities for screening for different mental ailments.

[239] Efficient Gaussian process learning via subspace projections

Felipe Tobar, Elsa Cazelles

Main category: cs.LG

TL;DR: Proposed projected likelihood (PL) training for GPs using lower-dimensional linear projections, showing better accuracy and efficiency than exact GP and variational sparse GPs.

DetailsMotivation: To improve Gaussian Process training efficiency while maintaining accuracy for moderately large datasets, addressing computational challenges of exact GP and limitations of variational sparse GPs.

Method: Developed projected likelihood (PL) objective using lower-dimensional linear projections of data, with closed-form expression for information loss and random projections on unit sphere to reduce this loss.

Result: PL demonstrates superiority over exact GP training and variational free energy approach to sparse GPs in terms of both accuracy and computational efficiency across different optimizers, kernels, and moderately large datasets.

Conclusion: Projected likelihood provides an effective alternative for GP training that balances computational efficiency with accuracy, particularly suitable for moderately large datasets.

Abstract: We propose a novel training objective for GPs constructed using lower-dimensional linear projections of the data, referred to as \emph{projected likelihood} (PL). We provide a closed-form expression for the information loss related to the PL and empirically show that it can be reduced with random projections on the unit sphere. We show the superiority of the PL, in terms of accuracy and computational efficiency, over the exact GP training and the variational free energy approach to sparse GPs over different optimisers, kernels and datasets of moderately large sizes.

[240] Analyzing Neural Network Information Flow Using Differential Geometry

Shuhang Tan, Jayson Sia, Paul Bogdan, Radoslav Ivanov

Main category: cs.LG

TL;DR: This paper introduces a novel approach to neural network data flow analysis using graph curvature theory, specifically Ollivier-Ricci curvature, to identify important connections in neural networks for pruning and model analysis.

DetailsMotivation: The paper aims to provide a fresh perspective on neural network data flow analysis through graph theory, moving away from traditional information-theoretic approaches. Understanding neural network data flow is crucial for symbolic analysis tasks like robustness evaluation and model repair.

Method: The method constructs a graph from the neural network structure and introduces neural curvature based on Ollivier-Ricci curvature. It calculates curvatures using activation patterns from input examples, then uses these curvature values to rank edges by importance. Negative-curvature edges are identified as bottlenecks critical to network connectivity.

Result: The method successfully identifies important neural network connections through pruning experiments. Removing negative-curvature edges quickly degrades performance, while positive-curvature edges have minimal impact. The approach outperforms state-of-the-art pruning methods by identifying more unimportant edges across models trained on MNIST, CIFAR-10, and CIFAR-100 datasets.

Conclusion: Graph curvature theory provides an effective framework for neural network data flow analysis. The neural curvature approach can reliably identify critical connections in neural networks, offering advantages over existing pruning methods and enabling better model analysis and optimization.

Abstract: This paper provides a fresh view of the neural network (NN) data flow problem, i.e., identifying the NN connections that are most important for the performance of the full model, through the lens of graph theory. Understanding the NN data flow provides a tool for symbolic NN analysis, e.g.,~robustness analysis or model repair. Unlike the standard approach to NN data flow analysis, which is based on information theory, we employ the notion of graph curvature, specifically Ollivier-Ricci curvature (ORC). The ORC has been successfully used to identify important graph edges in various domains such as road traffic analysis, biological and social networks. In particular, edges with negative ORC are considered bottlenecks and as such are critical to the graph’s overall connectivity, whereas positive-ORC edges are not essential. We use this intuition for the case of NNs as well: we 1)~construct a graph induced by the NN structure and introduce the notion of neural curvature (NC) based on the ORC; 2)~calculate curvatures based on activation patterns for a set of input examples; 3)~aim to demonstrate that NC can indeed be used to rank edges according to their importance for the overall NN functionality. We evaluate our method through pruning experiments and show that removing negative-ORC edges quickly degrades the overall NN performance, whereas positive-ORC edges have little impact. The proposed method is evaluated on a variety of models trained on three image datasets, namely MNIST, CIFAR-10 and CIFAR-100. The results indicate that our method can identify a larger number of unimportant edges as compared to state-of-the-art pruning methods.

[241] A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning

Sihan Zeng, Sujay Bhatt, Sumitra Ganesh, Alec Koppel

Main category: cs.LG

TL;DR: Single-loop first-order actor-critic algorithm for bi-level optimization with MDP lower-level, using penalty reformulation and attenuating entropy regularization for asymptotically unbiased hyper-gradient estimation.

DetailsMotivation: Existing bi-level optimization and RL methods have limitations: they require second-order information, impose strong regularization at lower level, or use inefficient nested-loop procedures. Need efficient single-loop method for bi-level problems where upper-level parameterizes MDP reward and depends on optimal policy.

Method: Proposes single-loop first-order actor-critic algorithm with penalty-based reformulation. Introduces attenuating entropy regularization into lower-level RL objective to enable asymptotically unbiased upper-level hyper-gradient estimation without solving unregularized RL problem exactly.

Result: Establishes finite-time and finite-sample convergence to stationary point of original unregularized bi-level optimization problem through novel lower-level residual analysis under special Polyak-Lojasiewicz condition.

Conclusion: Method validated on GridWorld goal position problem and happy tweet generation through RLHF, demonstrating practical effectiveness of the proposed approach for bi-level optimization with MDP lower-level problems.

Abstract: We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP). The upper-level decision variable parameterizes the reward of the lower-level MDP, and the upper-level objective depends on the optimal induced policy. Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures. In this work, we propose a single-loop, first-order actor-critic algorithm that optimizes the bi-level objective via a penalty-based reformulation. We introduce into the lower-level RL objective an attenuating entropy regularization, which enables asymptotically unbiased upper-level hyper-gradient estimation without solving the unregularized RL problem exactly. We establish the finite-time and finite-sample convergence of the proposed algorithm to a stationary point of the original, unregularized bi-level optimization problem through a novel lower-level residual analysis under a special type of Polyak-Lojasiewicz condition. We validate the performance of our method through experiments on a GridWorld goal position problem and on happy tweet generation through reinforcement learning from human feedback (RLHF).

[242] Towards a Theoretical Understanding to the Generalization of RLHF

Zhaochun Li, Mingyang Yi, Yue Wang, Shisheng Cui, Yong Liu

Main category: cs.LG

TL;DR: Theoretical analysis shows RLHF for LLMs has generalization bound O(n^{-1/2}) under linear reward models with feature coverage, providing theoretical evidence for empirical generalization.

DetailsMotivation: While RLHF is empirically effective for aligning LLMs with human intent, its theoretical generalization properties in high-dimensional settings remain unexplored, creating a gap between practice and theory.

Method: Builds generalization theory for RLHF of LLMs under linear reward models using algorithmic stability framework, analyzing end-to-end learning rather than just reward model consistency.

Result: Proves that under feature coverage condition, empirical optima of policy model have generalization bound O(n^{-1/2}), which extends to gradient-based algorithms (GA and SGA).

Conclusion: Provides theoretical evidence for empirically observed generalization of LLMs after RLHF, bridging theory-practice gap in alignment methods.

Abstract: Reinforcement Learning from Human Feedback (RLHF) and its variants have emerged as the dominant approaches for aligning Large Language Models with human intent. While empirically effective, the theoretical generalization properties of these methods in high-dimensional settings remain to be explored. To this end, we build the generalization theory on RLHF of LLMs under the linear reward model, through the framework of algorithmic stability. In contrast to the existing works built upon the consistency of maximum likelihood estimations on reward model, our analysis is presented under an end-to-end learning framework, which is consistent with practice. Concretely, we prove that under a key \textbf{feature coverage} condition, the empirical optima of policy model have a generalization bound of order $\mathcal{O}(n^{-\frac{1}{2}})$. Moreover, the results can be extrapolated to parameters obtained by gradient-based learning algorithms, i.e., Gradient Ascent (GA) and Stochastic Gradient Ascent (SGA). Thus, we argue that our results provide new theoretical evidence for the empirically observed generalization of LLMs after RLHF.

[243] MACTAS: Self-Attention-Based Inter-Agent Communication in Multi-Agent Reinforcement Learning with Action-Value Function Decomposition

Maciej Wojtala, Bogusz Stefańczyk, Dominik Bogucki, Łukasz Lepak, Jakub Strykowski, Paweł Wawrzyński

Main category: cs.LG

TL;DR: A self-attention-based communication method for multi-agent reinforcement learning that is fully differentiable, scalable, and achieves state-of-the-art performance on SMACv2 benchmark.

DetailsMotivation: Existing communication protocols in MARL are often complex and non-differentiable, limiting agents' ability to learn effective communication strategies through reinforcement learning.

Method: Introduces a self-attention-based communication mechanism that exchanges information between agents in MARL. The approach is fully differentiable, allowing agents to learn message generation in a reward-driven manner, and can be integrated with any action-value function decomposition algorithm.

Result: The method achieves state-of-the-art performance on several maps of the SMACv2 benchmark, demonstrating its effectiveness in multi-agent coordination tasks.

Conclusion: The proposed self-attention-based communication approach provides a scalable, differentiable solution for MARL communication that can be easily integrated with existing decomposition methods and scales well to large multi-agent systems.

Abstract: Communication is essential for the collective execution of complex tasks by human agents, motivating interest in communication mechanisms for multi-agent reinforcement learning (MARL). However, existing communication protocols in MARL are often complex and non-differentiable. In this work, we introduce a self-attention-based communication method that exchanges information between the agents in MARL. Our proposed approach is fully differentiable, allowing agents to learn to generate messages in a reward-driven manner. The method can be seamlessly integrated with any action-value function decomposition algorithm and can be viewed as an orthogonal extension of such decompositions. Notably, it includes a fixed number of trainable parameters, independent of the number of agents, which makes it scalable to large systems. Experimental results on the SMACv2 benchmark demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on a number of maps. makes it scalable to large systems. Experimental results on the SMACv2 benchmark demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on a number of maps.

[244] Reasoning-Enhanced Rare-Event Prediction with Balanced Outcome Correction

Vitaly Bulgakov, Alexander Turchin

Main category: cs.LG

TL;DR: LPCORP is a two-stage framework for rare-event prediction that combines reasoning-enhanced prediction with confidence-based correction to address extreme class imbalance without resampling.

DetailsMotivation: Rare-event prediction is critical in high-stakes domains like healthcare and finance, but extreme class imbalance biases conventional models toward majority-class predictions, limiting recall, calibration, and operational usefulness.

Method: Two-stage framework: 1) Reasoning model produces enriched predictions from narrative inputs, 2) Lightweight logistic-regression classifier evaluates and selectively corrects these outputs to mitigate prevalence-driven bias. No resampling strategies applied.

Result: Transforms highly imbalanced settings into well-balanced ones while preserving original sample count. Shows substantially improved performance, particularly in precision (known weakness in low-prevalence data). Cost-reduction analysis shows >50% reduction in some cases when comparing damage control expenses to preventive interventions.

Conclusion: LPCORP effectively addresses extreme class imbalance in rare-event prediction through reasoning-enhanced prediction with confidence-based correction, improving model performance and demonstrating significant cost savings in real-world applications.

Abstract: Rare-event prediction is critical in domains such as healthcare, finance, reliability engineering, customer support, aviation safety, where positive outcomes are infrequent yet potentially catastrophic. Extreme class imbalance biases conventional models toward majority-class predictions, limiting recall, calibration, and operational usefulness. We propose LPCORP (Low-Prevalence CORrector for Prediction)*, a two-stage framework that combines reasoningenhanced prediction with confidence-based outcome correction. A reasoning model first produces enriched predictions from narrative inputs, after which a lightweight logistic-regression classifier evaluates and selectively corrects these outputs to mitigate prevalence-driven bias. We evaluate LPCORP on real-world datasets from medical and consumer service domains. The results show that this method transforms a highly imbalanced setting into a well-balanced one while preserving the original number of samples and without applying any resampling strategies. Test-set evaluation demonstrates substantially improved performance, particularly in precision, which is a known weakness in low-prevalence data. We further provide a costreduction analysis comparing the expenses associated with rare-event damage control without preventive measures to those incurred when low-cost, prediction-based preventive interventions are applied that showed more than 50% reduction in some cases. * Patent pending: U.S. Provisional 63/933,518, filed 8 December 2025.

[245] Sample-wise Constrained Learning via a Sequential Penalty Approach with Applications in Image Processing

Francesca Lanzillotta, Chiara Albisani, Davide Pucci, Daniele Baracchi, Alessandro Piva, Matteo Lapucci

Main category: cs.LG

TL;DR: A sequential penalty method for deep learning that enforces strict constraints on individual data samples rather than using arbitrary penalties, with convergence guarantees and practical viability.

DetailsMotivation: In learning tasks, certain requirements on processing individual data samples should be formalized as strict constraints rather than arbitrary penalties, but existing methods don't properly handle such constraints in deep learning scenarios.

Method: Proposes a sequential penalty method that allows proper handling of constraints in deep learning optimization problems, with convergence guarantees under reasonable assumptions.

Result: The method possesses convergence guarantees under assumptions reasonable for deep learning, and experiments on image processing tasks demonstrate its practical viability.

Conclusion: The sequential penalty method is an effective approach for enforcing strict constraints on individual data samples in deep learning, offering both theoretical convergence guarantees and practical applicability.

Abstract: In many learning tasks, certain requirements on the processing of individual data samples should arguably be formalized as strict constraints in the underlying optimization problem, rather than by means of arbitrary penalties. We show that, in these scenarios, learning can be carried out exploiting a sequential penalty method that allows to properly deal with constraints. The proposed algorithm is shown to possess convergence guarantees under assumptions that are reasonable in deep learning scenarios. Moreover, the results of experiments on image processing tasks show that the method is indeed viable to be used in practice.

[246] A Refinement of Vapnik–Chervonenkis’ Theorem

A. Iosevich, A. Vagharshakyan, E. Wyman

Main category: cs.LG

TL;DR: The paper revisits the probabilistic component of the VC theorem, replacing Hoeffding’s inequality with normal approximation and Berry-Esseen error control to obtain sharper moderate-deviation bounds.

DetailsMotivation: To improve upon the classical VC theorem by obtaining sharper estimates for the rate of uniform convergence of empirical probabilities to theoretical probabilities, particularly in moderate-deviation regimes.

Method: Revisits the probabilistic component of the classical VC argument, replacing the final application of Hoeffding’s inequality with a normal approximation approach that incorporates explicit Berry-Esseen error control.

Result: Obtains a moderate-deviation sharpening of the usual VC estimate, yielding an additional factor of order (ε√n)^{-1} in the leading exponential term when ε√n is large.

Conclusion: The normal approximation with Berry-Esseen error control provides improved moderate-deviation bounds for the VC theorem compared to the classical Hoeffding-based approach.

Abstract: Vapnik–Chervonenkis’ theorem is a seminal result in machine learning. It establishes sufficient conditions for empirical probabilities to converge to theoretical probabilities, uniformly over families of events. It also provides an estimate for the rate of such uniform convergence. We revisit the probabilistic component of the classical argument. Instead of applying Hoeffding’s inequality at the final step, we use a normal approximation with explicit Berry–Esseen error control. This yields a moderate-deviation sharpening of the usual VC estimate, with an additional factor of order $(\varepsilon\sqrt{n})^{-1}$ in the leading exponential term when $\varepsilon\sqrt{n}$ is large.

[247] PyHealth 2.0: A Comprehensive Open-Source Toolkit for Accessible and Reproducible Clinical Deep Learning

John Wu, Yongda Fan, Zhenbang Wu, Paul Landes, Eric Schrock, Sayeed Sajjad Razin, Arjun Chatterjee, Naveen Baskaran, Joshua Steier, Andrea Fitzpatrick, Bilal Arif, Rian Atri, Jathurshan Pradeepkumar, Siddhartha Laghuvarapu, Junyi Gao, Adam R. Cross, Jimeng Sun

Main category: cs.LG

TL;DR: PyHealth 2.0 is an enhanced clinical AI toolkit that enables predictive modeling in 7 lines of code, addressing reproducibility, computational cost, and domain expertise barriers in healthcare AI research.

DetailsMotivation: The paper addresses persistent barriers in clinical AI research: difficulty replicating baselines, high computational costs, and required domain expertise that limit accessibility and reproducibility.

Method: Developed PyHealth 2.0 as a comprehensive clinical deep learning toolkit with three key contributions: (1) unified framework supporting 15+ datasets, 20+ clinical tasks, 25+ models, 5+ interpretability methods, uncertainty quantification, and multiple data modalities; (2) accessibility-focused design for diverse computational resources with optimized performance; (3) active open-source community with extensive documentation and multi-language support.

Result: The toolkit achieves up to 39x faster processing and 20x lower memory usage, supports work from 16GB laptops to production systems, has an active community of 400+ members, and enables predictive modeling in as few as 7 lines of code.

Conclusion: PyHealth 2.0 establishes an open-source foundation and community that advances accessible, reproducible healthcare AI by lowering barriers to clinical AI research through comprehensive tooling, optimized performance, and community support.

Abstract: Difficulty replicating baselines, high computational costs, and required domain expertise create persistent barriers to clinical AI research. To address these challenges, we introduce PyHealth 2.0, an enhanced clinical deep learning toolkit that enables predictive modeling in as few as 7 lines of code. PyHealth 2.0 offers three key contributions: (1) a comprehensive toolkit addressing reproducibility and compatibility challenges by unifying 15+ datasets, 20+ clinical tasks, 25+ models, 5+ interpretability methods, and uncertainty quantification including conformal prediction within a single framework that supports diverse clinical data modalities - signals, imaging, and electronic health records - with translation of 5+ medical coding standards; (2) accessibility-focused design accommodating multimodal data and diverse computational resources with up to 39x faster processing and 20x lower memory usage, enabling work from 16GB laptops to production systems; and (3) an active open-source community of 400+ members lowering domain expertise barriers through extensive documentation, reproducible research contributions, and collaborations with academic health systems and industry partners, including multi-language support via RHealth. PyHealth 2.0 establishes an open-source foundation and community advancing accessible, reproducible healthcare AI. Available at pip install pyhealth.

[248] Bayesian Experimental Design for Model Discrepancy Calibration: A Rivalry between Kullback–Leibler Divergence and Wasserstein Distance

Huchen Yang, Xinghao Dong, Jin-Long Wu

Main category: cs.LG

TL;DR: This paper compares KL divergence vs Wasserstein distance as utility functions in Bayesian experimental design, showing KL works better without model errors but Wasserstein is more robust when model discrepancies exist.

DetailsMotivation: The selection of utility functions in Bayesian experimental design (BED) is an active research area, with different criteria emphasizing different notions of information. While KL divergence is common, recent studies propose Wasserstein distance as an alternative, but systematic comparisons are needed to understand their trade-offs in practical applications.

Method: The authors first use a toy example to illustrate issues with Wasserstein distance, showing its value depends on posterior position within support and can exhibit false rewards. Then they conduct systematic comparison through a classical source inversion problem in BED literature, evaluating both criteria under conditions with and without model discrepancy.

Result: KL divergence leads to faster convergence in the absence of model discrepancy, while Wasserstein metrics provide more robust sequential BED results when model discrepancy is non-negligible. The toy example reveals Wasserstein distance can give misleading rewards unrelated to actual information gain.

Conclusion: The findings clarify trade-offs between KL divergence and Wasserstein metrics for utility functions in BED, providing practical guidelines: use KL when model is accurate for faster convergence, but prefer Wasserstein when model discrepancies exist for robustness. This helps practitioners select suitable criteria based on their specific application context.

Abstract: Designing experiments that systematically gather data from complex physical systems is central to accelerating scientific discovery. While Bayesian experimental design (BED) provides a principled, information-based framework that integrates experimental planning with probabilistic inference, the selection of utility functions in BED is a long-standing and active topic, where different criteria emphasize different notions of information. Although Kullback–Leibler (KL) divergence has been one of the most common choices, recent studies have proposed Wasserstein distance as an alternative. In this work, we first employ a toy example to illustrate an issue of Wasserstein distance - the value of Wasserstein distance of a fixed-shape posterior depends on the relative position of its main mass within the support and can exhibit false rewards unrelated to information gain, especially with a non-informative prior (e.g., uniform distribution). We then further provide a systematic comparison between these two criteria through a classical source inversion problem in the BED literature, revealing that the KL divergence tends to lead to faster convergence in the absence of model discrepancy, while Wasserstein metrics provide more robust sequential BED results if model discrepancy is non-negligible. These findings clarify the trade-offs between KL divergence and Wasserstein metrics for the utility function and provide guidelines for selecting suitable criteria in practical BED applications.

[249] Safe Multitask Molecular Graph Networks for Vapor Pressure and Odor Threshold Prediction

Shuang Wu, Meijie Wang, Lun Yu

Main category: cs.LG

TL;DR: The paper investigates vapor pressure and odor threshold modeling using molecular graph features with GINE and PNA backbones, introduces a “safe multitask” approach, and provides comprehensive experimental analysis.

DetailsMotivation: To develop robust models for two important odor-related properties (vapor pressure and odor threshold) with strong out-of-distribution generalization capabilities, addressing challenges in multitask learning where auxiliary tasks might harm primary task performance.

Method: Uses Bemis-Murcko scaffold split for OOD evaluation, A20/E17 molecular graph features (20D atom + 17D bond features), compares GINE and PNA backbones, and introduces “safe multitask” approach with VP as primary task, OP as auxiliary task using delayed activation, gradient clipping, and small weight.

Result: PNA with regression head achieves Val MSE ≈0.21 for VP; OP single task with robust training achieves Val MSE ≈0.60-0.61; “safe multitask” approach yields best VP generalization without harming primary task performance.

Conclusion: The proposed methods effectively model odor-related properties with good OOD generalization, and the “safe multitask” approach successfully leverages auxiliary tasks without compromising primary task performance, providing reproducible framework for molecular property prediction.

Abstract: We investigate two important tasks in odor-related property modeling: Vapor Pressure (VP) and Odor Threshold (OP). To evaluate the model’s out-of-distribution (OOD) capability, we adopt the Bemis-Murcko scaffold split. In terms of features, we introduce the rich A20/E17 molecular graph features (20-dimensional atom features + 17-dimensional bond features) and systematically compare GINE and PNA backbones. The results show: for VP, PNA with a simple regression head achieves Val MSE $\approx$ 0.21 (normalized space); for the OP single task under the same scaffold split, using A20/E17 with robust training (Huber/winsor) achieves Val MSE $\approx$ 0.60-0.61. For multitask training, we propose a “safe multitask” approach: VP as the primary task and OP as the auxiliary task, using delayed activation + gradient clipping + small weight, which avoids harming the primary task and simultaneously yields the best VP generalization performance. This paper provides complete reproducible experiments, ablation studies, and error-similarity analysis while discussing the impact of data noise and method limitations.

[250] Endless Terminals: Scaling RL Environments for Terminal Agents

Kanishk Gandhi, Shivam Garg, Noah D. Goodman, Dimitris Papailiopoulos

Main category: cs.LG

TL;DR: Endless Terminals is an autonomous pipeline for procedurally generating terminal-use tasks that enables effective RL training, leading to substantial performance gains on both generated and human-curated benchmarks.

DetailsMotivation: Current terminal benchmarks are designed for evaluation rather than training, and RL requires scalable training environments, not just datasets. There's a need for an automated pipeline to generate diverse terminal tasks without human annotation.

Method: A four-stage pipeline: 1) generating diverse task descriptions, 2) building and validating containerized environments, 3) producing completion tests, and 4) filtering for solvability. Training uses vanilla PPO with binary episode-level rewards and minimal interaction (no retrieval, multi-agent coordination, or specialized tools).

Result: Generated 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. Models trained on Endless Terminals show substantial gains: Llama-3.2-3B improved from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0% on held-out dev set. Improvements transfer to human-curated benchmarks like TerminalBench 2.0.

Conclusion: Simple RL can succeed when environments scale properly. The Endless Terminals pipeline demonstrates that procedurally generated training tasks enable substantial agent improvement, outperforming more complex agentic approaches with minimal training infrastructure.

Abstract: Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.

[251] Brownian ReLU(Br-ReLU): A New Activation Function for a Long-Short Term Memory (LSTM) Network

George Awiakye-Marfo, Elijah Agbosu, Victoria Mawuena Barns, Samuel Asante Gyamerah

Main category: cs.LG

TL;DR: BrownianReLU: A stochastic activation function based on Brownian motion that improves gradient stability and performance in LSTM networks for financial time series prediction.

DetailsMotivation: Standard activation functions (ReLU, LeakyReLU, PReLU) suffer from gradient instability when applied to noisy, non-stationary financial time series, leading to the dying ReLU problem and poor learning performance.

Method: Introduces BrownianReLU, a stochastic activation function induced by Brownian motion that provides smooth, adaptive responses for negative inputs. Uses Monte Carlo simulation to implement the function and evaluates it on LSTM networks with financial datasets (Apple, GCB, S&P 500, LendingClub loan data).

Result: BrownianReLU achieves consistently lower Mean Squared Error and higher R² values compared to standard activations, indicating improved predictive accuracy and generalization. While ROC-AUC has limitations in classification tasks, activation choice significantly affects the accuracy-sensitivity trade-off, with BrownianReLU delivering practically meaningful performance.

Conclusion: BrownianReLU effectively addresses gradient instability in financial time series modeling, offering enhanced learning stability and predictive performance in LSTM networks through its stochastic, Brownian motion-based design.

Abstract: Deep learning models are effective for sequential data modeling, yet commonly used activation functions such as ReLU, LeakyReLU, and PReLU often exhibit gradient instability when applied to noisy, non-stationary financial time series. This study introduces BrownianReLU, a stochastic activation function induced by Brownian motion that enhances gradient propagation and learning stability in Long Short-Term Memory (LSTM) networks. Using Monte Carlo simulation, BrownianReLU provides a smooth, adaptive response for negative inputs, mitigating the dying ReLU problem. The proposed activation is evaluated on financial time series from Apple, GCB, and the S&P 500, as well as LendingClub loan data for classification. Results show consistently lower Mean Squared Error and higher $R^2$ values, indicating improved predictive accuracy and generalization. Although ROC-AUC metric is limited in classification tasks, activation choice significantly affects the trade-off between accuracy and sensitivity, with Brownian ReLU and the selected activation functions yielding practically meaningful performance.

[252] On the Expressive Power of Floating-Point Transformers

Sejun Park, Yeachan Park, Geonho Hwang

Main category: cs.LG

TL;DR: Floating-point transformers can represent non-permutation-equivariant functions without positional encoding, but their ability to represent permutation-equivariant functions depends on sequence length.

DetailsMotivation: Existing theoretical results on transformers assume real parameters and exact operations, but real implementations use finite floating-point numbers with round-off errors. This work investigates how these practical limitations affect transformers' representational capabilities.

Method: Analyze the representability of transformers using floating-point parameters and operations instead of ideal real numbers, examining how round-off errors and finite precision affect their ability to represent different function classes.

Result: 1) Floating-point transformers can represent some non-permutation-equivariant functions even without positional encoding. 2) They can represent all permutation-equivariant functions when sequence length is bounded, but not when sequence length is large. 3) Minimal equivariance structure exists in floating-point transformers. 4) Non-trivial additive positional encoding can harm representability.

Conclusion: Practical floating-point implementations of transformers have different representational properties than ideal theoretical models, with limitations on permutation-equivariance that depend on sequence length and potential negative effects of positional encoding.

Abstract: The study on the expressive power of transformers shows that transformers are permutation equivariant, and they can approximate all permutation-equivariant continuous functions on a compact domain. However, these results are derived under real parameters and exact operations, while real implementations on computers can only use a finite set of numbers and inexact machine operations with round-off errors. In this work, we investigate the representability of floating-point transformers that use floating-point parameters and floating-point operations. Unlike existing results under exact operations, we first show that floating-point transformers can represent a class of non-permutation-equivariant functions even without positional encoding. Furthermore, we prove that floating-point transformers can represent all permutation-equivariant functions when the sequence length is bounded, but they cannot when the sequence length is large. We also found the minimal equivariance structure in floating-point transformers, and show that all non-trivial additive positional encoding can harm the representability of floating-point transformers.

[253] On the Effects of Adversarial Perturbations on Distribution Robustness

Yipei Wang, Zhaoying Pan, Xiaoqian Wang

Main category: cs.LG

TL;DR: The paper analyzes the tradeoff between adversarial and distribution robustness, showing that while adversarial training can harm distribution robustness by increasing reliance on spurious features, moderate ℓ∞ perturbations on moderately biased data can actually improve distribution robustness, especially when feature separability is high.

DetailsMotivation: Prior work has revealed a tradeoff between adversarial robustness (resistance to input perturbations) and distribution robustness (performance under data shifts). Adversarial training can increase reliance on spurious features, harming distribution robustness, particularly for underrepresented subgroups. The authors aim to better understand this complex relationship.

Method: Theoretical analysis using tractable surrogate for per-step adversarial training by studying models trained on perturbed data. The approach examines ℓ∞ perturbations on data with varying levels of bias and analyzes how feature separability affects robustness tradeoffs.

Result: 1) Confirmed the tradeoff between adversarial and distribution robustness in many cases. 2) Discovered that ℓ∞ perturbations on moderately biased data can actually increase distribution robustness. 3) Found that gains in distribution robustness persist on highly skewed data when simplicity bias induces reliance on core features (greater feature separability).

Conclusion: The interplay between robustness tradeoffs and feature separability is crucial. While the tradeoff persists in many cases, overlooking feature separability can lead to misleading conclusions about robustness. The nuanced relationship shows that moderate adversarial perturbations can sometimes benefit distribution robustness under specific conditions.

Abstract: Adversarial robustness refers to a model’s ability to resist perturbation of inputs, while distribution robustness evaluates the performance of the model under data shifts. Although both aim to ensure reliable performance, prior work has revealed a tradeoff in distribution and adversarial robustness. Specifically, adversarial training might increase reliance on spurious features, which can harm distribution robustness, especially the performance on some underrepresented subgroups. We present a theoretical analysis of adversarial and distribution robustness that provides a tractable surrogate for per-step adversarial training by studying models trained on perturbed data. In addition to the tradeoff, our work further identified a nuanced phenomenon that $\ell_\infty$ perturbations on data with moderate bias can yield an increase in distribution robustness. Moreover, the gain in distribution robustness remains on highly skewed data when simplicity bias induces reliance on the core feature, characterized as greater feature separability. Our theoretical analysis extends the understanding of the tradeoff by highlighting the interplay of the tradeoff and the feature separability. Despite the tradeoff that persists in many cases, overlooking the role of feature separability may lead to misleading conclusions about robustness.

[254] A Cautionary Tale of Self-Supervised Learning for Imaging Biomarkers: Alzheimer’s Disease Case Study

Maxwell Reynolds, Chaitanya Srinivasan, Vijay Cherupally, Michael Leone, Ke Yu, Li Sun, Tigmanshu Chaudhary, Andreas Pfenning, Kayhan Batmanghelich

Main category: cs.LG

TL;DR: R-NCE, a new self-supervised learning framework that integrates FreeSurfer features, outperforms traditional biomarkers and existing SSL methods for Alzheimer’s disease prediction and reveals biologically relevant associations.

DetailsMotivation: Current structural MRI biomarkers for Alzheimer's disease rely on hand-crafted features like cortical thickness or volume, which may not capture the full biological complexity. Existing self-supervised learning methods underperform compared to FreeSurfer-derived features, creating a need for more powerful biomarker discovery approaches.

Method: Residual Noise Contrastive Estimation (R-NCE), a new SSL framework that integrates auxiliary FreeSurfer features while maximizing additional augmentation-invariant information from structural MRI data.

Result: R-NCE outperforms traditional FreeSurfer features and existing SSL methods across multiple benchmarks including AD classification, conversion prediction, and amyloid status prediction. R-NCE-derived Brain Age Gap measures show high heritability and associations with MAPT and IRAG1 genes, with enrichment in astrocytes and oligodendrocytes.

Conclusion: R-NCE successfully uncovers more powerful biomarkers from structural MRI data than traditional methods, demonstrating both superior predictive performance and biological relevance to neurodegenerative and cerebrovascular processes in Alzheimer’s disease.

Abstract: Discovery of sensitive and biologically grounded biomarkers is essential for early detection and monitoring of Alzheimer’s disease (AD). Structural MRI is widely available but typically relies on hand-crafted features such as cortical thickness or volume. We ask whether self-supervised learning (SSL) can uncover more powerful biomarkers from the same data. Existing SSL methods underperform FreeSurfer-derived features in disease classification, conversion prediction, and amyloid status prediction. We introduce Residual Noise Contrastive Estimation (R-NCE), a new SSL framework that integrates auxiliary FreeSurfer features while maximizing additional augmentation-invariant information. R-NCE outperforms traditional features and existing SSL methods across multiple benchmarks, including AD conversion prediction. To assess biological relevance, we derive Brain Age Gap (BAG) measures and perform genome-wide association studies. R-NCE-BAG shows high heritability and associations with MAPT and IRAG1, with enrichment in astrocytes and oligodendrocytes, indicating sensitivity to neurodegenerative and cerebrovascular processes.

[255] Robust Categorical Data Clustering Guided by Multi-Granular Competitive Learning

Shenghong Cai, Yiqun Zhang, Xiaopeng Luo, Yiu-Ming Cheung, Hong Jia, Peng Liu

Main category: cs.LG

TL;DR: MGCPL algorithm for categorical data clustering that handles nested granular clusters through competitive penalization learning and cluster aggregation encoding.

DetailsMotivation: Categorical data is common in big data but presents challenges for clustering due to undefined distance spaces, overlapping data objects, and nested granular cluster effects where small clusters form larger clusters.

Method: Proposes MCDC approach with two components: 1) MGCPL algorithm that allows potential clusters to interactively tune themselves and converge in stages with different numbers of naturally compact clusters, and 2) CAME strategy that encodes data objects based on learned multi-granular distributions and performs final clustering on embeddings.

Result: MCDC is competent in automatically exploring nested distributions of multi-granular clusters, highly robust across domains, has linear time complexity, scalable to large datasets, and superior to state-of-the-art methods on various real public datasets.

Conclusion: The proposed MGCPL-guided categorical data clustering approach effectively addresses the challenges of categorical data clustering by handling nested granular clusters through competitive learning and encoding strategies, making it suitable for large-scale applications and distributed computing.

Abstract: Data set composed of categorical features is very common in big data analysis tasks. Since categorical features are usually with a limited number of qualitative possible values, the nested granular cluster effect is prevalent in the implicit discrete distance space of categorical data. That is, data objects frequently overlap in space or subspace to form small compact clusters, and similar small clusters often form larger clusters. However, the distance space cannot be well-defined like the Euclidean distance due to the qualitative categorical data values, which brings great challenges to the cluster analysis of categorical data. In view of this, we design a Multi-Granular Competitive Penalization Learning (MGCPL) algorithm to allow potential clusters to interactively tune themselves and converge in stages with different numbers of naturally compact clusters. To leverage MGCPL, we also propose a Cluster Aggregation strategy based on MGCPL Encoding (CAME) to first encode the data objects according to the learned multi-granular distributions, and then perform final clustering on the embeddings. It turns out that the proposed MGCPL-guided Categorical Data Clustering (MCDC) approach is competent in automatically exploring the nested distribution of multi-granular clusters and highly robust to categorical data sets from various domains. Benefiting from its linear time complexity, MCDC is scalable to large-scale data sets and promising in pre-partitioning data sets or compute nodes for boosting distributed computing. Extensive experiments with statistical evidence demonstrate its superiority compared to state-of-the-art counterparts on various real public data sets.

[256] Interpretable Fine-Gray Deep Survival Model for Competing Risks: Predicting Post-Discharge Foot Complications for Diabetic Patients in Ontario

Dhanesh Ramachandram, Anne Loefler, Surain Roberts, Amol Verma, Maia Norman, Fahad Razak, Conrad Pow, Charles de Mestral

Main category: cs.LG

TL;DR: CRISPNAM-FG: An intrinsically interpretable deep learning model for competing risks survival analysis that combines Neural Additive Models with Fine-Gray formulation for transparent predictions.

DetailsMotivation: Model interpretability is crucial for AI safety and clinician trust in medical applications like survival modeling with competing risks. While recent deep learning models achieve good predictive performance, their black-box nature hinders clinical integration.

Method: Proposes CRISPNAM-FG, an intrinsically interpretable survival model that leverages Neural Additive Models (NAMs) structure with separate projection vectors for each risk. It predicts Cumulative Incidence Function using the Fine-Gray formulation.

Result: Achieves competitive performance compared to other deep survival models on benchmark datasets. Successfully applied to predict future foot complications in diabetic patients across 29 Ontario hospitals (2016-2023). Provides transparency through shape functions and feature importance plots.

Conclusion: CRISPNAM-FG offers both high predictive power and intrinsic transparency, addressing the critical need for interpretable AI in clinical practice while maintaining competitive performance with black-box alternatives.

Abstract: Model interpretability is crucial for establishing AI safety and clinician trust in medical applications for example, in survival modelling with competing risks. Recent deep learning models have attained very good predictive performance but their limited transparency, being black-box models, hinders their integration into clinical practice. To address this gap, we propose an intrinsically interpretable survival model called CRISPNAM-FG. Leveraging the structure of Neural Additive Models (NAMs) with separate projection vectors for each risk, our approach predicts the Cumulative Incidence Function using the Fine-Gray formulation, achieving high predictive power with intrinsically transparent and auditable predictions. We validated the model on several benchmark datasets and applied our model to predict future foot complications in diabetic patients across 29 Ontario hospitals (2016-2023). Our method achieves competitive performance compared to other deep survival models while providing transparency through shape functions and feature importance plots.

[257] BoostFGL: Boosting Fairness in Federated Graph Learning

Zekai Chen, Kairui Yang, Xunkai Li, Henan Sun, Zhihan Zhang, Jia Li, Qiangqiang Dai, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: BoostFGL is a fairness-aware federated graph learning framework that addresses performance disparities across disadvantaged node groups through coordinated client-side and server-side boosting mechanisms.

DetailsMotivation: Existing federated graph learning methods achieve high overall accuracy but conceal severe performance degradation on disadvantaged node groups, creating fairness issues. These disparities arise from three coupled sources: label skew toward majority patterns, topology confounding in message propagation, and aggregation dilution of updates from hard clients.

Method: BoostFGL introduces three coordinated mechanisms: 1) Client-side node boosting that reshapes local training signals to emphasize systematically under-served nodes; 2) Client-side topology boosting that reallocates propagation emphasis toward reliable yet underused structures while attenuating misleading neighborhoods; 3) Server-side model boosting that performs difficulty- and reliability-aware aggregation to preserve informative updates from hard clients while stabilizing the global model.

Result: Extensive experiments on 9 datasets show that BoostFGL delivers substantial fairness gains, improving Overall-F1 by 8.43%, while preserving competitive overall performance against strong FGL baselines.

Conclusion: BoostFGL effectively addresses fairness issues in federated graph learning by systematically tackling the three sources of disparities through coordinated boosting mechanisms, achieving both improved fairness and competitive overall performance.

Abstract: Federated graph learning (FGL) enables collaborative training of graph neural networks (GNNs) across decentralized subgraphs without exposing raw data. While existing FGL methods often achieve high overall accuracy, we show that this average performance can conceal severe degradation on disadvantaged node groups. From a fairness perspective, these disparities arise systematically from three coupled sources: label skew toward majority patterns, topology confounding in message propagation, and aggregation dilution of updates from hard clients. To address this, we propose \textbf{BoostFGL}, a boosting-style framework for fairness-aware FGL. BoostFGL introduces three coordinated mechanisms: \ding{182} \emph{Client-side node boosting}, which reshapes local training signals to emphasize systematically under-served nodes; \ding{183} \emph{Client-side topology boosting}, which reallocates propagation emphasis toward reliable yet underused structures and attenuates misleading neighborhoods; and \ding{184} \emph{Server-side model boosting}, which performs difficulty- and reliability-aware aggregation to preserve informative updates from hard clients while stabilizing the global model. Extensive experiments on 9 datasets show that BoostFGL delivers substantial fairness gains, improving Overall-F1 by 8.43%, while preserving competitive overall performance against strong FGL baselines.

[258] kNN-Graph: An adaptive graph model for $k$-nearest neighbors

Jiaye Li, Gang Chen, Hang Xu, Shichao Zhang

Main category: cs.LG

TL;DR: The paper presents an adaptive graph model that decouples kNN inference latency from computational complexity using HNSW graphs with pre-computed voting, achieving real-time performance without accuracy loss.

DetailsMotivation: kNN faces computational trade-offs between inference speed and accuracy in large-scale applications. Existing approximate nearest neighbor solutions accelerate retrieval but degrade precision and lack adaptability in optimal neighborhood size selection.

Method: Integrates Hierarchical Navigable Small World (HNSW) graph with pre-computed voting mechanism, transferring computational burden of neighbor selection and weighting to training phase. Higher graph layers enable rapid navigation while lower layers encode precise, node-specific decision boundaries with adaptive neighbor counts.

Result: Benchmarking against eight state-of-the-art baselines across six diverse datasets shows significant inference speed acceleration achieving real-time performance without compromising classification accuracy.

Conclusion: Provides a scalable, robust solution to kNN’s inference bottleneck and establishes a new structural paradigm for graph-based nonparametric learning.

Abstract: The k-nearest neighbors (kNN) algorithm is a cornerstone of non-parametric classification in artificial intelligence, yet its deployment in large-scale applications is persistently constrained by the computational trade-off between inference speed and accuracy. Existing approximate nearest neighbor solutions accelerate retrieval but often degrade classification precision and lack adaptability in selecting the optimal neighborhood size (k). Here, we present an adaptive graph model that decouples inference latency from computational complexity. By integrating a Hierarchical Navigable Small World (HNSW) graph with a pre-computed voting mechanism, our framework completely transfers the computational burden of neighbor selection and weighting to the training phase. Within this topological structure, higher graph layers enable rapid navigation, while lower layers encode precise, node-specific decision boundaries with adaptive neighbor counts. Benchmarking against eight state-of-the-art baselines across six diverse datasets, we demonstrate that this architecture significantly accelerates inference speeds, achieving real-time performance, without compromising classification accuracy. These findings offer a scalable, robust solution to the long-standing inference bottleneck of kNN, establishing a new structural paradigm for graph-based nonparametric learning.

[259] Finite-Time Analysis of Gradient Descent for Shallow Transformers

Enes Arda, Semih Cayci, Atilla Eryilmaz

Main category: cs.LG

TL;DR: Transformers achieve logarithmic width scaling with sample size and sequence-length-independent optimization error, unlike RNNs, but require memory proportional to sequence length.

DetailsMotivation: To understand why Transformers perform so well despite their non-convex optimization landscape, and to analyze their optimization properties compared to recurrent architectures.

Method: Analyzed a shallow Transformer with m independent heads trained by projected gradient descent in the kernel regime, using theoretical analysis and numerical validation in a teacher-student setting.

Result: Two key findings: (1) width required for nonasymptotic guarantees scales only logarithmically with sample size n, (2) optimization error is independent of sequence length T (unlike RNNs where it grows exponentially with T). Trade-off is memory requirement grows with sequence length.

Conclusion: Transformers have favorable optimization properties with logarithmic width scaling and sequence-length-independent error, but require memory proportional to sequence length, providing theoretical insights into their empirical success.

Abstract: Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with $m$ independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size $n$, and (ii) the optimization error is independent of the sequence length $T$. This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with $T$. The trade-off is memory: to keep the full context, the Transformer’s memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher-student setting and confirm the predicted scaling laws for Transformers.

[260] Rethinking Large Language Models For Irregular Time Series Classification In Critical Care

Feixiang Zheng, Yu Wu, Cecilia Mascolo, Ting Dang

Main category: cs.LG

TL;DR: LLMs show promise for ICU time series but struggle with irregular data; encoder design matters more than alignment, but LLMs are computationally expensive and perform poorly in data-scarce settings.

DetailsMotivation: To investigate how well LLMs handle irregular ICU time series data with high missing rates, and understand which components (encoder vs alignment) are most critical for success.

Method: Established systematic testbed to evaluate time series encoders and multimodal alignment strategies across state-of-the-art LLM-based methods on benchmark ICU datasets, comparing against strong supervised and self-supervised baselines.

Result: Encoder design is more critical than alignment - irregularity-aware encoders achieved 12.8% AUPRC improvement over vanilla Transformer. Best alignment strategy gave modest 2.9% improvement. However, LLMs require 10× longer training than best irregular supervised models with comparable performance, and underperform in few-shot learning.

Conclusion: LLMs show promise for irregular ICU time series but have current limitations: computational inefficiency, poor performance in data-scarce settings, and need for specialized encoders that handle irregularity effectively.

Abstract: Time series data from the Intensive Care Unit (ICU) provides critical information for patient monitoring. While recent advancements in applying Large Language Models (LLMs) to time series modeling (TSM) have shown great promise, their effectiveness on the irregular ICU data, characterized by particularly high rates of missing values, remains largely unexplored. This work investigates two key components underlying the success of LLMs for TSM: the time series encoder and the multimodal alignment strategy. To this end, we establish a systematic testbed to evaluate their impact across various state-of-the-art LLM-based methods on benchmark ICU datasets against strong supervised and self-supervised baselines. Results reveal that the encoder design is more critical than the alignment strategy. Encoders that explicitly model irregularity achieve substantial performance gains, yielding an average AUPRC increase of $12.8%$ over the vanilla Transformer. While less impactful, the alignment strategy is also noteworthy, with the best-performing semantically rich, fusion-based strategy achieving a modest $2.9%$ improvement over cross-attention. However, LLM-based methods require at least 10$\times$ longer training than the best-performing irregular supervised models, while delivering only comparable performance. They also underperform in data-scarce few-shot learning settings. These findings highlight both the promise and current limitations of LLMs for irregular ICU time series. The code is available at https://github.com/mHealthUnimelb/LLMTS.

[261] DANCE: Dynamic, Available, Neighbor-gated Condensation for Federated Text-Attributed Graphs

Zekai Chen, Haodong Lu, Xunkai Li, Henan Sun, Jia Li, Hongchao Qin, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: DANCE introduces a new TAG-FGL paradigm with graph condensation that addresses overhead, suboptimal performance, and interpretability issues through round-wise condensation refresh and provenance-preserving evidence packs.

DetailsMotivation: Current TAG-FGL methods face three main challenges: (1) High overhead from LLM processing of long texts, (2) Suboptimal performance from one-shot graph condensation that lacks client adaptation, and (3) Poor interpretability where LLM-based condensation creates black-box summaries without faithful attribution to source texts.

Method: DANCE proposes round-wise, model-in-the-loop condensation refresh using the latest global model to improve performance, and preserves provenance through locally inspectable evidence packs that trace predictions to selected neighbors and source text spans.

Result: Across 8 TAG datasets, DANCE improves accuracy by 2.33% at an 8% condensation ratio, with 33.42% fewer tokens than baselines.

Conclusion: DANCE successfully addresses the key challenges in TAG-FGL by providing an efficient, adaptive, and interpretable framework that balances performance improvements with reduced computational overhead and enhanced transparency.

Abstract: Federated graph learning (FGL) enables collaborative training on graph data across multiple clients. With the rise of large language models (LLMs), textual attributes in FGL graphs are gaining attention. Text-attributed graph federated learning (TAG-FGL) improves FGL by explicitly leveraging LLMs to process and integrate these textual features. However, current TAG-FGL methods face three main challenges: \textbf{(1) Overhead.} LLMs for processing long texts incur high token and computation costs. To make TAG-FGL practical, we introduce graph condensation (GC) to reduce computation load, but this choice also brings new issues. \textbf{(2) Suboptimal.} To reduce LLM overhead, we introduce GC into TAG-FGL by compressing multi-hop texts/neighborhoods into a condensed core with fixed LLM surrogates. However, this one-shot condensation is often not client-adaptive, leading to suboptimal performance. \textbf{(3) Interpretability.} LLM-based condensation further introduces a black-box bottleneck: summaries lack faithful attribution and clear grounding to specific source spans, making local inspection and auditing difficult. To address the above issues, we propose \textbf{DANCE}, a new TAG-FGL paradigm with GC. To improve \textbf{suboptimal} performance, DANCE performs round-wise, model-in-the-loop condensation refresh using the latest global model. To enhance \textbf{interpretability}, DANCE preserves provenance by storing locally inspectable evidence packs that trace predictions to selected neighbors and source text spans. Across 8 TAG datasets, DANCE improves accuracy by \textbf{2.33%} at an \textbf{8%} condensation ratio, with \textbf{33.42%} fewer tokens than baselines.

[262] Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs

Xianya Fang, Feiyang Ren, Xiang Chen, Yu Tian, Zhen Bi, Haiyang Yu, Sheng-Jun Huang

Main category: cs.LG

TL;DR: SARE addresses structural fragility in multimodal LLM hallucination unlearning by using targeted min-max optimization and landscape flattening for robust erasure.

DetailsMotivation: Multimodal LLMs suffer from object hallucinations that harm reliability, and current unlearning methods have structural fragility - they achieve only superficial suppression that catastrophically resurges after lightweight relearning.

Method: Proposes SARE (Structural-Aware Robust Erasure) which casts unlearning as a targeted min-max optimization problem and uses Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts, suppressing hallucinations under worst-case parameter perturbations.

Result: SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality, and maintains persistent hallucination suppression against relearning and parameter updates.

Conclusion: Geometric stabilization through targeted min-max optimization and loss landscape flattening effectively addresses structural fragility in hallucination unlearning, ensuring robust removal stable against weight shifts.

Abstract: Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization.

[263] A Collision-Free Hot-Tier Extension for Engram-Style Conditional Memory: A Controlled Study of Training Dynamics

Tao Lin

Main category: cs.LG

TL;DR: Collision-free Engram-Nine design doesn’t improve validation loss despite eliminating key collisions; collisions may provide beneficial regularization and gating issues are the real bottleneck.

DetailsMotivation: To investigate whether high-frequency key collisions are the primary bottleneck in Engram-style conditional memory systems, and to understand if eliminating collisions would improve performance.

Method: Introduced Engram-Nine, a collision-free hot-tier extension using Minimal Perfect Hash Function (MPHF) for frequent n-grams while keeping original multi-head hashed lookup as cold tier. Used iso-parameter setup and route-stratified evaluation to decompose per-token loss into hot/cold contributions.

Result: Collision-free design didn’t consistently improve validation loss. Found “hot-to-cold advantage flip” during training where cold positions eventually surpass hot positions. Collision-free configurations flipped earlier, suggesting collisions act as implicit regularization. Identified gating mismatch where gate persists in favoring hot positions even after they have higher loss.

Conclusion: Improving lookup precision alone doesn’t guarantee better training outcomes. The dominant limitation may be in gating credit assignment rather than index accuracy, and collision-induced noise may provide beneficial regularization that shouldn’t be naively eliminated.

Abstract: We investigate whether high-frequency key collisions are a primary bottleneck in Engram-style conditional memory. To isolate the effect of collisions, we introduce Engram-Nine, a collision-free hot-tier extension that maps the most frequent n-grams through a Minimal Perfect Hash Function (MPHF) while retaining the original multi-head hashed lookup as a cold tier. Under a strictly iso-parameter setup, the collision-free design does not consistently improve validation loss. Through route-stratified evaluation (decomposing per-token loss into hot/cold contributions), we uncover a consistent “hot-to-cold advantage flip” during training: hot (high-frequency) positions initially have lower loss, but cold positions eventually surpass them. Crucially, collision-free configurations flip earlier than collision-prone baselines, suggesting that collisions act as implicit regularization. We also identify a gating mismatch: the gate learns to favor hot positions early in training, but this preference persists even after the flip, assigning higher weights to positions with higher loss. Our findings suggest that improving lookup precision alone does not guarantee better training outcomes. The dominant limitation may lie in gating credit assignment rather than index accuracy, and collision-induced noise may provide beneficial regularization that should not be naively eliminated.

[264] Understanding and Improving UMAP with Geometric and Topological Priors: The JORC-UMAP Algorithm

Xiaobin Li, Run Zhang

Main category: cs.LG

TL;DR: JORC-UMAP enhances UMAP by incorporating Ollivier-Ricci curvature and Jaccard similarity priors to reduce topological tearing and structural collapse in dimensionality reduction.

DetailsMotivation: UMAP's local Euclidean distance assumption often fails to capture intrinsic manifold geometry, causing topological tearing and structural collapse. The sensitivity to k-nearest neighbor graphs leads to inaccurate representations of true manifold structure.

Method: Introduces Ollivier-Ricci curvature as a geometric prior to reinforce edges at geometric bottlenecks and reduce redundant links. Also incorporates a topological prior using Jaccard similarity to ensure neighborhood consistency and handle noise sensitivity in curvature estimation.

Result: JORC-UMAP reduces tearing and collapse more effectively than standard UMAP and other dimensionality reduction methods, as measured by SVM accuracy and triplet preservation scores, while maintaining computational efficiency.

Conclusion: The work offers a geometry-aware enhancement to UMAP that better distinguishes true manifold structure from spurious connections, leading to more faithful data visualization.

Abstract: Nonlinear dimensionality reduction techniques, particularly UMAP, are widely used for visualizing high-dimensional data. However, UMAP’s local Euclidean distance assumption often fails to capture intrinsic manifold geometry, leading to topological tearing and structural collapse. We identify UMAP’s sensitivity to the k-nearest neighbor graph as a key cause. To address this, we introduce Ollivier-Ricci curvature as a geometric prior, reinforcing edges at geometric bottlenecks and reducing redundant links. Since curvature estimation is noise-sensitive, we also incorporate a topological prior using Jaccard similarity to ensure neighborhood consistency. The resulting method, JORC-UMAP, better distinguishes true manifold structure from spurious connections. Experiments on synthetic and real-world datasets show that JORC-UMAP reduces tearing and collapse more effectively than standard UMAP and other DR methods, as measured by SVM accuracy and triplet preservation scores, while maintaining computational efficiency. This work offers a geometry-aware enhancement to UMAP for more faithful data visualization.

[265] Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability

Vasileios Sevetlidis, George Pavlidis

Main category: cs.LG

TL;DR: Proposes neural training as a process tensor and introduces a model-agnostic witness for training memory based on back-flow of distinguishability, showing practical SGD deviates from Markov idealization.

DetailsMotivation: To provide a principled diagnostic for training memory in neural networks, moving beyond the Markov idealization of SGD and offering empirical evidence that data order matters in practical training scenarios.

Method: Frames training as a process tensor mapping controllable instruments to model observables. Uses a two-step protocol comparing outcome distributions after one vs. two interventions, measuring distinguishability increase (Δ_BF) with TV/JS/Hellinger distances on softmax predictions over a fixed probe set.

Result: Consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, larger batch overlap, and more micro-steps, and collapse under causal break (resetting optimizer state), directly attributing the effect to optimizer/data-state memory.

Conclusion: Provides a measurement contribution: a principled diagnostic showing practical SGD deviates from Markov idealization, with robust witness requiring no architectural changes. Framework enables comparison of optimizers, curricula, and schedules through induced training memory.

Abstract: This work proposes neural training as a \emph{process tensor}: a multi-time map that takes a sequence of controllable instruments (batch choices, augmentations, optimizer micro-steps) and returns an observable of the trained model. Building on this operational lens, we introduce a simple, model-agnostic witness of training memory based on \emph{back-flow of distinguishability}. In a controlled two-step protocol, we compare outcome distributions after one intervention versus two; the increase $Δ_{\mathrm{BF}} = D_2 - D_1>0$ (with $D\in{\mathrm{TV}, \mathrm{JS}, \mathrm{H}}$ measured on softmax predictions over a fixed probe set) certifies non-Markovianity. We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, larger batch overlap, and more micro-steps, and collapse under a \emph{causal break} (resetting optimizer state), directly attributing the effect to optimizer/data-state memory. The witness is robust across TV/JS/Hellinger, inexpensive to compute, and requires no architectural changes. We position this as a \emph{measurement} contribution: a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization. An exploratory case study illustrates how the micro-level signal can inform curriculum orderings. “Data order matters” turns into a testable operator with confidence bounds, our framework offers a common stage to compare optimizers, curricula, and schedules through their induced training memory.

[266] Predicting Startup Success Using Large Language Models: A Novel In-Context Learning Approach

Abdurahman Maarouf, Alket Bakiaj, Stefan Feuerriegel

Main category: cs.LG

TL;DR: Proposes kNN-ICL, a novel in-context learning framework using LLMs for startup success prediction that requires no training and works with small labeled datasets, outperforming traditional ML methods.

DetailsMotivation: Predicting early-stage startup success is challenging due to data scarcity in VC firms, which often have information about only a few dozen startups. Traditional ML methods require large labeled datasets, which are unavailable in this domain.

Method: kNN-ICL (k-nearest-neighbor-based in-context learning) framework that selects the most relevant past startups as demonstration examples based on similarity, using LLMs with no model training required.

Result: Using real-world Crunchbase profiles, kNN-ICL achieves higher prediction accuracy than supervised ML baselines and vanilla in-context learning. High balanced accuracy can be achieved with as few as 50 examples.

Conclusion: In-context learning can serve as an effective decision-making tool for VC firms operating in data-scarce environments, providing accurate predictions with minimal labeled data.

Abstract: Venture capital (VC) investments in early-stage startups that end up being successful can yield high returns. However, predicting early-stage startup success remains challenging due to data scarcity (e.g., many VC firms have information about only a few dozen of early-stage startups and whether they were successful). This limits the effectiveness of traditional machine learning methods that rely on large labeled datasets for model training. To address this challenge, we propose an in-context learning framework for startup success prediction using large language models (LLMs) that requires no model training and leverages only a small set of labeled startups as demonstration examples. Specifically, we propose a novel k-nearest-neighbor-based in-context learning framework, called kNN-ICL, which selects the most relevant past startups as examples based on similarity. Using real-world profiles from Crunchbase, we find that the kNN-ICL approach achieves higher prediction accuracy than supervised machine learning baselines and vanilla in-context learning. Further, we study how performance varies with the number of in-context examples and find that a high balanced accuracy can be achieved with as few as 50 examples. Together, we demonstrate that in-context learning can serve as a decision-making tool for VC firms operating in data-scarce environments.

[267] Integrating Meteorological and Operational Data: A Novel Approach to Understanding Railway Delays in Finland

Vinicius Pozzobon Borin, Jean Michel de Souza Sant’Ana, Usama Raheel, Nurul Huda Mahmood

Main category: cs.LG

TL;DR: First public dataset integrating Finnish railway operations with synchronized meteorological data (2018-2024) for analyzing train delays, featuring 38.5M observations with 28 engineered features, enabling ML applications in railway reliability research.

DetailsMotivation: Existing datasets rarely integrate meteorological information with operational train data, despite weather's significant impact on railway reliability, particularly in Nordic regions where weather conditions are challenging.

Method: Combines operational metrics from Finland Digitraffic Railway Traffic Service with weather measurements from 209 environmental monitoring stations using spatial-temporal alignment via Haversine distance. Includes strategic missing data handling through spatial fallback algorithms, cyclical encoding of temporal features, and robust scaling of weather data.

Result: Dataset contains 28 engineered features across 38.5M observations from Finland’s 5,915km rail network. Analysis reveals winter months exhibit delay rates exceeding 25% with geographic clustering in central/northern Finland. XGBoost baseline achieved MAE of 2.73 minutes for station-specific delay prediction.

Conclusion: The dataset enables diverse applications including train delay prediction, weather impact assessment, and infrastructure vulnerability mapping, providing researchers with a flexible resource for machine learning applications in railway operations research.

Abstract: Train delays result from complex interactions between operational, technical, and environmental factors. While weather impacts railway reliability, particularly in Nordic regions, existing datasets rarely integrate meteorological information with operational train data. This study presents the first publicly available dataset combining Finnish railway operations with synchronized meteorological observations from 2018-2024. The dataset integrates operational metrics from Finland Digitraffic Railway Traffic Service with weather measurements from 209 environmental monitoring stations, using spatial-temporal alignment via Haversine distance. It encompasses 28 engineered features across operational variables and meteorological measurements, covering approximately 38.5 million observations from Finland’s 5,915-kilometer rail network. Preprocessing includes strategic missing data handling through spatial fallback algorithms, cyclical encoding of temporal features, and robust scaling of weather data to address sensor outliers. Analysis reveals distinct seasonal patterns, with winter months exhibiting delay rates exceeding 25% and geographic clustering of high-delay corridors in central and northern Finland. Furthermore, the work demonstrates applications of the data set in analysing the reliability of railway traffic in Finland. A baseline experiment using XGBoost regression achieved a Mean Absolute Error of 2.73 minutes for predicting station-specific delays, demonstrating the dataset’s utility for machine learning applications. The dataset enables diverse applications, including train delay prediction, weather impact assessment, and infrastructure vulnerability mapping, providing researchers with a flexible resource for machine learning applications in railway operations research.

[268] E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory

Lin Huang, Chengxiang Huang, Ziang Wang, Yiyue Du, Chu Wang, Haocheng Lu, Yunyang Li, Xiaoli Liu, Arthur Jiang, Jia Zhang

Main category: cs.LG

TL;DR: E2Former-V2 introduces scalable equivariant graph neural networks using algebraic sparsity and hardware-aware execution to overcome computational bottlenecks in 3D atomistic modeling.

DetailsMotivation: Mainstream EGNN architectures face critical scalability bottlenecks due to explicit construction of geometric features or dense tensor products on every edge, limiting their efficiency for large-scale 3D atomistic systems.

Method: 1) Equivariant Axis-Aligned Sparsification (EAAS) using SO(3)→SO(2) change of basis to transform dense tensor contractions into sparse parity re-indexing operations. 2) On-the-Fly Equivariant Attention implemented via custom fused Triton kernel that eliminates materialized edge tensors and maximizes SRAM utilization.

Result: Achieves 20× improvement in TFLOPS compared to standard implementations while maintaining comparable predictive performance on SPICE and OMol25 datasets. Enables training large equivariant transformers on widely accessible GPU platforms.

Conclusion: E2Former-V2 demonstrates that scalable equivariant transformers can be trained efficiently using algebraic sparsity and hardware-aware execution, overcoming previous computational bottlenecks in 3D atomistic modeling.

Abstract: Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream architectures face critical scalability bottlenecks due to the explicit construction of geometric features or dense tensor products on \textit{every} edge. To overcome this, we introduce \textbf{E2Former-V2}, a scalable architecture that integrates algebraic sparsity with hardware-aware execution. We first propose \textbf{E}quivariant \textbf{A}xis-\textbf{A}ligned \textbf{S}parsification (EAAS). EAAS builds on Wigner-$6j$ convolution by exploiting an $\mathrm{SO}(3) \rightarrow \mathrm{SO}(2)$ change of basis to transform computationally expensive dense tensor contractions into efficient, sparse parity re-indexing operations. Building on this representation, we introduce \textbf{On-the-Fly Equivariant Attention}, a fully node-centric mechanism implemented via a custom fused Triton kernel. By eliminating materialized edge tensors and maximizing SRAM utilization, our kernel achieves a \textbf{20$\times$ improvement in TFLOPS} compared to standard implementations. Extensive experiments on the SPICE and OMol25 datasets demonstrate that E2Former-V2 maintains comparable predictive performance while notably accelerating inference. This work demonstrates that large equivariant transformers can be trained efficiently using widely accessible GPU platforms. The code is avalible at https://github.com/IQuestLab/UBio-MolFM/tree/e2formerv2.

[269] Dual-Prototype Disentanglement: A Context-Aware Enhancement Framework for Time Series Forecasting

Haonan Yang, Jianchao Tang, Zhuo Li

Main category: cs.LG

TL;DR: DPAD is a model-agnostic auxiliary framework that enhances time series forecasting models by dynamically disentangling complex temporal patterns into common and rare prototypes, then selectively retrieving context-specific patterns to improve forecasting accuracy and reliability.

DetailsMotivation: Current deep learning approaches for time series forecasting often learn static, averaged representations that fail to dynamically disentangle and leverage complex, intertwined temporal patterns, lacking context-aware capabilities.

Method: Proposes DPAD with three key components: 1) Dynamic Dual-Prototype bank (DDP) with common pattern bank (strong temporal priors for trend/seasonal patterns) and rare pattern bank (dynamic memory for infrequent events), 2) Dual-Path Context-aware routing (DPC) mechanism for selective retrieval of context-specific patterns, and 3) Disentanglement-Guided Loss (DGLoss) to ensure specialization and coverage.

Result: Comprehensive experiments show DPAD consistently improves forecasting performance and reliability of state-of-the-art models across diverse real-world benchmarks.

Conclusion: DPAD successfully addresses the limitation of static representations in time series forecasting by providing a model-agnostic framework for pattern disentanglement and context-aware adaptation, enhancing existing models’ capabilities without architectural modifications.

Abstract: Time series forecasting has witnessed significant progress with deep learning. While prevailing approaches enhance forecasting performance by modifying architectures or introducing novel enhancement strategies, they often fail to dynamically disentangle and leverage the complex, intertwined temporal patterns inherent in time series, thus resulting in the learning of static, averaged representations that lack context-aware capabilities. To address this, we propose the Dual-Prototype Adaptive Disentanglement framework (DPAD), a model-agnostic auxiliary method that equips forecasting models with the ability of pattern disentanglement and context-aware adaptation. Specifically, we construct a Dynamic Dual-Prototype bank (DDP), comprising a common pattern bank with strong temporal priors to capture prevailing trend or seasonal patterns, and a rare pattern bank dynamically memorizing critical yet infrequent events, and then an Dual-Path Context-aware routing (DPC) mechanism is proposed to enhance outputs with selectively retrieved context-specific pattern representations from the DDP. Additionally, we introduce a Disentanglement-Guided Loss (DGLoss) to ensure that each prototype bank specializes in its designated role while maintaining comprehensive coverage. Comprehensive experiments demonstrate that DPAD consistently improves forecasting performance and reliability of state-of-the-art models across diverse real-world benchmarks.

[270] Provably Robust Bayesian Counterfactual Explanations under Model Changes

Jamie Duell, Xiuyi Fan

Main category: cs.LG

TL;DR: PSCE generates counterfactual explanations with probabilistic safety guarantees that remain valid under model updates, ensuring high confidence and low variance predictions.

DetailsMotivation: Existing counterfactual explanations become invalid when machine learning models are updated in real-world settings, creating a need for explanations that remain reliable under model changes.

Method: PSCE uses Bayesian principles to generate δ-safe (high predictive confidence) and ε-robust (low predictive variance) counterfactual explanations with formal probabilistic guarantees. It integrates uncertainty-aware constraints into an optimization framework.

Result: PSCE outperforms state-of-the-art Bayesian CE methods, producing more plausible and discriminative counterfactual explanations that are provably robust under model changes, validated across diverse datasets.

Conclusion: PSCE provides a principled approach to generating counterfactual explanations that maintain validity under model updates through formal probabilistic safety guarantees, addressing a critical limitation of existing methods in dynamic real-world environments.

Abstract: Counterfactual explanations (CEs) offer interpretable insights into machine learning predictions by answering ``what if?" questions. However, in real-world settings where models are frequently updated, existing counterfactual explanations can quickly become invalid or unreliable. In this paper, we introduce Probabilistically Safe CEs (PSCE), a method for generating counterfactual explanations that are $δ$-safe, to ensure high predictive confidence, and $ε$-robust to ensure low predictive variance. Based on Bayesian principles, PSCE provides formal probabilistic guarantees for CEs under model changes which are adhered to in what we refer to as the $\langle δ, ε\rangle$-set. Uncertainty-aware constraints are integrated into our optimization framework and we validate our method empirically across diverse datasets. We compare our approach against state-of-the-art Bayesian CE methods, where PSCE produces counterfactual explanations that are not only more plausible and discriminative, but also provably robust under model change.

[271] Dynamic Expert-Guided Model Averaging for Causal Discovery

Adrick Tench, Thomas Demeester

Main category: cs.LG

TL;DR: A flexible ensemble method for causal discovery that dynamically incorporates expert knowledge (including LLMs) to combine multiple algorithms, addressing real-world challenges that violate standard assumptions.

DetailsMotivation: Causal discovery is crucial for healthcare but faces two main challenges: 1) too many algorithms without clear best choice, making ensembling natural, and 2) real-world scenarios often violate algorithm assumptions, forcing heavy reliance on expert knowledge.

Method: A flexible model averaging method that leverages dynamically requested expert knowledge to ensemble diverse causal discovery algorithms. Inspired by recent work on dynamic expert knowledge and LLMs as experts.

Result: Experiments show efficacy with imperfect experts (including LLMs) on both clean and noisy data. Analysis includes impact of different degrees of expert correctness and assessment of LLM capabilities for clinical causal discovery.

Conclusion: The method provides valuable insights for practitioners by combining algorithmic diversity with expert knowledge, particularly useful in healthcare where real-world data often violates standard causal discovery assumptions.

Abstract: Understanding causal relationships is critical for healthcare. Accurate causal models provide a means to enhance the interpretability of predictive models, and furthermore a basis for counterfactual and interventional reasoning and the estimation of treatment effects. However, would-be practitioners of causal discovery face a dizzying array of algorithms without a clear best choice. This abundance of competitive algorithms makes ensembling a natural choice for practical applications. At the same time, real-world use cases frequently face challenges that violate the assumptions of common causal discovery algorithms, forcing heavy reliance on expert knowledge. Inspired by recent work on dynamically requested expert knowledge and LLMs as experts, we present a flexible model averaging method leveraging dynamically requested expert knowledge to ensemble a diverse array of causal discovery algorithms. Experiments demonstrate the efficacy of our method with imperfect experts such as LLMs on both clean and noisy data. We also analyze the impact of different degrees of expert correctness and assess the capabilities of LLMs for clinical causal discovery, providing valuable insights for practitioners.

[272] Uncertainty propagation through trained multi-layer perceptrons: Exact analytical results

Andrew Thompson, Miles McCrory

Main category: cs.LG

TL;DR: Exact analytical expressions for mean and variance of single-hidden-layer ReLU MLP outputs with Gaussian inputs, avoiding series expansions.

DetailsMotivation: Previous methods for uncertainty propagation in neural networks typically rely on approximations or series expansions. The authors aim to provide exact analytical expressions for statistical moments of MLP outputs when inputs follow Gaussian distributions.

Method: Develop mathematical derivations for exact expressions of mean and variance in single-hidden-layer MLPs with ReLU activations when inputs are multivariate Gaussian. The approach avoids series expansions used in previous work.

Result: Successfully derived exact analytical expressions for both mean and variance of MLP outputs with Gaussian inputs. These expressions provide closed-form solutions without approximation errors from series truncation.

Conclusion: The paper provides exact analytical tools for uncertainty propagation in ReLU MLPs, offering more accurate results than previous approximation-based methods and enabling better uncertainty quantification in neural network applications.

Abstract: We give analytical results for propagation of uncertainty through trained multi-layer perceptrons (MLPs) with a single hidden layer and ReLU activation functions. More precisely, we give expressions for the mean and variance of the output when the input is multivariate Gaussian. In contrast to previous results, we obtain exact expressions without resort to a series expansion.

[273] Calibrated Probabilistic Interpolation for GEDI Biomass

Robin Young, Srinivasan Keshav

Main category: cs.LG

TL;DR: ANPs provide calibrated uncertainty estimates for GEDI biomass mapping by learning spatial covariance functions, outperforming traditional ML methods that fail to handle landscape heterogeneity.

DetailsMotivation: Traditional ML methods (Random Forest, XGBoost) for GEDI biomass mapping treat predictions as independent and fail to produce calibrated uncertainty estimates, especially in heterogeneous landscapes where they conflate ensemble variance with aleatoric uncertainty and ignore local spatial context.

Method: Attentive Neural Processes (ANPs), a probabilistic meta-learning framework that explicitly conditions predictions on local observation sets and geospatial foundation model embeddings. Unlike static ensembles, ANPs learn a flexible spatial covariance function that adapts uncertainty estimates to landscape complexity.

Result: ANPs achieve competitive accuracy while maintaining near-ideal uncertainty calibration across five distinct biomes (Tropical Amazonian forests to Boreal/Alpine ecosystems). The method enables few-shot adaptation, recovering most performance gap in cross-region transfer with minimal local data.

Conclusion: ANPs provide a scalable, theoretically rigorous alternative to ensemble variance for continental-scale earth observation, offering calibrated uncertainty estimates that adapt to landscape heterogeneity and enable efficient cross-region transfer learning.

Abstract: Reliable wall-to-wall biomass mapping from NASA’s GEDI mission requires interpolating sparse LiDAR observations across heterogeneous landscapes. While machine learning approaches like Random Forest and XGBoost are standard for this task, they treat spatial predictions of GEDI observations from multispectral or SAR remote sensing data as independent without adapting to the varying difficulty of heterogeneous landscapes. We demonstrate these approaches generally fail to produce calibrated prediction intervals. We identify that this stems from conflating ensemble variance with aleatoric uncertainty and ignoring local spatial context. To resolve this, we introduce Attentive Neural Processes (ANPs), a probabilistic meta-learning framework that explicitly conditions predictions on local observation sets and geospatial foundation model embeddings. Unlike static ensembles, ANPs learn a flexible spatial covariance function, allowing uncertainty estimates to expand in complex landscapes and contract in homogeneous areas. We validate this approach across five distinct biomes ranging from Tropical Amazonian forests to Boreal and Alpine ecosystems, demonstrating that ANPs achieve competitive accuracy while maintaining near-ideal uncertainty calibration. We demonstrate the operational utility of the method through few-shot adaptation, where the model recovers most of the performance gap in cross-region transfer using minimal local data. This work provides a scalable, theoretically rigorous alternative to ensemble variance for continental scale earth observation.

[274] The Art of Being Difficult: Combining Human and AI Strengths to Find Adversarial Instances for Heuristics

Henri Nikoleit, Ankit Anand, Anurag Murty Naredla, Heiko Röglin

Main category: cs.LG

TL;DR: Human-LLM collaboration refines FunSearch outputs to achieve state-of-the-art lower bounds for combinatorial optimization heuristics, breaking decade-old barriers.

DetailsMotivation: To demonstrate how human expertise can effectively extrapolate algorithmic insights from LLM-based evolutionary methods to solve open problems in theoretical computer science that have seen little improvement for over a decade.

Method: Collaborative approach combining FunSearch algorithm outputs with human refinement to generate adversarial instances where standard heuristics perform poorly, focusing on iterative improvement of constructions.

Result: Achieved state-of-the-art lower bounds for hierarchical k-median clustering, bin packing, knapsack problem, and a generalization of Lovász’s gasoline problem, breaking long-standing barriers.

Conclusion: LLMs provide critical initial patterns but human expertise is essential for transforming these into mathematically rigorous constructions, highlighting LLMs as strong collaborative tools in mathematics and computer science research.

Abstract: We demonstrate the power of human-LLM collaboration in tackling open problems in theoretical computer science. Focusing on combinatorial optimization, we refine outputs from the FunSearch algorithm [Romera-Paredes et al., Nature 2023] to derive state-of-the-art lower bounds for standard heuristics. Specifically, we target the generation of adversarial instances where these heuristics perform poorly. By iterating on FunSearch’s outputs, we identify improved constructions for hierarchical $k$-median clustering, bin packing, the knapsack problem, and a generalization of Lovász’s gasoline problem - some of these have not seen much improvement for over a decade, despite intermittent attention. These results illustrate how expert oversight can effectively extrapolate algorithmic insights from LLM-based evolutionary methods to break long-standing barriers. Our findings demonstrate that while LLMs provide critical initial patterns, human expertise is essential for transforming these patterns into mathematically rigorous and insightful constructions. This work highlights that LLMs are a strong collaborative tool in mathematics and computer science research.

[275] Provably Learning Attention with Queries

Satwik Bhattamishra, Kulin Shah, Michael Hahn, Varun Kanade

Main category: cs.LG

TL;DR: The paper studies learning Transformer parameters from black-box access, showing efficient algorithms for single-head attention but proving multi-head attention is not identifiable without additional assumptions.

DetailsMotivation: To understand the feasibility of learning Transformer parameters from only black-box access to their outputs, which has implications for model extraction, interpretability, and understanding what can be learned from API access.

Method: Develop query-based learning algorithms: 1) Elementary algorithm for single-head attention with O(d²) queries, 2) Randomized algorithm via compressed sensing for low head dimension (O(rd) queries), 3) Extension to noisy oracle access with polynomial queries, and 4) Analysis of multi-head attention identifiability.

Result: Single-head attention parameters can be learned efficiently with O(d²) queries (or O(rd) for low head dimension), even with noisy outputs. However, multi-head attention parameters are not generally identifiable from value queries alone.

Conclusion: While single-head Transformers can be learned efficiently from black-box access, multi-head attention requires additional structural assumptions for identifiability, highlighting fundamental differences in their learnability from output queries.

Abstract: We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the corresponding real-valued output. We begin with the simplest case, a single-head softmax-attention regressor. We show that for a model with width $d$, there is an elementary algorithm to learn the parameters of single-head attention exactly with $O(d^2)$ queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks (FFNs), then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. Next, motivated by the regime where the head dimension $r \ll d$, we provide a randomised algorithm that learns single-head attention-based models with $O(rd)$ queries via compressed sensing arguments. We also study robustness to noisy oracle access, proving that under mild norm and margin conditions, the parameters can be estimated to $\varepsilon$ accuracy with a polynomial number of queries even when outputs are only provided up to additive tolerance. Finally, we show that multi-head attention parameters are not identifiable from value queries in general – distinct parameterisations can induce the same input-output map. Hence, guarantees analogous to the single-head setting are impossible without additional structural assumptions.

[276] Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks

Bethan Evans, Jared Tanner

Main category: cs.LG

TL;DR: The paper analyzes minimal weight perturbations needed to change DNN outputs, applies this to backdoor attacks, and shows low-rank compression can activate latent backdoors while maintaining accuracy.

DetailsMotivation: To understand the minimal parameter changes required to alter DNN outputs, and apply this understanding to analyze precision-modification-activated backdoor attacks and compression thresholds.

Method: Derives exact formulae for minimal norm weight perturbations in single-layer networks, contrasts with multi-layer Lipschitz constant guarantees, and applies these to backdoor attack analysis.

Result: Single-layer exact formulae and multi-layer Lipschitz guarantees are of the same order, revealing how back-propagated margins govern layer-wise sensitivity. Low-rank compression can reliably activate latent backdoors while preserving accuracy.

Conclusion: The analysis provides certifiable guarantees on smallest parameter updates for desired output shifts, establishes provable compression thresholds below which backdoor attacks cannot succeed, and demonstrates practical implications for DNN security.

Abstract: The minimal norm weight perturbations of DNNs required to achieve a specified change in output are derived and the factors determining its size are discussed. These single-layer exact formulae are contrasted with more generic multi-layer Lipschitz constant based robustness guarantees; both are observed to be of the same order which indicates similar efficacy in their guarantees. These results are applied to precision-modification-activated backdoor attacks, establishing provable compression thresholds below which such attacks cannot succeed, and show empirically that low-rank compression can reliably activate latent backdoors while preserving full-precision accuracy. These expressions reveal how back-propagated margins govern layer-wise sensitivity and provide certifiable guarantees on the smallest parameter updates consistent with a desired output shift.

[277] Multigrade Neural Network Approximation

Shijun Zhang, Zuowei Shen, Yuesheng Xu

Main category: cs.LG

TL;DR: MGDL trains deep networks grade-by-grade, freezing previous layers and training new residual blocks to reduce approximation error, providing theoretical guarantees for vanishing error.

DetailsMotivation: Training very deep neural networks is challenging due to non-convex, ill-conditioned optimization landscapes, while shallow networks (especially one-hidden-layer ReLU models) admit convex reformulations with global guarantees. This motivates a structured approach that combines stability with depth.

Method: Multigrade Deep Learning (MGDL) trains networks grade by grade: previously learned grades are frozen, and each new residual block is trained solely to reduce remaining approximation error. This creates an interpretable hierarchical refinement process with operator-theoretic foundations.

Result: Theoretical proof that for any continuous target function, there exists a fixed-width multigrade ReLU scheme whose residuals decrease strictly across grades and converge uniformly to zero. This is the first rigorous guarantee that grade-wise training yields provable vanishing approximation error in deep networks.

Conclusion: MGDL provides a principled framework for structured error refinement in deep networks with theoretical guarantees of convergence, addressing the optimization challenges of deep architectures while maintaining interpretability and stability.

Abstract: We study multigrade deep learning (MGDL) as a principled framework for structured error refinement in deep neural networks. While the approximation power of neural networks is now relatively well understood, training very deep architectures remains challenging due to highly non-convex and often ill-conditioned optimization landscapes. In contrast, for relatively shallow networks, most notably one-hidden-layer $\texttt{ReLU}$ models, training admits convex reformulations with global guarantees, motivating learning paradigms that improve stability while scaling to depth. MGDL builds upon this insight by training deep networks grade by grade: previously learned grades are frozen, and each new residual block is trained solely to reduce the remaining approximation error, yielding an interpretable and stable hierarchical refinement process. We develop an operator-theoretic foundation for MGDL and prove that, for any continuous target function, there exists a fixed-width multigrade $\texttt{ReLU}$ scheme whose residuals decrease strictly across grades and converge uniformly to zero. To the best of our knowledge, this work provides the first rigorous theoretical guarantee that grade-wise training yields provable vanishing approximation error in deep networks. Numerical experiments further illustrate the theoretical results.

[278] FedSGM: A Unified Framework for Constraint Aware, Bidirectionally Compressed, Multi-Step Federated Optimization

Antesh Upadhyay, Sang Bin Moon, Abolfazl Hashemi

Main category: cs.LG

TL;DR: FedSGM is a unified federated constrained optimization framework that simultaneously addresses functional constraints, communication bottlenecks, local updates, and partial client participation with projection-free primal-only updates.

DetailsMotivation: Federated learning faces four major challenges: handling functional constraints, communication bottlenecks due to limited bandwidth, multiple local updates on clients, and partial client participation. Existing methods don't comprehensively address all these challenges together in constrained optimization settings.

Method: FedSGM builds on the switching gradient method with projection-free primal-only updates, avoiding dual-variable tuning. It incorporates bi-directional error feedback to handle compression, correcting bias while understanding compression-noise/local-update interactions. A soft switching version stabilizes updates near feasibility boundaries.

Result: Theoretical convergence guarantees show averaged iterate achieves O(1/√T) rate with high-probability bounds decoupling optimization from sampling noise. Experimental validation on Neyman-Pearson classification and constrained Markov decision process tasks confirms theoretical results.

Conclusion: FedSGM is the first unified framework for federated constrained optimization addressing all four major FL challenges, establishing a theoretically grounded foundation for constrained federated learning with practical validation.

Abstract: We introduce FedSGM, a unified framework for federated constrained optimization that addresses four major challenges in federated learning (FL): functional constraints, communication bottlenecks, local updates, and partial client participation. Building on the switching gradient method, FedSGM provides projection-free, primal-only updates, avoiding expensive dual-variable tuning or inner solvers. To handle communication limits, FedSGM incorporates bi-directional error feedback, correcting the bias introduced by compression while explicitly understanding the interaction between compression noise and multi-step local updates. We derive convergence guarantees showing that the averaged iterate achieves the canonical $\boldsymbol{\mathcal{O}}(1/\sqrt{T})$ rate, with additional high-probability bounds that decouple optimization progress from sampling noise due to partial participation. Additionally, we introduce a soft switching version of FedSGM to stabilize updates near the feasibility boundary. To our knowledge, FedSGM is the first framework to unify functional constraints, compression, multiple local updates, and partial client participation, establishing a theoretically grounded foundation for constrained federated learning. Finally, we validate the theoretical guarantees of FedSGM via experimentation on Neyman-Pearson classification and constrained Markov decision process (CMDP) tasks.

[279] Embedding -based Crop Type Classification in the Groundnut Basin of Senegal

Madeline C. Lisaius, Srinivasan Keshav, Andrew Blake, Clement Atzberger

Main category: cs.LG

TL;DR: TESSERA geospatial foundation model embeddings outperform other methods for crop type mapping in smallholder regions like Senegal, meeting key criteria for practical application.

DetailsMotivation: Current satellite-based crop mapping methods are inadequate for smallholder farming conditions, creating a need for approaches better suited to these regions where accurate crop type maps are crucial for food security, livelihood support, and climate change mitigation.

Method: Established a four-part evaluation criteria (performance, plausibility, transferability, accessibility) and compared geospatial foundation model embedding approaches (TESSERA and AlphaEarth) against baseline methods for crop type mapping in Senegal’s groundnut basin.

Result: TESSERA-based approach best fulfilled all selection criteria, achieving 28% higher accuracy than the next best method in one temporal transfer example, demonstrating superior effectiveness for crop type classification in Senegal.

Conclusion: TESSERA embeddings represent an effective approach for crop type mapping in smallholder regions like Senegal, addressing the limitations of existing satellite-based methods and meeting practical requirements for real-world application.

Abstract: Crop type maps from satellite remote sensing are important tools for food security, local livelihood support and climate change mitigation in smallholder regions of the world, but most satellite-based methods are not well suited to smallholder conditions. To address this gap, we establish a four-part criteria for a useful embedding-based approach consisting of 1) performance, 2) plausibility, 3) transferability and 4) accessibility and evaluate geospatial foundation model (FM) embeddings -based approaches using TESSERA and AlphaEarth against current baseline methods for a region in the groundnut basin of Senegal. We find that the TESSERA -based approach to land cover and crop type mapping fulfills the selection criteria best, and in one temporal transfer example shows 28% higher accuracy compared to the next best method. These results indicate that TESSERA embeddings are an effective approach for crop type classification and mapping tasks in Senegal.

[280] GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router Constraints

Andy Zhu, Rongzhe Wei, Yupu Gu, Pan Li

Main category: cs.LG

TL;DR: GRIP is a framework that prevents machine unlearning methods from exploiting MoE router vulnerabilities by constraining router updates to preserve routing stability while allowing expert parameter modification for true knowledge erasure.

DetailsMotivation: Existing machine unlearning methods fail for Mixture-of-Experts (MoE) architectures because they exploit router vulnerabilities rather than truly erasing knowledge, causing utility loss and superficial forgetting.

Method: GRIP uses a geometric constraint that projects router gradient updates into expert-specific null-spaces, decoupling routing stability from parameter rigidity. This prevents router manipulation while allowing expert parameter updates for genuine unlearning.

Result: GRIP achieves over 95% routing stability across all tested unlearning methods while preserving model utility, effectively adapting existing unlearning research from dense to MoE architectures.

Conclusion: GRIP provides an algorithm-agnostic framework that prevents exploitation of MoE router vulnerabilities, enabling effective machine unlearning for MoE architectures while maintaining routing stability and model utility.

Abstract: Machine unlearning (MU) for large language models has become critical for AI safety, yet existing methods fail to generalize to Mixture-of-Experts (MoE) architectures. We identify that traditional unlearning methods exploit MoE’s architectural vulnerability: they manipulate routers to redirect queries away from knowledgeable experts rather than erasing knowledge, causing a loss of model utility and superficial forgetting. We propose Geometric Routing Invariance Preservation (GRIP), an algorithm-agnostic framework for unlearning for MoE. Our core contribution is a geometric constraint, implemented by projecting router gradient updates into an expert-specific null-space. Crucially, this decouples routing stability from parameter rigidity: while discrete expert selections remain stable for retained knowledge, the continuous router parameters remain plastic within the null space, allowing the model to undergo necessary internal reconfiguration to satisfy unlearning objectives. This forces the unlearning optimization to erase knowledge directly from expert parameters rather than exploiting the superficial router manipulation shortcut. GRIP functions as an adapter, constraining router parameter updates without modifying the underlying unlearning algorithm. Extensive experiments on large-scale MoE models demonstrate that our adapter eliminates expert selection shift (achieving over 95% routing stability) across all tested unlearning methods while preserving their utility. By preventing existing algorithms from exploiting MoE model’s router vulnerability, GRIP adapts existing unlearning research from dense architectures to MoEs.

[281] The Trajectory Alignment Coefficient in Two Acts: From Reward Tuning to Reward Learning

Calarina Muslimani, Yunshu Du, Kenta Kawamoto, Kaushik Subramanian, Peter Stone, Peter Wurman

Main category: cs.LG

TL;DR: TAC helps RL practitioners design better reward functions and can be used as a differentiable objective (Soft-TAC) to learn reward models from human preferences.

DetailsMotivation: Reward function design in RL is time-consuming and error-prone. The paper aims to help practitioners specify appropriate reward weights and learn reward models that better capture human preferences.

Method: Two approaches: 1) Using Trajectory Alignment Coefficient (TAC) as a metric to guide manual reward tuning in human studies, and 2) Developing Soft-TAC, a differentiable approximation of TAC, to train reward models directly from human preference data.

Result: Human study showed TAC-guided tuning produced more performant reward functions with lower cognitive workload. Soft-TAC trained reward models captured preference-specific objectives better than standard Cross-Entropy loss, resulting in more distinct behaviors in Gran Turismo 7.

Conclusion: TAC serves as both a practical tool for guiding reward tuning and an effective reward learning objective, addressing the fundamental challenge of reward specification in RL.

Abstract: The success of reinforcement learning (RL) is fundamentally tied to having a reward function that accurately reflects the task objective. Yet, designing reward functions is notoriously time-consuming and prone to misspecification. To address this issue, our first goal is to understand how to support RL practitioners in specifying appropriate weights for a reward function. We leverage the Trajectory Alignment Coefficient (TAC), a metric that evaluates how closely a reward function’s induced preferences match those of a domain expert. To evaluate whether TAC provides effective support in practice, we conducted a human-subject study in which RL practitioners tuned reward weights for Lunar Lander. We found that providing TAC during reward tuning led participants to produce more performant reward functions and report lower cognitive workload relative to standard tuning without TAC. However, the study also underscored that manual reward design, even with TAC, remains labor-intensive. This limitation motivated our second goal: to learn a reward model that maximizes TAC directly. Specifically, we propose Soft-TAC, a differentiable approximation of TAC that can be used as a loss function to train reward models from human preference data. Validated in the racing simulator Gran Turismo 7, reward models trained using Soft-TAC successfully captured preference-specific objectives, resulting in policies with qualitatively more distinct behaviors than models trained with standard Cross-Entropy loss. This work demonstrates that TAC can serve as both a practical tool for guiding reward tuning and a reward learning objective in complex domains.

[282] Calibrated Similarity for Reliable Geometric Analysis of Embedding Spaces

Nicolas Tacheny

Main category: cs.LG

TL;DR: The paper proposes using isotonic regression to calibrate cosine similarity scores while preserving rank correlation, addressing anisotropy-induced miscalibration without altering embedding geometry.

DetailsMotivation: Raw cosine similarity in pretrained embeddings has strong rank correlation but suffers from anisotropy-induced miscalibration - scores concentrate in a narrow high-similarity band regardless of actual semantic relatedness, limiting interpretability as a quantitative measure.

Method: Use isotonic regression trained on human similarity judgments to construct a monotonic transformation that calibrates cosine similarity scores while preserving rank correlation and local stability.

Result: Achieves near-perfect calibration while preserving rank correlation and local stability (98% across seven perturbation types). The transformation preserves all order-based constructions including angular ordering, nearest neighbors, threshold graphs and quantile-based decisions.

Conclusion: Isotonic calibration restores interpretability of cosine similarity’s absolute values through monotone calibration without altering its ranking properties or requiring recomputation of embeddings, unlike prior space-modification approaches.

Abstract: While raw cosine similarity in pretrained embedding spaces exhibits strong rank correlation with human judgments, anisotropy induces systematic miscalibration of absolute values: scores concentrate in a narrow high-similarity band regardless of actual semantic relatedness, limiting interpretability as a quantitative measure. Prior work addresses this by modifying the embedding space (whitening, contrastive fine tuning), but such transformations alter geometric structure and require recomputing all embeddings. Using isotonic regression trained on human similarity judgments, we construct a monotonic transformation that achieves near-perfect calibration while preserving rank correlation and local stability(98% across seven perturbation types). Our contribution is not to replace cosine similarity, but to restore interpretability of its absolute values through monotone calibration, without altering its ranking properties. We characterize isotonic calibration as an order-preserving reparameterization and prove that all order-based constructions (angular ordering, nearest neighbors, threshold graphs and quantile-based decisions) are invariant under this transformation.

[283] Group-realizable multi-group learning by minimizing empirical risk

Navid Ardeshir, Samuel Deng, Daniel Hsu, Jingwen Liu

Main category: cs.LG

TL;DR: Multi-group learning has better sample complexity in group-realizable setting vs agnostic setting, even with infinite groups of finite VC dimension. ERM over group-realizable concepts works but is computationally intractable, suggesting improper learning as alternative.

DetailsMotivation: To understand the sample complexity advantages of multi-group learning in the group-realizable setting compared to the agnostic setting, particularly when dealing with infinite families of groups that have finite VC dimension.

Method: Empirical risk minimization over the class of group-realizable concepts, which may have infinite VC dimension. Also proposes improper learning as an alternative computational approach.

Result: Sample complexity improves in group-realizable setting over agnostic setting, even with infinite groups having finite VC dimension. However, implementing ERM over group-realizable concepts is computationally intractable.

Conclusion: While theoretical sample complexity advantages exist for multi-group learning in group-realizable settings, practical implementation requires alternative approaches like improper learning due to computational intractability of the direct ERM method.

Abstract: The sample complexity of multi-group learning is shown to improve in the group-realizable setting over the agnostic setting, even when the family of groups is infinite so long as it has finite VC dimension. The improved sample complexity is obtained by empirical risk minimization over the class of group-realizable concepts, which itself could have infinite VC dimension. Implementing this approach is also shown to be computationally intractable, and an alternative approach is suggested based on improper learning.

[284] Is BatchEnsemble a Single Model? On Calibration and Diversity of Efficient Ensembles

Anton Zamyatin, Patrick Indri, Sagar Malhotra, Thomas Gärtner

Main category: cs.LG

TL;DR: BatchEnsemble fails to provide meaningful ensemble uncertainty estimates, performing similarly to a single model rather than Deep Ensembles despite lower computational cost.

DetailsMotivation: Need efficient uncertainty estimation for resource-constrained, low-latency applications where Deep Ensembles are too computationally expensive.

Method: BatchEnsemble uses learned rank-1 perturbations to a shared base network to approximate ensemble behavior with lower parameter and memory costs.

Result: BatchEnsemble underperforms Deep Ensembles and closely tracks single model performance on accuracy, calibration, and OOD detection across CIFAR10/10C/SVHN. MNIST analysis shows members are functionally near-identical with limited capacity for distinct predictive modes.

Conclusion: BatchEnsemble behaves more like a single model than a true ensemble, failing to provide meaningful epistemic uncertainty despite its computational efficiency claims.

Abstract: In resource-constrained and low-latency settings, uncertainty estimates must be efficiently obtained. Deep Ensembles provide robust epistemic uncertainty (EU) but require training multiple full-size models. BatchEnsemble aims to deliver ensemble-like EU at far lower parameter and memory cost by applying learned rank-1 perturbations to a shared base network. We show that BatchEnsemble not only underperforms Deep Ensembles but closely tracks a single model baseline in terms of accuracy, calibration and out-of-distribution (OOD) detection on CIFAR10/10C/SVHN. A controlled study on MNIST finds members are near-identical in function and parameter space, indicating limited capacity to realize distinct predictive modes. Thus, BatchEnsemble behaves more like a single model than a true ensemble.

[285] 3D Molecule Generation from Rigid Motifs via SE(3) Flows

Roman Poletukhin, Marcel Kollovieh, Eike Eberhard, Stephan Günnemann

Main category: cs.LG

TL;DR: 3D molecular structure generation using rigid-body motifs instead of atoms, achieving faster generation and better compression than atom-based methods.

DetailsMotivation: Traditional 3D molecular structure generation works at the atomic level, but molecular graph generation often uses fragments. The paper aims to extend fragmentation ideas to 3D by treating molecules as sets of rigid-body motifs for more efficient generation.

Method: Extends frame-based protein structure generation to general molecules, representing them as sets of rigid-body motifs. Uses SE(3)-equivariant generative modeling for de novo 3D molecule generation from these rigid motifs.

Result: Achieves comparable or superior results to state-of-the-art across benchmarks, with better atom stability on GEOM-Drugs. Shows 2x to 10x reduction in generation steps and 3.5x compression in molecular representations compared to standard atom-based methods.

Conclusion: Rigid-body motif representation enables more efficient 3D molecular generation with fewer steps and better compression while maintaining or improving quality compared to atom-based approaches.

Abstract: Three-dimensional molecular structure generation is typically performed at the level of individual atoms, yet molecular graph generation techniques often consider fragments as their structural units. Building on the advances in frame-based protein structure generation, we extend these fragmentation ideas to 3D, treating general molecules as sets of rigid-body motifs. Utilising this representation, we employ SE(3)-equivariant generative modelling for de novo 3D molecule generation from rigid motifs. In our evaluations, we observe comparable or superior results to state-of-the-art across benchmarks, surpassing it in atom stability on GEOM-Drugs, while yielding a 2x to 10x reduction in generation steps and offering 3.5x compression in molecular representations compared to the standard atom-based methods.

[286] Auto-Regressive Masked Diffusion Models

Mahdi Karami, Ali Ghodsi

Main category: cs.LG

TL;DR: ARMD is a new architecture that combines autoregressive training efficiency with diffusion parallel generation, achieving SOTA performance with fewer training steps and faster parallel inference.

DetailsMotivation: Masked diffusion models have performance gaps compared to autoregressive models and require more training iterations. The authors aim to unify the strengths of both approaches.

Method: Reframe masked diffusion as block-wise causal model, create strictly causal permutation-equivariant architecture, use progressive permutation training, and introduce strided parallel generation strategy.

Result: Achieves state-of-the-art performance on language modeling benchmarks, outperforms diffusion baselines with fewer training steps, and sets new benchmark for parallel text generation.

Conclusion: ARMD successfully bridges the performance gap between parallel and sequential decoding by combining autoregressive efficiency with diffusion parallel generation capabilities.

Abstract: Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to close this gap by unifying the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models. Our key insight is to reframe the masked diffusion process as a block-wise causal model. This perspective allows us to design a strictly causal, permutation-equivariant architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn both canonical left-to-right and random token orderings. Leveraging this flexibility, we introduce a novel strided parallel generation strategy that accelerates inference by generating tokens in parallel streams while maintaining global coherence. Empirical results demonstrate that ARMD achieves state-of-the-art performance on standard language modeling benchmarks, outperforming established diffusion baselines while requiring significantly fewer training steps. Furthermore, it establishes a new benchmark for parallel text generation, effectively bridging the performance gap between parallel and sequential decoding.

[287] Latent Diffusion for Internet of Things Attack Data Generation in Intrusion Detection

Estela Sánchez-Carballo, Francisco M. Melgarejo-Meseguer, José Luis Rojo-Álvarez

Main category: cs.LG

TL;DR: LDM-based data augmentation improves IoT intrusion detection by addressing class imbalance, outperforming existing methods in F1-score while being computationally efficient.

DetailsMotivation: ML-based IDSs for IoT suffer from class imbalance between benign and attack traffic, and existing data augmentation methods struggle with balancing sample fidelity, diversity, and computational efficiency.

Method: Proposed using Latent Diffusion Model (LDM) for attack data augmentation in IoT intrusion detection, with comprehensive comparison against state-of-the-art baselines on three IoT attack types (DDoS, Mirai, Man-in-the-Middle).

Result: LDM-generated samples substantially improve IDS performance (F1-scores up to 0.99 for DDoS and Mirai), preserve feature dependencies, generate diverse samples, and reduce sampling time by ~25% compared to diffusion models in data space.

Conclusion: Latent diffusion is an effective and scalable solution for synthetic IoT attack data generation that mitigates class imbalance impact in ML-based IDSs for IoT scenarios.

Abstract: Intrusion Detection Systems (IDSs) are a key component for protecting Internet of Things (IoT) environments. However, in Machine Learning-based (ML-based) IDSs, performance is often degraded by the strong class imbalance between benign and attack traffic. Although data augmentation has been widely explored to mitigate this issue, existing approaches typically rely on simple oversampling techniques or generative models that struggle to simultaneously achieve high sample fidelity, diversity, and computational efficiency. To address these limitations, we propose the use of a Latent Diffusion Model (LDM) for attack data augmentation in IoT intrusion detection and provide a comprehensive comparison against state-of-the-art baselines. Experiments were conducted on three representative IoT attack types, specifically Distributed Denial-of-Service (DDoS), Mirai, and Man-in-the-Middle, evaluating both downstream IDS performance and intrinsic generative quality using distributional, dependency-based, and diversity metrics. Results show that balancing the training data with LDM-generated samples substantially improves IDS performance, achieving F1-scores of up to 0.99 for DDoS and Mirai attacks and consistently outperforming competing methods. Additionally, quantitative and qualitative analyses demonstrate that LDMs effectively preserve feature dependencies while generating diverse samples and reduce sampling time by approximately 25% compared to diffusion models operating directly in data space. These findings highlight latent diffusion as an effective and scalable solution for synthetic IoT attack data generation, substantially mitigating the impact of class imbalance in ML-based IDSs for IoT scenarios.

[288] A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, Michael Shvartsman

Main category: cs.LG

TL;DR: Critical sharpness (λc) is a computationally efficient alternative to Hessian sharpness that captures curvature dynamics in LLM training with only ~10 forward passes, enabling analysis of phenomena like progressive sharpening and Edge of Stability at scale up to 7B parameters.

DetailsMotivation: Hessian sharpness is crucial for understanding neural network training dynamics but is computationally prohibitive for Large Language Models (LLMs). There's a need for scalable curvature measures that can provide practical insights into training phenomena at scale.

Method: Introduces critical sharpness (λc), a computationally efficient measure requiring fewer than 10 forward passes given the update direction Δθ. Also introduces relative critical sharpness (λc^{1→2}) to analyze curvature when transitioning between different loss landscapes (e.g., pre-training to fine-tuning).

Result: First demonstration of Hessian sharpness phenomena (progressive sharpening, Edge of Stability) at scale up to 7B parameters in OLMo-2 models, spanning both pre-training and mid-training. Shows critical sharpness can guide data mixing strategies and provide actionable insights for large-scale training.

Conclusion: Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale, demonstrating that scalable curvature measures can offer actionable insights for large-scale training of LLMs.

Abstract: Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($λ_{\max}^H$) – the largest eigenvalue of the loss Hessian – determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($λ_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Δ\mathbfθ$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($λ_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.

[289] Vertical Semi-Federated Learning for Efficient Online Advertising

Wenjie Li, Shu-Tao Xia, Jiangke Fan, Teng Zhang, Xingxing Wang

Main category: cs.LG

TL;DR: Proposes Semi-VFL (Vertical Semi-Federated Learning) with Joint Privileged Learning framework to address limitations of traditional VFL in advertising systems, enabling independent local serving while maintaining federated learning advantages.

DetailsMotivation: Traditional vertical federated learning has two main limitations: 1) restricted applicability to overlapped samples only, and 2) high system challenges for real-time federated serving, which hinder its practical use in advertising systems.

Method: Proposes Joint Privileged Learning (JPL) framework with two key components: 1) federated equivalence imitation to alleviate absence of passive party’s features, and 2) cross-branch rank alignment to adapt to heterogeneous full sample space.

Result: Extensive experiments on real-world advertising datasets validate the effectiveness of the proposed method over baseline approaches.

Conclusion: Semi-VFL with JPL provides a practical solution for industrial applications that retains federated learning advantages while supporting independent local serving, overcoming traditional VFL limitations in advertising systems.

Abstract: Traditional vertical federated learning schema suffers from two main issues: 1) restricted applicable scope to overlapped samples and 2) high system challenge of real-time federated serving, which limits its application to advertising systems. To this end, we advocate a new practical learning setting, Semi-VFL (Vertical Semi-Federated Learning), for real-world industrial applications, where the learned model retains sufficient advantages of federated learning while supporting independent local serving. To achieve this goal, we propose the carefully designed Joint Privileged Learning framework (JPL) to i) alleviate the absence of the passive party’s feature with federated equivalence imitation and ii) adapt to the heterogeneous full sample space with cross-branch rank alignment. Extensive experiments conducted on real-world advertising datasets validate the effectiveness of our method over baseline methods.

[290] Task Aware Dreamer for Task Generalization in Reinforcement Learning

Chengyang Ying, Xinning Zhou, Zhongkai Hao, Hang Su, Songming Liu, Dong Yan, Jun Zhu

Main category: cs.LG

TL;DR: TAD (Task Aware Dreamer) improves RL generalization across tasks with different reward functions by using reward-informed world models and a new TDR metric to measure task relevance.

DetailsMotivation: RL agents need to generalize across tasks with similar dynamics but different reward functions for real-world adaptability. Current methods struggle when tasks differ significantly in reward structure.

Method: TAD integrates reward-informed features into world models to identify consistent latent characteristics across tasks. It computes variational lower bound of sample data log-likelihood with a new term to differentiate tasks using their states. Introduces TDR metric to measure task relevance.

Result: Extensive experiments show TAD significantly improves performance on handling different tasks simultaneously, especially for high-TDR tasks, and displays strong generalization to unseen tasks in both image-based and state-based environments.

Conclusion: Reward-informed world models are crucial for task generalization, particularly when tasks differ significantly (high TDR). TAD demonstrates superior generalization ability compared to Markovian policies.

Abstract: A long-standing goal of reinforcement learning is to acquire agents that can learn on training tasks and generalize well on unseen tasks that may share a similar dynamic but with different reward functions. The ability to generalize across tasks is important as it determines an agent’s adaptability to real-world scenarios where reward mechanisms might vary. In this work, we first show that training a general world model can utilize similar structures in these tasks and help train more generalizable agents. Extending world models into the task generalization setting, we introduce a novel method named Task Aware Dreamer (TAD), which integrates reward-informed features to identify consistent latent characteristics across tasks. Within TAD, we compute the variational lower bound of sample data log-likelihood, which introduces a new term designed to differentiate tasks using their states, as the optimization objective of our reward-informed world models. To demonstrate the advantages of the reward-informed policy in TAD, we introduce a new metric called Task Distribution Relevance (TDR) which quantitatively measures the relevance of different tasks. For tasks exhibiting a high TDR, i.e., the tasks differ significantly, we illustrate that Markovian policies struggle to distinguish them, thus it is necessary to utilize reward-informed policies in TAD. Extensive experiments in both image-based and state-based tasks show that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and display a strong generalization ability to unseen tasks.

[291] ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection

Erik Wallin, Lennart Svensson, Fredrik Kahl, Lars Hammarstrand

Main category: cs.LG

TL;DR: ProSub is a new OSSL framework that uses angle-based scores in feature space for ID/OOD classification and probabilistic predictions via estimated conditional distributions, achieving SOTA performance.

DetailsMotivation: Existing OSSL methods rely on softmax confidence and ad-hoc thresholds for ID/OOD classification without considering problem statistics, lacking probabilistic foundations.

Method: Proposed angle-based score in feature space between data and ID subspace, plus estimation of conditional score distributions for probabilistic ID/OOD predictions.

Result: ProSub achieves state-of-the-art performance on several benchmark OSSL problems, outperforming existing methods.

Conclusion: The ProSub framework provides a principled, probabilistic approach to OSSL with superior performance, addressing limitations of existing threshold-based methods.

Abstract: In open-set semi-supervised learning (OSSL), we consider unlabeled datasets that may contain unknown classes. Existing OSSL methods often use the softmax confidence for classifying data as in-distribution (ID) or out-of-distribution (OOD). Additionally, many works for OSSL rely on ad-hoc thresholds for ID/OOD classification, without considering the statistics of the problem. We propose a new score for ID/OOD classification based on angles in feature space between data and an ID subspace. Moreover, we propose an approach to estimate the conditional distributions of scores given ID or OOD data, enabling probabilistic predictions of data being ID or OOD. These components are put together in a framework for OSSL, termed ProSub, that is experimentally shown to reach SOTA performance on several benchmark problems. Our code is available at https://github.com/walline/prosub.

[292] Provable Differentially Private Computation of the Cross-Attention Mechanism

Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, Jiahao Zhang

Main category: cs.LG

TL;DR: First provable differentially private cross-attention mechanism with theoretical guarantees for privacy-preserving large generative models.

DetailsMotivation: Cross-attention modules in AI systems (RAG, system prompting, stable diffusion) handle sensitive user data, creating privacy risks that need to be addressed with provable guarantees.

Method: Novel data structure enforcing differential privacy for cross-attention with polynomial kernel methods, achieving efficient space/time complexity while maintaining robustness against adaptive queries.

Result: Achieves (ε,δ)-DP with specific additive and relative error bounds, with Õ(ndr²) space complexity and Õ(dr²) query time per token.

Conclusion: First provable differentially private cross-attention framework, establishing foundation for privacy-preserving algorithms in large generative models.

Abstract: Cross-attention has emerged as a cornerstone module in modern artificial intelligence, underpinning critical applications such as retrieval-augmented generation (RAG), system prompting, and guided stable diffusion. However, this is a rising concern about securing the privacy of cross-attention, as the underlying key and value matrices frequently encode sensitive data or private user information. In this work, we introduce a novel data structure designed to enforce differential privacy (DP) for cross-attention mechanisms, accompanied by provable theoretical guarantees. Specifically, letting $n$ denote the input sequence length, $d$ the feature dimension, $R$ the maximum magnitude of query and key matrices, $R_w$ the maximum magnitude of the value matrix, and $r, s, ε_s$ the parameters for polynomial kernel methods, our proposed structure achieves $\widetilde{O}(ndr^2)$ space and initialization complexity, with a query time of $\widetilde{O}(d r^2)$ per token. Moreover, we demonstrate that our mechanism satisfies $(ε, δ)$-DP, incurring an additive error of $\widetilde{O}((1-ε_s)^{-1} n^{-1} ε^{-1} R^{2s} R_w r^2)$ and a relative error of $2ε_s/(1-ε_s)$ with respect to the ground truth. Crucially, our framework maintains robustness against adaptive queries, ensuring security even in adversarial settings. To the best of our knowledge, this constitutes the first approach providing provable differential privacy for cross-attention, establishing a foundation for future privacy-preserving algorithms in large generative models (LGMs).

[293] On Fine-Grained I/O Complexity of Attention Backward Passes

Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Song Yue, Jiahao Zhang

Main category: cs.LG

TL;DR: Systematic analysis of I/O complexity in attention mechanisms using red-blue pebble game framework, deriving tight bounds across cache sizes, validating FlashAttention’s optimality for large caches, and introducing novel algorithm for small caches.

DetailsMotivation: LLMs handle large context windows efficiently, but attention computation scales quadratically with sequence length, creating efficiency bottlenecks that require I/O-optimized algorithms.

Method: Use red-blue pebble game framework to analyze I/O complexity of attention mechanisms, focusing on backward pass under small and large cache settings. Derive tight bounds across full cache size spectrum and extend to sparse attention.

Result: FlashAttention achieves optimality in large-cache scenarios for both forward and backward passes. Novel algorithm for small-cache environments outperforms contemporary methods and attains theoretical tight bounds. Established granular lower bounds for sparse attention across all cache configurations.

Conclusion: Results provide solid theoretical framework for I/O complexity in attention mechanisms, offering critical guidance for developing efficient LLM training and inference systems.

Abstract: Large Language Models (LLMs) exhibit exceptional proficiency in handling extensive context windows in natural language. Nevertheless, the quadratic scaling of attention computation relative to sequence length creates substantial efficiency bottlenecks, necessitating the development of I/O-optimized algorithms. In this work, we conduct a systematic examination of the I/O complexity inherent in attention mechanisms, with a specific emphasis on the backward pass under both small and large cache settings. By leveraging the red-blue pebble game framework, we derive tight bounds for I/O complexity across the full spectrum of cache sizes. We validate that FlashAttention, one of the current industry standards, achieves optimality in the large-cache scenario for both forward and backward passes. Conversely, for small-cache environments, we introduce a novel algorithm that outperforms contemporary methods and successfully attains theoretical tight bounds. Furthermore, we expand our investigation to include sparse attention by establishing granular lower bounds for both forward and backward passes across all cache configurations. Ultimately, our results solidify the theoretical framework regarding I/O complexity in attention mechanisms, providing critical guidance for the development of efficient LLM training and inference systems.

[294] Towards Fast Safe Online Reinforcement Learning via Policy Finetuning

Keru Chen, Honghao Wei, Zhigang Deng, Sen Lin

Main category: cs.LG

TL;DR: Marvel is a novel offline-to-online safe RL framework that addresses challenges in transitioning from offline to online learning while maintaining safety constraints, outperforming existing baselines.

DetailsMotivation: Current online safe RL methods are impractical due to high interaction costs, while offline safe RL suffers from data quality limitations and OOD issues. The gap between offline and online safe RL needs bridging for more efficient and practical solutions.

Method: Marvel framework with two key components: Value Pre-Alignment to correct Q-estimations before online learning, and Adaptive PID Control to adjust Lagrange multipliers during online finetuning.

Result: Marvel significantly outperforms existing baselines in both reward maximization and safety constraint satisfaction, demonstrating superior performance in offline-to-online safe RL.

Conclusion: Marvel introduces the first policy-finetuning based framework for O2O safe RL, compatible with many offline and online methods, advancing the field toward more efficient and practical safe RL solutions.

Abstract: The high costs and risks involved in extensive environment interactions hinder the practical application of current online safe reinforcement learning (RL) methods. While offline safe RL addresses this by learning policies from static datasets, the performance therein is usually limited due to reliance on data quality and challenges with out-of-distribution (OOD) actions. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online policy learning, a direction that has yet to be fully investigated. To fill this gap, we first demonstrate that naively applying existing O2O algorithms from standard RL would not work well in the safe RL setting due to two unique challenges: \emph{erroneous Q-estimations}, resulted from offline-online objective mismatch and offline cost sparsity, and \emph{Lagrangian mismatch}, resulted from difficulties in aligning Lagrange multipliers between offline and online policies. To address these challenges, we introduce \textbf{Marvel}, a novel framework for O2O safe RL, comprising two key components that work in concert: \emph{Value Pre-Alignment} to align the Q-functions with the underlying truth before online learning, and \emph{Adaptive PID Control} to effectively adjust the Lagrange multipliers during online finetuning. Extensive experiments demonstrate that Marvel significantly outperforms existing baselines in both reward maximization and safety constraint satisfaction. By introducing the first policy-finetuning based framework for O2O safe RL, which is compatible with many offline and online safe RL methods, our work has the great potential to advance the field towards more efficient and practical safe RL solutions.

[295] ViSymRe: Vision Multimodal Symbolic Regression

Da Li, Junping Yin, Jin Xu, Xinxin Li, Juan Zhang

Main category: cs.LG

TL;DR: ViSymRe is a Vision Symbolic Regression framework that uses visual modality to enhance Transformer-based symbolic regression, addressing modal heterogeneity issues through Multi-View Random Slicing for high-dimensional visualization and a dual-vision pipeline for dataset-only deployment.

DetailsMotivation: Transformer-based symbolic regression models face modal heterogeneity between datasets and equations, hindering convergence and generalization. The paper explores whether visual modality can enhance Transformer-based SR performance.

Method: Proposes ViSymRe with two key components: 1) Multi-View Random Slicing (MVRS) projects multivariate equations into 2D space using random affine transformations to handle high-dimensional visualization challenges; 2) Dual-vision pipeline with Visual Decoder and Biased Cross-Attention feature fusion module to reconstruct visual features from datasets and suppress reconstruction noise.

Result: Ablation studies show visual modality improves model convergence and SR metrics. ViSymRe achieves competitive performance on mainstream benchmarks, particularly in low-complexity and rapid-inference scenarios.

Conclusion: Visual modality positively contributes to Transformer-based symbolic regression. ViSymRe demonstrates effective high-dimensional visualization and noise-suppressing feature fusion, offering competitive performance especially in practical deployment scenarios.

Abstract: Extracting interpretable equations from observational datasets to describe complex natural phenomena is one of the core goals of artificial intelligence. This field is known as symbolic regression (SR). In recent years, Transformer-based paradigms have become a new trend in SR, addressing the well-known problem of inefficient search. However, the modal heterogeneity between datasets and equations often hinders the convergence and generalization of these models. In this paper, we propose ViSymRe, a Vision Symbolic Regression framework, to explore the positive role of visual modality in enhancing the performance of Transformer-based SR paradigms. To overcome the challenge where the visual SR model is untrainable in high-dimensional scenarios, we present Multi-View Random Slicing (MVRS). By projecting multivariate equations into 2-D space using random affine transformations, MVRS avoids common defects in high-dimensional visualization, such as variable degradation, non-linear interaction missing, and exponentially increasing sampling complexity, enabling ViSymRe to be trained with low computational costs. To support dataset-only deployment of ViSymRe, we design a dual-vision pipeline architecture based on generative techniques, which reconstructs visual features directly from the datasets via an auxiliary Visual Decoder and automatically suppresses the attention weights of reconstruction noise through a proposed Biased Cross-Attention feature fusion module, ensuring that subsequent processes are not affected by noisy modalities. Ablation studies demonstrate the positive contribution of visual modality to improving model convergence level and enhancing various SR metrics. Furthermore, evaluation results on mainstream benchmarks indicate that ViSymRe achieves competitive performance compared to baselines, particularly in low-complexity and rapid-inference scenarios.

[296] The Curse of Depth in Large Language Models

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu

Main category: cs.LG

TL;DR: The paper introduces the “Curse of Depth” in LLMs where deep layers underperform due to Pre-LN’s output variance explosion, and proposes LayerNorm Scaling (LNS) to mitigate this by scaling variance inversely with depth, improving training effectiveness.

DetailsMotivation: Modern LLMs exhibit a phenomenon where nearly half of their deep layers are less effective than expected, which the authors term the "Curse of Depth." This inefficiency in deep layers limits model performance despite increased depth.

Method: The authors identify Pre-Layer Normalization (Pre-LN) as the root cause, as its output variance grows exponentially with depth, making deep layers’ derivatives identity matrices that barely contribute to training. They propose LayerNorm Scaling (LNS), which scales the variance of layer normalization outputs inversely by the square root of depth to prevent variance explosion.

Result: Experiments across model sizes (130M to 7B) show LNS consistently outperforms previous normalization and scaling techniques in LLM pre-training. The improvement also carries over to supervised fine-tuning, with deeper layers contributing more effectively during training.

Conclusion: The Curse of Depth is a significant issue in modern LLMs caused by Pre-LN’s variance explosion. LayerNorm Scaling provides a simple yet effective solution that enables deeper layers to contribute meaningfully to training, improving overall model performance across pre-training and fine-tuning.

Abstract: In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.

[297] A Survey on Human-Centered Evaluation of Explainable AI Methods in Clinical Decision Support Systems

Alessandro Gambetti, Qiwei Han, Hong Shen, Claudia Soares

Main category: cs.LG

TL;DR: Systematic survey of 31 human-centered evaluations of XAI in clinical decision support reveals most use post-hoc methods like SHAP/Grad-CAM with small clinician studies, showing explanations improve trust but increase cognitive load and often misalign with clinical reasoning.

DetailsMotivation: XAI is crucial for clinical adoption of CDSS, but real-world effectiveness of existing XAI methods is limited and inconsistently evaluated, requiring systematic assessment of human-centered evaluations.

Method: PRISMA-guided systematic survey of 31 human-centered evaluations of XAI applied to CDSS, classifying by XAI methodology, evaluation design, and adoption barriers.

Result: Over 80% of studies use post-hoc, model-agnostic approaches (SHAP, Grad-CAM) with clinician sample sizes below 25 participants. Explanations improve clinician trust and diagnostic confidence but increase cognitive load and misalign with domain reasoning processes.

Conclusion: Proposes stakeholder-centric evaluation framework integrating socio-technical principles and human-computer interaction to guide development of clinically viable and trustworthy XAI-based CDSS.

Abstract: Explainable Artificial Intelligence (XAI) is essential for the transparency and clinical adoption of Clinical Decision Support Systems (CDSS). However, the real-world effectiveness of existing XAI methods remains limited and is inconsistently evaluated. This study conducts a systematic PRISMA-guided survey of 31 human-centered evaluations (HCE) of XAI applied to CDSS, classifying them by XAI methodology, evaluation design, and adoption barrier. Our findings reveal that most existing studies employ post-hoc, model-agnostic approaches such as SHAP and Grad-CAM, typically assessed through small-scale clinician studies. The results show that over 80% of the studies adopt post-hoc, model-agnostic approaches such as SHAP and Grad-CAM, and that clinician sample sizes remain below 25 participants. The findings indicate that explanations generally improve clinician trust and diagnostic confidence, but frequently increase cognitive load and exhibit misalignment with domain reasoning processes. To bridge these gaps, we propose a stakeholder-centric evaluation framework that integrates socio-technical principles and human-computer interaction to guide the future development of clinically viable and trustworthy XAI-based CDSS.

[298] On Computational Limits of FlowAR Models: Expressivity and Efficiency

Yang Cao, Chengyue Gong, Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song

Main category: cs.LG

TL;DR: FlowAR models (flow+autoregressive) have TC⁰ circuit complexity with constant depth and polynomial width, revealing expressive power limitations.

DetailsMotivation: The theoretical characterization of deep visual generative models' expressiveness through circuit complexity remains underexplored, particularly for state-of-the-art architectures like FlowAR that integrate flow-based and autoregressive mechanisms. This gap limits understanding of their computational limits and practical efficiency.

Method: Analyze the circuit complexity of FlowAR architecture by demonstrating that when the largest feature map has dimensions n×n×c, FlowAR is simulable by a family of threshold circuits (TC⁰) with constant depth O(1) and polynomial width poly(n). Identify conditions for achieving almost quadratic time computations.

Result: First rigorous study showing FlowAR models have TC⁰ circuit complexity, revealing limitations in expressive power. Identified conditions for almost quadratic time computations. Presented efficient model variant constructions based on low-rank approximations that align with derived criteria.

Conclusion: Provides foundation for future comparisons with other generative paradigms and guides development of more efficient and expressive implementations by establishing theoretical bounds on FlowAR’s computational capabilities.

Abstract: The expressive power and computational complexity of deep visual generative models, such as flow-based and autoregressive (AR) models, have gained considerable interest for their wide-ranging applications in generative tasks. However, the theoretical characterization of their expressiveness through the lens of circuit complexity remains underexplored, particularly for the state-of-the-art architecture like FlowAR proposed by [Ren et al., 2024], which integrates flow-based and autoregressive mechanisms. This gap limits our understanding of their inherent computational limits and practical efficiency. In this study, we address this gap by analyzing the circuit complexity of the FlowAR architecture. We demonstrate that when the largest feature map produced by the FlowAR model has dimensions $n \times n \times c$, the FlowAR model is simulable by a family of threshold circuits $\mathsf{TC}^0$, which have constant depth $O(1)$ and polynomial width $\mathrm{poly}(n)$. This is the first study to rigorously highlight the limitations in the expressive power of FlowAR models. Furthermore, we identify the conditions under which the FlowAR model computations can achieve almost quadratic time. To validate our theoretical findings, we present efficient model variant constructions based on low-rank approximations that align with the derived criteria. Our work provides a foundation for future comparisons with other generative paradigms and guides the development of more efficient and expressive implementations.

[299] Visual Autoregressive Transformers Must Use $Ω(n^2 d)$ Memory

Yang Cao, Xiaoyu Li, Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song

Main category: cs.LG

TL;DR: The paper formally defines KV-cache compression for Visual Autoregressive transformers and proves a fundamental memory lower bound of Ω(n²d) for attention-based architectures, showing sub-quadratic memory is impossible without structural constraints.

DetailsMotivation: Visual Autoregressive models face substantial memory overhead during inference due to storing previously generated representations (KV-cache). Prior works haven't formally defined the KV-cache compression problem in this context, and there's a need to understand fundamental memory requirements.

Method: 1) Formally define the KV-cache compression problem for Visual Autoregressive transformers. 2) Establish a fundamental negative result via proof that any attention-based sequential visual token generation mechanism requires at least Ω(n²d) memory when d = Ω(log n). 3) Use reduction from computational lower bound problems with randomized embedding techniques inspired by dimensionality reduction. 4) Analyze how sparsity priors on visual representations affect memory efficiency.

Result: Proved a fundamental memory lower bound: any attention-based visual autoregressive model must use at least Ω(n²d) memory for n tokens with embedding dimension d = Ω(log n). This demonstrates achieving truly sub-quadratic memory usage is impossible without additional structural constraints.

Conclusion: The paper provides the first formal definition of KV-cache compression for visual autoregressive models and establishes fundamental memory limitations. While sub-quadratic memory is impossible in general attention architectures, sparsity priors on visual representations may offer potential directions for mitigating memory overhead.

Abstract: A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least $Ω(n^2 d)$ memory, when $d = Ω(\log n)$, where $n$ is the number of tokens generated and $d$ is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.

[300] Quotation-Based Data Retention Mechanism for Data Privacy in LLM-Empowered Network Services

Bin Han, Di Feng, Jie Wang, Hans D. Schotten

Main category: cs.LG

TL;DR: A price discovery mechanism for mobile network operators to compensate users for data retention in LLM-based network optimization, addressing GDPR/CCPA compliance without degrading model performance through machine unlearning.

DetailsMotivation: LLMs for network optimization require sensitive user data, but GDPR/CCPA give users rights to withdraw consent and demand deletion. Machine unlearning degrades model accuracy and incurs high computational costs, harming network performance.

Method: Iterative price discovery mechanism where MNOs sequentially quote increasing unit prices for data retention, and users independently decide how much data to supply at each price. No prior knowledge of user privacy preferences needed.

Result: The approach efficiently maximizes social welfare across the network ecosystem while maintaining regulatory compliance.

Conclusion: A market-based compensation mechanism solves the data governance challenge for LLM-powered network optimization, balancing user privacy rights with model performance needs.

Abstract: The deployment of large language models (LLMs) for next-generation network optimization introduces novel data governance challenges. mobile network operators (MNOs) increasingly leverage generative artificial intelligence (AI) for traffic prediction, anomaly detection, and service personalization, requiring access to users’ sensitive network usage data-including mobility patterns, traffic types, and location histories. Under the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and similar regulations, users retain the right to withdraw consent and demand data deletion. However, extensive machine unlearning degrades model accuracy and incurs substantial computational costs, ultimately harming network performance for all users. We propose an iterative price discovery mechanism enabling MNOs to compensate users for data retention through sequential price quotations. The server progressively raises the unit price for retaining data while users independently determine their supply at each quoted price. This approach requires no prior knowledge of users’ privacy preferences and efficiently maximizes social welfare across the network ecosystem.

[301] Learning and Transferring Physical Models through Derivatives

Alessandro Trenta, Andrea Cossu, Davide Bacciu

Main category: cs.LG

TL;DR: DERL is a supervised method that learns physical systems by modeling their partial derivatives, with theoretical guarantees of consistency with physical laws, and enables incremental model building through knowledge distillation.

DetailsMotivation: To develop a method that can learn physical systems while ensuring consistency with underlying physical laws, and to enable incremental construction of physical models through knowledge transfer across different domains and parameter ranges.

Method: DERL learns physical systems by modeling their partial derivatives in a supervised framework. It includes a distillation protocol to transfer knowledge from pre-trained models to student models, allowing incremental model building. The method can extend models to new physical domains and parameter ranges.

Result: DERL outperforms state-of-the-art methods in generalizing ODEs to unseen initial conditions and parametric PDEs to unseen parameters. Theoretical guarantees show DERL can learn true physical systems consistent with physical laws, even with empirical derivatives.

Conclusion: DERL provides an effective approach for learning physical systems through derivative modeling, with theoretical soundness and practical advantages for generalization. The distillation protocol enables incremental model building, creating a new pipeline for multi-stage physical model development.

Abstract: We propose Derivative Learning (DERL), a supervised approach that models physical systems by learning their partial derivatives. We also leverage DERL to build physical models incrementally, by designing a distillation protocol that effectively transfers knowledge from a pre-trained model to a student one. We provide theoretical guarantees that DERL can learn the true physical system, being consistent with the underlying physical laws, even when using empirical derivatives. DERL outperforms state-of-the-art methods in generalizing an ODE to unseen initial conditions and a parametric PDE to unseen parameters. We also design a method based on DERL to transfer physical knowledge across models by extending them to new portions of the physical domain and a new range of PDE parameters. This introduces a new pipeline to build physical models incrementally in multiple stages.

[302] Bayesian Ensembling: Insights from Online Optimization and Empirical Bayes

Daniel Waxman, Fernando Llorente, Petar M. Djurić

Main category: cs.LG

TL;DR: Proposes Online Bayesian Stacking (OBS) for adaptive combination of Bayesian models in online continual learning, connecting it to portfolio selection theory and comparing it with online Bayesian model averaging.

DetailsMotivation: Addresses the challenge of learning optimal combinations of Bayesian models in online, continual learning settings, and revisits classical Bayesian ensembles to overcome limitations of existing approaches like Bayesian model averaging.

Method: Proposes Online Bayesian Stacking (OBS) which optimizes log-score over predictive distributions to adaptively combine Bayesian models. Establishes connection between OBS and portfolio selection theory, and clarifies relationship between OBS and online BMA.

Result: Theoretical analysis and empirical evaluation identify scenarios where OBS outperforms online BMA. Provides principled methods and guidance on when practitioners should prefer one approach over the other.

Conclusion: OBS offers a novel approach to Bayesian ensemble learning in online settings with connections to portfolio selection theory, providing both theoretical foundations and practical guidance for model combination.

Abstract: We revisit the classical problem of Bayesian ensembles and address the challenge of learning optimal combinations of Bayesian models in an online, continual learning setting. To this end, we reinterpret existing approaches such as Bayesian model averaging (BMA) and Bayesian stacking through a novel empirical Bayes lens, shedding new light on the limitations and pathologies of BMA. Further motivated by insights from online optimization, we propose Online Bayesian Stacking (OBS), a method that optimizes the log-score over predictive distributions to adaptively combine Bayesian models. A key contribution of our work is establishing a novel connection between OBS and portfolio selection, bridging Bayesian ensemble learning with a rich, well-studied theoretical framework that offers efficient algorithms and extensive regret analysis. We further clarify the relationship between OBS and online BMA, showing that they optimize related but distinct cost functions. Through theoretical analysis and empirical evaluation, we identify scenarios where OBS outperforms online BMA and provide principled methods and guidance on when practitioners should prefer one approach over the other.

[303] Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders

Aaron J. Li, Suraj Srinivas, Usha Bhalla, Himabindu Lakkaraju

Main category: cs.LG

TL;DR: SAE concept representations in LLMs are fragile to tiny adversarial perturbations, compromising their reliability for model monitoring despite good reconstruction metrics.

DetailsMotivation: Existing SAE evaluations overlook robustness to input perturbations, which is critical for concept representation fidelity in practical applications like model monitoring.

Method: Formulate robustness quantification as input-space optimization problems, develop evaluation framework with realistic adversarial perturbation scenarios to manipulate SAE representations.

Result: Tiny adversarial perturbations can effectively manipulate concept-based interpretations in most scenarios without significantly affecting base LLM activations, revealing SAE fragility.

Conclusion: SAE concept representations are fragile and may be ill-suited for model monitoring applications without further denoising or postprocessing to improve robustness.

Abstract: Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the base LLM’s activations. Overall, our results suggest that SAE concept representations are fragile and without further denoising or postprocessing they might be ill-suited for applications in model monitoring and oversight.

[304] Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, Dmitrii Krasheninnikov

Main category: cs.LG

TL;DR: Activation probes trained on synthetic data can efficiently detect high-stakes LLM interactions with performance comparable to LLM monitors but with 6 orders-of-magnitude computational savings.

DetailsMotivation: Monitoring LLMs for high-stakes interactions (where text indicates potential significant harm) is critical but underexplored. Current monitoring approaches are computationally expensive, creating a need for more efficient detection methods.

Method: Train activation probes on synthetic data to detect high-stakes interactions by analyzing model activations. Evaluate several probe architectures and compare them against prompted/finetuned medium-sized LLM monitors.

Result: Probes show robust generalization to diverse out-of-distribution real-world data, achieve performance comparable to LLM monitors, and offer computational savings of six orders-of-magnitude by reusing activations from the monitored model.

Conclusion: Activation probes provide an efficient, high-performance solution for monitoring high-stakes LLM interactions, enabling resource-aware hierarchical monitoring systems where probes serve as an initial filter for more expensive downstream analysis.

Abstract: Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting ``high-stakes’’ interactions – where the text indicates that the interaction might lead to significant harm – as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes’ performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. These savings are enabled by reusing activations of the model that is being monitored. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and the codebase at https://github.com/arrrlex/models-under-pressure.

[305] Logical Expressiveness of Graph Neural Networks with Hierarchical Node Individualization

Arie Soeteman, Balder ten Cate

Main category: cs.LG

TL;DR: HEGNNs are hierarchical graph neural networks that extend GNNs with node individualization inspired by isomorphism testing, forming increasingly expressive models that can distinguish graphs up to isomorphism.

DetailsMotivation: To create more expressive graph neural networks that can better distinguish between non-isomorphic graphs, going beyond the limitations of standard GNNs which have bounded expressiveness.

Method: Hierarchical Ego Graph Neural Networks (HEGNNs) that incorporate hierarchical node individualization inspired by the Individualization-Refinement paradigm used in graph isomorphism testing.

Result: HEGNNs form a hierarchy of increasingly expressive models that can distinguish graphs up to isomorphism, with separating power equal to graded hybrid logic over bounded-degree graphs, and show practical benefits over traditional GNNs.

Conclusion: HEGNNs provide a theoretically grounded and practically effective extension to GNNs with hierarchical node individualization, offering increased expressive power while maintaining practical feasibility.

Abstract: We propose and study Hierarchical Ego Graph Neural Networks (HEGNNs), an expressive extension of graph neural networks (GNNs) with hierarchical node individualization, inspired by the Individualization-Refinement paradigm for isomorphism testing. HEGNNs generalize subgraph-GNNs and form a hierarchy of increasingly expressive models that, in the limit, distinguish graphs up to isomorphism. We show that, over graphs of bounded degree, the separating power of HEGNN node classifiers equals that of graded hybrid logic. This characterization enables us to relate the separating power of HEGNNs to that of higher-order GNNs, GNNs enriched with local homomorphism count features, and color refinement algorithms based on Individualization-Refinement. Our experimental results confirm the practical feasibility of HEGNNs and show benefits in comparison with traditional GNN architectures, both with and without local homomorphism count features.

[306] Gradient-Based Neuroplastic Adaptation for Concurrent Optimization of Neuro-Fuzzy Networks

John Wesley Hostetter, Min Chi

Main category: cs.LG

TL;DR: The paper proposes gradient-based neuroplastic adaptation for concurrent optimization of neuro-fuzzy networks’ parameters and structure, enabling online reinforcement learning for vision-based tasks like DOOM gameplay.

DetailsMotivation: Neuro-fuzzy networks have advantages (transparent, symbolic, universal function approximation) but face challenges in systematic design. Existing methods inefficiently isolate parametric and structural identification, leading to premature commitment to brittle architectures.

Method: Proposes gradient-based neuroplastic adaptation for concurrent optimization of NFNs’ parameters and structure, recognizing they should be optimized simultaneously as they are deeply conjoined. This enables previously unapproachable settings like online reinforcement learning for vision-based tasks.

Result: The effectiveness is empirically demonstrated by training NFNs with online reinforcement learning to proficiently play challenging scenarios from the vision-based video game DOOM.

Conclusion: Concurrent optimization of NFNs’ parameters and structure through gradient-based neuroplastic adaptation makes previously inaccessible settings (like online RL for vision tasks) approachable, overcoming limitations of sequential design methods.

Abstract: Neuro-fuzzy networks (NFNs) are transparent, symbolic, and universal function approximations that perform as well as conventional neural architectures, but their knowledge is expressed as linguistic IF-THEN rules. Despite these advantages, their systematic design process remains a challenge. Existing work will often sequentially build NFNs by inefficiently isolating parametric and structural identification, leading to a premature commitment to brittle and subpar architecture. We propose a novel application-independent approach called gradient-based neuroplastic adaptation for the concurrent optimization of NFNs’ parameters and structure. By recognizing that NFNs’ parameters and structure should be optimized simultaneously as they are deeply conjoined, settings previously unapproachable for NFNs are now accessible, such as the online reinforcement learning of NFNs for vision-based tasks. The effectiveness of concurrently optimizing NFNs is empirically shown as it is trained by online reinforcement learning to proficiently play challenging scenarios from a vision-based video game called DOOM.

[307] RONOM: Reduced-Order Neural Operator Modeling

Sven Dummer, Dongwei Ye, Christoph Brune

Main category: cs.LG

TL;DR: RONOM framework bridges reduced-order modeling and neural operators, providing discretization error bounds and showing superior performance in spatial super-resolution and robustness.

DetailsMotivation: Traditional ROM lacks flexibility across varying meshes, while neural operators lack rigorous error quantification between infinite-dimensional and discretized operators. Need a framework that combines strengths of both approaches.

Method: Introduces Reduced-Order Neural Operator Modeling (RONOM) framework that bridges concepts from ROM and operator learning. Establishes discretization error bounds analogous to ROM, and analyzes discretization convergence and robustness.

Result: RONOM achieves comparable performance in input generalization and superior performance in spatial super-resolution and discretization robustness compared to existing neural operators. Also provides insights into temporal super-resolution and ROM-based approaches for time-dependent data.

Conclusion: RONOM successfully bridges ROM and operator learning, providing both theoretical error bounds and practical advantages for solving time-dependent PDEs across varying resolutions.

Abstract: Time-dependent partial differential equations are ubiquitous in physics-based modeling, but they remain computationally intensive in many-query scenarios, such as real-time forecasting, optimal control, and uncertainty quantification. Reduced-order modeling (ROM) addresses these challenges by constructing a low-dimensional surrogate model but relies on a fixed discretization, which limits flexibility across varying meshes during evaluation. Operator learning approaches, such as neural operators, offer an alternative by parameterizing mappings between infinite-dimensional function spaces, enabling adaptation to data across different resolutions. Whereas ROM provides rigorous numerical error estimates, neural operator learning largely focuses on discretization convergence and invariance without quantifying the error between the infinite-dimensional and the discretized operators. This work introduces the reduced-order neural operator modeling (RONOM) framework, which bridges concepts from ROM and operator learning. We establish a discretization error bound analogous to those in ROM, and get insights into RONOM’s discretization convergence and discretization robustness. Moreover, three numerical examples are presented that compare RONOM to existing neural operators for solving partial differential equations. The results demonstrate that RONOM using standard vector-to-vector neural networks can achieve comparable performance in input generalization and achieves superior performance in both spatial super-resolution and discretization robustness, while also offering novel insights into temporal super-resolution scenarios and ROM-based approaches for learning on time-dependent data.

[308] Categorical Distributions are Effective Neural Network Outputs for Event Prediction

Kevin Doran, Tom Baden

Main category: cs.LG

TL;DR: The paper demonstrates categorical distributions as effective neural network outputs for next event prediction in both discrete-time and continuous-time sequences, introduces retinal spike prediction tasks, critiques dataset biases favoring small models, and proposes new synthetic datasets for larger models.

DetailsMotivation: To establish categorical distributions as versatile neural network outputs for event prediction tasks, address limitations of existing datasets that favor small models, and create appropriate benchmarks for evaluating larger models in both discrete and continuous-time settings.

Method: Uses categorical distributions as neural network outputs for event prediction; interprets them as piecewise-constant density functions for continuous-time modeling; introduces retinal spike prediction task with discrete event times; analyzes dataset biases; creates synthetic datasets for testing larger models.

Result: Categorical distributions are shown to be competitive for continuous-time event prediction across multiple datasets; retinal spike prediction task demonstrates importance of discrete-time modeling; evidence shows existing datasets favor smaller models; new synthetic datasets are introduced for better evaluation.

Conclusion: Categorical distributions provide effective neural network outputs for event prediction in both discrete and continuous-time settings, while new synthetic datasets are needed to properly evaluate larger models and address biases in current benchmarks.

Abstract: We demonstrate the effectiveness of the categorical distribution as a neural network output for next event prediction. This is done for both discrete-time and continuous-time event sequences. To model continuous-time processes, the categorical distribution is interpreted as a piecewise-constant density function and is shown to be competitive across a range of datasets. We then argue for the importance of studying discrete-time processes by introducing a neuronal spike prediction task motivated by retinal prosthetics, where discretization of event times is consequent on the task description. Separately, we show evidence that commonly used datasets favour smaller models. Finally, we introduce new synthetic datasets for testing larger models, as well as synthetic datasets with discrete event times.

[309] Theoretical Foundations of Scaling Law in Familial Models

Huan Song, Qingfei Zhao, Ting Long, Shuyu Tian, Hongjun An, Jiawei Shao, Xuelong Li

Main category: cs.LG

TL;DR: The paper extends neural scaling laws to familial models (early-exit architectures), introducing granularity (G) as a third scaling variable alongside model size (N) and tokens (D), showing minimal performance penalty for deployment flexibility.

DetailsMotivation: Current neural scaling laws assume single dense model outputs, overlooking familial models that enable ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies through early exits and relay-style inference.

Method: Proposes unified scaling law L(N, D, G) with granularity as third variable; uses IsoFLOP experimental design to isolate architectural impact, systematically sweeps model sizes and granularities while adjusting tokens to decouple marginal costs.

Result: Granularity penalty follows multiplicative power law with extremely small exponent, bridging fixed-compute training with dynamic architectures and validating “train once, deploy many” paradigm without compromising compute-optimality.

Conclusion: Familial models with early exits can achieve deployment flexibility across heterogeneous devices while maintaining computational efficiency comparable to dense baselines, enabling practical ubiquitous intelligence.

Abstract: Neural scaling laws have become foundational for optimizing large language model (LLM) training, yet they typically assume a single dense model output. This limitation effectively overlooks “Familial models, a transformative paradigm essential for realizing ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies. Transcending static architectures, familial models integrate early exits with relay-style inference to spawn G deployable sub-models from a single shared backbone. In this work, we theoretically and empirically extend the scaling law to capture this “one-run, many-models” paradigm by introducing Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D). To rigorously quantify this relationship, we propose a unified functional form L(N, D, G) and parameterize it using large-scale empirical runs. Specifically, we employ a rigorous IsoFLOP experimental design to strictly isolate architectural impact from computational scale. Across fixed budgets, we systematically sweep model sizes (N) and granularities (G) while dynamically adjusting tokens (D). This approach effectively decouples the marginal cost of granularity from the benefits of scale, ensuring high-fidelity parameterization of our unified scaling law. Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent. Theoretically, this bridges fixed-compute training with dynamic architectures. Practically, it validates the “train once, deploy many” paradigm, demonstrating that deployment flexibility is achievable without compromising the compute-optimality of dense baselines.

[310] An Enhanced Focal Loss Function to Mitigate Class Imbalance in Auto Insurance Fraud Detection with Explainable AI

Francis Boabang, Samuel Asante Gyamerah

Main category: cs.LG

TL;DR: A three-stage training framework for imbalanced auto-insurance fraud detection that combines convex surrogate loss, controlled non-convex loss, and standard focal loss to improve minority-class detection.

DetailsMotivation: Auto-insurance fraud detection is challenging due to extreme class imbalance, causing standard learning algorithms to overfit the majority class and perform poorly on economically significant minority fraud cases.

Method: Three-stage training framework: 1) convex surrogate of focal loss for stable initialization, 2) controlled non-convex intermediate loss to improve feature discrimination, 3) standard focal loss to refine minority-class sensitivity. Uses deep sequential models with theoretical conditions for convexity preservation.

Result: Improves minority-class F1-scores and AUC compared to conventional focal-loss training and resampling baselines on proprietary auto-insurance dataset. Provides interpretable feature-attribution patterns through SHAP analysis.

Conclusion: The proposed structured three-stage framework effectively addresses class imbalance in fraud detection, offering both improved performance and interpretability for actuarial and fraud-analytics applications.

Abstract: Detecting fraudulent auto-insurance claims remains a challenging classification problem, largely due to the extreme imbalance between legitimate and fraudulent cases. Standard learning algorithms tend to overfit to the majority class, resulting in poor detection of economically significant minority events. This paper proposes a structured three-stage training framework that integrates a convex surrogate of focal loss for stable initialization, a controlled non-convex intermediate loss to improve feature discrimination, and the standard focal loss to refine minority-class sensitivity. We derive conditions under which the surrogate retains convexity in the prediction space and show how this facilitates more reliable optimization when combined with deep sequential models. Using a proprietary auto-insurance dataset, the proposed method improves minority-class F1-scores and AUC relative to conventional focal-loss training and resampling baselines. The approach also provides interpretable feature-attribution patterns through SHAP analysis, offering transparency for actuarial and fraud-analytics applications.

[311] Neural Logic Networks for Interpretable Classification

Vincent Perreault, Katsumi Inoue, Richard Labib, Alain Hertz

Main category: cs.LG

TL;DR: Neural Logic Networks with NOT operations and biases improve interpretability and performance in Boolean network discovery, learning logical rules for tabular classification in medical/industrial domains.

DetailsMotivation: Traditional neural networks lack interpretability - their learned mechanisms cannot be inspected, verified, or extracted. There's a need for models that can learn logical relationships between inputs and outputs while maintaining transparency, especially in domains like medicine and industry where interpretability has tangible value.

Method: Generalize Neural Logic Networks with NOT operations and biases to account for unobserved data. Develop rigorous logical and probabilistic modeling using concept combinations. Propose a novel factorized IF-THEN rule structure and a modified learning algorithm.

Result: The method achieves state-of-the-art performance in Boolean network discovery. It successfully learns relevant, interpretable rules for tabular classification, particularly demonstrating value in medical and industrial applications where interpretability is crucial.

Conclusion: The generalized Neural Logic Networks with NOT operations and biases provide an interpretable alternative to traditional neural networks, enabling logical rule discovery while maintaining competitive performance, making them valuable for domains requiring transparent decision-making.

Abstract: Traditional neural networks have an impressive classification performance, but what they learn cannot be inspected, verified or extracted. Neural Logic Networks on the other hand have an interpretable structure that enables them to learn a logical mechanism relating the inputs and outputs with AND and OR operations. We generalize these networks with NOT operations and biases that take into account unobserved data and develop a rigorous logical and probabilistic modeling in terms of concept combinations to motivate their use. We also propose a novel factorized IF-THEN rule structure for the model as well as a modified learning algorithm. Our method improves the state-of-the-art in Boolean networks discovery and is able to learn relevant, interpretable rules in tabular classification, notably on examples from the medical and industrial fields where interpretability has tangible value.

[312] R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning

Jingchu Wang, Bingbing Xu, Yige Yuan, Bin Xie, Xiaoqian Sun, Huawei Shen

Main category: cs.LG

TL;DR: R²PO introduces a Residual Rollout-Head to decouple training trajectories from inference responses in LLM reasoning, enabling better exploration during training while maintaining stable inference, achieving significant accuracy improvements on math and coding benchmarks.

DetailsMotivation: Existing RL methods for LLM reasoning use a single policy for both inference responses and training trajectories, creating an objective conflict between stable inference generation and diverse exploration during training, which harms reasoning capability.

Method: Proposes R²PO (Residual Rollout Policy Optimization) which adds a lightweight Residual Rollout-Head on top of the base policy to decouple training trajectories from inference responses, allowing controlled trajectory diversification during training while keeping inference generation stable.

Result: Outperforms baselines across multiple benchmarks with average accuracy gains of 3.4% on MATH-500 and 1.3% on APPS, while also reducing formatting errors and mitigating length bias for more stable optimization.

Conclusion: Decoupling training trajectories from inference responses through the Residual Rollout-Head addresses the exploration-stability conflict in RL for LLM reasoning, leading to improved reasoning capabilities and more stable optimization.

Abstract: Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between generating stable inference responses and diverse training trajectories leads to insufficient exploration, which harms reasoning capability. In this paper, to address the problem, we propose R$^2$PO (Residual Rollout Policy Optimization), which introduces a lightweight Residual Rollout-Head atop the policy to decouple training trajectories from inference responses, enabling controlled trajectory diversification during training while keeping inference generation stable. Experiments across multiple benchmarks show that our method consistently outperforms baselines, achieving average accuracy gains of 3.4% on MATH-500 and 1.3% on APPS, while also reducing formatting errors and mitigating length bias for stable optimization. Our code is publicly available at https://github.com/RRPO-ARR/Code.

[313] Differentiable Cyclic Causal Discovery Under Unmeasured Confounders

Muralikrishnna G. Sethuraman, Faramarz Fekri

Main category: cs.LG

TL;DR: DCCD-CONF: Differentiable learning of nonlinear cyclic causal graphs with unmeasured confounders using interventional data.

DetailsMotivation: Real-world systems often violate two key assumptions of causal discovery: (1) all variables are observed, and (2) causal graphs are acyclic. Existing methods either assume linearity or struggle with scalability when accounting for confounders.

Method: Proposes DCCD-CONF framework that alternates between optimizing graph structure and estimating confounder distribution by maximizing log-likelihood of interventional data. Handles nonlinear cyclic graphs with unmeasured confounders.

Result: Outperforms state-of-the-art methods in both causal graph recovery and confounder identification on synthetic data and real-world gene perturbation datasets.

Conclusion: DCCD-CONF provides an effective framework for learning nonlinear cyclic causal graphs with unmeasured confounders, with consistency guarantees for theoretical soundness.

Abstract: Understanding causal relationships between variables is fundamental across scientific disciplines. Most causal discovery algorithms rely on two key assumptions: (i) all variables are observed, and (ii) the underlying causal graph is acyclic. While these assumptions simplify theoretical analysis, they are often violated in real-world systems, such as biological networks. Existing methods that account for confounders either assume linearity or struggle with scalability. To address these limitations, we propose DCCD-CONF, a novel framework for differentiable learning of nonlinear cyclic causal graphs in the presence of unmeasured confounders using interventional data. Our approach alternates between optimizing the graph structure and estimating the confounder distribution by maximizing the log-likelihood of the data. Through experiments on synthetic data and real-world gene perturbation datasets, we show that DCCD-CONF outperforms state-of-the-art methods in both causal graph recovery and confounder identification. Additionally, we also provide consistency guarantees for our framework, reinforcing its theoretical soundness.

[314] Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems

Fleur Hendriks, Ondřej Rokoš, Martin Doškář, Marc G. D. Geers, Vlado Menkovski

Main category: cs.LG

TL;DR: Flow matching with equivariant architectures captures multimodal distributions in symmetry-breaking bifurcations, outperforming non-probabilistic methods.

DetailsMotivation: Deterministic machine learning models fail to capture multiple coexisting stable solutions in nonlinear dynamical systems with symmetry breaking, averaging over solutions and missing lower-symmetry outcomes.

Method: Combines flow matching with equivariant architectures and optimal-transport-based coupling mechanism; generalizes equivariant flow matching with symmetric coupling strategy that aligns predicted and target outputs under group actions.

Result: Accurately captures multimodal distributions and symmetry-breaking bifurcations across systems from simple conceptual models to physical problems like buckling beams and Allen-Cahn equation; significantly outperforms non-probabilistic and variational methods.

Conclusion: Provides a principled and scalable solution for modeling multistability in high-dimensional systems using generative AI approaches.

Abstract: Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of symmetry breaking. Deterministic machine learning models are unable to capture this multiplicity, averaging over solutions and failing to represent lower-symmetry outcomes. In this work, we formalize the use of generative AI, specifically flow matching, as a principled way to model the full probability distribution over bifurcation outcomes. Our approach builds on existing techniques by combining flow matching with equivariant architectures and an optimal-transport-based coupling mechanism. We generalize equivariant flow matching to a symmetric coupling strategy that aligns predicted and target outputs under group actions, allowing accurate learning in equivariant settings. We validate our approach on a range of systems, from simple conceptual systems to physical problems such as buckling beams and the Allen–Cahn equation. The results demonstrate that the approach accurately captures multimodal distributions and symmetry-breaking bifurcations. Moreover, our results demonstrate that flow matching significantly outperforms non-probabilistic and variational methods. This offers a principled and scalable solution for modeling multistability in high-dimensional systems.

[315] Optimal CO2 storage management considering safety constraints in multi-stakeholder multi-site CCS projects: a Markov game perspective

Jungang Chen, Seyyed A. Hosseini

Main category: cs.LG

TL;DR: The paper proposes a Markov game framework using multi-agent reinforcement learning to analyze coalition structures for multi-stakeholder CCS projects in geologically connected basins.

DetailsMotivation: CCS projects involve diverse stakeholders with conflicting objectives in geologically connected sites, making it unclear whether individual optimization or collaborative coalitions are more effective for project success.

Method: A Markov game framework with multi-agent reinforcement learning and safety constraints, using an E2C-based surrogate model to reduce computational costs of high-fidelity simulations.

Result: The framework effectively addresses optimal CO2 storage management with multiple stakeholders, demonstrating how different coalition structures impact stakeholder goals.

Conclusion: The proposed paradigm provides a quantitative approach to investigate coalition structures in CCS projects, enabling stakeholders to learn optimal strategies while complying with safety regulations.

Abstract: Carbon capture and storage (CCS) projects typically involve a diverse array of stakeholders or players from public, private, and regulatory sectors, each with different objectives and responsibilities. Given the complexity, scale, and long-term nature of CCS operations, determining whether individual stakeholders can independently maximize their interests or whether collaborative coalition agreements are needed remains a central question for effective CCS project planning and management. CCS projects are often implemented in geologically connected sites, where shared geological features such as pressure space and reservoir pore capacity can lead to competitive behavior among stakeholders. Furthermore, CO2 storage sites are often located in geologically mature basins that previously served as sites for hydrocarbon extraction or wastewater disposal in order to leverage existing infrastructures, which makes unilateral optimization even more complicated and unrealistic. In this work, we propose a paradigm based on Markov games to quantitatively investigate how different coalition structures affect the goals of stakeholders. We frame this multi-stakeholder multi-site problem as a multi-agent reinforcement learning problem with safety constraints. Our approach enables agents to learn optimal strategies while compliant with safety regulations. We present an example where multiple operators are injecting CO2 into their respective project areas in a geologically connected basin. To address the high computational cost of repeated simulations of high-fidelity models, a previously developed surrogate model based on the Embed-to-Control (E2C) framework is employed. Our results demonstrate the effectiveness of the proposed framework in addressing optimal management of CO2 storage when multiple stakeholders with various objectives and goals are involved.

[316] Towards Reasoning for PDE Foundation Models: A Reward-Model-Driven Inference-Time-Scaling Algorithm

Siddharth Mansingh, James Amarel, Ragib Arnab, Arvind Mohan, Kamaljeet Singh, Gerd J. Kunde, Nicolas Hengartner, Benjamin Migliori, Emily Casleton, Nathan A. Debardeleben, Ayan Biswas, Diane Oyen, Earl Lawrence

Main category: cs.LG

TL;DR: The paper introduces a test-time computing (TTC) strategy for PDEs that uses computational resources during inference to improve prediction accuracy with fewer training samples and smaller models, inspired by LLM “thinking” strategies.

DetailsMotivation: Existing PDE foundation models are constrained by pretraining datasets, struggle with auto-regressive rollout performance (especially OOD cases), and have high compute/data requirements that limit their use in critical applications.

Method: Introduces test-time computing (TTC) strategy using two types of reward models that evaluate predictions of a stochastic-based model for spatio-temporal consistency. Inspired by “thinking” strategies in LLMs.

Result: Demonstrated on compressible Euler-equation simulations from PDEGym benchmark, showing TTC captures improved predictions relative to standard non-adaptive auto-regressive inference.

Conclusion: TTC framework marks a foundational step toward more advanced reasoning algorithms for PDE modeling, including building reinforcement-learning-based approaches, potentially transforming computational workflows in physics and engineering.

Abstract: Partial Differential Equations (PDEs) are the bedrock for modern computational sciences and engineering, and inherently computationally expensive. While PDE foundation models have shown much promise for simulating such complex spatio-temporal phenomena, existing models remain constrained by the pretraining datasets and struggle with auto-regressive rollout performance, especially in out-of-distribution (OOD) cases. Furthermore, they have significant compute and training data requirements which hamper their use in many critical applications. Inspired by recent advances in ``thinking” strategies used in large language models (LLMs), we introduce the first test-time computing (TTC) strategy for PDEs that utilizes computational resources during inference to achieve more accurate predictions with fewer training samples and smaller models. We accomplish this with two types of reward models that evaluate predictions of a stochastic based model for spatio-temporal consistency. We demonstrate this method on compressible Euler-equation simulations from the PDEGym benchmark and show that TTC captures improved predictions relative to standard non-adaptive auto-regressive inference. This TTC framework marks a foundational step towards more advanced reasoning algorithms or PDE modeling, inluding building reinforcement-learning-based approaches, potentially transforming computational workflows in physics and engineering.

[317] Adaptive Spatial Goodness Encoding: Advancing and Scaling Forward-Forward Learning Without Backpropagation

Qingchun Gong, Robert Bogdan Staszewski, Kai Xu

Main category: cs.LG

TL;DR: ASGE is a new Forward-Forward training framework for CNNs that addresses channel explosion issues through adaptive spatial goodness encoding, achieving state-of-the-art FF performance and first successful ImageNet application.

DetailsMotivation: Existing Forward-Forward based extensions for CNNs suffer from limited representational capacity and poor scalability to large datasets due to exploding channel dimensionality, which needs to be addressed.

Method: Proposes adaptive spatial goodness encoding (ASGE) that leverages feature maps to compute spatially-aware goodness representations at each layer, enabling layer-wise supervision while decoupling classification complexity from channel dimensionality.

Result: Outperforms all other FF-based approaches with test accuracies: 99.65% (MNIST), 93.41% (FashionMNIST), 90.62% (CIFAR-10), 65.42% (CIFAR-100). First successful FF-based training on ImageNet with 51.58% Top-1 and 75.23% Top-5 accuracy.

Conclusion: ASGE effectively addresses channel explosion in FF-based CNN training, achieves competitive performance compared to BP alternatives, and enables flexible deployment through three prediction strategies for accuracy-parameter-memory trade-offs.

Abstract: The Forward-Forward (FF) algorithm offers a promising alternative to backpropagation (BP). Despite advancements in recent FF-based extensions, which have enhanced the original algorithm and adapted it to convolutional neural networks (CNNs), they often suffer from limited representational capacity and poor scalability to large-scale datasets, primarily due to exploding channel dimensionality. In this work, we propose adaptive spatial goodness encoding (ASGE), a new FF-based training framework tailored for CNNs. ASGE leverages feature maps to compute spatially-aware goodness representations at each layer, enabling layer-wise supervision. Crucially, this approach decouples classification complexity from channel dimensionality, thereby addressing the issue of channel explosion and achieving competitive performance compared to other BP alternatives. ASGE outperforms all other FF-based approaches across multiple benchmarks, delivering test accuracies of 99.65% on MNIST, 93.41% on FashionMNIST, 90.62% on CIFAR-10, and 65.42% on CIFAR-100. Moreover, we present the first successful application of FF-based training to ImageNet, with Top-1 and Top-5 accuracies of 51.58% and 75.23%. Furthermore, we propose three prediction strategies to achieve flexible trade-offs among accuracy, parameters and memory usage, enabling deployment under diverse resource constraints.

[318] Soft Graph Transformer for MIMO Detection

Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang

Main category: cs.LG

TL;DR: SGT is a soft-input-soft-output neural architecture for MIMO detection that combines self-attention with graph-aware cross-attention to achieve near-ML performance while maintaining computational efficiency.

DetailsMotivation: ML detection has exponential complexity, conventional message-passing algorithms fail in finite dimensions, and existing Transformer-based detectors ignore MIMO factor graph structure and cannot use prior soft information.

Method: Combines self-attention (encoding contextual dependencies within symbol and constraint subgraphs) with graph-aware cross-attention (structured message passing across subgraphs), featuring soft-input interface for auxiliary priors.

Result: Achieves near-ML performance, offers computational efficiency, and provides flexible, interpretable framework for receiver systems leveraging soft priors.

Conclusion: SGT effectively addresses limitations of existing MIMO detection methods by integrating graph structure awareness with soft information processing in a Transformer-based architecture.

Abstract: We propose the Soft Graph Transformer (SGT), a soft-input-soft-output neural architecture designed for MIMO detection. While Maximum Likelihood (ML) detection achieves optimal accuracy, its exponential complexity makes it infeasible in large systems, and conventional message-passing algorithms rely on asymptotic assumptions that often fail in finite dimensions. Recent Transformer-based detectors show strong performance but typically overlook the MIMO factor graph structure and cannot exploit prior soft information. SGT addresses these limitations by combining self-attention, which encodes contextual dependencies within symbol and constraint subgraphs, with graph-aware cross-attention, which performs structured message passing across subgraphs. Its soft-input interface allows the integration of auxiliary priors, producing effective soft outputs while maintaining computational efficiency. Experiments demonstrate that SGT achieves near-ML performance and offers a flexible and interpretable framework for receiver systems that leverage soft priors.

[319] AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing

Samuel Bright-Thonney, Christina Reissel, Gaia Grosso, Nathaniel Woodward, Katya Govorkova, Andrzej Novak, Sang Eon Park, Eric Moreno, Philip Harris

Main category: cs.LG

TL;DR: AutoSciDACT is a unified pipeline for novelty detection in scientific data that combines contrastive pre-training with statistical hypothesis testing to make quantifiable claims about anomalies.

DetailsMotivation: Scientific novelty detection faces challenges with noisy high-dimensional data and lacks methods that produce outputs compatible with quantifiable statistical claims of scientific discovery.

Method: AutoSciDACT uses contrastive pre-training to create expressive low-dimensional embeddings from simulated data and data augmentation, then applies the NPLM framework for sensitive two-sample hypothesis testing to statistically quantify deviations from reference distributions.

Result: The method demonstrates strong sensitivity to small injections of anomalous data across astronomical, physical, biological, image, and synthetic datasets.

Conclusion: AutoSciDACT provides a general-purpose pipeline for rigorous statistical novelty detection in scientific domains, addressing the need for quantifiable claims in scientific discovery.

Abstract: Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.

[320] Auto-bidding under Return-on-Spend Constraints with Uncertainty Quantification

Jiale Han, Chun Gan, Chengcheng Zhang, Jie He, Zhangang Lin, Ching Law, Xiaowu Dai

Main category: cs.LG

TL;DR: Auto-bidding system using conformal prediction to handle unknown ad impression values, providing performance guarantees without requiring true value knowledge.

DetailsMotivation: Existing auto-bidding systems assume known ad impression values (like conversion rates), but real-world scenarios involve unknown true values. Current industry systems use ML predictions but lack uncertainty quantification and performance guarantees.

Method: Uses conformal prediction to quantify uncertainty of ad impression values based on ML predictions from historical bidding data with contextual features. Introduces adjusted value estimator from prediction intervals, compatible with existing industry ML systems. Applied to enhance auto-bidding algorithms with budget and RoS constraints.

Result: Theoretical guarantees for achieving high reward while keeping RoS violations low. Empirical results on simulated and real-world industrial datasets show improved performance while maintaining computational efficiency.

Conclusion: Proposed method successfully handles unknown ad impression values using conformal prediction, providing practical solution with theoretical guarantees that integrates well with existing industry auto-bidding systems.

Abstract: Auto-bidding systems are widely used in advertising to automatically determine bid values under constraints such as total budget and Return-on-Spend (RoS) targets. Existing works often assume that the value of an ad impression, such as the conversion rate, is known. This paper considers the more realistic scenario where the true value is unknown. We propose a novel method that uses conformal prediction to quantify the uncertainty of these values based on machine learning methods trained on historical bidding data with contextual features, without assuming the data are i.i.d. This approach is compatible with current industry systems that use machine learning to predict values. Building on prediction intervals, we introduce an adjusted value estimator derived from machine learning predictions, and show that it provides performance guarantees without requiring knowledge of the true value. We apply this method to enhance existing auto-bidding algorithms with budget and RoS constraints, and establish theoretical guarantees for achieving high reward while keeping RoS violations low. Empirical results on both simulated and real-world industrial datasets demonstrate that our approach improves performance while maintaining computational efficiency.

[321] Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference

Ziyi Han, Xutong Liu, Ruiting Zhou, Xiangxiang Dai, John C. S. Lui

Main category: cs.LG

TL;DR: Tanbr enables efficient online MoE inference via tree-structured neural bandit routing that merges experts adaptively without task tags, achieving 45% latency reduction and 25% memory savings.

DetailsMotivation: Sparse Mixture of Experts (SMoE) models are large and complex for online inference, especially in edge networks where task information is unavailable, making task-level routing error-prone and deployment challenging.

Method: Proposes Tanbr: a tree-structured adaptive neural bandit router that estimates task distribution from historical data, uses binary tree partitioning for merging weight space, and applies neural bandit to learn non-linear mapping from merging weights to model performance for optimal expert merging.

Result: Tanbr achieves sublinear regret bound of O(√T log(T)), reduces inference latency by at least 45%, memory usage by up to 25%, while maintaining high accuracy compared to state-of-the-art methods.

Conclusion: Tanbr provides an effective solution for efficient and reliable online MoE inference by adaptively merging experts without task tags, making SMoE deployment practical for resource-constrained environments.

Abstract: Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for \textit{online inference} remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, \texttt{Tanbr}, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, \texttt{Tanbr} estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, \texttt{Tanbr} employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that \texttt{Tanbr} achieves a sublinear regret bound of {\small $\mathcal{O}(\sqrt{T} \log(T))$} over {\small $T$} rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that \texttt{Tanbr} reduces inference latency by at least {\small $45%$} and memory usage by up to {\small $25%$}, while maintaining a high accuracy compared to many state-of-the-art methods.

[322] LATTLE: LLM Attention Transplant for Transfer Learning of Tabular Data Across Disparate Domains

Ibna Kowsar, Kazi F. Akhter, Manar D. Samad

Main category: cs.LG

TL;DR: LATTLE is a novel method that transplants LLM attention weights to enable effective transfer learning across disparate tabular datasets without shared features, prompt engineering, or large pretrained models.

DetailsMotivation: Transfer learning on tabular data is challenging due to disparate feature spaces across domains, unlike homogeneous image/text data. LLMs offer knowledge but face limitations with subjective prompts and computational constraints of in-context learning.

Method: Proposes LLM-attention transplant for transfer learning (LATTLE) - a language-to-tabular context-learning method using attention-specific transformer weights. The LLM attention transplant mechanism enables domain-agnostic transfer without shared features, prompt engineering, or large pretrained models.

Result: Experiments with ten pairs of disjoint source-target datasets and 12 baseline methods show LATTLE outperforms traditional ML models, state-of-the-art deep tabular architectures, and models trained on thousands to billions of tabular samples.

Conclusion: Cross-domain attention transfer provides an effective solution for adapting LLMs to learn non-text tabular data in low-resource environments, enabling seamless transfer learning across disparate tabular datasets.

Abstract: Transfer learning on tabular data is challenging due to disparate feature spaces across domains, in contrast to the homogeneous structures of image and text. Large language models (LLMs) offer a knowledge base to improve the limited effectiveness of cross-domain transfer learning for tabular data. However, LLM performance often stagnates due to subjective text prompts and the computational limitations of in-context learning. We present a novel language-to-tabular context-learning method that uses attention-specific transformer weights, enabling seamless transfer learning across disparate tabular data sets. The LLM attention transplant mechanism facilitates a domain-agnostic transfer learning, eliminating the need for shared features between tables, LLM prompt engineering, and large-scale pretrained models. Our experiments using ten pairs of disjoint source-target data sets and 12 baseline methods demonstrate the superiority of the proposed LLM-attention transplant for transfer learning (LATTLE) method over traditional ML models, state-of-the-art deep tabular architectures, and models trained on thousands to billions of tabular samples. The proposed cross-domain attention transfer demonstrates an effective solution for adapting LLMs to learning non-text tabular data in a low-resource environment. The source code of the LATTLE implementation is publicly available.

[323] Flow Matching with Semidiscrete Couplings

Alireza Mousavi-Hosseini, Stephen Y. Zhang, Michal Klein, Marco Cuturi

Main category: cs.LG

TL;DR: The paper proposes Semidiscrete Flow Matching (SD-FM), which improves upon OT-FM by using semidiscrete optimal transport to efficiently match noise to data points, avoiding the quadratic computational bottleneck of batch-OT approaches.

DetailsMotivation: OT-FM (Optimal Transport Flow Matching) shows theoretical promise but suffers from quadratic computational costs O(n²/ε²) due to Sinkhorn algorithm, requiring large batch sizes and multi-GPU setups to be effective. This practical bottleneck limits widespread adoption despite theoretical advantages.

Method: SD-FM uses semidiscrete OT formulation that leverages finite dataset size N. It estimates a dual potential vector via SGD, then matches freshly sampled noise vectors to data points using maximum inner product search (MIPS), eliminating the quadratic dependency on n/ε.

Result: SD-FM outperforms both standard FM and OT-FM across all training metrics and inference budget constraints, on multiple datasets, for both unconditional/conditional generation, and with mean-flow models.

Conclusion: Semidiscrete Flow Matching provides a practical and efficient alternative to batch-OT approaches, fulfilling the theoretical promises of OT-FM while being computationally feasible for real-world applications.

Abstract: Flow models parameterized as time-dependent velocity fields can generate data from noise by integrating an ODE. These models are often trained using flow matching, i.e. by sampling random pairs of noise and target points $(\mathbf{x}_0,\mathbf{x}_1)$ and ensuring that the velocity field is aligned, on average, with $\mathbf{x}_1-\mathbf{x}_0$ when evaluated along a segment linking $\mathbf{x}_0$ to $\mathbf{x}_1$. While these pairs are sampled independently by default, they can also be selected more carefully by matching batches of $n$ noise to $n$ target points using an optimal transport (OT) solver. Although promising in theory, the OT flow matching (OT-FM) approach is not widely used in practice. Zhang et al. (2025) pointed out recently that OT-FM truly starts paying off when the batch size $n$ grows significantly, which only a multi-GPU implementation of the Sinkhorn algorithm can handle. Unfortunately, the costs of running Sinkhorn can quickly balloon, requiring $O(n^2/\varepsilon^2)$ operations for every $n$ pairs used to fit the velocity field, where $\varepsilon$ is a regularization parameter that should be typically small to yield better results. To fulfill the theoretical promises of OT-FM, we propose to move away from batch-OT and rely instead on a semidiscrete formulation that leverages the fact that the target dataset distribution is usually of finite size $N$. The SD-OT problem is solved by estimating a dual potential vector using SGD; using that vector, freshly sampled noise vectors at train time can then be matched with data points at the cost of a maximum inner product search (MIPS). Semidiscrete FM (SD-FM) removes the quadratic dependency on $n/\varepsilon$ that bottlenecks OT-FM. SD-FM beats both FM and OT-FM on all training metrics and inference budget constraints, across multiple datasets, on unconditional/conditional generation, or when using mean-flow models.

[324] Escaping Local Optima in the Waddington Landscape: A Two-Stage TRPO-PPO Approach for Single-Cell Perturbation Analysis

Francis Boabang, Samuel Asante Gyamerah

Main category: cs.LG

TL;DR: Two-stage RL algorithm improves single-cell perturbation modeling by combining natural gradient initialization with PPO refinement for better generalization in digital twin systems.

DetailsMotivation: Existing models for cellular perturbation prediction either use in silico or experimental data but rarely integrate both, limiting generalization in digital twin systems. They also get trapped in local optima in the nonconvex Waddington landscape of cell fate decisions.

Method: Two-stage reinforcement learning algorithm: 1) Natural gradient update using Fisher-vector products and conjugate gradient solver with KL trust-region constraint for safe initialization. 2) Proximal Policy Optimization (PPO) with KL penalty to refine the policy using minibatch efficiency.

Result: The initialization strategy substantially improves generalization on Single-cell RNA sequencing (scRNA-seq) perturbation analysis in a digital twin system.

Conclusion: The proposed two-stage RL approach addresses limitations of existing models by providing better initialization to avoid local optima and improving generalization across simulated and real biological contexts in digital twin systems.

Abstract: Modeling cellular responses to genetic and chemical perturbations remains a central challenge in single-cell biology. Existing data-driven frameworks have advanced perturbation prediction through variational autoencoders, chemically conditioned autoencoders, and large-scale transformer pretraining. However, most existing models rely exclusively on either in silico perturbation data or experimental perturbation data but rarely integrate both, limiting their ability to generalize and validate predictions across simulated and real biological contexts in a digital twin system. Moreover, the models are prone to local optima in the nonconvex Waddington landscape of cell fate decisions, where poor initialization can trap trajectories in spurious lineages. In this work, we introduce a two-stage reinforcement learning algorithm for modeling single-cell perturbation. We first compute an explicit natural gradient update using Fisher-vector products and a conjugate gradient solver, scaled by a KL trust-region constraint to provide a safe, curvature-aware first step for the policy. Starting with these preconditioned parameters, we then apply a second phase of proximal policy optimization (PPO) with a KL penalty, exploiting minibatch efficiency to refine the policy. We demonstrate that this initialization strategy substantially improves generalization on Single-cell RNA sequencing (scRNA-seq) perturbation analysis in a digital twin system.

[325] Koopman Invariants as Drivers of Emergent Time-Series Clustering in Joint-Embedding Predictive Architectures

Pablo Ruiz-Morales, Dries Vanoost, Davy Pissoort, Mathias Verbeke

Main category: cs.LG

TL;DR: JEPAs cluster time-series by dynamical regimes because their predictive objective implicitly learns the invariant subspace of the Koopman operator, with the linear predictor’s near-identity constraint being key to learning regime indicator functions.

DetailsMotivation: To explain the unexplained ability of Joint-Embedding Predictive Architectures (JEPAs) to cluster time-series data by their underlying dynamical regimes, and to provide a theoretical foundation connecting self-supervised learning with dynamical systems theory.

Method: Proposed theoretical explanation that JEPA’s predictive objective implicitly drives learning of invariant subspace of Koopman operator. Proved idealized JEPA loss is minimized when encoder represents regime indicator functions (Koopman eigenfunctions). Validated theory on synthetic data with known dynamics, showing that constraining linear predictor to be near-identity operator forces encoder to learn invariants.

Result: Demonstrated that the near-identity constraint on JEPA’s linear predictor is the key inductive bias that forces encoder to learn regime indicator functions. Showed this constraint selects interpretable solution from mathematically equivalent but entangled optima, revealing predictor’s role in representation disentanglement.

Conclusion: This work demystifies JEPA’s clustering behavior, provides principled connection between self-supervised learning and dynamical systems theory, and informs design of more robust and interpretable time-series models by understanding the role of predictor constraints in representation learning.

Abstract: Joint-Embedding Predictive Architectures (JEPAs), a powerful class of self-supervised models, exhibit an unexplained ability to cluster time-series data by their underlying dynamical regimes. We propose a novel theoretical explanation for this phenomenon, hypothesizing that JEPA’s predictive objective implicitly drives it to learn the invariant subspace of the system’s Koopman operator. We prove that an idealized JEPA loss is minimized when the encoder represents the system’s regime indicator functions, which are Koopman eigenfunctions. This theory was validated on synthetic data with known dynamics, demonstrating that constraining the JEPA’s linear predictor to be a near-identity operator is the key inductive bias that forces the encoder to learn these invariants. We further discuss that this constraint is critical for selecting this interpretable solution from a class of mathematically equivalent but entangled optima, revealing the predictor’s role in representation disentanglement. This work demystifies a key behavior of JEPAs, provides a principled connection between modern self-supervised learning and dynamical systems theory, and informs the design of more robust and interpretable time-series models.

[326] Hierarchical Physics-Embedded Learning for Prediction and Discovery in Spatiotemporal Dynamical Systems

Xizhe Wang, Xiaobin Song, Qingshan Jia, Hao Sun, Hongbo Zhao, Benben Jiang

Main category: cs.LG

TL;DR: A hierarchical physics-embedded learning framework for spatiotemporal prediction and physical law discovery from sparse noisy data, using a two-level architecture that learns symbolic PDE components and their combinations while embedding known physics.

DetailsMotivation: Modeling complex spatiotemporal dynamics in far-from-equilibrium systems is challenging because governing PDEs are often intractable to derive from first principles due to high-order derivatives, strong nonlinearities, and incomplete physical knowledge. Existing data-driven methods have limitations: purely data-driven models lack physical consistency and require extensive data, while physics-informed methods lack structural capacity for complex operators or systematic integration of partial knowledge.

Method: Proposes a hierarchical physics-embedded learning framework with a two-level architecture that mirrors scientific discovery: Level 1 learns fundamental symbolic components of PDEs, Level 2 learns their governing combinations. The framework builds upon adaptive Fourier Neural Operators to capture non-local dependencies and high-order operators. Known physical laws are directly embedded into computational graphs for physical consistency, while unknown terms are structurally decoupled to enable interpretable discovery through symbolic regression without presupposing functional forms.

Result: The framework fundamentally advances both forward spatiotemporal prediction and inverse discovery of physical laws from sparse and noisy data. The hierarchical decomposition reduces learning complexity and enables structural integration of prior knowledge, improving data efficiency while guaranteeing physical consistency.

Conclusion: The proposed hierarchical physics-embedded learning framework addresses key limitations in modeling complex spatiotemporal dynamics by combining the strengths of data-driven and physics-informed approaches through a structured, interpretable architecture that can handle incomplete physical knowledge while maintaining physical consistency and enabling discovery of underlying governing equations.

Abstract: Modeling complex spatiotemporal dynamics, particularly in far-from-equilibrium systems, remains a grand challenge in science. The governing partial differential equations (PDEs) for these systems are often intractable to derive from first principles, due to their inherent complexity, characterized by high-order derivatives and strong nonlinearities, coupled with incomplete physical knowledge. This has spurred the development of data-driven methods, yet these approaches face limitations: Purely data-driven models are often physically inconsistent and data-intensive, while existing physics-informed methods lack the structural capacity to represent complex operators or systematically integrate partial physical knowledge. Here, we propose a hierarchical physics-embedded learning framework that fundamentally advances both the forward spatiotemporal prediction and inverse discovery of physical laws from sparse and noisy data. The key innovation is a two-level architecture that mirrors the process of scientific discovery: the first level learns fundamental symbolic components of a PDE, while the second learns their governing combinations. This hierarchical decomposition not only reduces learning complexity but, more importantly, enables a structural integration of prior knowledge. Known physical laws are directly embedded into the models computational graph, guaranteeing physical consistency and improving data efficiency. By building the framework upon adaptive Fourier Neural Operators, we can effectively capture the non-local dependencies and high-order operators characteristic of dynamical systems. Additionally, by structurally decoupling known and unknown terms, the framework further enables interpretable discovery of underlying governing equations through symbolic regression, without presupposing functional forms.

[327] Constrained Best Arm Identification with Tests for Feasibility

Ting Cai, Kirthevasan Kandasamy

Main category: cs.LG

TL;DR: Feasible Best Arm Identification (BAI) with separate performance and constraint testing, where algorithm must decide whether to test performance or feasibility constraints for each arm pull.

DetailsMotivation: Real-world BAI problems often require arms to satisfy feasibility constraints (e.g., drug safety thresholds), but existing work assumes simultaneous observation of performance and constraints, which doesn't match practical scenarios like drug discovery where safety tests are conducted separately from performance measurements.

Method: Propose an efficient algorithm for feasible BAI where decision-maker chooses tuple (i,ℓ) - arm i and whether to test performance (ℓ=0) or one of N feasibility constraints (ℓ∈[N]). Focus on fixed-confidence setting to identify feasible arm with highest performance with probability ≥1-δ.

Result: Algorithm’s sample complexity is upper-bounded, showing it adapts to problem difficulty and eliminates arms by worse performance or infeasibility. Lower bound proves algorithm is asymptotically optimal (δ→0). Empirical results show algorithm outperforms state-of-the-art BAI algorithms on synthetic and real-world datasets.

Conclusion: The proposed feasible BAI algorithm with separate performance/constraint testing is both theoretically optimal and practically effective, addressing real-world needs where safety/feasibility constraints must be tested independently from performance measurements.

Abstract: Best arm identification (BAI) aims to identify the highest-performance arm among a set of $K$ arms by collecting stochastic samples from each arm. In real-world problems, the best arm needs to satisfy additional feasibility constraints. While there is limited prior work on BAI with feasibility constraints, they typically assume the performance and constraints are observed simultaneously on each pull of an arm. However, this assumption does not reflect most practical use cases, e.g., in drug discovery, we wish to find the most potent drug whose toxicity and solubility are below certain safety thresholds. These safety experiments can be conducted separately from the potency measurement. Thus, this requires designing BAI algorithms that not only decide which arm to pull but also decide whether to test for the arm’s performance or feasibility. In this work, we study feasible BAI which allows a decision-maker to choose a tuple $(i,\ell)$, where $i\in [K]$ denotes an arm and $\ell$ denotes whether she wishes to test for its performance ($\ell=0$) or any of its $N$ feasibility constraints ($\ell\in[N]$). We focus on the fixed confidence setting, which is to identify the feasible arm with the highest performance, with a probability of at least $1-δ$. We propose an efficient algorithm and upper-bound its sample complexity, showing our algorithm can naturally adapt to the problem’s difficulty and eliminate arms by worse performance or infeasibility, whichever is easier. We complement this upper bound with a lower bound showing that our algorithm is \textit{asymptotically ($δ\rightarrow 0$) optimal}. Finally, we empirically show that our algorithm outperforms other state-of-the-art BAI algorithms in both synthetic and real-world datasets.

[328] Enhancing ECG Classification Robustness with Lightweight Unsupervised Anomaly Detection Filters

Mustafa Fuad Rifet Ibrahim, Maurice Meijer, Alexander Schlaefer, Peer Stelldinger

Main category: cs.LG

TL;DR: NAS-optimized Deep SVDD provides best Pareto efficiency for lightweight OOD detection in ECG monitoring on microcontrollers, improving diagnostic classifier accuracy by up to 21.0%.

DetailsMotivation: Deploying deep learning models on resource-constrained microcontrollers for ECG monitoring faces reliability challenges from OOD pathologies and noise, with standard classifiers making high-confidence errors on such data.

Method: Performed Neural Architecture Search (NAS) on six UAD approaches (Deep SVDD, AE/VAE, MAD, NFs, DDPM) under strict hardware constraints (≤512k parameters) suitable for microcontrollers, evaluating on PTB-XL and BUT QDB datasets.

Result: NAS-optimized Deep SVDD offers superior Pareto efficiency between detection performance and model size. In simulated deployment, this lightweight filter improves diagnostic classifier accuracy by up to 21.0 percentage points.

Conclusion: Optimized UAD filters can effectively safeguard ECG analysis on wearables by providing lightweight upstream filtering for OOD detection on resource-constrained microcontrollers.

Abstract: Continuous electrocardiogram (ECG) monitoring via wearable devices is vital for early cardiovascular disease detection. However, deploying deep learning models on resource-constrained microcontrollers faces reliability challenges, particularly from Out-of-Distribution (OOD) pathologies and noise. Standard classifiers often yield high-confidence errors on such data. Existing OOD detection methods either neglect computational constraints or address noise and unseen classes separately. This paper investigates Unsupervised Anomaly Detection (UAD) as a lightweight, upstream filtering mechanism. We perform a Neural Architecture Search (NAS) on six UAD approaches, including Deep Support Vector Data Description (Deep SVDD), input reconstruction with (Variational-)Autoencoders (AE/VAE), Masked Anomaly Detection (MAD), Normalizing Flows (NFs) and Denoising Diffusion Probabilistic Models (DDPM) under strict hardware constraints ($\leq$512k parameters), suitable for microcontrollers. Evaluating on the PTB-XL and BUT QDB datasets, we demonstrate that a NAS-optimized Deep SVDD offers the superior Pareto efficiency between detection performance and model size. In a simulated deployment, this lightweight filter improves the accuracy of a diagnostic classifier by up to 21.0 percentage points, demonstrating that optimized UAD filters can safeguard ECG analysis on wearables.

[329] Pre-Generating Multi-Difficulty PDE Data for Few-Shot Neural PDE Solvers

Naman Choudhary, Vedant Singh, Ameet Talwalkar, Nicholas Matthew Boffi, Mikhail Khodak, Tanya Marwah

Main category: cs.LG

TL;DR: Using easier PDE examples to pre-train neural solvers reduces compute costs for solving harder problems.

DetailsMotivation: Training neural PDE solvers requires expensive classical solver data generation, especially for complex problems. There's a chicken-and-egg problem: hard problems need more data but are more expensive to generate. The paper explores whether easier examples can help learn harder physics more efficiently.

Method: Systematically study difficulty transfer on 2D incompressible Navier-Stokes equations by varying geometry (obstacle number/placement) and physics (Reynolds number). Pre-generate many low/medium difficulty examples with classical solvers and include them in training to learn high-difficulty physics from fewer samples.

Result: Combining low and high difficulty data reduces compute by 8.9x to achieve same error as using only high difficulty examples. Pre-training on easier problems enables effective learning of harder physics with far fewer expensive high-difficulty samples.

Conclusion: How classical-solver compute is allocated across difficulty levels is as important as total compute. Principled curation of pre-generated PDE data offers substantial gains for neural solvers, similar to foundation model pre-training.

Abstract: A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty–e.g., more complex geometries and higher Reynolds numbers–along which problems become (1) harder for classical solvers and thus (2) more likely to benefit from neural speedups. Towards addressing this chicken-and-egg challenge, we study difficulty transfer on 2D incompressible Navier-Stokes, systematically varying task complexity along geometry (number and placement of obstacles), physics (Reynolds number), and their combination. Similar to how it is possible to spend compute to pre-train foundation models and improve their performance on downstream tasks, we find that by classically solving (analogously pre-generating) many low and medium difficulty examples and including them in the training set, it is possible to learn high-difficulty physics from far fewer samples. Furthermore, we show that by combining low and high difficulty data, we can spend 8.9x less compute on pre-generating a dataset to achieve the same error as using only high difficulty examples. Our results highlight that how we allocate classical-solver compute across difficulty levels is as important as how much we allocate overall, and suggest substantial gains from principled curation of pre-generated PDE data for neural solvers. Our code is available at https://github.com/Naman-Choudhary-AI-ML/pregenerating-pde

[330] Fourier Neural Operators Explained: A Practical Perspective

Valentin Duruisseaux, Jean Kossaifi, Anima Anandkumar

Main category: cs.LG

TL;DR: A comprehensive practice-oriented guide to Fourier Neural Operators (FNOs) that connects theoretical foundations with practical implementation, addressing common misunderstandings and providing modular implementations through the NeuralOperator library.

DetailsMotivation: FNOs have become influential for learning PDE solutions but practitioners often lack understanding of their theoretical foundations and implementation details, leading to incorrect or unreliable applications. There's a need for a clear guide that bridges theory and practice.

Method: Provides intuitive exposition of operator theory and signal-processing concepts underlying FNOs, details spectral parameterization and computational design of all components, and integrates with NeuralOperator 2.0.0 library for modular implementations.

Result: A comprehensive guide that unifies mathematical principles with implementation strategies, addresses common misunderstandings in literature, and offers state-of-the-art implementations that faithfully reflect the theory.

Conclusion: By connecting rigorous foundations with practical insight, this guide establishes a clear and reliable framework for effectively applying FNOs across diverse scientific and engineering fields.

Abstract: Partial differential equations (PDEs) govern a wide variety of dynamical processes in science and engineering, yet obtaining their numerical solutions often requires high-resolution discretizations and repeated evaluations of complex operators, leading to substantial computational costs. Neural operators have recently emerged as a powerful framework for learning mappings between function spaces directly from data, enabling efficient surrogate models for PDE systems. Among these architectures, the Fourier Neural Operator (FNO) has become the most influential and widely adopted due to its elegant spectral formulation, which captures global correlations through learnable transformations in Fourier space while remaining invariant to discretization and resolution. Despite their success, the practical use of FNOs is often hindered by an incomplete understanding among practitioners of their theoretical foundations, practical constraints, and implementation details, which can lead to their incorrect or unreliable application. This work presents a comprehensive and practice-oriented guide to FNOs, unifying their mathematical principles with implementation strategies. We provide an intuitive exposition to the concepts of operator theory and signal-processing that underlie the FNO, detail its spectral parameterization and the computational design of all its components, and address common misunderstandings encountered in the literature. The exposition is closely integrated with the NeuralOperator 2.0.0 library, offering modular state-of-the-art implementations that faithfully reflect the theory. By connecting rigorous foundations with practical insight, this guide aims to establish a clear and reliable framework for applying FNOs effectively across diverse scientific and engineering fields.

[331] LUMOS: Large User MOdels for User Behavior Prediction

Dhruv Nigam, Naman Agarwal, Krishna Murthy, Susmit Saha

Main category: cs.LG

TL;DR: LUMOS is a transformer-based architecture that learns multiple user behavior prediction tasks jointly using raw activity data, eliminating task-specific models and manual feature engineering through cross-attention on future events and multi-modal tokenization.

DetailsMotivation: Traditional user behavior prediction approaches rely on task-specific models and domain-specific feature engineering, which is time-consuming, computationally expensive, requires domain expertise, and doesn't scale well for large B2C platforms.

Method: LUMOS uses a transformer-based architecture with: 1) A novel cross-attention mechanism that conditions predictions on future known events (holidays, sales), 2) Multi-modal tokenization combining user activities, event context, and static user demographic attributes through specialized embedding pathways, 3) Joint learning of multiple tasks using only raw user activity data.

Result: On a production dataset of 1.7 trillion user activity tokens from 250M users, LUMOS achieved: average 0.025 improvement in ROC-AUC for binary classification tasks, 4.6% reduction in MAPE for regression tasks across 5 tasks, and 3.15% increase in Daily Active Users in online A/B testing.

Conclusion: LUMOS provides a scalable solution for user behavior prediction that eliminates task-specific modeling and manual feature engineering, demonstrating superior performance and measurable business impact through its transformer-based architecture with cross-attention on future events.

Abstract: User behavior prediction at scale remains a critical challenge for online B2C platforms. Traditional approaches rely heavily on task-specific models and domain-specific feature engineering. This is time-consuming, computationally expensive, and requires domain expertise and therefore, not scalable. We present LUMOS (Large User MOdel Series), a transformer-based architecture that eliminates task-specific models and manual feature engineering by learning multiple tasks jointly using only raw user activity data. LUMOS introduces a novel cross-attention mechanism that conditions predictions on future known events (e.g., holidays, sales, etc.), enabling the model to predict complex behavior patterns like “how will upcoming holidays affect user engagement?” The architecture also employs multi-modal tokenization, combining user activities, event context, and static user demographic attributes into rich representations processed through specialized embedding pathways. Through extensive experiments on a production dataset spanning 1.7 trillion user activity tokens from 250 million users, we demonstrate that LUMOS achieves superior performance compared to traditional task-specific models. Across 5 tasks with established baselines, we achieve an average improvement of 0.025 in ROC-AUC for binary classification tasks and 4.6% reduction in MAPE for regression tasks. Online A/B testing validates these improvements translate to measurable business impact with a 3.15% increase in Daily Active Users.

[332] UACER: An Uncertainty-Adaptive Critic Ensemble Framework for Robust Adversarial Reinforcement Learning

Jiaxi Wu, Tiantian Zhang, Yuxing Wang, Yongzhe Chang, Xueqian Wang

Main category: cs.LG

TL;DR: UACER introduces an uncertainty-adaptive critic ensemble with time-varying decay mechanism to stabilize robust adversarial reinforcement learning against non-stationary adversaries.

DetailsMotivation: Robust adversarial RL faces training instability due to non-stationary learning dynamics when adversaries are trainable, especially in high-dimensional complex environments like autonomous driving and robotics.

Method: Two components: 1) Diversified critic ensemble with K parallel critic networks for stable Q-value estimation, and 2) Time-varying Decay Uncertainty mechanism using variance-derived Q-value aggregation with epistemic uncertainty to adaptively regulate exploration-exploitation trade-off.

Result: UACER outperforms state-of-the-art methods in challenging MuJoCo control problems, demonstrating superior performance, stability, and efficiency.

Conclusion: The proposed uncertainty-adaptive critic ensemble effectively addresses training instability in robust adversarial RL, providing a promising solution for sequential decision-making in uncertain environments.

Abstract: Robust adversarial reinforcement learning has emerged as an effective paradigm for training agents to handle uncertain disturbance in real environments, with critical applications in sequential decision-making domains such as autonomous driving and robotic control. Within this paradigm, agent training is typically formulated as a zero-sum Markov game between a protagonist and an adversary to enhance policy robustness. However, the trainable nature of the adversary inevitably induces non-stationarity in the learning dynamics, leading to exacerbated training instability and convergence difficulties, particularly in high-dimensional complex environments. In this paper, we propose a novel approach, Uncertainty-Adaptive Critic Ensemble for robust adversarial Reinforcement learning (UACER), which consists of two components: 1) Diversified critic ensemble: A diverse set of K critic networks is employed in parallel to stabilize Q-value estimation in robust adversarial reinforcement learning, reducing variance and enhancing robustness compared to conventional single-critic designs. 2) Time-varying Decay Uncertainty (TDU) mechanism: Moving beyond simple linear combinations, we propose a variance-derived Q-value aggregation strategy that explicitly incorporates epistemic uncertainty to adaptively regulate the exploration-exploitation trade-off while stabilizing the training process. Comprehensive experiments across several challenging MuJoCo control problems validate the superior effectiveness of UACER, outperforming state-of-the-art methods in terms of overall performance, stability, and efficiency.

[333] Controlled LLM Training on Spectral Sphere

Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, Baining Guo

Main category: cs.LG

TL;DR: SSO is a new optimizer that enforces strict spectral constraints on both weights and updates, achieving full μP alignment and outperforming AdamW/Muon in large-scale pretraining with improved stability.

DetailsMotivation: Existing optimizers like Muon are only "half-aligned" with μP constraints - they control updates but allow weights to drift, limiting stability and convergence in large model training.

Method: Introduces Spectral Sphere Optimizer (SSO) that enforces strict module-wise spectral constraints on both weights and their updates by deriving steepest descent direction on the spectral sphere, implemented as efficient parallel algorithm in Megatron.

Result: SSO consistently outperforms AdamW and Muon in pretraining diverse architectures (Dense 1.7B, MoE 8B-A1B, 200-layer DeepNet), with significant stability benefits including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

Conclusion: SSO provides a fully μP-aligned optimization framework that enables more stable and efficient large-scale model training with practical benefits for complex architectures.

Abstract: Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ($\boldsymbolμ$P) provides a theoretical safeguard for width-invariant $Θ(1)$ activation control, whereas emerging optimizers like Muon are only ``half-aligned’’ with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbf{Spectral Sphere Optimizer (SSO)}, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully $\boldsymbolμ$P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

[334] Pace: Physics-Aware Attentive Temporal Convolutional Network for Battery Health Estimation

Sara Sameer, Wei Zhang, Dhivya Dharshini Kannan, Xin Lou, Yulin Gao, Terence Goh, Qingyu Yan

Main category: cs.LG

TL;DR: Pace: A physics-aware attentive temporal convolutional network for battery health estimation that integrates sensor data with battery physics features, achieving significant performance improvements over existing models.

DetailsMotivation: Batteries are critical for modern energy systems (EVs, grid storage), and effective battery health management is essential for safety, cost-efficiency, and sustainability. Current methods need improvement for accurate battery health estimation across various usage conditions.

Method: Pace integrates raw sensor measurements with battery physics features from equivalent circuit models. It uses three specialized modules: dilated temporal blocks for efficient temporal encoding, chunked attention blocks for context modeling, and a dual-head output block for fusing short- and long-term degradation patterns.

Result: On a large public dataset, Pace significantly outperforms existing models with average performance improvements of 6.5x and 2.0x compared to two best-performing baselines. The model was successfully deployed in real-time on a Raspberry Pi, demonstrating practical viability.

Conclusion: Pace establishes itself as a practical and high-performance solution for battery health analytics, combining physics awareness with deep learning to achieve accurate battery health estimation across various usage conditions with real-time deployment capability.

Abstract: Batteries are critical components in modern energy systems such as electric vehicles and power grid energy storage. Effective battery health management is essential for battery system safety, cost-efficiency, and sustainability. In this paper, we propose Pace, a physics-aware attentive temporal convolutional network for battery health estimation. Pace integrates raw sensor measurements with battery physics features derived from the equivalent circuit model. We develop three battery-specific modules, including dilated temporal blocks for efficient temporal encoding, chunked attention blocks for context modeling, and a dual-head output block for fusing short- and long-term battery degradation patterns. Together, the modules enable Pace to predict battery health accurately and efficiently in various battery usage conditions. In a large public dataset, Pace performs much better than existing models, achieving an average performance improvement of 6.5 and 2.0x compared to two best-performing baseline models. We further demonstrate its practical viability with a real-time edge deployment on a Raspberry Pi. These results establish Pace as a practical and high-performance solution for battery health analytics.

[335] Combating Spurious Correlations in Graph Interpretability via Self-Reflection

Kecheng Cai, Chenyang Xu, Chao Peng, Jiafu Huang, Qiyuan Liang, Irene Zheng

Main category: cs.LG

TL;DR: The paper proposes a self-reflection framework to improve interpretability on challenging Spurious-Motif datasets by iteratively feeding importance scores back into existing graph learning methods, similar to LLM self-reflection techniques.

DetailsMotivation: Existing interpretable graph learning methods struggle with Spurious-Motif datasets that contain deliberate spurious correlations, showing significantly worse performance compared to other benchmarks. The authors aim to enhance interpretability on these challenging datasets.

Method: A self-reflection framework that integrates with existing interpretable graph learning methods. When a method produces importance scores for nodes/edges, the framework feeds these predictions back into the original method for a second round of evaluation. The authors also propose a fine-tuning training method based on this feedback mechanism.

Result: The self-reflection technique, commonly used in large language models, can be effectively adapted to enhance interpretability in datasets with strong spurious correlations. The iterative feedback process helps models better distinguish relevant structures from misleading patterns.

Conclusion: Self-reflection techniques from LLMs can be successfully applied to improve interpretable graph learning on challenging datasets with spurious correlations, leading to better performance on the difficult Spurious-Motif benchmark.

Abstract: Interpretable graph learning has recently emerged as a popular research topic in machine learning. The goal is to identify the important nodes and edges of an input graph that are crucial for performing a specific graph reasoning task. A number of studies have been conducted in this area, and various benchmark datasets have been proposed to facilitate evaluation. Among them, one of the most challenging is the Spurious-Motif benchmark, introduced at ICLR 2022. The datasets in this synthetic benchmark are deliberately designed to include spurious correlations, making it particularly difficult for models to distinguish truly relevant structures from misleading patterns. As a result, existing methods exhibit significantly worse performance on this benchmark compared to others. In this paper, we focus on improving interpretability on the challenging Spurious-Motif datasets. We demonstrate that the self-reflection technique, commonly used in large language models to tackle complex tasks, can also be effectively adapted to enhance interpretability in datasets with strong spurious correlations. Specifically, we propose a self-reflection framework that can be integrated with existing interpretable graph learning methods. When such a method produces importance scores for each node and edge, our framework feeds these predictions back into the original method to perform a second round of evaluation. This iterative process mirrors how large language models employ self-reflective prompting to reassess their previous outputs. We further analyze the reasons behind this improvement from the perspective of graph representation learning, which motivates us to propose a fine-tuning training method based on this feedback mechanism.

[336] Multi-scale Graph Autoregressive Modeling: Molecular Property Prediction via Next Token Prediction

Zhuoyang Jiang, Yaosen Min, Peiran Jin, Lei Chen

Main category: cs.LG

TL;DR: CamS is a graph-to-sequence representation that enables decoder-only Transformers to learn molecular graphs via next-token prediction, bridging the gap between SMILES-based and graph-native approaches for molecular property prediction.

DetailsMotivation: SMILES-based next-token prediction scales well but lacks explicit molecular topology, while graph-native masked modeling captures connectivity but risks disrupting crucial chemical details like activity cliffs. There's a need to combine the strengths of both approaches.

Method: CamS serializes molecular graphs into structure-rich causal sequences by: 1) mining data-driven connection-aware motifs, 2) serializing motifs via scaffold-rooted breadth-first search (BFS) to establish core-to-periphery order, and 3) enabling hierarchical modeling by concatenating sequences from fine to coarse motif scales.

Result: CamS-LLaMA (pre-trained vanilla LLaMA backbone on CamS sequences) achieves state-of-the-art performance on MoleculeNet and the activity-cliff benchmark MoleculeACE, outperforming both SMILES-based language models and strong graph baselines.

Conclusion: CamS effectively bridges the gap between sequence and graph approaches for molecular representation, enabling decoder-only Transformers to learn molecular graphs while preserving crucial chemical details. The multi-scale causal serialization drives attention toward important structural differences, particularly for activity cliffs.

Abstract: We present Connection-Aware Motif Sequencing (CamS), a graph-to-sequence representation that enables decoder-only Transformers to learn molecular graphs via standard next-token prediction (NTP). For molecular property prediction, SMILES-based NTP scales well but lacks explicit topology, whereas graph-native masked modeling captures connectivity but risks disrupting the pivotal chemical details (e.g., activity cliffs). CamS bridges this gap by serializing molecular graphs into structure-rich causal sequences. CamS first mines data-driven connection-aware motifs. It then serializes motifs via scaffold-rooted breadth-first search (BFS) to establish a stable core-to-periphery order. Crucially, CamS enables hierarchical modeling by concatenating sequences from fine to coarse motif scales, allowing the model to condition global scaffolds on dense, uncorrupted local structural evidence. We instantiate CamS-LLaMA by pre-training a vanilla LLaMA backbone on CamS sequences. It achieves state-of-the-art performance on MoleculeNet and the activity-cliff benchmark MoleculeACE, outperforming both SMILES-based language models and strong graph baselines. Interpretability analysis confirms that our multi-scale causal serialization effectively drives attention toward cliff-determining differences.

[337] GlueNN: gluing patchwise analytic solutions with neural networks

Doyoung Kim, Donghee Lee, Hye-Sung Lee, Jiheon Lee, Jaeok Yi

Main category: cs.LG

TL;DR: GlueNN is a physics-informed learning framework that decomposes solutions into interpretable patchwise analytic components using learnable coefficient functions from local asymptotic expansions, enabling regime transition detection and physical parameter extraction.

DetailsMotivation: Standard numerical solvers and PINNs operate as black boxes that output solution fields without disentangling interpretable constituent parts. The objective extends beyond computing numerical solutions to capturing regime crossovers and extracting meaningful physical parameters.

Method: GlueNN decomposes global solutions into patchwise analytic components by promoting integration constants of local asymptotic expansions to learnable, scale-dependent coefficient functions. These coefficients are constrained by the differential equation, enabling smooth interpolation between asymptotic limits without ad hoc boundary matching.

Result: The coefficient-centric approach reproduces accurate global solutions in various examples and directly extracts physical information not explicitly available through standard numerical integration.

Conclusion: GlueNN provides an interpretable physics-informed learning framework that captures regime transitions and extracts meaningful physical parameters by decomposing solutions into analytic components, overcoming limitations of black-box numerical methods.

Abstract: In the analysis of complex physical systems, the objective often extends beyond merely computing a numerical solution to capturing the precise crossover between different regimes and extracting parameters containing meaningful information. However, standard numerical solvers and conventional deep learning approaches, such as Physics-Informed Neural Networks (PINNs), typically operate as black boxes that output solution fields without disentangling the solution into its interpretable constituent parts. In this work, we propose GlueNN, a physics-informed learning framework that decomposes the global solution into interpretable, patchwise analytic components. Rather than approximating the solution directly, GlueNN promotes the integration constants of local asymptotic expansions to learnable, scale-dependent coefficient functions. By constraining these coefficients with the differential equation, the network effectively performs regime transition, smoothly interpolating between asymptotic limits without requiring ad hoc boundary matching. We demonstrate that this coefficient-centric approach reproduces accurate global solutions in various examples and thus directly extracts physical information that is not explicitly available through standard numerical integration.

[338] Q-learning with Adjoint Matching

Qiyang Li, Sergey Levine

Main category: cs.LG

TL;DR: QAM is a novel RL algorithm that efficiently optimizes expressive diffusion/flow-matching policies using adjoint matching to avoid unstable backpropagation while leveraging critic gradients.

DetailsMotivation: There's a long-standing challenge in continuous-action RL: efficiently optimizing expressive diffusion or flow-matching policies with respect to parameterized Q-functions. Direct gradient-based optimization through multi-step denoising processes is numerically unstable, forcing existing methods to either discard gradient information or use approximations that sacrifice policy expressivity or introduce bias.

Method: QAM uses adjoint matching, a technique from generative modeling, to transform the critic’s action gradient into a step-wise objective function that avoids unstable backpropagation. This allows the algorithm to leverage first-order information from the critic while maintaining an unbiased, expressive policy at the optimum. It combines this with temporal-difference backup for critic learning.

Result: QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online reinforcement learning settings.

Conclusion: QAM successfully addresses the challenge of optimizing expressive diffusion/flow-matching policies in continuous-action RL by using adjoint matching to avoid unstable backpropagation while effectively leveraging critic gradient information, leading to superior performance on challenging tasks.

Abstract: We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic’s action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

[339] Optimising for Energy Efficiency and Performance in Machine Learning

Emile Dos Santos Ferreira, Andrei Paleyes, Neil D. Lawrence

Main category: cs.LG

TL;DR: ECOpt is a hyperparameter tuner that optimizes ML models for both energy efficiency and performance, creating Pareto frontiers to help practitioners balance accuracy with environmental impact.

DetailsMotivation: Growing energy consumption of ML models, lack of understanding about energy scaling laws, focus on training costs ignoring inference costs, and absence of actionable tools for measuring and optimizing energy efficiency.

Method: Developed ECOpt - an energy-aware hyperparameter tuner that quantifies trade-offs between energy efficiency and model performance as interpretable Pareto frontiers, enabling informed decisions about energy costs and environmental impact.

Result: Parameter and FLOP counts are unreliable proxies for energy consumption; Transformer energy efficiency is consistent across hardware; ECOpt can have net positive environmental impact; discovered 7 CIFAR-10 models that improve state-of-the-art when considering both accuracy and energy efficiency.

Conclusion: Energy metrics should be measured and published for ML models; ECOpt provides practical solution for optimizing energy efficiency while maintaining performance, helping practitioners comply with regulations and reduce environmental impact.

Abstract: The ubiquity of machine learning (ML) and the demand for ever-larger models bring an increase in energy consumption and environmental impact. However, little is known about the energy scaling laws in ML, and existing research focuses on training cost – ignoring the larger cost of inference. Furthermore, tools for measuring the energy consumption of ML do not provide actionable feedback. To address these gaps, we developed Energy Consumption Optimiser (ECOpt): a hyperparameter tuner that optimises for energy efficiency and model performance. ECOpt quantifies the trade-off between these metrics as an interpretable Pareto frontier. This enables ML practitioners to make informed decisions about energy cost and environmental impact, while maximising the benefit of their models and complying with new regulations. Using ECOpt, we show that parameter and floating-point operation counts can be unreliable proxies for energy consumption, and observe that the energy efficiency of Transformer models for text generation is relatively consistent across hardware. These findings motivate measuring and publishing the energy metrics of ML models. We further show that ECOpt can have a net positive environmental impact and use it to uncover seven models for CIFAR-10 that improve upon the state of the art, when considering accuracy and energy efficiency together.

[340] Energy-Entropy Regularization: The True Power of Minimal Looped Transformers

Wai-Lun Lam

Main category: cs.LG

TL;DR: Proposes a novel training framework using Tsallis entropy and Hamiltonian dynamics to train single-head looped Transformers, successfully solving induction head tasks with 1000-token sequences.

DetailsMotivation: Looped Transformers show superior reasoning but current training approaches often fail due to highly non-convex and irregular loss landscapes, causing optimization to get stuck in poor local minima and saddle points.

Method: Uses Tsallis entropy and Hamiltonian dynamics to transform the geometry of the loss landscape, treating parameter updates as a physical flow to enable effective training of single-head looped Transformers.

Result: Successfully trained a single-head looped Transformer with model dimension d=8 to solve induction head tasks with input sequence length of 1000 tokens, revealing internal mechanisms behind superior reasoning capability.

Conclusion: The proposed training framework overcomes optimization challenges in looped Transformers, enabling successful training and providing insights into their reasoning mechanisms.

Abstract: Recent research suggests that looped Transformers have superior reasoning capabilities compared to standard deep architectures. Current approaches to training single-head looped architectures on benchmark tasks frequently fail or yield suboptimal performance due to a highly non-convex and irregular loss landscape. In these settings, optimization often stagnates in poor local minima and saddle points of the loss landscape, preventing the model from discovering the global minimum point. The internal mechanisms of these single-head looped transformer models remain poorly understood, and training them from scratch remains a significant challenge. In this paper, we propose a novel training framework that leverages Tsallis entropy and Hamiltonian dynamics to transform the geometry of the loss landscape. By treating the parameter updates as a physical flow, we successfully trained a single-head looped Transformer with model dimension $d = 8$ to solve induction head task with input sequence length of 1000 tokens. This success reveals the internal mechanism behind the superior reasoning capability.

[341] Self-Augmented Mixture-of-Experts for QoS Prediction

Kecheng Cai, Chao Peng, Chenyang Xu, Xia Chen, Yi Wang, Shuo Shi, Qiyuan Liang

Main category: cs.LG

TL;DR: Proposes a self-augmented mixture-of-experts model for QoS prediction that uses iterative refinement through partial masking and inter-expert communication to address data sparsity.

DetailsMotivation: QoS prediction is fundamental for service computing and personalized recommendation, but suffers from inherent sparsity of user-service interactions where only a small subset of feedback values is observed.

Method: Self-augmented strategy that leverages model’s own predictions for iterative refinement by partially masking predicted values and feeding them back into the model. Specifically, a self-augmented mixture-of-experts model where multiple expert networks iteratively and collaboratively estimate QoS values, enabling inter-expert communication.

Result: Experiments on benchmark datasets show the method outperforms existing baselines and achieves competitive results.

Conclusion: The proposed self-augmented mixture-of-experts approach effectively addresses QoS prediction challenges by leveraging iterative refinement and inter-expert communication to handle sparse user-service interaction data.

Abstract: Quality of Service (QoS) prediction is one of the most fundamental problems in service computing and personalized recommendation. In the problem, there is a set of users and services, each associated with a set of descriptive features. Interactions between users and services produce feedback values, typically represented as numerical QoS metrics such as response time or availability. Given the observed feedback for a subset of user-service pairs, the goal is to predict the QoS values for the remaining pairs. A key challenge in QoS prediction is the inherent sparsity of user-service interactions, as only a small subset of feedback values is typically observed. To address this, we propose a self-augmented strategy that leverages a model’s own predictions for iterative refinement. In particular, we partially mask the predicted values and feed them back into the model to predict again. Building on this idea, we design a self-augmented mixture-of-experts model, where multiple expert networks iteratively and collaboratively estimate QoS values. We find that the iterative augmentation process naturally aligns with the MoE architecture by enabling inter-expert communication: in the second round, each expert receives the first-round predictions and refines its output accordingly. Experiments on benchmark datasets show that our method outperforms existing baselines and achieves competitive results.

[342] Extractive summarization on a CMOS Ising machine

Ziqing Zeng, Abhimanyu Kumar, Ahmet Efe, Ruihong Yin, Chris H. Kim, Ulya R. Karpuzcu, Sachin S. Sapatnekar

Main category: cs.LG

TL;DR: This paper proposes implementing extractive summarization on a low-power CMOS coupled oscillator-based Ising machine (COBI) for energy-efficient, real-time inference on edge devices.

DetailsMotivation: Current extractive summarization systems rely on energy-intensive CPU/GPU infrastructures that are unsuitable for resource-constrained environments. There's a need for low-power, real-time inference solutions for edge devices.

Method: The authors develop: 1) a hardware-aware Ising formulation that reduces scale imbalance between local fields and coupling terms for better quantization robustness; 2) a complete ES pipeline with stochastic rounding and iterative refinement for precision compensation; 3) a decomposition strategy to partition large ES problems into smaller Ising subproblems solvable on COBI.

Result: On CNN/DailyMail dataset, the COBI-based pipeline achieves 3-4.5x runtime speedup vs brute-force (comparable to Tabu search), 2-3 orders of magnitude energy reduction, while maintaining competitive summary quality using only integer-coupled Ising hardware with limited precision.

Conclusion: CMOS Ising solvers like COBI show strong potential for deploying real-time, low-energy text summarization on edge devices, offering significant energy savings while maintaining quality.

Abstract: Extractive summarization (ES) aims to generate a concise summary by selecting a subset of sentences from a document while maximizing relevance and minimizing redundancy. Although modern ES systems achieve high accuracy using powerful neural models, their deployment typically relies on CPU or GPU infrastructures that are energy-intensive and poorly suited for real-time inference in resource-constrained environments. In this work, we explore the feasibility of implementing McDonald-style extractive summarization on a low-power CMOS coupled oscillator-based Ising machine (COBI) that supports integer-valued, all-to-all spin couplings. We first propose a hardware-aware Ising formulation that reduces the scale imbalance between local fields and coupling terms, thereby improving robustness to coefficient quantization: this method can be applied to any problem formulation that requires k of n variables to be chosen. We then develop a complete ES pipeline including (i) stochastic rounding and iterative refinement to compensate for precision loss, and (ii) a decomposition strategy that partitions a large ES problem into smaller Ising subproblems that can be efficiently solved on COBI and later combined. Experimental results on the CNN/DailyMail dataset show that our pipeline can produce high-quality summaries using only integer-coupled Ising hardware with limited precision. COBI achieves 3-4.5x runtime speedups compared to a brute-force method, which is comparable to software Tabu search, and two to three orders of magnitude reductions in energy, while maintaining competitive summary quality. These results highlight the potential of deploying CMOS Ising solvers for real-time, low-energy text summarization on edge devices.

cs.MA

[343] Computational Foundations for Strategic Coopetition: Formalizing Collective Action and Loyalty

Vik Pant, Eric Yu

Main category: cs.MA

TL;DR: This paper extends computational foundations for strategic coopetition to team-level dynamics, developing loyalty-moderated utility functions to address free-riding in mixed-motive multi-agent settings, with experimental and empirical validation showing robust effects.

DetailsMotivation: Mixed-motive multi-agent settings suffer from persistent free-riding where individual effort benefits all members equally but each bears full cost of their contribution. Classical work shows Nash equilibrium leads to universal shirking under pure self-interest, and while i* represents teams as composite actors, it lacks scalable computational mechanisms for analyzing how collective action problems emerge and resolve in coopetitive settings.

Method: Extends computational foundations for strategic coopetition to team-level dynamics, building on formalizations of interdependence/complementarity and trust dynamics. Develops loyalty-moderated utility functions with two mechanisms: loyalty benefit (welfare internalization plus intrinsic contribution satisfaction) and cost tolerance (reduced effort burden for loyal members). Integrates i* structural dependencies through dependency-weighted team cohesion, connecting member incentives to team-level positioning. Framework applies to both human teams and multi-agent systems.

Result: Experimental validation across 3,125 configurations demonstrates robust loyalty effects (15.04x median effort differentiation). All six behavioral targets achieve thresholds: free-riding baseline (96.5%), loyalty monotonicity (100%), effort differentiation (100%), team size effect (100%), mechanism synergy (99.5%), and bounded outcomes (100%). Empirical validation using Apache HTTP Server case study achieves 60/60 points, reproducing contribution patterns across formation, growth, maturation, and governance phases. Statistical significance confirmed at p<0.001, Cohen’s d=0.71.

Conclusion: The framework successfully addresses collective action problems in coopetitive settings by incorporating loyalty mechanisms into team-level strategic analysis, providing both computational foundations and empirical validation for understanding how free-riding can be mitigated through loyalty-moderation in mixed-motive environments.

Abstract: Mixed-motive multi-agent settings are rife with persistent free-riding because individual effort benefits all members equally, yet each member bears the full cost of their own contribution. Classical work by Holmström established that under pure self-interest, Nash equilibrium is universal shirking. While i* represents teams as composite actors, it lacks scalable computational mechanisms for analyzing how collective action problems emerge and resolve in coopetitive settings. This technical report extends computational foundations for strategic coopetition to team-level dynamics, building on companion work formalizing interdependence/complementarity (arXiv:2510.18802) and trust dynamics (arXiv:2510.24909). We develop loyalty-moderated utility functions with two mechanisms: loyalty benefit (welfare internalization plus intrinsic contribution satisfaction) and cost tolerance (reduced effort burden for loyal members). We integrate i* structural dependencies through dependency-weighted team cohesion, connecting member incentives to team-level positioning. The framework applies to both human teams (loyalty as psychological identification) and multi-agent systems (alignment coefficients and adjusted cost functions). Experimental validation across 3,125 configurations demonstrates robust loyalty effects (15.04x median effort differentiation). All six behavioral targets achieve thresholds: free-riding baseline (96.5%), loyalty monotonicity (100%), effort differentiation (100%), team size effect (100%), mechanism synergy (99.5%), and bounded outcomes (100%). Empirical validation using published Apache HTTP Server (1995-2023) case study achieves 60/60 points, reproducing contribution patterns across formation, growth, maturation, and governance phases. Statistical significance confirmed at p<0.001, Cohen’s d=0.71.

[344] AMBER: A Columnar Architecture for High-Performance Agent-Based Modeling in Python

Anh-Duy Pham

Main category: cs.MA

TL;DR: AMBER is a Python ABM framework using columnar state management with Polars DataFrames instead of object-per-agent representation, achieving 1.2x-93x speedups and 30-50% memory reduction.

DetailsMotivation: Python ABM frameworks face tension between accessibility (Python's strength) and performance requirements for large-scale simulations. Current object-per-agent approaches in Python struggle with performance.

Method: AMBER replaces conventional object-per-agent representation with columnar state management using Polars DataFrame library. Includes core abstractions, spatial environments, experiment management, and optimization capabilities.

Result: Achieves speedups of 1.2x to 93x depending on workload, with greatest advantages for population-wide attribute operations. Memory profiling shows 30-50% reduction in peak usage compared to object-oriented frameworks.

Conclusion: Columnar state management establishes a viable architectural foundation for high-performance ABM in interpreted languages, resolving the accessibility-performance tension in Python-based ABM frameworks.

Abstract: Agent-based modeling (ABM) has emerged as an indispensable methodology for studying complex adaptive systems across the natural and social sciences. However, Python-based ABM frameworks face a fundamental tension between the accessibility that has made Python dominant in scientific computing and the performance requirements of large-scale simulations. This paper introduces AMBER, a framework that resolves this tension through a novel architectural approach: replacing the conventional object-per-agent representation with columnar state management using the Polars DataFrame library. We analyze the computational characteristics of both paradigms, present the architectural design of AMBER including its core abstractions, spatial environments, experiment management, and optimization capabilities. Empirical evaluation on three canonical benchmarks demonstrates that AMBER achieves speedups of 1.2x to 93x depending on workload characteristics, with the greatest advantages for models dominated by population-wide attribute operations. Memory profiling reveals 30-50% reduction in peak usage compared to object-oriented frameworks. Our results establish columnar state management as a viable architectural foundation for high-performance ABM in interpreted languages.

[345] Emergent Coordination in Multi-Agent Systems via Pressure Fields and Temporal Decay

Roland Rodriguez

Main category: cs.MA

TL;DR: Pressure-field coordination replaces explicit agent orchestration with implicit coordination through shared pressure gradients on a common artifact, achieving dramatically higher solve rates for complex scheduling tasks.

DetailsMotivation: Current multi-agent LLM frameworks suffer from coordination overhead that scales poorly with agent count and task complexity due to reliance on explicit orchestration patterns borrowed from human organizational structures.

Method: Proposes pressure-field coordination inspired by natural mechanisms: agents operate locally on a shared artifact guided by pressure gradients from quality signals, with temporal decay preventing premature convergence. Formalized as optimization over a pressure landscape with convergence guarantees.

Result: On meeting room scheduling across 1,350 trials: 48.5% aggregate solve rate vs 12.6% for conversation-based, 1.5% for hierarchical control, and 0.4% for sequential/random baselines (all p<0.001). Temporal decay essential (+10 percentage points). Easy problems: 86.7% solve rate. Consistent performance from 1-4 agents.

Conclusion: Implicit coordination through shared pressure gradients outperforms explicit hierarchical control, suggesting constraint-driven emergence offers a simpler and more effective foundation for multi-agent AI.

Abstract: Current multi-agent LLM frameworks rely on explicit orchestration patterns borrowed from human organizational structures: planners delegate to executors, managers coordinate workers, and hierarchical control flow governs agent interactions. These approaches suffer from coordination overhead that scales poorly with agent count and task complexity. We propose a fundamentally different paradigm inspired by natural coordination mechanisms: agents operate locally on a shared artifact, guided only by pressure gradients derived from measurable quality signals, with temporal decay preventing premature convergence. We formalize this as optimization over a pressure landscape and prove convergence guarantees under mild conditions. Empirically, on meeting room scheduling across 1,350 trials, pressure-field coordination outperforms all baselines: 48.5% aggregate solve rate versus 12.6% for conversation-based coordination, 1.5% for hierarchical control, and 0.4% for sequential and random baselines (all pairwise comparisons p < 0.001). Temporal decay is essential: disabling it reduces solve rate by 10 percentage points. On easy problems, pressure-field achieves 86.7% solve rate. The approach maintains consistent performance from 1 to 4 agents. Implicit coordination through shared pressure gradients outperforms explicit hierarchical control, suggesting that constraint-driven emergence offers a simpler and more effective foundation for multi-agent AI.

cs.MM

eess.AS

[346] ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation

Zhuoyue Gao, Xiaohui Wang, Xiaocui Yang, Wen Zhang, Daling Wang, Shi Feng, Yifei Zhang

Main category: eess.AS

TL;DR: ES4R is a speech-based empathetic response generation framework that explicitly models structured affective context before speech encoding, outperforming existing methods in empathetic dialogue.

DetailsMotivation: Existing speech-to-speech LLMs weaken affective information and contextual coherence in multi-turn dialogues by either relying on ASR transcription or using encoders that extract latent representations without explicit affective modeling.

Method: Proposes ES4R with dual-level attention mechanism to capture turn-level affective states and dialogue-level affective dynamics, integrates affective representations with textual semantics through speech-guided cross-modal attention, and uses energy-based strategy selection and style fusion for empathetic speech synthesis.

Result: ES4R consistently outperforms strong baselines in both automatic and human evaluations and remains robust across different LLM backbones.

Conclusion: Explicitly modeling structured affective context before speech encoding is more effective for empathetic speech dialogue than implicit learning or explicit emotion supervision, enabling better affective understanding and contextual coherence.

Abstract: Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information such as prosody, tone, and emotional intensity for affective understandings. Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations, often weakening affective information and contextual coherence in multi-turn dialogues. To address this, we propose \textbf{ES4R}, a framework for speech-based empathetic response generation. Our core innovation lies in explicitly modeling structured affective context before speech encoding, rather than relying on implicit learning by the encoder or explicit emotion supervision. Specifically, we introduce a dual-level attention mechanism to capture turn-level affective states and dialogue-level affective dynamics. The resulting affective representations are then integrated with textual semantics through speech-guided cross-modal attention to generate empathetic responses. For speech output, we employ energy-based strategy selection and style fusion to achieve empathetic speech synthesis. ES4R consistently outperforms strong baselines in both automatic and human evaluations and remains robust across different LLM backbones.

[347] Zero-Shot Speech LLMs for Multi-Aspect Evaluation of L2 Speech: Challenges and Opportunities

Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

Main category: eess.AS

TL;DR: The paper evaluates Qwen2-Audio-7B-Instruct’s zero-shot performance on L2 English pronunciation assessment using Speechocean762 data, showing strong agreement with human ratings but with limitations in low-quality speech scoring and error detection.

DetailsMotivation: Accurate L2 English pronunciation assessment is crucial for personalized language learning feedback and fair progress evaluation, but automated scoring remains challenging due to the complexity of sentence-level fluency, prosody, and completeness.

Method: The study evaluates the zero-shot performance of Qwen2-Audio-7B-Instruct, an instruction-tuned speech-LLM, on 5,000 Speechocean762 utterances. The model generates rubric-aligned scores for accuracy, fluency, prosody, and completeness.

Result: The model shows strong agreement with human ratings within ±2 tolerance, especially for high-quality speech. However, it tends to overpredict low-quality speech scores and lacks precision in error detection.

Conclusion: Speech LLMs demonstrate strong potential for scalable pronunciation assessment. Future improvements should focus on enhanced prompting, calibration, and phonetic integration to advance Computer-Assisted Pronunciation Training.

Abstract: An accurate assessment of L2 English pronunciation is crucial for language learning, as it provides personalized feedback and ensures a fair evaluation of individual progress. However, automated scoring remains challenging due to the complexity of sentence-level fluency, prosody, and completeness. This paper evaluates the zero-shot performance of Qwen2-Audio-7B-Instruct, an instruction-tuned speech-LLM, on 5,000 Speechocean762 utterances. The model generates rubric-aligned scores for accuracy, fluency, prosody, and completeness, showing strong agreement with human ratings within +-2 tolerance, especially for high-quality speech. However, it tends to overpredict low-quality speech scores and lacks precision in error detection. These findings demonstrate the strong potential of speech LLMs in scalable pronunciation assessment and suggest future improvements through enhanced prompting, calibration, and phonetic integration to advance Computer-Assisted Pronunciation Training.

[348] Test-Time Adaptation for Speech Emotion Recognition

Jiaheng Dong, Hong Jia, Ting Dang

Main category: eess.AS

TL;DR: First systematic evaluation of 11 test-time adaptation methods for speech emotion recognition, finding backpropagation-free methods most promising while entropy minimization and pseudo-labeling fail due to emotion ambiguity.

DetailsMotivation: Speech Emotion Recognition (SER) systems are fragile to domain shifts (speaker variability, acted vs. natural emotions, cross-corpus variations). Domain adaptation requires source/target data that are often unavailable or raise privacy concerns. Test-time adaptation (TTA) adapts models at inference using only unlabeled target data, but its efficacy for SER's unique domain shifts hasn't been investigated.

Method: Conducted first systematic evaluation and comparison of 11 TTA methods across three representative SER tasks. Focused on methods that adapt models at test time using only unlabeled target data.

Result: Backpropagation-free TTA methods are most promising. Entropy minimization and pseudo-labeling generally fail because their assumption of a single, confident ground-truth label is incompatible with the inherent ambiguity of emotional expression. No single method universally excels - effectiveness depends on distributional shifts and tasks.

Conclusion: TTA can address SER’s domain shift challenges, but method selection must consider emotion ambiguity. Backpropagation-free approaches show promise, while methods assuming clear ground-truth labels fail. Future work should develop TTA methods specifically designed for SER’s unique characteristics.

Abstract: The practical utility of Speech Emotion Recognition (SER) systems is undermined by their fragility to domain shifts, such as speaker variability, the distinction between acted and naturalistic emotions, and cross-corpus variations. While domain adaptation and fine-tuning are widely studied, they require either source data or labelled target data, which are often unavailable or raise privacy concerns in SER. Test-time adaptation (TTA) bridges this gap by adapting models at inference using only unlabeled target data. Yet, having been predominantly designed for image classification and speech recognition, the efficacy of TTA for mitigating the unique domain shifts in SER has not been investigated. In this paper, we present the first systematic evaluation and comparison covering 11 TTA methods across three representative SER tasks. The results indicate that backpropagation-free TTA methods are the most promising. Conversely, entropy minimization and pseudo-labeling generally fail, as their core assumption of a single, confident ground-truth label is incompatible with the inherent ambiguity of emotional expression. Further, no single method universally excels, and its effectiveness is highly dependent on the distributional shifts and tasks.

[349] EdgeSpot: Efficient and High-Performance Few-Shot Model for Keyword Spotting

Oguzhan Buyuksolak, Alican Gok, Osman Erman Okman

Main category: eess.AS

TL;DR: EdgeSpot: Efficient few-shot keyword spotting model for edge devices with optimized BC-ResNet backbone, PCEN frontend, and temporal self-attention, achieving 82.0% accuracy at 1% FAR with only 29.4M MACs.

DetailsMotivation: To develop an efficient keyword spotting model suitable for edge devices that can achieve high accuracy with limited computational resources and few training examples.

Method: Combines optimized BC-ResNet acoustic backbone with trainable Per-Channel Energy Normalization frontend and lightweight temporal self-attention. Uses knowledge distillation with self-supervised teacher model trained with Sub-center ArcFace loss.

Result: EdgeSpot consistently outperforms BC-ResNet baselines. EdgeSpot-4 achieves 82.0% 10-shot accuracy at 1% FAR (vs 73.7% baseline) with only 29.4M MACs and 128k parameters.

Conclusion: EdgeSpot provides an efficient and accurate solution for few-shot keyword spotting on edge devices, demonstrating significant improvements over strong baselines while maintaining low computational requirements.

Abstract: We introduce an efficient few-shot keyword spotting model for edge devices, EdgeSpot, that pairs an optimized version of a BC-ResNet-based acoustic backbone with a trainable Per-Channel Energy Normalization frontend and lightweight temporal self-attention. Knowledge distillation is utilized during training by employing a self-supervised teacher model, optimized with Sub-center ArcFace loss. This study demonstrates that the EdgeSpot model consistently provides better accuracy at a fixed false-alarm rate (FAR) than strong BC-ResNet baselines. The largest variant, EdgeSpot-4, improves the 10-shot accuracy at 1% FAR from 73.7% to 82.0%, which requires only 29.4M MACs with 128k parameters.

[350] TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice

Aref Farhadipour, Jan Marquenie, Srikanth Madikeri, Eleanor Chodroff

Main category: eess.AS

TL;DR: TidyVoice dataset addresses multilingual speaker recognition gap by cleaning Common Voice data, offering 212K+ monolingual and 4.5K+ multilingual speakers across 81 languages, with ResNet models achieving 0.35% EER and improved generalization.

DetailsMotivation: Lack of large-scale, publicly available multilingual datasets for speaker recognition, especially for read-speech style needed for applications like anti-spoofing.

Method: Created TidyVoice dataset by mitigating speaker heterogeneity in Mozilla Common Voice corpus, resulting in two partitions: Tidy-M (monolingual speakers) and Tidy-X (multilingual speakers). Used ResNet architectures for speaker recognition models.

Result: Achieved 0.35% EER by fine-tuning on Tidy-M partition. Fine-tuning improved model generalization, enhancing performance on unseen conversational interview data from CANDOR corpus.

Conclusion: TidyVoice provides a valuable resource for multilingual speaker recognition research, with publicly released dataset, evaluation trials, and models to advance the field.

Abstract: The development of robust, multilingual speaker recognition systems is hindered by a lack of large-scale, publicly available and multilingual datasets, particularly for the read-speech style crucial for applications like anti-spoofing. To address this gap, we introduce the TidyVoice dataset derived from the Mozilla Common Voice corpus after mitigating its inherent speaker heterogeneity within the provided client IDs. TidyVoice currently contains training and test data from over 212,000 monolingual speakers (Tidy-M) and around 4,500 multilingual speakers (Tidy-X) from which we derive two distinct conditions. The Tidy-M condition contains target and non-target trials from monolingual speakers across 81 languages. The Tidy-X condition contains target and non-target trials from multilingual speakers in both same- and cross-language trials. We employ two architectures of ResNet models, achieving a 0.35% EER by fine-tuning on our comprehensive Tidy-M partition. Moreover, we show that this fine-tuning enhances the model’s generalization, improving performance on unseen conversational interview data from the CANDOR corpus. The complete dataset, evaluation trials, and our models are publicly released to provide a new resource for the community.

[351] FlowSE-GRPO: Training Flow Matching Speech Enhancement via Online Reinforcement Learning

Haoxu Wang, Biao Tian, Yiheng Jiang, Zexu Pan, Shengkui Zhao, Bin Ma, Daren Chen, Xiangang Li

Main category: eess.AS

TL;DR: First successful integration of online Group Relative Policy Optimization (GRPO) into flow-matching speech enhancement framework for post-training alignment with perceptual metrics

DetailsMotivation: Generative speech enhancement needs better alignment with human preferences and downstream metrics. While offline RL methods exist, online RL approaches like GRPO remain unexplored for speech enhancement despite their success in NLP.

Method: Adapted online GRPO algorithm to continuous time-series speech data and flow-matching generative models. Proposed multi-metric reward optimization strategy to balance competing objectives and prevent reward hacking.

Result: Successfully validated online GRPO for speech enhancement. Single-reward optimization causes reward hacking (higher scores but degraded audio quality). Multi-metric approach reduces overfitting and improves overall performance.

Conclusion: Online GRPO is effective for post-training alignment in generative speech enhancement. Multi-metric optimization prevents reward hacking and provides practical guidance for RL-based audio model training.

Abstract: Generative speech enhancement offers a promising alternative to traditional discriminative methods by modeling the distribution of clean speech conditioned on noisy inputs. Post-training alignment via reinforcement learning (RL) effectively aligns generative models with human preferences and downstream metrics in domains such as natural language processing, but its use in speech enhancement remains limited, especially for online RL. Prior work explores offline methods like Direct Preference Optimization (DPO); online methods such as Group Relative Policy Optimization (GRPO) remain largely uninvestigated. In this paper, we present the first successful integration of online GRPO into a flow-matching speech enhancement framework, enabling efficient post-training alignment to perceptual and task-oriented metrics with few update steps. Unlike prior GRPO work on Large Language Models, we adapt the algorithm to the continuous, time-series nature of speech and to the dynamics of flow-matching generative models. We show that optimizing a single reward yields rapid metric gains but often induces reward hacking that degrades audio fidelity despite higher scores. To mitigate this, we propose a multi-metric reward optimization strategy that balances competing objectives, substantially reducing overfitting and improving overall performance. Our experiments validate online GRPO for speech enhancement and provide practical guidance for RL-based post-training of generative audio models.

[352] Lightweight Implicit Neural Network for Binaural Audio Synthesis

Xikun Lu, Fang Liu, Weizhi Shi, Jinqiu Sang

Main category: eess.AS

TL;DR: Lite-INN is a lightweight two-stage neural framework for binaural audio synthesis that achieves comparable quality to SOTA with 72.7% fewer parameters and significantly lower computational cost, enabling edge-device deployment.

DetailsMotivation: Existing high-fidelity binaural audio synthesis methods require extensive computational resources, limiting their application on edge devices where efficiency is crucial for immersive listening experiences.

Method: Two-stage framework: 1) Initial estimates using time-domain warping, 2) Refinement via Implicit Binaural Corrector (IBC) module - an implicit neural network that predicts amplitude and phase corrections directly, resulting in compact architecture.

Result: Achieves statistically comparable perceptual quality to best-performing baseline while significantly improving computational efficiency: 72.7% parameter reduction compared to previous SOTA (NFS) and significantly fewer MAC operations.

Conclusion: Lite-INN effectively addresses the trade-off between synthesis quality and computational efficiency, providing a practical solution for high-fidelity edge-device spatial audio applications.

Abstract: High-fidelity binaural audio synthesis is crucial for immersive listening, but existing methods require extensive computational resources, limiting their edge-device application. To address this, we propose the Lightweight Implicit Neural Network (Lite-INN), a novel two-stage framework. Lite-INN first generates initial estimates using a time-domain warping, which is then refined by an Implicit Binaural Corrector (IBC) module. IBC is an implicit neural network that predicts amplitude and phase corrections directly, resulting in a highly compact model architecture. Experimental results show that Lite-INN achieves statistically comparable perceptual quality to the best-performing baseline model while significantly improving computational efficiency. Compared to the previous state-of-the-art method (NFS), Lite-INN achieves a 72.7% reduction in parameters and requires significantly fewer compute operations (MACs). This demonstrates that our approach effectively addresses the trade-off between synthesis quality and computational efficiency, providing a new solution for high-fidelity edge-device spatial audio applications.

[353] A Lightweight Fourier-based Network for Binaural Speech Enhancement with Spatial Cue Preservation

Xikun Lu, Yujian Ma, Xianquan Jiang, Xuelong Wang, Jinqiu Sang

Main category: eess.AS

TL;DR: GAF-Net is a lightweight deep complex network for binaural speech enhancement that balances performance and computational efficiency using Fourier-based processing with adaptive modulation and dynamic gating.

DetailsMotivation: Binaural speech enhancement faces a severe trade-off between performance and computational cost - state-of-the-art methods are computationally intensive while lightweight solutions sacrifice performance. There's a need to bridge this gap for resource-constrained devices.

Method: Three-component architecture: 1) Dual-feature encoder combining STFT and gammatone features for robust acoustic representation, 2) Channel-independent globally adaptive Fourier modulator for capturing long-term dependencies while preserving spatial cues, 3) Dynamic gating mechanism to reduce processing artifacts.

Result: GAF-Net achieves competitive performance in binaural cues (ILD and IPD error) and objective intelligibility (MBSTOI) with fewer parameters and lower computational cost compared to existing methods.

Conclusion: GAF-Net provides a feasible solution for high-fidelity binaural processing on resource-constrained devices by effectively balancing performance and computational efficiency through its Fourier-based adaptive architecture.

Abstract: Binaural speech enhancement faces a severe trade-off challenge, where state-of-the-art performance is achieved by computationally intensive architectures, while lightweight solutions often come at the cost of significant performance degradation. To bridge this gap, we propose the Global Adaptive Fourier Network (GAF-Net), a lightweight deep complex network that aims to establish a balance between performance and computational efficiency. The GAF-Net architecture consists of three components. First, a dual-feature encoder combining short-time Fourier transform and gammatone features enhances the robustness of acoustic representation. Second, a channel-independent globally adaptive Fourier modulator efficiently captures long-term temporal dependencies while preserving the spatial cues. Finally, a dynamic gating mechanism is implemented to reduce processing artifacts. Experimental results show that GAF-Net achieves competitive performance, particularly in terms of binaural cues (ILD and IPD error) and objective intelligibility (MBSTOI), with fewer parameters and computational cost. These results confirm that GAF-Net provides a feasible way to achieve high-fidelity binaural processing on resource-constrained devices.

[354] Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation

Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman, Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, Jason Li

Main category: eess.AS

TL;DR: The paper analyzes parallel vs. hierarchical decoding strategies for LLM-based speech generation, comparing autoregressive and MaskGIT-based local transformers for capturing intra-timestep dependencies in multicodebook acoustic codes.

DetailsMotivation: Speech generation LLMs operate on discrete acoustic codes with multicodebook structure, requiring prediction of N codebook entries per timestep. Simple parallel prediction assumes independence among codebooks, reducing fidelity, necessitating better methods to capture intra-timestep dependencies.

Method: Systematically investigate two local transformer architectures: 1) autoregressive transformer generating codebooks sequentially, and 2) MaskGIT-based transformer performing iterative masked prediction. Both enable frame stacking where primary transformer predicts multiple frames jointly and local transformer decodes their codebooks.

Result: Through extensive analysis, characterize tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Frame stacking improves speed without compromising perceptual quality.

Conclusion: Propose practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity, providing systematic framework for optimizing LLM-based speech generation.

Abstract: Speech generation models based on large language models (LLMs) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient decoding but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local transformer (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both designs further enable frame stacking, where the primary transformer predicts multiple frames jointly, and the LT decodes their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity.

[355] Enhanced Generative Machine Listener

Vishnu Raj, Gouthaman KV, Shiv Gehlot, Lars Villemoes, Arijit Biswas

Main category: eess.AS

TL;DR: GMLv2 is a reference-based model for predicting subjective audio quality (MUSHRA scores) using Beta distribution-based loss and additional neural audio coding datasets, outperforming traditional metrics like PEAQ and ViSQOL.

DetailsMotivation: The paper aims to develop a better automated method for evaluating perceptual audio quality to accelerate research and development in modern audio coding technologies, addressing limitations of existing metrics like PEAQ and ViSQOL.

Method: GMLv2 introduces a Beta distribution-based loss to model listener ratings and incorporates additional neural audio coding (NAC) subjective datasets to improve generalization and applicability across diverse content and codec configurations.

Result: Extensive evaluations show GMLv2 consistently outperforms widely used metrics (PEAQ and ViSQOL) in both correlation with subjective scores and reliable prediction across diverse content types and codec configurations.

Conclusion: GMLv2 offers a scalable and automated framework for perceptual audio quality evaluation that can accelerate research and development in modern audio coding technologies.

Abstract: We present GMLv2, a reference-based model designed for the prediction of subjective audio quality as measured by MUSHRA scores. GMLv2 introduces a Beta distribution-based loss to model the listener ratings and incorporates additional neural audio coding (NAC) subjective datasets to extend its generalization and applicability. Extensive evaluations on diverse testset demonstrate that proposed GMLv2 consistently outperforms widely used metrics, such as PEAQ and ViSQOL, both in terms of correlation with subjective scores and in reliably predicting these scores across diverse content types and codec configurations. Consequently, GMLv2 offers a scalable and automated framework for perceptual audio quality evaluation, poised to accelerate research and development in modern audio coding technologies.

[356] Speaker Anonymisation for Speech-based Suicide Risk Detection

Ziyun Cui, Sike Jia, Yang Lin, Yinan Duan, Diyang Qu, Runsen Chen, Chao Zhang, Chang Lei, Wen Wu

Main category: eess.AS

TL;DR: First systematic study of speaker anonymization for speech-based suicide risk detection, showing combined anonymization methods can protect speaker identity while maintaining detection performance comparable to original speech.

DetailsMotivation: Adolescent suicide is a critical global health issue where speech provides a cost-effective modality for automatic risk detection. Protecting speaker identity is crucial for this vulnerable population since speech can reveal personally identifiable information if data is leaked or exploited.

Method: Investigates a broad range of anonymization methods including traditional signal processing, neural voice conversion, and speech synthesis techniques. Builds a comprehensive evaluation framework to assess trade-offs between speaker identity protection and preservation of suicide risk detection information.

Result: Combining anonymization methods that retain complementary information yields suicide risk detection performance comparable to that of original speech, while achieving effective protection of speaker identity for vulnerable populations.

Conclusion: This work demonstrates that speaker anonymization can be effectively implemented for speech-based suicide risk detection systems, balancing privacy protection with clinical utility for vulnerable adolescent populations.

Abstract: Adolescent suicide is a critical global health issue, and speech provides a cost-effective modality for automatic suicide risk detection. Given the vulnerable population, protecting speaker identity is particularly important, as speech itself can reveal personally identifiable information if the data is leaked or maliciously exploited. This work presents the first systematic study of speaker anonymisation for speech-based suicide risk detection. A broad range of anonymisation methods are investigated, including techniques based on traditional signal processing, neural voice conversion, and speech synthesis. A comprehensive evaluation framework is built to assess the trade-off between protecting speaker identity and preserving information essential for suicide risk detection. Results show that combining anonymisation methods that retain complementary information yields detection performance comparable to that of original speech, while achieving protection of speaker identity for vulnerable populations.

[357] Audio dequantization using instantaneous frequency

Vojtěch Kovanda, Pavel Rajmic

Main category: eess.AS

TL;DR: PHADQ is a phase-aware audio dequantization method that uses temporal continuity regularization to avoid energy loss artifacts common in l1-based approaches.

DetailsMotivation: Current audio dequantization methods using l1-based regularization suffer from energy loss artifacts and don't properly handle temporal continuity of sinusoidal components in time-frequency representations.

Method: PHADQ employs a phase-aware regularizer that promotes temporal continuity of sinusoidal components in audio time-frequency representations, adapting a technique previously successful in audio inpainting problems.

Result: The method was evaluated against state-of-the-art approaches using SDR and PEMO-Q ODG objective metrics, plus a subjective MUSHRA-like test, showing improved performance.

Conclusion: PHADQ provides an effective audio dequantization approach that avoids energy loss artifacts by leveraging phase-aware regularization for better temporal continuity of sinusoidal components.

Abstract: We present a dequantization method that employs a phase-aware regularizer, originally successfully applied in an audio inpainting problem. The method promotes a temporal continuity of sinusoidal components in time-frequency representation of the audio signal, and avoids energy loss artifacts commonly encountered with l1-based regularization approaches. The proposed method is called the Phase-Aware Audio Dequantizer (PHADQ). The method are evaluated against the state-of-the-art using the SDR and PEMO-Q ODG objective metrics, and a~subjective MUSHRA-like test.

eess.IV

[358] Experience with Single Domain Generalization in Real World Medical Imaging Deployments

Ayan Banerjee, Komandoor Srivathsan, Sandeep K. S. Gupta

Main category: eess.IV

TL;DR: The paper presents DL+EKE, a novel deep learning approach that integrates expert knowledge to address Single Domain Generalization challenges in medical imaging, showing superior performance over SOTA methods on diabetic retinopathy and successful deployment on real-world seizure detection and coronary artery detection tasks.

DetailsMotivation: Medical imaging faces domain shift challenges due to differences in scanners and protocols across centers. Single Domain Generalization (SDG) is crucial for effective AI deployment in healthcare, but current SOTA methods fail to achieve generalized performance across unseen domains, especially for rare class characteristics.

Method: Developed DL+EKE (Deep Learning + Expert Knowledge Embedding), a generic framework that integrates domain expert knowledge into deep learning models. First validated on diabetic retinopathy application, then instantiated and deployed for two real-world medical imaging tasks: seizure onset zone detection using fMRI data and stress ECG-based coronary artery detection.

Result: DL+EKE outperformed state-of-the-art SDG methods on diabetic retinopathy. The technique was successfully deployed on real-world stress ECG and resting-state fMRI applications, demonstrating practical effectiveness while revealing challenges with existing SDG techniques in real clinical settings.

Conclusion: Integrating expert knowledge with deep learning (DL+EKE) provides an effective solution for Single Domain Generalization in medical imaging, addressing limitations of current SOTA methods and enabling successful deployment in real-world clinical applications with domain shift challenges.

Abstract: A desirable property of any deployed artificial intelligence is generalization across domains, i.e. data generation distribution under a specific acquisition condition. In medical imagining applications the most coveted property for effective deployment is Single Domain Generalization (SDG), which addresses the challenge of training a model on a single domain to ensure it generalizes well to unseen target domains. In multi-center studies, differences in scanners and imaging protocols introduce domain shifts that exacerbate variability in rare class characteristics. This paper presents our experience on SDG in real life deployment for two exemplary medical imaging case studies on seizure onset zone detection using fMRI data, and stress electrocardiogram based coronary artery detection. Utilizing the commonly used application of diabetic retinopathy, we first demonstrate that state-of-the-art SDG techniques fail to achieve generalized performance across data domains. We then develop a generic expert knowledge integrated deep learning technique DL+EKE and instantiate it for the DR application and show that DL+EKE outperforms SOTA SDG methods on DR. We then deploy instances of DL+EKE technique on the two real world examples of stress ECG and resting state (rs)-fMRI and discuss issues faced with SDG techniques.

[359] On The Robustness of Foundational 3D Medical Image Segmentation Models Against Imprecise Visual Prompts

Soumitri Chattopadhyay, Basar Demir, Marc Niethammer

Main category: eess.IV

TL;DR: This paper systematically studies the robustness of 3D foundational models for medical segmentation to imprecise visual prompts, revealing their reliance on shape/spatial cues and resilience patterns.

DetailsMotivation: While 3D foundational models show promise for promptable medical segmentation, their robustness to imprecise prompts (common in real-world scenarios) remains under-explored, creating a gap in understanding their practical reliability.

Method: The authors systematically study the effect of controlled perturbations of dense visual prompts that mimic real-world imprecision. They conduct experiments with two recent foundational models on a multi-organ abdominal segmentation task.

Result: The study reveals several facets of promptable medical segmentation: models’ reliance on visual shape and spatial cues, and the extent of their resilience to certain types of perturbations.

Conclusion: The work addresses the gap in understanding prompt robustness for medical segmentation foundational models, providing insights into their behavior with imprecise prompts and making code available for further research.

Abstract: While 3D foundational models have shown promise for promptable segmentation of medical volumes, their robustness to imprecise prompts remains under-explored. In this work, we aim to address this gap by systematically studying the effect of various controlled perturbations of dense visual prompts, that closely mimic real-world imprecision. By conducting experiments with two recent foundational models on a multi-organ abdominal segmentation task, we reveal several facets of promptable medical segmentation, especially pertaining to reliance on visual shape and spatial cues, and the extent of resilience of models towards certain perturbations. Codes are available at: https://github.com/ucsdbiag/Prompt-Robustness-MedSegFMs

[360] Unsupervised Super-Resolution of Hyperspectral Remote Sensing Images Using Fully Synthetic Training

Xinxin Xu, Yann Gousseau, Christophe Kervazo, Saïd Ladjal

Main category: eess.IV

TL;DR: Unsupervised hyperspectral image super-resolution using synthetic abundance data generated via dead leaves model to mimic real statistics, avoiding need for ground truth training data.

DetailsMotivation: Most hyperspectral super-resolution methods require supervised training with ground truth data, which is often unavailable. There's a need for unsupervised approaches that can work without paired training data.

Method: 1) Unmix hyperspectral image into abundances and endmembers; 2) Generate synthetic abundances using dead leaves model to mimic real statistics; 3) Train abundance super-resolution neural network on synthetic data; 4) Apply trained network to increase spatial resolution of real abundances; 5) Recombine with endmembers to get high-resolution hyperspectral image.

Result: Experimental results demonstrate the training potential of synthetic images and show the method’s effectiveness for unsupervised hyperspectral super-resolution.

Conclusion: The proposed unsupervised approach using synthetic abundance data successfully addresses the lack of ground truth training data problem in hyperspectral super-resolution, offering a practical solution for real-world applications.

Abstract: Considerable work has been dedicated to hyperspectral single image super-resolution to improve the spatial resolution of hyperspectral images and fully exploit their potential. However, most of these methods are supervised and require some data with ground truth for training, which is often non-available. To overcome this problem, we propose a new unsupervised training strategy for the super-resolution of hyperspectral remote sensing images, based on the use of synthetic abundance data. Its first step decomposes the hyperspectral image into abundances and endmembers by unmixing. Then, an abundance super-resolution neural network is trained using synthetic abundances, which are generated using the dead leaves model in such a way as to faithfully mimic real abundance statistics. Next, the spatial resolution of the considered hyperspectral image abundances is increased using this trained network, and the high resolution hyperspectral image is finally obtained by recombination with the endmembers. Experimental results show the training potential of the synthetic images, and demonstrate the method effectiveness.

[361] PanopMamba: Vision State Space Modeling for Nuclei Panoptic Segmentation

Ming Kang, Fung Fung Ting, Raphaël C. -W. Phan, Zongyuan Ge, Chee-Ming Ting

Main category: eess.IV

TL;DR: PanopMamba: A hybrid Mamba-Transformer architecture for nuclei panoptic segmentation with SSM-based feature fusion and new evaluation metrics.

DetailsMotivation: Nuclei panoptic segmentation is crucial for cancer diagnostics but faces challenges with small objects, ambiguous boundaries, and class imbalance. Existing methods need better handling of long-range dependencies and feature representation for densely overlapping nuclei.

Method: Proposes PanopMamba with multiscale Mamba backbone and SSM-based fusion network for efficient long-range perception. Integrates pyramid feature networks with dynamic feature enhancement across spatial scales. Introduces new evaluation metrics: iPQ, wPQ, and fwPQ to address nuclei segmentation challenges.

Result: Superior performance on MoNuSAC2020 and NuInsSeg benchmark datasets compared to state-of-the-art methods. Validates robustness across various metrics and demonstrates distinctiveness of proposed PQ variants.

Conclusion: PanopMamba is the first Mamba-based approach for panoptic segmentation, effectively addressing nuclei segmentation challenges through hybrid architecture and improved evaluation metrics, with code publicly available.

Abstract: Nuclei panoptic segmentation supports cancer diagnostics by integrating both semantic and instance segmentation of different cell types to analyze overall tissue structure and individual nuclei in histopathology images. Major challenges include detecting small objects, handling ambiguous boundaries, and addressing class imbalance. To address these issues, we propose PanopMamba, a novel hybrid encoder-decoder architecture that integrates Mamba and Transformer with additional feature-enhanced fusion via state space modeling. We design a multiscale Mamba backbone and a State Space Model (SSM)-based fusion network to enable efficient long-range perception in pyramid features, thereby extending the pure encoder-decoder framework while facilitating information sharing across multiscale features of nuclei. The proposed SSM-based feature-enhanced fusion integrates pyramid feature networks and dynamic feature enhancement across different spatial scales, enhancing the feature representation of densely overlapping nuclei in both semantic and spatial dimensions. To the best of our knowledge, this is the first Mamba-based approach for panoptic segmentation. Additionally, we introduce alternative evaluation metrics, including image-level Panoptic Quality ($i$PQ), boundary-weighted PQ ($w$PQ), and frequency-weighted PQ ($fw$PQ), which are specifically designed to address the unique challenges of nuclei segmentation and thereby mitigate the potential bias inherent in vanilla PQ. Experimental evaluations on two multiclass nuclei segmentation benchmark datasets, MoNuSAC2020 and NuInsSeg, demonstrate the superiority of PanopMamba for nuclei panoptic segmentation over state-of-the-art methods. Consequently, the robustness of PanopMamba is validated across various metrics, while the distinctiveness of PQ variants is also demonstrated. Code is available at https://github.com/mkang315/PanopMamba.

[362] Fast, faithful and photorealistic diffusion-based image super-resolution with enhanced Flow Map models

Maxence Noble, Gonzalo Iñaki Quintana, Benjamin Aubin, Clément Chadebec

Main category: eess.IV

TL;DR: FlowMapSR: A novel diffusion-based super-resolution framework using Flow Map self-distillation for efficient inference, with positive-negative prompting guidance and adversarial LoRA fine-tuning, achieving better balance between faithfulness and photorealism than SOTA methods.

DetailsMotivation: Address the trade-off between reconstruction faithfulness and photorealism in diffusion-based SR, while improving inference efficiency. Current teacher-student distillation approaches suffer from information compression that degrades perceptual cues like textures and depth of field.

Method: Adapts Flow Map self-distillation models to SR, introduces positive-negative prompting guidance (generalization of classifier-free guidance to Flow Map models), and adversarial fine-tuning using Low-Rank Adaptation (LoRA). Evaluates three Flow Map formulations (Eulerian, Lagrangian, Shortcut).

Result: Shortcut Flow Map variant combined with enhancements achieves best performance. FlowMapSR achieves better balance between reconstruction faithfulness and photorealism than recent SOTA methods for both x4 and x8 upscaling, with competitive inference time. Single model works for both scales without scale-specific conditioning.

Conclusion: FlowMapSR successfully addresses the faithfulness-photorealism trade-off in diffusion-based SR through Flow Map self-distillation with novel enhancements, enabling efficient inference while maintaining high perceptual quality across multiple upscaling factors.

Abstract: Diffusion-based image super-resolution (SR) has recently attracted significant attention by leveraging the expressive power of large pre-trained text-to-image diffusion models (DMs). A central practical challenge is resolving the trade-off between reconstruction faithfulness and photorealism. To address inference efficiency, many recent works have explored knowledge distillation strategies specifically tailored to SR, enabling one-step diffusion-based approaches. However, these teacher-student formulations are inherently constrained by information compression, which can degrade perceptual cues such as lifelike textures and depth of field, even with high overall perceptual quality. In parallel, self-distillation DMs, known as Flow Map models, have emerged as a promising alternative for image generation tasks, enabling fast inference while preserving the expressivity and training stability of standard DMs. Building on these developments, we propose FlowMapSR, a novel diffusion-based framework for image super-resolution explicitly designed for efficient inference. Beyond adapting Flow Map models to SR, we introduce two complementary enhancements: (i) positive-negative prompting guidance, based on a generalization of classifier free-guidance paradigm to Flow Map models, and (ii) adversarial fine-tuning using Low-Rank Adaptation (LoRA). Among the considered Flow Map formulations (Eulerian, Lagrangian, and Shortcut), we find that the Shortcut variant consistently achieves the best performance when combined with these enhancements. Extensive experiments show that FlowMapSR achieves a better balance between reconstruction faithfulness and photorealism than recent state-of-the-art methods for both x4 and x8 upscaling, while maintaining competitive inference time. Notably, a single model is used for both upscaling factors, without any scale-specific conditioning or degradation-guided mechanisms.

[363] PocketDVDNet: Realtime Video Denoising for Real Camera Noise

Crispian Morris, Imogen Dexter, Fan Zhang, David R. Bull, Nantheera Anantrasirichai

Main category: eess.IV

TL;DR: PocketDVDNet is a lightweight video denoiser that uses model compression with sparsity-guided pruning, physics-informed noise modeling, and knowledge distillation to achieve real-time performance with 74% size reduction while improving quality.

DetailsMotivation: Live video denoising under realistic, multi-component sensor noise remains challenging for real-time applications like autofocus, autonomous driving, and surveillance, requiring both high-quality restoration and computational efficiency.

Method: Uses a model compression framework combining: 1) sparsity-guided structured pruning, 2) physics-informed noise model, and 3) knowledge distillation. Starts from reference model, induces sparsity, applies targeted channel pruning, retrains teacher on realistic noise, then student learns implicit noise handling without explicit noise-map inputs.

Result: PocketDVDNet reduces original model size by 74% while improving denoising quality and processing 5-frame patches in real-time, demonstrating that aggressive compression with domain-adapted distillation reconciles performance and efficiency.

Conclusion: Aggressive compression combined with domain-adapted knowledge distillation can effectively reconcile performance and efficiency for practical, real-time video denoising applications.

Abstract: Live video denoising under realistic, multi-component sensor noise remains challenging for applications such as autofocus, autonomous driving, and surveillance. We propose PocketDVDNet, a lightweight video denoiser developed using our model compression framework that combines sparsity-guided structured pruning, a physics-informed noise model, and knowledge distillation to achieve high-quality restoration with reduced resource demands. Starting from a reference model, we induce sparsity, apply targeted channel pruning, and retrain a teacher on realistic multi-component noise. The student network learns implicit noise handling, eliminating the need for explicit noise-map inputs. PocketDVDNet reduces the original model size by 74% while improving denoising quality and processing 5-frame patches in real-time. These results demonstrate that aggressive compression, combined with domain-adapted distillation, can reconcile performance and efficiency for practical, real-time video denoising.

[364] TaQ-DiT: Time-aware Quantization for Diffusion Transformers

Xinyan Liu, Huihong Shi, Yang Xu, Zhongfeng Wang

Main category: eess.IV

TL;DR: TaQ-DiT: A novel quantization method for Diffusion Transformers that addresses joint reconstruction and time-variance-aware quantization to achieve superior W4A8 performance.

DetailsMotivation: Diffusion Transformers (DiTs) have state-of-the-art image/video generation performance but suffer from large model size and slow inference speed. Existing quantization methods overlook two critical issues: (1) impact of reconstruction, and (2) varying quantization sensitivities across different layers, which limit achievable compression performance.

Method: Proposes TaQ-DiT with two key innovations: (1) Joint reconstruction method to address non-convergence issue when reconstructing weights and activations separately, and (2) Time-variance-aware transformations to handle Post-GELU activations that are particularly sensitive to quantization due to significant variability across denoising steps and extreme asymmetries within each step.

Result: Experimental results show that when quantizing DiTs’ weights to 4-bit and activations to 8-bit (W4A8), the proposed method significantly surpasses previous quantization methods.

Conclusion: TaQ-DiT effectively addresses the limitations of existing DiT quantization methods by considering joint reconstruction and time-variance-aware quantization, enabling practical deployment of compressed DiTs while maintaining performance.

Abstract: Transformer-based diffusion models, dubbed Diffusion Transformers (DiTs), have achieved state-of-the-art performance in image and video generation tasks. However, their large model size and slow inference speed limit their practical applications, calling for model compression methods such as quantization. Unfortunately, existing DiT quantization methods overlook (1) the impact of reconstruction and (2) the varying quantization sensitivities across different layers, which hinder their achievable performance. To tackle these issues, we propose innovative time-aware quantization for DiTs (TaQ-DiT). Specifically, (1) we observe a non-convergence issue when reconstructing weights and activations separately during quantization and introduce a joint reconstruction method to resolve this problem. (2) We discover that Post-GELU activations are particularly sensitive to quantization due to their significant variability across different denoising steps as well as extreme asymmetries and variations within each step. To address this, we propose time-variance-aware transformations to facilitate more effective quantization. Experimental results show that when quantizing DiTs’ weights to 4-bit and activations to 8-bit (W4A8), our method significantly surpasses previous quantization methods.

[365] Towards contrast- and pathology-agnostic clinical fetal brain MRI segmentation using SynthSeg

Ziyao Shang, Misha Kaandorp, Kelly Payette, Marina Fernandez Garcia, Roxane Licandro, Georg Langs, Jordina Aviles Verdera, Jana Hutter, Bjoern Menze, Gregor Kasprian, Meritxell Bach Cuadra, Andras Jakab

Main category: eess.IV

TL;DR: A novel data-driven sampling strategy to improve fetal brain MRI segmentation across diverse domain shifts, particularly for pathological cases with anatomical abnormalities.

DetailsMotivation: Deep learning segmentation of fetal brain MRI suffers from domain shift issues, especially with pathological cases showing anatomical abnormalities. Current methods fail when applied to subjects deviating from training distribution.

Method: Developed a data-driven train-time sampling strategy that exploits training dataset diversity to enhance domain generalizability. Adapted this sampler with existing data augmentation techniques to the SynthSeg framework, which uses domain randomization to generate diverse training data.

Result: Significant improvements in segmentation quality on testing subjects with intense anatomical abnormalities (p < 1e-4), though with slight performance decrease in cases with fewer abnormalities.

Conclusion: The novel sampling strategy effectively improves segmentation for pathological fetal brain MRIs and provides foundation for developing data-driven sampling approaches for other training pipelines.

Abstract: Magnetic resonance imaging (MRI) has played a crucial role in fetal neurodevelopmental research. Structural annotations of MR images are an important step for quantitative analysis of the developing human brain, with Deep Learning providing an automated alternative for this otherwise tedious manual process. However, segmentation performances of Convolutional Neural Networks often suffer from domain shift, where the network fails when applied to subjects that deviate from the distribution with which it is trained on. In this work, we aim to train networks capable of automatically segmenting fetal brain MRIs with a wide range of domain shifts pertaining to differences in subject physiology and acquisition environments, in particular shape-based differences commonly observed in pathological cases. We introduce a novel data-driven train-time sampling strategy that seeks to fully exploit the diversity of a given training dataset to enhance the domain generalizability of the trained networks. We adapted our sampler, together with other existing data augmentation techniques, to the SynthSeg framework, a generator that utilizes domain randomization to generate diverse training data. We ran thorough experimentations and ablation studies on a wide range of training/testing data to test the validity of the approaches. Our networks achieved notable improvements in the segmentation quality on testing subjects with intense anatomical abnormalities (p < 1e-4), though at the cost of a slighter decrease in performance in cases with fewer abnormalities. Our work also lays the foundation for future works on creating and adapting data-driven sampling strategies for other training pipelines.

[366] Beyond the LUMIR challenge: The pathway to foundational registration models

Junyu Chen, Shuwen Wei, Joel Honkamaa, Pekka Marttinen, Hang Zhang, Min Liu, Yichao Zhou, Zuopeng Tan, Zhuoyuan Wang, Yi Wang, Hongchao Zhou, Shunbo Hu, Yi Zhang, Qian Tao, Lukas Förner, Thomas Wendler, Bailiang Jian, Benedikt Wiestler, Tim Hable, Jin Kim, Dan Ruan, Frederic Madesta, Thilo Sentker, Wiebke Heyer, Lianrui Zuo, Yuwei Dai, Jing Wu, Jerry L. Prince, Harrison Bai, Yong Du, Yihao Liu, Alessa Hering, Reuben Dorent, Lasse Hansen, Mattias P. Heinrich, Aaron Carass

Main category: eess.IV

TL;DR: LUMIR challenge introduces large-scale unsupervised brain MRI registration benchmark with 4,014 unlabeled T1-weighted MRIs, showing deep learning methods outperform optimization-based approaches and demonstrate robustness across domains.

DetailsMotivation: Previous medical image registration challenges relied on anatomical label maps, limiting development of unsupervised methods. There's a need for large-scale benchmarks that encourage biologically plausible deformation modeling through self-supervision without labeled data.

Method: LUMIR provides 4,014 unlabeled T1-weighted MRIs for training, using self-supervised learning approaches. Evaluation includes 590 in-domain test subjects and extensive zero-shot testing across disease populations, imaging protocols, and species to assess generalization.

Result: Deep learning methods achieved state-of-the-art performance, produced anatomically plausible diffeomorphic deformation fields, outperformed leading optimization-based methods, and remained robust to most domain shifts.

Conclusion: Deep learning has reached maturity in neuroimaging registration and shows potential as foundation models for general-purpose medical image registration, with unsupervised approaches demonstrating strong performance and generalization capabilities.

Abstract: Medical image challenges have played a transformative role in advancing the field, catalyzing innovation and establishing new performance benchmarks. Image registration, a foundational task in neuroimaging, has similarly advanced through the Learn2Reg initiative. Building on this, we introduce the Large-scale Unsupervised Brain MRI Image Registration (LUMIR) challenge, a next-generation benchmark for unsupervised brain MRI registration. Previous challenges relied upon anatomical label maps, however LUMIR provides 4,014 unlabeled T1-weighted MRIs for training, encouraging biologically plausible deformation modeling through self-supervision. Evaluation includes 590 in-domain test subjects and extensive zero-shot tasks across disease populations, imaging protocols, and species. Deep learning methods consistently achieved state-of-the-art performance and produced anatomically plausible, diffeomorphic deformation fields. They outperformed several leading optimization-based methods and remained robust to most domain shifts. These findings highlight the growing maturity of deep learning in neuroimaging registration and its potential to serve as a foundation model for general-purpose medical image registration.

[367] SAMRI: Segment Anything Model for MRI

Zhao Wang, Wei Dai, Thuy Thanh Dao, Steffen Bollmann, Hongfu Sun, Craig Engstrom, Shekhar S. Chandra

Main category: eess.IV

TL;DR: SAMRI adapts the Segment Anything Model (SAM) specifically for MRI segmentation using a two-stage fine-tuning approach, achieving state-of-the-art performance with 94% faster training and 96% fewer trainable parameters.

DetailsMotivation: MRI segmentation is crucial for clinical decision-making but remains labor-intensive manually. CNN-based methods often generalize poorly to MRI-specific challenges like variable contrast, intensity inhomogeneity, and different sequences. While transformer-based SAM shows strong generalizability in natural images, existing adaptations treat MRI as just another modality without addressing its unique challenges.

Method: Developed SAMRI, an MRI-specialized SAM trained on 1.1 million labeled MR slices covering whole-body organs and pathologies. Used a two-stage fine-tuning strategy focusing only on the mask decoder, rather than full-model retraining. This approach significantly reduces computational requirements while maintaining performance.

Result: Achieved mean Dice score of 0.87 across diverse MRI segmentation tasks, delivering state-of-the-art accuracy across anatomical regions. The method shows robust generalization to unseen structures, especially small clinically important structures. Training time reduced by 94% and trainable parameters by 96% compared to full-model retraining.

Conclusion: SAMRI demonstrates that SAM can be effectively adapted to MRI through targeted fine-tuning, achieving excellent segmentation performance while being computationally efficient. The authors provide a complete training-to-inference pipeline and user-friendly graphical interface for practical deployment in real-world clinical settings.

Abstract: Accurate magnetic resonance imaging (MRI) segmentation is crucial for clinical decision-making, but remains labor-intensive when performed manually. Convolutional neural network (CNN) based methods can be accurate and efficient but often generalize poorly to MRI variable contrast, intensity inhomogeneity, and sequences. Although the transformer-based Segment Anything Model (SAM) has demonstrated remarkable generalizability in natural images, existing adaptations often treat MRI as another imaging modality, overlooking these modality-specific challenges. We present SAMRI, an MRI-specialized SAM trained and validated on 1.1 million labeled MR slices spanning whole-body organs and pathologies. We demonstrate that SAM can be effectively adapted to MRI by fine-tuning its mask decoder using a two-stage strategy, reducing training time by 94 percent and trainable parameters by 96 percent compared to full-model retraining. Across diverse MRI segmentation tasks, SAMRI achieves a mean Dice of 0.87, delivering state-of-the-art accuracy across anatomical regions and robust generalization on unseen structures, particularly small clinically important structures. In addition, we provide a complete training-to-inference pipeline and a user-friendly local graphical interface that enables interactive application of pretrained SAMRI models on standard machines, facilitating practical deployment for real-world MRI segmentation.

[368] Fine-tuned Transformer Models for Breast Cancer Detection and Classification

Showkat Osman, Md. Tajwar Munim Turzo, Maher Ali Rusho, Md. Makid Haider, Sazzadul Islam Sajin, Ayatullah Hasnat Behesti, Ahmed Faizul Haque Dhrubo, Md. Khurshid Jahan, Mohammad Abdul Qayum

Main category: eess.IV

TL;DR: Transformer-based AI models, particularly ViT, achieve 99.32% accuracy in breast cancer detection from mammograms, outperforming traditional methods and CNNs by better capturing global patterns and subtle features.

DetailsMotivation: Breast cancer remains a leading cause of cancer deaths worldwide, highlighting the critical need for early detection. Traditional diagnostic methods like mammography, ultrasound, and thermography have limitations in detecting subtle patterns and reducing false positives. While AI and deep learning have revolutionized medical imaging, conventional CNNs struggle with modeling long-range dependencies in images.

Method: The study explores visual transformer models (Swin Tiny, DeiT, BEiT, ViT, and YOLOv8) for breast cancer detection using mammographic image datasets. Data augmentation techniques including resizing, cropping, flipping, and normalization were applied to enhance model performance. The research compares transformer architectures against traditional approaches to evaluate their effectiveness in medical image analysis.

Result: The ViT (Vision Transformer) model achieved the highest accuracy of 99.32% in breast cancer detection, demonstrating superior performance in capturing global patterns and subtle image features compared to other transformer models and traditional methods. The study shows that transformer-based architectures excel at modeling long-range dependencies that CNNs typically struggle with.

Conclusion: Transformer-based AI models, particularly ViT, show significant potential for revolutionizing breast cancer detection by achieving exceptional accuracy and better handling of global image patterns. While promising results were obtained, challenges remain regarding dataset diversity and model optimization, presenting opportunities for future research. This work suggests that transformer architectures could substantially improve early breast cancer diagnosis and patient outcomes.

Abstract: Breast cancer is still the second top cause of cancer deaths worldwide and this emphasizes the importance of necessary steps for early detection. Traditional diagnostic methods, such as mammography, ultrasound, and thermography, which have limitations when it comes to catching subtle patterns and reducing false positives. New technologies like artificial intelligence (AI) and deep learning have brought about the revolution in medical imaging analysis. Nevertheless, typical architectures such as Convolutional Neural Networks (CNNs) often have problems with modeling long-range dependencies. It explores the application of visual transformer models (here: Swin Tiny, DeiT, BEiT, ViT, and YOLOv8) for breast cancer detection through a collection of mammographic image sets. The ViT model reached the highest accuracy of 99.32% which showed its superiority in detecting global patterns as well as subtle image features. Data augmenting approaches, such as resizing croppings, flippings, and normalization, were further applied to the model for achieving higher performance. Although there were interesting results, the issues of dataset diversity and model optimization which present new avenues of research are also still present. Through this study, the crystal potential of transformer-based AI models in changing the detecting process of breast cancer and, thus, to patients health, is suggested.

[369] Learned Hemodynamic Coupling Inference in Resting-State Functional MRI

William Consagra, Eardi Lila

Main category: eess.IV

TL;DR: A method for estimating spatially varying hemodynamic coupling from resting-state fMRI using marginal likelihood approximation with deep neural networks and normalizing flows, enabling scalable cortical surface analysis.

DetailsMotivation: Hemodynamic variability across brain regions and individuals biases fMRI connectivity estimates, and hemodynamic parameters themselves may serve as important biomarkers. Current methods struggle with the blind inverse problem of estimating both unknown neural activity and hemodynamic coupling from resting-state fMRI.

Method: Marginalizes out latent neural signals and uses marginal likelihood inference. Employs deep neural networks with conditional normalizing flows to approximate intractable marginal likelihood. Enforces spatial coherence through cortical surface priors with sparse representations. Quantifies uncertainty via double-bootstrap procedure.

Result: Extensive validation using synthetic and real fMRI datasets shows clear improvements over current methods for hemodynamic estimation and downstream connectivity analysis.

Conclusion: The proposed approach provides a scalable, high-resolution method for inferring hemodynamic coupling on the cortical surface from resting-state fMRI, addressing the challenging blind inverse problem while quantifying uncertainty.

Abstract: Functional magnetic resonance imaging (fMRI) provides an indirect measurement of neuronal activity via hemodynamic responses that vary across brain regions and individuals. Ignoring this hemodynamic variability can bias downstream connectivity estimates. Furthermore, the hemodynamic parameters themselves may serve as important imaging biomarkers. Estimating spatially varying hemodynamics from resting-state fMRI (rsfMRI) is therefore an important but challenging blind inverse problem, since both the latent neural activity and the hemodynamic coupling are unknown. In this work, we propose a methodology for inferring hemodynamic coupling on the cortical surface from rsfMRI. Our approach avoids the highly unstable joint recovery of neural activity and hemodynamics by marginalizing out the latent neural signal and basing inference on the resulting marginal likelihood. To enable scalable, high-resolution estimation, we employ a deep neural network combined with conditional normalizing flows to accurately approximate this intractable marginal likelihood, while enforcing spatial coherence through priors defined on the cortical surface that admit sparse representations. Uncertainty in the hemodynamic estimates is quantified via a double-bootstrap procedure. The proposed approach is extensively validated using synthetic data and real fMRI datasets, demonstrating clear improvements over current methods for hemodynamic estimation and downstream connectivity analysis.

Last updated: 2026-01-27
Built with Hugo, theme modified on Stack