Daily arXiv Papers - 2026-01-15

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] DeliberationBench: When Do More Voices Hurt? A Controlled Study of Multi-LLM Deliberation Protocols

Vaarunay Kaushal, Taranveer Singh

Main category: cs.CL

TL;DR: Multi-LLM deliberation systems underperform simple “best-single” baseline by 6x while costing 1.5-2.5x more compute, challenging assumptions that complexity improves quality.

DetailsMotivation: To critically evaluate whether multi-agent LLM deliberation systems provide practical value over simpler methods, as their effectiveness remains under-scrutinized despite significant attention.

Method: Created DELIBERATIONBENCH benchmark to evaluate three deliberation protocols against a strong baseline of selecting the best response from a pool of model outputs. Conducted 810 total evaluations across 270 questions with three independent seeds.

Result: Best-single baseline achieved 82.5% ± 3.3% win rate, dramatically outperforming the best deliberation protocol (13.8% ± 2.6%) - a 6.0x performance gap that is statistically significant (p < 0.01). Deliberation protocols also cost 1.5-2.5x more computational resources.

Conclusion: Complex multi-LLM deliberation systems do not enhance quality over simpler methods, challenging assumptions that complexity improves performance in multi-agent systems. The findings suggest simpler approaches may be more effective and efficient.

Abstract: Multi-agent systems where Large Language Models (LLMs) deliberate to form consensus have gained significant attention, yet their practical value over simpler methods remains under-scrutinized. We introduce DELIBERATIONBENCH, a controlled benchmark evaluating three deliberation protocols against a strong baseline of selecting the best response from a pool of model outputs. Across 270 questions and three independent seeds (810 total evaluations), we find a striking negative result: the best-single baseline achieves an 82.5% +- 3.3% win rate, dramatically outperforming the best deliberation protocol(13.8% +- 2.6%). This 6.0x performance gap is statistically significant (p < 0.01) and comes at 1.5-2.5x higher computational cost. Our findings challenge assumptions that complexity enhances quality in multi-LLM systems.

[2] A Review: PTSD in Pre-Existing Medical Condition on Social Media

Zaber Al Hassan Ayon, Nur Hafieza Ismail, Nur Shazwani Kamarudin

Main category: cs.CL

TL;DR: Review examines PTSD in chronic illness patients using social media analysis, finding NLP/ML can detect PTSD with 74-90% accuracy and online communities provide valuable support.

DetailsMotivation: PTSD is complex in patients with pre-existing chronic illnesses, and social media offers unique insights into how these individuals experience and manage their conditions.

Method: Systematic literature review (2008-2024) analyzing social media data from platforms like X/Twitter and Facebook using natural language processing and machine learning techniques.

Result: Social media reveals unique PTSD challenges in chronic illness patients; NLP/ML achieves 74-90% accuracy in identifying PTSD cases; online support communities shape coping strategies and enable early interventions.

Conclusion: PTSD research must consider pre-existing medical conditions; social media serves as valuable monitoring/support tool; future work should develop targeted interventions for vulnerable groups.

Abstract: Post-Traumatic Stress Disorder (PTSD) is a multifaceted mental health condition, particularly challenging for individuals with pre-existing medical conditions. This review critically examines the intersection of PTSD and chronic illnesses as expressed on social media platforms. By systematically analyzing literature from 2008 to 2024, the study explores how PTSD manifests and is managed in individuals with chronic conditions such as cancer, heart disease, and autoimmune disorders, with a focus on online expressions on platforms like X (formally known as Twitter) and Facebook. Findings demonstrate that social media data offers valuable insights into the unique challenges faced by individuals with both PTSD and chronic illnesses. Specifically, natural language processing (NLP) and machine learning (ML) techniques can identify potential PTSD cases among these populations, achieving accuracy rates between 74% and 90%. Furthermore, the role of online support communities in shaping coping strategies and facilitating early interventions is highlighted. This review underscores the necessity of incorporating considerations of pre-existing medical conditions in PTSD research and treatment, emphasizing social media’s potential as a monitoring and support tool for vulnerable groups. Future research directions and clinical implications are also discussed, with an emphasis on developing targeted interventions.

[3] From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Piercosma Bisconti, Marcello Galisai, Matteo Prandi, Federico Pierucci, Olga Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Marcantonio Brancale, Daniele Nardi

Main category: cs.CL

TL;DR: Adversarial Tales is a jailbreak technique that embeds harmful content in cyberpunk narratives using Propp’s folktale analysis, achieving 71.3% success rate across 26 frontier LLMs, revealing structural jailbreaks as a broad vulnerability class.

DetailsMotivation: Current LLM safety mechanisms are vulnerable to attacks that reframe harmful requests through culturally coded structures. The authors aim to demonstrate that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques, and that the space of culturally coded frames for mediating harmful intent is vast and likely inexhaustible by pattern-matching defenses alone.

Method: Introduces Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp’s morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation.

Result: Across 26 frontier models from nine providers, the attack achieved an average success rate of 71.3%, with no model family proving reliably robust. This builds on prior work on Adversarial Poetry, suggesting structurally-grounded jailbreaks are a broad vulnerability class.

Conclusion: The findings indicate that structurally-grounded jailbreaks represent a significant vulnerability in LLM safety mechanisms. The authors propose a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form, suggesting that understanding why these attacks succeed is essential for developing more robust defenses.

Abstract: Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp’s morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.

[4] Companion Agents: A Table-Information Mining Paradigm for Text-to-SQL

Jiahui Chen, Lei Fu, Jian Cui, Yu Lei, Zhenning Dong

Main category: cs.CL

TL;DR: Companion Agents (CA) is a new Text-to-SQL paradigm that uses database-side agents to proactively mine hidden database information, improving accuracy when human-curated evidence is missing or incomplete.

DetailsMotivation: Current Text-to-SQL benchmarks assume complete database annotations and external knowledge, which doesn't reflect real industrial settings where annotations are often missing, incomplete, or erroneous. This limits practical applicability of state-of-the-art systems.

Method: Proposes Companion Agents (CA) - a database-centric approach where agents accompany database schemas to proactively mine and consolidate hidden inter-table relations, value-domain distributions, statistical regularities, and latent semantic cues before query generation. This “caches” query-relevant knowledge on the database side for selective activation at inference time.

Result: On BIRD benchmark under fully missing evidence setting: CA recovers +4.49 / +4.37 / +14.13 execution accuracy points on RSL-SQL / CHESS / DAIL-SQL respectively, with larger gains on Challenging subset (+9.65 / +7.58 / +16.71).

Conclusion: CA’s automatic database-side mining and evidence construction provides a practical path toward industrial-grade Text-to-SQL deployment without reliance on human-curated evidence, bridging the gap between academic benchmarks and real-world applications.

Abstract: Large-scale Text-to-SQL benchmarks such as BIRD typically assume complete and accurate database annotations as well as readily available external knowledge, which fails to reflect common industrial settings where annotations are missing, incomplete, or erroneous. This mismatch substantially limits the real-world applicability of state-of-the-art (SOTA) Text-to-SQL systems. To bridge this gap, we explore a database-centric approach that leverages intrinsic, fine-grained information residing in relational databases to construct missing evidence and improve Text-to-SQL accuracy under annotation-scarce conditions. Our key hypothesis is that when a query requires multi-step reasoning over extensive table information, existing methods often struggle to reliably identify and utilize the truly relevant knowledge. We therefore propose to “cache” query-relevant knowledge on the database side in advance, so that it can be selectively activated at inference time. Based on this idea, we introduce Companion Agents (CA), a new Text-to-SQL paradigm that incorporates a group of agents accompanying database schemas to proactively mine and consolidate hidden inter-table relations, value-domain distributions, statistical regularities, and latent semantic cues before query generation. Experiments on BIRD under the fully missing evidence setting show that CA recovers +4.49 / +4.37 / +14.13 execution accuracy points on RSL-SQL / CHESS / DAIL-SQL, respectively, with larger gains on the Challenging subset +9.65 / +7.58 / +16.71. These improvements stem from CA’s automatic database-side mining and evidence construction, suggesting a practical path toward industrial-grade Text-to-SQL deployment without reliance on human-curated evidence.

[5] Recursive Knowledge Synthesis for Multi-LLM Systems: Stability Analysis and Tri-Agent Audit Framework

Toshiyuki Shigemura

Main category: cs.CL

TL;DR: A tri-agent LLM framework achieves stable recursive knowledge synthesis through semantic generation, consistency checking, and transparency auditing agents, with empirical validation showing convergence in 89% of trials.

DetailsMotivation: To address stability and explainability challenges in multi-model large language systems by creating a coordinated framework that prevents single-model limitations and enables transparent, reliable knowledge synthesis.

Method: Tri-agent cross-validation framework with three heterogeneous LLMs: semantic generator, analytical consistency checker, and transparency auditor, operating in recursive interaction cycles to induce Recursive Knowledge Synthesis (RKS).

Result: Mean Reflex Reliability Score of 0.78±0.06, Transparency Score ≥0.8 in 68% of trials, 89% convergence rate across 47 controlled trials using public-access LLMs, supporting theoretical predictions about transparency auditing as contraction operator.

Conclusion: The tri-agent framework enables stable recursive knowledge synthesis in realistic public deployments, providing empirical evidence for safety-preserving, human-supervised multi-LLM architectures with coordinated reasoning across heterogeneous models.

Abstract: This paper presents a tri-agent cross-validation framework for analyzing stability and explainability in multi-model large language systems. The architecture integrates three heterogeneous LLMs-used for semantic generation, analytical consistency checking, and transparency auditing-into a recursive interaction cycle. This design induces Recursive Knowledge Synthesis (RKS), where intermediate representations are continuously refined through mutually constraining transformations irreducible to single-model behavior. Across 47 controlled trials using public-access LLM deployments (October 2025), we evaluated system stability via four metrics: Reflex Reliability Score (RRS), Transparency Score (TS), Deviation Detection Rate (DDR), and Correction Success Rate (CSR). The system achieved mean RRS = 0.78+-0.06 and maintained TS >= 0.8 in about 68% of trials. Approximately 89% of trials converged, supporting the theoretical prediction that transparency auditing acts as a contraction operator within the composite validation mapping. The contributions are threefold: (1) a structured tri-agent framework for coordinated reasoning across heterogeneous LLMs, (2) a formal RKS model grounded in fixed-point theory, and (3) empirical evaluation of inter-model stability under realistic, non-API public-access conditions. These results provide initial empirical evidence that a safety-preserving, humansupervised multi-LLM architecture can achieve stable recursive knowledge synthesis in realistic, publicly deployed environments.

[6] Consistency-Aware Editing for Entity-level Unlearning in Language Models

Xiaoqi Han, Víctor Gutiérrez-Basulto, Ru Li, Xiaoli Li, Jiye Liang, Jeff Z. Pan

Main category: cs.CL

TL;DR: CAE framework enables efficient entity-level unlearning in LLMs using consistency-aware editing with diverse prompts, outperforming traditional methods while providing insights into knowledge representation.

DetailsMotivation: LLMs risk retaining sensitive/copyrighted information; existing unlearning methods are computationally expensive or brittle, while editing techniques only handle instance-level updates, not complete entity removal.

Method: Consistency-Aware Editing (CAE) framework aggregates diverse prompts (attributes, relations, adversarial paraphrases) and learns low-rank updates with consistency regularization to align editing directions across prompts.

Result: CAE significantly improves forgetting accuracy and robustness on RWKU and ToFU benchmarks, enables scalable entity removal with only tens of prompts, and provides insights into knowledge representation/deletion.

Conclusion: CAE offers an effective, efficient approach for entity-level unlearning that balances comprehensive forgetting with minimal interference, advancing practical applications for knowledge removal in LLMs.

Abstract: Large language models (LLMs) risk retaining sensitive, copyrighted, or harmful information from their training data. Entity-level unlearning addresses this issue by removing all knowledge of a specific entity while preserving the model’s overall capabilities. Existing approaches typically rely on full-model fine-tuning or prompt-based interventions, which can be computationally expensive or brittle when handling paraphrased queries. Recently, model editing has emerged as an efficient alternative for updating knowledge in LLMs, offering a promising direction for unlearning. However, existing editing techniques are typically designed for instance-level updates, modifying responses to specific attributes of an entity rather than eliminating all knowledge associated with the entity. In this paper, we investigate how editing techniques can be adapted for effective and efficient entity-level unlearning. To this end, we introduce a novel consistency-aware editing (CAE) framework. CAE aggregates a diverse set of prompts related to a target entity, including its attributes, relations, and adversarial paraphrases. It then jointly learns a low-rank update guided by a consistency regularizer that aligns the editing directions across prompts. This promotes robust and comprehensive forgetting while minimizing interference with unrelated knowledge. We further examine where different entities are stored within the model and how many diverse prompts are needed for successful unlearning. We evaluate CAE on two challenging benchmarks, RWKU and ToFU, and demonstrate that it (i) provides insights into how entity-level knowledge is internally represented and deleted in LLMs, (ii) significantly improves forgetting accuracy and robustness over traditional unlearning and editing baselines, and (iii) enables scalable entity removal using only tens of carefully selected prompts.

[7] Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, Yunpu Ma

Main category: cs.CL

TL;DR: Memory-R1 is an RL framework that equips LLMs with active memory management using two specialized agents for structured memory operations and reasoning, achieving strong performance with minimal training data.

DetailsMotivation: LLMs are fundamentally stateless with limited context windows, hindering long-horizon reasoning. Existing memory augmentation approaches are static and heuristic-driven, lacking learned mechanisms for deciding what to store, update, or retrieve.

Method: Reinforcement learning framework with two specialized agents: Memory Manager that learns structured operations (ADD, UPDATE, DELETE, NOOP) and Answer Agent that pre-selects and reasons over relevant memory entries. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO).

Result: With only 152 training QA pairs, Memory-R1 outperforms strong baselines and generalizes across diverse question types, three benchmarks (LoCoMo, MSC, LongMemEval), and multiple model scales (3B-14B).

Conclusion: The RL-based approach enables adaptive memory management with minimal supervision, demonstrating that learned memory operations can significantly enhance LLMs’ long-horizon reasoning capabilities.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking a learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns structured operations, including ADD, UPDATE, DELETE, and NOOP; and an Answer Agent that pre-selects and reasons over relevant entries. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management with minimal supervision. With only 152 training QA pairs, Memory-R1 outperforms strong baselines and generalizes across diverse question types, three benchmarks (LoCoMo, MSC, LongMemEval), and multiple model scales (3B-14B).

[8] Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents

Mihael Arcan

Main category: cs.CL

TL;DR: Using structured knowledge triples with scientific abstracts improves classification accuracy but not clustering coherence.

DetailsMotivation: Scientific literature is growing rapidly in volume and complexity, requiring better methods for organizing and understanding research documents. Structured knowledge (subject-predicate-object triples) may enhance document organization.

Method: Proposed modular pipeline combining unsupervised clustering and supervised classification using multiple document representations: raw abstracts, extracted triples, and hybrid formats. Used arXiv corpus, extracted relational triples, created four text representations, embedded with four transformer models (MiniLM, MPNet, SciBERT, SPECTER), evaluated with KMeans, GMM, HDBSCAN for clustering, and fine-tuned classification models for arXiv subject prediction.

Result: Full abstract text yields most coherent clusters. Hybrid representations with triples improve classification performance (up to 92.6% accuracy, 0.925 macro-F1). Lightweight sentence encoders (MiniLM, MPNet) outperform domain-specific models in clustering, while SciBERT excels in structured-input classification.

Conclusion: Combining unstructured text with structured knowledge offers complementary benefits for semantic organization of scientific documents, highlighting the value of knowledge-infused representations.

Abstract: The increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we explore how structured knowledge, specifically, subject-predicate-object triples, can enhance the clustering and classification of scientific papers. We propose a modular pipeline that combines unsupervised clustering and supervised classification over multiple document representations: raw abstracts, extracted triples, and hybrid formats that integrate both. Using a filtered arXiv corpus, we extract relational triples from abstracts and construct four text representations, which we embed using four state-of-the-art transformer models: MiniLM, MPNet, SciBERT, and SPECTER. We evaluate the resulting embeddings with KMeans, GMM, and HDBSCAN for unsupervised clustering, and fine-tune classification models for arXiv subject prediction. Our results show that full abstract text yields the most coherent clusters, but that hybrid representations incorporating triples consistently improve classification performance, reaching up to 92.6% accuracy and 0.925 macro-F1. We also find that lightweight sentence encoders (MiniLM, MPNet) outperform domain-specific models (SciBERT, SPECTER) in clustering, while SciBERT excels in structured-input classification. These findings highlight the complementary benefits of combining unstructured text with structured knowledge, offering new insights into knowledge-infused representations for semantic organization of scientific documents.

[9] Resisting Correction: How RLHF Makes Language Models Ignore External Safety Signals in Natural Conversation

Felipe Biava Cataneo

Main category: cs.CL

TL;DR: Instruction-tuned language models ignore external confidence signals in natural conversation despite perfect compliance under explicit commands, revealing a safety-critical failure mode where RLHF prioritizes fluency over calibration.

DetailsMotivation: To test whether instruction-tuned language models preserve controllability across different interaction modes when incorporating externally provided confidence information, which is crucial for safety architectures that rely on external monitors.

Method: Causal intervention study using Llama-3.2-3B on GSM8K, injecting explicit external confidence signals and measuring model compliance under multiple prompt strategies (explicit commands vs. natural conversational queries).

Result: Base models show near-perfect controllability (Spearman rho ~1.0), while instruction-tuned models exhibit context dependence: full compliance with explicit commands (bias ~0%, rho=0.93) but systematic ignoring of same signals in natural conversation (bias +40%, rho=0.04). Internal token-level confidence is uninformative (r=0.035).

Conclusion: This is not a capability failure but an emergent property of RLHF optimization that prioritizes conversational fluency over external calibration cues in natural dialogue, creating a deployment-critical failure mode where safety corrections are least effective in the interaction styles users expect.

Abstract: Safety architectures for language models increasingly rely on external monitors to detect errors and inject corrective signals at inference time. For such systems to function in interactive settings, models must be able to incorporate externally provided confidence information into their verbal responses. In this work, we test whether instruction-tuned language models preserve this controllability across different interaction modes. Using Llama-3.2-3B on GSM8K, we perform a causal intervention study in which explicit external confidence signals are injected and model compliance is measured under multiple prompt strategies. We find that base models exhibit near-perfect controllability (Spearman rho close to 1.0), while instruction-tuned models display a striking context dependence: they fully comply with external corrections under explicit command prompts (bias approximately 0 percent, rho = 0.93), yet systematically ignore the same signals in natural conversational queries (bias plus 40 percent, rho = 0.04). This behavior is not a capability failure; the model can process the signal, but an emergent property of RLHF optimization that prioritizes conversational fluency over external calibration cues in natural dialogue. We further show that internal token-level confidence in small models is uninformative (r = 0.035), underscoring the necessity of external supervision. Our findings highlight a deployment-critical failure mode: the interaction style users expect is precisely where safety corrections are least effective.

[10] Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness

Haotian Deng, Chris Farber, Jiyoon Lee, David Tang

Main category: cs.CL

TL;DR: LLMs show promise for automated short-answer grading but struggle with complex rubrics; uncertainty estimation and robustness testing are crucial for reliable deployment.

DetailsMotivation: ASAG is challenging due to linguistic variability and need for nuanced partial credit. LLMs offer potential but require rigorous assessment as rubric-based judges.

Method: Systematically evaluate LLM-judges across three aspects: alignment with expert judgment across rubric complexities, uncertainty-accuracy trade-off via consensus-based deferral, and robustness to perturbations/adversarial attacks using SciEntsBank benchmark and Qwen 2.5-72B.

Result: Alignment strong for binary tasks but degrades with increased rubric granularity. Trust Curve shows filtering low-confidence predictions improves accuracy. Model resilient to prompt injection but sensitive to synonym substitutions.

Conclusion: Provides critical insights into rubric-conditioned LLM judges’ capabilities and limitations, highlighting importance of uncertainty estimation and robustness testing for reliable deployment.

Abstract: Automated short-answer grading (ASAG) remains a challenging task due to the linguistic variability of student responses and the need for nuanced, rubric-aligned partial credit. While Large Language Models (LLMs) offer a promising solution, their reliability as automated judges in rubric-based settings requires rigorous assessment. In this paper, we systematically evaluate the performance of LLM-judges for rubric-based short-answer grading. We investigate three key aspects: the alignment of LLM grading with expert judgment across varying rubric complexities, the trade-off between uncertainty and accuracy facilitated by a consensus-based deferral mechanism, and the model’s robustness under random input perturbations and adversarial attacks. Using the SciEntsBank benchmark and Qwen 2.5-72B, we find that alignment is strong for binary tasks but degrades with increased rubric granularity. Our “Trust Curve” analysis demonstrates a clear trade-off where filtering low-confidence predictions improves accuracy on the remaining subset. Additionally, robustness experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions. Our work provides critical insights into the capabilities and limitations of rubric-conditioned LLM judges, highlighting the importance of uncertainty estimation and robustness testing for reliable deployment.

[11] Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

Xinlu He, Jacob Whitehill

Main category: cs.CL

TL;DR: A comprehensive survey of end-to-end neural approaches for multi-speaker ASR, covering architectural paradigms, recent improvements, long-form extensions, benchmark evaluations, and future directions.

DetailsMotivation: Monaural multi-speaker ASR faces challenges due to data scarcity and difficulty in recognizing/attributing words to individual speakers, especially in overlapping speech. While recent advances have shifted from cascade to end-to-end systems, there's a lack of comprehensive review of these developments.

Method: The survey provides systematic taxonomy of E2E neural approaches, analyzing: (1) SIMO vs. SISO architectural paradigms for pre-segmented audio, (2) recent architectural/algorithmic improvements, (3) extensions to long-form speech including segmentation strategies and speaker-consistent hypothesis stitching, and (4) comparative evaluation across standard benchmarks.

Result: The paper presents a comprehensive review and comparative analysis of E2E multi-speaker ASR methods, highlighting their distinct characteristics, trade-offs, and performance across benchmarks.

Conclusion: The survey concludes with discussion of open challenges and future research directions toward building robust and scalable multi-speaker ASR systems, emphasizing the need for continued advancement in this important field.

Abstract: Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

[12] Emissions and Performance Trade-off Between Small and Large Language Models

Anandita Garg, Uma Gaba, Deepan Muthirayan, Anish Roy Chowdhury

Main category: cs.CL

TL;DR: Fine-tuned Small Language Models (SLMs) can match LLM performance on many tasks while drastically reducing carbon emissions during inference, offering a sustainable AI alternative.

DetailsMotivation: Address growing concerns about the enormous carbon footprint of Large Language Models (LLMs), which consume significant energy during both training and repeated inference operations.

Method: Comparative analysis of performance-emissions trade-off between LLMs and fine-tuned SLMs across selected tasks in Natural Language Processing, Reasoning, and Programming domains.

Result: In four out of six selected tasks, SLMs maintained comparable performance to LLMs while achieving significant reduction in carbon emissions during inference.

Conclusion: Smaller models are viable for mitigating environmental impact of resource-heavy LLMs, advancing toward sustainable, green AI solutions.

Abstract: The advent of Large Language Models (LLMs) has raised concerns about their enormous carbon footprint, starting with energy-intensive training and continuing through repeated inference. This study investigates the potential of using fine-tuned Small Language Models (SLMs) as a sustainable alternative for predefined tasks. Here, we present a comparative analysis of the performance-emissions trade-off between LLMs and fine-tuned SLMs across selected tasks under Natural Language Processing, Reasoning and Programming. Our results show that in four out of the six selected tasks, SLMs maintained comparable performances for a significant reduction in carbon emissions during inference. Our findings demonstrate the viability of smaller models in mitigating the environmental impact of resource-heavy LLMs, thus advancing towards sustainable, green AI.

[13] Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Tekla Etelka Gráczi, Anna Kohári, Katalin Mády

Main category: cs.CL

TL;DR: Researchers introduce two new Hungarian speech datasets (BEA-Large and BEA-Dialogue) to address the lack of spontaneous and conversational speech resources, establish ASR and diarization baselines, and provide a framework for similar benchmarks in other languages.

DetailsMotivation: Hungarian is underrepresented in ASR research due to limited spontaneous and conversational speech corpora, creating a gap compared to high-resource languages that have extensive datasets.

Method: Created two datasets from previously unprocessed portions of the Hungarian BEA corpus: BEA-Large (255 hours of spontaneous speech from 433 speakers) and BEA-Dialogue (85 hours of spontaneous conversations). Established reproducible baselines using publicly available ASR models and conducted diarization experiments.

Result: Fine-tuned Fast Conformer model achieved 14.18% WER on spontaneous speech and 4.8% on repeated speech. Diarization error rates ranged from 12.46% to 17.40%. Results confirm the difficulty of conversational ASR due to disfluencies, overlaps, and informal speech patterns.

Conclusion: The datasets and baselines advance Hungarian speech technology and provide a methodological framework for developing spontaneous and conversational benchmarks in other underrepresented languages.

Abstract: The advancement of automatic speech recognition (ASR) has been largely enhanced by extensive datasets in high-resource languages, while languages such as Hungarian remain underrepresented due to limited spontaneous and conversational corpora. To address this gap, we introduce two new datasets – BEA-Large and BEA-Dialogue – constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA. BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speakers, enriched with detailed segment-level metadata. BEA-Dialogue, comprising 85 hours of spontaneous conversations, is a Hungarian speech corpus featuring natural dialogues partitioned into speaker-independent subsets, supporting research in conversational ASR and speaker diarization. We establish reproducible baselines on these datasets using publicly available ASR models, with the fine-tuned Fast Conformer model achieving word error rates as low as 14.18% on spontaneous and 4.8% on repeated speech. Diarization experiments yield diarization error rates between 12.46% and 17.40%, providing reference points for future improvements. The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages.

[14] Directional Attractors in LLM Reasoning: How Similarity Retrieval Steers Iterative Summarization Based Reasoning

Cagatay Tekin, Charbel Barakat, Luis Joseph Luna Limgenco

Main category: cs.CL

TL;DR: InftyThink with Cross-Chain Memory adds semantic caching of successful reasoning patterns to improve iterative summarization reasoning, showing accuracy gains in structured domains but limitations in heterogeneous domains.

DetailsMotivation: Existing iterative summarization frameworks like InftyThink enable long-horizon reasoning but repeatedly regenerate similar reasoning strategies across tasks, lacking efficient reuse of successful patterns.

Method: Extends InftyThink with embedding-based semantic cache of previously successful reasoning patterns (lemmas). At each reasoning step, retrieves most semantically similar stored lemmas to guide inference without expanding context window indiscriminately.

Result: Semantic lemma retrieval improves accuracy on MATH500, AIME2024, and GPQA-Diamond in structured domains, but exposes failure modes in heterogeneous domains. Geometric analysis shows cache retrieval induces directional biases in embedding space, creating consistent fix (improve baseline) and break (degrade baseline) attractors.

Conclusion: Similarity-based memory offers benefits for self-improving LLM reasoning but has limits, particularly in heterogeneous domains where semantic similarity may not align with reasoning strategy transferability.

Abstract: Iterative summarization based reasoning frameworks such as InftyThink enable long-horizon reasoning in large language models (LLMs) by controlling context growth, but they repeatedly regenerate similar reasoning strategies across tasks. We introduce InftyThink with Cross-Chain Memory, an extension that augments iterative reasoning with an embedding-based semantic cache of previously successful reasoning patterns. At each reasoning step, the model retrieves and conditions on the most semantically similar stored lemmas, guiding inference without expanding the context window indiscriminately. Experiments on MATH500, AIME2024, and GPQA-Diamond demonstrate that semantic lemma retrieval improves accuracy in structured domains while exposing failure modes in tests that include heterogeneous domains. Geometric analyses of reasoning trajectories reveal that cache retrieval induces directional biases in embedding space, leading to consistent fix (improve baseline accuracy) and break (degradation in baseline accuracy) attractors. Our results highlight both the benefits and limits of similarity-based memory for self-improving LLM reasoning.

[15] Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe

JV Roig

Main category: cs.CL

TL;DR: RIKER is a novel evaluation framework that uses paradigm inversion (generating documents from ground truth) to create contamination-resistant benchmarks for knowledge systems without human annotation or reference models.

DetailsMotivation: Current evaluation methods for knowledge systems have three major problems: static benchmarks are vulnerable to contamination, LLM-based judges have systematic biases, and ground truth extraction requires expensive human annotation.

Method: RIKER uses paradigm inversion - generating synthetic documents from known structured ground truth rather than extracting ground truth from documents. This enables deterministic scoring, scalable evaluation without human annotation, and contamination resistance through regenerable corpora.

Result: Evaluation of 33 models using over 21B tokens revealed: 1) context length claims often exceed usable capacity with degradation beyond 32K tokens, 2) cross-document aggregation is substantially harder than single-document extraction, 3) grounding ability and hallucination resistance are distinct capabilities.

Conclusion: RIKER provides both a specific benchmark and a domain-agnostic methodology for constructing scalable, contamination-resistant evaluations wherever synthetic documents can be generated from structured ground truth.

Abstract: Evaluating knowledge systems (LLMs, RAG, knowledge graphs, etc) faces fundamental challenges: static benchmarks are vulnerable to contamination, LLM-based judges exhibit systematic biases, and ground truth extraction requires expensive human annotation. We present RIKER (Retrieval Intelligence and Knowledge Extraction Rating), both a benchmark and a replicable methodology based on paradigm inversion - generating documents from known ground truth rather than extracting ground truth from documents. This approach enables deterministic scoring and scalable evaluation without human annotation or reference models, and contamination resistance through regenerable corpora. Our evaluation of 33 models using over 21 billion tokens reveals that context length claims frequently exceed usable capacity, with significant degradation beyond 32K tokens; cross-document aggregation proves substantially harder than single-document extraction; and grounding ability and hallucination resistance are distinct capabilities - models excelling at finding facts that exist may still fabricate facts that do not. Beyond the specific benchmark, we contribute a domain-agnostic methodology for constructing scalable and contamination-resistant evaluations wherever synthetic documents can be generated from structured ground truth.

[16] PediaMind-R1: A Temperament-Aware Language Model for Personalized Early Childhood Care Reasoning via Cognitive Modeling and Preference Alignment

Zihe Zhang, Can Zhang, Yanheng Xu, Xin Hu, Jichao Leng

Main category: cs.CL

TL;DR: PediaMind-R1 is a specialized LLM for intelligent parenting that uses temperament theory and a two-stage training pipeline to provide personalized, psychologically-informed caregiving advice for children 0-3 years old.

DetailsMotivation: Current parenting systems provide generic suggestions that lack psychological grounding and personalization. The paper aims to create an AI system that can offer active, individualized parenting guidance based on developmental psychology principles, specifically addressing the need for temperament-aware caregiving strategies.

Method: 1) Introduces temperament theory from Thomas-Chess framework and builds a temperament knowledge graph for infants/toddlers (0-3 years). 2) Two-stage training pipeline: supervised fine-tuning for structured chain-of-thought reasoning, followed by GRPO-based alignment to reinforce logical consistency, domain expertise, and empathetic caregiving. 3) Creates evaluation framework with temperament-sensitive multiple-choice tests and human assessments.

Result: PediaMind-R1 demonstrates accurate interpretation of early childhood temperament profiles and engages in proactive individualized reasoning. The model successfully integrates psychological theory with domain-specific modeling to provide personalized parenting guidance.

Conclusion: The work shows the value of integrating vertical-domain modeling with psychological theory for developing user-centered LLMs. It offers a novel approach to active personalization in sensitive caregiving contexts, advancing personalized AI applications in parenting and child development domains.

Abstract: This paper presents PediaMind-R1, a domain-specialized large language model designed to achieve active personalization in intelligent parenting scenarios. Unlike conventional systems that provide generic suggestions, PediaMind-R1 draws on insights from developmental psychology. It introduces temperament theory from the Thomas-Chess framework and builds a temperament knowledge graph for infants and toddlers (0-3 years). Our two-stage training pipeline first uses supervised fine-tuning to teach structured chain-of-thought reasoning, and then applies a GRPO-based alignment stage to reinforce logical consistency, domain expertise, and empathetic caregiving strategies. We further design an evaluation framework comprising temperament-sensitive multiple-choice tests and human assessments. The results demonstrate that PediaMind-R1 can accurately interpret early childhood temperament profiles and proactively engage in individualized reasoning. This work highlights the value of integrating vertical-domain modeling with psychological theory. It offers a novel approach to developing user-centered LLMs that advance the practice of active personalization in sensitive caregiving contexts.

[17] Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

Manas Khatore, Sumana Sridharan, Kevork Sulahian, Benjamin J. Smith, Shi Feng

Main category: cs.CL

TL;DR: Answer matching using LLMs is robust against strategic text manipulations like verbosity, multiple answers, and conflicting information - these tactics don’t inflate scores and often reduce them.

DetailsMotivation: To investigate whether automated answer matching using LLMs is vulnerable to strategic attacks that could artificially inflate scores without improving actual correctness, such as guesswork or verbosity.

Method: Systematically test three manipulation tactics: prompting examinee models to (1) generate verbose responses, (2) provide multiple answers when unconfident, and (3) embed conflicting answers with correct answer near the start. Compare binary scoring vs continuous scoring robustness.

Result: The manipulations do not increase scores and often reduce them. Binary scoring is more robust to attacks than continuous scoring.

Conclusion: Answer matching is generally robust to inexpensive text manipulation and is a viable alternative to traditional LLM-as-a-judge or human evaluation when reference answers are available.

Abstract: Automated answer matching, which leverages LLMs to evaluate free-text responses by comparing them to a reference answer, shows substantial promise as a scalable and aligned alternative to human evaluation. However, its reliability requires robustness against strategic attacks such as guesswork or verbosity that may artificially inflate scores without improving actual correctness. In this work, we systematically investigate whether such tactics deceive answer matching models by prompting examinee models to: (1) generate verbose responses, (2) provide multiple answers when unconfident, and (3) embed conflicting answers with the correct answer near the start of their response. Our results show that these manipulations do not increase scores and often reduce them. Additionally, binary scoring (which requires a matcher to answer with a definitive “correct” or “incorrect”) is more robust to attacks than continuous scoring (which requires a matcher to determine partial correctness). These findings show that answer matching is generally robust to inexpensive text manipulation and is a viable alternative to traditional LLM-as-a-judge or human evaluation when reference answers are available.

[18] Más contexto no es mejor. Paradoja de la dilución vectorial en RAG corporativos

Alex Dantart

Main category: cs.CL

TL;DR: Contextualized chunking in RAG improves recall with summary injection but causes vector dilution; moderate injection boosts recall by 18%, but exceeding CIR > 0.4 threshold reduces precision by 22%, showing an inverted U curve.

DetailsMotivation: Recent contextualized chunking techniques inject summaries to enhance RAG context, but they introduce "vector dilution" that obscures local content, creating a need to understand the trade-off between context enhancement and content dilution.

Method: Evaluated various injection ratios of summaries in contextualized chunking, analyzed the impact on recall and precision, and developed a theoretical framework to calculate optimal injection ratios based on the observed inverted U curve pattern.

Result: Found an inverted U-shaped relationship: moderate summary injection boosts recall by 18%, but exceeding a critical injection ratio (CIR > 0.4) reduces precision by 22% for specific queries, demonstrating the vector dilution problem.

Conclusion: There exists an optimal injection ratio for contextualized chunking in RAG systems; moderate injection enhances performance but excessive injection causes vector dilution and precision loss, necessitating careful ratio calculation.

Abstract: Técnicas recientes de “Contextualized Chunking” inyectan resúmenes para mejorar el contexto en RAG, pero introducen una “dilución vectorial” que opaca el contenido local. Evaluando distintos ratios de inyección, demostramos una curva en “U invertida”: una inyección moderada mejora el “Recall” (+18%), pero superar un umbral crítico (CIR > 0.4) reduce la precisión en un 22% para consultas específicas. Proponemos un marco teórico para calcular el ratio óptimo de inyección. – Recent “Contextualized Chunking” techniques inject summaries to improve RAG context but introduce “vector dilution” drowning out local content. Evaluating various injection ratios, we demonstrate an “inverted U” curve: moderate injection boosts Recall (+18%), but exceeding a critical threshold (CIR > 0.4) drops precision by 22% for specific queries. We propose a theoretical framework to calculate the optimal injection ratio.

[19] NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models

Nidhi Pandya

Main category: cs.CL

TL;DR: NewsScope: A cross-domain dataset, benchmark, and fine-tuned model for schema-grounded news claim extraction that achieves near-GPT-4o-mini performance with offline deployment capability.

DetailsMotivation: Existing approaches for automated news verification lack either schema compliance or cross-domain generalization, creating a need for structured claim extraction systems that work across different news domains.

Method: Created NewsScope dataset with 455 articles across 4 domains, fine-tuned LLaMA 3.1 8B using LoRA on 315 training examples, and evaluated on in-domain and out-of-source test sets with human evaluation on 400 claims.

Result: NewsScope achieves 89.4% human-evaluated accuracy (vs GPT-4o-mini’s 93.7%), outperforms GPT-4o-mini on political claims (94.3% vs 87.8%), and with numeric grounding filter reaches 91.6% accuracy (2.1pp gap). High inter-annotator agreement (94.6%).

Conclusion: NewsScope provides an effective open-weight solution for schema-grounded claim extraction that approaches state-of-the-art performance while enabling cost-effective offline deployment, with publicly released code and benchmark.

Abstract: Automated news verification requires structured claim extraction, but existing approaches either lack schema compliance or generalize poorly across domains. This paper presents NewsScope, a cross-domain dataset, benchmark, and fine-tuned model for schema-grounded news claim extraction. The dataset contains 455 articles across politics, health, science/environment, and business, consisting of 395 in-domain articles and 60 out-of-source articles for generalization testing. LLaMA 3.1 8B was fine-tuned using LoRA on 315 training examples and evaluated on held-out in-domain (80 articles) and out-of-source (60 articles) test sets. Human evaluation on 400 claims shows NewsScope achieves 89.4% human-evaluated accuracy compared to GPT-4o-mini’s 93.7% (p=0.07). NewsScope outperforms GPT-4o-mini on political claims (94.3% vs. 87.8%). A numeric grounding filter further improves accuracy to 91.6%, narrowing the gap to 2.1 percentage points. Inter-annotator agreement studies (160 claims) confirm labeling reliability (94.6% positive agreement on SUPPORTED judgments). The open-weight model enables offline deployment at approximately $15 on-demand compute (or $0 on free tiers). Code and benchmark are publicly released.

[20] Evaluating Role-Consistency in LLMs for Counselor Training

Eric Rudolph, Natalie Engert, Jens Albrecht

Main category: cs.CL

TL;DR: This paper extends VirCo (Virtual Client for Online Counseling) research by introducing adversarial attacks to test LLM role-consistency, evaluating Vicuna model’s performance, and comparing various open-source LLMs for counselor training applications.

DetailsMotivation: The rise of online counseling services creates a need for effective training methods for future counselors. Traditional role-playing methods in academic training need to be complemented with realistic virtual client interactions.

Method: 1) Extends previous VirCo research, 2) Introduces new dataset with adversarial attacks to test LLM role-consistency, 3) Evaluates Vicuna model’s role consistency and coherence, 4) Compares various open-source LLMs for their performance in sustaining role consistency during virtual client interactions.

Result: The study provides: 1) Creation of an adversarial dataset, 2) Evaluation of conversation coherence and persona consistency, 3) Comparative analysis of different LLMs’ performance in maintaining role consistency during counseling simulations.

Conclusion: The research contributes to improving virtual client systems for counselor training by testing LLM robustness against adversarial attacks and providing comparative insights into different models’ ability to maintain consistent counseling personas.

Abstract: The rise of online counseling services has highlighted the need for effective training methods for future counselors. This paper extends research on VirCo, a Virtual Client for Online Counseling, designed to complement traditional role-playing methods in academic training by simulating realistic client interactions. Building on previous work, we introduce a new dataset incorporating adversarial attacks to test the ability of large language models (LLMs) to maintain their assigned roles (role-consistency). The study focuses on evaluating the role consistency and coherence of the Vicuna model’s responses, comparing these findings with earlier research. Additionally, we assess and compare various open-source LLMs for their performance in sustaining role consistency during virtual client interactions. Our contributions include creating an adversarial dataset, evaluating conversation coherence and persona consistency, and providing a comparative analysis of different LLMs.

[21] Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models

Youwei Liu, Jian Wang, Hanlin Wang, Beichen Guo, Wenjie Li

Main category: cs.CL

TL;DR: ITP is a framework for agent learning via lookahead imagination with world models, featuring adaptive horizon selection and fusion of imagined trajectories with observations for improved planning.

DetailsMotivation: Current world model methods mainly use single-step or fixed-horizon rollouts, limiting their potential for complex task planning. There's a need for more flexible and adaptive imagination mechanisms that can handle varying task requirements.

Method: Proposes Imagine-then-Plan (ITP) framework where policy interacts with learned world model to generate multi-step imagined trajectories. Introduces adaptive lookahead mechanism that trades off ultimate goal and task progress. Fuses imagined trajectories with current observations to create a partially observable and imaginable MDP. Implements both training-free and reinforcement-trained variants.

Result: Extensive experiments across representative agent benchmarks show ITP significantly outperforms competitive baselines. Adaptive lookahead enhances agents’ reasoning capability and provides insights for addressing broader, complex tasks.

Conclusion: ITP provides a unified framework for agent learning via lookahead imagination that improves planning capabilities through adaptive horizon selection and fusion of imagined future consequences with current observations.

Abstract: Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\texttt{ITP}), a unified framework for agent learning via lookahead imagination, where an agent’s policy model interacts with the learned world model, yielding multi-step ``imagined’’ trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially \textit{observable} and \textit{imaginable} Markov decision process to guide policy learning. We instantiate \texttt{ITP} with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that \texttt{ITP} significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents’ reasoning capability, providing valuable insights into addressing broader, complex tasks.

[22] Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

Pedro Memoli Buffa, Luciano Del Corro

Main category: cs.CL

TL;DR: Output-entropy profiles from LLM token probabilities can estimate domain-level accuracy under distribution shift, enabling scalable monitoring and targeted data acquisition.

DetailsMotivation: Deploying LLMs faces two challenges: monitoring model performance as traffic/domains drift, and improving models by prioritizing data acquisition to close performance gaps. The paper tests whether inference-time signals can estimate slice-level accuracy under domain shift.

Method: For each LLM response, compute output-entropy profile from final-layer next-token probabilities (using top-k logprobs) and summarize with eleven statistics. Train lightweight classifier to predict instance correctness, then average predicted probabilities to estimate domain-level accuracy.

Result: Evaluated on ten STEM reasoning benchmarks with exhaustive train/test compositions across nine LLMs (3B-20B). Estimates often track held-out benchmark accuracy, with several models showing near-monotonic ordering of domains.

Conclusion: Output-entropy profiles provide an accessible signal for scalable monitoring of LLM performance under domain shift and for targeting data acquisition to improve models.

Abstract: Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all “10 choose k” combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.

[23] TranslateGemma Technical Report

Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, Markus Freitag, David Vilar

Main category: cs.CL

TL;DR: TranslateGemma is an open machine translation model suite built on Gemma 3 foundation models, enhanced through two-stage fine-tuning with synthetic/human data and reinforcement learning, achieving strong translation performance across many language pairs while retaining multimodal capabilities.

DetailsMotivation: To enhance the inherent multilingual capabilities of Gemma 3 foundation models for machine translation tasks and provide the research community with powerful, adaptable open translation tools.

Method: Two-stage fine-tuning: 1) Supervised fine-tuning using mixture of high-quality synthetic parallel data (generated via state-of-the-art models) and human-translated parallel data, 2) Reinforcement learning phase optimizing translation quality using ensemble of reward models (MetricX-QE and AutoMQM).

Result: Demonstrated effectiveness through human evaluation on WMT25 test set (10 language pairs) and automatic evaluation on WMT24++ benchmark (55 language pairs). Showed consistent substantial gains over baseline Gemma 3 models across all sizes, with smaller TranslateGemma models often matching larger baseline performance. Models retain strong multimodal capabilities with enhanced performance on Vistra image translation benchmark.

Conclusion: TranslateGemma provides powerful open machine translation models that significantly improve upon baseline Gemma 3, offering better efficiency through smaller models achieving comparable performance to larger baselines, while maintaining multimodal capabilities for broader research applications.

Abstract: We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models. To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process. First, supervised fine-tuning is performed using a rich mixture of high-quality large-scale synthetic parallel data generated via state-of-the-art models and human-translated parallel data. This is followed by a reinforcement learning phase, where we optimize translation quality using an ensemble of reward models, including MetricX-QE and AutoMQM, targeting translation quality. We demonstrate the effectiveness of TranslateGemma with human evaluation on the WMT25 test set across 10 language pairs and with automatic evaluation on the WMT24++ benchmark across 55 language pairs. Automatic metrics show consistent and substantial gains over the baseline Gemma 3 models across all sizes. Notably, smaller TranslateGemma models often achieve performance comparable to larger baseline models, offering improved efficiency. We also show that TranslateGemma models retain strong multimodal capabilities, with enhanced performance on the Vistra image translation benchmark. The release of the open TranslateGemma models aims to provide the research community with powerful and adaptable tools for machine translation.

[24] Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game

Haryo Akbarianto Wibowo, Alaa Elsetohy, Qinrong Cui, Alham Fikri Aji

Main category: cs.CL

TL;DR: The paper proposes Spyfall, a dynamic game-based benchmarking framework for evaluating multilingual and multicultural capabilities of LLMs, revealing significant performance gaps in non-English contexts compared to traditional benchmarks.

DetailsMotivation: Traditional static benchmarks for LLMs are becoming inadequate due to data saturation and leakage issues, necessitating more robust evaluation methods that can assess multilingual and multicultural capabilities in dynamic, real-world scenarios.

Method: The authors use the social deduction game Spyfall as a dynamic benchmarking framework, where models engage in strategic dialogue to either identify a secret agent or avoid detection, using culturally relevant locations or local foods as game elements.

Result: Game-based rankings align closely with Chatbot Arena, but reveal significant performance gaps in non-English contexts: models are less proficient with locally specific entities and struggle with rule-following and strategic integrity in non-English languages.

Conclusion: The Spyfall game-based approach provides a scalable, leakage-resistant, and culturally nuanced alternative to traditional NLP benchmarks for evaluating LLMs’ multilingual and multicultural capabilities.

Abstract: The rapid advancement of Large Language Models (LLMs) has necessitated more robust evaluation methods that go beyond static benchmarks, which are increasingly prone to data saturation and leakage. In this paper, we propose a dynamic benchmarking framework for evaluating multilingual and multicultural capabilities through the social deduction game Spyfall. In our setup, models must engage in strategic dialogue to either identify a secret agent or avoid detection, utilizing culturally relevant locations or local foods. Our results show that our game-based rankings align closely with the Chatbot Arena. However, we find a significant performance gap in non-English contexts: models are generally less proficient when handling locally specific entities and often struggle with rule-following or strategic integrity in non-English languages. We demonstrate that this game-based approach provides a scalable, leakage-resistant, and culturally nuanced alternative to traditional NLP benchmarks. The game history can be accessed here https://huggingface.co/datasets/haryoaw/cultural-spyfall.

[25] OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG

Fengran Mo, Zhan Su, Yuchen Hui, Jinghan Zhang, Jia Ao Sun, Zheyuan Liu, Chao Zhang, Tetsuya Sakai, Jian-Yun Nie

Main category: cs.CL

TL;DR: OpenDecoder improves RAG by using explicit quality indicators (relevance, ranking, QPP scores) to make LLMs more robust to noisy retrieved information.

DetailsMotivation: Current RAG systems assume retrieved information is always relevant, but in reality retrieved content can vary in usefulness and contain noise. This affects answer quality since LLMs don't explicitly consider retrieval quality during generation.

Method: Proposes OpenDecoder approach that incorporates three explicit evaluation metrics as quality indicator features: relevance scores, ranking scores, and query performance prediction (QPP) scores. These indicators help the model assess and utilize retrieved information more effectively during answer generation.

Result: Experimental results on five benchmark datasets show OpenDecoder outperforms various baseline methods, demonstrating effectiveness and better robustness to varying levels of noisy context.

Conclusion: OpenDecoder provides a flexible paradigm that can be integrated with LLM post-training and combined with any type of external quality indicators, making RAG systems more robust to retrieval noise.

Abstract: The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs’ internal information processing mechanism to incorporate it in answer generation. It is generally assumed that the retrieved information is relevant to the question. However, the retrieved information may have a variable degree of relevance and usefulness, depending on the question and the document collection. It is important to take into account the relevance of the retrieved information in answer generation. In this paper, we propose OpenDecoder, a new approach that leverages explicit evaluation of the retrieved information as quality indicator features for generation. We aim to build a RAG model that is more robust to varying levels of noisy context. Three types of explicit evaluation information are considered: relevance score, ranking score, and QPP (query performance prediction) score. The experimental results on five benchmark datasets demonstrate the effectiveness and better robustness of OpenDecoder by outperforming various baseline methods. Importantly, this paradigm is flexible to be integrated with the post-training of LLMs for any purposes and incorporated with any type of external indicators.

[26] SpectraQuery: A Hybrid Retrieval-Augmented Conversational Assistant for Battery Science

Sreya Vangara, Jagjit Nanda, Yan-Kai Tzeng, Eric Darve

Main category: cs.CL

TL;DR: SpectraQuery is a hybrid query framework that integrates structured Raman spectroscopy data with unstructured scientific literature using SUQL-inspired design, enabling joint reasoning across modalities for scientific workflows.

DetailsMotivation: Scientific reasoning requires linking structured experimental data with unstructured explanatory literature, but current LLM assistants cannot effectively reason jointly across these different modalities.

Method: Combines semantic parsing with retrieval-augmented generation to translate natural language queries into coordinated SQL (for structured database) and literature retrieval operations, using SUQL-inspired design to integrate relational spectroscopy database with vector-indexed literature corpus.

Result: Strong performance: ~80% SQL correctness, 93-97% answer groundedness with 10-15 retrieved passages, and expert ratings of 4.1-4.6/5 across accuracy, relevance, grounding, and clarity.

Conclusion: Hybrid retrieval architectures can meaningfully support scientific workflows by bridging data and discourse for high-volume experimental datasets.

Abstract: Scientific reasoning increasingly requires linking structured experimental data with the unstructured literature that explains it, yet most large language model (LLM) assistants cannot reason jointly across these modalities. We introduce SpectraQuery, a hybrid natural-language query framework that integrates a relational Raman spectroscopy database with a vector-indexed scientific literature corpus using a Structured and Unstructured Query Language (SUQL)-inspired design. By combining semantic parsing with retrieval-augmented generation, SpectraQuery translates open-ended questions into coordinated SQL and literature retrieval operations, producing cited answers that unify numerical evidence with mechanistic explanation. Across SQL correctness, answer groundedness, retrieval effectiveness, and expert evaluation, SpectraQuery demonstrates strong performance: approximately 80 percent of generated SQL queries are fully correct, synthesized answers reach 93-97 percent groundedness with 10-15 retrieved passages, and battery scientists rate responses highly across accuracy, relevance, grounding, and clarity (4.1-4.6/5). These results show that hybrid retrieval architectures can meaningfully support scientific workflows by bridging data and discourse for high-volume experimental datasets.

[27] Can LLMs interpret figurative language as humans do?: surface-level vs representational similarity

Samhita Bollepally, Aurora Sloman-Moll, Takashi Yamauchi

Main category: cs.CL

TL;DR: LLMs show surface-level alignment with human judgments on linguistic traits but diverge significantly at representational level, especially for figurative language like idioms and Gen Z slang. GPT-4 performs best but all models struggle with context-dependent expressions.

DetailsMotivation: To investigate how well LLMs align with human judgments when interpreting figurative and socially grounded language, particularly examining whether their surface-level similarity extends to deeper representational understanding.

Method: Compared human participants and four instruction-tuned LLMs (GPT-4, Gemma-2-9B, Llama-3.2, Mistral-7B) on 240 dialogue-based sentences representing six linguistic traits. Each sentence was paired with 40 interpretive questions, rated on 10-point Likert scales.

Result: Humans and LLMs aligned at surface level but diverged significantly at representational level, especially for figurative sentences involving idioms and Gen Z slang. GPT-4 most closely approximated human patterns, while all models struggled with context-dependent and socio-pragmatic expressions.

Conclusion: Current LLMs, while showing surface-level similarity to human judgments, lack deeper representational understanding of figurative and socially grounded language, highlighting limitations in their ability to interpret context-dependent expressions that require socio-pragmatic knowledge.

Abstract: Large language models generate judgments that resemble those of humans. Yet the extent to which these models align with human judgments in interpreting figurative and socially grounded language remains uncertain. To investigate this, human participants and four instruction-tuned LLMs of different sizes (GPT-4, Gemma-2-9B, Llama-3.2, and Mistral-7B) rated 240 dialogue-based sentences representing six linguistic traits: conventionality, sarcasm, funny, emotional, idiomacy, and slang. Each of the 240 sentences was paired with 40 interpretive questions, and both humans and LLMs rated these sentences on a 10-point Likert scale. Results indicated that humans and LLMs aligned at the surface level with humans, but diverged significantly at the representational level, especially in interpreting figurative sentences involving idioms and Gen Z slang. GPT-4 most closely approximates human representational patterns, while all models struggle with context-dependent and socio-pragmatic expressions like sarcasm, slang, and idiomacy.

[28] Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers

Kaiyu He, Zhang Mian, Peilin Wu, Xinya Du, Zhiyu Chen

Main category: cs.CL

TL;DR: Grokking in transformers doesn’t create new reasoning paradigms but integrates memorized facts into existing paths, with limited transferability to new knowledge.

DetailsMotivation: To understand whether grokked models are superior to non-grokked ones on downstream tasks and whether the computational cost of waiting for grokking is worthwhile, given LLMs' struggles with compositional reasoning.

Method: Mechanistic study evaluating the Generalization Circuit’s role in knowledge assimilation and transfer, analyzing inference paths of grokked vs non-grokked models.

Result: 1) Grokked and non-grokked models use identical inference paths for in-distribution queries; grokking integrates memorized facts rather than creating new reasoning. 2) High accuracy on unseen cases and reasoning path formation can occur independently. 3) Mature circuits show limited transferability when integrating new knowledge.

Conclusion: Grokked Transformers don’t achieve full mastery of compositional logic; the extensive computational cost of waiting for grokking may not be worthwhile as it doesn’t create fundamentally new reasoning capabilities.

Abstract: While Large Language Models (LLMs) excel at factual retrieval, they often struggle with the “curse of two-hop reasoning” in compositional tasks. Recent research suggests that parameter-sharing transformers can bridge this gap by forming a “Generalization Circuit” during a prolonged “grokking” phase. A fundamental question arises: Is a grokked model superior to its non-grokked counterparts on downstream tasks? Furthermore, is the extensive computational cost of waiting for the grokking phase worthwhile? In this work, we conduct a mechanistic study to evaluate the Generalization Circuit’s role in knowledge assimilation and transfer. We demonstrate that: (i) The inference paths established by non-grokked and grokked models for in-distribution compositional queries are identical. This suggests that the “Generalization Circuit” does not represent the sudden acquisition of a new reasoning paradigm. Instead, we argue that grokking is the process of integrating memorized atomic facts into an naturally established reasoning path. (ii) Achieving high accuracy on unseen cases after prolonged training and the formation of a certain reasoning path are not bound; they can occur independently under specific data regimes. (iii) Even a mature circuit exhibits limited transferability when integrating new knowledge, suggesting that “grokked” Transformers do not achieve a full mastery of compositional logic.

[29] SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages

Tianyi Xu, Xuan Ouyang, Binwei Yao, Shoua Xiong, Sara Misurelli, Maichou Lor, Junjie Hu

Main category: cs.CL

TL;DR: SITA is a lightweight adaptation method for pretrained speech encoders that improves tone awareness and speaker invariance for tonal low-resource languages like Hmong and Mandarin.

DetailsMotivation: Tonal low-resource languages are widely spoken but poorly served by speech technology. Existing multilingual encoders fail to effectively represent tone distinctions while being robust to speaker variations like gender.

Method: SITA uses staged multi-objective training: (1) cross-gender contrastive objective for lexical consistency across speakers plus tone-repulsive loss to separate same-word different-tone realizations; (2) auxiliary CTC-based ASR objective with distillation to stabilize recognition-relevant structure.

Result: On Hmong, SITA improves cross-gender lexical retrieval accuracy while maintaining usable ASR accuracy relative to ASR-adapted XLS-R teacher. Similar gains observed when transferring to Mandarin.

Conclusion: SITA provides a general, plug-in approach for adapting multilingual speech encoders to tonal languages, addressing the dual challenge of speaker invariance and tone awareness for low-resource tonal languages.

Abstract: Tonal low-resource languages are widely spoken yet remain underserved by modern speech technology. A key challenge is learning representations that are robust to nuisance variation such as gender while remaining tone-aware for different lexical meanings. To address this, we propose SITA, a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders. SITA uses staged multi-objective training: (i) a cross-gender contrastive objective encourages lexical consistency across speakers, while a tone-repulsive loss prevents tone collapse by explicitly separating same-word different-tone realizations; and (ii) an auxiliary Connectionist Temporal Classification (CTC)-based ASR objective with distillation stabilizes recognition-relevant structure. We evaluate primarily on Hmong, a highly tonal and severely under-resourced language where off-the-shelf multilingual encoders fail to represent tone effectively. On a curated Hmong word corpus, SITA improves cross-gender lexical retrieval accuracy, while maintaining usable ASR accuracy relative to an ASR-adapted XLS-R teacher. We further observe similar gains when transferring the same recipe to Mandarin, suggesting SITA is a general, plug-in approach for adapting multilingual speech encoders to tonal languages.

[30] Efficient Multilingual Dialogue Processing via Translation Pipelines and Distilled Language Models

Santiago Martínez Novoa, Nicolás Rozo Fajardo, Diego Alejandro González Vargas, Nicolás Bedoya Figueroa

Main category: cs.CL

TL;DR: Team Kl33n3x developed a multilingual dialogue summarization and QA system using a three-stage translation pipeline with a distilled 2.55B parameter model, achieving strong performance across nine languages without task-specific fine-tuning.

DetailsMotivation: To create an effective multilingual dialogue summarization and question answering system for low-resource Indic languages, addressing the challenge of limited training data and computational resources while maintaining competitive performance.

Method: Three-stage pipeline: 1) Forward translation from Indic languages to English, 2) Multitask text generation using a 2.55B parameter distilled language model, 3) Reverse translation back to source languages. Uses knowledge distillation to create compact models.

Result: Achieved strong win rates across competition tasks, with particularly robust performance on Marathi (86.7% QnA), Tamil (86.7% QnA), and Hindi (80.0% QnA). Demonstrated that compact models can achieve competitive performance across nine languages.

Conclusion: Translation-based approaches with knowledge-distilled compact models are effective for low-resource language processing without task-specific fine-tuning, enabling competitive multilingual performance with reduced computational requirements.

Abstract: This paper presents team Kl33n3x’s multilingual dialogue summarization and question answering system developed for the NLPAI4Health 2025 shared task. The approach employs a three-stage pipeline: forward translation from Indic languages to English, multitask text generation using a 2.55B parameter distilled language model, and reverse translation back to source languages. By leveraging knowledge distillation techniques, this work demonstrates that compact models can achieve highly competitive performance across nine languages. The system achieved strong win rates across the competition’s tasks, with particularly robust performance on Marathi (86.7% QnA), Tamil (86.7% QnA), and Hindi (80.0% QnA), demonstrating the effectiveness of translation-based approaches for low-resource language processing without task-specific fine-tuning.

[31] Beyond Consensus: Perspectivist Modeling and Evaluation of Annotator Disagreement in NLP

Yinuo Xu, David Jurgens

Main category: cs.CL

TL;DR: Survey paper on disagreement-aware NLP methods, covering sources of annotator disagreement, modeling approaches, evaluation metrics, and future directions.

DetailsMotivation: Annotator disagreement is widespread in NLP, especially for subjective tasks like toxicity detection and stance analysis. While traditionally treated as noise, recent work recognizes disagreement as meaningful signal reflecting variation in interpretation and perspective, necessitating a unified view of disagreement-aware methods.

Method: Provides a domain-agnostic taxonomy of disagreement sources (data, task, annotator factors), synthesizes modeling approaches using a framework of prediction targets and pooling structure, and reviews evaluation metrics for both predictive performance and annotator behavior.

Result: Identifies a shift from consensus learning toward explicitly modeling disagreement and capturing structured relationships among annotators. Notes that most fairness evaluations remain descriptive rather than normative.

Conclusion: Identifies open challenges: integrating multiple sources of variation, developing disagreement-aware interpretability frameworks, and grappling with practical tradeoffs of perspectivist modeling.

Abstract: Annotator disagreement is widespread in NLP, particularly for subjective and ambiguous tasks such as toxicity detection and stance analysis. While early approaches treated disagreement as noise to be removed, recent work increasingly models it as a meaningful signal reflecting variation in interpretation and perspective. This survey provides a unified view of disagreement-aware NLP methods. We first present a domain-agnostic taxonomy of the sources of disagreement spanning data, task, and annotator factors. We then synthesize modeling approaches using a common framework defined by prediction targets and pooling structure, highlighting a shift from consensus learning toward explicitly modeling disagreement, and toward capturing structured relationships among annotators. We review evaluation metrics for both predictive performance and annotator behavior, and noting that most fairness evaluations remain descriptive rather than normative. We conclude by identifying open challenges and future directions, including integrating multiple sources of variation, developing disagreement-aware interpretability frameworks, and grappling with the practical tradeoffs of perspectivist modeling.

[32] Mi:dm 2.0 Korea-centric Bilingual Language Models

Donghoon Shin, Sejung Lee, Soonmin Bae, Hwijung Ryu, Changwon Ok, Hoyoun Jung, Hyesung Ji, Jeehyun Lim, Jehoon Lee, Ji-Eun Han, Jisoo Baik, Mihyeon Kim, Riwoo Chung, Seongmin Lee, Wonjae Park, Yoonseok Heo, Youngkyung Seo, Seyoun Won, Boeun Kim, Cheolhun Heo, Eunkyeong Lee, Honghee Lee, Hyeongju Ju, Hyeontae Seo, Jeongyong Shim, Jisoo Lee, Junseok Koh, Junwoo Kim, Minho Lee, Minji Kang, Minju Kim, Sangha Nam, Seongheum Park, Taehyeong Kim, Euijai Ahn, Hong Seok Jeung, Jisu Shin, Jiyeon Kim, Seonyeong Song, Seung Hyun Kong, Sukjin Hong, Taeyang Yun, Yu-Seon Kim, A-Hyun Lee, Chae-Jeong Lee, Hye-Won Yu, Ji-Hyun Ahn, Song-Yeon Kim, Sun-Woo Jung, Eunju Kim, Eunji Ha, Jinwoo Baek, Yun-ji Lee, Wanjin Park, Jeong Yeop Kim, Eun Mi Kim, Hyoung Jun Park, Jung Won Yoon, Min Sung Noh, Myung Gyo Oh, Wongyoung Lee, Yun Jin Park, Young S. Kwon, Hyun Keun Kim, Jieun Lee, YeoJoo Park

Main category: cs.CL

TL;DR: Mi:dm 2.0 is a bilingual Korean-centric LLM that integrates Korean cultural values and reasoning patterns, available in two sizes (11.5B and 2.3B parameters), achieving SOTA on Korean benchmarks and released under MIT license.

DetailsMotivation: To address limitations of existing LLMs that lack sufficient Korean data quality and cultural alignment, and to advance Korea-centric AI by creating models that understand Korean cultural contexts, emotional subtleties, and real-world scenarios.

Method: Uses comprehensive data pipeline with proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and custom Korean-optimized tokenizer. Offers two configurations: Base (11.5B) with depth-up scaling for general use, and Mini (2.3B) for resource-constrained environments.

Result: Achieves state-of-the-art performance on Korean-specific benchmarks with top-tier zero-shot results on KMMLU and strong internal evaluation across language, humanities, and social science tasks.

Conclusion: Mi:dm 2.0 advances Korea-centric AI by providing accessible, high-performance bilingual LLMs under MIT license to accelerate AI adoption in Korean industries, strengthen the developer community, and lay groundwork for K-intelligence vision.

Abstract: We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI. This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage. To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks. Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks. The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use. By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence. Our models are available at https://huggingface.co/K-intelligence. For technical inquiries, please contact midm-llm@kt.com.

[33] From Symbolic to Natural-Language Relations: Rethinking Knowledge Graph Construction in the Era of Large Language Models

Kanyao Han, Yushang Lai

Main category: cs.CL

TL;DR: The paper argues for shifting from symbolic relation labels in knowledge graphs to natural-language relation descriptions, leveraging LLMs’ capabilities for more contextual and nuanced knowledge representation.

DetailsMotivation: Traditional symbolic relation schemas in knowledge graphs oversimplify real-world relations by compressing nuanced, contextual information into discrete labels. While effective for pre-LLM systems, this approach loses critical semantic detail. The emergence of LLMs enables more natural, context-rich knowledge representation.

Method: Proposes hybrid design principles that maintain a minimal structural backbone while enabling flexible natural-language relation descriptions. Advocates for moving from categorical relation labels to context-sensitive textual representations that can capture uncertainty and nuance.

Result: The paper presents a position advocating for fundamental redesign of relation representation in knowledge graphs, rather than just using LLMs to populate existing symbolic schemas more efficiently.

Conclusion: Knowledge graphs should evolve from symbolic relation schemas to natural-language relation descriptions, creating hybrid systems that preserve structural benefits while enabling LLM-compatible, context-rich knowledge representation that better reflects real-world relational complexity.

Abstract: Knowledge graphs (KGs) have commonly been constructed using predefined symbolic relation schemas, typically implemented as categorical relation labels. This design has notable shortcomings: real-world relations are often contextual, nuanced, and sometimes uncertain, and compressing it into discrete relation labels abstracts away critical semantic detail. Nevertheless, symbolic-relation KGs remain widely used because they have been operationally effective and broadly compatible with pre-LLM downstream models and algorithms, in which KG knowledge could be retrieved or encoded into quantified features and embeddings at scale. The emergence of LLMs has reshaped how knowledge is created and consumed. LLMs support scalable synthesis of domain facts directly in concise natural language, and prompting-based inference favors context-rich free-form text over quantified representations. This position paper argues that these changes call for rethinking the representation of relations themselves rather than merely using LLMs to populate conventional schemas more efficiently. We therefore advocate moving from symbolic to natural-language relation descriptions, and we propose hybrid design principles that preserve a minimal structural backbone while enabling more flexible and context-sensitive relational representations.

[34] How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation

Wilson Y. Lee

Main category: cs.CL

TL;DR: Human preference evaluations often need far more judgments than typically collected to reliably detect small model improvements, especially when preference signals are diffuse across prompts.

DetailsMotivation: To understand why many human preference evaluations yield inconclusive results and determine how many judgments are actually needed to reliably detect small model improvements.

Method: Analyzed large-scale human preference datasets across multiple modalities (chat, image generation, code generation) to examine preference signal distribution, compared different allocation strategies, and evaluated how prompt-induced variability affects detectability.

Result: Most comparisons show diffuse preference signals with small margins requiring far more judgments than typically collected. Curated benchmarks that reduce prompt-level variance improve detectability by 1.5×. Proportional allocation is minimax-optimal in diffuse regimes.

Conclusion: Inconclusive human evaluation outcomes often reflect underpowered evaluation rather than model equivalence, highlighting the need to explicitly consider effect size, budget, and protocol design in preference evaluations.

Abstract: Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve detectability through a $1.5\times$ reduction in prompt-level variance. Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence, underscoring the need to account explicitly for effect size, budget, and protocol design.

[35] SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding

Shuyang Hou, Yi Hu, Muhan Zhang

Main category: cs.CL

TL;DR: SubTokenTest is a new benchmark that evaluates LLMs’ sub-token understanding through practical tasks, revealing tokenization-related weaknesses despite advanced reasoning capabilities.

DetailsMotivation: LLMs struggle with basic character-level tasks due to tokenization, but these failures are often dismissed as lacking practical relevance. However, many real-world applications (text-based maps, structured tables) require precise sub-token understanding.

Method: Introduced SubTokenTest benchmark with 10 tasks across 4 domains that isolate tokenization-related failures by decoupling performance from complex reasoning. Evaluated 9 advanced LLMs, investigated test-time scaling impact, and explored character-level information encoding in hidden states.

Result: Comprehensive evaluation reveals LLMs’ persistent weaknesses in sub-token understanding despite their advanced reasoning capabilities, highlighting the practical importance of character-level processing.

Conclusion: Tokenization remains a fundamental limitation for LLMs in practical applications requiring precise sub-token understanding, and the SubTokenTest benchmark provides a systematic way to assess and address these weaknesses.

Abstract: Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive evaluation of nine advanced LLMs. Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states.

[36] Contrastive Bi-Encoder Models for Multi-Label Skill Extraction: Enhancing ESCO Ontology Matching with BERT and Attention Mechanisms

Yongming Sun

Main category: cs.CL

TL;DR: Zero-shot skill extraction framework using LLM-synthesized training data from ESCO taxonomy definitions, with hierarchical constraints for multi-skill generation, achieving strong performance on Chinese job ads without manual annotations.

DetailsMotivation: Supervised skill extraction from job ads to standardized taxonomies like ESCO is limited by scarce and expensive labeled data, especially in non-English settings where job-ad language differs from formal skill definitions.

Method: 1) LLM synthesizes training instances from ESCO definitions with hierarchical constraints based on Level-2 categories; 2) Contrastive bi-encoder with BERT backbone, BiLSTM, and attention pooling aligns job-ad sentences with skill descriptions; 3) RoBERTa-based binary filter removes non-skill sentences.

Result: Hierarchy-conditioned generation improves fluency and discriminability; model achieves strong zero-shot retrieval on Chinese job ads (F1@5 = 0.72), outperforming TF-IDF and standard BERT baselines.

Conclusion: The pipeline provides scalable, data-efficient automated skill coding for labor economics and workforce analytics without needing manually labeled training data.

Abstract: Fine-grained labor market analysis increasingly relies on mapping unstructured job advertisements to standardized skill taxonomies such as ESCO. This mapping is naturally formulated as an Extreme Multi-Label Classification (XMLC) problem, but supervised solutions are constrained by the scarcity and cost of large-scale, taxonomy-aligned annotations–especially in non-English settings where job-ad language diverges substantially from formal skill definitions. We propose a zero-shot skill extraction framework that eliminates the need for manually labeled job-ad training data. The framework uses a Large Language Model (LLM) to synthesize training instances from ESCO definitions, and introduces hierarchically constrained multi-skill generation based on ESCO Level-2 categories to improve semantic coherence in multi-label contexts. On top of the synthetic corpus, we train a contrastive bi-encoder that aligns job-ad sentences with ESCO skill descriptions in a shared embedding space; the encoder augments a BERT backbone with BiLSTM and attention pooling to better model long, information-dense requirement statements. An upstream RoBERTa-based binary filter removes non-skill sentences to improve end-to-end precision. Experiments show that (i) hierarchy-conditioned generation improves both fluency and discriminability relative to unconstrained pairing, and (ii) the resulting multi-label model transfers effectively to real-world Chinese job advertisements, achieving strong zero-shot retrieval performance (F1@5 = 0.72) and outperforming TF–IDF and standard BERT baselines. Overall, the proposed pipeline provides a scalable, data-efficient pathway for automated skill coding in labor economics and workforce analytics.

[37] Adaptive Multi-Stage Patent Claim Generation with Unified Quality Assessment

Chen-Wei Liang, Bin Guo, Zhen-Yuan Wei, Mu-Jiang-Shan Wang

Main category: cs.CL

TL;DR: Novel three-stage framework for patent claim generation that addresses cross-jurisdictional generalization, semantic relationship modeling, and quality assessment limitations, achieving significant improvements over state-of-the-art models.

DetailsMotivation: Current patent claim generation systems have three fundamental limitations: poor cross-jurisdictional generalization, inadequate semantic relationship modeling between claims and prior art, and unreliable quality assessment.

Method: Three-stage framework with: 1) relationship-aware similarity analysis using multi-head attention with eight specialized heads, 2) domain-adaptive claim generation integrating curriculum learning with dynamic LoRA adapter selection across five patent domains, and 3) unified quality assessment using cross-attention mechanisms between evaluation aspects.

Result: Substantial improvements on USPTO HUPD, EPO patent collections, and Patent-CE benchmark: 7.6-point ROUGE-L gain over GPT-4o, 8.3% BERTScore enhancement over Llama-3.1-8B, 0.847 correlation with human experts (vs 0.623 for separate models), and 89.4% cross-jurisdictional performance retention (vs 76.2% for baselines).

Conclusion: The framework establishes a comprehensive solution for automated patent prosecution workflows by addressing key limitations in current systems through innovative relationship modeling, domain adaptation, and unified quality assessment approaches.

Abstract: Current patent claim generation systems face three fundamental limitations: poor cross-jurisdictional generalization, inadequate semantic relationship modeling between claims and prior art, and unreliable quality assessment. We introduce a novel three-stage framework that addresses these challenges through relationship-aware similarity analysis, domain-adaptive claim generation, and unified quality assessment. Our approach employs multi-head attention with eight specialized heads for explicit relationship modeling, integrates curriculum learning with dynamic LoRA adapter selection across five patent domains, and implements cross-attention mechanisms between evaluation aspects for comprehensive quality assessment. Extensive experiments on USPTO HUPD dataset, EPO patent collections, and Patent-CE benchmark demonstrate substantial improvements: 7.6-point ROUGE-L gain over GPT-4o, 8.3% BERTScore enhancement over Llama-3.1-8B, and 0.847 correlation with human experts compared to 0.623 for separate evaluation models. Our method maintains 89.4% cross-jurisdictional performance retention versus 76.2% for baselines, establishing a comprehensive solution for automated patent prosecution workflows.

[38] Identity-Robust Language Model Generation via Content Integrity Preservation

Miao Zhang, Kelly Chen, Md Mehrab Tanjim, Rumi Chunara

Main category: cs.CL

TL;DR: LLM outputs degrade in quality based on user sociodemographic attributes, even for objective questions. The paper proposes a training-free framework to neutralize non-critical identity information while preserving semantic content, reducing identity-dependent bias by 77%.

DetailsMotivation: LLMs show disparities in factual accuracy, utility, and safety across different user sociodemographic attributes, even when demographic information is irrelevant to the question. This identity-dependent degradation of core response quality represents a critical gap in LLM fairness beyond stereotypical or representational bias.

Method: Proposes a lightweight, training-free framework for identity-robust generation that selectively neutralizes non-critical identity information while preserving semantically essential attributes. This maintains output content integrity without requiring model retraining.

Result: Experiments across four benchmarks and 18 sociodemographic identities demonstrate an average 77% reduction in identity-dependent bias compared to vanilla prompting, and a 45% reduction relative to existing prompt-based defenses.

Conclusion: The work addresses a critical gap in mitigating the impact of user identity cues in prompts on core generation quality, showing that while factual knowledge is robustly encoded across identities, biased generation behavior causes degradation that can be effectively reduced through selective identity neutralization.

Abstract: Large Language Model (LLM) outputs often vary across user sociodemographic attributes, leading to disparities in factual accuracy, utility, and safety, even for objective questions where demographic information is irrelevant. Unlike prior work on stereotypical or representational bias, this paper studies identity-dependent degradation of core response quality. We show empirically that such degradation arises from biased generation behavior, despite factual knowledge being robustly encoded across identities. Motivated by this mismatch, we propose a lightweight, training-free framework for identity-robust generation that selectively neutralizes non-critical identity information while preserving semantically essential attributes, thus maintaining output content integrity. Experiments across four benchmarks and 18 sociodemographic identities demonstrate an average 77% reduction in identity-dependent bias compared to vanilla prompting and a 45% reduction relative to prompt-based defenses. Our work addresses a critical gap in mitigating the impact of user identity cues in prompts on core generation quality.

[39] OrthoGeoLoRA: Geometric Parameter-Efficient Fine-Tuning for Structured Social Science Concept Retrieval on theWeb

Zeqiang Wang, Xinyue Wu, Chenxi Li, Zixi Chen, Nishanth Sastry, Jon Johnson, Suparna De

Main category: cs.CL

TL;DR: OrthoGeoLoRA improves LoRA by enforcing orthogonal constraints on low-rank factors to address geometric issues, outperforming standard LoRA on social science concept retrieval with better parameter efficiency.

DetailsMotivation: Fine-tuning large language models for web-based social science systems is computationally expensive, especially for smaller institutions. Standard LoRA has geometric drawbacks (gauge freedom, scale ambiguity, rank collapse) that limit its effectiveness.

Method: Introduces OrthoGeoLoRA which enforces SVD-like form ΔW = BΣA⊤ by constraining low-rank factors to be orthogonal (Stiefel manifold). Uses geometric reparameterization compatible with standard optimizers like Adam.

Result: OrthoGeoLoRA outperforms standard LoRA and several PEFT variants on hierarchical concept retrieval benchmark using European Language Social Science Thesaurus (ELSST). Achieves better ranking metrics under same low-rank budget.

Conclusion: OrthoGeoLoRA provides more compute- and parameter-efficient adaptation of foundation models for resource-constrained settings, addressing geometric limitations of standard LoRA while maintaining compatibility with existing pipelines.

Abstract: Large language models and text encoders increasingly power web-based information systems in the social sciences, including digital libraries, data catalogues, and search interfaces used by researchers, policymakers, and civil society. Full fine-tuning is often computationally and energy intensive, which can be prohibitive for smaller institutions and non-profit organizations in the Web4Good ecosystem. Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), reduces this cost by updating only a small number of parameters. We show that the standard LoRA update $ΔW = BA^\top$ has geometric drawbacks: gauge freedom, scale ambiguity, and a tendency toward rank collapse. We introduce OrthoGeoLoRA, which enforces an SVD-like form $ΔW = BΣA^\top$ by constraining the low-rank factors to be orthogonal (Stiefel manifold). A geometric reparameterization implements this constraint while remaining compatible with standard optimizers such as Adam and existing fine-tuning pipelines. We also propose a benchmark for hierarchical concept retrieval over the European Language Social Science Thesaurus (ELSST), widely used to organize social science resources in digital repositories. Experiments with a multilingual sentence encoder show that OrthoGeoLoRA outperforms standard LoRA and several strong PEFT variants on ranking metrics under the same low-rank budget, offering a more compute- and parameter-efficient path to adapt foundation models in resource-constrained settings.

[40] ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection

Tao Liu, Taiqiang Wu, Runming Yang, Shaoning Sun, Junjie Wang, Yujiu Yang

Main category: cs.CL

TL;DR: ProFit improves SFT by masking low-probability tokens to prevent overfitting to non-core expressions, outperforming traditional SFT on reasoning tasks.

DetailsMotivation: Traditional SFT forces alignment with single reference answers, ignoring language's one-to-many nature and causing overfitting to non-core expressions. While multiple references could help, they're too costly, so the focus shifts to mitigating single-reference overfitting.

Method: ProFit leverages the insight that high-probability tokens carry core logical framework while low-probability tokens are replaceable expressions. It selectively masks low-probability tokens during training to prevent surface-level overfitting.

Result: Extensive experiments show ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.

Conclusion: ProFit provides an effective and efficient solution to mitigate single-reference overfitting in SFT by focusing on semantic importance through token probability analysis, improving model alignment without costly data expansion.

Abstract: Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent. However, traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer, leading to the model overfitting to non-core expressions. Although our empirical analysis suggests that introducing multiple reference answers can mitigate this issue, the prohibitive data and computational costs necessitate a strategic shift: prioritizing the mitigation of single-reference overfitting over the costly pursuit of answer diversity. To achieve this, we reveal the intrinsic connection between token probability and semantic importance: high-probability tokens carry the core logical framework, while low-probability tokens are mostly replaceable expressions. Based on this insight, we propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting. Extensive experiments confirm that ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.

[41] A.X K1 Technical Report

Sung Jun Cheon, Jaekyung Cho, Seongho Choi, Hyunjun Eun, Seokhwan Jo, Jaehyun Jun, Minsoo Kang, Jin Kim, Jiwon Kim, Minsang Kim, Sungwan Kim, Seungsik Kim, Tae Yoon Kim, Youngrang Kim, Hyeongmun Lee, Sangyeol Lee, Sungeun Lee, Youngsoon Lee, Yujin Lee, Seongmin Ok, Chanyong Park, Hyewoong Park, Junyoung Park, Hyunho Yang, Subin Yi, Soohyun Bae, Dhammiko Arya, Yongseok Choi, Sangho Choi, Dongyeon Cho, Seungmo Cho, Gyoungeun Han, Yong-jin Han, Seokyoung Hong, Hyeon Hwang, Wonbeom Jang, Minjeong Ju, Wonjin Jung, Keummin Ka, Sungil Kang, Dongnam Kim, Joonghoon Kim, Jonghwi Kim, SaeRom Kim, Sangjin Kim, Seongwon Kim, Youngjin Kim, Seojin Lee, Sunwoo Lee, Taehoon Lee, Chanwoo Park, Sohee Park, Sooyeon Park, Yohan Ra, Sereimony Sek, Seungyeon Seo, Gun Song, Sanghoon Woo, Janghan Yoon, Sungbin Yoon

Main category: cs.CL

TL;DR: A.X K1 is a 519B-parameter Mixture-of-Experts language model trained from scratch on 10T tokens, featuring controllable reasoning capabilities and competitive performance with Korean-language advantages.

DetailsMotivation: The paper aims to bridge the gap between reasoning capability and inference efficiency in large language models, enabling scalable deployment across diverse real-world scenarios through explicitly controllable reasoning.

Method: The model uses scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. It employs a multi-stage data processing pipeline for corpus curation and introduces a Think-Fusion training recipe that enables user-controlled switching between thinking and non-thinking modes within a single unified model.

Result: A.X K1 achieves performance competitive with leading open-source models and establishes a distinctive advantage in Korean-language benchmarks, demonstrating effective controllable reasoning capabilities.

Conclusion: The paper presents A.X K1 as an efficient MoE language model with explicit reasoning control, offering competitive performance and Korean-language specialization, enabling practical deployment in diverse scenarios.

Abstract: We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.

[42] UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning

Feng Zhang, Shijia Li, Chunmao Zhang, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu, Han Liu

Main category: cs.CL

TL;DR: UserLM-R1: A reasoning-capable user language model that improves generalization and strategic negotiation capabilities through dynamic profiles and goal-driven decision-making with reinforcement learning.

DetailsMotivation: Current user simulators for agent training have limitations: they use static, context-unaware profiles that require manual redesign for new scenarios, and they lack human strategic thinking, making them vulnerable to agent manipulation.

Method: 1) Construct comprehensive user profiles with both static roles and dynamic scenario-specific goals for adaptation. 2) Propose goal-driven decision-making policy that generates rationales before responses. 3) Refine reasoning and strategic capabilities with supervised fine-tuning and multi-reward reinforcement learning.

Result: Extensive experiments show UserLM-R1 outperforms competitive baselines, particularly on more challenging adversarial sets.

Conclusion: UserLM-R1 addresses key limitations of current user simulators by providing better generalization across domains and improved strategic negotiation capabilities through reasoning and reinforcement learning.

Abstract: User simulators serve as the critical interactive environment for agent post-training, and an ideal user simulator generalizes across domains and proactively engages in negotiation by challenging or bargaining. However, current methods exhibit two issues. They rely on static and context-unaware profiles, necessitating extensive manual redesign for new scenarios, thus limiting generalizability. Moreover, they neglect human strategic thinking, leading to vulnerability to agent manipulation. To address these issues, we propose UserLM-R1, a novel user language model with reasoning capability. Specifically, we first construct comprehensive user profiles with both static roles and dynamic scenario-specific goals for adaptation to diverse scenarios. Then, we propose a goal-driven decision-making policy to generate high-quality rationales before producing responses, and further refine the reasoning and improve strategic capabilities with supervised fine-tuning and multi-reward reinforcement learning. Extensive experimental results demonstrate that UserLM-R1 outperforms competitive baselines, particularly on the more challenging adversarial set.

[43] When to Trust: A Causality-Aware Calibration Framework for Accurate Knowledge Graph Retrieval-Augmented Generation

Jing Ren, Bowen Li, Ziqi Xu, Xinkun Zhang, Haytham Fayek, Xiaodong Li

Main category: cs.CL

TL;DR: Ca2KG is a causality-aware calibration framework for KG-RAG that addresses overconfidence issues by integrating counterfactual prompting and panel-based re-scoring to improve uncertainty estimation.

DetailsMotivation: Existing KG-RAG models are severely overconfident, producing high-confidence predictions even when retrieved knowledge graphs are incomplete or unreliable, which is problematic for deployment in high-stakes domains.

Method: Ca2KG integrates counterfactual prompting to expose retrieval-dependent uncertainties in knowledge quality and reasoning reliability, combined with a panel-based re-scoring mechanism that stabilizes predictions across interventions.

Result: Extensive experiments on two complex QA datasets demonstrate that Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy.

Conclusion: Ca2KG effectively addresses the overconfidence problem in KG-RAG systems, making them more reliable for deployment in critical applications by providing better uncertainty estimation without sacrificing performance.

Abstract: Knowledge Graph Retrieval-Augmented Generation (KG-RAG) extends the RAG paradigm by incorporating structured knowledge from knowledge graphs, enabling Large Language Models (LLMs) to perform more precise and explainable reasoning. While KG-RAG improves factual accuracy in complex tasks, existing KG-RAG models are often severely overconfident, producing high-confidence predictions even when retrieved sub-graphs are incomplete or unreliable, which raises concerns for deployment in high-stakes domains. To address this issue, we propose Ca2KG, a Causality-aware Calibration framework for KG-RAG. Ca2KG integrates counterfactual prompting, which exposes retrieval-dependent uncertainties in knowledge quality and reasoning reliability, with a panel-based re-scoring mechanism that stabilises predictions across interventions. Extensive experiments on two complex QA datasets demonstrate that Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy.

[44] TeachPro: Multi-Label Qualitative Teaching Evaluation via Cross-View Graph Synergy and Semantic Anchored Evidence Encoding

Xiangqian Wang, Yifan Jia, Yang Xiang, Yumin Zhang, Yanbin Wang, Ke Liu

Main category: cs.CL

TL;DR: TeachPro is a multi-label learning framework that analyzes open-ended student comments to assess five key teaching dimensions, addressing limitations of traditional binary sentiment analysis in student evaluations.

DetailsMotivation: Standardized student evaluations suffer from low reliability, restricted response options, and response distortion. Existing ML methods reduce feedback to binary sentiment, overlooking concrete concerns like content clarity, feedback timeliness, and instructor demeanor, providing limited guidance for instructional improvement.

Method: TeachPro uses: 1) Dimension-Anchored Evidence Encoder with pre-trained text encoder, prompt module for five teaching dimensions, and cross-attention mechanism; 2) Cross-View Graph Synergy Network with syntactic branch (grammatical dependencies) and semantic branch (BERT-based similarity graphs), BiAffine fusion, and differential regularizer; 3) Cross-attention to bridge evidence with comment representations.

Result: Extensive experiments demonstrate superior diagnostic granularity and robustness across diverse evaluation settings. The paper also contributes a novel benchmark dataset with expert qualitative annotations and multi-label scores.

Conclusion: TeachPro provides a systematic framework for multi-dimensional teaching assessment from open-ended comments, offering more granular and actionable feedback for instructional improvement compared to existing methods.

Abstract: Standardized Student Evaluation of Teaching often suffer from low reliability, restricted response options, and response distortion. Existing machine learning methods that mine open-ended comments usually reduce feedback to binary sentiment, which overlooks concrete concerns such as content clarity, feedback timeliness, and instructor demeanor, and provides limited guidance for instructional improvement.We propose TeachPro, a multi-label learning framework that systematically assesses five key teaching dimensions: professional expertise, instructional behavior, pedagogical efficacy, classroom experience, and other performance metrics. We first propose a Dimension-Anchored Evidence Encoder, which integrates three core components: (i) a pre-trained text encoder that transforms qualitative feedback annotations into contextualized embeddings; (ii) a prompt module that represents five teaching dimensions as learnable semantic anchors; and (iii) a cross-attention mechanism that aligns evidence with pedagogical dimensions within a structured semantic space. We then propose a Cross-View Graph Synergy Network to represent student comments. This network comprises two components: (i) a Syntactic Branch that extracts explicit grammatical dependencies from parse trees, and (ii) a Semantic Branch that models latent conceptual relations derived from BERT-based similarity graphs. BiAffine fusion module aligns syntactic and semantic units, while a differential regularizer disentangles embeddings to encourage complementary representations. Finally, a cross-attention mechanism bridges the dimension-anchored evidence with the multi-view comment representations. We also contribute a novel benchmark dataset featuring expert qualitative annotations and multi-label scores. Extensive experiments demonstrate that TeachPro offers superior diagnostic granularity and robustness across diverse evaluation settings.

[45] When to Invoke: Refining LLM Fairness with Toxicity Assessment

Jing Ren, Bowen Li, Ziqi Xu, Renqiang Luo, Shuo Yu, Xin Ye, Haytham Fayek, Xiaodong Li, Feng Xia

Main category: cs.CL

TL;DR: FairToT is an inference-time framework that improves LLM fairness in toxicity assessment by identifying when demographic-related variations occur and applying corrective mechanisms through prompt guidance, without modifying model parameters.

DetailsMotivation: LLMs are increasingly used for toxicity assessment in online moderation, but they often produce inconsistent judgements for subtle expressions like implicit hate speech, revealing underlying biases that are hard to correct through standard training. Existing approaches overlook when corrective mechanisms should be invoked for fair and reliable assessments.

Method: FairToT is an inference-time framework that enhances LLM fairness through prompt-guided toxicity assessment. It identifies cases where demographic-related variation is likely to occur and determines when additional assessment should be applied. The method introduces two interpretable fairness indicators that detect such cases and improve inference consistency without modifying model parameters.

Result: Experiments on benchmark datasets show that FairToT reduces group-level disparities while maintaining stable and reliable toxicity predictions, demonstrating that inference-time refinement offers an effective and practical approach for fairness improvement in LLM-based toxicity assessment systems.

Conclusion: Inference-time refinement through FairToT provides an effective and practical approach for improving fairness in LLM-based toxicity assessment systems, addressing biases in subtle expressions like implicit hate speech without requiring model parameter modifications.

Abstract: Large Language Models (LLMs) are increasingly used for toxicity assessment in online moderation systems, where fairness across demographic groups is essential for equitable treatment. However, LLMs often produce inconsistent toxicity judgements for subtle expressions, particularly those involving implicit hate speech, revealing underlying biases that are difficult to correct through standard training. This raises a key question that existing approaches often overlook: when should corrective mechanisms be invoked to ensure fair and reliable assessments? To address this, we propose FairToT, an inference-time framework that enhances LLM fairness through prompt-guided toxicity assessment. FairToT identifies cases where demographic-related variation is likely to occur and determines when additional assessment should be applied. In addition, we introduce two interpretable fairness indicators that detect such cases and improve inference consistency without modifying model parameters. Experiments on benchmark datasets show that FairToT reduces group-level disparities while maintaining stable and reliable toxicity predictions, demonstrating that inference-time refinement offers an effective and practical approach for fairness improvement in LLM-based toxicity assessment systems. The source code can be found at https://aisuko.github.io/fair-tot/.

[46] MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

Yexing Du, Kaiyuan Liu, Bihe Zhang, Youcheng Pan, Bo Yang, Liangyu Huo, Xiyuan Zhang, Jian Xie, Daojing He, Yang Xiang, Ming Liu, Bin Qin

Main category: cs.CL

TL;DR: MCGA is a multi-task audio corpus for Chinese Classical Studies that includes six speech tasks to evaluate multimodal LLMs, revealing current models’ limitations in handling classical Chinese audio content.

DetailsMotivation: To address the gap in audio corpus research for Chinese Classical Studies, as existing MLLM research has focused mainly on text and visual modalities while audio remains underexplored.

Method: Created MCGA corpus covering six speech tasks: ASR, speech-to-text translation, emotion captioning, spoken QA, speech understanding, and speech reasoning across diverse literary genres.

Result: Evaluation of ten MLLMs shows substantial challenges on MCGA test set, indicating current models struggle with classical Chinese audio processing.

Conclusion: MCGA corpus and evaluation metrics are released publicly to advance MLLMs with robust multidimensional audio capabilities for Chinese Classical Studies.

Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: https://github.com/yxduir/MCGA

[47] ReGraM: Region-First Knowledge Graph Reasoning for Medical Question Answering

Chaerin Lee, Sohee Park, Hyunsik Na, Daseon Choi

Main category: cs.CL

TL;DR: ReGraM introduces a region-first knowledge graph reasoning framework for medical QA that constructs query-aligned subgraphs and performs stepwise reasoning within localized regions, outperforming baseline methods and reducing hallucinations.

DetailsMotivation: Existing Medical QA approaches that integrate LLMs with biomedical KGs rely on traversing entire graphs or large-scale retrieval, introducing noise and unstable multi-hop reasoning. The core challenge is identifying appropriate evidence subsets rather than expanding knowledge access.

Method: ReGraM constructs query-aligned subgraphs and performs stepwise reasoning constrained to localized regions under multiple evidence-aware modes, focusing inference only on the most relevant KG portions rather than assuming all relations are equally useful.

Result: Experiments on seven medical QA benchmarks show ReGraM consistently outperforms KGARevion baseline with 8.04% absolute accuracy gain on MCQ, 4.50% gain on SAQ, and 42.9% reduction in hallucination rate. Ablation studies confirm region construction alignment with hop-wise reasoning drives improvements.

Conclusion: Region-first KG reasoning is an effective paradigm for improving factual accuracy and consistency in medical QA by focusing on relevant evidence subsets rather than entire knowledge graphs.

Abstract: Recent studies in medical question answering (Medical QA) have actively explored the integration of large language models (LLMs) with biomedical knowledge graphs (KGs) to improve factual accuracy. However, most existing approaches still rely on traversing the entire KG or performing large-scale retrieval, which introduces substantial noise and leads to unstable multi-hop reasoning. We argue that the core challenge lies not in expanding access to knowledge, but in identifying and reasoning over the appropriate subset of evidence for each query. ReGraM is a region-first knowledge graph reasoning framework that addresses this challenge by constructing a query-aligned subgraph and performing stepwise reasoning constrained to this localized region under multiple evidence aware modes. By focusing inference on only the most relevant portion of the KG, ReGraM departs from the assumption that all relations are equally useful an assumption that rarely holds in domain-specific medical settings. Experiments on seven medical QA benchmarks demonstrate that ReGraM consistently outperforms a strong baseline (KGARevion), achieving an 8.04% absolute accuracy gain on MCQ, a 4.50% gain on SAQ, and a 42.9% reduction in hallucination rate. Ablation and qualitative analyses further show that aligning region construction with hop-wise reasoning is the primary driver of these improvements. Overall, our results highlight region-first KG reasoning as an effective paradigm for improving factual accuracy and consistency in medical QA.

[48] Understanding or Memorizing? A Case Study of German Definite Articles in Language Models

Jonathan Drechsel, Erisa Bytyqi, Steffen Herbold

Main category: cs.CL

TL;DR: Language models’ performance on German definite article agreement shows evidence of memorization rather than rule-based generalization, as parameter updates for specific gender-case transitions affect unrelated settings with overlapping neuron patterns.

DetailsMotivation: To determine whether language models' grammatical agreement performance reflects rule-based generalization or memorization, specifically for German definite singular articles where forms depend on gender and case.

Method: Used GRADIEND, a gradient-based interpretability method, to learn parameter update directions for gender-case specific article transitions in German definite articles.

Result: Updates learned for specific gender-case article transitions frequently affect unrelated gender-case settings, with substantial overlap among the most affected neurons across different settings.

Conclusion: Models at least partly rely on memorized associations rather than abstract grammatical rules for German definite article agreement, arguing against a strictly rule-based encoding.

Abstract: Language models perform well on grammatical agreement, but it is unclear whether this reflects rule-based generalization or memorization. We study this question for German definite singular articles, whose forms depend on gender and case. Using GRADIEND, a gradient-based interpretability method, we learn parameter update directions for gender-case specific article transitions. We find that updates learned for a specific gender-case article transition frequently affect unrelated gender-case settings, with substantial overlap among the most affected neurons across settings. These results argue against a strictly rule-based encoding of German definite articles, indicating that models at least partly rely on memorized associations rather than abstract grammatical rules.

[49] Improving Implicit Hate Speech Detection via a Community-Driven Multi-Agent Framework

Ewelina Gajewska, Katarzyna Budzynska, Jarosław A Chudziak

Main category: cs.CL

TL;DR: Multi-agent system with Moderator and Community Agents improves hate speech detection by incorporating socio-cultural context, outperforming state-of-the-art methods on ToxiGen dataset with better accuracy and fairness.

DetailsMotivation: Current hate speech detection methods often lack socio-cultural context and struggle with implicitly hateful speech, especially across different demographic groups. There's a need for more identity-aware moderation that considers specific community perspectives.

Method: Proposes a contextualised detection framework using a multi-agent system: a central Moderator Agent coordinates with dynamically constructed Community Agents representing specific demographic groups. Integrates socio-cultural context from public knowledge sources.

Result: Outperforms state-of-the-art prompting methods (zero-shot, few-shot, chain-of-thought) and alternative approaches on the challenging ToxiGen dataset. Shows significant improvements in both classification accuracy and fairness across all target groups using balanced accuracy metrics.

Conclusion: The community-driven consultative framework with explicit socio-cultural context integration provides more accurate and fair hate speech detection, particularly for implicitly hateful content across diverse demographic groups.

Abstract: This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.

[50] Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs

Biswesh Mohapatra, Théo Charlot, Giovanni Duca, Mayank Palan, Laurent Romary, Justine Cassell

Main category: cs.CL

TL;DR: The paper investigates how to represent and store common ground in situated dialogues for LLMs, evaluating methods to improve both establishment and use of shared references.

DetailsMotivation: While LLMs can perform grounding acts like clarification requests, there's little work on explicitly representing and storing common ground for later use, making it unclear if these behaviors reflect true grounded understanding.

Method: The authors evaluate models’ ability to establish and exploit common ground through relational references in situational dialogues, test multiple methods for representing common ground, and propose approaches to improve both establishment and subsequent use.

Result: The abstract doesn’t provide specific results, but indicates the paper will present findings from evaluating different common ground representation methods and proposed improvement approaches.

Conclusion: The research addresses a gap in representing and storing common ground for LLMs in situated dialogues, with implications for developing more coherent and context-aware dialog systems.

Abstract: Common ground plays a critical role in situated spoken dialogues, where interlocutors must establish and maintain shared references to entities, events, and relations to sustain coherent interaction. For dialog systems, the ability to correctly ground conversational content in order to refer back to it later is particularly important. Prior studies have demonstrated that LLMs are capable of performing grounding acts such as requesting clarification or producing acknowledgments, yet relatively little work has investigated how common ground can be explicitly represented and stored for later use. Without such mechanisms, it remains unclear whether acknowledgment or clarification behaviors truly reflect a grounded understanding. In this work, we evaluate a model’s ability to establish and exploit common ground through relational references to entities within the shared context in a situational dialogue. We test multiple methods for representing common ground in situated dialogues and further propose approaches to improve both the establishment of common ground and its subsequent use in the conversation.

[51] Relation Extraction Capabilities of LLMs on Clinical Text: A Bilingual Evaluation for English and Turkish

Aidana Aidynkyzy, Oğuz Dikenelli, Oylum Alatlı, Şebnem Bora

Main category: cs.CL

TL;DR: First bilingual evaluation of LLMs for clinical relation extraction in English and Turkish, introducing a parallel dataset and proposing Relation-Aware Retrieval (RAR) method that outperforms fine-tuned baselines.

DetailsMotivation: Addresses the scarcity of annotated clinical datasets for non-English languages, which hinders evaluation of LLM-based methods developed primarily for English, focusing on clinical relation extraction.

Method: Created first English-Turkish parallel clinical RE dataset from i2b2/VA corpus; systematically evaluated diverse prompting strategies (ICL, CoT) vs fine-tuned baselines; proposed Relation-Aware Retrieval (RAR) using contrastive learning for example selection.

Result: Prompting-based LLMs consistently outperformed fine-tuned models; English evaluations performed better than Turkish across all LLMs; RAR achieved highest performance (0.906 F1 in English, 0.888 in Turkish with Gemini 1.5 Flash); combination with structured reasoning reached 0.918 F1 in English.

Conclusion: High-quality demonstration retrieval is crucial; advanced retrieval and prompting techniques can bridge resource gaps in clinical NLP, with RAR showing strong performance for bilingual clinical relation extraction.

Abstract: The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.

[52] The Imperfective Paradox in Large Language Models

Bolei Ma, Yusuke Miyao

Main category: cs.CL

TL;DR: LLMs show a teleological bias in aspectual reasoning, hallucinating event completion for goal-oriented activities despite explicit negation, revealing they lack true compositional semantic understanding of aspect.

DetailsMotivation: To determine whether LLMs genuinely understand compositional semantics of events or rely on surface-level probabilistic heuristics, specifically investigating the Imperfective Paradox in aspectual semantics.

Method: Created ImperfectiveNLI diagnostic dataset to probe aspectual distinctions across semantic classes, evaluated state-of-the-art open-weight models, conducted representational analyses of embeddings, and tested prompting-based interventions.

Result: Models show pervasive Teleological Bias - systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation. Embeddings distinguish process from result, but inference decisions are dominated by strong priors about goal attainment. Prompting interventions reduce hallucinations but increase incorrect rejections of valid entailments.

Conclusion: Current LLMs lack structural aspectual awareness and operate as predictive narrative engines rather than faithful logical reasoners, failing to properly handle the Imperfective Paradox.

Abstract: Do Large Language Models (LLMs) genuinely grasp the compositional semantics of events, or do they rely on surface-level probabilistic heuristics? We investigate the Imperfective Paradox, a logical phenomenon where the past progressive aspect entails event realization for activities (e.g., running $\to$ ran) but not for accomplishments (e.g., building $\nrightarrow$ built). We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes. Evaluating state-of-the-art open-weight models, we uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation. Representational analyses show that while internal embeddings often distinguish process from result, inference decisions are dominated by strong priors about goal attainment. We further find that prompting-based interventions reduce hallucinated completions but also increase incorrect rejections of valid entailments. Our findings suggest that current LLMs lack structural aspectual awareness, operating as predictive narrative engines rather than faithful logical reasoners.

[53] Ability Transfer and Recovery via Modularized Parameters Localization

Songyao Jin, Kun Zhou, Wenqi Li, Peng Wang, Biwei Huang

Main category: cs.CL

TL;DR: ACT (Activation-Guided Channel-wise Ability Transfer) is a method that localizes ability-specific channels in LLMs via activation differences and selectively transfers only those parameters to recover forgotten abilities or merge multiple specialized models with minimal interference.

DetailsMotivation: Specializing LLMs through continual pre-training or fine-tuning often causes catastrophic forgetting - improving some abilities while degrading others. The paper aims to understand how abilities are distributed within LLM parameters and develop methods to transfer abilities without interference.

Method: ACT analyzes module activations under domain- and language-specific inputs to find that ability-related activations are concentrated in a small set of channels (<5%). ACT then localizes ability-relevant channels via activation differences, selectively transfers only the corresponding parameters, and performs lightweight fine-tuning for compatibility.

Result: Experiments on multilingual mathematical and scientific reasoning show ACT can recover forgotten abilities while preserving retained skills, and can merge multiple specialized models to integrate several abilities into a single model with minimal interference.

Conclusion: ACT demonstrates that abilities in LLMs are highly localized and disentangled, enabling efficient ability transfer without catastrophic forgetting, which has practical applications for model specialization and merging.

Abstract: Large language models can be continually pre-trained or fine-tuned to improve performance in specific domains, languages, or skills, but this specialization often degrades other capabilities and may cause catastrophic forgetting. We investigate how abilities are distributed within LLM parameters by analyzing module activations under domain- and language-specific inputs for closely related models. Across layers and modules, we find that ability-related activations are highly concentrated in a small set of channels (typically <5%), and these channels are largely disentangled with good sufficiency and stability. Building on these observations, we propose ACT (Activation-Guided Channel-wise Ability Transfer), which localizes ability-relevant channels via activation differences and selectively transfers only the corresponding parameters, followed by lightweight fine-tuning for compatibility. Experiments on multilingual mathematical and scientific reasoning show that ACT can recover forgotten abilities while preserving retained skills. It can also merge multiple specialized models to integrate several abilities into a single model with minimal interference. Our code and data will be publicly released.

[54] Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation

Xinze Li, Zhenghao Liu, Haidong Xin, Yukun Yan, Shuo Wang, Zheni Zeng, Sen Mei, Ge Yu, Maosong Sun

Main category: cs.CL

TL;DR: PAGER is a page-driven autonomous knowledge representation framework for RAG that structures iterative knowledge accumulation through cognitive outlines and slot-based organization to build comprehensive knowledge pages for better answer generation.

DetailsMotivation: Existing iterative RAG approaches lack coherent organizational structure, limiting their ability to construct comprehensive and cohesive knowledge representations for enhanced LLM performance.

Method: PAGER first prompts an LLM to create a structured cognitive outline with multiple slots representing distinct knowledge aspects, then iteratively retrieves and refines relevant documents to populate each slot, constructing a coherent page as contextual input for answer generation.

Result: Experiments on multiple knowledge-intensive benchmarks show PAGER consistently outperforms all RAG baselines, constructs higher-quality and information-dense knowledge representations, better mitigates knowledge conflicts, and enables LLMs to leverage external knowledge more effectively.

Conclusion: PAGER provides an effective framework for structured knowledge accumulation in RAG systems, addressing organizational limitations of previous iterative approaches and demonstrating superior performance across various benchmarks and backbone models.

Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge. Recently, some works have incorporated iterative knowledge accumulation processes into RAG models to progressively accumulate and refine query-related knowledge, thereby constructing more comprehensive knowledge representations. However, these iterative processes often lack a coherent organizational structure, which limits the construction of more comprehensive and cohesive knowledge representations. To address this, we propose PAGER, a page-driven autonomous knowledge representation framework for RAG. PAGER first prompts an LLM to construct a structured cognitive outline for a given question, which consists of multiple slots representing a distinct knowledge aspect. Then, PAGER iteratively retrieves and refines relevant documents to populate each slot, ultimately constructing a coherent page that serves as contextual input for guiding answer generation. Experiments on multiple knowledge-intensive benchmarks and backbone models show that PAGER consistently outperforms all RAG baselines. Further analyses demonstrate that PAGER constructs higher-quality and information-dense knowledge representations, better mitigates knowledge conflicts, and enables LLMs to leverage external knowledge more effectively. All code is available at https://github.com/OpenBMB/PAGER.

[55] Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing

Filip Trhlik, Andrew Caines, Paula Buttery

Main category: cs.CL

TL;DR: BabyLMs (compact BERT-like models) serve as low-cost proxies for studying bias formation and debiasing in large language models, reducing pre-training costs from 500+ to under 30 GPU-hours while maintaining similar bias patterns.

DetailsMotivation: Large language models are expensive to train, making bias research difficult. Current debiasing methods (post-hoc or masking) often fail to address root causes. There's a need for affordable ways to study pre-model debiasing.

Method: Use BabyLMs - compact BERT-like models trained on small, mutable corpora - as proxies to approximate bias acquisition and learning dynamics of larger models. Compare their bias patterns with standard BERT and test various debiasing methods.

Result: BabyLMs show closely aligned bias formation and performance patterns with standard BERT despite much smaller size. Correlations hold across multiple debiasing methods. Experiments reveal insights about gender imbalance and toxicity effects on bias formation.

Conclusion: BabyLMs effectively serve as sandboxes for large-scale LM research, democratizing pre-model debiasing by reducing costs from 500+ to under 30 GPU-hours, enabling faster exploration of fairer LM development methods.

Abstract: Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.

[56] Where Knowledge Collides: A Mechanistic Study of Intra-Memory Knowledge Conflict in Language Models

Minh Vu Pham, Hsuvas Borkakoty, Yufang Hou

Main category: cs.CL

TL;DR: A framework using mechanistic interpretability to identify and localize conflicting knowledge within language models’ internal representations, enabling causal intervention at inference time.

DetailsMotivation: Prior work focused on resolving conflicts between internal knowledge and external resources, but the problem of localizing intra-memory knowledge conflicts that originate during pre-training within models' internal representations remains unexplored.

Method: Designed a framework based on mechanistic interpretability methods to identify where and how conflicting knowledge from pre-training data is encoded within language models.

Result: Found evidence that specific internal components of language models are responsible for encoding conflicting knowledge from pre-training, and demonstrated how mechanistic interpretability methods can be leveraged to causally intervene in and control conflicting knowledge at inference time.

Conclusion: The work contributes to understanding how conflicting knowledge is encoded in language models and provides tools for identifying and controlling such conflicts through mechanistic interpretability approaches.

Abstract: In language models (LMs), intra-memory knowledge conflict largely arises when inconsistent information about the same event is encoded within the model’s parametric knowledge. While prior work has primarily focused on resolving conflicts between a model’s internal knowledge and external resources through approaches such as fine-tuning or knowledge editing, the problem of localizing conflicts that originate during pre-training within the model’s internal representations remain unexplored. In this work, we design a framework based on mechanistic interpretability methods to identify where and how conflicting knowledge from the pre-training data is encoded within LMs. Our findings contribute to a growing body of evidence that specific internal components of a language model are responsible for encoding conflicting knowledge from pre-training, and we demonstrate how mechanistic interpretability methods can be leveraged to causally intervene in and control conflicting knowledge at inference time.

[57] Improving Symbolic Translation of Language Models for Logical Reasoning

Ramya Keerthy Thatikonda, Jiuzhou Han, Wray Buntine, Ehsan Shareghi

Main category: cs.CL

TL;DR: Smaller LMs struggle with NL-to-FOL translation, so the paper proposes fine-tuning with LLM-synthesized data, incremental inference (predicate generation + FOL translation), and verification modules to improve reliability.

DetailsMotivation: Smaller language models often produce incorrect symbolic outputs when translating natural language to first-order logic due to formatting and translation errors. Existing self-iteration methods depend too heavily on model capabilities, limiting reliability of symbolic reasoning systems.

Method: 1) Categorize common errors and fine-tune smaller LMs using data synthesized by large language models. 2) Introduce incremental inference that divides inference into predicate generation and FOL translation stages. 3) Use verification modules targeting predicate-arity errors to further improve performance.

Result: Evaluation across three model families and four logical-reasoning datasets shows reduced error rates, increased predicate coverage, and improved reasoning performance for smaller LMs.

Conclusion: The combination of comprehensive fine-tuning, incremental inference, and verification modules moves us closer to developing reliable and accessible symbolic-reasoning systems using smaller language models.

Abstract: The use of formal language for deductive logical reasoning aligns well with language models (LMs), where translating natural language (NL) into first-order logic (FOL) and employing an external solver results in a verifiable and therefore reliable reasoning system. However, smaller LMs often struggle with this translation task, frequently producing incorrect symbolic outputs due to formatting and translation errors. Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model. To address this, we first categorize common errors and fine-tune smaller LMs using data synthesized by large language models. The evaluation is performed using the defined error categories. We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior and enhancing generation quality as measured by predicate metrics. This decomposition framework also enables the use of a verification module that targets predicate-arity errors to further improve performance. Our study evaluates three families of models across four logical-reasoning datasets. The comprehensive fine-tuning, incremental inference, and verification modules reduce error rates, increase predicate coverage, and improve reasoning performance for smaller LMs, moving us closer to developing reliable and accessible symbolic-reasoning systems.

[58] SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics

Yunqiao Yang, Wenbo Li, Houxing Ren, Zimu Lu, Ke Wang, Zhiyuan Huang, Zhuofan Zong, Mingjie Zhan, Hongsheng Li

Main category: cs.CL

TL;DR: SlidesGen-Bench is a new benchmark for evaluating slide generation systems that addresses limitations of existing evaluation methods through visual-domain analysis, computational metrics, and human-aligned validation.

DetailsMotivation: Existing evaluation protocols for automated slide generation systems struggle with comparability across different architectures and often rely on uncalibrated judgments, making it difficult to fairly assess heterogeneous LLM-based slide generation approaches.

Method: The benchmark uses three core principles: 1) Universality - treating outputs as visual renderings regardless of generation method, 2) Quantification - computational assessment across Content, Aesthetics, and Editability dimensions, and 3) Reliability - creating Slides-Align1.5k dataset with human preferences aligned across 9 generation systems and 7 scenarios.

Result: SlidesGen-Bench achieves higher alignment with human judgment than existing evaluation pipelines, providing more reliable and comparable assessment of slide generation systems.

Conclusion: The proposed benchmark offers a unified, quantitative, and human-aligned framework for evaluating slide generation systems, addressing key limitations in current evaluation methods and enabling better comparison across diverse generation approaches.

Abstract: The rapid evolution of Large Language Models (LLMs) has fostered diverse paradigms for automated slide generation, ranging from code-driven layouts to image-centric synthesis. However, evaluating these heterogeneous systems remains challenging, as existing protocols often struggle to provide comparable scores across architectures or rely on uncalibrated judgments. In this paper, we introduce SlidesGen-Bench, a benchmark designed to evaluate slide generation through a lens of three core principles: universality, quantification, and reliability. First, to establish a unified evaluation framework, we ground our analysis in the visual domain, treating terminal outputs as renderings to remain agnostic to the underlying generation method. Second, we propose a computational approach that quantitatively assesses slides across three distinct dimensions - Content, Aesthetics, and Editability - offering reproducible metrics where prior works relied on subjective or reference-dependent proxies. Finally, to ensure high correlation with human preference, we construct the Slides-Align1.5k dataset, a human preference aligned dataset covering slides from nine mainstream generation systems across seven scenarios. Our experiments demonstrate that SlidesGen-Bench achieves a higher degree of alignment with human judgment than existing evaluation pipelines. Our code and data are available at https://github.com/YunqiaoYang/SlidesGen-Bench.

[59] MVSS: A Unified Framework for Multi-View Structured Survey Generation

Yinqi Liu, Yueqi Zhu, Yongkang Zhang, Xinfeng Li, Feiran Liu, Yufei Sun, Xin Wang, Renzhao Liang, Yidong Wang, Cunxiang Wang

Main category: cs.CL

TL;DR: MVSS is a multi-view structured survey generation framework that creates hierarchical topic trees, comparison tables, and survey text in a coordinated way, outperforming existing methods in organization and evidence grounding.

DetailsMotivation: Existing automatic survey generation methods focus on linear text and struggle to model hierarchical relations among research topics and structured methodological comparisons, resulting in poor structural organization compared to expert-written surveys.

Method: MVSS follows a structure-first paradigm: 1) constructs a conceptual tree of the research domain, 2) generates comparison tables constrained by the tree, and 3) uses both as structural constraints for text generation, enabling complementary multi-view representations across structure, comparison, and narrative.

Result: Experiments on 76 computer science topics show MVSS outperforms existing methods in organization and evidence grounding, achieving performance comparable to expert surveys.

Conclusion: The proposed multi-view structured approach enables better survey generation by explicitly modeling hierarchical relations and structured comparisons, bridging the gap between automatic methods and expert-written surveys.

Abstract: Scientific surveys require not only summarizing large bodies of literature, but also organizing them into clear and coherent conceptual structures. Existing automatic survey generation methods typically focus on linear text generation and struggle to explicitly model hierarchical relations among research topics and structured methodological comparisons, resulting in gaps in structural organization compared to expert-written surveys. We propose MVSS, a multi-view structured survey generation framework that jointly generates and aligns citation-grounded hierarchical trees, structured comparison tables, and survey text. MVSS follows a structure-first paradigm: it first constructs a conceptual tree of the research domain, then generates comparison tables constrained by the tree, and finally uses both as structural constraints for text generation. This enables complementary multi-view representations across structure, comparison, and narrative. We introduce an evaluation framework assessing structural quality, comparative completeness, and citation fidelity. Experiments on 76 computer science topics show MVSS outperforms existing methods in organization and evidence grounding, achieving performance comparable to expert surveys.

[60] SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams

Chenglong Wang, Canjia Li, Xingzhao Zhu, Yifu Huo, Huiyu Wang, Weixiong Lin, Yun Yang, Qiaozhi He, Tianhua Zhou, Xiaojia Chang, Jingbo Zhu, Tong Xiao

Main category: cs.CL

TL;DR: SERM is a self-evolving relevance model with multi-agent modules for sample mining and relevance annotation to handle large-scale industrial search with sparse informative samples and unreliable pseudo-labels.

DetailsMotivation: Real-world query streams are dynamically evolving, making relevance models struggle to generalize. Self-evolution techniques face challenges in large-scale industrial settings: (1) sparse informative samples are hard to identify, and (2) pseudo-labels from current models are unreliable.

Method: Proposes Self-Evolving Relevance Model (SERM) with two complementary multi-agent modules: (1) multi-agent sample miner to detect distributional shifts and identify informative training samples, and (2) multi-agent relevance annotator with two-level agreement framework to provide reliable labels.

Result: Evaluated in large-scale industrial setting serving billions of daily requests. Experimental results show SERM achieves significant performance gains through iterative self-evolution, validated by extensive offline multilingual evaluations and online testing.

Conclusion: SERM effectively addresses challenges of sparse informative samples and unreliable pseudo-labels in large-scale industrial search through its multi-agent architecture, enabling successful self-evolution and improved relevance modeling.

Abstract: Due to the dynamically evolving nature of real-world query streams, relevance models struggle to generalize to practical search scenarios. A sophisticated solution is self-evolution techniques. However, in large-scale industrial settings with massive query streams, this technique faces two challenges: (1) informative samples are often sparse and difficult to identify, and (2) pseudo-labels generated by the current model could be unreliable. To address these challenges, in this work, we propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules: a multi-agent sample miner, designed to detect distributional shifts and identify informative training samples, and a multi-agent relevance annotator, which provides reliable labels through a two-level agreement framework. We evaluate SERM in a large-scale industrial setting, which serves billions of user requests daily. Experimental results demonstrate that SERM can achieve significant performance gains through iterative self-evolution, as validated by extensive offline multilingual evaluations and online testing.

[61] Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats

Manyi Zhang, Ji-Fu Li, Zhongao Sun, Haoli Bai, Hui-Ling Zhen, Zhenhua Dong, Xianzhi Yu

Main category: cs.CL

TL;DR: Systematic investigation of post-training quantization algorithms for Microscaling Floating-Point (MXFP) formats in large language models, covering 7+ PTQ algorithms, 15 benchmarks, and 3 LLM families.

DetailsMotivation: MXFP has emerged as a promising low-precision format for LLMs, but most PTQ research focuses on integer quantization, leaving MXFP formats largely unexplored. There's a gap in understanding how existing PTQ algorithms perform with MXFP formats.

Method: Conducted comprehensive evaluation of PTQ under MXFP formats using over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 different LLM families. Investigated format compatibility, algorithmic effectiveness, and performance trends across models and modalities.

Result: Key findings: 1) MXFP8 achieves near-lossless performance while MXFP4 causes substantial accuracy degradation; 2) PTQ effectiveness depends strongly on format compatibility with certain algorithms consistently better; 3) Performance trends are consistent across model families, with quantization sensitivity dominated by language models rather than vision encoders in multimodal LLMs; 4) Scaling factor is critical error source in MXFP4, and simple pre-scale optimization significantly mitigates its impact.

Conclusion: The study provides practical guidance for adapting existing PTQ methods to MXFP quantization, revealing that format compatibility and scaling factor optimization are crucial for successful MXFP quantization, especially for lower precision formats like MXFP4.

Abstract: Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs). Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization.

[62] Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering

Dimitris Panagopoulos, Adolfo Perrusquia, Weisi Guo

Main category: cs.CL

TL;DR: Dialogue Telemetry (DT) framework provides turn-level monitoring signals for information-gathering dialogues: Progress Estimator measures residual information potential, and Stalling Index detects unproductive questioning patterns without causal diagnosis.

DetailsMotivation: Autonomous systems lack turn-level observables for monitoring acquisition efficiency and detecting when questioning becomes unproductive in schema-grounded information-gathering dialogues.

Method: Introduces Dialogue Telemetry (DT) with two model-agnostic signals: Progress Estimator (PE) quantifying residual information potential per category, and Stalling Index (SI) detecting repeated category probing with semantically similar, low-marginal-gain responses.

Result: Validated in controlled search-and-rescue interviews using LLM-based simulations, distinguishing efficient from stalled dialogue traces. Integration into RL policy improves performance when stalling carries operational costs.

Conclusion: DT provides interpretable turn-level instrumentation that improves policy performance in information-gathering dialogues, offering practical monitoring without requiring causal diagnosis of degradation.

Abstract: Autonomous systems conducting schema-grounded information-gathering dialogues face an instrumentation gap, lacking turn-level observables for monitoring acquisition efficiency and detecting when questioning becomes unproductive. We introduce Dialogue Telemetry (DT), a measurement framework that produces two model-agnostic signals after each question-answer exchange: (i) a Progress Estimator (PE) quantifying residual information potential per category (with a bits-based variant), and (ii) a Stalling Index (SI) detecting an observable failure signature characterized by repeated category probing with semantically similar, low-marginal-gain responses. SI flags this pattern without requiring causal diagnosis, supporting monitoring in settings where attributing degradation to specific causes may be impractical. We validate DT in controlled search-and-rescue (SAR)-inspired interviews using large language model (LLM)-based simulations, distinguishing efficient from stalled dialogue traces and illustrating downstream utility by integrating DT signals into a reinforcement learning (RL) policy. Across these settings, DT provides interpretable turn-level instrumentation that improves policy performance when stalling carries operational costs.

[63] DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing

Qian Cao, Yahui Liu, Wei Bi, Yi Zhao, Ruihua Song, Xiting Wang, Ruiming Tang, Guorui Zhou, Han Li

Main category: cs.CL

TL;DR: Proposes RL framework with semi-structured Chain-of-Thought planning to improve LLM output diversity in creative tasks without sacrificing quality.

DetailsMotivation: RL-based enhancement of LLMs often reduces output diversity, harming performance in open-ended tasks like creative writing. Current methods prioritize optimization efficiency over diversity.

Method: Uses semi-structured long Chain-of-Thought to decompose generation into planned intermediate steps. Introduces Diverse Planning Branching (divergence at planning phase based on diversity variation) and group-aware diversity reward.

Result: Experimental results on creative writing benchmarks show significant improvement in output diversity without compromising generation quality, consistently outperforming existing baselines.

Conclusion: The proposed RL framework with structured planning and diversity mechanisms effectively addresses the diversity loss problem in RL-enhanced LLMs for creative tasks.

Abstract: Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.

[64] LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation

Stergios Chatzikyriakidis

Main category: cs.CL

TL;DR: LLMs struggle with phonological tasks like rhyme in low-resource languages like Greek. A hybrid system combining LLMs with phonological algorithms achieves accurate rhyme identification/generation, with verification loops dramatically improving performance from <4% to 73.1% valid poems.

DetailsMotivation: Large Language Models (LLMs) have remarkable NLP capabilities but struggle with phonologically-grounded phenomena like rhyme detection and generation, especially in lower-resource languages such as Modern Greek. This paper addresses this gap by developing a hybrid approach.

Method: Developed a hybrid system combining LLMs with deterministic phonological algorithms. Implemented comprehensive taxonomy of Greek rhyme types (Pure, Rich, Imperfect, Mosaic, IDV patterns). Used agentic generation pipeline with phonological verification. Evaluated multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, RAG-augmented) across various LLMs including Claude, GPT-4o, Gemini, Llama, and Mistral.

Result: Revealed significant “Reasoning Gap”: native-like models (Claude 3.7) perform intuitively (40% accuracy), while reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54%) with Chain-of-Thought prompting. Pure LLM generation fails catastrophically (<4% valid poems), but hybrid verification loop restores performance to 73.1%. Released system and cleaned corpus of 40,000+ rhymes.

Conclusion: Hybrid systems combining LLMs with phonological algorithms are essential for accurate rhyme tasks in low-resource languages. The approach successfully addresses LLM limitations in phonological reasoning, with verification loops dramatically improving generation quality. Released resources support future research in this area.

Abstract: Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant “Reasoning Gap”: while native-like models (Claude 3.7) perform intuitively (40% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4% valid poems), while our hybrid verification loop restores performance to 73.1%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.

[65] TaxoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion

Sahil Mishra, Srinitish Srinivasan, Srikanta Bedathur, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: TaxoBell introduces Gaussian box embeddings for taxonomy expansion, addressing limitations of existing methods by modeling semantic uncertainty and enabling stable optimization, achieving significant performance improvements over state-of-the-art baselines.

DetailsMotivation: Manual taxonomy expansion is labor-intensive and cannot keep pace with new concepts. Existing automated methods using point-based vector embeddings struggle with asymmetric "is-a" relationships, while box embeddings have issues with unstable gradients, lack of semantic uncertainty modeling, and limited capacity for polysemy/ambiguity.

Method: TaxoBell uses a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions. Means encode semantic location and covariances encode uncertainty. The approach employs energy-based optimization for stable training and robust modeling of ambiguous concepts.

Result: Extensive experiments on five benchmark datasets show TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. Error analysis and ablation studies demonstrate the advantages and limitations of the approach.

Conclusion: TaxoBell successfully addresses key limitations of existing taxonomy expansion methods by introducing Gaussian box embeddings with uncertainty modeling, enabling stable optimization and improved performance for hierarchical knowledge representation.

Abstract: Taxonomies form the backbone of structured knowledge representation across diverse domains, enabling applications such as e-commerce catalogs, semantic search, and biomedical discovery. Yet, manual taxonomy expansion is labor-intensive and cannot keep pace with the emergence of new concepts. Existing automated methods rely on point-based vector embeddings, which model symmetric similarity and thus struggle with the asymmetric “is-a” relationships that are fundamental to taxonomies. Box embeddings offer a promising alternative by enabling containment and disjointness, but they face key issues: (i) unstable gradients at the intersection boundaries, (ii) no notion of semantic uncertainty, and (iii) limited capacity to represent polysemy or ambiguity. We address these shortcomings with TaxoBell, a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions, where means encode semantic location and covariances encode uncertainty. Energy-based optimization yields stable optimization, robust modeling of ambiguous concepts, and interpretable hierarchical reasoning. Extensive experimentation on five benchmark datasets demonstrates that TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. We further demonstrate the advantages and pitfalls of TaxoBell with error analysis and ablation studies.

[66] Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation

Andrew Moore, Paul Rayson, Dawn Archer, Tim Czerniak, Dawn Knight, Daisy Lal, Gearóid Ó Donnchadha, Mícheál Ó Meachair, Scott Piao, Elaine Uí Dhonnchadha, Johanna Vuorinen, Yan Yabo, Xiaobin Yang

Main category: cs.CL

TL;DR: This paper presents the largest semantic tagging evaluation of the USAS framework across five languages, introduces new datasets including Chinese and silver-labeled English data, and develops neural models that outperform rule-based systems.

DetailsMotivation: While WSD has been extensively evaluated using frameworks like WordNet and BabelNet, the USAS framework lacks comprehensive open evaluation beyond lexical coverage or single-language studies. There's a need for extensive multi-language evaluation and better approaches to overcome the lack of manually tagged training data.

Method: 1) Performed largest semantic tagging evaluation of rule-based USAS system across five languages using four existing datasets and one novel Chinese dataset. 2) Created new silver-labeled English dataset to address training data scarcity. 3) Trained and evaluated various mono and multilingual neural models in both mono and cross-lingual setups. 4) Compared neural models with rule-based counterparts and showed how rule-based systems can be enhanced with neural networks.

Result: Developed neural network models that outperform rule-based systems. Released all resources openly including: neural models, training data, Chinese evaluation dataset, and all code. Demonstrated successful enhancement of rule-based systems with neural approaches.

Conclusion: The study provides the most extensive evaluation of USAS semantic tagging to date, introduces valuable new datasets, and shows that neural models can effectively enhance rule-based systems for semantic tagging across multiple languages. All resources are made openly available to advance research in this area.

Abstract: Word Sense Disambiguation (WSD) has been widely evaluated using the semantic frameworks of WordNet, BabelNet, and the Oxford Dictionary of English. However, for the UCREL Semantic Analysis System (USAS) framework, no open extensive evaluation has been performed beyond lexical coverage or single language evaluation. In this work, we perform the largest semantic tagging evaluation of the rule based system that uses the lexical resources in the USAS framework covering five different languages using four existing datasets and one novel Chinese dataset. We create a new silver labelled English dataset, to overcome the lack of manually tagged training data, that we train and evaluate various mono and multilingual neural models in both mono and cross-lingual evaluation setups with comparisons to their rule based counterparts, and show how a rule based system can be enhanced with a neural network model. The resulting neural network models, including the data they were trained on, the Chinese evaluation dataset, and all of the code have been released as open resources.

[67] DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing

Main category: cs.CL

TL;DR: DeepResearchEval: Automated framework for constructing realistic deep research tasks and evaluating research systems with adaptive quality assessment and active fact-checking.

DetailsMotivation: Existing benchmarks for deep research systems have limitations: they require intensive manual annotation, use static evaluation dimensions, and fail to reliably verify facts when citations are missing. There's a need for better evaluation methods for multi-step web research systems.

Method: Two-part framework: 1) Persona-driven task construction pipeline generating realistic research tasks with two-stage filtering (Task Qualification and Search Necessity) to ensure tasks require multi-source evidence integration. 2) Agentic evaluation pipeline with Adaptive Point-wise Quality Evaluation (dynamic, task-specific evaluation dimensions) and Active Fact-Checking (autonomous extraction and verification of report statements via web search).

Result: DeepResearchEval provides an automated framework that addresses limitations of existing benchmarks by generating realistic research tasks and enabling comprehensive evaluation without manual annotation overhead.

Conclusion: The proposed framework bridges gaps in deep research system evaluation by automating task construction and providing adaptive, fact-checking-based evaluation that works even when citations are missing.

Abstract: Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.

[68] Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection

Tianyi Niu, Justin Chih-Yao Chen, Genta Indra Winata, Shi-Xiong Zhang, Supriyo Chakraborty, Sambit Sahu, Yue Zhang, Elias Stengel-Eskin, Mohit Bansal

Main category: cs.CL

TL;DR: LLM routers trained on generated data (RGD setting) - query-only routers outperform query-answer routers when generator quality is poor. CASCAL query-only router uses consensus voting and hierarchical clustering for better robustness.

DetailsMotivation: Existing LLM router approaches require ground-truth labeled data, which is often unavailable in practice, especially with heterogeneous and unknown user request distributions. Need for routers that can be trained on generated data instead.

Method: Introduce Routing with Generated Data (RGD) setting where routers are trained exclusively on generated queries and answers from generator LLMs. Evaluate query-answer vs query-only routers. Propose CASCAL - a query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering.

Result: Query-answer routers degrade faster than query-only routers as generator quality decreases. Key generator characteristics: must accurately respond to own questions, and questions must produce sufficient performance differentiation among models. CASCAL outperforms best query-answer router by 4.6% absolute accuracy when trained on weak generator data.

Conclusion: Query-only routers are more robust to generator quality than query-answer routers in RGD setting. CASCAL’s consensus voting and hierarchical clustering approach provides substantial improvements, especially with weak generator data. Effective generators need both self-consistency and ability to differentiate model performance.

Abstract: Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.

[69] LLMs can Compress LLMs: Adaptive Pruning by Agents

Sai Varun Kodathala, Rakesh Vunnam

Main category: cs.CL

TL;DR: Agent-guided pruning uses an LLM as an adaptive pruning agent to intelligently select which layers to prune, achieving better performance than uniform pruning methods while preserving factual knowledge.

DetailsMotivation: Existing pruning methods use uniform or hand-crafted heuristics for per-layer sparsity ratios and suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual QA capabilities.

Method: Uses a foundation model as an adaptive pruning agent that constructs layer-wise sensitivity profiles combining weight-activation metrics with gradient importance scores (normalized as z-scores). The LLM agent has self-reflection capabilities to learn from previous pruning outcomes and iteratively refine its strategy, with a checkpoint rollback mechanism to maintain model quality.

Result: Evaluated on Qwen3 models (4B and 8B) at ~45% sparsity: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation compared to structured pruning baselines. Requires no retraining, operates model-agnostically, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations.

Conclusion: Foundation models can effectively guide the compression of other foundation models through intelligent, adaptive pruning that preserves critical knowledge pathways while achieving high sparsity.

Abstract: As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.

[70] Empathy Applicability Modeling for General Health Queries

Shan Randhawa, Agha Ali Raza, Kentaro Toyama, Julie Hui, Mustafa Naseem

Main category: cs.CL

TL;DR: The paper introduces the Empathy Applicability Framework (EAF) to identify when empathy is needed in patient queries before response generation, addressing limitations of existing reactive empathy labeling approaches.

DetailsMotivation: LLMs lack clinical empathy in healthcare applications, and current NLP frameworks only reactively label empathy in doctors' responses rather than anticipating empathy needs in patient queries.

Method: Developed the Empathy Applicability Framework (EAF) using clinical, contextual, and linguistic cues; created benchmark with human and GPT-4o annotations; trained classifiers on both human-labeled and GPT-only data.

Result: Achieved strong performance in predicting empathy applicability, outperforming heuristic and zero-shot LLM baselines; found substantial human-GPT alignment in consensus subset; identified challenges with implicit distress and clinical ambiguity.

Conclusion: EAF enables anticipatory empathy modeling for asynchronous healthcare, establishes a benchmark, and highlights need for multi-annotator modeling, clinician calibration, and diverse cultural annotation.

Abstract: LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors’ responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.

[71] Value-Aware Numerical Representations for Transformer Language Models

Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu

Main category: cs.CL

TL;DR: The paper proposes a value-aware numerical representation that augments tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on numerical value, improving language models’ numerical robustness.

DetailsMotivation: Transformer-based language models perform well on mathematical reasoning but remain fragile on basic numerical understanding and arithmetic operations because numbers are processed as symbolic tokens whose embeddings don't encode numerical value, leading to systematic errors.

Method: Introduces a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This injects magnitude information directly into the model’s input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures.

Result: Evaluation on arithmetic tasks shows the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths.

Conclusion: Explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.

Abstract: Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model’s input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.

[72] Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu

Main category: cs.CL

TL;DR: P4D is an automated debugging tool that finds problematic prompts to test safety mechanisms in text-to-image diffusion models like Stable Diffusion, revealing vulnerabilities in existing safety evaluations.

DetailsMotivation: There's growing concern about misuse of text-to-image diffusion models for generating copyrighted or NSFW content, but the reliability of existing safety mechanisms against diverse problematic prompts remains largely unexplored.

Method: Proposed Prompting4Debugging (P4D) as an automated debugging and red-teaming tool that systematically finds problematic prompts to test deployed safety mechanisms in diffusion models.

Result: P4D uncovered new vulnerabilities in Stable Diffusion models with safety mechanisms, showing that around half of prompts in existing safe prompting benchmarks considered “safe” can actually bypass deployed safety mechanisms like concept removal, negative prompts, and safety guidance.

Conclusion: Without comprehensive testing, evaluations on limited safe prompting benchmarks can create a false sense of safety for text-to-image models, highlighting the need for more robust safety testing tools like P4D.

Abstract: Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered “safe” can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models.

[73] Template-Based Probes Are Imperfect Lenses for Counterfactual Bias Evaluation in LLMs

Farnaz Kohankhaki, D. B. Emerson, Jacob-Junqi Tian, Laleh Seyyed-Kalantari, Faiza Khan Khattak

Main category: cs.CL

TL;DR: Template-based probes for counterfactual bias evaluation in LLMs can introduce systematic distortions, artificially suggesting White-associated text is classified as negative due to linguistic asymmetries in training data.

DetailsMotivation: To investigate potential systematic distortions in counterfactual bias evaluation methods that use template-based probes, which may produce misleading bias measurements rather than reflecting genuine model biases.

Method: Analyzed template-based probes across multiple LLMs, diverse templates, and different classification approaches to examine consistency of bias measurements, focusing on linguistic asymmetries like markedness in pretraining data.

Result: Consistently found that template-based probes suggest LLMs classify White-associated text as negative at disproportionately elevated rates across models, templates, and classification methods, indicating artificial distortions rather than genuine bias.

Conclusion: Template-based probes for counterfactual bias evaluation can introduce systematic measurement artifacts due to linguistic asymmetries in pretraining data, highlighting the need for more rigorous methodologies to distinguish genuine biases from measurement artifacts.

Abstract: Bias in large language models (LLMs) has many forms, from overt discrimination to implicit stereotypes. Counterfactual bias evaluation is a widely used approach to quantifying bias and often relies on template-based probes that explicitly state group membership. It aims to measure whether the outcome of a task performed by an LLM is invariant to a change in group membership. In this work, we find that template-based probes can introduce systematic distortions in bias measurements. Specifically, we consistently find that such probes suggest that LLMs classify text associated with White race as negative at disproportionately elevated rates. This is observed consistently across a large collection of LLMs, over several diverse template-based probes, and with different classification approaches. We hypothesize that this arises artificially due to linguistic asymmetries present in LLM pretraining data, in the form of markedness, (e.g., Black president vs. president) and templates used for bias measurement (e.g., Black president vs. White president). These findings highlight the need for more rigorous methodologies in counterfactual bias evaluation, ensuring that observed disparities reflect genuine biases rather than artifacts of linguistic conventions.

[74] Temporal Knowledge Graph Question Answering: A Survey

Miao Su, Zixuan Li, Zhuo Chen, Long Bai, Xiaolong Jin, Jiafeng Guo

Main category: cs.CL

TL;DR: A comprehensive survey paper on Temporal Knowledge Graph Question Answering (TKGQA) that establishes a taxonomy of temporal questions and categorizes existing methods, while outlining future research directions.

DetailsMotivation: The field of Temporal Knowledge Graph Question Answering (TKGQA) faces ambiguities in defining temporal questions and lacks systematic categorization of existing methods, creating a need for a comprehensive survey to clarify the landscape and guide future research.

Method: The paper conducts a thorough survey from two perspectives: 1) establishing a detailed taxonomy of temporal questions based on prior studies, and 2) providing a comprehensive review of TKGQA techniques categorized into semantic parsing-based and TKG embedding-based methods.

Result: The survey provides a systematic framework for understanding TKGQA by clarifying temporal question definitions and organizing existing approaches, serving as a comprehensive reference for researchers in the field.

Conclusion: This work serves as a foundational reference for TKGQA research, stimulates further investigation in the field, and outlines potential research directions to advance temporal question answering over knowledge graphs.

Abstract: Knowledge Base Question Answering (KBQA) has been a long-standing field to answer questions based on knowledge bases. Recently, the evolving dynamics of knowledge have attracted a growing interest in Temporal Knowledge Graph Question Answering (TKGQA), an emerging task to answer temporal questions. However, this field grapples with ambiguities in defining temporal questions and lacks a systematic categorization of existing methods for TKGQA. In response, this paper provides a thorough survey from two perspectives: the taxonomy of temporal questions and the methodological categorization for TKGQA. Specifically, we first establish a detailed taxonomy of temporal questions engaged in prior studies. Subsequently, we provide a comprehensive review of TKGQA techniques of two categories: semantic parsing-based and TKG embedding-based. Building on this review, the paper outlines potential research directions aimed at advancing the field of TKGQA. This work aims to serve as a comprehensive reference for TKGQA and to stimulate further research.

[75] Can Editing LLMs Inject Harm?

Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu

Main category: cs.CL

TL;DR: Editing Attack: A new safety threat where knowledge editing techniques are used to bypass LLM safety alignment and inject harmful information (misinformation/bias) with high effectiveness and stealthiness.

DetailsMotivation: To investigate whether knowledge editing can be misused to compromise LLM safety alignment and inject harmful information stealthily, addressing an under-explored security threat.

Method: Reformulate knowledge editing as Editing Attack threat; construct EditAttack dataset; systematically investigate two safety risks: Misinformation Injection (commonsense/long-tail) and Bias Injection; evaluate effectiveness and stealthiness.

Result: Editing attacks effectively inject both commonsense and long-tail misinformation (commonsense particularly high); biased sentences can be injected with high effectiveness, and single biased injection degrades overall fairness; attacks demonstrate high stealthiness.

Conclusion: Knowledge editing techniques pose emerging misuse risks for compromising LLM safety alignment, enabling feasible dissemination of misinformation/bias through LLMs as new channels, highlighting critical security vulnerabilities.

Abstract: Large Language Models (LLMs) have emerged as a new information channel. Meanwhile, one critical but under-explored question is: Is it possible to bypass the safety alignment and inject harmful information into LLMs stealthily? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the first risk, we find that editing attacks can inject both commonsense and long-tail misinformation into LLMs, and the effectiveness for the former one is particularly high. For the second risk, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can degrade the overall fairness. Then, we further illustrate the high stealthiness of editing attacks. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels.

[76] “Hiding in Plain Sight”: Designing Synthetic Dialog Generation for Uncovering Socially Situated Norms

Chengfei Wu, Dan Goldwasser

Main category: cs.CL

TL;DR: A framework for generating dialogues that automatically uncovers social norms from context-rich interactions, used to create NormHint dataset with turn-level norm violation annotations.

DetailsMotivation: To capture social norms inherent in naturally situated conversations without relying on predefined norm labels, enabling better understanding of how norms manifest in diverse conversational contexts.

Method: Multi-step framework using self-assessment and norm discovery to generate dialogues, constructing NormHint dataset with diverse interlocutor attributes, relationships, topics, and trajectories, annotated with turn-level norm violations and remediation suggestions.

Result: Created comprehensive NormHint dataset with high naturalness and realism, validated by humans and automated analysis. Fine-tuning models with norm violation data significantly improves their ability to detect and understand potential norm violations.

Conclusion: The proposed framework successfully generates realistic dialogues that capture social norms, and the resulting dataset enhances AI models’ understanding of conversational norms and violations.

Abstract: Naturally situated conversations encapsulate the social norms inherent to their context, reflecting both the relationships between interlocutors and the underlying communicative intent. In this paper, we propose a novel, multi-step framework for generating dialogues that automatically uncovers social norms from rich, context-laden interactions through a process of self-assessment and norm discovery, rather than relying on predefined norm labels. Leveraging this framework, we construct NormHint, a comprehensive synthetic dialogue dataset spanning a wide range of interlocutor attributes (e.g., age, profession, personality), relationship types, conversation topics, and conversational trajectories. NormHint is meticulously annotated with turn-level norm violation information, detailed participant descriptions, and remediation suggestions-including alternative trajectories achieved through early intervention. Human validation and automated analysis demonstrate that our dataset captures diverse conversational topics with high naturalness and realism. Moreover, we discovered that fine-tuning a model with our norm violation data significantly enhances its ability to detect and understand potential norm violations in conversations.

[77] Mathematical Derivation Graphs: A Relation Extraction Task in STEM Manuscripts

Vishesh Prasad, Brian Kim, Nickvash Kani

Main category: cs.CL

TL;DR: This paper introduces a new dataset (MDGD) for extracting mathematical equation dependencies in STEM articles and evaluates LLMs on this task, achieving F1 scores of 45-52%.

DetailsMotivation: While NLP and LLMs have advanced textual analysis, they struggle with understanding mathematical equations and their relationships in STEM texts. The paper aims to expand relation extraction to mathematical dependencies.

Method: Created Mathematical Derivation Graphs Dataset (MDGD) from 107 arXiv STEM manuscripts with 2000+ manually labeled inter-equation dependencies. Evaluated analytical and ML models (including LLMs) on extracting derivation relationships.

Result: Best LLMs achieved F1 scores of ~45-52% on identifying mathematical derivation relationships. The authors attempted to improve performance by combining LLMs with analytic algorithms and other methods.

Conclusion: The paper establishes a new benchmark for mathematical relation extraction, showing current LLMs have limited capability (~50% F1) in understanding mathematical dependencies, highlighting the need for specialized approaches.

Abstract: Recent advances in natural language processing (NLP), particularly with the emergence of large language models (LLMs), have significantly enhanced the field of textual analysis. However, while these developments have yielded substantial progress in analyzing natural language text, applying analysis to mathematical equations and their relationships within texts has produced mixed results. This paper takes the initial steps in expanding the problem of relation extraction towards understanding the dependency relationships between mathematical expressions in STEM articles. The authors construct the Mathematical Derivation Graphs Dataset (MDGD), sourced from a random sampling of the arXiv corpus, containing an analysis of $107$ published STEM manuscripts with over $2000$ manually labeled inter-equation dependency relationships, resulting in a new object referred to as a derivation graph that summarizes the mathematical content of the manuscript. The authors exhaustively evaluate analytical and machine learning (ML) based models to assess their capability to identify and extract the derivation relationships for each article and compare the results with the ground truth. The authors show that the best tested LLMs achieve $F_1$ scores of $\sim45%-52%$, and attempt to improve their performance by combining them with analytic algorithms and other methods.

[78] DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Giorgio Franceschelli, Mirco Musolesi

Main category: cs.CL

TL;DR: DiffSampling is a new decoding method that uses mathematical analysis of token probability distributions to generate contextually appropriate text while maintaining quality and sampling from a larger token set.

DetailsMotivation: Current language models frequently reproduce training data, generate repetitive text, and favor common patterns due to limitations in decoding strategies - either too conservative (reducing diversity) or too aggressive (compromising accuracy).

Method: DiffSampling leverages mathematical analysis of token probability distributions, specifically using the difference between consecutive sorted probabilities to truncate incorrect tokens. Two variations are also proposed to correct inconsistencies in common sampling strategies.

Result: Experiments across four different text-generation tasks show that DiffSampling consistently performs at least on par with existing methods in terms of quality, despite sampling from a larger set of tokens.

Conclusion: DiffSampling offers an effective decoding approach that addresses limitations of current strategies by using probability distribution analysis to balance diversity and accuracy in text generation.

Abstract: Despite their growing capabilities, language models still frequently reproduce content from their training data, generate repetitive text, and favor common grammatical patterns and vocabulary. A possible cause is the decoding strategy: the most common strategies either consider only the most probable tokens, which reduces output diversity, or increase the likelihood of unlikely tokens, compromising output accuracy and correctness. In this paper, we propose DiffSampling, a new decoding method that leverages a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. In addition, we also propose two variations of the proposed method that aim to correct the subtle inconsistencies of common sampling strategies. Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, despite sampling from a larger set of tokens.

[79] A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models

Aviv Brokman, Xuguang Ai, Yuhang Jiang, Shashank Gupta, Ramakanth Kavuluru

Main category: cs.CL

TL;DR: LLM-based zero-shot relation extraction is approaching supervised methods on some biomedical datasets but struggles with complex multi-relation inputs.

DetailsMotivation: With rapid growth of scientific literature, relation extraction is essential for knowledge graph curation. The rise of LLMs raises the question of whether to skip supervised RE methods and annotation efforts in favor of zero-shot RE via LLM API calls.

Method: Proposed benchmark with seven biomedical RE datasets with interesting characteristics. Evaluated three OpenAI models (GPT-4, o1, and GPT-OSS-120B) for end-to-end zero-shot relation extraction.

Result: LLM-based zero-shot RE is approaching supervised methods in performance on some datasets but struggles on complex inputs expressing multiple relations with different predicates. Error analysis reveals scope for improvements.

Conclusion: While LLM-based zero-shot RE shows promise and is getting closer to supervised methods for some biomedical tasks, it still faces challenges with complex multi-relation extraction, indicating need for further improvements.

Abstract: Extracting relations from scientific literature is a fundamental task in biomedical NLP because entities and relations among them drive hypothesis generation and knowledge discovery. As literature grows rapidly, relation extraction (RE) is indispensable to curate knowledge graphs to be used as computable structured and symbolic representations. With the rise of LLMs, it is pertinent to examine if it is better to skip tailoring supervised RE methods, save annotation burden, and just use zero shot RE (ZSRE) via LLM API calls. In this paper, we propose a benchmark with seven biomedical RE datasets with interesting characteristics and evaluate three Open AI models (GPT-4, o1, and GPT-OSS-120B) for end-to-end ZSRE. We show that LLM-based ZSRE is inching closer to supervised methods in performances on some datasets but still struggles on complex inputs expressing multiple relations with different predicates. Our error analysis reveals scope for improvements.

[80] Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs

Kangda Wei, Hasnat Md Abdullah, Ruihong Huang

Main category: cs.CL

TL;DR: Novel data generation framework reduces gender bias in LLMs by prompting them to generate story pairs with male/female protagonists in identical scenarios, then using DPO to align moral judgments.

DetailsMotivation: LLMs often exhibit gender bias, leading to unequal treatment of male and female subjects across different contexts, which needs to be addressed.

Method: Framework prompts LLMs to generate story pairs with male/female protagonists in structurally identical, morally ambiguous scenarios, compares moral judgments, and when inconsistencies arise, guides models to produce balanced, gender-neutral judgments. These story-judgment pairs are used for fine-tuning via Direct Preference Optimization (DPO).

Result: Experimental results show the method significantly reduces gender bias while preserving or even enhancing general model capabilities.

Conclusion: The proposed exploratory data generation framework effectively mitigates gender bias in LLMs through structured scenario generation and DPO optimization, with code and data publicly released.

Abstract: Large Language Models (LLMs) often exhibit gender bias, resulting in unequal treatment of male and female subjects across different contexts. To address this issue, we propose a novel data generation framework that fosters exploratory thinking in LLMs. Our approach prompts models to generate story pairs featuring male and female protagonists in structurally identical, morally ambiguous scenarios, then elicits and compares their moral judgments. When inconsistencies arise, the model is guided to produce balanced, gender-neutral judgments. These story-judgment pairs are used to fine-tune or optimize the models via Direct Preference Optimization (DPO). Experimental results show that our method significantly reduces gender bias while preserving or even enhancing general model capabilities. We will release the code and generated data. We release the code and generated data at: https://github.com/WeiKangda/LLMs-Exploratory-Bias-Mitigation/tree/main.

[81] Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tianxing He

Main category: cs.CL

TL;DR: LLMs struggle to generate targeted test case generators that expose bugs in human code, though they can generate valid test case generators for competition programming problems.

DetailsMotivation: To explore LLMs' capabilities in code checking/debugging through test case generation, particularly for competition-level programming where test cases are crucial for finding bugs.

Method: Created TCGBench benchmark with two tasks: (1) generating valid test case generators for CP problems, and (2) generating targeted test case generators that expose bugs in human code. Evaluated state-of-the-art LLMs and created a manually curated dataset for targeted generator generation.

Result: LLMs can generate valid test case generators in most cases but struggle significantly with targeted bug-exposing generators. Even advanced reasoning models like o3-mini fall short of human performance. However, performance improves with the curated dataset through prompting and fine-tuning.

Conclusion: While LLMs show promise for test case generation, they need improvement for targeted bug-finding tasks. The curated dataset helps bridge this gap, suggesting that better training data and methods could enhance LLMs’ debugging capabilities.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

[82] HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning

Guimin Hu, Daniel Hershcovich, Hasti Seifi

Main category: cs.CL

TL;DR: HapticLLaMA is a multimodal language model that generates natural language descriptions from haptic vibration signals, achieving strong performance in haptic captioning through frequency-based and EnCodec tokenizers with two-stage training (supervised fine-tuning + RLHF).

DetailsMotivation: Haptic signals for touch remain underexplored compared to vision and audio in multimodal research, creating a gap for applications in virtual reality, accessibility, and rehabilitation that could benefit from haptic captioning.

Method: Proposed HapticLLaMA with two haptic tokenizers (frequency-based and EnCodec-based) to convert vibration signals into discrete units, integrated with LLaMA using two-stage training: 1) supervised fine-tuning with LoRA adaptation, and 2) RLHF fine-tuning.

Result: Achieved METEOR score of 59.98 and BLEU-4 score of 32.06; over 61% of generated captions received human ratings above 3.5/7, with RLHF yielding 10% improvement in overall rating distribution.

Conclusion: HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, highlighting the potential of large language models to process and adapt to sensory data beyond vision and audio.

Abstract: Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA’s captioning performance using both automated n-gram metrics and human evaluation. HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively. Additionally, over 61% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 10% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.

[83] LingVarBench: Benchmarking LLMs on Entity Recognitions and Linguistic Verbalization Patterns in Phone-Call Transcripts

Seyedali Mohammadi, Manas Paldhe, Amit Chhabra, Youngseo Son, Vishal Seshagiri

Main category: cs.CL

TL;DR: LingVarBench: A benchmark and synthetic data generation pipeline for structured entity extraction from phone-call transcripts, using LLM-sampled entities, curated verbalization patterns, and consistency filtering to train robust extraction models without real annotated data.

DetailsMotivation: Extracting structured entities from phone-call transcripts in customer-support and healthcare settings is challenging due to costly annotation, privacy constraints limiting data access, and degradation of existing methods under disfluencies, interruptions, and speaker overlap.

Method: Introduces LingVarBench with three components: (1) LLM-sampled entity values, (2) curated linguistic verbalization patterns covering diverse disfluencies and entity-specific readout styles, and (3) a value-transcript consistency filter. Uses DSPy’s SIMBA to automatically synthesize and optimize extraction prompts from this synthetic data.

Result: On real customer transcripts, prompts optimized solely on LingVarBench outperform zero-shot baselines and match or closely approach human-tuned prompts for structured entities (F1 ~94-95% for ZIP code, date of birth, name). For subjective questionnaire items, optimized prompts substantially improve over zero-shot and approach human-tuned performance.

Conclusion: LingVarBench offers a practical, cost-efficient path to deployment in direct-answer settings, enabling robust entity extraction without real annotated data, with real annotations later enabling additional refinement.

Abstract: We study structured entity extraction from phone-call transcripts in customer-support and healthcare settings, where annotation is costly, and data access is limited by privacy and consent. Existing methods degrade under disfluencies, interruptions, and speaker overlap, yet large real-call corpora are rarely shareable. We introduce LingVarBench, a benchmark and semantic synthetic data generation pipeline that generates linguistically varied training data via (1) LLM-sampled entity values, (2) curated linguistic verbalization patterns covering diverse disfluencies and entity-specific readout styles, and (3) a value-transcript consistency filter. Using this dataset, DSPy’s SIMBA automatically synthesizes and optimizes extraction prompts, reducing manual prompt engineering and targeting robustness to verbal variation. On real customer transcripts, prompts optimized solely on LingVarBench outperform zero-shot baselines and match or closely approach human-tuned prompts for structured entities such as ZIP code, date of birth, and name (F1 approximately 94-95 percent). For subjective questionnaire items, optimized prompts substantially improve over zero-shot performance and approach human-tuned prompts. LingVarBench offers a practical and cost-efficient path to deployment in a direct-answer setting, with real annotations later enabling additional refinement.

[84] KPoEM: A Human-Annotated Dataset for Emotion Classification and RAG-Based Poetry Generation in Korean Modern Poetry

Iro Lim, Haein Ji, Byungjun Kim

Main category: cs.CL

TL;DR: KPoEM is a novel Korean poetry emotion dataset with 7,662 entries annotated with 44 fine-grained emotion categories, enabling emotion classification (F1-micro: 0.60) and emotion-aware poetry generation.

DetailsMotivation: Poetry remains underexplored in NLP due to complex figurative language and cultural specificity, especially for Korean poetry which lacks emotion-centered datasets for computational analysis and generation.

Method: Created KPoEM dataset with 7,662 entries (7,007 line-level, 615 work-level) from five Korean poets, annotated with 44 emotion categories. Developed emotion classification model using sequential fine-tuning strategy from general corpora to specialized KPoEM data, and applied structured emotion data to RAG-based poetry generation.

Result: KPoEM emotion classification model achieved F1-micro score of 0.60, significantly outperforming previous models (0.43). The model effectively identifies temporally and culturally specific emotional expressions while preserving core poetic sentiments. RAG-based poetry generation demonstrates feasibility of producing texts reflecting Korean literary emotional and cultural sensibilities.

Conclusion: KPoEM provides foundational dataset for advancing emotion-centered analysis and creation in modern Korean poetry, bridging computational techniques with literary analysis and opening pathways for quantitative emotion research and generative poetics.

Abstract: This study introduces KPoEM (Korean Poetry Emotion Mapping), a novel dataset that serves as a foundation for both emotion-centered analysis and generative applications in modern Korean poetry. Despite advancements in NLP, poetry remains underexplored due to its complex figurative language and cultural specificity. We constructed a multi-label dataset of 7,662 entries (7,007 line-level and 615 work-level), annotated with 44 fine-grained emotion categories from five influential Korean poets. The KPoEM emotion classification model, fine-tuned through a sequential strategy – moving from general-purpose corpora to the specialized KPoEM dataset – achieved an F1-micro score of 0.60, significantly outperforming previous models (0.43). The model demonstrates an enhanced ability to identify temporally and culturally specific emotional expressions while preserving core poetic sentiments. Furthermore, applying the structured emotion dataset to a RAG-based poetry generation model demonstrates the empirical feasibility of generating texts that reflect the emotional and cultural sensibilities of Korean literature. This integrated approach strengthens the connection between computational techniques and literary analysis, opening new pathways for quantitative emotion research and generative poetics. Overall, this study provides a foundation for advancing emotion-centered analysis and creation in modern Korean poetry.

[85] AgenticIE: An Adaptive Agent for Information Extraction from Complex Regulatory Documents

Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst

Main category: cs.CL

TL;DR: Proposes DoP Key Information Extraction and QA as new NLP challenges, introduces agentic system with planner-executor-corresponder pattern, and creates high-density multilingual dataset that outperforms LLM baselines.

DetailsMotivation: Declaration of Performance (DoP) documents are essential for construction quality control and carbon footprint reduction but are not machine-readable, with significant variation in layout, schema, format, and languages, creating a need for automated information extraction.

Method: Designs a domain-specific AgenticIE system based on a planner-executor-corresponder pattern for DoP documents, and creates a high-density expert-annotated dataset of complex multi-page regulatory documents in English and German with over 15K annotated entities.

Result: The agentic system outperforms static and multimodal LLM baselines, achieving Exact Match scores of 0.396 vs. 0.342 (GPT-4o, +16% improvement) and 0.314 (GPT-4o-V, +26% improvement) across KIE and QA tasks.

Conclusion: The work successfully introduces DoP KIE and QA as new NLP challenges, validates the benefits of the agentic system approach, and demonstrates the challenging nature of the new DoP dataset compared to standard IE datasets.

Abstract: Declaration of Performance (DoP) documents, mandated by EU regulation, specify characteristics of construction products, such as fire resistance and insulation. While this information is essential for quality control and reducing carbon footprints, it is not easily machine readable. Despite content requirements, DoPs exhibit significant variation in layout, schema, and format, further complicated by their multilingual nature. In this work, we propose DoP Key Information Extraction (KIE) and Question Answering (QA) as new NLP challenges. To address this challenge, we design a domain-specific AgenticIE system based on a planner-executor-corresponder pattern. For evaluation, we introduce a high-density, expert-annotated dataset of complex, multi-page regulatory documents in English and German. Unlike standard IE datasets (e.g., FUNSD, CORD) with sparse annotations, our dataset contains over 15K annotated entities, averaging over 190 annotations per document. Our agentic system outperforms static and multimodal LLM baselines, achieving Exact Match (EM) scores of 0.396 vs. 0.342 (GPT-4o, +16%) and 0.314 (GPT-4o-V, +26%) across the KIE and QA tasks. Our experimental analysis validates the benefits of the agentic system, as well as the challenging nature of our new DoP dataset.

[86] Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Distributional Semantics

Guangliang Liu, Xi Chen, Bocheng Chen, Han Zi, Xitong Zhang, Kristen Johnson

Main category: cs.CL

TL;DR: The paper proposes pragmatic inference methods based on moral foundations theory to help LLMs achieve generalized moral reasoning by bridging the gap between distributional semantics and pragmatic-level moral understanding.

DetailsMotivation: LLMs struggle with generalization in moral reasoning because they rely on distributional semantics, which operates at a different level than moral reasoning that requires pragmatic understanding. There's a need to bridge this gap to enable more robust moral reasoning in LLMs.

Method: The authors propose pragmatic inference methods grounded in moral foundations theory. These methods leverage contextual information at each step to bridge the pragmatic gap and guide LLMs in connecting moral foundations with moral reasoning objectives.

Result: Experimental results show that the proposed approach significantly enhances LLMs’ generalization in moral reasoning, demonstrating improved performance compared to baseline methods.

Conclusion: The work provides a foundation for future research in moral reasoning for LLMs grounded in moral foundations theory, showing that pragmatic inference methods can effectively bridge the gap between distributional semantics and moral understanding.

Abstract: Moral reasoning has emerged as a promising research direction for Large Language Models (LLMs), yet achieving generalization remains a central challenge. From a linguistic standpoint, this difficulty arises because LLMs are adept at capturing distributional semantics, which fundamentally differs from the morals which operate at the pragmatic level. This paper investigates how LLMs can achieve generalized moral reasoning despite their reliance on distributional semantics. We propose pragmatic inference methods grounded in moral foundations theory, which leverage contextual information at each step to bridge the pragmatic gap and guide LLMs in connecting moral foundations with moral reasoning objectives. Experimental results demonstrate that our approach significantly enhances LLMs’ generalization in moral reasoning, providing a foundation for future research grounded in moral foundations theory.

[87] ThinkBrake: A Simple Test-Time Decoding Control for Efficient Reasoning

Sangjun Song, Minjae Oh, Seungkyu Lee, Sungmin Jo, Yohan Jo

Main category: cs.CL

TL;DR: ThinkBrake: A training-free method that monitors token probabilities to detect when LLMs should stop reasoning, reducing overthinking by up to 30% thinking tokens while maintaining accuracy.

DetailsMotivation: Large Reasoning Models often suffer from "overthinking" - they reach correct intermediate solutions but continue reasoning and overwrite them with incorrect answers, wasting compute and reducing accuracy.

Method: ThinkBrake monitors the log-probability margin between the top continuation token and the token at sentence boundaries, stopping reasoning when this margin narrows. It requires no training and uses only inference-time monitoring.

Result: ThinkBrake achieves favorable accuracy-efficiency trade-offs across math, scientific QA, and tool usage benchmarks, reducing thinking token usage by up to 30% while maintaining or improving accuracy.

Conclusion: The method effectively addresses overthinking in Large Reasoning Models through a simple, training-free approach that monitors token probabilities, and theoretical analysis shows it’s equivalent to test-time realignment with a reward bonus for stopping.

Abstract: Large Reasoning Models (LRMs) allocate substantial inference-time compute to Chain-of-Thought (CoT) reasoning, improving performance on mathematics, scientific QA, and tool usage. However, this introduces overthinking: LRMs often reach a correct intermediate solution, continue reasoning, and overwrite it with an incorrect answer. We first demonstrate that oracle stopping–where we inject at every sentence boundary and select the best stopping point in hindsight–improves average accuracy by 8% while reducing thinking tokens by 72%, exposing substantial overthinking. Motivated by this finding, we propose ThinkBrake, which monitors the log-probability margin between the top continuation token and at sentence boundaries, stopping reasoning when this margin narrows. ThinkBrake requires no training and achieves favorable accuracy-efficiency trade-offs across math, scientific QA, and tool usage benchmarks, reducing thinking token usage by up to 30%. Furthermore, we provide theoretical analysis showing that ThinkBrake is equivalent to test-time realignment with a reward bonus for the token.

[88] Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation

Arjun Krishnakumar, Rhea Sanjay Sukthanker, Hannan Javed Mahadik, Gabriela Kadlecová, Vladyslav Moroshan, Timur Carstensen, Frank Hutter, Aaron Klein

Main category: cs.CL

TL;DR: A framework for efficient pretraining of Small Language Models (SLMs) using sparse sub-network initialization, evolutionary search, and knowledge distillation to achieve comparable performance with significantly fewer computational resources.

DetailsMotivation: Small Language Models (SLMs) offer an efficient alternative to Large Language Models (LLMs) but need better pretraining methods to maximize performance while minimizing computational costs and resource requirements.

Method: Three-component framework: 1) Identify structurally sparse sub-network initializations that outperform random initialization, 2) Use evolutionary search to discover high-quality sub-network initializations, 3) Apply knowledge distillation from larger teacher models to accelerate training and improve generalization.

Result: The best model matches validation perplexity of comparable Pythia SLM while requiring 5.16x fewer FLOPs for 10B tokens and 1.26x fewer FLOPs for 100B tokens, demonstrating substantial efficiency gains.

Conclusion: The proposed framework makes SLM pretraining substantially more efficient, offering a practical and reproducible path for cost-efficient small language model development at scale, with all code released publicly.

Abstract: Small Language models (SLMs) offer an efficient and accessible alternative to Large Language Models (LLMs), delivering strong performance while using far fewer resources. We introduce a simple and effective framework for pretraining SLMs that brings together three complementary ideas. First, we identify structurally sparse sub-network initializations that consistently outperform randomly initialized models of similar size under the same compute budget. Second, we use evolutionary search to automatically discover high-quality sub-network initializations, providing better starting points for pretraining. Third, we apply knowledge distillation from larger teacher models to speed up training and improve generalization. Together, these components make SLM pretraining substantially more efficient: our best model, discovered using evolutionary search and initialized with LLM weights, matches the validation perplexity of a comparable Pythia SLM while requiring 5.16x and 1.26x fewer floating point operations for token budgets of 10B and 100B, respectively. We release all code publicly, offering a practical and reproducible path toward cost-efficient small language model development at scale.

[89] Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain

Yue Li, Ran Tao, Derek Hommel, Yusuf Denizay Dönder, Sungyong Chang, David Mimno, Unso Eun Seo Jo

Main category: cs.CL

TL;DR: CORGI is a new text-to-SQL benchmark that expands beyond simple data access to include complex business queries requiring causal reasoning, forecasting, and recommendations, revealing LLM performance degradation as question complexity increases.

DetailsMotivation: Existing text-to-SQL benchmarks only test simple data access as translation tasks, but real-world users ask diverse questions requiring complex responses including predictions and recommendations. The business domain serves as a motivating example for practical database queries encountered by end users.

Method: CORGI is composed of synthetic databases inspired by enterprises (DoorDash, Airbnb, Lululemon) with questions across four increasingly complex categories: descriptive, explanatory, predictive, and recommendational. It introduces new evaluation methods for open-ended qualitative responses in data access tasks.

Result: LLM performance degrades on higher-level questions as complexity increases. LLMs exhibit an average 33.12% lower success execution rate (SER) on CORGI compared to existing benchmarks like BIRD, highlighting the substantially higher complexity of real-world business needs.

Conclusion: CORGI expands text-to-SQL to reflect practical database queries and calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level agentic intelligence. The benchmark encourages the community to consider new evaluation methods for open-ended responses in data access tasks.

Abstract: Text-to-SQL benchmarks have traditionally only tested simple data access as a translation task of natural language to SQL queries. But in reality, users tend to ask diverse questions that require more complex responses including data-driven predictions or recommendations. Using the business domain as a motivating example, we introduce CORGI, a new benchmark that expands text-to-SQL to reflect practical database queries encountered by end users. CORGI is composed of synthetic databases inspired by enterprises such as DoorDash, Airbnb, and Lululemon. It provides questions across four increasingly complicated categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance degrades on higher-level questions as question complexity increases. CORGI also introduces and encourages the text-to-SQL community to consider new automatic methods for evaluating open-ended, qualitative responses in data access tasks. Our experiments show that LLMs exhibit an average 33.12% lower success execution rate (SER) on CORGI compared to existing benchmarks such as BIRD, highlighting the substantially higher complexity of real-world business needs. We release the CORGI dataset, an evaluation framework, and a submission website to support future research.

[90] SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

Jiaming Wang, Zhe Tang, Zehao Jin, Hefei Chen, Yilin Jin, Peng Ding, Xiaoyu Li, Xuezhi Cao

Main category: cs.CL

TL;DR: SOP-Maze is a benchmark for evaluating LLMs on complex business Standard Operating Procedures (SOPs), featuring 397 instances across 23 real-world scenarios, revealing significant model limitations in procedural reasoning.

DetailsMotivation: Existing benchmarks don't adequately evaluate LLMs' ability to handle complex business SOPs involving multi-step procedures and real-world decision-making scenarios.

Method: Created SOP-Maze benchmark using real-world business data, comprising 397 instances and 3422 subtasks across 23 SOP scenarios. Categorized tasks into Lateral Root System (wide-option selection) and Heart Root System (deep logical reasoning with complex branches).

Result: Nearly all state-of-the-art models struggle with SOP-Maze. Identified three key error categories: route blindness (difficulty following procedures), conversational fragility (inability to handle dialogue nuances), and calculation errors (mistakes in time/arithmetic reasoning).

Conclusion: SOP-Maze reveals significant gaps in LLM capabilities for complex business procedures, providing new insights for improving model performance on tasks requiring both breadth and depth of reasoning.

Abstract: As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 instances and 3422 subtasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on: https://github.com/meituan-longcat/SOP-Maze.

[91] Investigating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

Gabrielle Kaili-May Liu, Bryan Li, Arman Cohan, William Gantt Walden, Eugene Yang

Main category: cs.CL

TL;DR: CRUMQs pipeline creates challenging, uncheatable, realistic, unanswerable, multi-hop queries for RAG benchmarks to better evaluate system limitations.

DetailsMotivation: Existing RAG benchmarks fail to reflect real-world complexity, allowing systems to cheat via disconnected reasoning or simple recall, limiting ability to uncover system weaknesses.

Method: Developed first automatic pipeline for difficulty-controlled creation of CRUMQs (uncheatable, realistic, unanswerable, multi-hop queries) adaptable to any corpus and domain.

Result: CRUMQs are highly challenging for RAG systems, achieving up to 81.0% reduction in cheatability scores compared to prior benchmarks when tested on leading retrieval-augmented LLMs.

Conclusion: The pipeline provides a simple way to enhance benchmark difficulty and drive development of more capable RAG systems that can handle real-world query complexity.

Abstract: Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, and $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and drive development of more capable RAG systems.

[92] Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan’s Intelligent Interaction Systems

Xuxin Cheng, Ke Zeng, Zhiquan Cao, Linyi Dai, Wenxuan Gao, Fei Han, Ai Jian, Feng Hong, Wenxing Hu, Zihe Huang, Dejian Kong, Jia Leng, Zhuoyuan Liao, Pei Liu, Jiaye Lin, Xing Ma, Jingqing Ruan, Jiaxing Song, Xiaoyu Tan, Ruixuan Xiao, Wenhui Yu, Wenyu Zhan, Haoxing Zhang, Chao Zhou, Hao Zhou, Shaodong Zheng, Ruinian Chen, Siyuan Chen, Ziyang Chen, Yiwen Dong, Yaoyou Fan, Yangyi Fang, Yang Gan, Shiguang Guo, Qi He, Chaowen Hu, Binghui Li, Dailin Li, Xiangyu Li, Yan Li, Chengjian Liu, Xiangfeng Liu, Jiahui Lv, Qiao Ma, Jiang Pan, Cong Qin, Chenxing Sun, Wen Sun, Zhonghui Wang, Abudukelimu Wuerkaixi, Xin Yang, Fangyi Yuan, Yawen Zhu, Tianyi Zhai, Jie Zhang, Runlai Zhang, Yao Xu, Yiran Zhao, Yifan Wang, Xunliang Cai, Yangen Hu, Cao Liu, Lu Pan, Xiaoli Wang, Bo Xiao, Wenyuan Yao, Qianlin Zhou, Benchang Zhu

Main category: cs.CL

TL;DR: WOWService is an industrial intelligent interaction system using LLMs and multi-agent architecture to overcome challenges in customer service automation, deployed on Meituan App with significant user satisfaction improvements.

DetailsMotivation: To address five key challenges in intelligent interaction systems: 1) difficulty in constructing high-quality training data for cold-start, 2) suboptimal multi-turn dialogue performance, 3) frequent business rule evolution affecting operability, 4) single LLM insufficiency in complex scenarios, and 5) lack of unified evaluation metrics for open-domain dialogues.

Method: WOWService integrates LLMs with multi-agent architectures for autonomous task management and collaborative problem-solving. Core modules include data construction, general capability enhancement, business scenario adaptation, multi-agent coordination, and automated evaluation.

Result: Deployed on Meituan App with significant metric improvements: User Satisfaction Metric 1 reduced by 27.53% (likely negative metric like complaint rate) and User Satisfaction Metric 2 increased by 25.51% (positive metric like satisfaction score).

Conclusion: WOWService effectively addresses industrial intelligent interaction challenges through LLM-multi-agent integration, demonstrating practical value in capturing user needs and advancing personalized service with measurable performance gains.

Abstract: Enhancing customer experience is essential for business success, particularly as service demands grow in scale and complexity. Generative artificial intelligence and Large Language Models (LLMs) have empowered intelligent interaction systems to deliver efficient, personalized, and 24/7 support. In practice, intelligent interaction systems encounter several challenges: (1) Constructing high-quality data for cold-start training is difficult, hindering self-evolution and raising labor costs. (2) Multi-turn dialogue performance remains suboptimal due to inadequate intent understanding, rule compliance, and solution extraction. (3) Frequent evolution of business rules affects system operability and transferability, constraining low-cost expansion and adaptability. (4) Reliance on a single LLM is insufficient in complex scenarios, where the absence of multi-agent frameworks and effective collaboration undermines process completeness and service quality. (5) The open-domain nature of multi-turn dialogues, lacking unified golden answers, hampers quantitative evaluation and continuous optimization. To address these challenges, we introduce WOWService, an intelligent interaction system tailored for industrial applications. With the integration of LLMs and multi-agent architectures, WOWService enables autonomous task management and collaborative problem-solving. Specifically, WOWService focuses on core modules including data construction, general capability enhancement, business scenario adaptation, multi-agent coordination, and automated evaluation. Currently, WOWService is deployed on the Meituan App, achieving significant gains in key metrics, e.g., User Satisfaction Metric 1 (USM 1) -27.53% and User Satisfaction Metric 2 (USM 2) +25.51%, demonstrating its effectiveness in capturing user needs and advancing personalized service.

[93] Structured yet Bounded Temporal Understanding in Large Language Models

Damin Zhang, Julia Rayz

Main category: cs.CL

TL;DR: LLMs adapt to different temporal frames of reference (deictic vs. sequential), producing systematic but distinct similarity patterns in temporal understanding.

DetailsMotivation: While LLMs show strong performance on temporal tasks, it's unclear how their behavior depends on how time is anchored in language (deictic vs. sequential framing). Understanding how LLMs organize temporal representations under different reference structures is important for improving their temporal reasoning capabilities.

Method: Study LLMs’ temporal understanding through temporal frames of reference (t-FoRs), contrasting deictic framing (past-present-future) and sequential framing (before-after). Use a large-scale dataset of real-world events from Wikidata and similarity judgement tasks to examine how LLMs’ outputs vary with temporal distance, interval relations, and event duration.

Result: LLMs systematically adapt to both t-FoRs but produce significantly different similarity patterns. Under deictic t-FoR: graded asymmetric structures centered on present, sharper decline for future events, higher variance in past. Under sequential t-FoR: similarity becomes strongly negative once events are temporally separated. Temporal judgements also shaped by interval algebra and duration, with instability in overlap/containment relations, and duration influencing only past events under deictic t-FoR.

Conclusion: The findings characterize how LLMs organize temporal representation under different reference structures and identify key factors (temporal distance, interval relations, duration) that most strongly shape their temporal understanding, revealing systematic patterns in how LLMs process temporal information differently based on framing.

Abstract: Large language models (LLMs) increasingly show strong performance on temporally grounded tasks, such as timeline construction, temporal question answering, and event ordering. However, it remains unclear how their behavior depends on the way time is anchored in language. In this work, we study LLMs’ temporal understanding through temporal frames of reference (t-FoRs), contrasting deictic framing (past-present-future) and sequential framing (before-after). Using a large-scale dataset of real-world events from Wikidata and similarity judgement task, we examine how LLMs’ outputs vary with temporal distance, interval relations, and event duration. Our results show that LLMs systematically adapt to both t-FoRs, but the resulting similarity patterns differ significantly. Under deictic t-FoR, the similarity judgement scores form graded and asymmetric structures centered on the present, with sharper decline for future events and higher variance in the past. Under sequential t-FoR, similarity becomes strongly negative once events are temporally separated. Temporal judgements are also shaped by interval algebra and duration, with instability concentrated in overlap- and containment-based relations, and duration influencing only past events under deictic t-FoR. Overall, these findings characterize how LLMs organize temporal representation under different reference structures and identify the factors that most strongly shape their temporal understanding.

[94] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu

Main category: cs.CL

TL;DR: MLLMs show phonetic intuition aligning with linguistic research on sound symbolism, with phonosemantic attention patterns revealing focus on iconic phonemes across text and audio modalities.

DetailsMotivation: To investigate how Multimodal Large Language Models interpret auditory information through sound symbolism, bridging AI and cognitive linguistics with quantitative analysis of phonetic iconicity.

Method: Created LEX-ICON dataset with 8,052 words from 4 languages and 2,930 pseudo-words annotated with semantic features. Analyzed MLLMs’ performance on phonetic iconicity across textual (orthographic/IPA) and auditory inputs using phoneme-level attention fraction scores across 25 semantic dimensions.

Result: MLLMs demonstrate phonetic intuitions that align with existing linguistic research across multiple semantic dimensions, and show phonosemantic attention patterns highlighting focus on iconic phonemes.

Conclusion: First large-scale quantitative analysis of phonetic iconicity in MLLMs, bridging AI and cognitive linguistics, showing models can capture sound symbolism patterns similar to human linguistic intuition.

Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs’ performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models’ layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs’ phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models’ focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs’ interpretability.

[95] LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models

Jian Gao, Richeng Xuan, Zhaolu Kang, Dingshi Liao, Wenxin Huang, Zongmou Huang, Yangdi Xu, Bowen Qin, Zheqi He, Xi Yang, Changjin Li, Yonghua Lin

Main category: cs.CL

TL;DR: LaoBench is the first large-scale benchmark for evaluating LLMs in Lao language, featuring 17,000+ samples across cultural knowledge, K12 education, and bilingual translation, with both open-source and held-out subsets for secure evaluation.

DetailsMotivation: There's a significant gap in LLM evaluation for low-resource Southeast Asian languages like Lao, despite rapid LLM advancements. Current benchmarks lack comprehensive assessment of Lao language understanding and reasoning capabilities.

Method: Created LaoBench using a hybrid pipeline combining expert authoring with agent-assisted verification. The benchmark includes 17,000+ expert-curated samples across three dimensions: culturally grounded knowledge, K12 curriculum alignment, and bilingual translation (Lao-Chinese-English). Features both open-source and held-out subsets, with the latter enabling secure black-box evaluation via controlled services.

Result: Evaluation of diverse state-of-the-art LLMs shows that even strong multilingual models significantly lag behind human experts, particularly in culturally grounded reasoning and translation fidelity. The benchmark reveals substantial performance gaps in Lao language capabilities.

Conclusion: LaoBench fills a critical gap in LLM evaluation for underrepresented languages and should catalyze research on Lao and other Southeast Asian languages for more inclusive multilingual AI development.

Abstract: The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce \textbf{LaoBench}, the first large-scale, high-quality, and multidimensional benchmark for assessing LLM language understanding and reasoning in Lao. LaoBench contains \textbf{17,000+} expert-curated samples across three dimensions: culturally grounded knowledge application, curriculum-aligned K12 education, and bilingual translation among Lao, Chinese, and English. It includes open-source and held-out subsets, where the held-out portion enables secure black-box evaluation via a controlled service to improve fairness and data security. We construct LaoBench with a hybrid pipeline that combines expert authoring with agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational validity. We evaluate diverse state-of-the-art open-source and closed-source LLMs, and find that even strong multilingual models lag behind human experts, particularly in culturally grounded reasoning and translation fidelity. We hope LaoBench will catalyze research on Lao and other underrepresented Southeast Asian languages for more inclusive multilingual evaluation.

[96] Non-Linear Scoring Model for Translation Quality Evaluation

Serge Gladkoff, Lifeng Han, Katerina Gasova

Main category: cs.CL

TL;DR: A non-linear scoring model for translation quality evaluation that addresses biases in linear extrapolation across different sample sizes, using logarithmic error tolerance growth based on psychophysical principles.

DetailsMotivation: Traditional linear error-to-penalty scaling in analytic TQE based on MQM produces biased judgments across different sample sizes - over-penalizing short samples and under-penalizing long ones, creating misalignment with expert intuition and human perception.

Method: Proposes a calibrated two-parameter logarithmic model E(x) = a * ln(1 + b * x) where error tolerance grows logarithmically with sample size. The model is anchored to reference tolerance and calibrated from two tolerance points using one-dimensional root-finding, building on the Multi-Range framework.

Result: Empirical data from three large-scale enterprise environments shows acceptable error counts grow logarithmically, not linearly, with sample size. The model yields explicit intervals where linear approximation stays within ±20% relative error and improves interpretability, fairness, and inter-rater reliability.

Conclusion: The non-linear scoring model provides a perceptually valid paradigm that advances TQE toward more accurate and scalable assessment, offering stronger basis for AI-based document-level evaluation aligned with human judgment, with implications for both human and AI-generated text evaluation.

Abstract: Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.

[97] Bench360: Benchmarking Local LLM Inference from 360 Degrees

Linus Stuhlmann, Mauricio Fadel Argerich, Jonathan Fürst

Main category: cs.CL

TL;DR: Bench360 is a comprehensive framework for evaluating local LLM inference across models, quantization, engines, and serving scenarios, showing substantial tradeoffs between efficiency and quality with no universal best option.

DetailsMotivation: Users face complex design choices when running LLMs locally (models, quantization, inference engines, serving scenarios), but existing benchmarks are fragmented and offer little practical guidance for real deployments.

Method: Bench360 framework supports custom tasks, integrates multiple inference engines and quantization formats, and reports both task quality and system metrics (latency, throughput, energy, startup time). Demonstrated on four NLP tasks across three GPUs and four engines.

Result: Results show substantial tradeoffs between efficiency and output quality, with configuration choices depending on specific workloads and constraints. No universal best option exists.

Conclusion: There is a need for comprehensive, deployment-oriented benchmarks like Bench360 to guide practical local LLM deployments, as design choices significantly impact efficiency and quality with no one-size-fits-all solution.

Abstract: Running LLMs locally has become increasingly common, but users face a complex design space across models, quantization levels, inference engines, and serving scenarios. Existing inference benchmarks are fragmented and focus on isolated goals, offering little guidance for practical deployments. We present Bench360, a framework for evaluating local LLM inference across tasks, usage patterns, and system metrics in one place. Bench360 supports custom tasks, integrates multiple inference engines and quantization formats, and reports both task quality and system behavior (latency, throughput, energy, startup time). We demonstrate it on four NLP tasks across three GPUs and four engines, showing how design choices shape efficiency and output quality. Results confirm that tradeoffs are substantial and configuration choices depend on specific workloads and constraints. There is no universal best option, underscoring the need for comprehensive, deployment-oriented benchmarks.

[98] Sentiment Analysis Of Shopee Product Reviews Using Distilbert

Zahri Aksa Dautd, Aviv Yuniar Rahman

Main category: cs.CL

TL;DR: DistilBERT achieves 94.8% accuracy for sentiment analysis on Shopee reviews, slightly below BERT but with 55% faster computation, offering optimal balance for large-scale e-commerce applications.

DetailsMotivation: The massive volume of daily Shopee reviews requires automated sentiment analysis since manual processing is inefficient. There's a need for computational approaches to extract valuable customer satisfaction and preference information from millions of reviews.

Method: Used DistilBERT (distilbert-base-uncased) for sentiment classification on approximately one million preprocessed English-language Shopee reviews. Compared performance against benchmark models BERT and SVM using accuracy, precision, recall, and F1-score metrics.

Result: DistilBERT achieved 94.8% accuracy, slightly below BERT (95.3%) but significantly higher than SVM (90.2%). Computation time was reduced by more than 55% compared to BERT, demonstrating superior efficiency.

Conclusion: DistilBERT provides an optimal balance between accuracy and efficiency for large-scale sentiment analysis on e-commerce platforms, making it suitable for processing massive volumes of consumer reviews while maintaining competitive performance.

Abstract: The rapid growth of digital commerce has led to the accumulation of a massive number of consumer reviews on online platforms. Shopee, as one of the largest e-commerce platforms in Southeast Asia, receives millions of product reviews every day containing valuable information regarding customer satisfaction and preferences. Manual analysis of these reviews is inefficient, thus requiring a computational approach such as sentiment analysis. This study examines the use of DistilBERT, a lightweight transformer-based deep learning model, for sentiment classification on Shopee product reviews. The dataset used consists of approximately one million English-language reviews that have been preprocessed and trained using the distilbert-base-uncased model. Evaluation was conducted using accuracy, precision, recall, and F1-score metrics, and compared against benchmark models such as BERT and SVM. The results show that DistilBERT achieved an accuracy of 94.8%, slightly below BERT (95.3%) but significantly higher than SVM (90.2%), with computation time reduced by more than 55%. These findings demonstrate that DistilBERT provides an optimal balance between accuracy and efficiency, making it suitable for large scale sentiment analysis on e-commerce platforms. Keywords: Sentiment Analysis, DistilBERT, Shopee Reviews, Natural Language Processing, Deep Learning, Transformer Models.

[99] Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support

Raunak Jain, Mudita Khurana

Main category: cs.CL

TL;DR: The paper proposes Collaborative Causal Sensemaking (CCS) as a new research agenda to develop LLM-based agents as collaborative partners rather than just answer engines, aiming to close the complementarity gap in human-AI teams for high-stakes decision-making.

DetailsMotivation: Current LLM-based agents deployed for expert decision support fail to reliably outperform the best individual in high-stakes settings. This complementarity gap exists because agents are trained as answer engines rather than as partners in collaborative sensemaking, which is how experts actually make decisions through co-constructing explanations, surfacing uncertainties, and adapting goals.

Method: The paper proposes Collaborative Causal Sensemaking (CCS) as a research agenda with three key components: 1) New training environments that reward collaborative thinking, 2) Representations for shared human-AI mental models, and 3) Evaluation centered on trust and complementarity rather than just accuracy.

Result: The paper presents a conceptual framework and research agenda rather than empirical results. The proposed CCS approach aims to shift multi-agent systems research from building oracle-like answer engines to cultivating AI teammates that co-reason with human partners over the causal structure of shared decisions.

Conclusion: Developing collaborative causal sensemaking capabilities in LLM-based agents is essential for advancing effective human-AI teams in high-stakes settings. This requires fundamentally rethinking agent training, representation, and evaluation to focus on partnership and co-reasoning rather than just answer generation.

Abstract: LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual. We argue this complementarity gap reflects a fundamental mismatch: current agents are trained as answer engines, not as partners in the collaborative sensemaking through which experts actually make decisions. Sensemaking (the ability to co-construct causal explanations, surface uncertainties, and adapt goals) is the key capability that current training pipelines do not explicitly develop or evaluate. We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models, and evaluation centred on trust and complementarity. Taken together, these directions shift MAS research from building oracle-like answer engines to cultivating AI teammates that co-reason with their human partners over the causal structure of shared decisions, advancing the design of effective human-AI teams.

[100] Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony

Darshil Chauhan, Adityasinh Solanki, Vansh Patel, Kanav Kapoor, Ritvik Jain, Aditya Bansal, Pratik Narang, Dhruv Kumar

Main category: cs.CL

TL;DR: On-device continual adaptation with LoRA and experience replay reduces ASR error rates for clinical telephony speech in resource-constrained regions, addressing the “reality gap” between lab performance and real-world clinical audio.

DetailsMotivation: ASR could greatly assist clinical documentation in resource-constrained regions, but deployment is hindered by a severe "reality gap" between lab performance and noisy real-world clinical audio, plus privacy/resource constraints. Current models degrade significantly (up to 40.94% WER) on rural clinical telephony speech.

Method: On-device continual adaptation framework using Low-Rank Adaptation (LoRA) without transmitting raw patient data. Investigated stabilization strategies including multi-domain Experience Replay (ER) and stabilized importance estimation (Absolute Fisher) to handle high-variance gradients in clinical speech.

Result: Multi-domain Experience Replay achieved 17.1% relative improvement in target WER and reduced catastrophic forgetting by 55% compared to naive adaptation. Acoustic adaptation proved essential for healthcare usability, not bypassable by language models alone.

Conclusion: On-device continual adaptation with stabilization strategies effectively bridges the reality gap for clinical ASR in resource-constrained settings, enabling privacy-preserving deployment while maintaining performance across diverse clinical telephony speech.

Abstract: Automatic Speech Recognition (ASR) holds immense potential to assist in clinical documentation and patient report generation, particularly in resource-constrained regions. However, deployment is currently hindered by a technical deadlock: a severe “Reality Gap” between laboratory performance and noisy, real-world clinical audio, coupled with strict privacy and resource constraints. Such adaptation is essential for clinical telephony systems, where patient speech is highly variable and transcription errors can directly impact downstream clinical workflows. We quantify this gap, showing that a robust multilingual model (IndicWav2Vec) degrades up to a 40.94% WER on rural clinical telephony speech from India, rendering it unusable. We demonstrate consistent improvements on these helpline interactions without transmitting raw patient data off-device via an on-device continual adaptation framework using Low-Rank Adaptation (LoRA). We conduct an investigative study of stabilization strategies, characterizing the trade-offs between data-driven and parameter-driven approaches. Our results demonstrate that multi-domain Experience Replay (ER) yields the primary performance gains, achieving a 17.1% relative improvement in target WER and reducing catastrophic forgetting by 55% compared to naive adaptation. Furthermore, we investigate a stabilized importance estimation strategy (Absolute Fisher) to ensure robust convergence against the high-variance gradients common in clinical telephony speech. Finally, we verify via a domain-specific spot check that acoustic adaptation is a fundamental prerequisite for usability in healthcare settings which cannot be bypassed by language models alone.

[101] Stuttering-Aware Automatic Speech Recognition for Indonesian Language

Fadhil Muhammad, Alwin Djuliansah, Adrian Aryaputra Hamzah, Kurniawati Azizah

Main category: cs.CL

TL;DR: Proposed synthetic data augmentation framework for Indonesian stuttered speech recognition using text transformations and TTS to fine-tune Whisper models, improving performance on dysfluent speech without degrading fluent speech recognition.

DetailsMotivation: ASR systems perform poorly on stuttered speech, especially for low-resource languages like Indonesian where specialized datasets are scarce, creating accessibility barriers for people with speech disorders.

Method: Data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text using rule-based transformations and LLMs, followed by TTS synthesis. This synthetic data is used to fine-tune a pre-trained Indonesian Whisper model via transfer learning.

Result: Experiments show consistent reduction in recognition errors on stuttered speech while maintaining performance on fluent segments, validating synthetic data pipelines for inclusive speech technologies in under-represented languages.

Conclusion: Targeted synthetic data exposure enables ASR systems to adapt to dysfluent acoustic patterns without requiring large-scale real-world recordings, offering a practical solution for developing more inclusive speech technologies in low-resource language contexts.

Abstract: Automatic speech recognition systems have achieved remarkable performance on fluent speech but continue to degrade significantly when processing stuttered speech, a limitation that is particularly acute for low-resource languages like Indonesian where specialized datasets are virtually non-existent. To overcome this scarcity, we propose a data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text through a combination of rule-based transformations and large language models followed by text-to-speech synthesis. We apply this synthetic data to fine-tune a pre-trained Indonesian Whisper model using transfer learning, enabling the architecture to adapt to dysfluent acoustic patterns without requiring large-scale real-world recordings. Our experiments demonstrate that this targeted synthetic exposure consistently reduces recognition errors on stuttered speech while maintaining performance on fluent segments, validating the utility of synthetic data pipelines for developing more inclusive speech technologies in under-represented languages.

[102] Afri-MCQA: Multimodal Cultural Question Answering for African Languages

Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Muhidin A. Mohamed, Debela Desalegn Yadeta, Negasi Haile Abadi, Abigail Oppong, Nnaemeka Casmir Obiefuna, Idris Abdulmumin, Naome A Etori, Eric Peter Wairagala, Kanda Patrick Tshinu, Imanigirimbabazi Emmanuel, Gabofetswe Malema, Alham Fikri Aji, David Ifeoluwa Adelani, Thamar Solorio

Main category: cs.CL

TL;DR: Afri-MCQA is the first multilingual cultural QA benchmark covering 15 African languages with 7.5k Q&A pairs, created by native speakers, revealing poor LLM performance on African languages and speech modalities.

DetailsMotivation: Africa has over one-third of the world's languages but is underrepresented in AI research, creating a need for culturally grounded benchmarks to evaluate AI systems on African languages and cultural knowledge.

Method: Created Afri-MCQA benchmark with 7.5k Q&A pairs across 15 African languages from 12 countries, featuring parallel English-African language pairs across text and speech modalities, entirely created by native speakers. Includes control experiments to separate linguistic competence from cultural knowledge.

Result: Open-weight LLMs perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. Significant performance gaps exist between native languages and English for both text and speech modalities.

Conclusion: The findings highlight the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. The benchmark is released to support more inclusive multimodal AI development in African languages.

Abstract: Africa is home to over one-third of the world’s languages, yet remains underrepresented in AI research. We introduce Afri-MCQA, the first Multilingual Cultural Question-Answering benchmark covering 7.5k Q&A pairs across 15 African languages from 12 countries. The benchmark offers parallel English-African language Q&A pairs across text and speech modalities and was entirely created by native speakers. Benchmarking large language models (LLMs) on Afri-MCQA shows that open-weight models perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. To evaluate linguistic competence, we include control experiments meant to assess this specific aspect separate from cultural knowledge, and we observe significant performance gaps between native languages and English for both text and speech. These findings underscore the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. To support more inclusive multimodal AI development in African languages, we release our Afri-MCQA under academic license or CC BY-NC 4.0 on HuggingFace (https://huggingface.co/datasets/Atnafu/Afri-MCQA)

[103] TeleMem: Building Long-Term and Multimodal Memory for Agentic AI

Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, Xuelong Li

Main category: cs.CL

TL;DR: TeleMem is a unified long-term multimodal memory system for LLMs that improves dialogue coherence, reduces hallucinations, and enables efficient video understanding through narrative extraction, structured writing, and ReAct-style reasoning.

DetailsMotivation: LLMs struggle with long-term interactions due to limited attention over extended dialogue histories. Existing RAG approaches lack reliable memory update mechanisms, leading to schema-driven hallucinations, inefficient write operations, and poor multimodal reasoning support.

Method: TeleMem features: 1) Narrative dynamic extraction for dialogue-grounded user profiles, 2) Structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, 3) Multimodal memory module with ReAct-style reasoning (observe-think-act process) for video understanding.

Result: Outperforms Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and 2.1x speedup on ZH-4O long-term role-play gaming benchmark. Improves storage efficiency, reduces token usage, and accelerates memory operations.

Conclusion: TeleMem provides an effective solution for long-term multimodal memory in LLMs, addressing key limitations of current RAG systems through coherent narrative extraction, efficient structured writing, and closed-loop multimodal reasoning.

Abstract: Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal reasoning.To address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.

[104] Controlled Self-Evolution for Algorithmic Code Optimization

Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Yi Xu, Huacan Wang

Main category: cs.CL

TL;DR: CSE (Controlled Self-Evolution) improves code generation efficiency through diversified initialization, feedback-guided genetic evolution, and hierarchical memory, outperforming existing self-evolution methods on EffiBench-X.

DetailsMotivation: Existing self-evolution methods for code generation suffer from low exploration efficiency due to initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks.

Method: CSE consists of three components: 1) Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage; 2) Genetic Evolution replaces stochastic operations with feedback-guided mechanisms for targeted mutation and compositional crossover; 3) Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels.

Result: Experiments on EffiBench-X show CSE consistently outperforms all baselines across various LLM backbones, achieves higher efficiency from early generations, and maintains continuous improvement throughout evolution.

Conclusion: CSE addresses key bottlenecks in self-evolution methods for code generation by introducing controlled mechanisms for initialization, evolution operations, and experience utilization, leading to superior performance and efficiency.

Abstract: Self-evolution methods enhance code generation through iterative “generate-verify-refine” cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels. Experiments on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.

[105] DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs

Nayoung Choi, Jonathan Zhang, Jinho D. Choi

Main category: cs.CL

TL;DR: DyCP is a lightweight context management method that dynamically segments and retrieves relevant memory at query time to improve LLM performance in long dialogues.

DetailsMotivation: LLMs suffer from increased latency and degraded answer quality as dialogue length grows, and existing memory methods either require extra LLM calls or perform offline memory construction without considering current user utterances, leading to inefficiencies and disrupted conversational continuity.

Method: DyCP dynamically segments and retrieves relevant memory at query time, preserving the sequential structure of dialogue without predefined topic boundaries, and supports efficient, adaptive context retrieval.

Result: Across three long-form dialogue benchmarks (LoCoMo, MT-Bench+, and SCM4LLMs) and multiple LLMs, DyCP consistently improves answer quality while reducing response latency.

Conclusion: Despite modern LLMs having expanded context windows, there remains a gap between their theoretical capacity and actual long-context processing ability, highlighting the continued importance of effective context management methods like DyCP.

Abstract: Large Language Models (LLMs) often exhibit increased response latency and degraded answer quality as dialogue length grows, making effective context management essential. However, existing methods rely on extra LLM calls to build memory or perform offline memory construction without considering the current user utterance, which can introduce inefficiencies or disrupt conversational continuity. We introduce DyCP, a lightweight context management method that dynamically segment and retrieve relevant memory at query time. It preserves the sequential structure of dialogue without predefined topic boundaries and supports efficient, adaptive context retrieval. Across three long-form dialogue benchmarks, LoCoMo, MT-Bench+, and SCM4LLMs, and multiple LLMs, DyCP consistently improves answer quality while reducing response latency. We also examine the gap between modern LLMs’ expanded context windows and their actual long-context processing capacity, highlighting the continued importance of effective context management.

[106] To Retrieve or To Think? An Agentic Approach for Context Evolution

Rubing Chen, Jian Wang, Wenjie Li, Xiao-Yong Wei, Qing Li

Main category: cs.CL

TL;DR: ACE is a framework that dynamically decides when to retrieve external knowledge vs. reason with existing context, reducing unnecessary retrieval and improving performance on complex QA tasks.

DetailsMotivation: Current retrieval-augmented methods use brute-force retrieval at every step, causing computational inefficiency and performance degradation due to irrelevant noise in the context.

Method: Agentic Context Evolution (ACE) uses a central orchestrator agent with majority voting to strategically decide between activating a retriever agent for external knowledge or a reasoner agent for internal analysis and refinement.

Result: ACE significantly outperforms competitive baselines in accuracy on challenging multi-hop QA benchmarks while achieving efficient token consumption by eliminating redundant retrieval steps.

Conclusion: ACE provides valuable insights for advancing context-evolved generation for complex, knowledge-intensive tasks by introducing dynamic, strategic decision-making about when to retrieve vs. reason.

Abstract: Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning tasks. However, they typically adhere to a rigid, brute-force strategy that executes retrieval at every step. This indiscriminate approach not only incurs unnecessary computational costs but also degrades performance by saturating the context with irrelevant noise. To address these limitations, we introduce Agentic Context Evolution (ACE), a framework inspired by human metacognition that dynamically determines whether to seek new evidence or reason with existing knowledge. ACE employs a central orchestrator agent to make decisions strategically via majority voting. It aims to alternate between activating a retriever agent for external retrieval and a reasoner agent for internal analysis and refinement. By eliminating redundant retrieval steps, ACE maintains a concise and evolved context. Extensive experiments on challenging multi-hop QA benchmarks demonstrate that ACE significantly outperforms competitive baselines in accuracy while achieving efficient token consumption. Our work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.

cs.CV

[107] ForensicFormer: Hierarchical Multi-Scale Reasoning for Cross-Domain Image Forgery Detection

Hema Hariharan Samson

Main category: cs.CV

TL;DR: ForensicFormer is a hierarchical multi-scale transformer framework for cross-domain forgery detection that achieves 86.8% average accuracy across diverse manipulation types, outperforming single-paradigm approaches and showing strong robustness to compression.

DetailsMotivation: Traditional forensic methods are ineffective against AI-generated imagery and sophisticated editing tools, creating a need for robust cross-domain forgery detection that can handle unknown manipulation techniques in real-world deployment.

Method: A hierarchical multi-scale framework using cross-attention transformers that unifies low-level artifact detection, mid-level boundary analysis, and high-level semantic reasoning for comprehensive forgery analysis.

Result: Achieves 86.8% average accuracy across seven diverse test sets (traditional manipulations, GAN-generated images, diffusion model outputs), maintains 83% accuracy under JPEG compression (Q=70), and provides pixel-level localization with 0.76 F1-score.

Conclusion: ForensicFormer bridges classical image forensics and modern deep learning, offering a practical solution for real-world deployment where manipulation techniques are unknown, with each hierarchical component contributing significant accuracy improvements.

Abstract: The proliferation of AI-generated imagery and sophisticated editing tools has rendered traditional forensic methods ineffective for cross-domain forgery detection. We present ForensicFormer, a hierarchical multi-scale framework that unifies low-level artifact detection, mid-level boundary analysis, and high-level semantic reasoning via cross-attention transformers. Unlike prior single-paradigm approaches, which achieve <75% accuracy on out-of-distribution datasets, our method maintains 86.8% average accuracy across seven diverse test sets, spanning traditional manipulations, GAN-generated images, and diffusion model outputs - a significant improvement over state-of-the-art universal detectors. We demonstrate superior robustness to JPEG compression (83% accuracy at Q=70 vs. 66% for baselines) and provide pixel-level forgery localization with a 0.76 F1-score. Extensive ablation studies validate that each hierarchical component contributes 4-10% accuracy improvement, and qualitative analysis reveals interpretable forensic features aligned with human expert reasoning. Our work bridges classical image forensics and modern deep learning, offering a practical solution for real-world deployment where manipulation techniques are unknown a priori.

[108] Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR

Yufeng Zhong, Lei Chen, Zhixiong Zeng, Xuanle Zhao, Deyang Jiang, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Siqi Yang, Lin Ma

Main category: cs.CV

TL;DR: FD-RL uses format decoupled reinforcement learning to improve OCR performance on formatted text by targeting high-entropy patterns with specialized rewards.

DetailsMotivation: Advanced OCR models show significantly higher entropy (uncertainty) when processing formatted text like formulas and tables compared to plain text, indicating they struggle with format-sensitive documents despite being treated as simple perceptual tasks.

Method: Format decoupled reinforcement learning (FD-RL) with entropy-based data filtration to identify format-intensive instances, and format decoupled rewards tailored to different format types for format-level validation rather than token-level memorization.

Result: Achieves average score of 90.41 on OmniDocBench, setting new record for end-to-end models on this popular benchmark. Comprehensive ablation studies validate effectiveness of data, training, filtering, and rewarding strategies.

Conclusion: Reasoning over diverse reading pathways improves OCR performance on formatted documents, and format decoupled reinforcement learning effectively addresses the high uncertainty in format-sensitive text recognition.

Abstract: Reading text from images or scanned documents via OCR models has been a longstanding focus of researchers. Intuitively, text reading is perceived as a straightforward perceptual task, and existing work primarily focuses on constructing enriched data engineering to enhance SFT capabilities. In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text (\emph{e.g.}, formula, table, etc.) compared to plain text, often by an order of magnitude. These statistical patterns reveal that advanced OCR models struggle with high output uncertainty when dealing with format sensitive document, suggesting that reasoning over diverse reading pathways may improve OCR performance. To address this, we propose format decoupled reinforcement learning (FD-RL), which leverages high-entropy patterns for targeted optimization. Our approach employs entropy-based data filtration strategy to identify format-intensive instances, and adopt format decoupled rewards tailored to different format types, enabling format-level validation rather than token-level memorization. FD-RL achieves an average score of 90.41 on OmniDocBench, setting a new record for end-to-end models on this highly popular benchmark. More importantly, we conduct comprehensive ablation studies over data, training, filtering, and rewarding strategies, thoroughly validating their effectiveness.

[109] Bias Detection and Rotation-Robustness Mitigation in Vision-Language Models and Generative Image Models

Tarannum Mithila

Main category: cs.CV

TL;DR: This paper investigates bias propagation and robustness issues in vision-language and generative models under image rotations and distributional shifts, proposing mitigation strategies that improve robustness while reducing bias.

DetailsMotivation: Despite remarkable performance of VLMs and generative image models, their robustness and fairness under input transformations like image rotation remain insufficiently explored. The paper aims to address bias propagation and robustness degradation in these models.

Method: The authors analyze how rotation-induced perturbations affect model predictions, confidence calibration, and demographic bias patterns. They propose rotation-robust mitigation strategies combining data augmentation, representation alignment, and model-level regularization.

Result: Experimental results across multiple datasets demonstrate that the proposed methods significantly improve robustness while reducing bias amplification without sacrificing overall performance.

Conclusion: The study highlights critical limitations of current multimodal systems and provides practical mitigation techniques for building more reliable and fair AI models.

Abstract: Vision-Language Models (VLMs) and generative image models have achieved remarkable performance across multimodal tasks, yet their robustness and fairness under input transformations remain insufficiently explored. This work investigates bias propagation and robustness degradation in state-of-the-art vision-language and generative models, with a particular focus on image rotation and distributional shifts. We analyze how rotation-induced perturbations affect model predictions, confidence calibration, and demographic bias patterns. To address these issues, we propose rotation-robust mitigation strategies that combine data augmentation, representation alignment, and model-level regularization. Experimental results across multiple datasets demonstrate that the proposed methods significantly improve robustness while reducing bias amplification without sacrificing overall performance. This study highlights critical limitations of current multimodal systems and provides practical mitigation techniques for building more reliable and fair AI models.

[110] R$^2$BD: A Reconstruction-Based Method for Generalizable and Efficient Detection of Fake Images

Qingyu Liu, Zhongjie Ba, Jianmin Guo, Qiu Wang, Zhibo Wang, Jie Shi, Kui Ren

Main category: cs.CV

TL;DR: R²BD is a novel fake image detection framework that uses a unified reconstruction model (G-LDM) to simulate multiple generative paradigms and a single-step residual bias calculation, achieving 22× speedup and better accuracy than existing methods.

DetailsMotivation: Current reconstruction-based AIGC detection methods rely heavily on diffusion models, limiting generalization to other generative paradigms like GANs. They also suffer from inefficiency due to multi-step inversion and reconstruction processes.

Method: Two key designs: (1) G-LDM - a unified reconstruction model that simulates generation behaviors of VAEs, GANs, and diffusion models; (2) Residual bias calculation module that distinguishes real/fake images in a single inference step instead of 20+ steps.

Result: R²BD is over 22× faster than existing reconstruction-based methods while achieving superior detection accuracy. In cross-dataset evaluations, it outperforms state-of-the-art methods by an average of 13.87%, showing strong efficiency and generalization across diverse generative methods.

Conclusion: The proposed R²BD framework successfully addresses the limitations of existing reconstruction-based detection methods by broadening the detection scope beyond diffusion-only approaches and significantly improving efficiency through single-step inference, demonstrating strong performance across diverse generative paradigms.

Abstract: Recently, reconstruction-based methods have gained attention for AIGC image detection. These methods leverage pre-trained diffusion models to reconstruct inputs and measure residuals for distinguishing real from fake images. Their key advantage lies in reducing reliance on dataset-specific artifacts and improving generalization under distribution shifts. However, they are limited by significant inefficiency due to multi-step inversion and reconstruction, and their reliance on diffusion backbones further limits generalization to other generative paradigms such as GANs. In this paper, we propose a novel fake image detection framework, called R$^2$BD, built upon two key designs: (1) G-LDM, a unified reconstruction model that simulates the generation behaviors of VAEs, GANs, and diffusion models, thereby broadening the detection scope beyond prior diffusion-only approaches; and (2) a residual bias calculation module that distinguishes real and fake images in a single inference step, which is a significant efficiency improvement over existing methods that typically require 20$+$ steps. Extensive experiments on the benchmark from 10 public datasets demonstrate that R$^2$BD is over 22$\times$ faster than existing reconstruction-based methods while achieving superior detection accuracy. In cross-dataset evaluations, it outperforms state-of-the-art methods by an average of 13.87%, showing strong efficiency and generalization across diverse generative methods. The code and dataset used for evaluation are available at https://github.com/QingyuLiu/RRBD.

[111] Residual Cross-Modal Fusion Networks for Audio-Visual Navigation

Yi Wang, Yinfeng Yu, Bin Ren

Main category: cs.CV

TL;DR: CRFN introduces bidirectional residual interactions between audio and visual streams for audio-visual embodied navigation, achieving better multimodal fusion and cross-domain generalization than conventional methods.

DetailsMotivation: The key challenge in audio-visual embodied navigation is effectively modeling interactions between heterogeneous features during multimodal fusion to avoid single-modality dominance or information degradation, especially in cross-domain scenarios.

Method: Proposes Cross-Modal Residual Fusion Network (CRFN) with bidirectional residual interactions between audio and visual streams for complementary modeling and fine-grained alignment while maintaining representation independence. Uses residual connections and stabilization techniques instead of simple concatenation or attention gating.

Result: CRFN significantly outperforms state-of-the-art fusion baselines on Replica and Matterport3D datasets and achieves stronger cross-domain generalization. Also reveals that agents exhibit differentiated modality dependence across different datasets.

Conclusion: CRFN effectively addresses multimodal fusion challenges in audio-visual embodied navigation. The discovery of differentiated modality dependence provides new perspective for understanding cross-modal collaboration mechanisms in embodied agents.

Abstract: Audio-visual embodied navigation aims to enable an agent to autonomously localize and reach a sound source in unseen 3D environments by leveraging auditory cues. The key challenge of this task lies in effectively modeling the interaction between heterogeneous features during multimodal fusion, so as to avoid single-modality dominance or information degradation, particularly in cross-domain scenarios. To address this, we propose a Cross-Modal Residual Fusion Network, which introduces bidirectional residual interactions between audio and visual streams to achieve complementary modeling and fine-grained alignment, while maintaining the independence of their representations. Unlike conventional methods that rely on simple concatenation or attention gating, CRFN explicitly models cross-modal interactions via residual connections and incorporates stabilization techniques to improve convergence and robustness. Experiments on the Replica and Matterport3D datasets demonstrate that CRFN significantly outperforms state-of-the-art fusion baselines and achieves stronger cross-domain generalization. Notably, our experiments also reveal that agents exhibit differentiated modality dependence across different datasets. The discovery of this phenomenon provides a new perspective for understanding the cross-modal collaboration mechanism of embodied agents.

[112] Learning Domain-Invariant Representations for Cross-Domain Image Registration via Scene-Appearance Disentanglement

Jiahao Qin, Yiwen Wang

Main category: cs.CV

TL;DR: SAR-Net addresses cross-domain image registration by disentangling scene geometry from domain-specific appearance, enabling registration through re-rendering rather than direct intensity matching.

DetailsMotivation: Image registration under domain shift is challenging because systematic intensity differences violate the brightness constancy assumption, making correspondence estimation ill-posed when source and target images come from different domains.

Method: Proposes SAR-Net framework that decomposes images into domain-invariant scene representations and domain-specific appearance codes. Uses scene consistency loss for geometric correspondence and domain alignment loss. Registration is performed via re-rendering rather than direct intensity matching.

Result: Achieves 0.885 SSIM and 0.979 NCC on bidirectional scanning microscopy, representing 3.1x improvement over strongest baseline. Maintains real-time performance (77 fps). Ablation shows both scene consistency and domain alignment losses are essential.

Conclusion: Scene-appearance disentanglement provides a principled solution to cross-domain registration by separating geometric correspondence from appearance variations, enabling consistent alignment across domains through re-rendering.

Abstract: Image registration under domain shift remains a fundamental challenge in computer vision and medical imaging: when source and target images exhibit systematic intensity differences, the brightness constancy assumption underlying conventional registration methods is violated, rendering correspondence estimation ill-posed. We propose SAR-Net, a unified framework that addresses this challenge through principled scene-appearance disentanglement. Our key insight is that observed images can be decomposed into domain-invariant scene representations and domain-specific appearance codes, enabling registration via re-rendering rather than direct intensity matching. We establish theoretical conditions under which this decomposition enables consistent cross-domain alignment (Proposition 1) and prove that our scene consistency loss provides a sufficient condition for geometric correspondence in the shared latent space (Proposition 2). Empirically, we validate SAR-Net on bidirectional scanning microscopy, where coupled domain shift and geometric distortion create a challenging real-world testbed. Our method achieves 0.885 SSIM and 0.979 NCC, representing 3.1x improvement over the strongest baseline, while maintaining real-time performance (77 fps). Ablation studies confirm that both scene consistency and domain alignment losses are necessary: removing either degrades performance by 90% SSIM or causes 223x increase in latent alignment error, respectively. Code and data are available at https://github.com/D-ST-Sword/SAR-NET.

[113] The Semantic Lifecycle in Embodied AI: Acquisition, Representation and Storage via Foundation Models

Shuai Chen, Hao Chen, Yuanchen Bei, Tianyang Zhao, Zhibo Zhou, Feiran Huang

Main category: cs.CV

TL;DR: Survey paper proposing a “Semantic Lifecycle” framework to analyze how foundation models transform semantic knowledge processing in embodied AI, covering acquisition, representation, and storage stages.

DetailsMotivation: Semantic information in embodied AI is multi-source and multi-stage, making stable perception-to-action loops challenging. Traditional approaches with manual engineering and deep networks work for specific tasks but lack generalizability for complex environments and open-ended tasks. Foundation models offer cross-domain generalization and rich semantic priors to address these limitations.

Method: Proposes the “Semantic Lifecycle” as a unified framework to characterize semantic knowledge evolution in embodied AI driven by foundation models. This holistic perspective captures continuous flow and maintenance of semantic knowledge, departing from traditional isolated module approaches. Analyzes advances across three key stages: acquisition, representation, and storage.

Result: The survey provides a comprehensive analysis of how foundation models are reshaping embodied AI research through their semantic processing capabilities. The Semantic Lifecycle framework offers a systematic way to understand and compare different approaches to semantic knowledge handling in embodied systems.

Conclusion: Foundation models are transforming embodied AI by enabling more generalizable and robust semantic processing. The Semantic Lifecycle framework provides a valuable perspective for analyzing this evolution. The paper identifies existing challenges and outlines promising research directions for future work in this rapidly advancing field.

Abstract: Semantic information in embodied AI is inherently multi-source and multi-stage, making it challenging to fully leverage for achieving stable perception-to-action loops in real-world environments. Early studies have combined manual engineering with deep neural networks, achieving notable progress in specific semantic-related embodied tasks. However, as embodied agents encounter increasingly complex environments and open-ended tasks, the demand for more generalizable and robust semantic processing capabilities has become imperative. Recent advances in foundation models (FMs) address this challenge through their cross-domain generalization abilities and rich semantic priors, reshaping the landscape of embodied AI research. In this survey, we propose the Semantic Lifecycle as a unified framework to characterize the evolution of semantic knowledge within embodied AI driven by foundation models. Departing from traditional paradigms that treat semantic processing as isolated modules or disjoint tasks, our framework offers a holistic perspective that captures the continuous flow and maintenance of semantic knowledge. Guided by this embodied semantic lifecycle, we further analyze and compare recent advances across three key stages: acquisition, representation, and storage. Finally, we summarize existing challenges and outline promising directions for future research.

[114] TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts

Yu Xu, Hongbin Yan, Juan Cao, Yiji Cheng, Tiankai Hang, Runze He, Zijin Yin, Shiyi Zhang, Yuxin Zhang, Jintao Li, Chunyu Wang, Qinglin Lu, Tong-Yee Lee, Fan Tang

Main category: cs.CV

TL;DR: Proposes a semantic-aware MoE framework for unified image generation/editing that injects task intent into routing to resolve task interference, outperforming dense baselines.

DetailsMotivation: Unified image generation and editing models suffer from severe task interference in dense diffusion transformers where shared parameters must compromise between conflicting objectives. While sparse Mixture-of-Experts (MoE) is promising, its task-agnostic gating networks prevent meaningful specialization and fail to resolve task interference.

Method: Introduces Hierarchical Task Semantic Annotation to create structured task descriptors (scope, type, preservation) and designs Predictive Alignment Regularization to align internal routing decisions with high-level task semantics, evolving gating networks from task-agnostic to semantic-aware dispatch centers.

Result: Effectively mitigates task interference, outperforming dense baselines in fidelity and quality. Analysis shows that experts naturally develop clear and semantically correlated specializations.

Conclusion: Injecting semantic intent into MoE routing through structured task descriptors and alignment regularization successfully resolves task interference in unified image generation/editing models, enabling meaningful expert specialization.

Abstract: Unified image generation and editing models suffer from severe task interference in dense diffusion transformers architectures, where a shared parameter space must compromise between conflicting objectives (e.g., local editing v.s. subject-driven generation). While the sparse Mixture-of-Experts (MoE) paradigm is a promising solution, its gating networks remain task-agnostic, operating based on local features, unaware of global task intent. This task-agnostic nature prevents meaningful specialization and fails to resolve the underlying task interference. In this paper, we propose a novel framework to inject semantic intent into MoE routing. We introduce a Hierarchical Task Semantic Annotation scheme to create structured task descriptors (e.g., scope, type, preservation). We then design Predictive Alignment Regularization to align internal routing decisions with the task’s high-level semantics. This regularization evolves the gating network from a task-agnostic executor to a dispatch center. Our model effectively mitigates task interference, outperforming dense baselines in fidelity and quality, and our analysis shows that experts naturally develop clear and semantically correlated specializations.

[115] Changes in Visual Attention Patterns for Detection Tasks due to Dependencies on Signal and Background Spatial Frequencies

Amar Kavuri, Howard C. Gifford, Mini Das

Main category: cs.CV

TL;DR: This paper investigates how image and signal properties affect visual attention during signal detection in digital breast tomosynthesis images, finding that detection performance is constrained by perceptual stages and influenced by interactions between target morphology and background complexity.

DetailsMotivation: To understand how image and signal properties impact visual attention mechanisms during signal detection tasks in medical imaging, particularly for improving detection accuracy in complex heterogeneous backgrounds where misdiagnosis still occurs despite radiologists' expertise.

Method: Used simulated tomographic breast images with digital breast phantoms (Bakic and XCAT) containing two types of lesions with distinct spatial frequency properties. Conducted observer study with 6 human participants detecting 3-mm sphere and 6-mm spicule lesions in DBT slices while collecting eye-gaze data to analyze visual attention metrics.

Result: Detection performance is strongly constrained by later perceptual stages, with decision failures accounting for most errors. Signal detectability is jointly influenced by target morphology and background complexity, showing critical interaction between local signal features and global anatomical noise. Spiculated lesions received longer fixation durations, indicating differential visual attention engagement based on background and signal spatial frequency dependencies.

Conclusion: Visual attention mechanisms in complex medical imaging tasks are significantly affected by interactions between target characteristics and background complexity, with decision-making failures being the primary source of detection errors rather than early visual processing limitations.

Abstract: We aim to investigate the impact of image and signal properties on visual attention mechanisms during a signal detection task in digital images. The application of insight yielded from this work spans many areas of digital imaging where signal or pattern recognition is involved in complex heterogenous background. We used simulated tomographic breast images as the platform to investigate this question. While radiologists are highly effective at analyzing medical images to detect and diagnose diseases, misdiagnosis still occurs. We selected digital breast tomosynthesis (DBT) images as a sample medical images with different breast densities and structures using digital breast phantoms (Bakic and XCAT). Two types of lesions (with distinct spatial frequency properties) were randomly inserted in the phantoms during projections to generate abnormal cases. Six human observers participated in observer study designed for a locating and detection of an 3-mm sphere lesion and 6-mm spicule lesion in reconstructed in-plane DBT slices. We collected eye-gaze data to estimate gaze metrics and to examine differences in visual attention mechanisms. We found that detection performance in complex visual environments is strongly constrained by later perceptual stages, with decision failures accounting for the largest proportion of errors. Signal detectability is jointly influenced by both target morphology and background complexity, revealing a critical interaction between local signal features and global anatomical noise. Increased fixation duration on spiculated lesions suggests that visual attention is differentially engaged depending on background and signal spatial frequency dependencies.

[116] Compressing Vision Transformers in Geospatial Transfer Learning with Manifold-Constrained Optimization

Thomas Snyder, H. Lexie Yang, Stefan Schnake, Steffen Schotthöfer

Main category: cs.CV

TL;DR: DLRT framework compresses geospatial foundation models via manifold-constrained optimization during transfer learning, achieving strong parameter reduction with minimal accuracy loss for edge deployment.

DetailsMotivation: Geospatial foundation models are too large for resource-constrained edge devices, and existing compression methods cause significant accuracy loss, limiting practical adoption.

Method: Uses manifold-constrained optimization framework DLRT to compress vision transformer-based geospatial models during transfer learning by enforcing structured low-dimensional parameterizations aligned with downstream objectives.

Result: Outperforms off-the-shelf low-rank methods like LoRA, achieves substantial parameter reduction with minimal accuracy loss on diverse geospatial benchmarks, enabling high-performing on-device models.

Conclusion: DLRT enables effective compression of geospatial foundation models for edge deployment while maintaining task-specific accuracy, overcoming limitations of existing compression techniques.

Abstract: Deploying geospatial foundation models on resource-constrained edge devices demands compact architectures that maintain high downstream performance. However, their large parameter counts and the accuracy loss often induced by compression limit practical adoption. In this work, we leverage manifold-constrained optimization framework DLRT to compress large vision transformer-based geospatial foundation models during transfer learning. By enforcing structured low-dimensional parameterizations aligned with downstream objectives, this approach achieves strong compression while preserving task-specific accuracy. We show that the method outperforms of-the-shelf low-rank methods as LoRA. Experiments on diverse geospatial benchmarks confirm substantial parameter reduction with minimal accuracy loss, enabling high-performing, on-device geospatial models.

[117] DeTracker: Motion-decoupled Vehicle Detection and Tracking in Unstabilized Satellite Videos

Jiajun Chen, Jing Xiao, Shaohan Cao, Yuming Zhu, Liang Liao, Jun Pan, Mi Wang

Main category: cs.CV

TL;DR: DeTracker: A joint detection-and-tracking framework for unstabilized satellite videos that addresses platform jitter and tiny object tracking through motion decoupling and temporal feature fusion.

DetailsMotivation: Satellite videos provide continuous observations but face MOT challenges under unstabilized conditions where platform jitter and weak appearance of tiny objects degrade tracking performance.

Method: 1) Global-Local Motion Decoupling (GLMD) module separates satellite platform motion from true object motion via global alignment and local refinement. 2) Temporal Dependency Feature Pyramid (TDFP) module performs cross-frame temporal feature fusion for better tiny-object representations. 3) New benchmark dataset SDM-Car-SU simulates multi-directional platform motions.

Result: DeTracker significantly outperforms existing methods, achieving 61.1% MOTA on SDM-Car-SU dataset and 47.3% MOTA on real satellite video data.

Conclusion: DeTracker effectively addresses MOT challenges in unstabilized satellite videos through motion decoupling and temporal feature fusion, demonstrating superior performance on both simulated and real datasets.

Abstract: Satellite videos provide continuous observations of surface dynamics but pose significant challenges for multi-object tracking (MOT), especially under unstabilized conditions where platform jitter and the weak appearance of tiny objects jointly degrade tracking performance. To address this problem, we propose DeTracker, a joint detection-and-tracking framework tailored for unstabilized satellite videos. DeTracker introduces a Global–Local Motion Decoupling (GLMD) module that explicitly separates satellite platform motion from true object motion through global alignment and local refinement, leading to improved trajectory stability and motion estimation accuracy. In addition, a Temporal Dependency Feature Pyramid (TDFP) module is developed to perform cross-frame temporal feature fusion, enhancing the continuity and discriminability of tiny-object representations. We further construct a new benchmark dataset, SDM-Car-SU, which simulates multi-directional and multi-speed platform motions to enable systematic evaluation of tracking robustness under varying motion perturbations. Extensive experiments on both simulated and real unstabilized satellite videos demonstrate that DeTracker significantly outperforms existing methods, achieving 61.1% MOTA on SDM-Car-SU and 47.3% MOTA on real satellite video data.

[118] Adaptive few-shot learning for robust part quality classification in two-photon lithography

Sixian Jia, Ruo-Syuan Mei, Chenhui Shao

Main category: cs.CV

TL;DR: An adaptive computer vision framework for quality control in two-photon lithography manufacturing that handles novelty detection, incremental learning, and domain adaptation with minimal data requirements.

DetailsMotivation: Existing computer vision models for quality control in additive manufacturing are static and ineffective in dynamic environments - they can't detect new defect classes, update efficiently from scarce data, or adapt to new part geometries.

Method: A unified framework with three components: 1) LDA-based statistical hypothesis testing for novelty detection, 2) rehearsal-based few-shot incremental learning for adding new classes, and 3) few-shot Domain-Adversarial Neural Network for domain adaptation between different part geometries.

Result: Novelty detection achieved 99-100% accuracy, incremental learning reached 92% accuracy with only 20 samples per new class, and domain adaptation achieved 96.19% accuracy on target domain with just 5 shots, bridging severe domain gaps.

Conclusion: The framework provides a robust, data-efficient solution for deploying and maintaining computer vision quality control models in evolving manufacturing scenarios, addressing key limitations of static models.

Abstract: Two-photon lithography (TPL) is an advanced additive manufacturing (AM) technique for fabricating high-precision micro-structures. While computer vision (CV) is proofed for automated quality control, existing models are often static, rendering them ineffective in dynamic manufacturing environments. These models typically cannot detect new, unseen defect classes, be efficiently updated from scarce data, or adapt to new part geometries. To address this gap, this paper presents an adaptive CV framework for the entire life-cycle of quality model maintenance. The proposed framework is built upon a same, scale-robust backbone model and integrates three key methodologies: (1) a statistical hypothesis testing framework based on Linear Discriminant Analysis (LDA) for novelty detection, (2) a two-stage, rehearsal-based strategy for few-shot incremental learning, and (3) a few-shot Domain-Adversarial Neural Network (DANN) for few-shot domain adaptation. The framework was evaluated on a TPL dataset featuring hemisphere as source domain and cube as target domain structures, with each domain categorized into good, minor damaged, and damaged quality classes. The hypothesis testing method successfully identified new class batches with 99-100% accuracy. The incremental learning method integrated a new class to 92% accuracy using only K=20 samples. The domain adaptation model bridged the severe domain gap, achieving 96.19% accuracy on the target domain using only K=5 shots. These results demonstrate a robust and data-efficient solution for deploying and maintaining CV models in evolving production scenarios.

[119] Variance-Penalized MC-Dropout as a Learned Smoothing Prior for Brain Tumour Segmentation

Satyaki Roy Chowdhury, Golrokh Mirzaei

Main category: cs.CV

TL;DR: UAMSA-UNet: Uncertainty-aware multi-scale attention Bayesian U-Net that uses Monte Carlo Dropout for data-driven smoothing, improving brain tumor segmentation with better boundaries and computational efficiency.

DetailsMotivation: Existing CNN and U-Net based brain tumor segmentation methods produce noisy boundaries in tumor infiltration regions, requiring better approaches for accurate diagnosis and treatment planning.

Method: UAMSA-UNet combines uncertainty-aware Bayesian approach with Monte Carlo Dropout for data-driven smoothing prior, fuses multi-scale features and attention maps, and uses smoothing-regularized loss with variance penalty across stochastic forward passes.

Result: On BraTS2023: 3.3% Dice and 2.7% IoU improvement over U-Net. On BraTS2024: 4.5% Dice and 4.0% IoU gains over best baseline. Reduces FLOPs by 42.5% relative to U-Net++ while maintaining higher accuracy.

Conclusion: Combining multi-scale attention with learned smoothing prior achieves better segmentation quality and computational efficiency, providing flexible foundation for future integration with transformer-based modules.

Abstract: Brain tumor segmentation is essential for diagnosis and treatment planning, yet many CNN and U-Net based approaches produce noisy boundaries in regions of tumor infiltration. We introduce UAMSA-UNet, an Uncertainty-Aware Multi-Scale Attention-based Bayesian U-Net that in- stead leverages Monte Carlo Dropout to learn a data-driven smoothing prior over its predictions, while fusing multi-scale features and attention maps to capture both fine details and global context. Our smoothing-regularized loss augments binary cross-entropy with a variance penalty across stochas- tic forward passes, discouraging spurious fluctuations and yielding spatially coherent masks. On BraTS2023, UAMSA- UNet improves Dice Similarity Coefficient by up to 3.3% and mean IoU by up to 2.7% over U-Net; on BraTS2024, it delivers up to 4.5% Dice and 4.0% IoU gains over the best baseline. Remarkably, it also reduces FLOPs by 42.5% rel- ative to U-Net++ while maintaining higher accuracy. These results demonstrate that, by combining multi-scale attention with a learned smoothing prior, UAMSA-UNet achieves both better segmentation quality and computational efficiency, and provides a flexible foundation for future integration with transformer-based modules for further enhanced segmenta- tion results.

[120] Thermo-LIO: A Novel Multi-Sensor Integrated System for Structural Health Monitoring

Chao Yang, Haoyuan Zheng, Yue Ma

Main category: cs.CV

TL;DR: Thermo-LIO: A novel multi-sensor system combining thermal imaging with LiDAR for enhanced structural health monitoring, enabling better defect detection in complex structures than traditional thermography.

DetailsMotivation: Traditional 2D thermography is limited for assessing complex geometries, inaccessible areas, and subsurface defects in construction. There's a need for better methods to monitor large-scale civil infrastructure.

Method: Developed multimodal fusion method combining thermal imaging and LiDAR with precise calibration/synchronization. Integrated this fusion with LiDAR-Inertial Odometry (LIO) for full coverage of large structures and detailed temperature monitoring across inspection cycles.

Result: Experimental validations on bridge and hall building demonstrate Thermo-LIO detects detailed thermal anomalies and structural defects more accurately than traditional methods. System enhances diagnostic precision, enables real-time processing, and expands inspection coverage.

Conclusion: Thermo-LIO highlights the crucial role of multimodal sensor integration in advancing SHM methodologies for large-scale civil infrastructure, providing superior defect detection capabilities compared to traditional approaches.

Abstract: Traditional two-dimensional thermography, despite being non-invasive and useful for defect detection in the construction field, is limited in effectively assessing complex geometries, inaccessible areas, and subsurface defects. This paper introduces Thermo-LIO, a novel multi-sensor system that can enhance Structural Health Monitoring (SHM) by fusing thermal imaging with high-resolution LiDAR. To achieve this, the study first develops a multimodal fusion method combining thermal imaging and LiDAR, enabling precise calibration and synchronization of multimodal data streams to create accurate representations of temperature distributions in buildings. Second, it integrates this fusion approach with LiDAR-Inertial Odometry (LIO), enabling full coverage of large-scale structures and allowing for detailed monitoring of temperature variations and defect detection across inspection cycles. Experimental validations, including case studies on a bridge and a hall building, demonstrate that Thermo-LIO can detect detailed thermal anomalies and structural defects more accurately than traditional methods. The system enhances diagnostic precision, enables real-time processing, and expands inspection coverage, highlighting the crucial role of multimodal sensor integration in advancing SHM methodologies for large-scale civil infrastructure.

[121] SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds

Constantin Kolomiiets, Miroslav Purkrabek, Jiri Matas

Main category: cs.CV

TL;DR: Pose-guided fine-tuning of SAM 2.1 for occlusion-aware human segmentation using pose keypoints with iterative refinement.

DetailsMotivation: SAM struggles with occlusion where keypoints may be partially or fully invisible, limiting its effectiveness for human segmentation in challenging scenarios.

Method: Adapt SAM 2.1 with minimal encoder modifications, using PoseMaskRefine fine-tuning strategy that incorporates pose keypoints with high visibility into SAM’s iterative correction process. During inference, only the three highest visibility keypoints are used for prompting.

Result: Improved robustness and accuracy across multiple datasets, reduced sensitivity to errors like missing body parts or misclassified clothing, and accurate mask prediction from as few as a single keypoint.

Conclusion: Pose-guided fine-tuning of SAM enables effective, occlusion-aware human segmentation while preserving the original model’s generalization capabilities.

Abstract: Segment Anything (SAM) provides an unprecedented foundation for human segmentation, but may struggle under occlusion, where keypoints may be partially or fully invisible. We adapt SAM 2.1 for pose-guided segmentation with minimal encoder modifications, retaining its strong generalization. Using a fine-tuning strategy called PoseMaskRefine, we incorporate pose keypoints with high visibility into the iterative correction process originally employed by SAM, yielding improved robustness and accuracy across multiple datasets. During inference, we simplify prompting by selecting only the three keypoints with the highest visibility. This strategy reduces sensitivity to common errors, such as missing body parts or misclassified clothing, and allows accurate mask prediction from as few as a single keypoint. Our results demonstrate that pose-guided fine-tuning of SAM enables effective, occlusion-aware human segmentation while preserving the generalization capabilities of the original model. The code and pretrained models will be available at https://mirapurkrabek.github.io/BBox-MaskPose.

[122] Instance camera focus prediction for crystal agglomeration classification

Xiaoyu Ji, Chenhao Zhang, Tyler James Downard, Zoltan Nagy, Ali Shakouri, Fengqing Zhu

Main category: cs.CV

TL;DR: A method using instance camera focus prediction and segmentation to accurately classify crystal agglomeration in microscopic images by distinguishing overlapping crystals at different depth layers.

DetailsMotivation: Crystal agglomeration analysis from 2D microscopic images is challenging because overlapping crystals at different depth layers appear connected but aren't truly agglomerated, and traditional methods can't distinguish depth information effectively.

Method: First uses an instance camera focus prediction network to quantify focus levels (2 classes) that align with visual observations better than traditional focus measures. Then combines instance segmentation with predicted focus levels for agglomeration classification.

Result: The proposed method achieves higher agglomeration classification and segmentation accuracy than baseline models on both ammonium perchlorate crystal and sugar crystal datasets.

Conclusion: The focus-aware approach effectively addresses the depth ambiguity problem in 2D microscopic images for accurate crystal agglomeration analysis, outperforming traditional methods.

Abstract: Agglomeration refers to the process of crystal clustering due to interparticle forces. Crystal agglomeration analysis from microscopic images is challenging due to the inherent limitations of two-dimensional imaging. Overlapping crystals may appear connected even when located at different depth layers. Because optical microscopes have a shallow depth of field, crystals that are in-focus and out-of-focus in the same image typically reside on different depth layers and do not constitute true agglomeration. To address this, we first quantified camera focus with an instance camera focus prediction network to predict 2 class focus level that aligns better with visual observations than traditional image processing focus measures. Then an instance segmentation model is combined with the predicted focus level for agglomeration classification. Our proposed method has a higher agglomeration classification and segmentation accuracy than the baseline models on ammonium perchlorate crystal and sugar crystal dataset.

[123] Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers

Jonas Römer, Timo Dickscheid

Main category: cs.CV

TL;DR: Blockwise self-supervised learning applied to masked video transformers achieves comparable performance to end-to-end training, with differences in learning dynamics and representation development.

DetailsMotivation: To explore whether masked video transformers can be trained without end-to-end backpropagation, addressing the underexplored application of blockwise self-supervised learning to video modeling and the sparse analysis comparing BWSSL and end-to-end training dynamics.

Method: Apply blockwise learning to masked autoencoding video vision transformers by partitioning the encoder into blocks, each optimized with a local masked reconstruction loss. Analyze depth-wise decodability, inter-block similarity, and patch-level diagnostics.

Result: Blockwise training converges and yields representations close to end-to-end baselines under linear-probe and retrieval metrics. It exposes higher-level structure earlier, with later blocks saturating and operating in geometry-preserving regimes, showing token-level shifts consistent with stronger early mixing.

Conclusion: Blockwise training is viable for masked video transformers, with late-block saturation and interface formation contributing to remaining performance gaps compared to end-to-end training.

Abstract: End-to-end backpropagation couples all layers through a global error signal, enabling coordinated learning but requiring long-range credit assignment. Motivated by recent progress in blockwise self-supervised learning (BWSSL), we ask whether masked video transformers can be trained without end-to-end backpropagation. Applying BWSSL to masked video modeling remains relatively underexplored and must handle spatiotemporal context and long-range temporal structure. More broadly, analyses that compare BWSSL and end-to-end training in terms of learning dynamics and depth-wise representation development remain sparse. We apply blockwise learning to a masked autoencoding video vision transformer by partitioning the encoder into blocks, each of which is optimized with a local masked reconstruction loss. Across model sizes and partition granularities, training converges and yields representations close to matched end-to-end baselines under linear-probe and retrieval proxies. In order to compare intermediate representations, we analyze depth-wise decodability, inter-block similarity, and patch-level diagnostics. Blockwise training exposes higher-level structure earlier, while later blocks saturate and operate in a more geometry-preserving regime. It can also induce token-level shifts consistent with stronger early mixing that pooled metrics can miss. These findings point to late-block saturation and interface formation as contributors to the remaining gap.

[124] Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking

Junze Shi, Yang Yu, Jian Shi, Haibo Luo

Main category: cs.CV

TL;DR: STDTrack introduces spatiotemporal dependencies into lightweight object tracking using dense video sampling, temporal propagation tokens, and multi-frame information fusion to bridge performance gaps while maintaining real-time efficiency.

DetailsMotivation: Existing lightweight trackers use sparse sampling (one template + one search image per sequence), failing to explore spatiotemporal information in videos, which creates a performance gap between lightweight and high-performance trackers.

Method: 1) Dense video sampling for spatiotemporal information utilization; 2) Temporally propagating spatiotemporal token for per-frame feature guidance; 3) Multi-frame Information Fusion Module (MFIFM) using historical context; 4) Spatiotemporal Token Maintainer (STM) with quality-based update; 5) Multi-scale prediction head for varying object sizes.

Result: Achieves state-of-the-art results across six benchmarks. On GOT-10k, rivals high-performance non-real-time trackers like MixFormer while operating at 192 FPS (GPU) and 41 FPS (CPU).

Conclusion: STDTrack successfully bridges the performance gap between lightweight and high-performance trackers by integrating reliable spatiotemporal dependencies while maintaining real-time efficiency.

Abstract: Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these achievements, existing methods universally employ sparse sampling during training–utilizing only one template and one search image per sequence–which fails to comprehensively explore spatiotemporal information in videos. This limitation constrains performance and cause the gap between lightweight and high-performance trackers. To bridge this divide while maintaining real-time efficiency, we propose STDTrack, a framework that pioneers the integration of reliable spatiotemporal dependencies into lightweight trackers. Our approach implements dense video sampling to maximize spatiotemporal information utilization. We introduce a temporally propagating spatiotemporal token to guide per-frame feature extraction. To ensure comprehensive target state representation, we disign the Multi-frame Information Fusion Module (MFIFM), which augments current dependencies using historical context. The MFIFM operates on features stored in our constructed Spatiotemporal Token Maintainer (STM), where a quality-based update mechanism ensures information reliability. Considering the scale variation among tracking targets, we develop a multi-scale prediction head to dynamically adapt to objects of different sizes. Extensive experiments demonstrate state-of-the-art results across six benchmarks. Notably, on GOT-10k, STDTrack rivals certain high-performance non-real-time trackers (e.g., MixFormer) while operating at 192 FPS(GPU) and 41 FPS(CPU).

[125] Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams

Lachlan Holden, Feras Dayoub, Alberto Candela, David Harvey, Tat-Jun Chin

Main category: cs.CV

TL;DR: Cross-view localization method for planetary rovers using dual-encoder neural networks with synthetic data and semantic segmentation to bridge domain gap to real images.

DetailsMotivation: Future planetary missions require advanced autonomy and accurate localization. Ground-aerial robotic teams are promising but face challenges: real space data with ground-truth labels is scarce, and traditional methods struggle with limited field-of-view monocular ground-view images.

Method: Proposes cross-view-localising dual-encoder deep neural networks that use semantic segmentation with vision foundation models and high-volume synthetic data to bridge domain gap. Combines particle filters for state estimation with cross-view networks to estimate position from sequences of ground-view images.

Result: Developed a new cross-view dataset of real-world rover trajectories with ground-truth localization from a planetary analogue facility, plus a high-volume synthetic dataset. The method enables accurate position estimation over both simple and complex trajectories.

Conclusion: The proposed approach addresses the data scarcity problem in planetary robotics by leveraging synthetic data and semantic segmentation, enabling accurate rover localization in aerial maps using limited ground-view images, which supports future autonomous planetary missions.

Abstract: Accurate localisation in planetary robotics enables the advanced autonomy required to support the increased scale and scope of future missions. The successes of the Ingenuity helicopter and multiple planetary orbiters lay the groundwork for future missions that use ground-aerial robotic teams. In this paper, we consider rovers using machine learning to localise themselves in a local aerial map using limited field-of-view monocular ground-view RGB images as input. A key consideration for machine learning methods is that real space data with ground-truth position labels suitable for training is scarce. In this work, we propose a novel method of localising rovers in an aerial map using cross-view-localising dual-encoder deep neural networks. We leverage semantic segmentation with vision foundation models and high volume synthetic data to bridge the domain gap to real images. We also contribute a new cross-view dataset of real-world rover trajectories with corresponding ground-truth localisation data captured in a planetary analogue facility, plus a high volume dataset of analogous synthetic image pairs. Using particle filters for state estimation with the cross-view networks allows accurate position estimation over simple and complex trajectories based on sequences of ground-view images.

[126] Small but Mighty: Dynamic Wavelet Expert-Guided Fine-Tuning of Large-Scale Models for Optical Remote Sensing Object Segmentation

Yanguang Sun, Chao Wang, Jian Yang, Lei Luo

Main category: cs.CV

TL;DR: WEFT is a dynamic wavelet expert-guided fine-tuning paradigm that efficiently adapts large-scale foundation models to remote sensing segmentation tasks with fewer trainable parameters.

DetailsMotivation: Large-scale foundation models offer strong performance potential for remote sensing segmentation but are impractical due to massive parameter counts causing GPU memory and computational issues with full-parameter fine-tuning.

Method: Introduces a task-specific wavelet expert extractor to model wavelet experts from different perspectives and dynamically regulate outputs, plus an expert-guided conditional adapter that enhances frozen features with trainable wavelet features through iterative updates.

Result: Outperforms 21 SOTA methods on three ORSIs datasets and achieves optimal results in camouflage, natural, and medical scenarios.

Conclusion: WEFT provides an efficient fine-tuning paradigm that enables practical use of large-scale foundation models for remote sensing segmentation with reduced computational costs while maintaining superior performance.

Abstract: Accurately localizing and segmenting relevant objects from optical remote sensing images (ORSIs) is critical for advancing remote sensing applications. Existing methods are typically built upon moderate-scale pre-trained models and employ diverse optimization strategies to achieve promising performance under full-parameter fine-tuning. In fact, deeper and larger-scale foundation models can provide stronger support for performance improvement. However, due to their massive number of parameters, directly adopting full-parameter fine-tuning leads to pronounced training difficulties, such as excessive GPU memory consumption and high computational costs, which result in extremely limited exploration of large-scale models in existing works. In this paper, we propose a novel dynamic wavelet expert-guided fine-tuning paradigm with fewer trainable parameters, dubbed WEFT, which efficiently adapts large-scale foundation models to ORSIs segmentation tasks by leveraging the guidance of wavelet experts. Specifically, we introduce a task-specific wavelet expert extractor to model wavelet experts from different perspectives and dynamically regulate their outputs, thereby generating trainable features enriched with task-specific information for subsequent fine-tuning. Furthermore, we construct an expert-guided conditional adapter that first enhances the fine-grained perception of frozen features for specific tasks by injecting trainable features, and then iteratively updates the information of both types of feature, allowing for efficient fine-tuning. Extensive experiments show that our WEFT not only outperforms 21 state-of-the-art (SOTA) methods on three ORSIs datasets, but also achieves optimal results in camouflage, natural, and medical scenarios. The source code is available at: https://github.com/CSYSI/WEFT.

[127] SAM-Aug: Leveraging SAM Priors for Few-Shot Parcel Segmentation in Satellite Time Series

Kai Hu, Yaozu Feng, Vladimir Lysenko, Ya Guo Member, Huayi Wu

Main category: cs.CV

TL;DR: SAM-Aug: A framework using Segment Anything Model (SAM) to generate geometry-aware mask priors for few-shot semantic segmentation of remote sensing images, improving performance without additional labeled data.

DetailsMotivation: Few-shot semantic segmentation of time-series remote sensing images is challenging due to scarce labeled data. Current models degrade significantly under limited labeling, limiting real-world applicability for land cover mapping.

Method: Constructs cloud-free composite images from temporal sequences, applies SAM in unsupervised manner to generate geometry-aware mask priors, integrates these priors via RegionSmoothLoss that enforces prediction consistency within SAM-derived regions across temporal frames.

Result: Achieves mean test mIoU of 36.21% (5% labeled setting), outperforming SOTA by +2.33 percentage points (6.89% relative improvement). Best split reaches 40.28% mIoU (11.2% relative gain) with no additional labeled data.

Conclusion: Vision models like SAM can serve as useful regularizers in few-shot remote sensing learning, offering scalable plug-and-play solution for land cover monitoring without manual annotations or model fine-tuning.

Abstract: Few-shot semantic segmentation of time-series remote sensing images remains a critical challenge, particularly in regions where labeled data is scarce or costly to obtain. While state-of-the-art models perform well under full supervision, their performance degrades significantly under limited labeling, limiting their real-world applicability. In this work, we propose SAM-Aug, a new annotation-efficient framework that leverages the geometry-aware segmentation capability of the Segment Anything Model (SAM) to improve few-shot land cover mapping. Our approach constructs cloud-free composite images from temporal sequences and applies SAM in a fully unsupervised manner to generate geometry-aware mask priors. These priors are then integrated into training through a proposed loss function called RegionSmoothLoss, which enforces prediction consistency within each SAM-derived region across temporal frames, effectively regularizing the model to respect semantically coherent structures. Extensive experiments on the PASTIS-R benchmark under a 5 percent labeled setting demonstrate the effectiveness and robustness of SAM-Aug. Averaged over three random seeds (42, 2025, 4090), our method achieves a mean test mIoU of 36.21 percent, outperforming the state-of-the-art baseline by +2.33 percentage points, a relative improvement of 6.89 percent. Notably, on the most favorable split (seed=42), SAM-Aug reaches a test mIoU of 40.28 percent, representing an 11.2 percent relative gain with no additional labeled data. The consistent improvement across all seeds confirms the generalization power of leveraging foundation model priors under annotation scarcity. Our results highlight that vision models like SAM can serve as useful regularizers in few-shot remote sensing learning, offering a scalable and plug-and-play solution for land cover monitoring without requiring manual annotations or model fine-tuning.

[128] Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning

Yang Li, Aming Wu, Zihao Zhang, Yahong Han

Main category: cs.CV

TL;DR: Proposes slow4fast-VLN framework for Vision-Language Navigation with fast-slow reasoning interaction to handle unseen environments and instructions in open-world settings.

DetailsMotivation: Traditional VLN assumes close-set training/test data, but real world has diverse unseen environments. Need generalized navigation ability for open-world adaptation.

Method: Dynamic interactive fast-slow reasoning framework: fast-reasoning module (end-to-end strategy network) executes actions and builds memory; slow-reasoning module analyzes memories, extracts generalization experiences, and optimizes fast module.

Result: Framework enables continuous adaptation to unseen scenarios through fast-slow interaction, unlike traditional independent fast-slow reasoning approaches.

Conclusion: Proposed slow4fast-VLN addresses General Scene Adaptation in VLN by enabling dynamic strategy generation through interactive fast-slow reasoning for open-world navigation.

Abstract: Vision-Language Navigation aims to enable agents to navigate to a target location based on language instructions. Traditional VLN often follows a close-set assumption, i.e., training and test data share the same style of the input images and instructions. However, the real world is open and filled with various unseen environments, posing enormous difficulties for close-set methods. To this end, we focus on the General Scene Adaptation (GSA-VLN) task, aiming to learn generalized navigation ability by introducing diverse environments and inconsistent intructions.Towards this task, when facing unseen environments and instructions, the challenge mainly lies in how to enable the agent to dynamically produce generalized strategies during the navigation process. Recent research indicates that by means of fast and slow cognition systems, human beings could generate stable policies, which strengthen their adaptation for open world. Inspired by this idea, we propose the slow4fast-VLN, establishing a dynamic interactive fast-slow reasoning framework. The fast-reasoning module, an end-to-end strategy network, outputs actions via real-time input. It accumulates execution records in a history repository to build memory. The slow-reasoning module analyze the memories generated by the fast-reasoning module. Through deep reflection, it extracts experiences that enhance the generalization ability of decision-making. These experiences are structurally stored and used to continuously optimize the fast-reasoning module. Unlike traditional methods that treat fast-slow reasoning as independent mechanisms, our framework enables fast-slow interaction. By leveraging the experiences from slow reasoning. This interaction allows the system to continuously adapt and efficiently execute navigation tasks when facing unseen scenarios.

[129] LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models

Haoyan Gong, Hongbin Liu

Main category: cs.CV

TL;DR: Proposes an end-to-end structure-aware multimodal reasoning framework for license plate recognition that addresses misalignment between restoration and recognition objectives by introducing character-aware queries and residual modulation.

DetailsMotivation: Real-world LPR suffers from severe degradations (motion blur, low resolution, complex illumination). The traditional "restoration-then-recognition" approach has misalignment between pixel-level restoration objectives and semantic recognition goals, causing artifact interference and error accumulation. VLMs lack explicit structural modeling for license plate character sequences.

Method: End-to-end structure-aware multimodal reasoning framework based on Qwen3-VL with Character-Aware Multimodal Reasoning Module (CMRM). Uses learnable Character Slot Queries that retrieve fine-grained evidence from visual features via cross-attention, then injects character-aware representations back into visual tokens via residual modulation. Combines with LoRA parameter-efficient fine-tuning for domain adaptation while retaining generalization capabilities.

Result: Extensive experiments on synthetic and real-world severely degraded datasets show the method significantly outperforms existing restoration-recognition combinations and general VLMs, validating superiority of incorporating structured reasoning into large models for low-quality text recognition.

Conclusion: The proposed framework successfully addresses the misalignment problem in traditional LPR pipelines by incorporating explicit structural reasoning into VLMs, demonstrating effectiveness for degraded text recognition tasks through character-aware multimodal reasoning.

Abstract: Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination. The prevailing “restoration-then-recognition” two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition, leading to artifact interference and error accumulation. While Vision-Language Models (VLMs) have demonstrated powerful general capabilities, they lack explicit structural modeling for license plate character sequences (e.g., fixed length, specific order). To address this, we propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL. The core innovation lies in the Character-Aware Multimodal Reasoning Module (CMRM), which introduces a set of learnable Character Slot Queries. Through a cross-attention mechanism, these queries actively retrieve fine-grained evidence corresponding to character positions from visual features. Subsequently, we inject these character-aware representations back into the visual tokens via residual modulation, enabling the language model to perform autoregressive generation based on explicit structural priors. Furthermore, combined with the LoRA parameter-efficient fine-tuning strategy, the model achieves domain adaptation while retaining the generalization capabilities of the large model. Extensive experiments on both synthetic and real-world severely degraded datasets demonstrate that our method significantly outperforms existing restoration-recognition combinations and general VLMs, validating the superiority of incorporating structured reasoning into large models for low-quality text recognition tasks.

[130] LPCAN: Lightweight Pyramid Cross-Attention Network for Rail Surface Defect Detection Using RGB-D Data

Jackie Alex, Guoqiang Huan

Main category: cs.CV

TL;DR: LPCANet: Lightweight Pyramid Cross-Attention Network for efficient RGB-D rail defect detection with state-of-the-art performance using only 9.90M parameters and 2.50 G FLOPs.

DetailsMotivation: Current vision-based rail defect detection methods suffer from high computational complexity, excessive parameters, and suboptimal accuracy, limiting practical deployment.

Method: Proposes LPCANet with MobileNetv2 backbone for RGB features, lightweight pyramid module for depth processing, cross-attention mechanism for multimodal fusion, and spatial feature extractor for enhanced structural analysis.

Result: Achieves SOTA on three RGB-D rail datasets with 9.90M parameters, 2.50 G FLOPs, 162.60 fps, and improvements of +1.48% in Sα, +0.86% in IOU, +1.77% in MAE over best baselines. Validates generalization on non-rail datasets.

Conclusion: LPCANet effectively bridges traditional and deep learning approaches for industrial defect inspection, offering practical value with future work focused on further compression for real-time deployment.

Abstract: This paper addresses the limitations of current vision-based rail defect detection methods, including high computational complexity, excessive parameter counts, and suboptimal accuracy. We propose a Lightweight Pyramid Cross-Attention Network (LPCANet) that leverages RGB-D data for efficient and accurate defect identification. The architecture integrates MobileNetv2 as a backbone for RGB feature extraction with a lightweight pyramid module (LPM) for depth processing, coupled with a cross-attention mechanism (CAM) for multimodal fusion and a spatial feature extractor (SFE) for enhanced structural analysis. Comprehensive evaluations on three unsupervised RGB-D rail datasets (NEU-RSDDS-AUG, RSDD-TYPE1, RSDD-TYPE2) demonstrate that LPCANet achieves state-of-the-art performance with only 9.90 million parameters, 2.50 G FLOPs, and 162.60 fps inference speed. Compared to 18 existing methods, LPCANet shows significant improvements, including +1.48% in $S_α$, +0.86% in IOU, and +1.77% in MAE over the best-performing baseline. Ablation studies confirm the critical roles of CAM and SFE, while experiments on non-rail datasets (DAGM2007, MT, Kolektor-SDD2) validate its generalization capability. The proposed framework effectively bridges traditional and deep learning approaches, offering substantial practical value for industrial defect inspection. Future work will focus on further model compression for real-time deployment.

[131] Beyond Seen Bounds: Class-Centric Polarization for Single-Domain Generalized Deep Metric Learning

Xin Yuan, Meiqi Wan, Wei Liu, Xin Xu, Zheng Wang

Main category: cs.CV

TL;DR: CenterPolar is a novel SDG-DML framework that uses class-centric polarization with centrifugal expansion and centripetal constraint to improve generalization to unseen categories and domains.

DetailsMotivation: Single-domain generalized deep metric learning faces challenges with both category and domain shifts during testing, limiting real-world applications. Existing methods use proxy-based expansion that generates samples clustered near class proxies, failing to simulate broad domain shifts.

Method: CenterPolar uses two collaborative class-centric polarization phases: 1) Class-Centric Centrifugal Expansion (C³E) shifts source data away from class centroids to generalize to unseen domains, and 2) Class-Centric Centripetal Constraint (C⁴) pulls all samples toward class centroids while enforcing inter-class separation for domain-invariant class information.

Result: Extensive experiments on CUB-200-2011 Ext., Cars196 Ext., DomainNet, PACS, and Office-Home datasets demonstrate superiority over state-of-the-art methods.

Conclusion: CenterPolar effectively addresses SDG-DML challenges by dynamically expanding and constraining domain distributions to learn a generalizable model for wider target domain distributions.

Abstract: Single-domain generalized deep metric learning (SDG-DML) faces the dual challenge of both category and domain shifts during testing, limiting real-world applications. Therefore, aiming to learn better generalization ability on both unseen categories and domains is a realistic goal for the SDG-DML task. To deliver the aspiration, existing SDG-DML methods employ the domain expansion-equalization strategy to expand the source data and generate out-of-distribution samples. However, these methods rely on proxy-based expansion, which tends to generate samples clustered near class proxies, failing to simulate the broad and distant domain shifts encountered in practice. To alleviate the problem, we propose CenterPolar, a novel SDG-DML framework that dynamically expands and constrains domain distributions to learn a generalizable DML model for wider target domain distributions. Specifically, \textbf{CenterPolar} contains two collaborative class-centric polarization phases: (1) Class-Centric Centrifugal Expansion ($C^3E$) and (2) Class-Centric Centripetal Constraint ($C^4$). In the first phase, $C^3E$ drives the source domain distribution by shifting the source data away from class centroids using centrifugal expansion to generalize to more unseen domains. In the second phase, to consolidate domain-invariant class information for the generalization ability to unseen categories, $C^4$ pulls all seen and unseen samples toward their class centroids while enforcing inter-class separation via centripetal constraint. Extensive experimental results on widely used CUB-200-2011 Ext., Cars196 Ext., DomainNet, PACS, and Office-Home datasets demonstrate the superiority and effectiveness of our CenterPolar over existing state-of-the-art methods. The code will be released after acceptance.

[132] SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL

Lijun Liu, Linwei Chen, Zhishou Zhang, Meng Tian, Hengfu Cui, Ruiyang Li, Zhaocheng Liu, Qiang Ju, Qianxi Li, Hong-Yu Zhou

Main category: cs.CV

TL;DR: SkinFlow introduces a framework that optimizes visual information flow for dermatology diagnosis using a Virtual-Width Dynamic Vision Encoder and two-stage RL, achieving state-of-the-art results with a 7B model despite being much smaller than general-purpose LVLMs.

DetailsMotivation: General-purpose Large Vision-Language Models (LVLMs) struggle in dermatology due to "diffuse attention" - they can't separate subtle pathological lesions from background noise. The paper challenges the assumption that parameter scaling is the only path to medical precision.

Method: SkinFlow treats diagnosis as optimization of visual information transmission efficiency. It uses a Virtual-Width Dynamic Vision Encoder (DVE) to “unfold” complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy that sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space.

Result: The 7B model establishes new state-of-the-art on Fitzpatrick17k benchmark: +12.06% gain in Top-1 accuracy and +28.57% boost in Top-6 accuracy over massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2).

Conclusion: Optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling. The work demonstrates that specialized architectural design and information flow optimization can outperform much larger general-purpose models in medical domains.

Abstract: General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to “diffuse attention” - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to “unfold” complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.

[133] SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection

Chenhao Fu, Han Fang, Xiuzheng Zheng, Wenbo Wei, Yonghua Li, Hao Sun, Xuelong Li

Main category: cs.CV

TL;DR: SSVP fuses diverse visual encodings with hierarchical semantic-visual synergy and vision-conditioned prompt generation for zero-shot anomaly detection, achieving SOTA performance on industrial benchmarks.

DetailsMotivation: Existing zero-shot anomaly detection methods rely on single visual backbones that struggle to balance global semantic generalization with fine-grained structural discriminability needed for industrial inspection.

Method: Proposes Synergistic Semantic-Visual Prompting (SSVP) with: 1) Hierarchical Semantic-Visual Synergy (HSVS) integrating DINOv3’s multi-scale structural priors into CLIP semantic space, 2) Vision-Conditioned Prompt Generator (VCPG) using cross-modal attention for dynamic prompt generation, and 3) Visual-Text Anomaly Mapper (VTAM) with dual-gated calibration to address global scoring vs local evidence discrepancy.

Result: Achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, validated across seven industrial benchmarks, significantly outperforming existing zero-shot approaches.

Conclusion: SSVP effectively bridges the gap between global semantic generalization and fine-grained structural discriminability in zero-shot anomaly detection through synergistic fusion of diverse visual encodings and vision-conditioned prompt generation.

Abstract: Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model’s fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3’s multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.

[134] From Snow to Rain: Evaluating Robustness, Calibration, and Complexity of Model-Based Robust Training

Josué Martínez-Martínez, Olivia Brown, Giselle Zeno, Pooya Khorrami, Rajmonda Caceres

Main category: cs.CV

TL;DR: Model-based training with learned nuisance variation models outperforms baselines for traffic sign recognition under snow/rain corruptions, with model-based adversarial training providing strongest robustness but higher computation, while model-based data augmentation offers comparable robustness with less complexity.

DetailsMotivation: Robustness to natural corruptions is critical for reliable deep learning in safety-sensitive domains like autonomous driving, where models must handle challenging environmental conditions like snow and rain.

Method: Family of model-based training approaches using learned nuisance variation models to generate realistic corruptions, plus hybrid strategies combining random coverage with adversarial refinement in nuisance space. Evaluated on CURE-TSR dataset with Snow and Rain corruptions.

Result: Model-based methods consistently outperform Vanilla, Adversarial Training, and AugMix baselines. Model-based adversarial training provides strongest robustness across all corruptions but with higher computation. Model-based data augmentation achieves comparable robustness with significantly less computational complexity without statistically significant performance drop.

Conclusion: Learned nuisance models are important for capturing natural variability, offering a promising path toward more resilient and calibrated models under challenging conditions, with trade-offs between robustness and computational efficiency.

Abstract: Robustness to natural corruptions remains a critical challenge for reliable deep learning, particularly in safety-sensitive domains. We study a family of model-based training approaches that leverage a learned nuisance variation model to generate realistic corruptions, as well as new hybrid strategies that combine random coverage with adversarial refinement in nuisance space. Using the Challenging Unreal and Real Environments for Traffic Sign Recognition dataset (CURE-TSR), with Snow and Rain corruptions, we evaluate accuracy, calibration, and training complexity across corruption severities. Our results show that model-based methods consistently outperform baselines Vanilla, Adversarial Training, and AugMix baselines, with model-based adversarial training providing the strongest robustness under across all corruptions but at the expense of higher computation and model-based data augmentation achieving comparable robustness with $T$ less computational complexity without incurring a statistically significant drop in performance. These findings highlight the importance of learned nuisance models for capturing natural variability, and suggest a promising path toward more resilient and calibrated models under challenging conditions.

[135] Architecture inside the mirage: evaluating generative image models on architectural style, elements, and typologies

Jamie Magrill, Leah Gornstein, Sandra Seekins, Barry Magrill

Main category: cs.CV

TL;DR: Study evaluates 5 GenAI image platforms on architectural accuracy using 30 prompts, finding limited overall accuracy (42% mean) with significant variation between common vs rare prompts and platform performance.

DetailsMotivation: To characterize the capacity of generative AI text-to-image systems to reproduce accurate architectural imagery in a historically rule-bound field, as their increasing use in architecture raises concerns about accuracy.

Method: Evaluated 5 GenAI platforms (Adobe Firefly, DALL-E 3, Google Imagen 3, Microsoft Image Generator, Midjourney) using 30 architectural prompts spanning styles, typologies, and codified elements. Each prompt-generator pair produced 4 images (600 total). Two architectural historians independently scored images for accuracy against predefined criteria, resolving disagreements by consensus.

Result: Overall accuracy was limited (highest 52%, lowest 32%, mean 42%). Common prompts were 2.7x more accurate than rare prompts. All-correct outcomes were similar across platforms, but all-incorrect outcomes varied substantially (Imagen 3 had fewest failures, Microsoft Image Generator had most). Qualitative analysis identified patterns: over-embellishment, confusion between medieval styles and revivals, misrepresentation of descriptive prompts.

Conclusion: Findings support need for visible labeling of GenAI synthetic content, provenance standards for training datasets, and cautious educational use of GenAI architectural imagery due to limited accuracy and systematic errors.

Abstract: Generative artificial intelligence (GenAI) text-to-image systems are increasingly used to generate architectural imagery, yet their capacity to reproduce accurate images in a historically rule-bound field remains poorly characterized. We evaluated five widely used GenAI image platforms (Adobe Firefly, DALL-E 3, Google Imagen 3, Microsoft Image Generator, and Midjourney) using 30 architectural prompts spanning styles, typologies, and codified elements. Each prompt-generator pair produced four images (n = 600 images total). Two architectural historians independently scored each image for accuracy against predefined criteria, resolving disagreements by consensus. Set-level performance was summarized as zero to four accurate images per four-image set. Image output from Common prompts was 2.7-fold more accurate than from Rare prompts (p < 0.05). Across platforms, overall accuracy was limited (highest accuracy score 52 percent; lowest 32 percent; mean 42 percent). All-correct (4 out of 4) outcomes were similar across platforms. By contrast, all-incorrect (0 out of 4) outcomes varied substantially, with Imagen 3 exhibiting the fewest failures and Microsoft Image Generator exhibiting the highest number of failures. Qualitative review of the image dataset identified recurring patterns including over-embellishment, confusion between medieval styles and their later revivals, and misrepresentation of descriptive prompts (for example, egg-and-dart, banded column, pendentive). These findings support the need for visible labeling of GenAI synthetic content, provenance standards for future training datasets, and cautious educational use of GenAI architectural imagery.

[136] N-EIoU-YOLOv9: A Signal-Aware Bounding Box Regression Loss for Lightweight Mobile Detection of Rice Leaf Diseases

Dung Ta Nguyen Duc, Thanh Bui Dang, Hoang Le Minh, Tung Nguyen Viet, Huong Nguyen Thanh, Dong Trinh Cong

Main category: cs.CV

TL;DR: N EIoU YOLOv9: A lightweight detection framework using a novel bounding box regression loss (N EIoU) that combines non-monotonic gradient focusing with geometric decoupling for better small/low-contrast target detection in agricultural disease imagery.

DetailsMotivation: Agricultural disease detection often involves small and low-contrast targets that are challenging for standard detection methods. Existing bounding box regression losses like CIoU may not effectively handle weak regression signals from hard samples with low overlap, leading to suboptimal performance for agricultural monitoring applications.

Method: Proposed N EIoU (Non-monotonic Efficient Intersection over Union) loss that reshapes localization gradients by combining non-monotonic focusing with decoupled width and height optimization. This enhances weak regression signals for hard samples while reducing gradient interference. Integrated into lightweight YOLOv9t architecture and evaluated on a custom rice leaf disease dataset.

Result: Achieved 90.3% mAP on rice leaf disease dataset, representing 4.3% improvement over CIoU baseline. Demonstrated improved localization accuracy under stricter evaluation criteria. Deployed on Android with TensorFlow Lite Float16 quantization, achieving 156ms average inference time per frame while maintaining accuracy.

Conclusion: N EIoU YOLOv9 effectively balances accuracy, optimization stability, and computational efficiency for edge-based agricultural monitoring systems, making it suitable for practical deployment on mobile devices for real-time disease detection.

Abstract: In this work, we propose N EIoU YOLOv9, a lightweight detection framework based on a signal aware bounding box regression loss derived from non monotonic gradient focusing and geometric decoupling principles, referred to as N EIoU (Non monotonic Efficient Intersection over Union). The proposed loss reshapes localization gradients by combining non monotonic focusing with decoupled width and height optimization, thereby enhancing weak regression signals for hard samples with low overlap while reducing gradient interference. This design is particularly effective for small and low contrast targets commonly observed in agricultural disease imagery. The proposed N EIoU loss is integrated into a lightweight YOLOv9t architecture and evaluated on a self collected field dataset comprising 5908 rice leaf images across four disease categories and healthy leaves. Experimental results demonstrate consistent performance gains over the standard CIoU loss, achieving a mean Average Precision of 90.3 percent, corresponding to a 4.3 percent improvement over the baseline, with improved localization accuracy under stricter evaluation criteria. For practical validation, the optimized model is deployed on an Android device using TensorFlow Lite with Float16 quantization, achieving an average inference time of 156 milliseconds per frame while maintaining accuracy. These results confirm that the proposed approach effectively balances accuracy, optimization stability, and computational efficiency for edge based agricultural monitoring systems.

[137] From Performance to Practice: Knowledge-Distilled Segmentator for On-Premises Clinical Workflows

Qizhen Lan, Aaron Choi, Jun Ma, Bo Wang, Zhaogming Zhao, Xiaoqian Jiang, Yu-Chun Hsu

Main category: cs.CV

TL;DR: Knowledge distillation framework converts large medical image segmentation models into compact, deployment-ready versions for on-premises clinical workflows while maintaining accuracy.

DetailsMotivation: Clinical deployment faces constraints: fixed on-premises infrastructure, governance/security restrictions on cloud inference, and computational demands of high-capacity models hinder practical deployment and long-term maintainability in hospital environments.

Method: Deployment-oriented framework using knowledge distillation to translate high-performing segmentation models into scalable family of compact student models without modifying inference pipeline. Preserves architectural compatibility while enabling systematic capacity reduction.

Result: Evaluated on multi-site brain MRI (1,104 3D volumes) with independent testing on 101 curated cases, plus abdominal CT for cross-modality generalizability. Under 94% parameter reduction, distilled student preserves 98.7% of teacher’s accuracy, achieves 67% reduction in CPU inference latency without additional deployment overhead.

Conclusion: Knowledge distillation provides practical and reliable pathway for converting research-grade segmentation models into maintainable, deployment-ready components for on-premises clinical workflows in real-world health systems.

Abstract: Deploying medical image segmentation models in routine clinical workflows is often constrained by on-premises infrastructure, where computational resources are fixed and cloud-based inference may be restricted by governance and security policies. While high-capacity models achieve strong segmentation accuracy, their computational demands hinder practical deployment and long-term maintainability in hospital environments. We present a deployment-oriented framework that leverages knowledge distillation to translate a high-performing segmentation model into a scalable family of compact student models, without modifying the inference pipeline. The proposed approach preserves architectural compatibility with existing clinical systems while enabling systematic capacity reduction. The framework is evaluated on a multi-site brain MRI dataset comprising 1,104 3D volumes, with independent testing on 101 curated cases, and is further examined on abdominal CT to assess cross-modality generalizability. Under aggressive parameter reduction (94%), the distilled student model preserves nearly all of the teacher’s segmentation accuracy (98.7%), while achieving substantial efficiency gains, including up to a 67% reduction in CPU inference latency without additional deployment overhead. These results demonstrate that knowledge distillation provides a practical and reliable pathway for converting research-grade segmentation models into maintainable, deployment-ready components for on-premises clinical workflows in real-world health systems.

[138] Point Tracking as a Temporal Cue for Robust Myocardial Segmentation in Echocardiography Videos

Bahar Khodabakhshian, Nima Hashemi, Armin Saadat, Zahra Gholami, In-Chang Hwang, Samira Sojoudi, Christina Luong, Purang Abolmaesumi, Teresa Tsang

Main category: cs.CV

TL;DR: Point-Seg uses point tracking as temporal cue for myocardium segmentation in echo videos, improving consistency without memory-based feature accumulation.

DetailsMotivation: Myocardium segmentation in echocardiography is challenging due to low contrast, noise, and anatomical variability. Traditional deep learning models either ignore temporal information or rely on error-prone memory-based feature propagation.

Method: Transformer-based framework with point tracking module trained on synthetic echo data to track anatomical landmarks. Uses tracked trajectories as motion-aware signal to guide segmentation, plus temporal smoothing loss for consistency.

Result: Statistically similar Dice accuracy to SOTA in high quality echo data, better accuracy in lower quality echo with improved temporal stability. Provides pixel-level myocardium motion information useful for downstream tasks like strain measurement.

Conclusion: Point tracking serves as effective temporal cue for consistent video segmentation, offering reliable and generalizable approach for myocardium segmentation in echocardiography videos.

Abstract: Purpose: Myocardium segmentation in echocardiography videos is a challenging task due to low contrast, noise, and anatomical variability. Traditional deep learning models either process frames independently, ignoring temporal information, or rely on memory-based feature propagation, which accumulates error over time. Methods: We propose Point-Seg, a transformer-based segmentation framework that integrates point tracking as a temporal cue to ensure stable and consistent segmentation of myocardium across frames. Our method leverages a point-tracking module trained on a synthetic echocardiography dataset to track key anatomical landmarks across video sequences. These tracked trajectories provide an explicit motion-aware signal that guides segmentation, reducing drift and eliminating the need for memory-based feature accumulation. Additionally, we incorporate a temporal smoothing loss to further enhance temporal consistency across frames. Results: We evaluate our approach on both public and private echocardiography datasets. Experimental results demonstrate that Point-Seg has statistically similar accuracy in terms of Dice to state-of-the-art segmentation models in high quality echo data, while it achieves better segmentation accuracy in lower quality echo with improved temporal stability. Furthermore, Point-Seg has the key advantage of pixel-level myocardium motion information as opposed to other segmentation methods. Such information is essential in the computation of other downstream tasks such as myocardial strain measurement and regional wall motion abnormality detection. Conclusion: Point-Seg demonstrates that point tracking can serve as an effective temporal cue for consistent video segmentation, offering a reliable and generalizable approach for myocardium segmentation in echocardiography videos. The code is available at https://github.com/DeepRCL/PointSeg.

[139] Pairing-free Group-level Knowledge Distillation for Robust Gastrointestinal Lesion Classification in White-Light Endoscopy

Qiang Hu, Qimei Wang, Yingjie Guo, Qiang Li, Zhiwei Wang

Main category: cs.CV

TL;DR: PaGKD enables cross-modal knowledge distillation from NBI to WLI for endoscopic cancer screening using unpaired data, eliminating the need for costly paired images.

DetailsMotivation: NBI provides superior diagnostic details for endoscopic cancer screening compared to standard WLI, but existing knowledge transfer methods require paired NBI-WLI images which are costly and impractical, leaving most clinical data unused.

Method: PaGKD uses group-level knowledge distillation with two modules: GKD-Pro for modality-invariant semantic prototypes via shared lesion-aware queries, and GKD-Den for dense cross-modal alignment using group-aware attention with activation-derived relation maps.

Result: PaGKD consistently outperforms state-of-the-art methods across four clinical datasets with relative AUC improvements of 3.3%, 1.1%, 2.8%, and 3.2%, establishing new performance benchmarks.

Conclusion: PaGKD breaks the paired-data paradigm by enabling effective cross-modal learning from unpaired WLI and NBI data, offering a practical solution for enhancing WLI-only models with NBI knowledge without requiring image-level correspondence.

Abstract: White-Light Imaging (WLI) is the standard for endoscopic cancer screening, but Narrow-Band Imaging (NBI) offers superior diagnostic details. A key challenge is transferring knowledge from NBI to enhance WLI-only models, yet existing methods are critically hampered by their reliance on paired NBI-WLI images of the same lesion, a costly and often impractical requirement that leaves vast amounts of clinical data untapped. In this paper, we break this paradigm by introducing PaGKD, a novel Pairing-free Group-level Knowledge Distillation framework that that enables effective cross-modal learning using unpaired WLI and NBI data. Instead of forcing alignment between individual, often semantically mismatched image instances, PaGKD operates at the group level to distill more complete and compatible knowledge across modalities. Central to PaGKD are two complementary modules: (1) Group-level Prototype Distillation (GKD-Pro) distills compact group representations by extracting modality-invariant semantic prototypes via shared lesion-aware queries; (2) Group-level Dense Distillation (GKD-Den) performs dense cross-modal alignment by guiding group-aware attention with activation-derived relation maps. Together, these modules enforce global semantic consistency and local structural coherence without requiring image-level correspondence. Extensive experiments on four clinical datasets demonstrate that PaGKD consistently and significantly outperforms state-of-the-art methods, achieving relative AUC improvements of 3.3%, 1.1%, 2.8%, and 3.2%, respectively, establishing a new direction for cross-modal learning from unpaired data.

[140] Affostruction: 3D Affordance Grounding with Generative Reconstruction

Chunghyun Park, Seunghyeon Lee, Minsu Cho

Main category: cs.CV

TL;DR: Affostruction: A generative framework for 3D affordance grounding that reconstructs complete object geometry from partial RGBD views and predicts affordances on both visible and unobserved regions.

DetailsMotivation: Existing affordance grounding methods only predict on visible surfaces, missing affordances on unobserved regions. This limits practical applications where complete object understanding is needed for robotic manipulation.

Method: Three key components: 1) Generative multi-view reconstruction via sparse voxel fusion for complete geometry, 2) Flow-based affordance grounding to handle ambiguity in affordance distributions, 3) Affordance-driven active view selection for intelligent viewpoint sampling.

Result: Achieves 19.1 aIoU on affordance grounding (40.4% improvement) and 32.67 IoU for 3D reconstruction (67.7% improvement), enabling accurate affordance prediction on complete shapes.

Conclusion: Affostruction successfully addresses the limitation of existing methods by combining generative 3D reconstruction with affordance grounding, enabling comprehensive affordance prediction on both visible and unobserved object regions.

Abstract: This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete geometry from partial observations and grounds affordances on the full shape including unobserved regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures inherent ambiguity in affordance distributions, and affordance-driven active view selection that leverages predicted affordances for intelligent viewpoint sampling. Affostruction achieves 19.1 aIoU on affordance grounding (40.4% improvement) and 32.67 IoU for 3D reconstruction (67.7% improvement), enabling accurate affordance prediction on complete shapes.

[141] Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation

Xingyao Li, Fengzhuo Zhang, Cunxiao Du, Hui Ji

Main category: cs.CV

TL;DR: COOL-SD: An annealed relaxation of speculative decoding for faster autoregressive image generation with theoretical grounding.

DetailsMotivation: Autoregressive image generation is slow due to sequential token generation and token ambiguity. Existing relaxed speculative decoding methods lack theoretical foundation.

Method: COOL-SD uses two key theoretical insights: 1) Optimal resampling distribution minimizing TV distance upper bound between target and relaxed models, 2) Perturbation analysis revealing annealing behavior, leading to annealed relaxation design.

Result: COOL-SD generates images faster with comparable quality, or achieves better quality at similar latency. Experiments show consistent improvements in speed-quality trade-offs over prior methods.

Conclusion: The paper establishes theoretical basis for relaxed speculative decoding and demonstrates COOL-SD’s effectiveness in improving autoregressive image generation efficiency.

Abstract: Despite significant progress in autoregressive image generation, inference remains slow due to the sequential nature of AR models and the ambiguity of image tokens, even when using speculative decoding. Recent works attempt to address this with relaxed speculative decoding but lack theoretical grounding. In this paper, we establish the theoretical basis of relaxed SD and propose COOL-SD, an annealed relaxation of speculative decoding built on two key insights. The first analyzes the total variation (TV) distance between the target model and relaxed speculative decoding and yields an optimal resampling distribution that minimizes an upper bound of the distance. The second uses perturbation analysis to reveal an annealing behaviour in relaxed speculative decoding, motivating our annealed design. Together, these insights enable COOL-SD to generate images faster with comparable quality, or achieve better quality at similar latency. Experiments validate the effectiveness of COOL-SD, showing consistent improvements over prior methods in speed-quality trade-offs.

[142] SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction via VD-VAE and Versatile Diffusion

Jialu Li, Taiyan Zhou

Main category: cs.CV

TL;DR: SpikeVAEDiff is a two-stage framework combining VDVAE and Versatile Diffusion to reconstruct high-resolution images from neural spike data, showing VISI region is most important for reconstruction quality.

DetailsMotivation: Reconstructing natural visual scenes from neural activity is a key challenge in neuroscience and computer vision. Current approaches often use fMRI data, but spike data offers superior temporal and spatial resolution for better reconstruction quality.

Method: Two-stage framework: 1) VDVAE produces low-resolution preliminary reconstructions from neural spike signals to latent representations; 2) Regression models map spike signals to CLIP-Vision and CLIP-Text features, enabling Versatile Diffusion to refine images via image-to-image generation.

Result: Evaluation on Allen Visual Coding-Neuropixels dataset shows VISI region exhibits most prominent activation and plays key role in reconstruction quality. Ablation studies demonstrate data from specific brain regions significantly enhances reconstruction performance compared to fMRI-based approaches.

Conclusion: SpikeVAEDiff successfully reconstructs high-resolution, semantically meaningful images from neural spike data, with VISI region being most critical. The framework demonstrates advantages of spike data over fMRI for neural decoding tasks, though challenges remain as shown by both successful and unsuccessful reconstruction examples.

Abstract: Reconstructing natural visual scenes from neural activity is a key challenge in neuroscience and computer vision. We present SpikeVAEDiff, a novel two-stage framework that combines a Very Deep Variational Autoencoder (VDVAE) and the Versatile Diffusion model to generate high-resolution and semantically meaningful image reconstructions from neural spike data. In the first stage, VDVAE produces low-resolution preliminary reconstructions by mapping neural spike signals to latent representations. In the second stage, regression models map neural spike signals to CLIP-Vision and CLIP-Text features, enabling Versatile Diffusion to refine the images via image-to-image generation. We evaluate our approach on the Allen Visual Coding-Neuropixels dataset and analyze different brain regions. Our results show that the VISI region exhibits the most prominent activation and plays a key role in reconstruction quality. We present both successful and unsuccessful reconstruction examples, reflecting the challenges of decoding neural activity. Compared with fMRI-based approaches, spike data provides superior temporal and spatial resolution. We further validate the effectiveness of the VDVAE model and conduct ablation studies demonstrating that data from specific brain regions significantly enhances reconstruction performance.

[143] Disentangle Object and Non-object Infrared Features via Language Guidance

Fan Liu, Ting Wu, Chuanyi Zhang, Liang Yao, Xing Ma, Yuhui Zheng

Main category: cs.CV

TL;DR: Vision-language representation learning for infrared object detection using textual supervision to disentangle object features from noisy backgrounds.

DetailsMotivation: Infrared object detection faces challenges due to low contrast and weak edge information in infrared images, making it difficult to extract discriminative features for robust detection in complex environments where visible imaging fails.

Method: Proposes a vision-language representation learning paradigm with two key modules: 1) Semantic Feature Alignment (SFA) to align object features with corresponding text features, and 2) Object Feature Disentanglement (OFD) to separate text-aligned object features from non-object features by minimizing their correlation.

Result: Achieves state-of-the-art performance on two benchmarks: 83.7% mAP on M³FD and 86.1% mAP on FLIR datasets, demonstrating superior detection performance through more discriminative and less noisy features.

Conclusion: The proposed vision-language approach effectively enhances infrared object detection by leveraging textual supervision to guide feature disentanglement, resulting in significantly improved detection accuracy in challenging infrared imaging conditions.

Abstract: Infrared object detection focuses on identifying and locating objects in complex environments (\eg, dark, snow, and rain) where visible imaging cameras are disabled by poor illumination. However, due to low contrast and weak edge information in infrared images, it is challenging to extract discriminative object features for robust detection. To deal with this issue, we propose a novel vision-language representation learning paradigm for infrared object detection. An additional textual supervision with rich semantic information is explored to guide the disentanglement of object and non-object features. Specifically, we propose a Semantic Feature Alignment (SFA) module to align the object features with the corresponding text features. Furthermore, we develop an Object Feature Disentanglement (OFD) module that disentangles text-aligned object features and non-object features by minimizing their correlation. Finally, the disentangled object features are entered into the detection head. In this manner, the detection performance can be remarkably enhanced via more discriminative and less noisy features. Extensive experimental results demonstrate that our approach achieves superior performance on two benchmarks: M\textsuperscript{3}FD (83.7% mAP), FLIR (86.1% mAP). Our code will be publicly available once the paper is accepted.

[144] SPOT-Face: Forensic Face Identification using Attention Guided Optimal Transport

Ravi Shankar Prasad, Dinesh Singh

Main category: cs.CV

TL;DR: SPOT-Face: A superpixel graph-based framework for cross-domain forensic face identification using skeleton/sketch to face matching with attention-guided optimal transport.

DetailsMotivation: Forensic investigations face challenges when traditional DNA identification means (hair, soft tissue) are unavailable. Current deep learning face recognition methods lack effective mechanisms to model cross-domain structural correspondence between different forensic modalities like skeleton/sketch and face images.

Method: Construct superpixel-based graphs from images, use GNN backbones to extract graph embeddings, establish cross-domain correspondence through attention-guided optimal transport mechanism. Unified framework for matching skeleton/sketch to faces.

Result: Significant improvement in identification metrics (Recall, mAP) over existing graph-based baselines on IIT_Mandi_S2F and CUFS datasets. Framework demonstrates high effectiveness for matching skulls and sketches to faces in forensic investigations.

Conclusion: SPOT-Face provides an effective solution for cross-domain forensic face identification, addressing the challenge when traditional DNA evidence is unavailable, with superior performance over existing methods.

Abstract: Person identification in forensic investigations becomes very challenging when common identification means for DNA (i.e., hair strands, soft tissue) are not available. Current methods utilize deep learning methods for face recognition. However, these methods lack effective mechanisms to model cross-domain structural correspondence between two different forensic modalities. In this paper, we introduce a SPOT-Face, a superpixel graph-based framework designed for cross-domain forensic face identification of victims using their skeleton and sketch images. Our unified framework involves constructing a superpixel-based graph from an image and then using different graph neural networks(GNNs) backbones to extract the embeddings of these graphs, while cross-domain correspondence is established through attention-guided optimal transport mechanism. We have evaluated our proposed framework on two publicly available dataset: IIT_Mandi_S2F (S2F) and CUFS. Extensive experiments were conducted to evaluate our proposed framework. The experimental results show significant improvement in identification metrics ( i.e., Recall, mAP) over existing graph-based baselines. Furthermore, our framework demonstrates to be highly effective for matching skulls and sketches to faces in forensic investigations.

[145] CLIDD: Cross-Layer Independent Deformable Description for Efficient and Discriminative Local Feature Representation

Haodi Yao, Fenghua He, Ning Hao, Yao Su

Main category: cs.CV

TL;DR: CLIDD is a novel local feature description method that achieves superior matching accuracy with exceptional computational efficiency through cross-layer independent sampling and hardware-aware optimization, enabling real-time performance on edge devices.

DetailsMotivation: Robust local feature representations are crucial for spatial intelligence tasks like robot navigation and AR, requiring descriptors with both high discriminative power and computational efficiency. Current methods often trade off accuracy for speed or vice versa.

Method: Cross-Layer Independent Deformable Description (CLIDD) samples directly from independent feature hierarchies using learnable offsets to capture fine-grained structural details across scales. It implements hardware-aware kernel fusion for real-time performance and integrates lightweight architectures with training using both metric learning and knowledge distillation.

Result: CLIDD achieves superior matching accuracy and exceptional computational efficiency. The ultra-compact variant matches SuperPoint’s precision with only 0.004M parameters (99.7% size reduction). The high-performance configuration outperforms all current SOTA methods including DINOv2-based frameworks while exceeding 200 FPS on edge devices.

Conclusion: CLIDD delivers high-precision local feature matching with minimal computational overhead, providing a robust and scalable solution for real-time spatial intelligence tasks by balancing accuracy and efficiency through innovative architectural and optimization strategies.

Abstract: Robust local feature representations are essential for spatial intelligence tasks such as robot navigation and augmented reality. Establishing reliable correspondences requires descriptors that provide both high discriminative power and computational efficiency. To address this, we introduce Cross-Layer Independent Deformable Description (CLIDD), a method that achieves superior distinctiveness by sampling directly from independent feature hierarchies. This approach utilizes learnable offsets to capture fine-grained structural details across scales while bypassing the computational burden of unified dense representations. To ensure real-time performance, we implement a hardware-aware kernel fusion strategy that maximizes inference throughput. Furthermore, we develop a scalable framework that integrates lightweight architectures with a training protocol leveraging both metric learning and knowledge distillation. This scheme generates a wide spectrum of model variants optimized for diverse deployment constraints. Extensive evaluations demonstrate that our approach achieves superior matching accuracy and exceptional computational efficiency simultaneously. Specifically, the ultra-compact variant matches the precision of SuperPoint while utilizing only 0.004M parameters, achieving a 99.7% reduction in model size. Furthermore, our high-performance configuration outperforms all current state-of-the-art methods, including high-capacity DINOv2-based frameworks, while exceeding 200 FPS on edge devices. These results demonstrate that CLIDD delivers high-precision local feature matching with minimal computational overhead, providing a robust and scalable solution for real-time spatial intelligence tasks.

[146] Knowledge-Embedded and Hypernetwork-Guided Few-Shot Substation Meter Defect Image Generation Method

Jackie Alex, Justin Petter

Main category: cs.CV

TL;DR: Novel framework combining Knowledge Embedding and Hypernetwork-Guided Conditional Control with Stable Diffusion for few-shot defect image generation in substation meters.

DetailsMotivation: Substation meters are critical for power grid monitoring, but crack/defect detection suffers from severe scarcity of annotated samples, creating a few-shot generation challenge.

Method: Three-stage approach: 1) DreamBooth-style knowledge embedding to fine-tune Stable Diffusion for meter characteristics; 2) Geometric crack modeling for parameterized defect attributes and control maps; 3) Lightweight hypernetwork to modulate denoising process based on control maps and defect descriptors.

Result: Outperforms existing baselines: reduces FID by 32.7%, increases diversity metrics, and boosts downstream defect detector mAP by 15.3% when trained on augmented data.

Conclusion: Framework provides practical, high-quality data synthesis solution for industrial inspection systems with rare defect samples, enabling realistic and controllable defect image generation from limited data.

Abstract: Substation meters play a critical role in monitoring and ensuring the stable operation of power grids, yet their detection of cracks and other physical defects is often hampered by a severe scarcity of annotated samples. To address this few-shot generation challenge, we propose a novel framework that integrates Knowledge Embedding and Hypernetwork-Guided Conditional Control into a Stable Diffusion pipeline, enabling realistic and controllable synthesis of defect images from limited data. First, we bridge the substantial domain gap between natural-image pre-trained models and industrial equipment by fine-tuning a Stable Diffusion backbone using DreamBooth-style knowledge embedding. This process encodes the unique structural and textural priors of substation meters, ensuring generated images retain authentic meter characteristics. Second, we introduce a geometric crack modeling module that parameterizes defect attributes–such as location, length, curvature, and branching pattern–to produce spatially constrained control maps. These maps provide precise, pixel-level guidance during generation. Third, we design a lightweight hypernetwork that dynamically modulates the denoising process of the diffusion model in response to the control maps and high-level defect descriptors, achieving a flexible balance between generation fidelity and controllability. Extensive experiments on a real-world substation meter dataset demonstrate that our method substantially outperforms existing augmentation and generation baselines. It reduces Frechet Inception Distance (FID) by 32.7%, increases diversity metrics, and–most importantly–boosts the mAP of a downstream defect detector by 15.3% when trained on augmented data. The framework offers a practical, high-quality data synthesis solution for industrial inspection systems where defect samples are rare.

[147] A$^2$TG: Adaptive Anisotropic Textured Gaussians for Efficient 3D Scene Representation

Sheng-Chi Hsu, Ting-Yu Yen, Shih-Hsuan Hung, Hung-Kuo Chu

Main category: cs.CV

TL;DR: A²TG introduces adaptive anisotropic textures for Gaussian Splatting, using gradient-guided resolution/aspect ratio allocation to reduce memory while improving quality.

DetailsMotivation: Existing textured Gaussian methods use fixed square textures per primitive, leading to inefficient memory usage and limited adaptability to scene variability. There's a need for more efficient texture allocation that aligns with the anisotropic nature of Gaussian splats.

Method: Introduces adaptive anisotropic textured Gaussians (A²TG) that equip each primitive with anisotropic textures. Uses a gradient-guided adaptive rule to jointly determine texture resolution and aspect ratio, enabling non-uniform, detail-aware allocation.

Result: Significantly improves texture efficiency, reducing memory consumption while enhancing image quality. Outperforms fixed-texture Gaussian Splatting methods on multiple benchmark datasets, achieving comparable rendering fidelity with substantially lower memory requirements.

Conclusion: A²TG provides a more efficient and effective representation for textured Gaussian Splatting by adapting texture allocation to scene characteristics, offering better memory-performance tradeoffs than fixed-texture approaches.

Abstract: Gaussian Splatting has emerged as a powerful representation for high-quality, real-time 3D scene rendering. While recent works extend Gaussians with learnable textures to enrich visual appearance, existing approaches allocate a fixed square texture per primitive, leading to inefficient memory usage and limited adaptability to scene variability. In this paper, we introduce adaptive anisotropic textured Gaussians (A$^2$TG), a novel representation that generalizes textured Gaussians by equipping each primitive with an anisotropic texture. Our method employs a gradient-guided adaptive rule to jointly determine texture resolution and aspect ratio, enabling non-uniform, detail-aware allocation that aligns with the anisotropic nature of Gaussian splats. This design significantly improves texture efficiency, reducing memory consumption while enhancing image quality. Experiments on multiple benchmark datasets demonstrate that A TG consistently outperforms fixed-texture Gaussian Splatting methods, achieving comparable rendering fidelity with substantially lower memory requirements.

[148] Integrating Diverse Assignment Strategies into DETRs

Yiwei Zhang, Jin Gao, Hanshi Wang, Fudong Ge, Guan Luo, Weiming Hu, Zhipeng Zhang

Main category: cs.CV

TL;DR: LoRA-DETR: A flexible framework that integrates diverse one-to-many assignment strategies into DETR-style detectors using Low-Rank Adaptation branches during training only, improving performance without inference cost.

DetailsMotivation: DETR-style detectors suffer from slow convergence due to sparse one-to-one matching supervision. While one-to-many assignments can help, existing approaches are complex, architecture-specific, and lack unified design. The paper finds that performance gains come from diversity of assignment strategies rather than just quantity of supervision.

Method: Proposes LoRA-DETR framework that augments primary network with multiple Low-Rank Adaptation (LoRA) branches during training, each implementing different one-to-many assignment rules. These branches inject diverse supervisory gradients into the main model and are discarded during inference, maintaining original architecture simplicity.

Result: Extensive experiments on different baselines validate effectiveness. The approach achieves state-of-the-art results by integrating diverse one-to-many supervision without compromising model elegance or adding inference computational cost.

Conclusion: Diverse one-to-many supervision can be effectively integrated into DETR-style detectors through LoRA branches during training, providing a new paradigm for enhancing detectors that maintains architectural simplicity while improving performance.

Abstract: Label assignment is a critical component in object detectors, particularly within DETR-style frameworks where the one-to-one matching strategy, despite its end-to-end elegance, suffers from slow convergence due to sparse supervision. While recent works have explored one-to-many assignments to enrich supervisory signals, they often introduce complex, architecture-specific modifications and typically focus on a single auxiliary strategy, lacking a unified and scalable design. In this paper, we first systematically investigate the effects of one-to-many'' supervision and reveal a surprising insight that performance gains are driven not by the sheer quantity of supervision, but by the diversity of the assignment strategies employed. This finding suggests that a more elegant, parameter-efficient approach is attainable. Building on this insight, we propose LoRA-DETR, a flexible and lightweight framework that seamlessly integrates diverse assignment strategies into any DETR-style detector. Our method augments the primary network with multiple Low-Rank Adaptation (LoRA) branches during training, each instantiating a different one-to-many assignment rule. These branches act as auxiliary modules that inject rich, varied supervisory gradients into the main model and are discarded during inference, thus incurring no additional computational cost. This design promotes robust joint optimization while maintaining the architectural simplicity of the original detector. Extensive experiments on different baselines validate the effectiveness of our approach. Our work presents a new paradigm for enhancing detectors, demonstrating that diverse one-to-many’’ supervision can be integrated to achieve state-of-the-art results without compromising model elegance.

[149] Hybrid guided variational autoencoder for visual place recognition

Ni Wang, Zihan You, Emre Neftci, Thorben Schoepe

Main category: cs.CV

TL;DR: A compact event-based VAE with spiking neural networks achieves robust visual place recognition for mobile robots in GPS-denied indoor environments with strong generalization capabilities.

DetailsMotivation: Autonomous agents need precise localization in GPS-denied indoor environments, but current VPR models are either too memory-intensive for mobile deployment or lack robustness and generalization capabilities.

Method: Combines event-based vision sensors with a guided variational autoencoder (VAE) using spiking neural networks in the encoder, making it compatible with neuromorphic hardware for power efficiency and low latency.

Result: Successfully disentangles visual features of 16 distinct places in a new indoor VPR dataset with classification performance comparable to state-of-the-art approaches, showing robustness under various illumination conditions and generalization to novel scenes.

Conclusion: The compact and robust guided VAE with generalization capabilities presents a promising model for visual place recognition that can significantly enhance mobile robot navigation in known and unknown indoor environments.

Abstract: Autonomous agents such as cars, robots and drones need to precisely localize themselves in diverse environments, including in GPS-denied indoor environments. One approach for precise localization is visual place recognition (VPR), which estimates the place of an image based on previously seen places. State-of-the-art VPR models require high amounts of memory, making them unwieldy for mobile deployment, while more compact models lack robustness and generalization capabilities. This work overcomes these limitations for robotics using a combination of event-based vision sensors and an event-based novel guided variational autoencoder (VAE). The encoder part of our model is based on a spiking neural network model which is compatible with power-efficient low latency neuromorphic hardware. The VAE successfully disentangles the visual features of 16 distinct places in our new indoor VPR dataset with a classification performance comparable to other state-of-the-art approaches while, showing robust performance also under various illumination conditions. When tested with novel visual inputs from unknown scenes, our model can distinguish between these places, which demonstrates a high generalization capability by learning the essential features of location. Our compact and robust guided VAE with generalization capabilities poses a promising model for visual place recognition that can significantly enhance mobile robot navigation in known and unknown indoor environments.

[150] PhyRPR: Training-Free Physics-Constrained Video Generation

Yibo Zhao, Hengjia Li, Xiaofei He, Boxi Wu

Main category: cs.CV

TL;DR: PhyRPR is a training-free three-stage pipeline that decouples physical reasoning from visual synthesis to improve physical plausibility in video generation.

DetailsMotivation: Existing diffusion-based video generation models struggle with physical constraints because they entangle physical understanding with visual synthesis in a single stage, making explicit physical reasoning difficult.

Method: Three-stage pipeline: 1) PhyReason - uses large multimodal model for physical state reasoning and image generator for keyframe synthesis; 2) PhyPlan - deterministically synthesizes controllable coarse motion scaffold; 3) PhyRefine - injects scaffold into diffusion sampling via latent fusion to refine appearance while preserving planned dynamics.

Result: Extensive experiments show the method consistently improves physical plausibility and motion controllability under physics constraints.

Conclusion: The staged design enables explicit physical control during generation, addressing limitations of single-stage approaches that entangle physical understanding with visual synthesis.

Abstract: Recent diffusion-based video generation models can synthesize visually plausible videos, yet they often struggle to satisfy physical constraints. A key reason is that most existing approaches remain single-stage: they entangle high-level physical understanding with low-level visual synthesis, making it hard to generate content that require explicit physical reasoning. To address this limitation, we propose a training-free three-stage pipeline,\textit{PhyRPR}:\textit{Phy\uline{R}eason}–\textit{Phy\uline{P}lan}–\textit{Phy\uline{R}efine}, which decouples physical understanding from visual synthesis. Specifically, \textit{PhyReason} uses a large multimodal model for physical state reasoning and an image generator for keyframe synthesis; \textit{PhyPlan} deterministically synthesizes a controllable coarse motion scaffold; and \textit{PhyRefine} injects this scaffold into diffusion sampling via a latent fusion strategy to refine appearance while preserving the planned dynamics. This staged design enables explicit physical control during generation. Extensive experiments under physics constraints show that our method consistently improves physical plausibility and motion controllability.

[151] Magnifying change: Rapid burn scar mapping with multi-resolution, multi-source satellite imagery

Maria Sdraka, Dimitrios Michail, Ioannis Papoutsis

Main category: cs.CV

TL;DR: BAM-MRCD is a deep learning model that uses multi-resolution satellite imagery (MODIS and Sentinel-2) to create detailed burnt area maps with both high spatial and temporal resolution for timely wildfire monitoring.

DetailsMotivation: Current wildfire detection methods face limitations in operational settings due to trade-offs between spatial resolution and temporal revisit frequency of satellite systems, making quick delineation of burn scars after wildfires challenging.

Method: Proposes BAM-MRCD, a novel deep learning model that employs multi-resolution, multi-source satellite imagery combining MODIS (high temporal frequency) and Sentinel-2 (high spatial resolution) data for change detection.

Result: The model achieves high accuracy in detecting even small-scale wildfires, surpassing similar change detection models and solid baselines in performance.

Conclusion: BAM-MRCD enables timely production of detailed burnt area maps with both high spatial and temporal resolution, addressing operational limitations in wildfire monitoring. All data and code are publicly available.

Abstract: Delineating wildfire affected areas using satellite imagery remains challenging due to irregular and spatially heterogeneous spectral changes across the electromagnetic spectrum. While recent deep learning approaches achieve high accuracy when high-resolution multispectral data are available, their applicability in operational settings, where a quick delineation of the burn scar shortly after a wildfire incident is required, is limited by the trade-off between spatial resolution and temporal revisit frequency of current satellite systems. To address this limitation, we propose a novel deep learning model, namely BAM-MRCD, which employs multi-resolution, multi-source satellite imagery (MODIS and Sentinel-2) for the timely production of detailed burnt area maps with high spatial and temporal resolution. Our model manages to detect even small scale wildfires with high accuracy, surpassing similar change detection models as well as solid baselines. All data and code are available in the GitHub repository: https://github.com/Orion-AI-Lab/BAM-MRCD.

[152] BrainSegNet: A Novel Framework for Whole-Brain MRI Parcellation Enhanced by Large Models

Yucheng Li, Xiaofan Wang, Junyi Wang, Yijie Li, Xi Zhu, Mubai Du, Dian Sheng, Wei Zhang, Fan Zhang

Main category: cs.CV

TL;DR: BrainSegNet adapts SAM with U-Net skip connections and specialized modules for accurate 95-region whole-brain parcellation, outperforming state-of-the-art methods on HCP data.

DetailsMotivation: Whole-brain parcellation is critical but challenging due to complex, irregular brain regions. Traditional template-registration methods are being replaced by deep learning, but large models like SAM lack the precision needed for fine-grained brain anatomy segmentation.

Method: BrainSegNet adapts SAM by integrating U-Net skip connections and specialized modules: 1) Hybrid encoder combining U-Net skip connections with SAM’s transformer blocks, 2) Multi-scale attention decoder with pyramid pooling for varying-sized structures, 3) Boundary refinement module to sharpen edges.

Result: BrainSegNet outperforms several state-of-the-art methods on the Human Connectome Project dataset, achieving higher accuracy and robustness in complex, multi-label parcellation of 95 brain regions.

Conclusion: The proposed BrainSegNet framework successfully adapts SAM for precise whole-brain parcellation by incorporating anatomical-aware architectural enhancements, demonstrating superior performance for fine-grained brain segmentation tasks.

Abstract: Whole-brain parcellation from MRI is a critical yet challenging task due to the complexity of subdividing the brain into numerous small, irregular shaped regions. Traditionally, template-registration methods were used, but recent advances have shifted to deep learning for faster workflows. While large models like the Segment Anything Model (SAM) offer transferable feature representations, they are not tailored for the high precision required in brain parcellation. To address this, we propose BrainSegNet, a novel framework that adapts SAM for accurate whole-brain parcellation into 95 regions. We enhance SAM by integrating U-Net skip connections and specialized modules into its encoder and decoder, enabling fine-grained anatomical precision. Key components include a hybrid encoder combining U-Net skip connections with SAM’s transformer blocks, a multi-scale attention decoder with pyramid pooling for varying-sized structures, and a boundary refinement module to sharpen edges. Experimental results on the Human Connectome Project (HCP) dataset demonstrate that BrainSegNet outperforms several state-of-the-art methods, achieving higher accuracy and robustness in complex, multi-label parcellation.

[153] GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials

Bei Huang, Yixin Chen, Ruijie Lu, Gang Zeng, Hongbin Zha, Yuru Pei, Siyuan Huang

Main category: cs.CV

TL;DR: GaussianFluent: A framework for realistic brittle fracture simulation and rendering using 3D Gaussian Splatting with synthesized interiors and optimized continuum damage MPM.

DetailsMotivation: Previous physics simulation with 3D Gaussians focused on soft, deformable materials but couldn't handle brittle fracture due to two key obstacles: lack of volumetric interiors with coherent textures in Gaussian representation, and absence of fracture-aware simulation methods for Gaussians.

Method: 1) Synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. 2) Integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable high-speed brittle fracture simulation.

Result: Handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving photo-realistic, real-time rendering with structurally consistent interiors - results infeasible with previous methods.

Conclusion: GaussianFluent demonstrates capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting potential for downstream applications like VR and Robotics.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent’s capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.

[154] Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain

Lianying Chao, Haoran Cai, Xubin Li, Kai Zhang, Sijie Wu, Rui Xu

Main category: cs.CV

TL;DR: A multi-stage training strategy for Domain-specific Image Captioning Model (DICModel) in ICT that outperforms larger SOTA models using synthetic and expert-annotated data.

DetailsMotivation: Domain knowledge in ICT exists in both text and images, but traditional methods lack image captioning ability while MLLMs lack sufficient domain knowledge. Need to extract logical text from domain-specific images efficiently.

Method: Three-stage progressive training: 1) Synthesize 7K image-text pairs using Mermaid tool + LLMs for first-stage SFT; 2) Expert-annotated 2K image-text pairs for second-stage SFT; 3) Expert+LLM synthesized 1.5K VQA data for instruction-based SFT. Constructs evaluation system for validation.

Result: DICModel with only 7B parameters outperforms SOTA models with 32B parameters: increases BLEU by ~56.8% vs 7B models and ~20.8% vs 32B models. Outperforms Qwen2.5-VL 32B by 1% accuracy on expert-constructed objective questions.

Conclusion: The proposed method efficiently extracts logical text from ICT domain images, promoting multimodal model development in specialized domains through progressive training with synthetic and expert data.

Abstract: In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.

[155] Frequency Error-Guided Under-sampling Optimization for Multi-Contrast MRI Reconstruction

Xinming Fang, Chaoyan Huang, Juncheng Li, Jun Wang, Jun Shi, Guixu Zhang

Main category: cs.CV

TL;DR: Proposes a frequency error-guided MRI reconstruction framework using conditional diffusion models and joint optimization of under-sampling patterns and reconstruction networks.

DetailsMotivation: MRI suffers from long acquisition times and motion artifacts. Existing multi-contrast reconstruction methods have limitations: superficial reference fusion, insufficient complementary information utilization, and fixed under-sampling patterns.

Method: Uses conditional diffusion model to learn Frequency Error Prior (FEP), then jointly optimizes under-sampling pattern and reconstruction network. Employs model-driven deep unfolding framework with frequency- and image-domain information, spatial alignment module, and reference feature decomposition.

Result: Demonstrates consistent superiority over state-of-the-art methods across multiple imaging modalities, acceleration rates (4-30x), and sampling schemes in both quantitative metrics and visual quality.

Conclusion: Proposes an efficient and interpretable frequency error-guided reconstruction framework that addresses key limitations of existing multi-contrast MRI reconstruction methods.

Abstract: Magnetic resonance imaging (MRI) plays a vital role in clinical diagnostics, yet it remains hindered by long acquisition times and motion artifacts. Multi-contrast MRI reconstruction has emerged as a promising direction by leveraging complementary information from fully-sampled reference scans. However, existing approaches suffer from three major limitations: (1) superficial reference fusion strategies, such as simple concatenation, (2) insufficient utilization of the complementary information provided by the reference contrast, and (3) fixed under-sampling patterns. We propose an efficient and interpretable frequency error-guided reconstruction framework to tackle these issues. We first employ a conditional diffusion model to learn a Frequency Error Prior (FEP), which is then incorporated into a unified framework for jointly optimizing both the under-sampling pattern and the reconstruction network. The proposed reconstruction model employs a model-driven deep unfolding framework that jointly exploits frequency- and image-domain information. In addition, a spatial alignment module and a reference feature decomposition strategy are incorporated to improve reconstruction quality and bridge model-based optimization with data-driven learning for improved physical interpretability. Comprehensive validation across multiple imaging modalities, acceleration rates (4-30x), and sampling schemes demonstrates consistent superiority over state-of-the-art methods in both quantitative metrics and visual quality. All codes are available at https://github.com/fangxinming/JUF-MRI.

[156] Beyond the final layer: Attentive multilayer fusion for vision transformers

Laure Ciernik, Marco Morik, Lukas Thede, Luca Eyring, Shinichi Nakajima, Zeynep Akata, Lukas Muttenthaler

Main category: cs.CV

TL;DR: Attentive probing mechanism that fuses representations from all Vision Transformer layers outperforms standard linear probing by leveraging task-relevant information distributed across the network hierarchy.

DetailsMotivation: Linear probing is computationally efficient but limited to last-layer representations, while task-relevant information is actually distributed across all network layers rather than concentrated in final layers.

Method: Attentive probing mechanism that dynamically fuses representations from all Vision Transformer layers, learning to identify the most relevant layers for each target task and combining low-level structural cues with high-level semantic abstractions.

Result: Across 20 diverse datasets and multiple pretrained foundation models, the method achieves consistent, substantial gains over standard linear probes. Attention heatmaps reveal that tasks different from pre-training domain benefit most from intermediate representations.

Conclusion: Intermediate layer information is valuable for adaptation, and the proposed attentive probing provides a principled, task-aware approach to unlock their potential in probing-based adaptation of foundation models.

Abstract: With the rise of large-scale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 20 diverse datasets and multiple pretrained foundation models, our method achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate layer information and demonstrate a principled, task aware approach for unlocking their potential in probing-based adaptation.

[157] See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval

Mingyu Jeon, Sungjin Han, Jinkwon Hwang, Minchol Kwon, Jonghee Kim, Junyeong Kim

Main category: cs.CV

TL;DR: SMORE is a memory-efficient video moment retrieval framework that uses query-guided captions, importance modulation, and adaptive frame compression to handle long videos without information loss.

DetailsMotivation: Current MLLMs struggle with video tasks due to memory constraints from dense frame processing, and existing VMR methods use sparse frame sampling which risks information loss, especially in long videos.

Method: SMORE uses three key techniques: (1) query-guided captions to encode semantics aligned with user intent, (2) query-aware importance modulation to highlight relevant segments, and (3) adaptive frame compression to preserve key content while reducing redundancy.

Result: SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks while maintaining memory efficiency.

Conclusion: The SMORE framework enables efficient video understanding without exceeding memory budgets, addressing the limitations of current video moment retrieval methods while maintaining high information resolution.

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.

[158] Spectral Complex Autoencoder Pruning: A Fidelity-Guided Criterion for Extreme Structured Channel Compression

Wei Liu, Xing Deng, Haijian Shao, Yingtao Jiang

Main category: cs.CV

TL;DR: SCAP is a novel pruning method that uses spectral reconstruction fidelity of complex interaction fields to measure channel redundancy, achieving 90% FLOP reduction with minimal accuracy drop.

DetailsMotivation: Existing pruning methods often rely on simple heuristics like filter norms, which may not accurately capture functional redundancy at the channel level. There's a need for more sophisticated criteria that can better identify which channels are truly redundant and can be removed without harming network performance.

Method: For each convolutional layer, create complex interaction fields by pairing multi-channel input activations (real part) with single output-channel activations (imaginary part). Transform to frequency domain, train low-capacity autoencoders to reconstruct normalized spectra. Channels with high reconstruction fidelity are considered redundant (compressible), while low-fidelity channels are retained. This yields importance scores that can be fused with filter L1 norms for threshold-based pruning.

Result: On VGG16 trained on CIFAR-10, with a threshold of 0.6, achieved 90.11% FLOP reduction and 96.30% parameter reduction with only 1.67% absolute Top-1 accuracy drop from a 93.44% baseline after fine-tuning.

Conclusion: Spectral reconstruction fidelity of complex interaction fields provides an effective proxy for measuring channel-level redundancy, enabling aggressive network compression while maintaining performance, outperforming traditional pruning criteria.

Abstract: We propose Spectral Complex Autoencoder Pruning (SCAP), a reconstruction-based criterion that measures functional redundancy at the level of individual output channels. For each convolutional layer, we construct a complex interaction field by pairing the full multi-channel input activation as the real part with a single output-channel activation (spatially aligned and broadcast across input channels) as the imaginary part. We transform this complex field to the frequency domain and train a low-capacity autoencoder to reconstruct normalized spectra. Channels whose spectra are reconstructed with high fidelity are interpreted as lying close to a low-dimensional manifold captured by the autoencoder and are therefore more compressible; conversely, channels with low fidelity are retained as they encode information that cannot be compactly represented by the learned manifold. This yields an importance score (optionally fused with the filter L1 norm) that supports simple threshold-based pruning and produces a structurally consistent pruned network. On VGG16 trained on CIFAR-10, at a fixed threshold of 0.6, we obtain 90.11% FLOP reduction and 96.30% parameter reduction with an absolute Top-1 accuracy drop of 1.67% from a 93.44% baseline after fine-tuning, demonstrating that spectral reconstruction fidelity of complex interaction fields is an effective proxy for channel-level redundancy under aggressive compression.

[159] Detail Loss in Super-Resolution Models Based on the Laplacian Pyramid and Repeated Upscaling and Downscaling Process

Sangjun Han, Youngmi Hur

Main category: cs.CV

TL;DR: The paper proposes two methods to enhance high-frequency details in super-resolution images: a Laplacian pyramid-based detail loss and repeated upscaling/downscaling process, achieving state-of-the-art results with CNN models and improving attention-based models.

DetailsMotivation: Image super-resolution is crucial for real-world applications, and enhancing fine details (high-frequency information) is essential for this task. Current methods need better emphasis on pixels that contribute to high-frequency details.

Method: Two main methods: 1) Laplacian pyramid-based detail loss that guides models by separately generating and controlling super-resolution and detail images, and 2) repeated upscaling and downscaling process that amplifies detail loss effectiveness by extracting diverse information from multiple low-resolution features.

Result: The CNN-based model incorporating these methods achieves state-of-the-art results, surpassing all available CNN-based and some attention-based models. When applied to existing attention-based models, the detail loss consistently improves performance compared to original models.

Conclusion: The proposed approaches effectively enhance super-resolution images across different model structures by focusing on high-frequency components, demonstrating broad applicability and effectiveness in improving image quality.

Abstract: With advances in artificial intelligence, image processing has gained significant interest. Image super-resolution is a vital technology closely related to real-world applications, as it enhances the quality of existing images. Since enhancing fine details is crucial for the super-resolution task, pixels that contribute to high-frequency information should be emphasized. This paper proposes two methods to enhance high-frequency details in super-resolution images: a Laplacian pyramid-based detail loss and a repeated upscaling and downscaling process. Total loss with our detail loss guides a model by separately generating and controlling super-resolution and detail images. This approach allows the model to focus more effectively on high-frequency components, resulting in improved super-resolution images. Additionally, repeated upscaling and downscaling amplify the effectiveness of the detail loss by extracting diverse information from multiple low-resolution features. We conduct two types of experiments. First, we design a CNN-based model incorporating our methods. This model achieves state-of-the-art results, surpassing all currently available CNN-based and even some attention-based models. Second, we apply our methods to existing attention-based models on a small scale. In all our experiments, attention-based models adding our detail loss show improvements compared to the originals. These results demonstrate our approaches effectively enhance super-resolution images across different model structures.

[160] Radiomics-Integrated Deep Learning with Hierarchical Loss for Osteosarcoma Histology Classification

Yaxi Chen, Zi Ye, Shaheer U. Saeed, Oliver Yu, Simin Ni, Jie Huang, Yipeng Hu

Main category: cs.CV

TL;DR: Deep learning model for osteosarcoma necrosis quantification using radiomic features and hierarchical classification improves performance over previous methods.

DetailsMotivation: Manual assessment of viable vs non-viable tumor regions in osteosarcoma after chemotherapy is subjective and labor-intensive. Existing deep learning models show performance drop when evaluated on patient-level test data compared to tile-level performance reported in previous studies.

Method: Two key innovations: 1) Incorporation of radiomic features as multimodal input alongside images to improve classification and interpretability. 2) Hierarchical classification approach with two binary tasks (tumor-vs-non-tumor and viable-vs-non-viable) using trainable weightings, instead of flat three-class classification.

Result: Experimental evaluation on TCIA OS Tumor Assessment dataset shows both proposed approaches individually improve performance, and their combination achieves state-of-the-art results for osteosarcoma necrosis quantification.

Conclusion: The proposed multimodal approach with radiomic features and hierarchical classification with trainable loss weightings significantly improves automated necrosis quantification in osteosarcoma, addressing patient-level generalization challenges.

Abstract: Osteosarcoma (OS) is an aggressive primary bone malignancy. Accurate histopathological assessment of viable versus non-viable tumor regions after neoadjuvant chemotherapy is critical for prognosis and treatment planning, yet manual evaluation remains labor-intensive, subjective, and prone to inter-observer variability. Recent advances in digital pathology have enabled automated necrosis quantification. Evaluating on test data, independently sampled on patient-level, revealed that the deep learning model performance dropped significantly from the tile-level generalization ability reported in previous studies. First, this work proposes the use of radiomic features as additional input in model training. We show that, despite that they are derived from the images, such a multimodal input effectively improved the classification performance, in addition to its added benefits in interpretability. Second, this work proposes to optimize two binary classification tasks with hierarchical classes (i.e. tumor-vs-non-tumor and viable-vs-non-viable), as opposed to the alternative ``flat’’ three-class classification task (i.e. non-tumor, non-viable tumor, viable tumor), thereby enabling a hierarchical loss. We show that such a hierarchical loss, with trainable weightings between the two tasks, the per-class performance can be improved significantly. Using the TCIA OS Tumor Assessment dataset, we experimentally demonstrate the benefits from each of the proposed new approaches and their combination, setting a what we consider new state-of-the-art performance on this open dataset for this application. Code and trained models: https://github.com/YaxiiC/RadiomicsOS.git.

[161] Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs

Rui Zhu, Xin Shen, Shuchen Wu, Chenxi Miao, Xin Yu, Yang Li, Weikang Li, Deguo Xia, Jizhou Huang

Main category: cs.CV

TL;DR: Video-MSR is a new benchmark for evaluating multi-hop spatial reasoning in videos, revealing current MLLMs’ limitations in complex spatial logic chains and showing improvement through specialized instruction tuning.

DetailsMotivation: Existing benchmarks focus on single-step perception-to-judgment tasks, leaving complex visual-spatial logical chains underexplored. There's a need to evaluate multi-hop spatial reasoning in dynamic video scenarios.

Method: Created Video-MSR benchmark with 3,052 video instances and 4,993 QA pairs via scalable pipeline combining model generation and human verification. Evaluated 20 state-of-the-art MLLMs, then curated MSR-9K instruction-tuning dataset to fine-tune Qwen-VL.

Result: Current MLLMs show significant limitations in multi-hop spatial reasoning, with performance drops in complex tasks, spatial disorientation, and hallucinations. Fine-tuning with MSR-9K achieved +7.82% improvement on Video-MSR.

Conclusion: Video-MSR establishes a vital foundation for evaluating multi-hop spatial reasoning in videos, demonstrating the efficacy of specialized instruction data and highlighting the need for improved spatial reasoning capabilities in MLLMs.

Abstract: Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR.

[162] Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?

David Reid, Ognjen Arandjelovic

Main category: cs.CV

TL;DR: Vision Transformers (ViT) outperform CNNs for identifying semantic elements on ancient coins using multi-modal data (images + text).

DetailsMotivation: Automated analysis of ancient coins can help researchers extract historical insights from large collections and assist collectors in understanding coin authenticity and value. Current CNN approaches show promise but have limitations.

Method: First application of Vision Transformer (ViT) architecture to ancient coin analysis, using fully automatic learning from multi-modal data (images and unstructured text). Compares ViT with CNN models trained on the same data.

Result: ViT models outperformed newly trained CNN models in accuracy for identifying semantic elements on ancient coins.

Conclusion: Vision Transformers represent a promising advancement for automated ancient coin analysis, demonstrating superior performance over CNN approaches for semantic element identification.

Abstract: Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.

Darya Baranouskaya, Andrea Cavallaro

Main category: cs.CV

TL;DR: PrivLEX is an interpretable image privacy classifier that grounds decisions in legally defined personal data concepts using Vision-Language Models without requiring explicit concept labels during training.

DetailsMotivation: Current privacy classifiers lack interpretability and legal alignment. There's a need for privacy classification systems that can explain their decisions using legally defined personal data concepts to ensure transparency and compliance with privacy regulations.

Method: Uses zero-shot Vision-Language Model concept detection with a label-free Concept Bottleneck Model. Leverages VLM recognition capabilities to identify personal data concepts in images without requiring explicit concept labels during training.

Result: PrivLEX demonstrates ability to identify personal data concepts in images. The paper also analyzes human perception of concept sensitivity in image privacy datasets.

Conclusion: PrivLEX represents the first interpretable privacy classifier aligned with legal concepts, providing transparent privacy classification through legally grounded personal data concept detection.

Abstract: We present PrivLEX, a novel image privacy classifier that grounds its decisions in legally defined personal data concepts. PrivLEX is the first interpretable privacy classifier aligned with legal concepts that leverages the recognition capabilities of Vision-Language Models (VLMs). PrivLEX relies on zero-shot VLM concept detection to provide interpretable classification through a label-free Concept Bottleneck Model, without requiring explicit concept labels during training. We demonstrate PrivLEX’s ability to identify personal data concepts that are present in images. We further analyse the sensitivity of such concepts as perceived by human annotators of image privacy datasets.

[164] MAD: Motion Appearance Decoupling for efficient Driving World Models

Ahmad Rahimi, Valentin Gerard, Eloi Zablocki, Matthieu Cord, Alexandre Alahi

Main category: cs.CV

TL;DR: Efficient adaptation of general video diffusion models into controllable driving world models using decoupled motion learning and appearance synthesis.

DetailsMotivation: Current video diffusion models generate photorealistic videos but lack structured motion and physical consistency needed for autonomous driving world models. Adapting them requires massive domain data and costly fine-tuning.

Method: Two-stage framework: 1) Adapt model to predict structured motion using skeletonized agents/scene videos, focusing on physical/social plausibility. 2) Reuse same backbone to synthesize realistic RGB videos conditioned on motion sequences (dressing motion with texture/lighting).

Result: Achieves SOTA performance with <6% compute of prior methods when adapting SVD. MAD-LTX outperforms all open-source competitors and supports comprehensive text, ego, and object controls.

Conclusion: Decoupled motion-appearance approach enables efficient adaptation of general video diffusion models into controllable driving world models with minimal supervision, following reasoning-rendering paradigm.

Abstract: Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively “dressing” the motion with texture and lighting. This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Project page: https://vita-epfl.github.io/MAD-World-Model/

[165] Towards Robust Cross-Dataset Object Detection Generalization under Domain Specificity

Ritabrata Chakraborty, Hrishit Mitra, Shivakumara Palaiahnakote, Umapada Pal

Main category: cs.CV

TL;DR: Cross-dataset object detection performance reveals clear structure: transfer within same setting type (agnostic-agnostic or specific-specific) is stable, while cross-setting transfers drop substantially, especially from specific to agnostic datasets.

DetailsMotivation: Object detectors degrade sharply on different benchmarks, but the nature of this cross-dataset performance drop isn't well understood. The paper aims to characterize cross-dataset object detection through the lens of setting specificity.

Method: Group benchmarks into setting-agnostic (diverse everyday scenes) and setting-specific (narrow environment) datasets. Evaluate standard detector family across all train-test pairs. Compare closed-label transfer with open-label protocol using CLIP similarity to map predicted classes to nearest target label.

Result: Clear structure emerges: transfer within same setting type is relatively stable, while cross-setting transfers drop substantially and are often asymmetric. Most severe breakdowns occur when transferring from specific sources to agnostic targets. Open-label evaluation yields consistent but bounded gains, with many corrected cases corresponding to semantic near-misses.

Conclusion: Domain shift dominates in hardest cross-dataset regimes. The study provides principled characterization of cross-dataset object detection under setting specificity and practical guidance for evaluating detectors under distribution shift.

Abstract: Object detectors often perform well in-distribution, yet degrade sharply on a different benchmark. We study cross-dataset object detection (CD-OD) through a lens of setting specificity. We group benchmarks into setting-agnostic datasets with diverse everyday scenes and setting-specific datasets tied to a narrow environment, and evaluate a standard detector family across all train–test pairs. This reveals a clear structure in CD-OD: transfer within the same setting type is relatively stable, while transfer across setting types drops substantially and is often asymmetric. The most severe breakdowns occur when transferring from specific sources to agnostic targets, and persist after open-label alignment, indicating that domain shift dominates in the hardest regimes. To disentangle domain shift from label mismatch, we compare closed-label transfer with an open-label protocol that maps predicted classes to the nearest target label using CLIP similarity. Open-label evaluation yields consistent but bounded gains, and many corrected cases correspond to semantic near-misses supported by the image evidence. Overall, we provide a principled characterization of CD-OD under setting specificity and practical guidance for evaluating detectors under distribution shift. Code will be released at \href{[https://github.com/Ritabrata04/cdod-icpr.git}{https://github.com/Ritabrata04/cdod-icpr}.

[166] V-DPM: 4D Video Reconstruction with Dynamic Point Maps

Edgar Sucar, Eldar Insafutdinov, Zihang Lai, Andrea Vedaldi

Main category: cs.CV

TL;DR: V-DPM extends Dynamic Point Maps to video input for 4D reconstruction, achieving SOTA by adapting VGGT with synthetic data to predict full 3D motion of every scene point.

DetailsMotivation: Existing Dynamic Point Maps are limited to image pairs and require optimization for multiple views, while videos provide richer temporal information for dynamic 3D scene understanding.

Method: Formulates DPMs for video input to maximize representational power, facilitates neural prediction, and enables pretrained model reuse. Implements on VGGT architecture with modest synthetic data adaptation.

Result: Achieves state-of-the-art performance in 3D and 4D reconstruction for dynamic scenes, recovering not only dynamic depth but also full 3D motion of every point in the scene.

Conclusion: V-DPM demonstrates the utility of Dynamic Point Maps for video analysis, enabling comprehensive 4D reconstruction of dynamic scenes with accurate 3D motion estimation.

Abstract: Powerful 3D representations such as DUSt3R invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend this concept to dynamic 3D content by additionally representing scene motion. However, existing DPMs are limited to image pairs and, like DUSt3R, require post processing via optimization when more than two views are involved. We argue that DPMs are more useful when applied to videos and introduce V-DPM to demonstrate this. First, we show how to formulate DPMs for video input in a way that maximizes representational power, facilitates neural prediction, and enables reuse of pretrained models. Second, we implement these ideas on top of VGGT, a recent and powerful 3D reconstructor. Although VGGT was trained on static scenes, we show that a modest amount of synthetic data is sufficient to adapt it into an effective V-DPM predictor. Our approach achieves state of the art performance in 3D and 4D reconstruction for dynamic scenes. In particular, unlike recent dynamic extensions of VGGT such as P3, DPMs recover not only dynamic depth but also the full 3D motion of every point in the scene.

[167] Video Joint-Embedding Predictive Architectures for Facial Expression Recognition

Lennart Eing, Cristina Luna-Jiménez, Silvan Mertes, Elisabeth André

Main category: cs.CV

TL;DR: V-JEPA pre-training for facial expression recognition achieves state-of-the-art performance on RAVDESS and outperforms vision-based methods on CREMA-D, demonstrating strong generalization capabilities.

DetailsMotivation: Traditional video understanding pre-training methods use pixel-level reconstructions which capture irrelevant information. V-JEPA learns by predicting embeddings of masked regions from unmasked ones, focusing on relevant features for facial expression recognition.

Method: Use pre-trained V-JEPA video encoder (embedding-based approach), then train shallow classifiers on RAVDESS and CREMA-D datasets. Cross-dataset evaluations test generalization.

Result: State-of-the-art on RAVDESS, outperforms all other vision-based methods on CREMA-D (+1.48 WAR). Cross-dataset evaluations show strong generalization capabilities.

Conclusion: Embedding-based pre-training approaches like V-JEPA have strong potential to advance facial expression recognition by focusing on relevant features and avoiding irrelevant information capture.

Abstract: This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.

[168] GlovEgo-HOI: Bridging the Synthetic-to-Real Gap for Industrial Egocentric Human-Object Interaction Detection

Alfio Spoto, Rosario Leonardi, Francesco Ragusa, Giovanni Maria Farinella

Main category: cs.CV

TL;DR: A framework combining synthetic data and diffusion models to augment real images with realistic PPE for industrial EHOI analysis, with new benchmark dataset and model.

DetailsMotivation: Industrial safety requires robust EHOI analysis, but development is hindered by scarcity of annotated domain-specific data, especially for Personal Protective Equipment scenarios.

Method: Propose data generation framework combining synthetic data with diffusion-based process to augment real images with realistic PPE; introduce GlovEgo-HOI dataset and GlovEgo-Net model with Glove-Head and Keypoint-Head modules leveraging hand pose information.

Result: Extensive experiments demonstrate effectiveness of both the data generation framework and GlovEgo-Net model for enhanced interaction detection in industrial safety scenarios.

Conclusion: The proposed approach addresses data scarcity in industrial EHOI analysis, releases comprehensive resources (dataset, augmentation pipeline, pre-trained models) to foster further research in industrial safety applications.

Abstract: Egocentric Human-Object Interaction (EHOI) analysis is crucial for industrial safety, yet the development of robust models is hindered by the scarcity of annotated domain-specific data. We address this challenge by introducing a data generation framework that combines synthetic data with a diffusion-based process to augment real-world images with realistic Personal Protective Equipment (PPE). We present GlovEgo-HOI, a new benchmark dataset for industrial EHOI, and GlovEgo-Net, a model integrating Glove-Head and Keypoint- Head modules to leverage hand pose information for enhanced interaction detection. Extensive experiments demonstrate the effectiveness of the proposed data generation framework and GlovEgo-Net. To foster further research, we release the GlovEgo-HOI dataset, augmentation pipeline, and pre-trained models at: GitHub project.

[169] Bipartite Mode Matching for Vision Training Set Search from a Hierarchical Data Server

Yue Yao, Ruining Yang, Tom Gedeon

Main category: cs.CV

TL;DR: The paper proposes BMM, a bipartite mode matching algorithm that constructs optimal training sets from a hierarchical data server by aligning source and target modes, enabling data-centric unsupervised domain adaptation.

DetailsMotivation: When target domain is accessible but real-time annotation is infeasible, existing methods focus on improving algorithms while ignoring the potential of optimizing data server structure. Target domains have distinct semantic modes, and if training sets lack these modes, model performance suffers.

Method: Introduces a hierarchical data server inspired by web search engines, with BMM algorithm to align source and target modes through bipartite matching. For each target mode, finds best mode match in server data tree, ensuring one-on-one optimal matching between all target and source modes.

Result: Matched server modes create training sets with consistently smaller domain gaps across object re-ID and detection tasks. Models trained on these sets achieve higher accuracy. BMM is orthogonal to model-centric UDA methods, and combining it with pseudo-labeling yields further improvements.

Conclusion: BMM enables data-centric UDA by optimizing training set construction from hierarchical data servers, complementing existing model-centric approaches and improving performance when real-time target annotation is unavailable.

Abstract: We explore a situation in which the target domain is accessible, but real-time data annotation is not feasible. Instead, we would like to construct an alternative training set from a large-scale data server so that a competitive model can be obtained. For this problem, because the target domain usually exhibits distinct modes (i.e., semantic clusters representing data distribution), if the training set does not contain these target modes, the model performance would be compromised. While prior existing works improve algorithms iteratively, our research explores the often-overlooked potential of optimizing the structure of the data server. Inspired by the hierarchical nature of web search engines, we introduce a hierarchical data server, together with a bipartite mode matching algorithm (BMM) to align source and target modes. For each target mode, we look in the server data tree for the best mode match, which might be large or small in size. Through bipartite matching, we aim for all target modes to be optimally matched with source modes in a one-on-one fashion. Compared with existing training set search algorithms, we show that the matched server modes constitute training sets that have consistently smaller domain gaps with the target domain across object re-identification (re-ID) and detection tasks. Consequently, models trained on our searched training sets have higher accuracy than those trained otherwise. BMM allows data-centric unsupervised domain adaptation (UDA) orthogonal to existing model-centric UDA methods. By combining the BMM with existing UDA methods like pseudo-labeling, further improvement is observed.

[170] Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling

Shuyang Xiang, Hao Guan

Main category: cs.CV

TL;DR: Chinese character modeling using low-resolution visual inputs instead of token IDs achieves comparable accuracy to traditional index-based approaches, with faster early learning.

DetailsMotivation: Traditional Chinese language models use discrete index-based tokens that ignore the visual form of characters, which carries semantic and phonetic information. The authors investigate whether visual structure can provide useful signals for character prediction.

Method: The decoder receives grayscale images of individual Chinese characters at very low resolutions (as low as 8×8 pixels) instead of token IDs. This visual approach is compared against traditional index-based token models.

Result: Visual models achieve 39.2% accuracy, comparable to the index-based baseline of 39.1%. They show a “hot-start” effect - reaching above 12% accuracy by 0.4% of total training, while index-based models lag below 6% at the same point.

Conclusion: Minimal visual structure provides a robust and efficient signal for Chinese language modeling, offering an alternative character representation approach that complements traditional index-based methods.

Abstract: Large language models typically represent Chinese characters as discrete index-based tokens, largely ignoring their visual form. For logographic scripts, visual structure carries semantic and phonetic information, which may aid prediction. We investigate whether low-resolution visual inputs can serve as an alternative for character-level modeling. Instead of token IDs, our decoder receives grayscale images of individual characters, with resolutions as low as $8 \times 8$ pixels. Remarkably, these inputs achieve 39.2% accuracy, comparable to the index-based baseline of 39.1%. Such low-resource settings also exhibit a pronounced \emph{hot-start} effect: by 0.4% of total training, accuracy reaches above 12%, while index-based models lag at below 6%. Overall, our results demonstrate that minimal visual structure can provide a robust and efficient signal for Chinese language modeling, offering an alternative perspective on character representation that complements traditional index-based approaches.

[171] Trustworthy Longitudinal Brain MRI Completion: A Deformation-Based Approach with KAN-Enhanced Diffusion Model

Tianli Tao, Ziyang Wang, Delong Yang, Han Zhang, Le Zhang

Main category: cs.CV

TL;DR: DF-DiffCom is a KAN-enhanced diffusion model that uses deformation fields for trustworthy longitudinal brain MRI completion, outperforming SOTA methods and being modality-agnostic.

DetailsMotivation: High attrition rates in longitudinal brain MRI studies lead to missing data, complicating analysis. Existing deep generative models rely solely on image intensity, resulting in limited fidelity/trustworthiness and restricted usage flexibility due to fixed guidance in model structure.

Method: DF-DiffCom is a Kolmogorov-Arnold Networks (KAN)-enhanced diffusion model that smartly leverages deformation fields for trustworthy longitudinal brain image completion. It’s trained on OASIS-3 dataset.

Result: Outperforms state-of-the-art methods, improving PSNR by 5.6% and SSIM by 0.12. The modality-agnostic nature allows smooth extension to varied MRI modalities and even to attribute maps like brain tissue segmentation results.

Conclusion: DF-DiffCom addresses key limitations of existing methods by providing trustworthy longitudinal brain image completion with improved fidelity and greater flexibility for diverse application scenarios.

Abstract: Longitudinal brain MRI is essential for lifespan study, yet high attrition rates often lead to missing data, complicating analysis. Deep generative models have been explored, but most rely solely on image intensity, leading to two key limitations: 1) the fidelity or trustworthiness of the generated brain images are limited, making downstream studies questionable; 2) the usage flexibility is restricted due to fixed guidance rooted in the model structure, restricting full ability to versatile application scenarios. To address these challenges, we introduce DF-DiffCom, a Kolmogorov-Arnold Networks (KAN)-enhanced diffusion model that smartly leverages deformation fields for trustworthy longitudinal brain image completion. Trained on OASIS-3, DF-DiffCom outperforms state-of-the-art methods, improving PSNR by 5.6% and SSIM by 0.12. More importantly, its modality-agnostic nature allows smooth extension to varied MRI modalities, even to attribute maps such as brain tissue segmentation results.

[172] Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets

Jeremiah Coholich, Justin Wit, Robert Azarcon, Zsolt Kira

Main category: cs.CV

TL;DR: MANGO is an unpaired image translation method that enables sim2real transfer for robot manipulation by generating diverse real-world camera viewpoints from simulation data, using novel segmentation-conditioned losses and regularized discriminator design.

DetailsMotivation: Vision-based robot manipulation policies are brittle to camera viewpoint variations, real robot demonstration data is scarce and lacks viewpoint diversity, and simulation data has viewpoint coverage but suffers from visual sim2real gap.

Method: Proposes MANGO - an unpaired image translation method with three key components: 1) segmentation-conditioned InfoNCE loss, 2) highly-regularized discriminator design, and 3) modified PatchNCE loss to maintain viewpoint consistency during sim2real translation.

Result: MANGO outperforms all other image translation methods tested. Policies trained on MANGO-augmented data achieve up to 60% success rates on viewpoints where non-augmented policies completely fail. Only requires small amount of fixed-camera real-world data.

Conclusion: MANGO effectively bridges the sim2real gap for robot manipulation by enabling viewpoint-consistent translation of simulated observations to diverse real-world viewpoints, significantly improving policy robustness to camera viewpoint variations.

Abstract: Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO – an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this domain, MANGO outperforms all other image translation methods we tested. Imitation-learning policies trained on data augmented by MANGO are able to achieve success rates as high as 60% on views that the non-augmented policy fails completely on.

[173] OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun

Main category: cs.CV

TL;DR: OpenVoxel is a training-free algorithm for open-vocabulary 3D scene understanding that groups sparse voxels and generates captions using VLMs/MLLMs without CLIP/BERT embeddings.

DetailsMotivation: To enable open-vocabulary 3D scene understanding without requiring training or introducing text encoder embeddings, allowing for more flexible and efficient scene analysis.

Method: Uses sparse voxel rasterization from multi-view images, groups voxels into meaningful objects, then leverages Vision Language Models and Multi-modal Large Language Models for captioning via text-to-text search.

Result: Superior performance compared to recent studies, especially in complex referring expression segmentation tasks, while being training-free and avoiding CLIP/BERT embeddings.

Conclusion: OpenVoxel provides an effective training-free approach for open-vocabulary 3D scene understanding that outperforms existing methods, particularly for challenging referring expression segmentation.

Abstract: We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.

[174] CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems

Yonglin Tian, Qiyao Zhang, Wei Xu, Yutong Wang, Yihao Wu, Xinyi Li, Xingyuan Dai, Hui Zhang, Zhiyong Cui, Baoqing Guo, Zujun Yu, Yisheng Lv

Main category: cs.CV

TL;DR: CogRail: A novel benchmark for cognitive intrusion perception in railway safety that integrates curated datasets with QA annotations for spatio-temporal reasoning, revealing limitations of current VLMs and proposing a joint fine-tuning framework for improved performance.

DetailsMotivation: Existing railway intrusion detection systems focus narrowly on object classification within fixed scopes using rule-based heuristics, overlooking latent intrusion risks that require understanding spatial context and temporal dynamics of objects.

Method: 1) Introduce CogRail benchmark with curated datasets and cognitively-driven QA annotations for spatio-temporal reasoning; 2) Systematically evaluate state-of-the-art VLMs using multimodal prompts; 3) Propose joint fine-tuning framework integrating three core tasks: position perception, movement prediction, and threat analysis.

Result: Current large-scale multimodal models struggle with complex spatial-temporal reasoning required for cognitive intrusion perception. The proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands.

Conclusion: Structured multi-task learning through joint fine-tuning improves both accuracy and interpretability for cognitive intrusion perception, highlighting the need for specialized adaptation of foundation models in safety-critical railway domains.

Abstract: Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule-based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open-source datasets with cognitively driven question-answer annotations to support spatio-temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine-tune VLMs for better performance and propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general-purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large-scale multimodal models struggle with the complex spatial-temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety-critical domain. In contrast, our proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands, highlighting the advantages of structured multi-task learning in improving both accuracy and interpretability. Code will be available at https://github.com/Hub-Tian/CogRail.

[175] Show, don’t tell – Providing Visual Error Feedback for Handwritten Documents

Said Yasin, Torsten Zesch

Main category: cs.CV

TL;DR: Current systems for providing visual feedback on handwritten documents fail to achieve acceptable quality, with both modular and end-to-end approaches showing limitations.

DetailsMotivation: Handwriting remains essential in education, but providing visual feedback on handwritten documents is an important yet understudied area that needs better solutions.

Method: The paper empirically compares modular and end-to-end systems for processing handwritten input images to generate correctly placed informative error feedback.

Result: Both modular and end-to-end approaches currently do not achieve acceptable overall quality for providing visual feedback on handwritten documents.

Conclusion: The paper identifies major challenges in the field and outlines an agenda for future research to improve visual feedback systems for handwritten documents.

Abstract: Handwriting remains an essential skill, particularly in education. Therefore, providing visual feedback on handwritten documents is an important but understudied area. We outline the many challenges when going from an image of handwritten input to correctly placed informative error feedback. We empirically compare modular and end-to-end systems and find that both approaches currently do not achieve acceptable overall quality. We identify the major challenges and outline an agenda for future research.

[176] Iterative Differential Entropy Minimization (IDEM) method for fine rigid pairwise 3D Point Cloud Registration: A Focus on the Metric

Emmanuele Barberi, Felice Sfravara, Filippo Cucinotta

Main category: cs.CV

TL;DR: A novel differential entropy-based metric for 3D point cloud registration that outperforms traditional Euclidean distance metrics in handling density differences, noise, holes, and partial overlap.

DetailsMotivation: Traditional point cloud registration methods using Euclidean distances (RMSE, Chamfer, Hausdorff) have limitations: they require choosing a fixed point cloud, are sensitive to density differences, noise, holes, and limited overlap, and often need preprocessing. Real-world scenarios often affect both point clouds, making traditional approaches suboptimal.

Method: The authors propose Iterative Differential Entropy Minimization (IDEM), a novel differential entropy-based metric that serves as an objective function for fine rigid pairwise 3D point cloud registration. This metric is commutative (doesn’t require choosing a fixed point cloud) and reveals clear minima during transformations for optimal alignment.

Result: Multiple case studies show IDEM outperforms traditional metrics (RMSE, Chamfer distance, Hausdorff distance). It proves effective even with challenging conditions like density differences, noise, holes, and partial overlap, where RMSE fails to achieve optimal alignment.

Conclusion: The differential entropy-based metric provides a robust alternative to traditional Euclidean distance metrics for point cloud registration, addressing key limitations of existing methods and enabling better alignment in real-world scenarios with imperfect data.

Abstract: Point cloud registration is a central theme in computer vision, with alignment algorithms continuously improving for greater robustness. Commonly used methods evaluate Euclidean distances between point clouds and minimize an objective function, such as Root Mean Square Error (RMSE). However, these approaches are most effective when the point clouds are well-prealigned and issues such as differences in density, noise, holes, and limited overlap can compromise the results. Traditional methods, such as Iterative Closest Point (ICP), require choosing one point cloud as fixed, since Euclidean distances lack commutativity. When only one point cloud has issues, adjustments can be made, but in real scenarios, both point clouds may be affected, often necessitating preprocessing. The authors introduce a novel differential entropy-based metric, designed to serve as the objective function within an optimization framework for fine rigid pairwise 3D point cloud registration, denoted as Iterative Differential Entropy Minimization (IDEM). This metric does not depend on the choice of a fixed point cloud and, during transformations, reveals a clear minimum corresponding to the best alignment. Multiple case studies are conducted, and the results are compared with those obtained using RMSE, Chamfer distance, and Hausdorff distance. The proposed metric proves effective even with density differences, noise, holes, and partial overlap, where RMSE does not always yield optimal alignment.

[177] GRCF: Two-Stage Groupwise Ranking and Calibration Framework for Multimodal Sentiment Analysis

Manning Gao, Leheng Zhang, Shiqin Han, Haifeng Hu, Yuncheng Jiang, Sijie Mai

Main category: cs.CV

TL;DR: GRCF: A two-stage framework for multimodal sentiment analysis that combines adaptive pairwise ranking with score calibration to address limitations of traditional regression and uniform pairwise approaches.

DetailsMotivation: Traditional point-wise regression is sensitive to label noise and ignores relative ordering between samples, while existing pairwise approaches treat all comparisons equally with static margins that don't reflect varying semantic distances between sentiment groups.

Method: Two-stage framework: Stage 1 uses GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build fine-grained ordinal structure with adaptive focus on hard samples. Stage 2 employs MAE-driven objective for absolute score calibration and magnitude alignment.

Result: Achieves state-of-the-art performance on core regression benchmarks and demonstrates strong generalizability in classification tasks including multimodal humor and sarcasm detection.

Conclusion: GRCF effectively addresses key limitations of existing approaches by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples, showing broad applicability across regression and classification tasks.

Abstract: Most Multimodal Sentiment Analysis research has focused on point-wise regression. While straightforward, this approach is sensitive to label noise and neglects whether one sample is more positive than another, resulting in unstable predictions and poor correlation alignment. Pairwise ordinal learning frameworks emerged to address this gap, capturing relative order by learning from comparisons. Yet, they introduce two new trade-offs: First, they assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples. Second, they employ static ranking margins, which fail to reflect the varying semantic distances between sentiment groups. To address this, we propose a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization (GRPO). Our framework resolves these trade-offs by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples. Specifically, Stage 1 introduces a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build a fine-grained ordinal structure. Stage 2 then employs an MAE-driven objective to align prediction magnitudes. To validate its generalizability, we extend GRCF to classification tasks, including multimodal humor detection and sarcasm detection. GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.

[178] Identifying Models Behind Text-to-Image Leaderboards

Ali Naseh, Yuefeng Peng, Anshuman Suri, Harsh Chaudhari, Alina Oprea, Amir Houmansadr

Main category: cs.CV

TL;DR: T2I model anonymity in voting-based leaderboards can be broken via distinctive image embedding clusters, enabling accurate deanonymization without prompt control or training data.

DetailsMotivation: To expose security flaws in text-to-image model leaderboards that rely on anonymized outputs for fair comparison, showing that current anonymization methods are insufficient.

Method: Centroid-based deanonymization method using image embeddings from 22 models and 280 prompts (150K images), analyzing distinctive clustering patterns and introducing prompt-level distinguishability metrics.

Result: High accuracy deanonymization achieved, revealing systematic model-specific signatures in embedding space, with certain prompts enabling near-perfect distinguishability between models.

Conclusion: Current T2I leaderboard anonymization is fundamentally flawed, requiring stronger defenses against deanonymization attacks that exploit distinctive model signatures in image embeddings.

Abstract: Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.

[179] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang

Main category: cs.CV

TL;DR: Fast-ThinkAct is an efficient reasoning framework for Vision-Language-Action tasks that uses verbalizable latent reasoning to achieve compact planning with 89.3% reduced inference latency while maintaining strong performance.

DetailsMotivation: Current reasoning VLAs with explicit chain-of-thought suffer from high inference latency due to lengthy reasoning traces, creating a need for more efficient reasoning that maintains strong performance in complex embodied tasks.

Method: Proposes Fast-ThinkAct framework that learns efficient reasoning with latent CoTs through distillation from a teacher model, using preference-guided objective to align manipulation trajectories and transfer linguistic/visual planning capabilities.

Result: Achieves strong performance across diverse embodied manipulation and reasoning benchmarks with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.

Conclusion: Fast-ThinkAct demonstrates that compact latent reasoning can significantly improve inference efficiency while preserving the planning capabilities needed for complex Vision-Language-Action tasks, offering a practical solution for real-time embodied AI systems.

Abstract: Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.

[180] AquaFeat+: an Underwater Vision Learning-based Enhancement Method for Object Detection, Classification, and Tracking

Emanuel da Costa Silva, Tatiana Taís Schein, José David García Ramos, Eduardo Lawson da Silva, Stephanie Loi Brião, Felipe Gomes de Oliveira, Paulo Lilles Jorge Drews-Jr

Main category: cs.CV

TL;DR: AquaFeat+ is a plug-and-play pipeline that enhances features for underwater vision tasks, improving object detection, classification, and tracking in challenging underwater conditions.

DetailsMotivation: Underwater video analysis faces challenges like low lighting, color distortion, and turbidity that degrade visual data quality and impair perception modules in robotic applications.

Method: AquaFeat+ includes color correction, hierarchical feature enhancement, and adaptive residual output modules trained end-to-end, guided directly by the final application’s loss function.

Result: Trained and evaluated on FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics.

Conclusion: AquaFeat+ effectively enhances perception tasks for underwater robotic applications by addressing specific underwater visual challenges through feature enhancement rather than human perceptual quality improvement.

Abstract: Underwater video analysis is particularly challenging due to factors such as low lighting, color distortion, and turbidity, which compromise visual data quality and directly impact the performance of perception modules in robotic applications. This work proposes AquaFeat+, a plug-and-play pipeline designed to enhance features specifically for automated vision tasks, rather than for human perceptual quality. The architecture includes modules for color correction, hierarchical feature enhancement, and an adaptive residual output, which are trained end-to-end and guided directly by the loss function of the final application. Trained and evaluated in the FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics, validating its effectiveness for enhancing perception tasks in underwater robotic applications.

[181] Image2Garment: Simulation-ready Garment Generation from a Single Image

Selim Emir Can, Jan Ackermann, Kiyohiro Nakayama, Ruofan Liu, Tong Wu, Yang Zheng, Hugo Bertiche, Menglei Chai, Thabo Beeler, Gordon Wetzstein

Main category: cs.CV

TL;DR: A feed-forward framework that estimates simulation-ready garments from a single image by predicting material composition and fabric attributes, then mapping them to physical fabric parameters using a small material-physics dataset.

DetailsMotivation: Current methods for estimating garments from images either require multi-view capture with expensive differentiable simulation or only predict geometry without material properties needed for realistic simulation. There's a lack of image-to-physics datasets and the problem is ill-posed.

Method: 1) Fine-tune a vision-language model to infer material composition and fabric attributes from real images. 2) Train a lightweight predictor that maps these attributes to corresponding physical fabric parameters using a small dataset of material-physics measurements. Introduces two new datasets: FTAG and T2P.

Result: The estimator achieves superior accuracy in material composition estimation and fabric attribute prediction. When passed through the physics parameter estimator, it produces higher-fidelity simulations compared to state-of-the-art image-to-garment methods, without requiring iterative optimization.

Conclusion: The proposed framework successfully delivers simulation-ready garments from a single image by bridging the gap between visual appearance and physical properties through a two-stage approach that leverages vision-language models and material-physics data.

Abstract: Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.

[182] LiteEmbed: Adapting CLIP to Rare Classes

Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi

Main category: cs.CV

TL;DR: LiteEmbed is a lightweight framework for few-shot personalization of CLIP that enables adding new classes without retraining, using subspace-guided optimization of text embeddings with PCA decomposition and dual objectives for semantic consistency and discriminability.

DetailsMotivation: CLIP struggles with rarely seen classes during pretraining, including newly emerging entities and culturally specific categories. There's a need for efficient adaptation to underrepresented, rare, or unseen classes without retraining the entire model.

Method: LiteEmbed performs subspace-guided optimization of text embeddings using PCA-based decomposition to disentangle coarse semantic directions from fine-grained variations. It uses two complementary objectives: coarse alignment (preserving global semantic consistency) and fine separation (enhancing discriminability among visually similar classes).

Result: Extensive experiments show substantial gains over prior methods. The optimized embeddings are plug-and-play and work across classification, retrieval, segmentation, and detection tasks.

Conclusion: LiteEmbed establishes an effective approach for adapting CLIP to underrepresented, rare, or unseen classes through lightweight few-shot personalization without retraining the vision-language model encoders.

Abstract: Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP’s vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP’s original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.

[183] Self-Supervised Animal Identification for Long Videos

Xuyang Fang, Sion Hannuna, Edwin Simpson, Neill Campbell

Main category: cs.CV

TL;DR: A highly efficient self-supervised method for animal identification that reframes the problem as global clustering rather than sequential tracking, achieving >97% accuracy with minimal GPU memory usage.

DetailsMotivation: Traditional animal identification methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long video sequences due to memory constraints and temporal error propagation.

Method: Reframes animal identification as global clustering task assuming known fixed number of individuals. Uses bounding box detections and total count only. Samples frame pairs, employs frozen pre-trained backbone with self-bootstrapping mechanism using Hungarian algorithm for in-batch pseudo-label assignment. Adapts Binary Cross Entropy loss from vision-language models.

Result: Achieves state-of-the-art accuracy (>97%) while consuming less than 1 GB of GPU memory per batch (order of magnitude less than standard contrastive methods). Matches or surpasses supervised baselines trained on over 1,000 labeled frames on challenging datasets (3D-POP pigeons and 8-calves feeding videos).

Conclusion: Enables practical, high-accuracy animal identification on consumer-grade hardware, effectively removing the manual annotation bottleneck with broad applicability in resource-constrained research settings.

Abstract: Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video – a common scenario in practice – and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97%) while consuming less than 1 GB of GPU memory per batch – an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{https://huggingface.co/datasets/tonyFang04/8-calves}{here}.

[184] SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings

Yuchen Wu, Jiahe Li, Xiaohan Yu, Lina Yu, Jin Zheng, Xiao Bai

Main category: cs.CV

TL;DR: SCE-SLAM: An end-to-end SLAM system that maintains scale consistency through learned scene coordinate embeddings, reducing scale drift in monocular visual SLAM while maintaining real-time performance.

DetailsMotivation: Monocular visual SLAM suffers from scale drift (gradual divergence of estimated scale over long sequences) due to lack of global constraints among independent windows in frame-to-frame methods.

Method: Uses scene coordinate embeddings - learned patch-level representations encoding 3D geometric relationships under canonical scale reference. Two key modules: 1) Geometry-guided aggregation leveraging 3D spatial proximity to propagate scale information through geometry-modulated attention, 2) Scene coordinate bundle adjustment anchoring current estimates to reference scale through explicit 3D coordinate constraints decoded from embeddings.

Result: Reduces absolute trajectory error by 8.36m on KITTI compared to best prior approach, maintains 36 FPS, achieves scale consistency across large-scale scenes. Validated on KITTI, Waymo, and vKITTI datasets.

Conclusion: SCE-SLAM effectively addresses scale drift in monocular visual SLAM through scene coordinate embeddings, enabling scale consistency while maintaining real-time performance for applications like 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms.

Abstract: Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.

[185] STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge

Main category: cs.CV

TL;DR: STEP3-VL-10B is a 10B parameter open-source multimodal model that achieves frontier-level performance rivaling models 10-20x larger through unified pre-training on 1.2T tokens and innovative parallel reasoning techniques.

DetailsMotivation: To redefine the trade-off between model compactness and multimodal intelligence by creating an efficient yet powerful open-source foundation model that can compete with much larger proprietary models.

Method: Two strategic shifts: 1) Unified fully unfrozen pre-training on 1.2T multimodal tokens using language-aligned Perception Encoder with Qwen3-8B decoder, 2) Scaled post-training with 1k+ RL iterations, and Parallel Coordinated Reasoning (PaCoRe) for test-time compute scaling.

Result: Despite only 10B parameters, STEP3-VL-10B rivals/surpasses models 10-20x larger (GLM-4.6V-106B, Qwen3-VL-235B) and proprietary flagships (Gemini 2.5 Pro, Seed-1.5-VL). Achieves 92.2% on MMBench, 80.11% on MMMU, 94.43% on AIME2025, and 75.95% on MathVision.

Conclusion: STEP3-VL-10B demonstrates that compact models can achieve frontier-level multimodal intelligence through strategic training approaches and efficient reasoning techniques, providing the community with a powerful, efficient, and reproducible baseline.

Abstract: We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

[186] Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

Jieying Chen, Jeffrey Hu, Joan Lasenby, Ayush Tewari

Main category: cs.CV

TL;DR: SRENDER: A 3D-aware video generation method that uses diffusion models to generate sparse keyframes, lifts them to 3D, and renders intermediate views for efficient video synthesis.

DetailsMotivation: Current diffusion-based video generation models are computationally inefficient, requiring minutes of GPU time for just seconds of video, which is impractical for real-time applications like embodied AI and VR/AR.

Method: Generate sparse keyframes using diffusion models, lift them into 3D representations, render intermediate views through 3D reconstruction and rendering, and adaptively predict optimal keyframe density based on camera trajectory complexity.

Result: SRENDER achieves more than 40x speedup over diffusion baselines for generating 20-second videos while maintaining high visual fidelity and temporal stability.

Conclusion: The approach offers a practical path toward efficient and controllable video synthesis by amortizing generation costs across frames and enforcing geometric consistency through 3D reconstruction.

Abstract: Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.

[187] COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation

Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

Main category: cs.CV

TL;DR: COMPOSE formulates multi-view 3D pose correspondence as hypergraph partitioning instead of pairwise associations, using geometric pruning for efficiency, achieving significant improvements over existing methods.

DetailsMotivation: Existing optimization-based 3D pose estimation methods rely on pairwise associations with cycle consistency as a soft constraint, which becomes brittle when spurious associations propagate errors across multiple views.

Method: COMPOSE formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than pairwise association. It uses an efficient geometric pruning strategy to reduce the exponential search space of the resulting integer linear program.

Result: COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods.

Conclusion: Hypergraph partitioning with geometric pruning offers a promising solution to multi-view 3D pose correspondence, significantly outperforming existing approaches.

Abstract: 3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.

[188] SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3

Ruiqi Shen, Chang Liu, Henghui Ding

Main category: cs.CV

TL;DR: SAM3-DMS improves video segmentation by using decoupled memory selection for individual objects instead of synchronized group decisions, achieving better identity preservation especially in dense multi-object scenarios.

DetailsMotivation: The original SAM3's group-level collective memory selection is suboptimal for complex multi-object scenarios because it uses synchronized decisions based on average performance across all targets, which overlooks individual reliability and object-specific characteristics.

Method: SAM3-DMS is a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects rather than synchronized group decisions, allowing for object-specific reliability assessment and memory management.

Result: The approach achieves robust identity preservation and tracking stability, with advantages becoming more pronounced as target density increases, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.

Conclusion: Decoupled memory selection at the individual object level significantly improves multi-object video segmentation performance over synchronized group decisions, especially in dense scenarios, without requiring additional training.

Abstract: Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group-level collective memory selection is suboptimal for complex multi-object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3-DMS, a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.

[189] Positional Embedding-Aware Activations

Kathan Shah, Chawin Sitawarin

Main category: cs.CV

TL;DR: SPDER is a neural network architecture that learns positional embeddings and overcomes spectral bias using a sinusoidal activation with damping function, achieving 10x faster training and 1,500-50,000x lower losses than SOTA for image representation.

DetailsMotivation: Conventional neural networks face spectral bias towards lower frequencies and struggle with positional embedding learning. Current methods require hyperparameter tuning and preprocessing for coordinate-based representations.

Method: SPDER uses a simple MLP with a novel activation function: sinusoidal multiplied by a sublinear damping function. The sinusoidal enables automatic positional embedding learning, while the damping prevents coordinate values from being projected to finite ranges.

Result: SPDER achieves 10x faster training and converges to losses 1,500-50,000x lower than state-of-the-art for image representation. It’s also SOTA in audio representation and excels in downstream tasks like image super-resolution and video frame interpolation.

Conclusion: SPDER significantly improves fitting over other INR methods without requiring hyperparameter tuning or preprocessing, offering superior representation capability for various applications.

Abstract: We present a neural network architecture designed to naturally learn a positional embedding and overcome the spectral bias towards lower frequencies faced by conventional activation functions. Our proposed architecture, SPDER, is a simple MLP that uses an activation function composed of a sinusoidal multiplied by a sublinear function, called the damping function. The sinusoidal enables the network to automatically learn the positional embedding of an input coordinate while the damping passes on the actual coordinate value by preventing it from being projected down to within a finite range of values. Our results indicate that SPDERs speed up training by 10x and converge to losses 1,500-50,000x lower than that of the state-of-the-art for image representation. SPDER is also state-of-the-art in audio representation. The superior representation capability allows SPDER to also excel on multiple downstream tasks such as image super-resolution and video frame interpolation. We provide intuition as to why SPDER significantly improves fitting compared to that of other INR methods while requiring no hyperparameter tuning or preprocessing.

[190] Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

Jiachen Li, Xiaojin Gong

Main category: cs.CV

TL;DR: This paper proposes a simple PCL-based CLIP fine-tuning approach for object Re-ID that eliminates prompt learning, achieving competitive supervised performance and SOTA unsupervised results.

DetailsMotivation: The motivation is to adapt large-scale pre-trained vision-language models like CLIP for object re-identification tasks, addressing the unclear necessity and limitations of prompt learning in existing CLIP-ReID methods due to the absence of semantic labels in Re-ID.

Method: The method directly fine-tunes the CLIP image encoder using prototypical contrastive learning (PCL) loss, eliminating prompt learning. This approach is extended to both supervised and unsupervised Re-ID scenarios.

Result: Experimental results on person and vehicle Re-ID datasets show competitive performance compared to CLIP-ReID in supervised settings, and state-of-the-art performance in unsupervised scenarios.

Conclusion: The paper concludes that prompt learning is unnecessary for adapting CLIP to Re-ID tasks, and a simple PCL-based fine-tuning approach is both effective and achieves superior performance, especially in unsupervised settings.

Abstract: This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) across various supervision settings. Although prompt learning has enabled a recent work named CLIP-ReID to achieve promising performance, the underlying mechanisms and the necessity of prompt learning remain unclear due to the absence of semantic labels in ReID tasks. In this work, we first analyze the role prompt learning in CLIP-ReID and identify its limitations. Based on our investigations, we propose a simple yet effective approach to adapt CLIP for supervised object Re-ID. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning. Experimental results on both person and vehicle Re-ID datasets demonstrate the competitiveness of our method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP fine-tuning approach to unsupervised scenarios, where we achieve state-of-the art performance. Code is available at https://github.com/RikoLi/PCL-CLIP.

[191] Human-in-the-Loop Segmentation of Multi-species Coral Imagery

Scarlett Raine, Ross Marchant, Brano Kusy, Frederic Maire, Niko Suenderhauf, Tobias Fischer

Main category: cs.CV

TL;DR: Using denoised DINOv2 features with KNN for point label propagation improves coral reef image segmentation, achieving 19.7% mIoU improvement over prior methods with 5 point labels and human-in-the-loop.

DetailsMotivation: Marine surveys generate large amounts of coral reef imagery, but labeling is expensive and time-consuming for domain experts. Point label propagation can create augmented ground truth from sparse labels, but needs improvement for efficiency.

Method: Uses denoised DINOv2 foundation model features with K-Nearest Neighbors (KNN) for point label propagation without pre-training. Incorporates human-in-the-loop principles for extremely sparse labels (5 points per image). Studies point label number and placement for efficiency.

Result: Outperforms prior state-of-the-art by 19.7% mIoU with 5 point labels and human-in-the-loop. Without human-in-the-loop, still improves by 5.8% mIoU (5 grid points). On semantic segmentation task, outperforms prior SOTA by 13.5% mIoU with only 5 point labels.

Conclusion: Recent foundation model advances enable effective point label propagation using only DINOv2 features and KNN. Method significantly improves annotation efficiency for coral reef imagery and provides recommendations for optimal point labeling strategies.

Abstract: Marine surveys by robotic underwater and surface vehicles result in substantial quantities of coral reef imagery, however labeling these images is expensive and time-consuming for domain experts. Point label propagation is a technique that uses existing images labeled with sparse points to create augmented ground truth data, which can be used to train a semantic segmentation model. In this work, we show that recent advances in large foundation models facilitate the creation of augmented ground truth masks using only features extracted by the denoised version of the DINOv2 foundation model and K-Nearest Neighbors (KNN), without any pre-training. For images with extremely sparse labels, we use human-in-the-loop principles to enhance annotation efficiency: if there are 5 point labels per image, our method outperforms the prior state-of-the-art by 19.7% for mIoU. When human-in-the-loop labeling is not available, using the denoised DINOv2 features with a KNN still improves on the prior state-of-the-art by 5.8% for mIoU (5 grid points). On the semantic segmentation task, we outperform the prior state-of-the-art by 13.5% for mIoU when only 5 point labels are used for point label propagation. Additionally, we perform a comprehensive study into the number and placement of point labels, and make several recommendations for improving the efficiency of labeling images with points.

Jiawen Xu, Margret Keuper

Main category: cs.CV

TL;DR: The paper analyzes open set recognition methods, finds feature diversity correlates with better OSR performance, and proposes a novel OSR approach leveraging feature diversity that outperforms state-of-the-art methods.

DetailsMotivation: Open set recognition is critical for detecting novel classes during inference, but neural classifiers trained on closed sets struggle with novel classes. Existing heuristic methods lack understanding of underlying mechanisms, creating a gap in literature.

Method: The paper conducts analysis of OSR methods focusing on feature diversity, revealing correlation between diverse discriminative features and OSR performance. Based on this insight, proposes a novel OSR approach that leverages feature diversity advantages.

Result: The proposed method demonstrates substantial improvement over state-of-the-art methods through rigorous evaluation on a standard OSR testbench, substantiating the efficacy of the feature diversity approach.

Conclusion: Feature diversity is a key factor in enhancing open set recognition performance, and leveraging this insight leads to significantly improved OSR methods that better handle novel class detection.

Abstract: Open set recognition (OSR) is a critical aspect of machine learning, addressing the challenge of detecting novel classes during inference. Within the realm of deep learning, neural classifiers trained on a closed set of data typically struggle to identify novel classes, leading to erroneous predictions. To address this issue, various heuristic methods have been proposed, allowing models to express uncertainty by stating “I don’t know.” However, a gap in the literature remains, as there has been limited exploration of the underlying mechanisms of these methods. In this paper, we conduct an analysis of open set recognition methods, focusing on the aspect of feature diversity. Our research reveals a significant correlation between learning diverse discriminative features and enhancing OSR performance. Building on this insight, we propose a novel OSR approach that leverages the advantages of feature diversity. The efficacy of our method is substantiated through rigorous evaluation on a standard OSR testbench, demonstrating a substantial improvement over state-of-the-art methods.

[193] Boosting Adversarial Transferability with Low-Cost Optimization via Maximin Expected Flatness

Chunlin Qiu, Ang Li, Yiheng Duan, Shenyi Zhang, Yuanjie Zhang, Lingchen Zhao, Qian Wang

Main category: cs.CV

TL;DR: MEFAttack: A principled flatness-based adversarial attack framework that balances exploration-exploitation dynamics to improve transferability across models with theoretical guarantees.

DetailsMotivation: Existing flatness-enhanced transfer attacks have divergent definitions, heuristic designs, unexamined optimization limitations, and lack theoretical foundation, constraining their effectiveness and efficiency.

Method: Unifies flatness definitions, formalizes average-case flatness and transferability gaps theoretically, then designs Maximin Expected Flatness (MEF) attack that balances flatness exploration-exploitation while enhancing zeroth-order average-case flatness.

Result: Surpasses state-of-the-art PGN by 4% attack success rate at half computational cost, achieves 8% higher success rate under same budget, and with input augmentation gets 15% additional gains against defended models across 22 models and 24 attacks.

Conclusion: MEF establishes first theoretical foundation for flatness-based transferability, resolves imbalanced optimization issues, and sets new robustness benchmarks through principled flatness enhancement.

Abstract: Transfer-based attacks craft adversarial examples on white-box surrogate models and directly deploy them against black-box target models, offering model-agnostic and query-free threat scenarios. While flatness-enhanced methods have recently emerged to improve transferability by enhancing the loss surface flatness of adversarial examples, their divergent flatness definitions and heuristic attack designs suffer from unexamined optimization limitations and missing theoretical foundation, thus constraining their effectiveness and efficiency. This work exposes the severely imbalanced exploitation-exploration dynamics in flatness optimization, establishing the first theoretical foundation for flatness-based transferability and proposing a principled framework to overcome these optimization pitfalls. Specifically, we systematically unify fragmented flatness definitions across existing methods, revealing their imbalanced optimization limitations in over-exploration of sensitivity peaks or over-exploitation of local plateaus. To resolve these issues, we rigorously formalize average-case flatness and transferability gaps, proving that enhancing zeroth-order average-case flatness minimizes cross-model discrepancies. Building on this theory, we design a Maximin Expected Flatness (MEF) attack that enhances zeroth-order average-case flatness while balancing flatness exploration and exploitation. Extensive evaluations across 22 models and 24 current transfer-based attacks demonstrate MEF’s superiority: it surpasses the state-of-the-art PGN attack by 4% in attack success rate at half the computational cost and achieves 8% higher success rate under the same budget. When combined with input augmentation, MEF attains 15% additional gains against defense-equipped models, establishing new robustness benchmarks. Our code is available at https://github.com/SignedQiu/MEFAttack.

[194] Video Prediction Transformers without Recurrence or Convolution

Yujin Tang, Lu Qi, Xiangtai Li, Chao Ma, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: PredFormer is a pure transformer-based video prediction framework that outperforms both RNN and CNN approaches while being simpler and more efficient.

DetailsMotivation: Existing video prediction models have limitations: RNN-based models (like ConvLSTM) have high computational costs, while CNN-based models (like SimVP) suffer from limited receptive fields and poor generalization. The authors question whether a simpler pure transformer approach could overcome these limitations.

Method: PredFormer is a framework entirely based on Gated Transformers, with comprehensive analysis of 3D Attention specifically for video prediction tasks.

Result: PredFormer achieves state-of-the-art performance across four standard benchmarks, with significant improvements in both accuracy and efficiency compared to existing approaches.

Conclusion: PredFormer demonstrates the potential of pure transformer models for video prediction, offering a strong baseline for real-world applications while being simpler and more effective than hybrid RNN/CNN approaches.

Abstract: Video prediction has witnessed the emergence of RNN-based models led by ConvLSTM, and CNN-based models led by SimVP. Following the significant success of ViT, recent works have integrated ViT into both RNN and CNN frameworks, achieving improved performance. While we appreciate these prior approaches, we raise a fundamental question: Is there a simpler yet more effective solution that can eliminate the high computational cost of RNNs while addressing the limited receptive fields and poor generalization of CNNs? How far can it go with a simple pure transformer model for video prediction? In this paper, we propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. Extensive experiments demonstrate that PredFormer delivers state-of-the-art performance across four standard benchmarks. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer as a strong baseline for real-world video prediction applications. The source code and trained models will be released at https://github.com/yyyujintang/PredFormer.

[195] Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-training

Mingliang Liang, Martha Larson

Main category: cs.CV

TL;DR: CLIPF outperforms syntax masking and other text masking strategies for Vision Language Models by leveraging word frequency information, especially with limited input tokens.

DetailsMotivation: Current VLM training uses various text masking strategies (truncation, random, block, syntax) with syntax masking considered best. The paper investigates how different masking strategies affect word frequency distribution in training data and how this impacts model performance.

Method: Proposes Contrastive Language-Image Pre-training with Word Frequency Masking (CLIPF), which directly uses word frequency information for masking. Analyzes impact of masking strategies on word frequency and connects this to model success.

Result: CLIPF outperforms syntax masking and other existing approaches, particularly when input tokens decrease. Shows that other existing masking strategies also beat syntax masking with sufficient training epochs.

Conclusion: Word frequency-based masking (CLIPF) is superior to syntax masking for VLM training. Practical finding: with enough epochs, various masking strategies can outperform syntax masking, important for selecting text masking methods.

Abstract: Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size. Recent work has shown the benefits of masking text during VLM training using a variety of strategies (truncation, random masking, block masking and syntax masking) and has reported syntax masking as the top performer. In this paper, we analyze the impact of different text masking strategies on the word frequency in the training data, and show that this impact is connected to model success. This finding motivates Contrastive Language-Image Pre-training with Word Frequency Masking (CLIPF), our proposed masking approach, which directly leverages word frequency. Extensive experiments demonstrate the advantages of CLIPF over syntax masking and other existing approaches, particularly when the number of input tokens decreases. We show that not only CLIPF, but also other existing masking strategies, outperform syntax masking when enough epochs are used during training, a finding of practical importance for selecting a text masking method for VLM training. Our code is available online.

[196] MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

Wisdom O. Ikezogwo, Kevin Zhang, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Linda Shapiro, Ranjay Krishna

Main category: cs.CV

TL;DR: MedicalNarratives: A large-scale medical image-text dataset (4.7M pairs) curated from YouTube medical videos, with 1M containing spatial mouse traces and 118K videos for spatiotemporal grounding, used to train GenMedClip which outperforms SOTA on medical imaging benchmarks.

DetailsMotivation: Medical image datasets are scarce compared to natural images, limiting representation learning at scale. YouTube offers abundant medical pedagogical videos that can be leveraged to create large-scale medical image-text datasets for training multi-modal models.

Method: Curated MedicalNarratives dataset from YouTube medical videos containing 4.7M image-text pairs, with 1M samples having spatial mouse traces (similar to think-aloud studies) and 118K videos with aligned text for spatiotemporal grounding. Trained GenMedClip using CLIP-like objective on this dataset spanning 12 medical domains.

Result: GenMedClip outperforms previous state-of-the-art models on all 12 medical domains on a newly constructed medical imaging benchmark. The dataset enables spatiotemporal grounding beyond single frames through mouse traces and video-text alignment.

Conclusion: MedicalNarratives provides a scalable solution for medical representation learning by leveraging YouTube’s open-source medical content. The inclusion of spatial traces and video-text alignment enables richer multimodal understanding, and GenMedClip demonstrates the dataset’s utility through superior performance on medical imaging tasks.

Abstract: Multi-modal models are data hungry. While datasets with natural images are abundant, medical image datasets can not afford the same luxury. To enable representation learning for medical images at scale, we turn to YouTube, a platform with a large reservoir of open-source medical pedagogical videos. We curate MedicalNarratives, a dataset 4.7M medical image-text pairs, with 1M samples containing dense annotations in the form of spatial traces (and bounding boxes), and 118K videos centered on the trace event (with aligned text), enabling spatiotemporal grounding beyond single frames. Similar to $\textit{think-aloud}$ studies where instructors speak while hovering their mouse cursor movements over relevant image regions, 1M images in MedicalNarratives contains localized mouse traces in image pixels, creating a spatial and temporal association between the text and pixels. To evaluate the utility of MedicalNarratives, we train GenMedClip with a CLIP-like objective using our dataset spanning 12 medical domains. GenMedClip outperforms previous state-of-the-art models on all 12 domains on a newly constructed medical imaging benchmark. $\href{https://huggingface.co/datasets/wisdomik/MedicalNarratives}{[Data]}$

[197] Fair Foundation Models for Medical Image Analysis: Challenges and Perspectives

Dilermando Queiroz, Anderson Carlos, André Anjos, Lilian Berton

Main category: cs.CV

TL;DR: This paper reviews how Foundation Models (FMs) in medical imaging can enhance fairness in healthcare AI, arguing that effective bias mitigation requires systematic interventions throughout the entire development pipeline rather than just model-level fixes.

DetailsMotivation: The motivation is to address the critical need for equitable AI in healthcare that makes unbiased decisions across all demographic groups, bridging technical innovation with ethical principles to serve underserved populations and regions with limited resources.

Method: The paper conducts a comprehensive review of Foundation Models in medical imaging, analyzing how systematic bias mitigation interventions throughout the entire development pipeline (from data documentation to deployment protocols) can address fairness challenges.

Result: The review indicates that while FMs show potential for enhancing fairness in healthcare AI, achieving consistent performance across demographic groups requires integrated interventions throughout all development stages, not just model-level bias mitigation.

Conclusion: Equitable Foundation Models represent a critical step toward democratizing advanced healthcare technologies, requiring comprehensive frameworks that combine systematic bias mitigation with policy engagement to address both technical and institutional barriers to fairness.

Abstract: Ensuring equitable Artificial Intelligence (AI) in healthcare demands systems that make unbiased decisions across all demographic groups, bridging technical innovation with ethical principles. Foundation Models (FMs), trained on vast datasets through self-supervised learning, enable efficient adaptation across medical imaging tasks while reducing dependency on labeled data. These models demonstrate potential for enhancing fairness, though significant challenges remain in achieving consistent performance across demographic groups. Our review indicates that effective bias mitigation in FMs requires systematic interventions throughout all stages of development. While previous approaches focused primarily on model-level bias mitigation, our analysis reveals that fairness in FMs requires integrated interventions throughout the development pipeline, from data documentation to deployment protocols. This comprehensive framework advances current knowledge by demonstrating how systematic bias mitigation, combined with policy engagement, can effectively address both technical and institutional barriers to equitable AI in healthcare. The development of equitable FMs represents a critical step toward democratizing advanced healthcare technologies, particularly for underserved populations and regions with limited medical infrastructure and computational resources.

[198] DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Hao Luo, Yibing Song, Gao Huang, Fan Wang, Yang You

Main category: cs.CV

TL;DR: DyDiT++ is an improved dynamic diffusion transformer that reduces computational costs by 51% with <3% fine-tuning iterations, achieving 1.73x speedup and competitive FID scores.

DetailsMotivation: Diffusion Transformers (DiT) have superior performance but suffer from high computational costs due to static inference that introduces redundant computation in certain timesteps and spatial regions.

Method: Proposes Dynamic Diffusion Transformer (DyDiT) that dynamically adjusts computation along timestep and spatial dimensions, with DyDiT++ extending to flow matching, video/text-to-image generation, and introducing timestep-based dynamic LoRA (TD-LoRA) for parameter-efficient training.

Result: Reduces FLOPs of DiT-XL by 51% with <3% additional fine-tuning iterations, achieves 1.73x realistic hardware speedup, and obtains competitive FID score of 2.07 on ImageNet. Works across diverse models including DiT, SiT, Latte, and FLUX.

Conclusion: DyDiT++ effectively addresses computational inefficiency in diffusion models through dynamic computation, extends to various generation tasks and methods, and enables parameter-efficient training, making it versatile and practical for real-world applications.

Abstract: Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demonstrating that our method can also accelerate flow-matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional fine-tuning iterations, our approach reduces the FLOPs of DiT-XL by 51%, yielding 1.73x realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.

[199] An Attention Infused Deep Learning System with Grad-CAM Visualization for Early Screening of Glaucoma

Ramanathan Swaminathan

Main category: cs.CV

TL;DR: A hybrid CNN-Vision Transformer model with Cross-Attention module outperforms standalone CNN and ViT models for glaucoma detection using ACRIMA and Drishti datasets.

DetailsMotivation: To leverage complementary strengths of CNN and Vision Transformer architectures for improved glaucoma detection by enabling bidirectional feature exchange between the two streams.

Method: Intertwines a deep custom CNN with a Vision Transformer using a radical Cross-Attention module that facilitates bidirectional feature exchange between the two streams, allowing the model to learn clinically relevant regions in fundus images.

Result: The hybrid model shows improved performance compared to standalone baseline CNN and ViT models on both ACRIMA and Drishti glaucoma detection datasets.

Conclusion: Fusing CNN and Vision Transformer architectures with Cross-Attention is effective for glaucoma detection, demonstrating the value of combining complementary vision architectures for medical image analysis.

Abstract: This research work reveals the strengths of intertwining a deep custom convolutional neural network with a disruptive Vision Transformer, both fused together with a radical Cross-Attention module. Here, two high-yielding datasets for artificial intelligence models in detecting glaucoma, namely ACRIMA and Drishti, are utilized. The Cross-Attention mechanism facilitates the model in learning regions in the fundus that are clinically relevant through bidirectional feature exchange between CNN and ViT streams. Experiments clearly depict improved performance when compared to standalone baseline CNN and ViT models.

[200] FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models

Nils Neukirch, Johanna Vielhaben, Nils Strodthoff

Main category: cs.CV

TL;DR: Using conditional diffusion models to map neural network feature spaces back to input space for better interpretability of deep learning models.

DetailsMotivation: Internal representations of deep neural networks are crucial for understanding their properties and reasoning patterns, but remain difficult to interpret. Existing approaches for mapping feature space to input space often rely on crude approximations.

Method: Proposes using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn the mapping from feature space to input space in a probabilistic manner.

Result: Demonstrates feasibility across various pretrained image classifiers (CNNs to ViTs) with excellent reconstruction capabilities. Validated through qualitative comparisons and robustness analysis, with applications like visualization of concept steering and investigation of feature space composition.

Conclusion: This approach has broad potential for improving feature space understanding in computer vision models by providing high-fidelity mappings from feature representations back to interpretable input space.

Abstract: Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.

[201] ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, Sarah Adel Bargal

Main category: cs.CV

TL;DR: ViSTA is a multi-modal history adapter for text-to-image diffusion models that maintains consistency in visual storytelling by effectively leveraging past text-image pairs without extensive training.

DetailsMotivation: Existing methods for visual storytelling either require extensive training (auto-regressive methods) or lack adaptability to narrative prompts (subject-specific approaches). There's a need for a method that can effectively use history text-image pairs to maintain consistency across frames while being adaptable to different narratives.

Method: ViSTA uses a multi-modal history fusion module to extract relevant history features and a history adapter to condition generation on these features. It also employs a salient history selection strategy during inference to choose the most relevant history text-image pair for conditioning.

Result: Evaluated on StorySalon and FlintStonesSV datasets, ViSTA achieves consistent image sequences across frames while maintaining good alignment with narrative text descriptions. The method also introduces TIFA, a Visual Question Answering-based metric for better assessment of text-image alignment in visual storytelling.

Conclusion: ViSTA provides an effective solution for coherent visual storytelling by leveraging history information without extensive training, offering both consistency and adaptability to narrative prompts through its multi-modal history adapter approach.

Abstract: Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.

[202] GroupNL: Low-Resource and Robust CNN Design over Cloud and Device

Chuntao Ding, Jianhang Xie, Junna Zhang, Salman Raza, Shangguang Wang, Jiannong Cao

Main category: cs.CV

TL;DR: GroupNL replaces standard convolution layers with lightweight nonlinear transformation functions to generate diversified feature maps, reducing resource consumption while improving CNN robustness and accuracy.

DetailsMotivation: Existing CNN acceleration methods reduce parameters and FLOPs but still use multiple lightweight operations that affect speed. There's a need for more efficient alternatives that maintain or improve model performance while reducing resource consumption.

Method: GroupNL uses data-agnostic, hyperparameter-fixed Nonlinear Transformation Functions (NLFs) to generate feature maps. It first creates seed feature maps via seed convolution, then splits them into groups and applies different NLFs to generate diversified feature maps without additional convolution operations. A sparse variant further optimizes by grouping input channels and seed feature maps.

Result: GroupNL-ResNet-18 achieves 2.86% higher accuracy than ResNet-18 on Icons-50 dataset. GroupNL-EfficientNet-ES achieves about 1.1% higher accuracy than EfficientNet-ES on ImageNet-C dataset. The method demonstrates improved accuracy while reducing resource consumption.

Conclusion: GroupNL convolution is an effective alternative to standard convolution layers, offering improved accuracy and robustness with reduced computational resources, making it suitable for deployment on resource-constrained IoT devices.

Abstract: Deploying Convolutional Neural Network (CNN) models on ubiquitous Internet of Things (IoT) devices in a cloud-assisted manner to provide users with a variety of high-quality services has become mainstream. Most existing studies speed up model cloud training/on-device inference by reducing the number of convolution (Conv) parameters and floating-point operations (FLOPs). However, they usually employ two or more lightweight operations (e.g., depthwise Conv, $1\times1$ cheap Conv) to replace a Conv, which can still affect the model’s speedup even with fewer parameters and FLOPs. To this end, we propose the Grouped NonLinear transformation generation method (GroupNL), leveraging data-agnostic, hyperparameters-fixed, and lightweight Nonlinear Transformation Functions (NLFs) to generate diversified feature maps on demand via grouping, thereby reducing resource consumption while improving the robustness of CNNs. First, in a GroupNL Conv layer, a small set of feature maps, i.e., seed feature maps, are generated based on the seed Conv operation. Then, we split seed feature maps into several groups, each with a set of different NLFs, to generate the required number of diversified feature maps with tensor manipulation operators and nonlinear processing in a lightweight manner without additional Conv operations. We further introduce a sparse GroupNL Conv to speed up by reasonably designing the seed Conv groups between the number of input channels and seed feature maps. Experiments conducted on benchmarks and on-device resource measurements demonstrate that the GroupNL Conv is an impressive alternative to Conv layers in baseline models. Specifically, on Icons-50 dataset, the accuracy of GroupNL-ResNet-18 is 2.86% higher than ResNet-18; on ImageNet-C dataset, the accuracy of GroupNL-EfficientNet-ES achieves about 1.1% higher than EfficientNet-ES.

[203] Privacy-Preserving in Connected and Autonomous Vehicles Through Vision to Text Transformation

Abdolazim Rezaei, Mehdi Sookhak, Ahmad Patooghy, Shahab S. Band, Amir Mosavi

Main category: cs.CV

TL;DR: A privacy-preserving framework using reinforcement learning and vision-language models to convert vehicle camera images into textual descriptions while protecting sensitive information.

DetailsMotivation: AI-equipped cameras in ITS capture privacy-sensitive data from vehicle interiors, risking identity theft, profiling, and unauthorized use. Existing methods like face blurring are insufficient for comprehensive privacy protection.

Method: Uses feedback-based reinforcement learning with vision-language models to transform images into textual descriptions while preserving scene details. Employs hierarchical RL strategy to iteratively refine generated text with external knowledge feedback for enhanced semantic accuracy and privacy.

Result: Significant improvements in privacy preservation metrics (SSIM, PSNR, MSE, SRRA) on two datasets, outperforming other methods. The approach maintains scene details while effectively protecting sensitive information.

Conclusion: The proposed framework provides superior privacy protection for ITS camera data compared to existing methods, balancing scene preservation with privacy through iterative RL refinement and vision-language model integration.

Abstract: Intelligent Transportation Systems (ITS) rely on a variety of devices that frequently process privacy-sensitive data. Roadside units are important because they use AI-equipped cameras to detect traffic violations in Connected and Autonomous Vehicles (CAV). However, although the interior of a vehicle is generally considered a private space, the privacy risks associated with captured imagery remain a major concern, as such data can be misused for identity theft, profiling, or unauthorized commercial purposes. Methods like face blurring reduce privacy risks, however individuals’ privacy can still be compromised. This paper introduces a novel privacy-preserving framework that leverages feedback-based reinforcement learning (RL) and vision-language models (VLMs) to protect sensitive visual information captured by AIE cameras. The proposed idea transforms images into textual descriptions using an innovative method while the main scene details are preserved and protects privacy. A hierarchical RL strategy is employed to iteratively refine the generated text, enhancing both semantic accuracy and privacy. Unlike prior captioning-based methods, our model incorporates an iterative reinforcement-learning cycle with external knowledge feedback which progressively refines privacy-aware text. In addition to qualitative textual metric evaluations, the privacy-based metrics demonstrate significant improvements in privacy preservation where SSIM, PSNR, MSE, and SRRA values obtained using the proposed method on two different datasets outperform other methods.

[204] Divergence-Based Similarity Function for Multi-View Contrastive Learning

Jae Hyoung Jeon, Cheolsu Lim, Myungjoo Kang

Main category: cs.CV

TL;DR: Proposes DSF, a divergence-based similarity function that captures joint structure across multiple augmented views by treating view sets as distributions and measuring divergence between them, outperforming existing multi-view methods across diverse tasks without needing temperature tuning.

DetailsMotivation: Existing multi-view contrastive learning methods only capture pairwise relationships between views and fail to model the joint structure across all augmented views, limiting their effectiveness in leveraging multiple views of data.

Method: Proposes a divergence-based similarity function (DSF) that represents each set of augmented views as a distribution and measures similarity as the divergence between distributions, explicitly capturing joint structure across all views.

Result: DSF consistently improves performance across kNN classification, linear evaluation, transfer learning, and distribution shift tasks, achieves greater efficiency than other multi-view methods, and operates effectively without temperature hyperparameter tuning unlike cosine similarity.

Conclusion: DSF provides an effective approach for capturing joint structure across multiple augmented views in contrastive learning, offering improved performance, efficiency, and eliminating the need for temperature hyperparameter tuning compared to existing methods.

Abstract: Recent success in contrastive learning has sparked growing interest in more effectively leveraging multiple augmented views of data. While prior methods incorporate multiple views at the loss or feature level, they primarily capture pairwise relationships and fail to model the joint structure across all views. In this work, we propose a divergence-based similarity function (DSF) that explicitly captures the joint structure by representing each set of augmented views as a distribution and measuring similarity as the divergence between distributions. Extensive experiments demonstrate that DSF consistently improves performance across diverse tasks, including kNN classification, linear evaluation, transfer learning, and distribution shift, while also achieving greater efficiency than other multi-view methods. Furthermore, we establish a connection between DSF and cosine similarity, and demonstrate that, unlike cosine similarity, DSF operates effectively without the need for tuning a temperature hyperparameter.

[205] SlumpGuard: An AI-Powered Real-Time System for Automated Concrete Slump Prediction via Video Analysis

Youngmin Kim, Giyeong Oh, Kwangsoo Youm, Youngjae Yu

Main category: cs.CV

TL;DR: AI vision system (SlumpGuard) automates concrete slump testing using single camera to analyze mixer-truck discharge flow, eliminating manual testing and enabling continuous monitoring.

DetailsMotivation: Traditional slump testing is manual, time-consuming, operator-dependent, and unsuitable for continuous/real-time monitoring during concrete placement, creating need for automated assessment.

Method: AI-powered vision system with single fixed camera that performs automatic chute detection, pouring-event identification, and video-based slump classification without sensors or hardware installation.

Result: System evaluated on site-replicated dataset of 6,000+ video clips, demonstrating reliable chute localization, accurate pouring detection, and robust slump prediction under diverse field conditions.

Conclusion: SlumpGuard enables automated concrete quality monitoring, with expert study revealing significant human disagreement in visual estimates, highlighting need for automated assessment systems.

Abstract: Concrete workability is essential for construction quality, with the slump test being the most widely used on-site method for its assessment. However, traditional slump testing is manual, time-consuming, and highly operator-dependent, making it unsuitable for continuous or real-time monitoring during placement. To address these limitations, we present SlumpGuard, an AI-powered vision system that analyzes the natural discharge flow from a mixer-truck chute using a single fixed camera. The system performs automatic chute detection, pouring-event identification, and video-based slump classification, enabling quality monitoring without sensors, hardware installation, or manual intervention. We introduce the system design, construct a site-replicated dataset of over 6,000 video clips, and report extensive evaluations demonstrating reliable chute localization, accurate pouring detection, and robust slump prediction under diverse field conditions. An expert study further reveals significant disagreement in human visual estimates, highlighting the need for automated assessment.

[206] Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Ruofan Wang, Xin Wang, Yang Yao, Juncheng Li, Xuan Tong, Xingjun Ma

Main category: cs.CV

TL;DR: SEA is a grey-box jailbreak attack that transfers vulnerabilities from base VLMs to fine-tuned variants without knowing the target, achieving high success rates across safety-enhanced models.

DetailsMotivation: Fine-tuning open-source VLMs may inherit jailbreak vulnerabilities from base models, creating security risks where attacks could transfer across downstream variants, including safety-enhanced models.

Method: Simulated Ensemble Attack (SEA) uses Fine-tuning Trajectory Simulation (FTS) to model parameter variations in vision encoders and Targeted Prompt Guidance (TPG) to stabilize adversarial optimization with auxiliary textual guidance.

Result: SEA achieves consistently high transfer success and toxicity rates across diverse fine-tuned Qwen2-VL variants, while standard PGD-based attacks show negligible transferability. Fine-tuning causes localized parameter shifts, explaining SEA’s effectiveness.

Conclusion: Jailbreak vulnerabilities persist through fine-tuning, and SEA effectively exploits this by simulating fine-tuning parameter variations, highlighting security risks in open-source VLM ecosystems.

Abstract: The widespread practice of fine-tuning open-source Vision-Language Models (VLMs) raises a critical security concern: jailbreak vulnerabilities in base models may persist in downstream variants, enabling transferable attacks across fine-tuned systems. To investigate this risk, we propose the Simulated Ensemble Attack (SEA), a grey-box jailbreak framework that assumes full access to the base VLM but no knowledge of the fine-tuned target. SEA enhances transferability via Fine-tuning Trajectory Simulation (FTS), which models bounded parameter variations in the vision encoder, and Targeted Prompt Guidance (TPG), which stabilizes adversarial optimization through auxiliary textual guidance. Experiments on the Qwen2-VL family demonstrate that SEA achieves consistently high transfer success and toxicity rates across diverse fine-tuned variants, including safety-enhanced models, while standard PGD-based image jailbreaks exhibit negligible transferability. Further analysis reveals that fine-tuning primarily induces localized parameter shifts around the base model, explaining why attacks optimized over a simulated neighborhood transfer effectively. We also show that SEA generalizes across different base generations (e.g., Qwen2.5/3-VL), indicating that its effectiveness arises from shared fine-tuning-induced behaviors rather than architecture- or initialization-specific factors.

[207] Decoupling Continual Semantic Segmentation

Yifu Guo, Yuquan Lu, Wentao Zhang, Zishan Xu, Dexia Chen, Siyu Zhang, Yizhe Zhang, Ruixuan Wang

Main category: cs.CV

TL;DR: DecoupleCSS introduces a two-stage framework for Continual Semantic Segmentation that decouples class-aware detection from class-agnostic segmentation to better balance retention of old knowledge with learning of new classes.

DetailsMotivation: Existing CSS methods use single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, causing interference between old and new class learning and suboptimal retention-plasticity balance.

Method: Two-stage framework: 1) Uses pre-trained text/image encoders with LoRA adaptation to encode class-specific info and generate location-aware prompts; 2) Employs Segment Anything Model (SAM) for precise class-agnostic segmentation masks shared across all classes.

Result: Achieves state-of-the-art performance across various challenging CSS tasks by improving the balance between retention of past knowledge and adaptability to new classes.

Conclusion: Decoupling class-aware detection from class-agnostic segmentation enables more effective continual learning in semantic segmentation, preserving past knowledge while learning new classes with better retention-plasticity balance.

Abstract: Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention-plasticity balance. We introduce DecoupleCSS, a novel two-stage framework for CSS. By decoupling class-aware detection from class-agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre-trained text and image encoders, adapted using LoRA, to encode class-specific information and generate location-aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state-of-the-art performance across a variety of challenging tasks. Our code is publicly available at: https://github.com/euyis1019/Decoupling-Continual-Semantic-Segmentation.

[208] LVLM-Aware Multimodal Retrieval for RAG-Based Medical Diagnosis with General-Purpose Models

Nir Mazor, Tom Hope

Main category: cs.CV

TL;DR: Lightweight multimodal retrieval mechanism enhances diagnostic performance of LVLMs in clinical settings with minimal fine-tuning, achieving competitive results while identifying and improving inconsistent retrieval prediction errors.

DetailsMotivation: Multimodal retrieval from medical literature and hospital records can improve diagnostic accuracy for clinical image interpretation, but current multimodal retrieval-augmented diagnosis approaches are challenging and resource-intensive.

Method: Train a lightweight LVLM-aware multimodal retriever that learns to return images and texts guiding the LVLM toward correct predictions. Use only lightweight fine-tuning with small data amounts and general-purpose backbone models in low-resource settings.

Result: Achieves competitive results in clinical classification and VQA tasks compared to medically pre-trained models with extensive training. Identifies and significantly improves inconsistent retrieval prediction errors where different top-retrieved images yield different predictions for the same target.

Conclusion: The lightweight retrieval optimization mechanism effectively enhances diagnostic performance while revealing gaps in LVLMs’ ability to utilize retrieved information for clinical predictions, highlighting the need for better integration of multimodal retrieval in clinical AI systems.

Abstract: Retrieving visual and textual information from medical literature and hospital records can enhance diagnostic accuracy for clinical image interpretation. However, multimodal retrieval-augmented diagnosis is highly challenging. We explore a lightweight mechanism for enhancing diagnostic performance of retrieval-augmented LVLMs. We train a lightweight LVLM-aware multimodal retriever, such that the retriever learns to return images and texts that guide the LVLM toward correct predictions. In our low-resource setting, we perform only lightweight fine-tuning with small amounts of data, and use only general-purpose backbone models, achieving competitive results in clinical classification and VQA tasks compared to medically pre-trained models with extensive training. In a novel analysis, we highlight a previously unexplored class of errors that we term inconsistent retrieval predictions: cases where different top-retrieved images yield different predictions for the same target. We find that these cases are challenging for all models, even for non-retrieval models, and that our retrieval optimization mechanism significantly improves these cases over standard RAG. However, our analysis also sheds light on gaps in the ability of LVLMs to utilize retrieved information for clinical predictions. Code and models available at: https://github.com/Nirmaz/CLARE.

[209] Image-to-Brain Signal Generation for Visual Prosthesis with CLIP Guided Multimodal Diffusion Models

Ganxi Xu, Zhao-Rong Lai, Yuting Tang, Yonghao Song, Guoxu Zhou, Boyu wang, Jian Zhu, Jinyi Long

Main category: cs.CV

TL;DR: A novel image-to-brain signal framework using diffusion transformers with cross-attention mechanisms to generate M/EEG signals from images, enabling visual prosthesis development.

DetailsMotivation: Visual prostheses need both brain decoding (M/EEG to perceptions) and brain encoding (images to M/EEG) stages. While decoding has been explored, encoding remains largely unaddressed, preventing a complete functional pipeline for visual restoration.

Method: Uses diffusion transformer (DiT) architecture based on DDIM for brain signal generation. Employs cross-attention to align brain signal embeddings with CLIP image embeddings. Leverages LLMs to generate image captions, concatenating CLIP text and image embeddings for semantic alignment. Introduces learnable spatio-temporal position encoding combining brain region and temporal embeddings.

Result: Evaluated on THINGS-EEG2 and THINGS-MEG datasets, demonstrating generation of biologically plausible brain signals from images.

Conclusion: The framework successfully addresses the brain encoding gap in visual prostheses by generating realistic M/EEG signals from images using diffusion transformers and multimodal alignment, advancing toward complete visual restoration systems.

Abstract: Visual prostheses hold great promise for restoring vision in blind individuals. While researchers have successfully utilized M/EEG signals to evoke visual perceptions during the brain decoding stage of visual prostheses, the complementary process of converting images into M/EEG signals in the brain encoding stage remains largely unexplored, hindering the formation of a complete functional pipeline. In this work, we present a novel image-to-brain signal framework that generates M/EEG from images by leveraging the diffusion transformer architecture enhanced with cross-attention mechanisms. Specifically, we employ a diffusion transformer (DiT) architecture based on denoising diffusion implicit models (DDIM) to achieve brain signal generation. To realize the goal of image-to-brain signal conversion, we use cross-attention mechanisms to align brain signal embeddings with CLIP image embeddings. Moreover, we leverage large language models (LLMs) to generate image captions, and concatenate the resulting CLIP text embeddings with CLIP image embeddings to form unified embeddings for cross-attention alignment, enabling our model to capture core semantic information. Moreover, to capture core semantic information, we use large language models (LLMs) to generate descriptive and semantically accurate captions for images. Furthermore, we introduce a learnable spatio-temporal position encoding that combines brain region embeddings with temporal embeddings to capture both spatial and temporal characteristics of brain signals. We evaluate the framework on two multimodal benchmark datasets (THINGS-EEG2 and THINGS-MEG) and demonstrate that it generates biologically plausible brain signals.

[210] Ensemble-Based Event Camera Place Recognition Under Varying Illumination

Therese Joseph, Tobias Fischer, Michael Milford

Main category: cs.CV

TL;DR: Event camera ensemble VPR method combining multiple reconstructions, feature extractors, and temporal resolutions achieves 57% recall improvement in day-night transitions.

DetailsMotivation: Event cameras offer high dynamic range and low latency advantages for visual place recognition, but robust VPR under severe illumination changes remains an open problem that needs to be addressed.

Method: Ensemble-based approach combining sequence-matched results from multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions, plus modification to standard sequence matching framework for longer sequences.

Result: Achieves 57% relative improvement in Recall@1 across day-night transitions, evaluated on two long-term driving datasets (8 km per traverse) without metric subsampling.

Conclusion: The broader fusion strategy outperforms previous temporal-only ensemble methods, with comprehensive analysis identifying critical components for robust performance; codebase will be released for future research.

Abstract: Compared to conventional cameras, event cameras provide a high dynamic range and low latency, offering greater robustness to rapid motion and challenging lighting conditions. Although the potential of event cameras for visual place recognition (VPR) has been established, developing robust VPR frameworks under severe illumination changes remains an open research problem. In this paper, we introduce an ensemble-based approach to event camera place recognition that combines sequence-matched results from multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions. Unlike previous event-based ensemble methods, which only utilise temporal resolution, our broader fusion strategy delivers significantly improved robustness under varied lighting conditions (e.g., afternoon, sunset, night), achieving a 57% relative improvement in Recall@1 across day-night transitions. We evaluate our approach on two long-term driving datasets (with 8 km per traverse) without metric subsampling, thereby preserving natural variations in speed and stop duration that influence event density. We also conduct a comprehensive analysis of key design choices, including binning strategies, polarity handling, reconstruction methods, and feature extractors, to identify the most critical components for robust performance. Additionally, we propose a modification to the standard sequence matching framework that enhances performance at longer sequence lengths. To facilitate future research, we will release our codebase and benchmarking framework.

[211] Universal Few-Shot Spatial Control for Diffusion Models

Kiet T. Nguyen, Chanhyuk Lee, Donggyun Kim, Dong Hoon Lee, Seunghoon Hong

Main category: cs.CV

TL;DR: UFC is a universal few-shot control adapter for text-to-image diffusion models that can generalize to novel spatial control tasks with minimal training data.

DetailsMotivation: Existing spatial conditioning adapters for diffusion models have limited adaptability to novel control conditions and require expensive retraining for new tasks.

Method: UFC uses few-shot learning with image-condition pairs, leveraging analogy between query and support conditions to construct task-specific control features through matching mechanisms and updates to small task-specific parameters.

Result: UFC achieves fine-grained control with only 30 annotated examples per novel task, and with 0.1% of full training data matches fully supervised baselines across six spatial control tasks.

Conclusion: UFC provides a versatile, data-efficient solution for adapting diffusion models to novel spatial control conditions, working with both UNet and DiT architectures.

Abstract: Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks. To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples of novel tasks, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures. Code is available at https://github.com/kietngt00/UFC.

[212] Pain in 3D: Generating Controllable Synthetic Faces for Automated Pain Assessment

Xin Lei Lin, Soroush Mehraban, Abhishek Moturu, Babak Taati

Main category: cs.CV

TL;DR: 3DPain: A large-scale synthetic dataset for automated pain assessment with demographic diversity and rich annotations, plus ViTPain: A cross-modal distillation framework using heatmap guidance for improved accuracy and interpretability.

DetailsMotivation: Automated pain assessment from facial expressions is crucial for non-communicative patients like dementia patients, but limited by: (1) severe demographic and label imbalance in existing datasets due to ethical constraints, and (2) current generative models' inability to precisely control facial action units, facial structure, or clinically validated pain levels.

Method: Three-stage framework: (1) generates diverse 3D meshes, (2) textures them with diffusion models, and (3) applies AU-driven face rigging to synthesize multi-view faces with paired neutral/pain images, AU configurations, PSPI scores, and pain-region heatmaps. Also introduces ViTPain: a Vision Transformer based cross-modal distillation framework where a heatmap-trained teacher guides an RGB-trained student.

Result: Created 3DPain dataset with 82,500 samples across 25,000 pain expression heatmaps and 2,500 synthetic identities balanced by age, gender, and ethnicity. The framework provides unprecedented annotation richness and demographic diversity for pain assessment.

Conclusion: 3DPain and ViTPain together establish a controllable, diverse, and clinically grounded foundation for generalizable automated pain assessment, addressing previous limitations in dataset imbalance and generative model precision.

Abstract: Automated pain assessment from facial expressions is crucial for non-communicative patients, such as those with dementia. Progress has been limited by two challenges: (i) existing datasets exhibit severe demographic and label imbalance due to ethical constraints, and (ii) current generative models cannot precisely control facial action units (AUs), facial structure, or clinically validated pain levels. We present 3DPain, a large-scale synthetic dataset specifically designed for automated pain assessment, featuring unprecedented annotation richness and demographic diversity. Our three-stage framework generates diverse 3D meshes, textures them with diffusion models, and applies AU-driven face rigging to synthesize multi-view faces with paired neutral and pain images, AU configurations, PSPI scores, and the first dataset-level annotations of pain-region heatmaps. The dataset comprises 82,500 samples across 25,000 pain expression heatmaps and 2,500 synthetic identities balanced by age, gender, and ethnicity. We further introduce ViTPain, a Vision Transformer based cross-modal distillation framework in which a heatmap-trained teacher guides a student trained on RGB images, enhancing accuracy, interpretability, and clinical reliability. Together, 3DPain and ViTPain establish a controllable, diverse, and clinically grounded foundation for generalizable automated pain assessment.

[213] GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning

Xiaojie Li, Bei Wang, Jianlong Wu, Yue Yu, Liqiang Nie, Min Zhang

Main category: cs.CV

TL;DR: GenView++ improves contrastive learning by generating diverse, semantically coherent views and dynamically weighting training pairs based on quality assessment.

DetailsMotivation: Current contrastive learning methods have limitations: view construction suffers from limited diversity and semantic corruption, while learning lacks quality assessment mechanisms that treat all pairs equally regardless of quality.

Method: Two synergistic innovations: 1) Multi-source adaptive view generation that synthesizes diverse yet semantically coherent views using image-conditioned, text-conditioned, and image-text-conditioned strategies with dynamic parameter modulation; 2) Quality-driven contrastive learning that assesses semantic alignment and diversity to dynamically reweight training contributions, prioritizing high-quality pairs.

Result: Improves MoCov2 by +2.5% on ImageNet linear classification; raises average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets; improves Flickr30k text retrieval R@5 by +3.2%.

Conclusion: GenView++ effectively addresses both construction and learning limitations in contrastive learning through unified quality-aware view generation and adaptive training, demonstrating significant improvements across vision and vision-language tasks.

Abstract: The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair’s semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%.

[214] ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

Edoardo Bianchi, Jacopo Staiano, Antonio Liotta

Main category: cs.CV

TL;DR: ProfVLM reformulates action quality assessment as generative vision-language modeling, predicting proficiency levels while generating expert-like natural language feedback from multi-view videos, outperforming classification-based methods with significantly fewer parameters and training time.

DetailsMotivation: Existing approaches treat action quality assessment and skill proficiency estimation as classification problems, outputting discrete labels without interpretable reasoning. This lacks actionable insights and natural language feedback that would be valuable for skill improvement.

Method: Introduces ProfVLM, a generative vision-language model that uses an AttentiveGatedProjector to dynamically fuse multi-view egocentric and exocentric features from a frozen TimeSformer backbone into a language model fine-tuned for feedback generation. The model jointly predicts proficiency levels and generates expert-like natural language critiques.

Result: ProfVLM surpasses state-of-the-art methods on EgoExo4D dataset while using up to 20x fewer parameters and reducing training time by up to 60% compared to existing classification-based methods.

Conclusion: Generative vision-language modeling offers a powerful and efficient paradigm shift for interpretable action quality assessment by providing both quantitative evaluation scores and natural language critiques aligned with performance levels.

Abstract: Existing approaches treat action quality assessment and skill proficiency estimation as classification problems, outputting discrete labels without interpretable reasoning. We reformulate this task as generative vision language modeling, introducing ProfVLM, a compact model that jointly predicts proficiency levels and generates expert-like natural language feedback from multi-view videos. ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores. Central to our method is an AttentiveGatedProjector that dynamically fuses and projects multi-view egocentric and exocentric features from a frozen TimeSformer backbone into a language model fine-tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60% compared to existing classification-based methods. By providing natural language critiques aligned with performance levels, this work shows that generative vision-language modeling offers a powerful and efficient paradigm shift for interpretable action quality assessment.

[215] Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

Junjie Li, Ziao Wang, Jianghong Ma, Xiaofeng Zhang

Main category: cs.CV

TL;DR: CADC framework uses unsupervised capability discovery and influence estimation to curate instruction tuning data, achieving better performance with only 5% of original data.

DetailsMotivation: Current instruction tuning for vision-language models treats models as black boxes with heuristic strategies, causing regressions when reducing dataset size and overlooking latent capabilities that govern learning.

Method: CADC framework: 1) discovers intrinsic capabilities unsupervised from gradient-based learning trajectories, 2) attributes training data to these capabilities via influence estimation, 3) curates capability-aware curricula through balanced selection and staged sequencing.

Result: With only 5% of original data, CADC surpasses full-data training on multimodal benchmarks, validating intrinsic capabilities as fundamental building blocks of model learning.

Conclusion: CADC transforms black-box instruction tuning into a controllable, capability-driven process and establishes a principle paradigm for instruction data curation.

Abstract: Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.

[216] Comprehensive language-image pre-training for 3D medical image understanding

Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Anton Schwaighofer, Noel C. F. Codella, Maria Teodora Wetscherek, Klaus H. Maier-Hein, Panagiotis Korfiatis, Valentina Salvatelli, Javier Alvarez-Valle, Fernando Pérez-García

Main category: cs.CV

TL;DR: COLIPRI is a 3D medical vision-language encoder that combines vision-language pre-training with vision-only pre-training and report generation objectives to overcome data limitations in 3D medical imaging.

DetailsMotivation: Current 3D vision-language encoders in medical imaging face limitations due to data availability and domain-specific challenges, restricting their capabilities for clinical applications like patient retrieval, abnormality prediction, and report generation.

Method: Inject additional supervision via report generation objective, combine vision-language pre-training with vision-only pre-training, leverage both image-only and paired image-text 3D datasets, and incorporate best practices of 3D medical imaging domain.

Result: COLIPRI encoders achieve state-of-the-art performance in report generation, semantic segmentation, classification probing, and zero-shot classification.

Conclusion: The proposed approach successfully overcomes data limitations in 3D medical vision-language pre-training, creating powerful encoders that can support various clinical applications including report generation and patient similarity retrieval.

Abstract: Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification, retrieval, and segmentation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities, predicting likelihoods of abnormality, or, with downstream adaptation, generating radiological reports. While the methodology holds promise, data availability and domain-specific hurdles limit the capabilities of current 3D VLEs. In this paper, we overcome these challenges by injecting additional supervision via a report generation objective and combining vision-language with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional objectives, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-Image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, semantic segmentation, classification probing, and zero-shot classification. The model is available at https://huggingface.co/microsoft/colipri.

[217] BARL: Bilateral Alignment in Representation and Label Spaces for Semi-Supervised Volumetric Medical Image Segmentation

Shujian Gao, Yuan Wang, Zekuan Yu

Main category: cs.CV

TL;DR: BARL introduces a unified semi-supervised medical image segmentation framework that enforces alignment in both representation and label spaces, outperforming state-of-the-art methods on multiple benchmarks.

DetailsMotivation: Current semi-supervised medical image segmentation methods focus only on label-space consistency while ignoring representation-space alignment, causing models to struggle with learning discriminative and spatially coherent representations for complex pathological patterns.

Method: BARL uses a dual-branch framework with two collaborative components: 1) Label-space alignment via Dual-Path Regularization (DPR) and Progressively Cognitive Bias Correction (PCBC) for cross-branch consistency and error correction across scales; 2) Representation-space alignment through region-level and lesion-instance matching to capture fragmented pathological patterns.

Result: Extensive experiments on four public benchmarks and a proprietary CBCT dataset show BARL consistently surpasses state-of-the-art SSMIS methods. Ablation studies confirm the contribution of each component.

Conclusion: BARL demonstrates that simultaneous alignment in both representation and label spaces is crucial for effective semi-supervised medical image segmentation, achieving superior performance while reducing annotation costs.

Abstract: Semi-supervised medical image segmentation (SSMIS) seeks to match fully supervised performance while sharply reducing annotation cost. Mainstream SSMIS methods rely on \emph{label-space consistency}, yet they overlook the equally critical \emph{representation-space alignment}. Without harmonizing latent features, models struggle to learn representations that are both discriminative and spatially coherent. To this end, we introduce \textbf{Bilateral Alignment in Representation and Label spaces (BARL)}, a unified framework that couples two collaborative branches and enforces alignment in both spaces. For label-space alignment, inspired by co-training and multi-scale decoding, we devise \textbf{Dual-Path Regularization (DPR)} and \textbf{Progressively Cognitive Bias Correction (PCBC)} to impose fine-grained cross-branch consistency while mitigating error accumulation from coarse to fine scales. For representation-space alignment, we conduct region-level and lesion-instance matching between branches, explicitly capturing the fragmented, complex pathological patterns common in medical imagery. Extensive experiments on four public benchmarks and a proprietary CBCT dataset demonstrate that BARL consistently surpasses state-of-the-art SSMIS methods. Ablative studies further validate the contribution of each component. Code will be released soon.

[218] Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga

Main category: cs.CV

TL;DR: The paper presents a method to analyze and rank attention heads in transformers based on their specialization in semantic/visual concepts, enabling targeted editing of model outputs by modifying a small subset of heads.

DetailsMotivation: While language and vision-language models show impressive performance, their internal mechanisms remain poorly understood. The authors aim to better understand how individual attention heads specialize in specific attributes and develop methods to control model behavior through targeted interventions.

Method: The authors reinterpret probing of intermediate activations through a signal processing lens, allowing principled analysis of multiple samples. They develop a method to rank attention heads based on their relevance to target concepts, then selectively edit a small subset (as few as 1%) of heads to suppress or enhance specific concepts.

Result: The method reveals consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Editing just 1% of heads can reliably control targeted concepts in model outputs. Validation shows effectiveness on language tasks (QA, toxicity mitigation) and vision-language tasks (image classification, captioning).

Conclusion: Attention layers contain interpretable and controllable structures, providing simple tools for understanding and editing large-scale generative models through targeted head-level interventions.

Abstract: Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.

[219] FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao, Yufei Huang

Main category: cs.CV

TL;DR: FigEx2 is a visual-conditioned framework that localizes panels and generates captions for scientific compound figures, using noise-aware fusion and staged optimization with RL for multimodal consistency.

DetailsMotivation: Scientific compound figures combine multiple labeled panels, but captions are often missing or only provide figure-level summaries, making panel-level understanding difficult in real pipelines.

Method: Proposes FigEx2 with noise-aware gated fusion module to filter token-level features and stabilize detection query space. Uses staged optimization combining supervised learning with reinforcement learning (RL) with CLIP-based alignment and BERTScore-based semantic rewards for multimodal consistency.

Result: Achieves 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Shows remarkable zero-shot transferability to out-of-distribution scientific domains without fine-tuning.

Conclusion: FigEx2 effectively addresses panel-level understanding in scientific compound figures through visual-conditioned framework with noise-aware fusion and RL-based optimization, demonstrating strong performance and cross-domain transferability.

Abstract: Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.

[220] egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks

Matthias Jammot, Björn Braun, Paul Streli, Rafael Wampfler, Christian Holz

Main category: cs.CV

TL;DR: Introduces egoEMOTION, the first dataset combining egocentric vision and physiological signals with dense emotion/personality self-reports for affect-aware behavior modeling.

DetailsMotivation: Current egocentric vision benchmarks ignore emotional states that shape human decisions and actions, focusing only on physical activities and assuming neutral affect. This limits vision systems' ability to capture internal drivers of behavior.

Method: Created egoEMOTION dataset with 50+ hours of recordings from 43 participants using Meta’s Project Aria glasses. Includes synchronized eye-tracking video, photoplethysmography, inertial motion data, and physiological baselines. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting affect (Circumplex Model, Mikels’ Wheel) and personality (Big Five model).

Result: Shows that classical learning-based methods produce better affect estimates from egocentric vision signals than from physiological signals alone. Dataset enables three benchmark tasks: continuous affect classification (valence, arousal, dominance), discrete emotion classification, and trait-level personality inference.

Conclusion: Establishes emotion and personality as core dimensions in egocentric perception, opening new directions for affect-driven modeling of behavior, intent, and interaction.

Abstract: Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person’s emotional states that shape their decisions and actions. Existing tasks in egocentric perception focus on physical activities, hand-object interactions, and attention modeling - assuming neutral affect and uniform personality. This limits the ability of vision systems to capture key internal drivers of behavior. In this paper, we present egoEMOTION, the first dataset that couples egocentric visual and physiological signals with dense self-reports of emotion and personality across controlled and real-world scenarios. Our dataset includes over 50 hours of recordings from 43 participants, captured using Meta’s Project Aria glasses. Each session provides synchronized eye-tracking video, headmounted photoplethysmography, inertial motion data, and physiological baselines for reference. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting their affective state using the Circumplex Model and Mikels’ Wheel as well as their personality via the Big Five model. We define three benchmark tasks: (1) continuous affect classification (valence, arousal, dominance); (2) discrete emotion classification; and (3) trait-level personality inference. We show that a classical learning-based method, as a simple baseline in real-world affect prediction, produces better estimates from signals captured on egocentric vision systems than processing physiological signals. Our dataset establishes emotion and personality as core dimensions in egocentric perception and opens new directions in affect-driven modeling of behavior, intent, and interaction.

[221] Hierarchical Fusion of Local and Global Visual Features with Mixture-of-Experts for Remote Sensing Image Scene Classification

Yuanhao Tang, Xuechao Zou, Zhengpei Hu, Junliang Xing, Chengkun Zhang, Jianqiang Huang

Main category: cs.CV

TL;DR: Parallel heterogeneous encoder with local-global co-representation for remote sensing scene classification, achieving SOTA results on three benchmarks.

DetailsMotivation: Remote sensing scene classification is challenging due to complex spatial structures and multi-scale objects. Single paradigms (CNN for local features or Mamba for global context) are insufficient for capturing both fine-grained textures and complex spatial structures simultaneously.

Method: Proposes a parallel heterogeneous encoder with two pathways: local visual encoder for multi-scale local features and global visual encoder for efficient global features. Uses hierarchical fusion module for progressive multi-scale feature aggregation with dynamic cross-level interaction. Features are routed through mixture-of-experts classifier head for fine-grained scene recognition.

Result: Achieves 93.72% on AID, 95.54% on NWPU-RESISC45, and 96.92% on UC Merced datasets, surpassing state-of-the-art methods with optimal balance of performance and efficiency.

Conclusion: The proposed parallel heterogeneous encoder effectively addresses the limitations of single-paradigm approaches by enabling local-global co-representation, achieving superior performance in remote sensing scene classification through hierarchical fusion and adaptive expert routing.

Abstract: Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Although CNN-based methods excel at extracting local inductive biases, and Mamba-based approaches demonstrate impressive capabilities in efficiently capturing global sequential context, relying on a single paradigm restricts the model’s ability to simultaneously characterize fine-grained textures and complex spatial structures. To tackle this, we propose a parallel heterogeneous encoder, a hierarchical fusion module designed to achieve effective local-global co-representation. It consists of two parallel pathways: a local visual encoder for extracting multi-scale local visual features, and a global visual encoder for capturing efficient global visual features. The core innovation lies in its hierarchical fusion module, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a mixture-of-experts classifier head, which dynamically dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that our model achieves 93.72%, 95.54%, and 96.92% accuracy, surpassing SOTA methods with an optimal balance of performance and efficiency. Code is available at https://anonymous.4open.science/r/classification-41DF.

[222] Weakly Supervised Concept Learning with Class-Level Priors for Interpretable Medical Diagnosis

Md Nahiduzzaman, Steven Korevaar, Alireza Bab-Hadiashar, Ruwan Tennakoon

Main category: cs.CV

TL;DR: PCP is a weakly supervised framework for medical image interpretation that predicts clinical concepts without needing concept annotations, using class-level priors and refinement mechanisms to achieve better performance than zero-shot methods.

DetailsMotivation: Current interpretable-by-design frameworks require costly concept annotations for training data, which are impractical in clinical settings. Zero-shot and concept-generation methods fail to capture domain-specific medical features, leading to poor reliability.

Method: Prior-guided Concept Predictor (PCP) uses class-level concept priors as weak supervision, with a refinement mechanism employing KL divergence and entropy regularization to align predictions with clinical reasoning, avoiding explicit supervision or language models.

Result: PCP improves concept-level F1-score by over 33% compared to zero-shot baselines on PH2 and WBCatt datasets, while achieving competitive classification performance on four medical datasets (PH2, WBCatt, HAM10000, CXR4) relative to fully supervised CBMs and V-IP.

Conclusion: PCP provides a practical solution for interpretable medical AI by enabling concept prediction without costly annotations, bridging the gap between interpretability and clinical applicability through weak supervision and refinement mechanisms.

Abstract: Human-interpretable predictions are essential for deploying AI in medical imaging, yet most interpretable-by-design (IBD) frameworks require concept annotations for training data, which are costly and impractical to obtain in clinical contexts. Recent attempts to bypass annotation, such as zero-shot vision-language models or concept-generation frameworks, struggle to capture domain-specific medical features, leading to poor reliability. In this paper, we propose a novel Prior-guided Concept Predictor (PCP), a weakly supervised framework that enables concept answer prediction without explicit supervision or reliance on language models. PCP leverages class-level concept priors as weak supervision and incorporates a refinement mechanism with KL divergence and entropy regularization to align predictions with clinical reasoning. Experiments on PH2 (dermoscopy) and WBCatt (hematology) show that PCP improves concept-level F1-score by over 33% compared to zero-shot baselines, while delivering competitive classification performance on four medical datasets (PH2, WBCatt, HAM10000, and CXR4) relative to fully supervised concept bottleneck models (CBMs) and V-IP.

[223] Tracking and Understanding Object Transformations

Yihong Sun, Xinyu Yang, Jennifer J. Sun, Bharath Hariharan

Main category: cs.CV

TL;DR: Zero-shot system TubeletGraph tracks objects through state transformations while detecting and describing state changes, achieving SOTA performance on new VOST-TAS benchmark.

DetailsMotivation: Real-world objects frequently undergo state transformations (e.g., apple being cut, butterfly emerging), but existing tracking methods lose track after significant appearance changes. Need to track objects through transformations while understanding state changes.

Method: TubeletGraph system: identifies potentially overlooked tracks, integrates them based on semantic and proximity priors, reasons about added tracks, and generates a state graph describing each observed transformation.

Result: Achieves state-of-the-art tracking performance under transformations, demonstrates deeper understanding of object transformations, and shows promising capabilities in temporal grounding and semantic reasoning for complex object transformations.

Conclusion: TubeletGraph successfully addresses the limitation of tracking objects through state transformations while detecting and describing state changes, with a new benchmark dataset VOST-TAS enabling further research in this direction.

Abstract: Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.

[224] Deep Hybrid Model for Region of Interest Detection in Omnidirectional Videos

Sana Alamgeer, Mylene Farias, Marcelo Carvalho

Main category: cs.CV

TL;DR: A hybrid saliency model for predicting regions of interest in 360° videos to optimize streaming efficiency and viewing experience.

DetailsMotivation: ROI prediction is crucial for 360° video streaming to reduce bandwidth usage, predict view-ports for head-mounted devices, and enable intelligent video cuts for better streaming efficiency and viewing quality.

Method: Three-step approach: 1) Preprocess videos to obtain frames, 2) Develop and train a hybrid saliency model to predict ROIs, 3) Post-process model predictions to obtain final ROI outputs for each frame.

Result: Performance evaluated by comparing proposed method’s predictions with subjective annotations from the 360RAT dataset.

Conclusion: The hybrid saliency model approach shows promise for ROI prediction in 360° videos, potentially enabling more efficient streaming and better user experience through view-port prediction and intelligent video cutting.

Abstract: The main goal of the project is to design a new model that predicts regions of interest in 360$^{\circ}$ videos. The region of interest (ROI) plays an important role in 360$^{\circ}$ video streaming. For example, ROIs are used to predict view-ports, intelligently cut the videos for live streaming, etc so that less bandwidth is used. Detecting view-ports in advance helps reduce the movement of the head while streaming and watching a video via the head-mounted device. Whereas, intelligent cuts of the videos help improve the efficiency of streaming the video to users and enhance the quality of their viewing experience. This report illustrates the secondary task to identify ROIs, in which, we design, train, and test a hybrid saliency model. In this work, we refer to saliency regions to represent the regions of interest. The method includes the processes as follows: preprocessing the video to obtain frames, developing a hybrid saliency model for predicting the region of interest, and finally post-processing the output predictions of the hybrid saliency model to obtain the output region of interest for each frame. Then, we compare the performance of the proposed method with the subjective annotations of the 360RAT dataset.

[225] Self-Paced Learning for Images of Antinuclear Antibodies

Yiyang Jiang, Guangwu Qian, Jiaxin Wu, Qi Huang, Qing Li, Yongkang Wu, Xiao-Yong Wei

Main category: cs.CV

TL;DR: A novel framework for automated ANA detection using multi-instance multi-label learning with self-paced learning components achieves state-of-the-art performance on medical datasets.

DetailsMotivation: Manual ANA testing for autoimmune disease diagnosis is slow, labor-intensive, and requires extensive training. Existing automation approaches struggle with the multi-instance, multi-label nature of real-world ANA detection involving over 100 antibody types and complex fluorescent patterns.

Method: Proposes a framework with three task-specific components: 1) instance sampler that suppresses low-confidence instances by modeling pattern confidence, 2) probabilistic pseudo-label dispatcher that adaptively assigns labels based on instance distinguishability, and 3) self-paced weight learning rate coefficients that adjust training based on empirical label observations. The approach works with unaltered microscope images without manual preprocessing.

Result: On the ANA dataset, achieves up to +7.0% F1-Macro and +12.6% mAP gains over prior methods. On three public medical MIML benchmarks, ranks top-2 across all key metrics, reducing Hamming loss by up to 18.2% and one-error by up to 26.9%.

Conclusion: The proposed framework effectively handles the complexities of multi-instance multi-label learning for ANA detection, outperforming traditional methods and supporting end-to-end optimization, making it suitable for real-world clinical applications.

Abstract: Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren’s syndrome, and scleroderma. Despite its importance, manual ANA detection is slow, labor-intensive, and demands years of training. ANA detection is complicated by over 100 coexisting antibody types, resulting in vast fluorescent pattern combinations. Although machine learning and deep learning have enabled automation, ANA detection in real-world clinical settings presents unique challenges as it involves multi-instance, multi-label (MIML) learning. In this paper, a novel framework for ANA detection is proposed that handles the complexities of MIML tasks using unaltered microscope images without manual preprocessing. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels accordingly. These steps are implemented using three task-specific components: an instance sampler, a probabilistic pseudo-label dispatcher, and self-paced weight learning rate coefficients. The instance sampler suppresses low-confidence instances by modeling pattern confidence, while the dispatcher adaptively assigns labels based on instance distinguishability. Self-paced learning adjusts training according to empirical label observations. Our framework overcomes limitations of traditional MIML methods and supports end-to-end optimization. Extensive experiments on one ANA dataset and three public medical MIML benchmarks demonstrate the superiority of our framework. On the ANA dataset, our model achieves up to +7.0% F1-Macro and +12.6% mAP gains over the best prior method, setting new state-of-the-art results. It also ranks top-2 across all key metrics on public datasets, reducing Hamming loss and one-error by up to 18.2% and 26.9%, respectively. The source code can be accessed at https://github.com/fletcherjiang/ANA-SelfPacedLearning.

[226] LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

Kai Wang, Siyi Chen, Weicong Pang, Chenchen Zhang, Renjun Gao, Ziru Chen, Cheng Li, Dasa Gu, Rui Huang, Alexis Kai Hon Lau

Main category: cs.CV

TL;DR: LC4-DViT combines generative data creation with a deformation-aware Vision Transformer for improved land-cover classification, achieving high accuracy on aerial imagery datasets.

DetailsMotivation: Timely, accurate land-cover maps are critical for environmental applications, but current remote sensing classification faces challenges with scarce/imbalanced annotations and geometric distortions in high-resolution scenes.

Method: Uses text-guided diffusion pipeline with GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize balanced training data, combined with DViT that couples DCNv4 deformable convolutional backbone with Vision Transformer encoder.

Result: Achieved 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen’s Kappa on AID dataset, outperforming ViT baseline and other models. Cross-dataset experiments showed good transferability (0.9333 OA). LLM-based judge confirmed DViT’s attention aligns with hydrologically meaningful structures.

Conclusion: Description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping, addressing annotation scarcity and geometric distortion challenges.

Abstract: Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen’ s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT’ s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.

[227] DEAR: Dataset for Evaluating the Aesthetics of Rendering

Vsevolod Plohotnuk, Artyom Panshin, Nikola Banić, Simone Bianco, Michael Freeman, Egor Ershov

Main category: cs.CV

TL;DR: DEAR is a novel benchmark dataset for evaluating image rendering aesthetics using human preference scores, addressing the gap in traditional distortion-based image quality assessment.

DetailsMotivation: Traditional IQA focuses on technical degradations (noise, blur, compression) but fails to capture aesthetic judgments of rendering styles, which are crucial for photographic editing, content creation, and AI-generated imagery. There's a lack of datasets reflecting subjective style preferences.

Method: Built upon MIT-Adobe FiveK dataset, DEAR incorporates pairwise human preference scores collected via large-scale crowdsourcing (25 evaluators per image pair, total 13,648 participants). The dataset captures nuanced, context-sensitive aesthetic preferences for rendering styles.

Result: Created the first systematic dataset for image aesthetics of rendering assessment grounded in subjective human preferences. Published subset of 100 images with markup on HuggingFace. Enables new task: Evaluation of Aesthetics of Rendering (EAR).

Conclusion: DEAR fills a critical gap in aesthetic evaluation by providing a benchmark for modeling human aesthetic judgments of rendering styles, enabling development of models that go beyond traditional distortion-based IQA for applications in style preference prediction, aesthetic benchmarking, and personalized aesthetic modeling.

Abstract: Traditional Image Quality Assessment~(IQA) focuses on quantifying technical degradations such as noise, blur, or compression artifacts, using both full-reference and no-reference objective metrics. However, evaluation of rendering aesthetics, a growing domain relevant to photographic editing, content creation, and AI-generated imagery, remains underexplored due to the lack of datasets that reflect the inherently subjective nature of style preference. In this work, a novel benchmark dataset designed to model human aesthetic judgments of image rendering styles is introduced: the Dataset for Evaluating the Aesthetics of Rendering (DEAR). Built upon the MIT-Adobe FiveK dataset, DEAR incorporates pairwise human preference scores collected via large-scale crowdsourcing, with each image pair evaluated by 25 distinct human evaluators with a total of 13,648 of them participating overall. These annotations capture nuanced, context-sensitive aesthetic preferences, enabling the development and evaluation of models that go beyond traditional distortion-based IQA, focusing on a new task: Evaluation of Aesthetics of Rendering (EAR). The data collection pipeline is described, human voting patterns are analyzed, and multiple use cases are outlined, including style preference prediction, aesthetic benchmarking, and personalized aesthetic modeling. To the best of the authors’ knowledge, DEAR is the first dataset to systematically address image aesthetics of rendering assessment grounded in subjective human preferences. A subset of 100 images with markup for them is published on HuggingFace (huggingface.co/datasets/vsevolodpl/DEAR).

[228] A Multi-Mode Structured Light 3D Imaging System with Multi-Source Information Fusion for Underwater Pipeline Detection

Qinghan Hu, Haijiang Zhu, Na Sun, Lei Chen, Zhengqiang Fan, Zhiqing Li

Main category: cs.CV

TL;DR: Developed a multi-mode underwater structured light 3D imaging system (UW-SLD) for pipeline corrosion detection using multi-source information fusion and adaptive algorithms for robust performance in challenging underwater environments.

DetailsMotivation: Underwater pipelines are highly susceptible to corrosion, posing safety risks and shortening service life. Manual inspection is unreliable, so intelligent real-time imaging systems are needed. Structured light 3D imaging can provide sufficient spatial detail for precise defect characterization.

Method: Developed UW-SLD system with: 1) Rapid distortion correction (FDC) for underwater image rectification, 2) Factor graph-based parameter optimization for sensor calibration, 3) Multi-mode 3D imaging strategy for pipeline geometry variability, 4) Multi-source information fusion and adaptive extended Kalman filter (AEKF) for stable pose estimation, 5) Edge detection-based ICP (ED-ICP) algorithm combining pipeline edge detection network with enhanced point cloud registration.

Result: Extensive experiments under different operation modes, velocities, and depths demonstrate superior accuracy, adaptability and robustness. The system provides a solid foundation for autonomous underwater pipeline detection with high-fidelity reconstruction of defect structures even under variable motion conditions.

Conclusion: The developed multi-mode underwater structured light 3D imaging system successfully addresses challenges in underwater pipeline inspection through innovative sensor fusion, adaptive filtering, and enhanced registration algorithms, enabling reliable and precise defect characterization for autonomous detection applications.

Abstract: Underwater pipelines are highly susceptible to corrosion, which not only shorten their service life but also pose significant safety risks. Compared with manual inspection, the intelligent real-time imaging system for underwater pipeline detection has become a more reliable and practical solution. Among various underwater imaging techniques, structured light 3D imaging can restore the sufficient spatial detail for precise defect characterization. Therefore, this paper develops a multi-mode underwater structured light 3D imaging system for pipeline detection (UW-SLD system) based on multi-source information fusion. First, a rapid distortion correction (FDC) method is employed for efficient underwater image rectification. To overcome the challenges of extrinsic calibration among underwater sensors, a factor graph-based parameter optimization method is proposed to estimate the transformation matrix between the structured light and acoustic sensors. Furthermore, a multi-mode 3D imaging strategy is introduced to adapt to the geometric variability of underwater pipelines. Given the presence of numerous disturbances in underwater environments, a multi-source information fusion strategy and an adaptive extended Kalman filter (AEKF) are designed to ensure stable pose estimation and high-accuracy measurements. In particular, an edge detection-based ICP (ED-ICP) algorithm is proposed. This algorithm integrates pipeline edge detection network with enhanced point cloud registration to achieve robust and high-fidelity reconstruction of defect structures even under variable motion conditions. Extensive experiments are conducted under different operation modes, velocities, and depths. The results demonstrate that the developed system achieves superior accuracy, adaptability and robustness, providing a solid foundation for autonomous underwater pipeline detection.

[229] JoyAvatar-Flash: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion

Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, Xiaodong He

Main category: cs.CV

TL;DR: JoyAvatar-Flash: A real-time audio-driven avatar generation model that enables infinite-length video synthesis with improved temporal coherence and reduced error accumulation.

DetailsMotivation: Existing DiT-based audio-driven avatar methods have high computational overhead and cannot synthesize long videos. Autoregressive methods address length but suffer from error accumulation and quality degradation.

Method: Three key techniques: 1) Progressive Step Bootstrapping (PSB) allocates more denoising steps to initial frames for stability; 2) Motion Condition Injection (MCI) uses noise-corrupted previous frames as motion condition for temporal coherence; 3) Unbounded RoPE via Cache-Resetting (URCR) enables infinite-length generation with dynamic positional encoding.

Result: The 1.3B-parameter causal model achieves 16 FPS on a single GPU with competitive results in visual quality, temporal consistency, and lip synchronization.

Conclusion: JoyAvatar-Flash successfully addresses limitations of existing methods by enabling real-time inference and infinite-length video generation while maintaining high quality and temporal coherence.

Abstract: Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar-Flash, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.

[230] SPOT!: Map-Guided LLM Agent for Unsupervised Multi-CCTV Dynamic Object Tracking

Yujin Roh, Inho Jake Park, Chigon Hwang

Main category: cs.CV

TL;DR: SPOT is a map-guided LLM agent that predicts vehicle trajectories across CCTV blind spots using spatial reasoning without prior training, maintaining continuous tracking in multi-camera environments.

DetailsMotivation: CCTV-based vehicle tracking systems suffer from blind spots between cameras and limited fields of view, causing object ID switching, trajectory loss, and reduced reliability of real-time path prediction in multi-camera environments.

Method: SPOT represents road structures (Waypoints) and CCTV placement as documents based on 2D spatial coordinates, uses chunking for real-time querying, transforms vehicle positions to world coordinates using relative position and FOV information, and performs beam search at intersections combining map spatial information with vehicle movement patterns to predict next CCTV locations.

Result: Experimental results using CARLA simulator in a virtual city environment show SPOT accurately predicts next appearing CCTV in blind spot sections and maintains continuous vehicle trajectories more effectively than existing techniques.

Conclusion: SPOT successfully addresses the blind spot problem in multi-CCTV vehicle tracking through map-guided spatial reasoning without requiring prior training, enabling reliable continuous trajectory tracking across camera boundaries.

Abstract: CCTV-based vehicle tracking systems face structural limitations in continuously connecting the trajectories of the same vehicle across multiple camera environments. In particular, blind spots occur due to the intervals between CCTVs and limited Fields of View (FOV), which leads to object ID switching and trajectory loss, thereby reducing the reliability of real-time path prediction. This paper proposes SPOT (Spatial Prediction Over Trajectories), a map-guided LLM agent capable of tracking vehicles even in blind spots of multi-CCTV environments without prior training. The proposed method represents road structures (Waypoints) and CCTV placement information as documents based on 2D spatial coordinates and organizes them through chunking techniques to enable real-time querying and inference. Furthermore, it transforms the vehicle’s position into the actual world coordinate system using the relative position and FOV information of objects observed in CCTV images. By combining map spatial information with the vehicle’s moving direction, speed, and driving patterns, a beam search is performed at the intersection level to derive candidate CCTV locations where the vehicle is most likely to enter after the blind spot. Experimental results based on the CARLA simulator in a virtual city environment confirmed that the proposed method accurately predicts the next appearing CCTV even in blind spot sections, maintaining continuous vehicle trajectories more effectively than existing techniques.

[231] Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale

Charith Wickrema, Eliza Mace, Hunter Brown, Heidys Cabrera, Nick Krall, Matthew O’Neill, Shivangi Sarkar, Lowell Weissman, Eric Hughes, Guido Zarrella

Main category: cs.CV

TL;DR: The paper explores scaling laws for training foundation models on massive-scale electro-optical satellite data, finding that performance remains data-limited even at peta-scale training, with implications for remote sensing AI development.

DetailsMotivation: Current scaling laws for AI models are well-established for natural images with abundant internet data, but poorly understood for high-value domains like remote sensing where specialized encoders are needed for multimodal applications like image captioning, search, and reasoning.

Method: Used over a quadrillion pixels of commercial satellite EO data and MITRE’s Federal AI Sandbox to train progressively larger vision transformer (ViT) backbones, analyzing scaling behaviors at peta-scale and reporting successes and failure modes.

Result: Even at peta-scale training with massive datasets, performance remained consistent with a data-limited regime rather than a model parameter-limited one, revealing different scaling dynamics than natural image domains.

Conclusion: The practical insights from peta-scale experiments should inform data collection strategies, compute budgets, and optimization schedules for developing frontier-scale remote sensing foundation models, helping bridge domain gaps across additional RS modalities.

Abstract: We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and MITRE’s Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report successes and failure modes observed at peta-scale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data-limited regime rather than a model parameter-limited one. These practical insights are intended to inform data collection strategies, compute budgets, and optimization schedules that advance the future development of frontier scale RS foundation models.

[232] Beyond the Last Frame: Process-aware Evaluation for Generative Video Reasoning

Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, Minghui Qiu

Main category: cs.CV

TL;DR: VIPER is a new benchmark for evaluating Chain-of-Frames reasoning in video generation models, introducing process-aware evaluation to detect outcome-hacking where models get right answers through wrong reasoning.

DetailsMotivation: Current evaluation frameworks for Generative Video Reasoning rely on single-frame assessments, which can lead to outcome-hacking where models reach correct conclusions through erroneous processes, failing to assess true reasoning capabilities.

Method: Propose VIPER benchmark with 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning domains. Introduce Process-outcome Consistency (POC@r) metric using VLM-as-Judge with hierarchical rubric to evaluate both intermediate step validity and final results.

Result: State-of-the-art video models achieve only about 20% POC@1.0 and exhibit significant outcome-hacking. Test-time scaling and sampling robustness experiments reveal substantial gap between current video generation and true generalized visual reasoning.

Conclusion: VIPER provides a comprehensive process-aware evaluation framework that reveals limitations in current video generation models’ reasoning capabilities, highlighting the need for improved evaluation methods and model development for true generalized visual reasoning.

Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve POC@1.0 only about 20% and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark are released at https://github.com/RUCAIBox/VIPER.

[233] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

Main category: cs.CV

TL;DR: VideoAR is the first large-scale Visual Autoregressive framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling, achieving state-of-the-art results while being 10x faster than diffusion models.

DetailsMotivation: Current video generation models (diffusion and flow-matching) are computationally intensive and difficult to scale. There's a need for more efficient and scalable approaches that can maintain high-quality results while improving inference speed.

Method: VideoAR integrates intra-frame VAR modeling with causal next-frame prediction using a 3D multi-scale tokenizer. It introduces Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask to improve long-term consistency. Uses multi-stage pretraining pipeline for progressive spatial-temporal alignment.

Result: Achieves SOTA among autoregressive models: improves FVD on UCF-101 from 99.5 to 88.6, reduces inference steps by over 10x, reaches VBench score of 81.74 (competitive with much larger diffusion models).

Conclusion: VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.

Abstract: Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.

[234] GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models

Zhankai Ye, Bofan Li, Yukai Jin, Shuoqiu Li, Wei Wang, Yanfu Zhang, Shangqian Gao, Xin Liu

Main category: cs.CV

TL;DR: A novel framework that aligns motion quantization with LLM embeddings through orthogonality constraints, achieving 20% performance improvement on motion reasoning tasks.

DetailsMotivation: Existing motion tokenization pipelines decouple motion quantization from semantic embedding learning, linking them only via token IDs. This fails to align the intrinsic geometry of motion space with embedding space, hindering LLMs' capacity for nuanced motion reasoning.

Method: 1) Uses decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage; 2) Employs sparse projection to map motion codes into LLM embedding space while preserving orthogonality; 3) Implements two-stage orthonormal regularization schedule for geometric alignment during tokenizer training and LLM fine-tuning.

Result: Achieves 20% performance improvement over current state-of-the-art methods on HumanML3D dataset, validating that unified geometric basis effectively empowers LLMs for nuanced motion reasoning.

Conclusion: Alignment between motion quantization and LLM embeddings is most effective when both modalities share a unified geometric basis, achieved through explicit orthogonality constraints on both motion codebook and LLM embedding space.

Abstract: Discrete motion tokenization has recently enabled Large Language Models (LLMs) to serve as versatile backbones for motion understanding and motion-language reasoning. However, existing pipelines typically decouple motion quantization from semantic embedding learning, linking them solely via token IDs. This approach fails to effectively align the intrinsic geometry of the motion space with the embedding space, thereby hindering the LLM’s capacity for nuanced motion reasoning. We argue that alignment is most effective when both modalities share a unified geometric basis. Therefore, instead of forcing the LLM to reconstruct the complex geometry among motion tokens from scratch, we present a novel framework that explicitly enforces orthogonality on both the motion codebook and the LLM embedding space, ensuring that their relational structures naturally mirror each other. Specifically, we employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage. To bridge the modalities, we use a sparse projection that maps motion codes into the LLM embedding space while preserving orthogonality. Finally, a two-stage orthonormal regularization schedule enforces soft constraints during tokenizer training and LLM fine-tuning to maintain geometric alignment without hindering semantic adaptation. Extensive experiments on HumanML3D demonstrate that our framework achieves a 20% performance improvement over current state-of-the-art methods, validating that a unified geometric basis effectively empowers the LLM for nuanced motion reasoning.

[235] Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition

Jakob Paul Zimmermann, Georg Loho

Main category: cs.CV

TL;DR: Monotonicity boosts explainability in neural networks through two approaches: 1) decomposing trained ReLU networks into monotone convex parts to create improved saliency methods, and 2) training models as differences between monotone networks for self-explainability.

DetailsMotivation: While monotonicity improves explainability in neural networks, not all functions can be approximated by monotone networks. The paper aims to leverage monotonicity in alternative ways to enhance explainability despite this limitation.

Method: Two approaches: 1) Adaptation of decomposition of trained ReLU networks into two monotone and convex parts, overcoming numerical weight blowup issues. 2) Training models as the difference between two monotone neural networks.

Result: Proposed saliency methods (SplitCAM and SplitLRP) outperform state-of-the-art on VGG16 and Resnet18 networks across all Quantus saliency metric categories on ImageNet-S. Models trained as differences between monotone networks exhibit strong self-explainability properties.

Conclusion: Monotonicity can be effectively leveraged to enhance explainability through decomposition techniques and architectural design, even when functions cannot be directly approximated by monotone networks.

Abstract: It has been demonstrated in various contexts that monotonicity leads to better explainability in neural networks. However, not every function can be well approximated by a monotone neural network. We demonstrate that monotonicity can still be used in two ways to boost explainability. First, we use an adaptation of the decomposition of a trained ReLU network into two monotone and convex parts, thereby overcoming numerical obstacles from an inherent blowup of the weights in this procedure. Our proposed saliency methods - SplitCAM and SplitLRP - improve on state of the art results on both VGG16 and Resnet18 networks on ImageNet-S across all Quantus saliency metric categories. Second, we exhibit that training a model as the difference between two monotone neural networks results in a system with strong self-explainability properties.

[236] MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou

Main category: cs.CV

TL;DR: MHLA addresses global context collapse in linear attention by computing attention within divided heads along token dimension, maintaining linear complexity while recovering softmax attention’s expressive power.

DetailsMotivation: Transformers have quadratic self-attention complexity that limits large-scale applications. Linear attention offers efficiency but degrades performance, with existing fixes reintroducing computational overhead through extra modules that defeat the original purpose.

Method: Propose Multi-Head Linear Attention (MHLA) which preserves representational diversity by computing attention within divided heads along the token dimension. This maintains linear complexity while recovering expressive power of softmax attention.

Result: Achieved significant improvements across multiple domains: 3.6% improvement on ImageNet classification, 6.3% gain on NLP tasks, 12.6% improvement on image generation, and 41% enhancement on video generation under the same time complexity.

Conclusion: MHLA effectively addresses the global context collapse problem in linear attention methods, maintaining linear complexity while achieving performance comparable to or better than softmax attention across diverse domains.

Abstract: While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6% improvement on ImageNet classification, a 6.3% gain on NLP, a 12.6% improvement on image generation, and a 41% enhancement on video generation under the same time complexity.

[237] Tuning-free Visual Effect Transfer across Videos

Maxwell Jones, Rameen Abdal, Or Patashnik, Ruslan Salakhutdinov, Sergey Tulyakov, Jun-Yan Zhu, Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: RefVFX is a framework for transferring complex temporal effects from reference videos to target videos/images in a feed-forward manner, addressing limitations of text-based editing methods.

DetailsMotivation: Existing methods struggle with dynamic temporal effects like lighting changes or character transformations that are difficult to describe via text or static conditions. Transferring video effects requires integrating new temporal dynamics with the input's existing motion and appearance.

Method: Created a large-scale dataset of triplets (reference effect video, input image/video, output video) using an automated pipeline for video-to-video effects, augmented with image-to-video effects from LoRA adapters and code-based temporal effects. Trained a reference-conditioned model using recent text-to-video backbones.

Result: RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.

Conclusion: The framework successfully enables feed-forward transfer of complex temporal effects from reference videos to target content, overcoming limitations of text-based editing approaches.

Abstract: We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations, which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model must integrate the new temporal dynamics with the input video’s existing motion and appearance. % To address this, we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated pipeline that creates high-quality paired videos designed to preserve the input’s motion and structure while transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference. See our website at https://tuningfreevisualeffects-maker.github.io/Tuning-free-Visual-Effect-Transfer-across-Videos-Project-Page/

[238] GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards

Yan Zhu, Te Luo, Pei-Yao Fu, Zhen Zhang, Zi-Long Wang, Yi-Fan Qu, Zi-Han Geng, Jia-Qi Xu, Lu Yao, Li-Yun Ma, Wei Su, Wei-Feng Chen, Quan-Lin Li, Shuo Wang, Ping-Hong Zhou

Main category: cs.CV

TL;DR: MLLMs show promise in gastroenterology but have spatial grounding limitations and fluency-accuracy paradox compared to human endoscopists.

DetailsMotivation: To systematically evaluate MLLMs across comprehensive gastrointestinal endoscopy workflows and determine their clinical utility compared to human benchmarks.

Method: Created GI-Bench with 20 lesion categories, evaluated 12 MLLMs across 5-stage clinical workflow (localization, identification, diagnosis, description, management), benchmarked against 3 junior endoscopists and 3 trainees using Macro-F1, mIoU, and Likert scales.

Result: Gemini-3-Pro achieved SOTA; top models outperformed trainees and rivaled junior endoscopists in diagnostic reasoning, but humans significantly outperformed models in lesion localization. Models showed superior linguistic readability but lower factual correctness due to over-interpretation and hallucinations.

Conclusion: MLLMs show promise in gastroenterology but face critical spatial grounding bottlenecks and fluency-accuracy paradox, highlighting need for improved visual grounding and factual accuracy in clinical applications.

Abstract: Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical “spatial grounding bottleneck” persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a “fluency-accuracy paradox”: models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to “over-interpretation” and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.

[239] MoCha:End-to-End Video Character Replacement without Structural Guidance

Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang, Jing Li

Main category: cs.CV

TL;DR: MoCha is a novel framework for controllable video character replacement that requires only a single arbitrary frame mask, bypassing limitations of prior reconstruction-based methods that need per-frame masks and explicit structural guidance.

DetailsMotivation: Current video character replacement methods rely on reconstruction-based paradigms requiring per-frame segmentation masks and explicit structural guidance (skeleton, depth), which severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, leading to visual artifacts and temporal inconsistencies.

Method: MoCha introduces a condition-aware RoPE to adapt multi-modal input conditions and enhance facial identity, employs an RL-based post-training stage, and uses a comprehensive data construction pipeline with three specialized datasets: high-fidelity rendered dataset (UE5), expression-driven dataset (portrait animation), and augmented dataset from existing video-mask pairs.

Result: Extensive experiments demonstrate that MoCha substantially outperforms existing state-of-the-art approaches in video character replacement tasks.

Conclusion: MoCha presents a pioneering framework that overcomes data scarcity and generalization limitations in video character replacement by requiring only a single arbitrary frame mask, with promising results that advance the field and will be released to facilitate further research.

Abstract: Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: orange-3dv-team.github.io/MoCha

cs.AI

[240] ConvoLearn: A Dataset of Constructivist Tutor-Student Dialogue

Mayank Sharma, Roy Pea, Hari Subramonyam

Main category: cs.AI

TL;DR: Fine-tuning Mistral 7B on ConvoLearn dataset improves LLM tutoring by shifting behavior toward knowledge-building strategies, outperforming base model and Claude Sonnet in human teacher evaluations.

DetailsMotivation: LLMs in education have fundamental pedagogical limitations - they tend to reveal solutions rather than support dialogic learning. There's a need for AI tutors that can engage in constructive, knowledge-building dialogues rather than just providing answers.

Method: Created ConvoLearn dataset grounded in knowledge building theory with six pedagogical dimensions. Constructed semi-synthetic dataset of 1250 tutor-student dialogues (20 turns each) in middle school Earth Science through controlled human teacher-simulated student interactions. Used QLoRA to fine-tune Mistral 7B on this dataset.

Result: Human evaluation by 31 teachers shows fine-tuned Mistral 7B (M = 4.10, SD = 1.03) significantly outperforms both its base version (M = 2.59, SD = 1.11) and Claude Sonnet 4.5 (M = 2.87, SD = 1.29) overall. Training on ConvoLearn meaningfully shifts LLM behavior toward knowledge-building strategies.

Conclusion: This work establishes a potential framework to guide future development and evaluation of constructivist AI tutors, demonstrating that targeted fine-tuning on pedagogically-grounded datasets can address fundamental limitations of LLMs in educational applications.

Abstract: In educational applications, LLMs exhibit several fundamental pedagogical limitations, such as their tendency to reveal solutions rather than support dialogic learning. We introduce ConvoLearn (https://huggingface.co/datasets/masharma/convolearn ), a dataset grounded in knowledge building theory that operationalizes six core pedagogical dimensions: cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics. We construct a semi-synthetic dataset of 1250 tutor-student dialogues (20 turns each) in middle school Earth Science through controlled interactions between human teachers and a simulated student. Using QLoRA, we demonstrate that training on this dataset meaningfully shifts LLM behavior toward knowledge-building strategies. Human evaluation by 31 teachers shows our fine-tuned Mistral 7B (M = 4.10, SD = 1.03) significantly outperforms both its base version (M = 2.59, SD = 1.11) and Claude Sonnet 4.5 (M = 2.87, SD = 1.29) overall. This work establishes a potential framework to guide future development and evaluation of constructivist AI tutors.

[241] ART: Action-based Reasoning Task Benchmarking for Medical AI Agents

Ananya Mantravadi, Shivali Dalmia, Abhishek Mukherji

Main category: cs.AI

TL;DR: ART is a new benchmark for testing medical AI agents’ action-based reasoning on EHR data, exposing weaknesses in aggregation and threshold evaluation that existing benchmarks miss.

DetailsMotivation: Existing benchmarks inadequately assess medical AI agents' performance on action-based tasks involving threshold evaluation, temporal aggregation, and conditional logic - critical capabilities for reliable clinical decision support systems.

Method: Four-stage pipeline: 1) scenario identification from real EHR data, 2) task generation targeting known reasoning weaknesses, 3) quality audit, and 4) evaluation. Created 600 clinically validated tasks grounded in real patient data.

Result: GPT-4o-mini and Claude 3.5 Sonnet showed near-perfect retrieval after prompt refinement, but substantial gaps in aggregation (28-64% error) and threshold reasoning (32-38% error), exposing critical failure modes.

Conclusion: ART advances toward more reliable clinical AI agents by exposing failure modes in action-oriented EHR reasoning, essential for AI systems that reduce cognitive load and support workforce capacity in high-demand care settings.

Abstract: Reliable clinical decision support requires medical AI agents capable of safe, multi-step reasoning over structured electronic health records (EHRs). While large language models (LLMs) show promise in healthcare, existing benchmarks inadequately assess performance on action-based tasks involving threshold evaluation, temporal aggregation, and conditional logic. We introduce ART, an Action-based Reasoning clinical Task benchmark for medical AI agents, which mines real-world EHR data to create challenging tasks targeting known reasoning weaknesses. Through analysis of existing benchmarks, we identify three dominant error categories: retrieval failures, aggregation errors, and conditional logic misjudgments. Our four-stage pipeline – scenario identification, task generation, quality audit, and evaluation – produces diverse, clinically validated tasks grounded in real patient data. Evaluating GPT-4o-mini and Claude 3.5 Sonnet on 600 tasks shows near-perfect retrieval after prompt refinement, but substantial gaps in aggregation (28–64%) and threshold reasoning (32–38%). By exposing failure modes in action-oriented EHR reasoning, ART advances toward more reliable clinical agents, an essential step for AI systems that reduce cognitive load and administrative burden, supporting workforce capacity in high-demand care settings

[242] The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments

Logan Ritchie, Sushant Mehta, Nick Heiner, Mason Yu, Edwin Chen

Main category: cs.AI

TL;DR: Empirical study shows frontier AI models fail ~40% of workplace tasks in e-commerce environment, revealing hierarchy of agentic capabilities needed for real-world deployment.

DetailsMotivation: As LLM-based agents shift AI evaluation from single-turn responses to multi-step task completion in interactive environments, there's a need to understand their real-world capabilities and limitations in workplace settings.

Method: Evaluated frontier AI models on 150 workplace tasks within a realistic e-commerce RL environment from Surge, using task-centric design methodology emphasizing diversity and domain expert contributions.

Result: Best-performing models fail ~40% of tasks, revealing empirically-derived hierarchy of agentic capabilities: tool use, planning/goal formation, adaptability, groundedness, and common-sense reasoning. Weaker models struggle with basic tool use/planning, while stronger models fail on contextual inference tasks.

Conclusion: Current frontier models show coherent multi-step behavior but have substantial capability gaps before achieving human-level task completion in realistic workplace settings, with failures clustering predictably along the identified hierarchy.

Abstract: The advancement of large language model (LLM) based agents has shifted AI evaluation from single-turn response assessment to multi-step task completion in interactive environments. We present an empirical study evaluating frontier AI models on 150 workplace tasks within a realistic e-commerce RL environment from Surge. Our analysis reveals an empirically-derived \emph{hierarchy of agentic capabilities} that models must master for real-world deployment: (1) tool use, (2) planning and goal formation, (3) adaptability, (4) groundedness, and (5) common-sense reasoning. Even the best-performing models fail approximately 40% of the tasks, with failures clustering predictably along this hierarchy. Weaker models struggle with fundamental tool use and planning, whereas stronger models primarily fail on tasks requiring contextual inference beyond explicit instructions. We introduce a task-centric design methodology for RL environments that emphasizes diversity and domain expert contributions, provide detailed failure analysis, and discuss implications for agent development. Our findings suggest that while current frontier models can demonstrate coherent multi-step behavior, substantial capability gaps remain before achieving human-level task completion in realistic workplace settings.

[243] When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail

Xiaoxiao Li

Main category: cs.AI

TL;DR: Single-agent skill selection can replace multi-agent systems with similar reasoning benefits but lower overhead, but faces capacity limits that degrade sharply beyond a critical library size due to semantic confusability.

DetailsMotivation: Multi-agent AI systems are effective for complex reasoning but incur substantial computational overhead from inter-agent communication. The paper explores whether similar modularity benefits can be achieved with a single agent using skill selection, and investigates how this approach scales as skill libraries grow.

Method: The paper views skills as internalized agent behaviors and compiles multi-agent systems into equivalent single-agent systems that trade inter-agent communication for skill selection. It investigates scaling behavior of skill selection, analyzing how accuracy degrades with library size and semantic confusability. The study also explores hierarchical organization as a solution, drawing on cognitive science principles of bounded capacity in human decision-making.

Result: Preliminary experiments show single-agent skill selection substantially reduces token usage and latency while maintaining competitive accuracy on reasoning benchmarks. However, skill selection exhibits a phase transition: accuracy remains stable up to a critical library size, then drops sharply. Semantic confusability among similar skills, rather than library size alone, drives this degradation. Hierarchical routing shows promise for managing these capacity limits.

Conclusion: Skill selection offers efficiency benefits over multi-agent systems but faces fundamental capacity limits analogous to human cognition. Hierarchical organization, inspired by human cognitive strategies, can help manage these limits. The work establishes a cognitive-grounded framework for designing scalable skill-based agents and raises new questions about semantic-based skill selection limits in LLMs.

Abstract: Multi-agent AI systems have proven effective for complex reasoning. These systems are compounded by specialized agents, which collaborate through explicit communication, but incur substantial computational overhead. A natural question arises: can we achieve similar modularity benefits with a single agent that selects from a library of skills? We explore this question by viewing skills as internalized agent behaviors. From this perspective, a multi-agent system can be compiled into an equivalent single-agent system, trading inter-agent communication for skill selection. Our preliminary experiments suggest this approach can substantially reduce token usage and latency while maintaining competitive accuracy on reasoning benchmarks. However, this efficiency raises a deeper question that has received little attention: how does skill selection scale as libraries grow? Drawing on principles from cognitive science, we propose that LLM skill selection exhibits bounded capacity analogous to human decision-making. We investigate the scaling behavior of skill selection and observe a striking pattern. Rather than degrading gradually, selection accuracy remains stable up to a critical library size, then drops sharply, indicating a phase transition reminiscent of capacity limits in human cognition. Furthermore, we find evidence that semantic confusability among similar skills, rather than library size alone, plays a central role in this degradation. This perspective suggests that hierarchical organization, which has long helped humans manage complex choices, may similarly benefit AI systems. Our initial results with hierarchical routing support this hypothesis. This work opens new questions about the fundamental limits of semantic-based skill selection in LLMs and offers a cognitive-grounded framework and practical guidelines for designing scalable skill-based agents.

[244] Human-AI Co-design for Clinical Prediction Models

Jean Feng, Avni Kothari, Patrick Vossler, Andrew Bishara, Lucas Zier, Newton Addo, Aaron Kornblith, Yan Shuo Tan, Chandan Singh

Main category: cs.AI

TL;DR: HACHI is an iterative human-in-the-loop framework that uses AI agents to accelerate development of interpretable clinical prediction models by exploring concepts in clinical notes, with clinical experts providing feedback.

DetailsMotivation: Traditional clinical prediction model development is extremely time- and resource-intensive, with only a small fraction reaching clinical practice. The challenge intensifies when incorporating unstructured clinical notes containing enormous numbers of concepts.

Method: HACHI alternates between: (1) AI agent rapidly exploring/evaluating candidate concepts in clinical notes, and (2) clinical/domain experts providing feedback to improve the learning process. Concepts are defined as simple yes-no questions used in linear models for transparency.

Result: In acute kidney injury and traumatic brain injury prediction tasks, HACHI outperforms existing approaches, surfaces new clinically relevant concepts not in commonly-used CPMs, and improves model generalizability across clinical sites and time periods.

Conclusion: HACHI demonstrates the critical role of clinical AI teams in directing AI agents, adjusting concept granularity, aligning objective functions with clinical goals, and identifying data bias/leakage issues, enabling faster development of interpretable clinical prediction models.

Abstract: Developing safe, effective, and practically useful clinical prediction models (CPMs) traditionally requires iterative collaboration between clinical experts, data scientists, and informaticists. This process refines the often small but critical details of the model building process, such as which features/patients to include and how clinical categories should be defined. However, this traditional collaboration process is extremely time- and resource-intensive, resulting in only a small fraction of CPMs reaching clinical practice. This challenge intensifies when teams attempt to incorporate unstructured clinical notes, which can contain an enormous number of concepts. To address this challenge, we introduce HACHI, an iterative human-in-the-loop framework that uses AI agents to accelerate the development of fully interpretable CPMs by enabling the exploration of concepts in clinical notes. HACHI alternates between (i) an AI agent rapidly exploring and evaluating candidate concepts in clinical notes and (ii) clinical and domain experts providing feedback to improve the CPM learning process. HACHI defines concepts as simple yes-no questions that are used in linear models, allowing the clinical AI team to transparently review, refine, and validate the CPM learned in each round. In two real-world prediction tasks (acute kidney injury and traumatic brain injury), HACHI outperforms existing approaches, surfaces new clinically relevant concepts not included in commonly-used CPMs, and improves model generalizability across clinical sites and time periods. Furthermore, HACHI reveals the critical role of the clinical AI team, such as directing the AI agent to explore concepts that it had not previously considered, adjusting the granularity of concepts it considers, changing the objective function to better align with the clinical objectives, and identifying issues of data bias and leakage.

[245] Programming over Thinking: Efficient and Robust Multi-Constraint Planning

Derrick Goh Xin Deik, Quanyu Long, Zhengyuan Liu, Nancy F. Chen, Wenya Wang

Main category: cs.AI

TL;DR: SCOPE is a framework that separates reasoning from code execution for multi-constraint planning, achieving state-of-the-art performance with lower cost and latency compared to existing LLM approaches.

DetailsMotivation: Existing LLM approaches for multi-constraint planning have limitations: pure reasoning methods suffer from inconsistency and error accumulation, while coding/solver-based approaches lack flexibility and generalizability across diverse problems.

Method: SCOPE (Scalable COde Planning Engine) disentangles query-specific reasoning from generic code execution, producing reusable solver functions that only require minimal parameter changes while maintaining consistency and determinism.

Result: SCOPE achieves 93.1% success on TravelPlanner benchmark, a 61.6% improvement over best baseline (CoT), while reducing inference cost by 1.4x and time by ~4.67x using GPT-4o.

Conclusion: The separation of reasoning from execution in SCOPE enables consistent, deterministic, and reusable solver functions that outperform existing approaches in multi-constraint planning while being more cost-effective and faster.

Abstract: Multi-constraint planning involves identifying, evaluating, and refining candidate plans while satisfying multiple, potentially conflicting constraints. Existing large language model (LLM) approaches face fundamental limitations in this domain. Pure reasoning paradigms, which rely on long natural language chains, are prone to inconsistency, error accumulation, and prohibitive cost as constraints compound. Conversely, LLMs combined with coding- or solver-based strategies lack flexibility: they often generate problem-specific code from scratch or depend on fixed solvers, failing to capture generalizable logic across diverse problems. To address these challenges, we introduce the Scalable COde Planning Engine (SCOPE), a framework that disentangles query-specific reasoning from generic code execution. By separating reasoning from execution, SCOPE produces solver functions that are consistent, deterministic, and reusable across queries while requiring only minimal changes to input parameters. SCOPE achieves state-of-the-art performance while lowering cost and latency. For example, with GPT-4o, it reaches 93.1% success on TravelPlanner, a 61.6% gain over the best baseline (CoT) while cutting inference cost by 1.4x and time by ~4.67x. Code is available at https://github.com/DerrickGXD/SCOPE.

[246] DScheLLM: Enabling Dynamic Scheduling through a Fine-Tuned Dual-System Large language Model

Lixiang Zhang, Chenggong Zhao, Qing Gao, Xiaoke Zhao, Gengyi Bai, Jinhu Lv

Main category: cs.AI

TL;DR: DScheLLM: A dynamic scheduling approach using fine-tuned LLMs in dual-system reasoning architecture to handle production disruptions.

DetailsMotivation: Conventional production scheduling approaches are limited in adaptability to dynamic disruptions like processing time variations, machine availability changes, and unexpected task insertions. They rely on event-specific models and explicit analytical formulations, which don't generalize well to unseen disturbances.

Method: Proposes DScheLLM with dual-system (fast-slow) reasoning architecture using fine-tuned large language models. Uses Huawei OpenPangu Embedded-7B model fine-tuned with LoRA. Training datasets generated from exact schedules from operations research solver for both reasoning modes.

Result: Fast-thinking mode efficiently generates high-quality schedules; slow-thinking mode produces solver-compatible, well-formatted decision inputs. Demonstrated on standard job shop scheduling benchmarks.

Conclusion: One of earliest studies applying LLMs to dynamic job shop scheduling, highlighting significant potential for intelligent, adaptive scheduling optimization in dynamic environments.

Abstract: Production scheduling is highly susceptible to dynamic disruptions, such as variations in processing times, machine availability, and unexpected task insertions. Conventional approaches typically rely on event-specific models and explicit analytical formulations, which limits their adaptability and generalization across previously unseen disturbances. To overcome these limitations, this paper proposes DScheLLM, a dynamic scheduling approach that leverages fine-tuned large language models within a dual-system (fast-slow) reasoning architecture to address disturbances of different scales. A unified large language model-based framework is constructed to handle dynamic events, where training datasets for both fast and slow reasoning modes are generated using exact schedules obtained from an operations research solver. The Huawei OpenPangu Embedded-7B model is subsequently fine-tuned under the hybrid reasoning paradigms using LoRA. Experimental evaluations on standard job shop scheduling benchmarks demonstrate that the fast-thinking mode can efficiently generate high-quality schedules and the slow-thinking mode can produce solver-compatible and well-formatted decision inputs. To the best of our knowledge, this work represents one of the earliest studies applying large language models to job shop scheduling in dynamic environments, highlighting their considerable potential for intelligent and adaptive scheduling optimization.

[247] AviationLMM: A Large Multimodal Foundation Model for Civil Aviation

Wenbin Li, Jingling Wu, Xiaoyong Lin. Jing Chen, Cong Chen

Main category: cs.AI

TL;DR: Proposes AviationLMM, a Large Multimodal foundation Model for civil aviation to unify heterogeneous data streams (voice, radar, sensors, text) for improved situational awareness, reasoning, and decision support.

DetailsMotivation: Current AI solutions in aviation are siloed and narrow, focusing on isolated tasks or single modalities, which limits their ability to integrate diverse data sources and provide comprehensive situational awareness and real-time decision support.

Method: Introduces AviationLMM architecture that ingests multimodal inputs (air-ground voice, surveillance, telemetry, video, structured texts), performs cross-modal alignment and fusion, and produces flexible outputs including situation summaries, risk alerts, predictive diagnostics, and incident reconstructions.

Result: Identifies key research opportunities including data acquisition, alignment/fusion, pretraining, reasoning, trustworthiness, privacy, robustness to missing modalities, and synthetic scenario generation to realize the AviationLMM vision.

Conclusion: AviationLMM aims to boost civil aviation foundation model progress and catalyze coordinated research toward an integrated, trustworthy, and privacy-preserving aviation AI ecosystem by addressing current limitations of siloed AI solutions.

Abstract: Civil aviation is a cornerstone of global transportation and commerce, and ensuring its safety, efficiency and customer satisfaction is paramount. Yet conventional Artificial Intelligence (AI) solutions in aviation remain siloed and narrow, focusing on isolated tasks or single modalities. They struggle to integrate heterogeneous data such as voice communications, radar tracks, sensor streams and textual reports, which limits situational awareness, adaptability, and real-time decision support. This paper introduces the vision of AviationLMM, a Large Multimodal foundation Model for civil aviation, designed to unify the heterogeneous data streams of civil aviation and enable understanding, reasoning, generation and agentic applications. We firstly identify the gaps between existing AI solutions and requirements. Secondly, we describe the model architecture that ingests multimodal inputs such as air-ground voice, surveillance, on-board telemetry, video and structured texts, and performs cross-modal alignment and fusion, and produces flexible outputs ranging from situation summaries and risk alerts to predictive diagnostics and multimodal incident reconstructions. In order to fully realize this vision, we identify key research opportunities to address, including data acquisition, alignment and fusion, pretraining, reasoning, trustworthiness, privacy, robustness to missing modalities, and synthetic scenario generation. By articulating the design and challenges of AviationLMM, we aim to boost the civil aviation foundation model progress and catalyze coordinated research efforts toward an integrated, trustworthy and privacy-preserving aviation AI ecosystem.

[248] The AI Hippocampus: How Far are We From Human Memory?

Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu

Main category: cs.AI

TL;DR: A comprehensive survey of memory mechanisms in LLMs and MLLMs, organized into implicit, explicit, and agentic memory paradigms, covering architectural advances, benchmarks, and open challenges.

DetailsMotivation: Memory is crucial for enhancing reasoning, adaptability, and contextual fidelity in LLMs/MLLMs as they evolve from static predictors to interactive systems capable of continual learning and personalized inference.

Method: Structured taxonomy approach organizing memory literature into three paradigms: implicit (knowledge embedded in model parameters), explicit (external storage/retrieval components), and agentic (persistent memory for autonomous agents).

Result: Comprehensive synthesis of memory mechanisms across text and multi-modal settings, discussing architectural advances, benchmark tasks, and extending to vision, language, audio, and action modalities.

Conclusion: Memory integration is central to LLM/MLLM evolution, with key challenges remaining in memory capacity, alignment, factual consistency, and cross-system interoperability.

Abstract: Memory plays a foundational role in augmenting the reasoning, adaptability, and contextual fidelity of modern Large Language Models and Multi-Modal LLMs. As these models transition from static predictors to interactive systems capable of continual learning and personalized inference, the incorporation of memory mechanisms has emerged as a central theme in their architectural and functional evolution. This survey presents a comprehensive and structured synthesis of memory in LLMs and MLLMs, organizing the literature into a cohesive taxonomy comprising implicit, explicit, and agentic memory paradigms. Specifically, the survey delineates three primary memory frameworks. Implicit memory refers to the knowledge embedded within the internal parameters of pre-trained transformers, encompassing their capacity for memorization, associative retrieval, and contextual reasoning. Recent work has explored methods to interpret, manipulate, and reconfigure this latent memory. Explicit memory involves external storage and retrieval components designed to augment model outputs with dynamic, queryable knowledge representations, such as textual corpora, dense vectors, and graph-based structures, thereby enabling scalable and updatable interaction with information sources. Agentic memory introduces persistent, temporally extended memory structures within autonomous agents, facilitating long-term planning, self-consistency, and collaborative behavior in multi-agent systems, with relevance to embodied and interactive AI. Extending beyond text, the survey examines the integration of memory within multi-modal settings, where coherence across vision, language, audio, and action modalities is essential. Key architectural advances, benchmark tasks, and open challenges are discussed, including issues related to memory capacity, alignment, factual consistency, and cross-system interoperability.

[249] PrivacyReasoner: Can LLM Emulate a Human-like Privacy Mind?

Yiwen Tu, Xuan Liu, Lianhui Qin, Haojian Jin

Main category: cs.AI

TL;DR: PRA is an AI agent that simulates individual users’ privacy concern formation in response to real-world news by integrating privacy theories with cognitive models and personal comment histories.

DetailsMotivation: Current approaches focus on population-level sentiment analysis, but there's a need to understand how individual users form privacy concerns based on their personal histories and contextual factors.

Method: PRA reconstructs each user’s “privacy mind” using personal comment histories, employs a contextual filter for bounded rationality, generates synthetic comments for new privacy scenarios, and uses an LLM-as-a-Judge evaluator calibrated against privacy concern taxonomy.

Result: Experiments on Hacker News discussions show PRA outperforms baseline agents in privacy concern prediction and captures transferable reasoning patterns across AI, e-commerce, and healthcare domains.

Conclusion: PRA provides a novel approach to simulating individual-level privacy reasoning that goes beyond population averages, offering insights into how users form privacy concerns across different domains.

Abstract: This paper introduces PRA, an AI-agent design for simulating how individual users form privacy concerns in response to real-world news. Moving beyond population-level sentiment analysis, PRA integrates privacy and cognitive theories to simulate user-specific privacy reasoning grounded in personal comment histories and contextual cues. The agent reconstructs each user’s “privacy mind”, dynamically activates relevant privacy memory through a contextual filter that emulates bounded rationality, and generates synthetic comments reflecting how that user would likely respond to new privacy scenarios. A complementary LLM-as-a-Judge evaluator, calibrated against an established privacy concern taxonomy, quantifies the faithfulness of generated reasoning. Experiments on real-world Hacker News discussions show that \PRA outperforms baseline agents in privacy concern prediction and captures transferable reasoning patterns across domains including AI, e-commerce, and healthcare.

[250] Position on LLM-Assisted Peer Review: Addressing Reviewer Gap through Mentoring and Feedback

JungMin Yun, JuneHyoung Kwon, MiHyeon Kim, YoungBin Kim

Main category: cs.AI

TL;DR: Paper proposes using LLMs as tools to assist and educate human reviewers rather than generating automated reviews, aiming to address the Reviewer Gap and improve peer-review sustainability.

DetailsMotivation: The rapid expansion of AI research has intensified the Reviewer Gap, threatening peer-review sustainability and perpetuating low-quality evaluations. Existing LLM approaches that automatically generate reviews are insufficient.

Method: Proposes a paradigm shift positioning LLMs as tools for assisting and educating human reviewers. Defines core principles of high-quality peer review and proposes two complementary systems: (1) LLM-assisted mentoring system for cultivating reviewers’ long-term competencies, and (2) LLM-assisted feedback system for helping reviewers refine review quality.

Result: A human-centered approach that aims to strengthen reviewer expertise and contribute to building a more sustainable scholarly ecosystem.

Conclusion: LLMs should be positioned as tools for assisting and educating human reviewers rather than generating automated reviews, with the goal of addressing the Reviewer Gap and improving peer-review sustainability through human-centered systems.

Abstract: The rapid expansion of AI research has intensified the Reviewer Gap, threatening the peer-review sustainability and perpetuating a cycle of low-quality evaluations. This position paper critiques existing LLM approaches that automatically generate reviews and argues for a paradigm shift that positions LLMs as tools for assisting and educating human reviewers. We define the core principles of high-quality peer review and propose two complementary systems grounded in these foundations: (i) an LLM-assisted mentoring system that cultivates reviewers’ long-term competencies, and (ii) an LLM-assisted feedback system that helps reviewers refine the quality of their reviews. This human-centered approach aims to strengthen reviewer expertise and contribute to building a more sustainable scholarly ecosystem.

[251] MAXS: Meta-Adaptive Exploration with LLM Agents

Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, li yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu

Main category: cs.AI

TL;DR: MAXS is a meta-adaptive reasoning framework for LLM agents that uses lookahead planning and trajectory convergence to improve reasoning stability and efficiency in multi-tool collaboration.

DetailsMotivation: Existing LLM agent methods suffer from two main issues: (1) locally myopic generation due to lack of lookahead planning, and (2) trajectory instability where minor early errors escalate into divergent reasoning paths. These problems make it difficult to balance global effectiveness with computational efficiency.

Method: MAXS employs a lookahead strategy to extend reasoning paths several steps ahead, estimating advantage values for tool usage. It combines step consistency variance and inter-step trend slopes to select stable, consistent, high-value reasoning steps. A trajectory convergence mechanism halts further rollouts once path consistency is achieved, balancing resource efficiency with global effectiveness.

Result: Extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets show that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of the lookahead strategy and tool usage.

Conclusion: MAXS successfully addresses the myopic generation and trajectory instability problems in LLM agent reasoning through meta-adaptive exploration, achieving better balance between computational efficiency and global effectiveness in multi-tool reasoning tasks.

Abstract: Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta-adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning. We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.

[252] Efficient Paths and Dense Rewards: Probabilistic Flow Reasoning for Large Language Models

Yan Liu, Feng Zhang, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Han Liu, Yangdong Deng

Main category: cs.AI

TL;DR: CoT-Flow is a framework that treats reasoning steps as continuous probabilistic flow to quantify each step’s contribution to the answer, enabling flow-guided decoding for efficient inference and flow-based RL for dense rewards without external verifiers.

DetailsMotivation: Current chain-of-thought paradigms treat reasoning as indivisible sequences without quantifying step-wise information gain, leading to inference inefficiency from redundant exploration and optimization difficulty due to sparse supervision or costly external verifiers.

Method: CoT-Flow reconceptualizes discrete reasoning steps as continuous probabilistic flow to quantify each step’s contribution toward ground-truth answers. This enables two methodologies: 1) flow-guided decoding using greedy flow-based strategy for information-efficient reasoning paths, and 2) flow-based reinforcement learning with verifier-free dense reward functions.

Result: Experiments on challenging benchmarks demonstrate that CoT-Flow achieves a superior balance between inference efficiency and reasoning performance compared to existing approaches.

Conclusion: CoT-Flow provides a principled framework for quantifying reasoning step contributions, addressing both inference efficiency and optimization challenges in chain-of-thought reasoning through continuous probabilistic flow modeling.

Abstract: High-quality chain-of-thought has demonstrated strong potential for unlocking the reasoning capabilities of large language models. However, current paradigms typically treat the reasoning process as an indivisible sequence, lacking an intrinsic mechanism to quantify step-wise information gain. This granularity gap manifests in two limitations: inference inefficiency from redundant exploration without explicit guidance, and optimization difficulty due to sparse outcome supervision or costly external verifiers. In this work, we propose CoT-Flow, a framework that reconceptualizes discrete reasoning steps as a continuous probabilistic flow, quantifying the contribution of each step toward the ground-truth answer. Built on this formulation, CoT-Flow enables two complementary methodologies: flow-guided decoding, which employs a greedy flow-based decoding strategy to extract information-efficient reasoning paths, and flow-based reinforcement learning, which constructs a verifier-free dense reward function. Experiments on challenging benchmarks demonstrate that CoT-Flow achieves a superior balance between inference efficiency and reasoning performance.

[253] Coordinated Pandemic Control with Large Language Model Agents as Policymaking Assistants

Ziyi Shi, Xusen Guo, Hongliang Lu, Mingxing Peng, Haotian Wang, Zheng Zhu, Zhenning Li, Yuxuan Liang, Xinhu Zheng, Hai Yang

Main category: cs.AI

TL;DR: LLM multi-agent framework enables coordinated pandemic policymaking across regions, reducing COVID-19 infections by up to 63.7% and deaths by 40.1% compared to real-world outcomes.

DetailsMotivation: Human pandemic responses are often fragmented and reactive, with policies made in isolation and adjusted only after outbreaks escalate, undermining proactive intervention and global mitigation.

Method: Assigns each administrative region an LLM agent as AI policymaking assistant that reasons over region-specific epidemiological dynamics while communicating with other agents to account for cross-regional interdependencies. Integrates real-world data, pandemic simulator, and structured inter-agent communication for joint exploration of counterfactual scenarios through closed-loop simulation.

Result: Using US state-level COVID-19 data from April-December 2020, reduces cumulative infections by up to 63.7% and deaths by 40.1% at individual state level, and by 39.0% and 27.0% respectively when aggregated across states.

Conclusion: LLM multi-agent systems can enable more effective pandemic control through coordinated policymaking, demonstrating significant improvements over real-world fragmented responses.

Abstract: Effective pandemic control requires timely and coordinated policymaking across administrative regions that are intrinsically interdependent. However, human-driven responses are often fragmented and reactive, with policies formulated in isolation and adjusted only after outbreaks escalate, undermining proactive intervention and global pandemic mitigation. To address this challenge, here we propose a large language model (LLM) multi-agent policymaking framework that supports coordinated and proactive pandemic control across regions. Within our framework, each administrative region is assigned an LLM agent as an AI policymaking assistant. The agent reasons over region-specific epidemiological dynamics while communicating with other agents to account for cross-regional interdependencies. By integrating real-world data, a pandemic evolution simulator, and structured inter-agent communication, our framework enables agents to jointly explore counterfactual intervention scenarios and synthesize coordinated policy decisions through a closed-loop simulation process. We validate the proposed framework using state-level COVID-19 data from the United States between April and December 2020, together with real-world mobility records and observed policy interventions. Compared with real-world pandemic outcomes, our approach reduces cumulative infections and deaths by up to 63.7% and 40.1%, respectively, at the individual state level, and by 39.0% and 27.0%, respectively, when aggregated across states. These results demonstrate that LLM multi-agent systems can enable more effective pandemic control with coordinated policymaking…

[254] RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering

Wencheng Ye, Liang Peng, Xiaoyang Yuan, Yi Bin, Pengpeng Zeng, Hengyu Jin, Heng Tao Shen

Main category: cs.AI

TL;DR: RISER is a plug-and-play framework that uses reinforcement learning to dynamically compose reusable reasoning vectors for adaptive activation steering in LLMs, achieving significant accuracy improvements with high token efficiency.

DetailsMotivation: Existing activation steering methods use static, manual interventions that don't adapt to the dynamic nature of complex reasoning, while training-intensive approaches require parameter updates. There's a need for parameter-efficient, adaptive steering methods.

Method: RISER constructs a library of reusable reasoning vectors and uses a lightweight Router optimized via reinforcement learning under task-level rewards to dynamically compose these vectors for each input, activating latent cognitive primitives compositionally.

Result: Across seven diverse benchmarks, RISER yields 3.4-6.5% average zero-shot accuracy improvements over base models, surpasses CoT-style reasoning with 2-3x higher token efficiency, and shows robust accuracy gains. The framework autonomously combines vectors into interpretable control strategies.

Conclusion: RISER demonstrates that adaptive, compositional activation steering via reinforcement learning enables more controllable and efficient LLM reasoning without parameter updates, pointing toward better ways to enhance domain-specific reasoning capabilities.

Abstract: Recent work on domain-specific reasoning with large language models (LLMs) often relies on training-intensive approaches that require parameter updates. While activation steering has emerged as a parameter efficient alternative, existing methods apply static, manual interventions that fail to adapt to the dynamic nature of complex reasoning. To address this limitation, we propose RISER (Router-based Intervention for Steerable Enhancement of Reasoning), a plug-and-play intervention framework that adaptively steers LLM reasoning in activation space. RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input. The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner. Across seven diverse benchmarks, RISER yields 3.4-6.5% average zero-shot accuracy improvements over the base model while surpassing CoT-style reasoning with 2-3x higher token efficiency and robust accuracy gains. Further analysis shows that RISER autonomously combines multiple vectors into interpretable, precise control strategies, pointing toward more controllable and efficient LLM reasoning.

[255] $A^3$-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation

Jian Zhang, Yu He, Zhiyuan Wang, Zhangqi Wang, Kai He, Fangzhi Xu, Qika Lin, Jun Liu

Main category: cs.AI

TL;DR: A³-Bench is a new benchmark for evaluating scientific reasoning through memory-driven activation mechanisms (anchors and attractors), addressing gaps in existing benchmarks that overlook memory’s role in reasoning.

DetailsMotivation: Existing benchmarks mainly evaluate final answers or step-by-step coherence but overlook memory-driven mechanisms underlying human reasoning, which involves activating anchors and attractors then integrating them into multi-step inference.

Method: 1) Annotated 2,198 science reasoning problems using SAPM process (subject, anchor & attractor, problem, memory developing); 2) Introduced dual-scale memory evaluation framework using anchors and attractors; 3) Created AAUI metric to measure memory activation rates; 4) Conducted experiments with various base models and paradigms.

Result: Validated A³-Bench through experiments and analyzed how memory activation impacts reasoning performance, providing insights into memory-driven scientific reasoning.

Conclusion: The benchmark addresses a critical gap in evaluating memory-driven reasoning mechanisms and provides tools to analyze how memory activation influences scientific reasoning performance.

Abstract: Scientific reasoning relies not only on logical inference but also on activating prior knowledge and experiential structures. Memory can efficiently reuse knowledge and enhance reasoning consistency and stability. However, existing benchmarks mainly evaluate final answers or step-by-step coherence, overlooking the \textit{memory-driven} mechanisms that underlie human reasoning, which involves activating anchors and attractors, then integrating them into multi-step inference. To address this gap, we propose $A^3$-Bench~ https://a3-bench.github.io, a benchmark designed to evaluate scientific reasoning through dual-scale memory-driven activation, grounded in Anchor and Attractor Activation. First, we annotate 2,198 science reasoning problems across domains using the SAPM process(subject, anchor & attractor, problem, and memory developing). Second, we introduce a dual-scale memory evaluation framework utilizing anchors and attractors, along with the AAUI(Anchor–Attractor Utilization Index) metric to measure memory activation rates. Finally, through experiments with various base models and paradigms, we validate $A^3$-Bench and analyze how memory activation impacts reasoning performance, providing insights into memory-driven scientific reasoning.

[256] M$^3$Searcher: Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning

Xiaohan Yu, Chao Feng, Lang Mei, Chong Chen

Main category: cs.AI

TL;DR: M³Searcher is a modular multimodal information-seeking agent that decouples information acquisition from answer derivation, addressing challenges in extending autonomous agents to multimodal settings through retrieval-oriented multi-objective optimization.

DetailsMotivation: Existing DeepResearch-style agents are limited to text modality, and extending them to multimodal settings faces two key challenges: the specialization-generalization trade-off in training multimodal tool-use models at scale, and severe scarcity of training data for complex multimodal search trajectories.

Method: Proposes M³Searcher, a modular multimodal agent that explicitly decouples information acquisition from answer derivation. It uses retrieval-oriented multi-objective reward optimization that jointly encourages factual accuracy, reasoning soundness, and retrieval fidelity. Also develops MMSearchVQA dataset for retrieval-centric RL training.

Result: Experimental results show M³Searcher outperforms existing approaches, exhibits strong transfer adaptability, and demonstrates effective reasoning in complex multimodal tasks.

Conclusion: M³Searcher successfully addresses multimodal information-seeking challenges through modular architecture and multi-objective optimization, showing promising performance and adaptability in complex multimodal search scenarios.

Abstract: Recent advances in DeepResearch-style agents have demonstrated strong capabilities in autonomous information acquisition and synthesize from real-world web environments. However, existing approaches remain fundamentally limited to text modality. Extending autonomous information-seeking agents to multimodal settings introduces critical challenges: the specialization-generalization trade-off that emerges when training models for multimodal tool-use at scale, and the severe scarcity of training data capturing complex, multi-step multimodal search trajectories. To address these challenges, we propose M$^3$Searcher, a modular multimodal information-seeking agent that explicitly decouples information acquisition from answer derivation. M$^3$Searcher is optimized with a retrieval-oriented multi-objective reward that jointly encourages factual accuracy, reasoning soundness, and retrieval fidelity. In addition, we develop MMSearchVQA, a multimodal multi-hop dataset to support retrieval centric RL training. Experimental results demonstrate that M$^3$Searcher outperforms existing approaches, exhibits strong transfer adaptability and effective reasoning in complex multimodal tasks.

[257] STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models

Jingjing Zhou, Gaoxiang Cong, Li Su, Liang Li

Main category: cs.AI

TL;DR: STaR is a parameter-free inference-time unlearning framework that protects privacy in Large Reasoning Models by removing sensitive content from both final answers and intermediate reasoning steps, outperforming existing LLM unlearning methods.

DetailsMotivation: Large Reasoning Models generate complex Chain-of-Thought trajectories that embed sensitive information throughout the reasoning process, creating severe privacy risks. Existing LLM unlearning approaches only modify final answers, failing to remove sensitive content from intermediate steps, leading to persistent privacy leakage.

Method: STaR uses a four-step approach: 1) semantic-aware detection to identify sensitive content, 2) injection of global safety constraints via secure prompt prefix, 3) trajectory-aware suppression to dynamically block sensitive content across the reasoning chain, and 4) token-level adaptive filtering to prevent both exact and paraphrased sensitive tokens during generation.

Result: Experiments on the R-TOFU benchmark show STaR achieves comprehensive and stable unlearning with minimal utility loss. The framework also introduces new evaluation metrics (MCS and MIA) that demonstrate superior privacy protection compared to existing methods.

Conclusion: STaR sets a new standard for privacy-preserving reasoning in LRMs by providing robust privacy protection throughout the entire reasoning process while maintaining model utility, addressing critical gaps in existing unlearning approaches for complex reasoning models.

Abstract: Large Reasoning Models (LRMs) have advanced automated multi-step reasoning, but their ability to generate complex Chain-of-Thought (CoT) trajectories introduces severe privacy risks, as sensitive information may be deeply embedded throughout the reasoning process. Existing Large Language Models (LLMs) unlearning approaches that typically focus on modifying only final answers are insufficient for LRMs, as they fail to remove sensitive content from intermediate steps, leading to persistent privacy leakage and degraded security. To address these challenges, we propose Sensitive Trajectory Regulation (STaR), a parameter-free, inference-time unlearning framework that achieves robust privacy protection throughout the reasoning process. Specifically, we first identify sensitive content via semantic-aware detection. Then, we inject global safety constraints through secure prompt prefix. Next, we perform trajectory-aware suppression to dynamically block sensitive content across the entire reasoning chain. Finally, we apply token-level adaptive filtering to prevent both exact and paraphrased sensitive tokens during generation. Furthermore, to overcome the inadequacies of existing evaluation protocols, we introduce two metrics: Multi-Decoding Consistency Assessment (MCS), which measures the consistency of unlearning across diverse decoding strategies, and Multi-Granularity Membership Inference Attack (MIA) Evaluation, which quantifies privacy protection at both answer and reasoning-chain levels. Experiments on the R-TOFU benchmark demonstrate that STaR achieves comprehensive and stable unlearning with minimal utility loss, setting a new standard for privacy-preserving reasoning in LRMs.

[258] Cluster Workload Allocation: Semantic Soft Affinity Using Natural Language Processing

Leszek Sliwko, Jolanta Mizeria-Pietraszko

Main category: cs.AI

TL;DR: LLM-powered semantic scheduling for Kubernetes interprets natural language hints for soft affinity preferences, achieving >95% parsing accuracy and superior placement in complex scenarios.

DetailsMotivation: Cluster workload allocation requires complex configurations, creating a usability gap that needs simplification through more intuitive, natural language interfaces.

Method: Developed a semantic scheduling system using LLMs (AWS Bedrock models) integrated via Kubernetes scheduler extender with cluster state cache and intent analyzer to interpret natural language allocation hints for soft affinity preferences.

Result: High LLM parsing accuracy (>95% Subset Accuracy), superior scheduling quality in complex/quantitative scenarios, and effective handling of conflicting soft preferences compared to standard Kubernetes configurations.

Conclusion: LLMs enable accessible semantic scheduling for workload orchestration, but production readiness requires addressing synchronous latency through asynchronous processing.

Abstract: Cluster workload allocation often requires complex configurations, creating a usability gap. This paper introduces a semantic, intent-driven scheduling paradigm for cluster systems using Natural Language Processing. The system employs a Large Language Model (LLM) integrated via a Kubernetes scheduler extender to interpret natural language allocation hint annotations for soft affinity preferences. A prototype featuring a cluster state cache and an intent analyzer (using AWS Bedrock) was developed. Empirical evaluation demonstrated high LLM parsing accuracy (>95% Subset Accuracy on an evaluation ground-truth dataset) for top-tier models like Amazon Nova Pro/Premier and Mistral Pixtral Large, significantly outperforming a baseline engine. Scheduling quality tests across six scenarios showed the prototype achieved superior or equivalent placement compared to standard Kubernetes configurations, particularly excelling in complex and quantitative scenarios and handling conflicting soft preferences. The results validate using LLMs for accessible scheduling but highlight limitations like synchronous LLM latency, suggesting asynchronous processing for production readiness. This work confirms the viability of semantic soft affinity for simplifying workload orchestration.

[259] Policy-Based Reinforcement Learning with Action Masking for Dynamic Job Shop Scheduling under Uncertainty: Handling Random Arrivals and Machine Failures

Sofiene Lassoued, Stefan Lier, Andreas Schwung

Main category: cs.AI

TL;DR: A novel framework combining Coloured Timed Petri Nets and Maskable Proximal Policy Optimization for dynamic job shop scheduling under uncertainty, outperforming traditional methods in makespan minimization.

DetailsMotivation: Address challenges in Dynamic Job Shop Scheduling Problems under uncertainty, including stochastic job arrivals and unexpected machine breakdowns, to better reflect real-world manufacturing scenarios with complex temporal patterns and machine degradation.

Method: Model-based paradigm using Coloured Timed Petri Nets to represent scheduling environment, combined with Maskable Proximal Policy Optimization for dynamic decision-making. Dynamic job arrivals modeled with Gamma distribution, machine failures with Weibull distribution. Two action-masking strategies studied: non-gradient approach overriding invalid action probabilities and gradient-based approach assigning negative gradients to invalid actions.

Result: Extensive experiments on dynamic JSSP benchmarks show the method consistently outperforms traditional heuristic and rule-based approaches in terms of makespan minimization.

Conclusion: The framework demonstrates strength in combining interpretable Petri-net-based models with adaptive reinforcement learning policies, yielding a resilient, scalable, and explainable framework for real-time scheduling in dynamic and uncertain manufacturing environments.

Abstract: We present a novel framework for solving Dynamic Job Shop Scheduling Problems under uncertainty, addressing the challenges introduced by stochastic job arrivals and unexpected machine breakdowns. Our approach follows a model-based paradigm, using Coloured Timed Petri Nets to represent the scheduling environment, and Maskable Proximal Policy Optimization to enable dynamic decision-making while restricting the agent to feasible actions at each decision point. To simulate realistic industrial conditions, dynamic job arrivals are modeled using a Gamma distribution, which captures complex temporal patterns such as bursts, clustering, and fluctuating workloads. Machine failures are modeled using a Weibull distribution to represent age-dependent degradation and wear-out dynamics. These stochastic models enable the framework to reflect real-world manufacturing scenarios better. In addition, we study two action-masking strategies: a non-gradient approach that overrides the probabilities of invalid actions, and a gradient-based approach that assigns negative gradients to invalid actions within the policy network. We conduct extensive experiments on dynamic JSSP benchmarks, demonstrating that our method consistently outperforms traditional heuristic and rule-based approaches in terms of makespan minimization. The results highlight the strength of combining interpretable Petri-net-based models with adaptive reinforcement learning policies, yielding a resilient, scalable, and explainable framework for real-time scheduling in dynamic and uncertain manufacturing environments.

[260] Monte-Carlo Tree Search with Neural Network Guidance for Lane-Free Autonomous Driving

Ioannis Peridis, Dimitrios Troullinos, Georgios Chalkiadakis, Pantelis Giankoulidis, Ioannis Papamichail, Markos Papageorgiou

Main category: cs.AI

TL;DR: This paper proposes a Monte-Carlo Tree Search (MCTS) approach with neural network guidance for autonomous driving in lane-free traffic environments, evaluating safety, efficiency, and computational trade-offs.

DetailsMotivation: Lane-free traffic environments offer increased road capacity by allowing vehicles to use lateral space freely, but create more challenging autonomous driving scenarios that require advanced planning approaches beyond traditional lane-keeping systems.

Method: The authors use Monte-Carlo Tree Search (MCTS) planning for single-agent autonomous driving in lane-free traffic, formulating a Markov Decision Process influenced by reinforcement learning frameworks. The MCTS is enhanced with a pre-trained neural network that guides the selection phase, incorporating predictive capabilities for more informed tree search under computational constraints.

Result: Experimental evaluation shows: (a) isotropic state information leads to nudging behavior where vehicles react to faster tailing ones, (b) NN-guided MCTS accelerates performance, and (c) there’s a trade-off between computational resources and solution quality.

Conclusion: The proposed NN-guided MCTS approach effectively addresses autonomous driving in lane-free environments, balancing safety (collision rates) and efficacy (speed) while managing computational constraints through intelligent search guidance.

Abstract: Lane-free traffic environments allow vehicles to better harness the lateral capacity of the road without being restricted to lane-keeping, thereby increasing the traffic flow rates. As such, we have a distinct and more challenging setting for autonomous driving. In this work, we consider a Monte-Carlo Tree Search (MCTS) planning approach for single-agent autonomous driving in lane-free traffic, where the associated Markov Decision Process we formulate is influenced from existing approaches tied to reinforcement learning frameworks. In addition, MCTS is equipped with a pre-trained neural network (NN) that guides the selection phase. This procedure incorporates the predictive capabilities of NNs for a more informed tree search process under computational constraints. In our experimental evaluation, we consider metrics that address both safety (through collision rates) and efficacy (through measured speed). Then, we examine: (a) the influence of isotropic state information for vehicles in a lane-free environment, resulting in nudging behaviour–vehicles’ policy reacts due to the presence of faster tailing ones, (b) the acceleration of performance for the NN-guided variant of MCTS, and (c) the trade-off between computational resources and solution quality.

[261] Long-term Task-oriented Agent: Proactive Long-term Intent Maintenance in Dynamic Environments

Qinglong Shi, Donghai Wang, Hantao Zhou, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He

Main category: cs.AI

TL;DR: Proactive task-oriented agents that monitor dynamic environments and trigger follow-ups based on user intent, with new benchmark and synthetic data pipeline achieving 85% task completion.

DetailsMotivation: Current LLM agents are reactive and short-term focused, unable to maintain long-term user intents or adapt to evolving environments. Need for proactive agents that bridge static user needs with dynamic environments.

Method: Proposed proactive paradigm with two capabilities: (1) Intent-Conditioned Monitoring - agent formulates trigger conditions from dialog history, (2) Event-Triggered Follow-up - agent engages user upon detecting useful environmental updates. Created data synthesis pipeline for complex multi-turn dialogs in dynamic environments.

Result: Fine-tuned model using synthetic data achieved 85.19% task completion rate for complex tasks with user intent shifts, outperforming other tested models. Introduced ChronosBench benchmark revealing flaws in existing models for long-term task-oriented interaction.

Conclusion: Proactive agents with intent monitoring and event-triggered follow-ups effectively bridge static user needs with dynamic environments. Data-driven strategy using synthetic training data significantly improves performance on complex long-term tasks.

Abstract: Current large language model agents predominantly operate under a reactive paradigm, responding only to immediate user queries within short-term sessions. This limitation hinders their ability to maintain long-term user’s intents and dynamically adapt to evolving external environments. In this paper, we propose a novel interaction paradigm for proactive Task-oriented Agents capable of bridging the gap between relatively static user’s needs and a dynamic environment. We formalize proactivity through two key capabilities, (i) Intent-Conditioned Monitoring: The agent autonomously formulates trigger conditions based on dialog history; (ii) Event-Triggered Follow-up: The agent actively engages the user upon detecting useful environmental updates. We introduce a high-quality data synthesis pipeline to construct complex, multi-turn dialog data in a dynamic environment. Furthermore, we attempt to address the lack of evaluation criteria of task-oriented interaction in a dynamic environment by proposing a new benchmark, namely ChronosBench. We evaluated some leading close-source and open-source models at present and revealed their flaws in long-term task-oriented interaction. Furthermore, our fine-tuned model trained using synthetic data for supervised learning achieves a task completion rate of 85.19% for complex tasks including shifts in user intent, outperforming other models under test. And the result validated the effectiveness of our data-driven strategy.

[262] EvoFSM: Controllable Self-Evolution for Deep Research with Finite State Machines

Shuo Zhang, Chaofa Yuan, Ryan Guo, Xiaomin Yu, Rui Xu, Zhangquan Chen, Zinuo Li, Zhi Yang, Shuhao Guan, Zhenheng Tang, Sen Hu, Liwen Zhang, Ronghao Chen, Huacan Wang

Main category: cs.AI

TL;DR: EvoFSM is a structured self-evolving framework that evolves Finite State Machines instead of free-form code/prompt rewriting, achieving better adaptability and control for LLM-based agents on open-ended research tasks.

DetailsMotivation: Existing LLM-based agents use fixed workflows that struggle with open-ended queries, while recent self-evolution approaches suffer from instability, hallucinations, and instruction drift due to unconstrained optimization.

Method: EvoFSM evolves explicit Finite State Machines by decoupling optimization into macroscopic Flow (state-transition logic) and microscopic Skill (state-specific behaviors). It uses constrained operations guided by a critic mechanism and incorporates self-evolving memory that distills successful trajectories as reusable priors and failure patterns as constraints.

Result: Extensive evaluations on five multi-hop QA benchmarks show effectiveness, with 58.0% accuracy on DeepSearch benchmark. Additional results on interactive decision-making tasks validate generalization.

Conclusion: EvoFSM provides a structured approach to self-evolution that balances adaptability and control, addressing limitations of both fixed workflows and unconstrained optimization methods for LLM-based agents.

Abstract: While LLM-based agents have shown promise for deep research, most existing approaches rely on fixed workflows that struggle to adapt to real-world, open-ended queries. Recent work therefore explores self-evolution by allowing agents to rewrite their own code or prompts to improve problem-solving ability, but unconstrained optimization often triggers instability, hallucinations, and instruction drift. We propose EvoFSM, a structured self-evolving framework that achieves both adaptability and control by evolving an explicit Finite State Machine (FSM) instead of relying on free-form rewriting. EvoFSM decouples the optimization space into macroscopic Flow (state-transition logic) and microscopic Skill (state-specific behaviors), enabling targeted improvements under clear behavioral boundaries. Guided by a critic mechanism, EvoFSM refines the FSM through a small set of constrained operations, and further incorporates a self-evolving memory that distills successful trajectories as reusable priors and failure patterns as constraints for future queries. Extensive evaluations on five multi-hop QA benchmarks demonstrate the effectiveness of EvoFSM. In particular, EvoFSM reaches 58.0% accuracy on the DeepSearch benchmark. Additional results on interactive decision-making tasks further validate its generalization.

[263] What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding

Siyuan Liu, Hongbang Yuan, Xinze Li, Ziyue Zhu, Yixin Cao, Yu-Gang Jiang

Main category: cs.AI

TL;DR: T2Q is a new evaluation paradigm that separates task execution from world-state understanding in LLM agents, revealing that task success doesn’t guarantee environment comprehension.

DetailsMotivation: Current LLM agent evaluations focus on task success metrics but fail to assess whether agents actually understand the environment they're operating in, creating a gap in measuring true generalization capabilities.

Method: Proposed Task-to-Quiz (T2Q) paradigm that automatically converts task execution into grounded QA pairs to test world-state understanding. Implemented T2QBench with 30 environments and 1,967 QA pairs across multiple difficulty levels.

Result: Task success is often a poor proxy for environment understanding, and current memory mechanisms don’t effectively help agents acquire grounded environment models. Identified proactive exploration and fine-grained state representation as key bottlenecks.

Conclusion: The T2Q paradigm provides a robust foundation for developing more generalizable autonomous agents by properly evaluating environment understanding separate from task execution.

Abstract: Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks, yet their ability to generalize across varying environments remains a under-examined concern. Current evaluation paradigms predominantly rely on trajectory-based metrics that measure task success, while failing to assess whether agents possess a grounded, transferable model of the environment. To address this gap, we propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding. We instantiate this paradigm in T2QBench, a suite comprising 30 environments and 1,967 grounded QA pairs across multiple difficulty levels. Our extensive experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment. These findings identify proactive exploration and fine-grained state representation as primary bottlenecks, offering a robust foundation for developing more generalizable autonomous agents.

[264] Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, Yupeng Hu, Wenjie Wang, Liqiang Nie, Wenjie Li

Main category: cs.AI

TL;DR: Omni-R1 introduces unified generative multimodal reasoning that generates intermediate images during reasoning, enabling diverse multimodal skills across tasks through a two-stage SFT+RL framework with perception alignment.

DetailsMotivation: Current MLLMs either use pure text-based reasoning or single task-specific multimodal reasoning patterns, limiting generalizability across diverse multimodal tasks requiring different reasoning skills like zooming or object marking.

Method: Proposes unified generative multimodal reasoning paradigm that generates intermediate images during reasoning. Implements Omni-R1 with two-stage SFT+RL framework featuring perception alignment loss and perception reward for functional image generation. Also introduces Omni-R1-Zero that bootstraps visualizations from text-only data without multimodal annotations.

Result: Omni-R1 achieves unified generative reasoning across wide range of multimodal tasks. Omni-R1-Zero matches or even surpasses Omni-R1 on average, demonstrating promising direction for generative multimodal reasoning without multimodal annotations.

Conclusion: Unified generative multimodal reasoning with intermediate image generation enables diverse multimodal skills across tasks, and bootstrapping from text-only data offers promising annotation-free approach for multimodal reasoning.

Abstract: Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning.

[265] LLM for Large-Scale Optimization Model Auto-Formulation: A Lightweight Few-Shot Learning Approach

Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, Chung-Piaw Teo

Main category: cs.AI

TL;DR: LEAN-LLM-OPT is a lightweight LLM-based framework that automates large-scale optimization model formulation by orchestrating LLM agents to construct workflows and generate optimization models from problem descriptions and datasets.

DetailsMotivation: Building large-scale optimization models is labor-intensive and time-consuming. There's a need to automate the formulation process to improve efficiency in business decision-making.

Method: Uses a multi-agent LLM framework: upstream agents dynamically construct step-by-step workflows for similar problems, while a downstream agent follows the workflow to generate final optimization formulations. Decomposes modeling into structured sub-tasks and offloads mechanical data-handling to auxiliary tools.

Result: Achieves strong performance on large-scale optimization tasks, competitive with state-of-the-art approaches. Demonstrates practical value in Singapore Airlines revenue management use case. Introduces two new benchmarks: Large-Scale-OR and Air-NRM.

Conclusion: LEAN-LLM-OPT effectively automates optimization model formulation, reducing manual effort while maintaining high performance. The framework’s workflow-based approach allows LLMs to focus on complex modeling components while standardizing routine tasks.

Abstract: Large-scale optimization is a key backbone of modern business decision-making. However, building these models is often labor-intensive and time-consuming. We address this by proposing LEAN-LLM-OPT, a LightwEight AgeNtic workflow construction framework for LLM-assisted large-scale OPTimization auto-formulation. LEAN-LLM-OPT takes as input a problem description together with associated datasets and orchestrates a team of LLM agents to produce an optimization formulation. Specifically, upon receiving a query, two upstream LLM agents dynamically construct a workflow that specifies, step-by-step, how optimization models for similar problems can be formulated. A downstream LLM agent then follows this workflow to generate the final output. Leveraging LLMs’ text-processing capabilities and common modeling practices, the workflow decomposes the modeling task into a sequence of structured sub-tasks and offloads mechanical data-handling operations to auxiliary tools. This design alleviates the downstream agent’s burden related to planning and data handling, allowing it to focus on the most challenging components that cannot be readily standardized. Extensive simulations show that LEAN-LLM-OPT, instantiated with GPT-4.1 and the open source gpt-oss-20B, achieves strong performance on large-scale optimization modeling tasks and is competitive with state-of-the-art approaches. In addition, in a Singapore Airlines choice-based revenue management use case, LEAN-LLM-OPT demonstrates practical value by achieving leading performance across a range of scenarios. Along the way, we introduce Large-Scale-OR and Air-NRM, the first comprehensive benchmarks for large-scale optimization auto-formulation. The code and data of this work is available at https://github.com/CoraLiang01/lean-llm-opt.

[266] PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records

Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie

Main category: cs.AI

TL;DR: PersonalAlign introduces hierarchical implicit intent alignment for GUI agents using long-term user records to resolve vague instructions and anticipate latent routines, with AndroidIntent benchmark showing HIM-Agent improves performance by 15.7% on execution and 7.3% on proactive tasks.

DetailsMotivation: Current GUI agents perform well with explicit instructions but struggle with real-world deployment requiring alignment with users' complex implicit intents, which involve resolving omitted preferences in vague instructions and anticipating latent routines based on user state.

Method: Introduces Hierarchical Intent Memory Agent (HIM-Agent) that maintains continuously updating personal memory and hierarchically organizes user preferences and routines for personalization, evaluated on AndroidIntent benchmark with 775 user-specific preferences and 215 routines from 20k long-term records.

Result: HIM-Agent significantly outperforms other GUI agents (GPT-5, Qwen3-VL, UI-TARS) on AndroidIntent benchmark, improving execution performance by 15.7% and proactive performance by 7.3%.

Conclusion: The work demonstrates the importance of hierarchical implicit intent alignment for personalized GUI agents and shows that HIM-Agent’s approach of maintaining and organizing long-term user records enables effective resolution of vague instructions and proactive assistance.

Abstract: While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users’ more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents’ ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.

[267] Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park

Main category: cs.AI

TL;DR: MATTRL introduces test-time reinforcement learning for multi-agent systems, using textual experience injection during inference to improve decision-making without expensive training.

DetailsMotivation: Traditional multi-agent RL training is resource-intensive, unstable, and suffers from non-stationarity due to co-adapting teammates. Sparse, high-variance rewards make training difficult, creating a need for more efficient approaches.

Method: Forms multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences from a turn-level experience pool, uses credit assignment for experience construction, and reaches consensus for final decisions.

Result: Improves accuracy by average 3.67% over multi-agent baseline and 8.67% over single-agent baselines across medicine, math, and education benchmarks. Ablation studies examine different credit-assignment schemes.

Conclusion: MATTRL provides stable, effective, and efficient path to distribution-shift-robust multi-agent reasoning without requiring expensive tuning or training.

Abstract: Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67% over a multi-agent baseline, and by 8.67% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.

[268] Automating Supply Chain Disruption Monitoring via an Agentic AI Approach

Sara AlMahri, Liming Xu, Alexandra Brintrup

Main category: cs.AI

TL;DR: AI agent framework for proactive supply chain disruption monitoring across multi-tier networks using LLMs and deterministic tools

DetailsMotivation: Modern supply chains lack visibility beyond Tier-1 suppliers, leaving upstream vulnerabilities undetected until disruptions cascade downstream, creating reactive rather than proactive resilience

Method: Minimally supervised agentic AI framework with seven specialized agents powered by LLMs and deterministic tools that detect disruption signals from unstructured news, map to multi-tier supplier networks, evaluate exposure based on network structure, and recommend mitigations

Result: Achieves F1 scores between 0.962-0.991 across core tasks, performs end-to-end analyses in mean 3.83 minutes at $0.0836 per disruption, representing >3 orders of magnitude reduction in response time compared to industry benchmarks; validated with 30 synthesized scenarios and real-world Russia-Ukraine conflict case study

Conclusion: Establishes foundational step toward building resilient, proactive, and autonomous supply chains capable of managing disruptions across deep-tier networks

Abstract: Modern supply chains are increasingly exposed to disruptions from geopolitical events, demand shocks, trade restrictions, to natural disasters. While many of these disruptions originate deep in the supply network, most companies still lack visibility beyond Tier-1 suppliers, leaving upstream vulnerabilities undetected until the impact cascades downstream. To overcome this blind-spot and move from reactive recovery to proactive resilience, we introduce a minimally supervised agentic AI framework that autonomously monitors, analyses, and responds to disruptions across extended supply networks. The architecture comprises seven specialised agents powered by large language models and deterministic tools that jointly detect disruption signals from unstructured news, map them to multi-tier supplier networks, evaluate exposure based on network structure, and recommend mitigations such as alternative sourcing options. \rev{We evaluate the framework across 30 synthesised scenarios covering three automotive manufacturers and five disruption classes. The system achieves high accuracy across core tasks, with F1 scores between 0.962 and 0.991, and performs full end-to-end analyses in a mean of 3.83 minutes at a cost of $0.0836 per disruption. Relative to industry benchmarks of multi-day, analyst-driven assessments, this represents a reduction of more than three orders of magnitude in response time. A real-world case study of the 2022 Russia-Ukraine conflict further demonstrates operational applicability. This work establishes a foundational step toward building resilient, proactive, and autonomous supply chains capable of managing disruptions across deep-tier networks.

[269] Evaluating Detection Thresholds: The Impact of False Positives and Negatives on Super-Resolution Ultrasound Localization Microscopy

Sepideh K. Gharamaleki, Brandon Helfield, Hassan Rivaz

Main category: cs.AI

TL;DR: False Positives and False Negatives in microbubble detection significantly impact ultrasound localization microscopy image quality, with FNs causing greater structural similarity degradation than FPs.

DetailsMotivation: ULM image quality depends heavily on precise microbubble detection, but practical pitfalls like threshold settings and their impact on false positives/negatives haven't been sufficiently studied.

Method: Systematically adding controlled detection errors (false positives and false negatives) to simulated ultrasound localization microscopy data to examine their effects on image quality metrics.

Result: Both FP and FN rates similarly affect PSNR, but FNs cause much greater SSIM degradation (45% drop vs 7% for FPs). Dense microbubble regions are more resilient to errors while sparse regions are highly sensitive.

Conclusion: Robust microbubble detection frameworks are crucial for enhancing super-resolution ultrasound imaging, especially considering the disproportionate impact of false negatives on structural image quality.

Abstract: Super-resolution ultrasound imaging with ultrasound localization microscopy (ULM) offers a high-resolution view of microvascular structures. Yet, ULM image quality heavily relies on precise microbubble (MB) detection. Despite the crucial role of localization algorithms, there has been limited focus on the practical pitfalls in MB detection tasks such as setting the detection threshold. This study examines how False Positives (FPs) and False Negatives (FNs) affect ULM image quality by systematically adding controlled detection errors to simulated data. Results indicate that while both FP and FN rates impact Peak Signal-to-Noise Ratio (PSNR) similarly, increasing FP rates from 0% to 20% decreases Structural Similarity Index (SSIM) by 7%, whereas same FN rates cause a greater drop of around 45%. Moreover, dense MB regions are more resilient to detection errors, while sparse regions show high sensitivity, showcasing the need for robust MB detection frameworks to enhance super-resolution imaging.

[270] AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling

Zhining Zhang, Chuanyang Jin, Mung Yao Jia, Shunchi Zhang, Tianmin Shu

Main category: cs.AI

TL;DR: AutoToM: Automated agent modeling method for scalable, robust, and interpretable mental inference that outperforms existing ToM methods across diverse benchmarks.

DetailsMotivation: Current ToM reasoning approaches have limitations: LLM-based prompting is prone to systematic errors, while handcrafted agent models are robust but fail to generalize across domains. Need for automated, scalable, and robust mental inference.

Method: AutoToM proposes initial agent model, performs automated Bayesian inverse planning using LLM backend, and iteratively refines model based on inference uncertainty by adding mental variables and/or incorporating more timesteps.

Result: Outperforms existing ToM methods and large reasoning models across five diverse benchmarks. Can produce human-like confidence estimates and enable online mental inference for embodied decision-making.

Conclusion: AutoToM provides an automated agent modeling approach that achieves scalable, robust, and interpretable mental inference, addressing limitations of current ToM methods while demonstrating strong performance across diverse domains.

Abstract: Theory of Mind (ToM), the ability to understand people’s minds based on their behavior, is key to developing socially intelligent agents. Current approaches to ToM reasoning either rely on prompting Large Language Models (LLMs), which are prone to systematic errors, or use handcrafted, rigid agent models for model-based inference, which are more robust but fail to generalize across domains. In this work, we introduce AutoToM, an automated agent modeling method for scalable, robust, and interpretable mental inference. Given a ToM problem, AutoToM first proposes an initial agent model and then performs automated Bayesian inverse planning based on this model, leveraging an LLM backend. Guided by inference uncertainty, it iteratively refines the model by introducing additional mental variables and/or incorporating more timesteps in the context. Across five diverse benchmarks, AutoToM outperforms existing ToM methods and even large reasoning models. Additionally, we show that AutoToM can produce human-like confidence estimates and enable online mental inference for embodied decision-making.

[271] Advancing AI Negotiations: A Large-Scale Autonomous Negotiation Competition

Michelle Vaccaro, Michael Caosun, Harang Ju, Sinan Aral, Jared R. Curhan

Main category: cs.AI

TL;DR: AI negotiation competition shows human negotiation principles remain relevant for AI-AI negotiations, with warmth being surprisingly effective and dominance good for value claiming. AI-specific strategies like chain-of-thought also matter, suggesting need for new AI negotiation theory.

DetailsMotivation: To understand how negotiation principles apply in AI-AI contexts and identify unique dynamics in autonomous negotiations between AI agents.

Method: International AI Negotiation Competition with participants designing prompts for AI agents, followed by over 180,000 negotiations across diverse scenarios, analyzed using NLP methods on full transcripts.

Result: Warmth consistently associated with superior outcomes across all metrics; dominance effective for value claiming; AI-specific strategies like chain-of-thought and prompt injection matter; positivity, gratitude, and question-asking linked to deal success; conversation length linked to impasses.

Conclusion: Need to establish new theory of AI negotiation integrating classic negotiation theory with AI-specific negotiation theories to better understand autonomous negotiations and optimize agent performance.

Abstract: We conducted an International AI Negotiation Competition in which participants designed and refined prompts for AI negotiation agents. We then facilitated over 180,000 negotiations between these agents across multiple scenarios with diverse characteristics and objectives. Our findings revealed that principles from human negotiation theory remain crucial even in AI-AI contexts. Surprisingly, warmth – a traditionally human relationship-building trait – was consistently associated with superior outcomes across all key performance metrics. Dominant agents, meanwhile, were especially effective at claiming value. Our analysis also revealed unique dynamics in AI-AI negotiations not fully explained by existing theory, including AI-specific technical strategies like chain-of-thought reasoning and prompt injection. When we applied natural language processing (NLP) methods to the full transcripts of all negotiations, we found positivity, gratitude, and question-asking (associated with warmth) were strongly associated with reaching deals as well as objective and subjective value, whereas conversation lengths (associated with dominance) were strongly associated with impasses. The results suggest the need to establish a new theory of AI negotiation, which integrates classic negotiation theory with AI-specific negotiation theories to better understand autonomous negotiations and optimize agent performance.

[272] Epistemic Skills: Reasoning about Knowledge and Oblivion

Xiaolong Liang, Yì N. Wáng

Main category: cs.AI

TL;DR: A novel epistemic logic framework using weighted models and “epistemic skills” to model knowledge acquisition (upskilling) and oblivion (downskilling), with analysis of group knowledge, knowability/forgettability, de re/de dicto distinctions, and computational complexity.

DetailsMotivation: To develop a formal epistemic logic that captures both knowledge acquisition and forgetting processes, while incorporating group knowledge dynamics and providing a metric-based approach to epistemic capacities.

Method: Uses a system of weighted models with an “epistemic skills” metric to represent epistemic capacities. Knowledge acquisition is modeled as upskilling, oblivion as downskilling. The framework analyzes knowability (potential to gain knowledge) and forgettability (potential to lapse into oblivion).

Result: The framework enables exploration of group knowledge dynamics, distinguishes between epistemic de re and de dicto expressions, and provides analysis of computational complexity for model checking and satisfiability problems.

Conclusion: The proposed epistemic logic framework successfully captures the dynamics of knowledge acquisition and oblivion using a weighted model approach with epistemic skills metrics, offering theoretical foundations and practical implications for epistemic reasoning systems.

Abstract: This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incorporating concepts of group knowledge. The approach is grounded in a system of weighted models, introducing an epistemic skills'' metric to represent the epistemic capacities tied to knowledge updates. Within this framework, knowledge acquisition is modeled as a process of upskilling, whereas oblivion is represented as a consequence of downskilling. The framework further enables exploration of knowability’’ and ``forgettability,’’ defined as the potential to gain knowledge through upskilling and to lapse into oblivion through downskilling, respectively. Additionally, it supports a detailed analysis of the distinctions between epistemic de re and de dicto expressions. The computational complexity of the model checking and satisfiability problems is examined, offering insights into their theoretical foundations and practical implications.

[273] Fodor and Pylyshyn’s Legacy: Still No Human-like Systematic Compositionality in Neural Networks

Tim Woydt, Moritz Willig, Antonia Wüst, Lukas Helff, Wolfgang Stammer, Constantin A. Rothkopf, Kristian Kersting

Main category: cs.AI

TL;DR: The paper argues that despite recent claims, neural meta-learning systems still lack true human-like systematic compositionality, and Fodor & Pylyshyn’s critique remains valid.

DetailsMotivation: To critically examine recent claims that meta-learning provides a pathway to compositionality in neural networks, challenging the assertion that modern systems have overcome Fodor and Pylyshyn's critique about neural networks' inability to model compositional representations.

Method: Position paper analysis that critically revisits the meta-learning framework for compositionality, examining limitations and analyzing under what narrow conditions modern neural meta-learning systems can perform compositional tasks.

Result: Analysis shows that modern neural meta-learning systems can only perform compositional tasks under very narrow and restricted definitions of meta-learning setups, not achieving human-like systematic compositionality.

Conclusion: Fodor and Pylyshyn’s critique persists - to date, there is no evidence of human-like systematic compositionality learned in neural networks, despite recent claims about meta-learning as a solution.

Abstract: Strong meta-learning capabilities for systematic compositionality are emerging as an important skill for navigating the complex and changing tasks of today’s world. However, in presenting models for robust adaptation to novel environments, it is important to refrain from making unsupported claims about the performance of meta-learning systems that ultimately do not stand up to scrutiny. While Fodor and Pylyshyn famously posited that neural networks inherently lack this capacity as they are unable to model compositional representations or structure-sensitive operations, and thus are not a viable model of the human mind, Lake and Baroni recently presented meta-learning as a pathway to compositionality. In this position paper, we critically revisit this claim and highlight limitations in the proposed meta-learning framework for compositionality. Our analysis shows that modern neural meta-learning systems can only perform such tasks, if at all, under a very narrow and restricted definition of a meta-learning setup. We therefore claim that `Fodor and Pylyshyn’s legacy’ persists, and to date, there is no human-like systematic compositionality learned in neural networks.

[274] Memory Mosaics at scale

Jianyu Zhang, Léon Bottou

Main category: cs.AI

TL;DR: Memory Mosaics v2 scaled to 10B parameters on 1T tokens outperforms transformers on new-task inference and in-context learning, beating transformers trained on 8x more data.

DetailsMotivation: Previous Memory Mosaics showed promising compositional and in-context learning capabilities at GPT-2 scale on synthetic data, but needed validation at larger LLM scales (Llama-8B) on real-world datasets.

Method: Scaling Memory Mosaics to 10B parameters, training on 1 trillion tokens, introducing architectural modifications (“Memory Mosaics v2”), and evaluating across three dimensions: training-knowledge storage, new-knowledge storage, and in-context learning.

Result: Memory Mosaics v2 match transformers on training knowledge learning, but significantly outperform transformers on new-task inference and in-context learning. These improvements persist even when comparing to transformers trained on 8x more data (8 trillion tokens).

Conclusion: Memory Mosaics maintain their favorable compositional and in-context learning properties when scaled to large language model sizes on real-world data, offering superior performance on new-task inference compared to transformers even with significantly less training data.

Abstract: Memory Mosaics [Zhang et al., 2025], networks of associative memories, have demonstrated appealing compositional and in-context learning capabilities on medium-scale networks (GPT-2 scale) and synthetic small datasets. This work shows that these favorable properties remain when we scale memory mosaics to large language model sizes (llama-8B scale) and real-world datasets. To this end, we scale memory mosaics to 10B size, we train them on one trillion tokens, we introduce a couple architectural modifications (“Memory Mosaics v2”), we assess their capabilities across three evaluation dimensions: training-knowledge storage, new-knowledge storage, and in-context learning. Throughout the evaluation, memory mosaics v2 match transformers on the learning of training knowledge (first dimension) and significantly outperforms transformers on carrying out new tasks at inference time (second and third dimensions). These improvements cannot be easily replicated by simply increasing the training data for transformers. A memory mosaics v2 trained on one trillion tokens still perform better on these tasks than a transformer trained on eight trillion tokens.

[275] Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening

Zhijun Guo, Alvina Lai, Julia Ive, Alexandru Petcu, Yutong Wang, Luyuan Qi, Johan H Thygesen, Kezhi Li

Main category: cs.AI

TL;DR: HopeBot is an LLM-powered chatbot that administers PHQ-9 depression screening with better user experience than traditional self-administered methods, showing high user acceptance and trust.

DetailsMotivation: Traditional depression screening tools like PHQ-9 are static, non-interactive, and lack adaptability. There's a need for more engaging, supportive, and scalable screening methods that can provide real-time clarification and guidance.

Method: Developed HopeBot, a voice-based chatbot using large language model with retrieval-augmented generation for PHQ-9 administration. Conducted within-subject study with 132 adults in UK and China comparing self-administered vs chatbot versions, collecting quantitative scores and qualitative feedback.

Result: Strong agreement between chatbot and self-administered PHQ-9 scores (ICC = 0.91; 45% identical). 71% reported greater trust in chatbot due to clearer structure, interpretive guidance, and supportive tone. High user ratings: 8.4/10 comfort, 7.7/10 voice clarity, 7.6/10 handling sensitive topics, 7.4/10 recommendation helpfulness. 87.1% willing to reuse or recommend HopeBot.

Conclusion: Voice-based LLM chatbots like HopeBot are feasible, scalable, low-burden adjuncts for routine depression screening, offering improved user experience and trust compared to traditional static methods.

Abstract: Static tools like the Patient Health Questionnaire-9 (PHQ-9) effectively screen depression but lack interactivity and adaptability. We developed HopeBot, a chatbot powered by a large language model (LLM) that administers the PHQ-9 using retrieval-augmented generation and real-time clarification. In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbot versions. Scores demonstrated strong agreement (ICC = 0.91; 45% identical). Among 75 participants providing comparative feedback, 71% reported greater trust in the chatbot, highlighting clearer structure, interpretive guidance, and a supportive tone. Mean ratings (0-10) were 8.4 for comfort, 7.7 for voice clarity, 7.6 for handling sensitive topics, and 7.4 for recommendation helpfulness; the latter varied significantly by employment status and prior mental-health service use (p < 0.05). Overall, 87.1% expressed willingness to reuse or recommend HopeBot. These findings demonstrate voice-based LLM chatbots can feasibly serve as scalable, low-burden adjuncts for routine depression screening.

[276] A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering

Chenliang Zhang, Lin Wang, Yuanyuan Lu, Yusheng Qi, Kexin Wang, Peixu Hou, Wenshi Chen

Main category: cs.AI

TL;DR: Dianping-Trust-Safety team’s winning solution for META CRAG-MM challenge uses vision LLM with GPT-4 distillation, curriculum learning for RL, and web search APIs for multi-modal multi-turn QA.

DetailsMotivation: To build a comprehensive retrieval-augmented generation system for multi-modal multi-turn question answering as required by the META CRAG-MM challenge, which involves three complex tasks combining structured data, knowledge graphs, web search, and conversational context.

Method: For Task 1: Vision large language model enhanced by supervised fine-tuning with knowledge distilled from GPT-4, plus curriculum learning strategies to guide reinforcement learning. For Tasks 2 & 3: Additional integration of web search APIs to incorporate external knowledge for handling complex queries and multi-turn conversations.

Result: Achieved 1st place in Task 1 with a significant lead of 52.38%, and 3rd place in Task 3, demonstrating the effectiveness of the integration of curriculum learning with reinforcement learning in the training pipeline.

Conclusion: The proposed approach combining vision LLMs with GPT-4 distillation, curriculum learning for RL, and web search integration effectively addresses multi-modal multi-turn QA challenges, achieving top competition results with reduced hallucination and improved accuracy.

Abstract: This paper describes the solutions of the Dianping-Trust-Safety team for the META CRAG-MM challenge. The challenge requires building a comprehensive retrieval-augmented generation system capable for multi-modal multi-turn question answering. The competition consists of three tasks: (1) answering questions using structured data retrieved from an image-based mock knowledge graph, (2) synthesizing information from both knowledge graphs and web search results, and (3) handling multi-turn conversations that require context understanding and information aggregation from multiple sources. For Task 1, our solution is based on the vision large language model, enhanced by supervised fine-tuning with knowledge distilled from GPT-4.1. We further applied curriculum learning strategies to guide reinforcement learning, resulting in improved answer accuracy and reduced hallucination. For Task 2 and Task 3, we additionally leveraged web search APIs to incorporate external knowledge, enabling the system to better handle complex queries and multi-turn conversations. Our approach achieved 1st place in Task 1 with a significant lead of 52.38%, and 3rd place in Task 3, demonstrating the effectiveness of the integration of curriculum learning with reinforcement learning in our training pipeline.

[277] Large Language Model-Based Automatic Formulation for Stochastic Optimization Models

Amirreza Talebi

Main category: cs.AI

TL;DR: LLMs (ChatGPT) can formulate and solve stochastic optimization problems from natural language using structured prompts and multi-agent collaboration, with GPT-4-Turbo generally outperforming GPT-3.5.

DetailsMotivation: To systematically evaluate how well large language models can automatically formulate and solve stochastic optimization problems from natural language descriptions, which could enable intelligent, language-driven modeling pipelines for practical applications.

Method: Designed structured prompts using chain-of-thought and agentic reasoning to guide ChatGPT through three categories of stochastic optimization problems. Introduced a novel soft-scoring metric to evaluate structural quality and partial correctness of generated models.

Result: GPT-4-Turbo achieved better partial scores than GPT-3.5 variants across most problem types except individual chance-constrained problems. Structured prompts significantly outperformed simple prompting, reducing extra-element generation and improving objective matching.

Conclusion: With well-engineered prompts and multi-agent collaboration, LLMs can facilitate stochastic optimization formulations, paving the way for practical language-driven modeling pipelines.

Abstract: This paper presents an integrated systematic study of the performance of large language models (LLMs), specifically ChatGPT, for automatically formulating and solving Stochastic Optimization (SO) problems from natural language descriptions. Focusing on three key categories, individual chance-constrained models, joint chance-constrained models, and two-stage stochastic mixed-integer linear programming models, we design several prompts that guide ChatGPT through structured tasks using chain-of-thought and agentic reasoning. We introduce a novel soft-scoring metric that evaluates the structural quality and partial correctness of generated models, addressing the limitations of canonical and execution-based accuracy metrics. Across a diverse set of SO problems, GPT-4-Turbo achieves better partial scores than GPT-3.5 variants except for individual chance-constrained problems. Structured prompts significantly outperform simple prompting, reducing extra-element generation and improving objective matching, although extra-element generation remains a nontrivial task. Our findings reveal that with well-engineered prompts and multi-agent collaboration, LLMs can facilitate SO formulations, paving the way for intelligent, language-driven modeling pipelines for SO in practice.

[278] Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces

Minju Gwak, Guijin Son, Jaehyung Kim

Main category: cs.AI

TL;DR: The paper shows that step-level information density uniformity in LLM reasoning traces correlates with reasoning quality - more uniform traces yield better performance, with 10-32% accuracy gains on AIME2025.

DetailsMotivation: To investigate whether the Uniform Information Density (UID) hypothesis applies to LLM reasoning traces, specifically whether step-level uniformity reflects reasoning quality and can be used to improve reasoning systems.

Method: Proposed an entropy-based stepwise information density metric and introduced two complementary uniformity measures (local and global uniformity scores). Evaluated on six reasoning benchmarks.

Result: Step-level uniformity strongly correlates with reasoning quality: correct traces avoid sharp information spikes while incorrect traces show irregular bursts. Selecting traces with uniform information density improves accuracy by 10-32% on AIME2025.

Conclusion: UID-inspired information density measures outperform alternative internal signals as predictors of reasoning quality, making uniformity a robust diagnostic and selection criterion for building more reliable reasoning systems.

Abstract: The Uniform Information Density (UID) hypothesis suggests that effective communication maintains a stable flow of information. In this work, we revisit this principle in the context of large language model (LLM) reasoning traces, asking whether step-level uniformity reflects reasoning quality. To this end, we propose an entropy-based stepwise information density metric and introduce two complementary measures of uniformity, local and global uniformity scores. Across the experiments on six different reasoning benchmarks, we find that step-level uniformity not only provides a strong theoretical lens but also yields practical performance benefits; for example, selecting reasoning traces with more uniform information density at the step-level improves accuracy by 10-32% relative gains over baselines at AIME2025. Our analysis further reveals that correct reasoning traces tend to avoid sharp information density spikes, while incorrect traces exhibit irregular information bursts. These results demonstrate that UID-inspired information density measures outperform alternative internal signals as predictors of reasoning quality. Results highlight the uniformity of the information density as a robust diagnostic and selection criterion for building more reliable and accurate reasoning systems.

[279] Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms

Miguel E. Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, Nadescha Trudel

Main category: cs.AI

TL;DR: First systematic framework for evaluating voice AI testing quality through human-centered benchmarking, revealing significant performance differences between commercial platforms.

DetailsMotivation: Voice AI agents are scaling to billions of daily interactions, but there's no objective way to assess whether testing approaches actually work, creating a critical measurement gap for production deployments.

Method: Human-centered benchmarking framework combining psychometric techniques (pairwise comparisons yielding Elo ratings, bootstrap confidence intervals, permutation tests) with rigorous statistical validation to measure both simulation quality (realistic test conversations) and evaluation quality (accurate response assessment).

Result: Comprehensive evaluation of three commercial platforms using 21,600 human judgments across 45 simulations and ground truth validation on 60 conversations. Top platform Evalion achieved 0.92 evaluation quality (f1-score) vs 0.73 for others, and 0.61 simulation quality vs 0.43 for others.

Conclusion: The framework enables empirical validation of testing capabilities for any platform, providing essential measurement foundations for confident voice AI deployment at scale.

Abstract: Voice AI agents are rapidly transitioning to production deployments, yet systematic methods for ensuring testing reliability remain underdeveloped. Organizations cannot objectively assess whether their testing approaches (internal tools or external platforms) actually work, creating a critical measurement gap as voice AI scales to billions of daily interactions. We present the first systematic framework for evaluating voice AI testing quality through human-centered benchmarking. Our methodology addresses the fundamental dual challenge of testing platforms: generating realistic test conversations (simulation quality) and accurately evaluating agent responses (evaluation quality). The framework combines established psychometric techniques (pairwise comparisons yielding Elo ratings, bootstrap confidence intervals, and permutation tests) with rigorous statistical validation to provide reproducible metrics applicable to any testing approach. To validate the framework and demonstrate its utility, we conducted comprehensive empirical evaluation of three leading commercial platforms focused on Voice AI Testing using 21,600 human judgments across 45 simulations and ground truth validation on 60 conversations. Results reveal statistically significant performance differences with the proposed framework, with the top-performing platform, Evalion, achieving 0.92 evaluation quality measured as f1-score versus 0.73 for others, and 0.61 simulation quality using a league based scoring system (including ties) vs 0.43 for other platforms. This framework enables researchers and organizations to empirically validate the testing capabilities of any platform, providing essential measurement foundations for confident voice AI deployment at scale. Supporting materials are made available to facilitate reproducibility and adoption.

[280] Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan

Main category: cs.AI

TL;DR: Proposes lightweight transformer probes that use LLM internal states to verify reasoning steps, matching/exceeding larger PRMs while being 810x smaller.

DetailsMotivation: Existing verification methods like Process Reward Models (PRMs) are computationally expensive, domain-specific, and require extensive annotations. Need a lightweight, generalizable alternative for test-time scaling.

Method: Train transformer-based probes (<10M parameters) that use frozen LLM’s internal states to estimate credibility of reasoning steps. Annotations generated by larger LLMs or self-supervised by original model.

Result: Probes match/exceed PRM performance across math, planning, and QA domains while being up to 810x smaller. Internal states encode LLM confidence in reasoning processes.

Conclusion: LLM internal states provide reliable signals for reasoning verification, offering scalable, generalizable test-time scaling and introspective LLMs.

Abstract: LLMs can solve complex tasks by generating long, multi-step reasoning chains. Test-time scaling (TTS) can further improve LLM performance by sampling multiple variants of intermediate reasoning steps, verifying their correctness, and strategically choosing the best steps for continuation. However, existing verification approaches, such as Process Reward Models (PRMs), are computationally expensive, limited to specific domains, and require large-scale human or model-generated annotations. We propose a lightweight alternative for step-level reasoning verification based on probing the internal states of LLMs. We train a transformer-based probe that uses the internal states of the frozen LLM to estimate the credibility of its reasoning steps during generation. Annotation can be generated either by another larger LLM (e.g., DeepSeek-R1) or in a self-supervised manner by the original model itself. The probes are both effective and lightweight, containing fewer than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, our probes match or even exceed the performance of PRMs that are up to 810x larger. Our findings suggest that the internal states of LLMs encode their confidence in reasoning processes and can serve as reliable signals for reasoning step verification, offering a promising direction towards scalable and generalizable TTS and introspective LLMs.

[281] Graph Neural Networks, Deep Reinforcement Learning and Probabilistic Topic Modeling for Strategic Multiagent Settings

Georgios Chalkiadakis, Charilaos Akasiadis, Gerasimos Koresis, Stergios Plataniotis, Leonidas Bakopoulos

Main category: cs.AI

TL;DR: A review paper analyzing GNN, DRL, and PTM methods for strategic multiagent settings, focusing on opponent modeling, uncertainty handling, and integration with game theory while avoiding unrealistic assumptions.

DetailsMotivation: To address the challenges of strategic multiagent settings in real-world scenarios where traditional game theory assumptions (like Common Prior Assumption and Self-Interest Hypothesis) often fail, and to explore how modern ML methods can handle uncertainty, heterogeneity, and non-stationarity.

Method: Comprehensive review and analysis of three ML approaches: Graph Neural Networks (GNNs) for modeling relationships and interactions, Deep Reinforcement Learning (DRL) for decision-making in multiagent settings, and Probabilistic Topic Modeling (PTM) for applications beyond document analysis. Integration of these methods with game theoretic concepts.

Result: Identifies GNNs as particularly promising for modeling multiagent interactions due to their ability to handle graph-structured data. Highlights challenges in applying single-agent DRL to multiagent settings due to varying relationships and non-stationarity. Notes PTM’s potential beyond traditional document analysis domains.

Conclusion: The paper identifies key open challenges: fitting non-stationary environments, balancing stability and adaptation, tackling uncertainty and heterogeneity, and ensuring scalability and solution tractability. Champions GNNs as a powerful approach for strategic multiagent modeling while calling for further research to address these remaining challenges.

Abstract: This paper provides a comprehensive review of mainly GNN, DRL, and PTM methods with a focus on their potential incorporation in strategic multiagent settings. We draw interest in (i) ML methods currently utilized for uncovering unknown model structures adaptable to the task of strategic opponent modeling, and (ii) the integration of these methods with Game Theoretic concepts that avoid relying on assumptions often invalid in real-world scenarios, such as the Common Prior Assumption (CPA) and the Self-Interest Hypothesis (SIH). We analyze the ability to handle uncertainty and heterogeneity, two characteristics that are very common in real-world application cases, as well as scalability. As a potential answer to effectively modeling relationships and interactions in multiagent settings, we champion the use of GNN. Such approaches are designed to operate upon graph-structured data, and have been shown to be a very powerful tool for performing tasks such as node classification and link prediction. Next, we review the domain of RL, and in particular that of multiagent deep reinforcement learning. Single-agent deep RL has been widely used for decision making in demanding game settings. Its application in multiagent settings though is hindered due to, e.g., varying relationships between agents, and non-stationarity of the environment. We describe existing relevant game theoretic solution concepts, and consider properties such as fairness and stability. Our review comes complete with a note on the literature that utilizes probabilistic topic modeling (PTM) in domains other than that of document analysis and classification. Finally, we identify certain open challenges – specifically, the need to (i) fit non-stationary environments, (ii) balance the degrees of stability and adaptation, (iii) tackle uncertainty and heterogeneity, (iv) guarantee scalability and solution tractability.

[282] GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, Cheng Tan

Main category: cs.AI

TL;DR: GGBench is a new benchmark for evaluating Unified Multimodal Models’ geometric generative reasoning abilities, bridging the gap between discriminative understanding and unconstrained generation.

DetailsMotivation: Existing benchmarks fail to measure the integrated cognitive process of generative reasoning in UMMs, focusing only on discriminative understanding or unconstrained image generation separately. There's a critical need to evaluate how models can actively construct solutions through language comprehension and precise visual generation.

Method: The authors propose geometric construction as an ideal testbed for generative reasoning. They introduce GGBench, a benchmark specifically designed to evaluate geometric generative reasoning, providing a comprehensive framework for systematically diagnosing models’ ability to understand, reason, and actively construct solutions.

Result: GGBench establishes a new benchmark that sets more rigorous standards for evaluating the next generation of intelligent systems by measuring their integrated cognitive capabilities in generative reasoning.

Conclusion: Geometric construction serves as an effective testbed for evaluating UMMs’ generative reasoning abilities, and GGBench addresses the critical gap in current evaluation methods, advancing the assessment of integrated cognitive processes in multimodal AI systems.

Abstract: The advent of Unified Multimodal Models (UMMs) signals a paradigm shift in artificial intelligence, moving from passive perception to active, cross-modal generation. Despite their unprecedented ability to synthesize information, a critical gap persists in evaluation: existing benchmarks primarily assess discriminative understanding or unconstrained image generation separately, failing to measure the integrated cognitive process of generative reasoning. To bridge this gap, we propose that geometric construction provides an ideal testbed as it inherently demands a fusion of language comprehension and precise visual generation. We introduce GGBench, a benchmark designed specifically to evaluate geometric generative reasoning. It provides a comprehensive framework for systematically diagnosing a model’s ability to not only understand and reason but to actively construct a solution, thereby setting a more rigorous standard for the next generation of intelligent systems. Project website: https://opendatalab-raiser.github.io/GGBench/.

[283] The Agentic Leash: Extracting Causal Feedback Fuzzy Cognitive Maps with LLMs

Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko

Main category: cs.AI

TL;DR: LLM agent extracts causal feedback fuzzy cognitive maps (FCMs) from text, creating bidirectional system where FCM equilibria drive LLM to fetch more text, which modifies FCM structure. Tested on Kissinger AI essay, matches human-generated FCM equilibria.

DetailsMotivation: To develop an autonomous LLM agent that can extract causal relationships from text and create adaptive FCMs that can evolve through bidirectional interaction between text processing and dynamical system equilibria.

Method: Three-step LLM agent process: 1) extract key nouns/noun phrases, 2) identify FCM concept nodes, 3) infer partial/fuzzy causal edges between nodes. Tested on Kissinger AI essay, compared with human-generated FCMs, and created mixed FCMs from Gemini and ChatGPT agents.

Result: Generated FCMs converged to same equilibrium limit cycles as human-generated FCMs despite structural differences. Mixed FCMs absorbed equilibria of dominant components and created new equilibria to better approximate underlying causal dynamics.

Conclusion: LLM agents can effectively extract causal FCMs from text, creating adaptive systems that maintain autonomy while staying on an “agentic leash.” The bidirectional process enables evolving causal structures that approximate human understanding of complex systems.

Abstract: We design a large-language-model (LLM) agent that extracts causal feedback fuzzy cognitive maps (FCMs) from raw text. The causal learning or extraction process is agentic both because of the LLM’s semi-autonomy and because ultimately the FCM dynamical system’s equilibria drive the LLM agents to fetch and process causal text. The fetched text can in principle modify the adaptive FCM causal structure and so modify the source of its quasi-autonomy–its equilibrium limit cycles and fixed-point attractors. This bidirectional process endows the evolving FCM dynamical system with a degree of autonomy while still staying on its agentic leash. We show in particular that a sequence of three finely tuned system instructions guide an LLM agent as it systematically extracts key nouns and noun phrases from text, as it extracts FCM concept nodes from among those nouns and noun phrases, and then as it extracts or infers partial or fuzzy causal edges between those FCM nodes. We test this FCM generation on a recent essay about the promise of AI from the late diplomat and political theorist Henry Kissinger and his colleagues. This three-step process produced FCM dynamical systems that converged to the same equilibrium limit cycles as did the human-generated FCMs even though the human-generated FCM differed in the number of nodes and edges. A final FCM mixed generated FCMs from separate Gemini and ChatGPT LLM agents. The mixed FCM absorbed the equilibria of its dominant mixture component but also created new equilibria of its own to better approximate the underlying causal dynamical system.

[284] SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

Caijun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, Yixin Cao

Main category: cs.AI

TL;DR: SCALER is a framework that uses adaptive environment design to sustain effective RL training signals for language model reasoning, addressing issues of task difficulty alignment and pattern diversity through scalable synthesis and dynamic environment adjustment.

DetailsMotivation: Reinforcement learning for language model reasoning often slows down when task difficulty becomes misaligned with model capability or when training is dominated by narrow problem patterns, requiring sustained informative learning signals.

Method: SCALER combines a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty, plus an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates active environments to track model capability and maintain diversity.

Result: SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.

Conclusion: The adaptive environment design framework effectively prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout RL training for language model reasoning.

Abstract: Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model’s capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.

[285] DR-LoRA: Dynamic Rank LoRA for Mixture-of-Experts Adaptation

Guanzhi Deng, Bo Li, Ronghao Chen, Huacan Wang, Lijie Wen, Linqi Song

Main category: cs.AI

TL;DR: DR-LoRA: Dynamic rank allocation for LoRA fine-tuning of Mixture-of-Experts LLMs, where expert ranks grow based on task-specific demands rather than uniform allocation.

DetailsMotivation: Current PEFT methods like LoRA assign identical ranks to all experts in MoE LLMs, ignoring functional specialization. This causes resource mismatch: task-relevant experts are under-provisioned while less relevant ones get redundant parameters.

Method: DR-LoRA dynamically grows expert LoRA ranks during fine-tuning based on task demands. Uses Expert Saliency Scoring that combines expert routing frequency and LoRA rank importance to quantify each expert’s need for additional capacity. Higher saliency experts get priority for rank expansion, creating heterogeneous rank distribution.

Result: Experiments on multiple benchmarks show DR-LoRA consistently outperforms standard LoRA and static allocation strategies under same parameter budget, achieving better task performance with more efficient parameter utilization.

Conclusion: Dynamic rank allocation tailored to expert specialization in MoE LLMs enables superior parameter efficiency and task performance compared to uniform LoRA rank assignment.

Abstract: Mixture-of-Experts (MoE) has become a prominent paradigm for scaling Large Language Models (LLMs). Parameter-efficient fine-tuning (PEFT), such as LoRA, is widely adopted to adapt pretrained MoE LLMs to downstream tasks. However, existing approaches assign identical LoRA ranks to all experts, overlooking the intrinsic functional specialization within MoE LLMs. This uniform allocation leads to resource mismatch, task-relevant experts are under-provisioned while less relevant ones receive redundant parameters. We propose a Dynamic Rank LoRA framework named DR-LoRA, which dynamically grows expert LoRA ranks during fine-tuning based on task-specific demands. DR-LoRA employs an Expert Saliency Scoring mechanism that integrates expert routing frequency and LoRA rank importance to quantify each expert’s demand for additional capacity. Experts with higher saliency scores are prioritized for rank expansion, enabling the automatic formation of a heterogeneous rank distribution tailored to the target task. Experiments on multiple benchmarks demonstrate that DR-LoRA consistently outperforms standard LoRA and static allocation strategies under the same parameter budget, achieving superior task performance with more efficient parameter utilization.

[286] Effects of personality steering on cooperative behavior in Large Language Model agents

Mizuki Sakai, Mizuki Yokoyama, Wakaba Tateishi, Genki Ichinose

Main category: cs.AI

TL;DR: Personality steering in LLM agents affects cooperation in Prisoner’s Dilemma games, with agreeableness being the dominant factor promoting cooperation across models.

DetailsMotivation: To understand how personality steering affects cooperative behavior in LLM agents under controlled conditions, particularly in strategic interactions like repeated Prisoner's Dilemma games.

Method: Used Big Five framework to measure personality scores of GPT-3.5-turbo, GPT-4o, and GPT-5 models. Compared behavior under baseline vs. personality-informed conditions, and manipulated each personality dimension to extreme values in repeated Prisoner’s Dilemma games.

Result: Agreeableness was the dominant factor promoting cooperation across all models. Explicit personality information increases cooperation but also raises vulnerability to exploitation, especially in earlier-generation models. Later-generation models show more selective cooperation.

Conclusion: Personality steering acts as a behavioral bias rather than a deterministic control mechanism for LLM agents in cooperative settings.

Abstract: Large language models (LLMs) are increasingly used as autonomous agents in strategic and social interactions. Although recent studies suggest that assigning personality traits to LLMs can influence their behavior, how personality steering affects cooperation under controlled conditions remains unclear. In this study, we examine the effects of personality steering on cooperative behavior in LLM agents using repeated Prisoner’s Dilemma games. Based on the Big Five framework, we first measure basic personality scores of three models, GPT-3.5-turbo, GPT-4o, and GPT-5, using the Big Five Inventory. We then compare behavior under baseline and personality-informed conditions, and further analyze the effects of independently manipulating each personality dimension to extreme values. Our results show that agreeableness is the dominant factor promoting cooperation across all models, while other personality traits have limited impact. Explicit personality information increases cooperation but can also raise vulnerability to exploitation, particularly in earlier-generation models. In contrast, later-generation models exhibit more selective cooperation. These findings indicate that personality steering acts as a behavioral bias rather than a deterministic control mechanism.

[287] LSRIF: Logic-Structured Reinforcement Learning for Instruction Following

Qingyu Ren, Qianyu He, Jingwen Chang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Han Xia, Zeye Sun, Fei Yu

Main category: cs.AI

TL;DR: LSRIF is a logic-structured training framework for LLMs that explicitly models instruction logic (parallel, sequential, conditional) with structure-aware rewards, improving instruction-following and reasoning.

DetailsMotivation: Real-world instructions often contain logical structures like sequential dependencies and conditional branching, but existing methods ignore these logical dependencies and yield noisy signals by optimizing average rewards on parallel constraints.

Method: Propose LSRIF framework: 1) Construct LSRInstruct dataset with constraint structures (parallel, sequential, conditional), 2) Design structure-aware rewarding method with average aggregation for parallel structures, failure-penalty propagation for sequential structures, and selective rewards for conditional branches.

Result: LSRIF brings significant improvements in instruction-following (both in-domain and out-of-domain) and general reasoning. Analysis shows learning with explicit logic structures brings parameter updates in attention layers and sharpens token-level attention to constraints and logical operators.

Conclusion: Explicitly modeling instruction logic with structure-aware rewards is effective for improving LLM instruction-following capabilities, addressing limitations of existing methods that ignore logical dependencies.

Abstract: Instruction-following is critical for large language models, but real-world instructions often contain logical structures such as sequential dependencies and conditional branching. Existing methods typically construct datasets with parallel constraints and optimize average rewards, ignoring logical dependencies and yielding noisy signals. We propose a logic-structured training framework LSRIF that explicitly models instruction logic. We first construct a dataset LSRInstruct with constraint structures such as parallel, sequential, and conditional types, and then design structure-aware rewarding method LSRIF including average aggregation for parallel structures, failure-penalty propagation for sequential structures, and selective rewards for conditional branches. Experiments show LSRIF brings significant improvements in instruction-following (in-domain and out-of-domain) and general reasoning. Analysis reveals that learning with explicit logic structures brings parameter updates in attention layers and sharpens token-level attention to constraints and logical operators.

[288] MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents

Shouju Wang, Haopeng Zhang

Main category: cs.AI

TL;DR: MPCI-Bench is the first multimodal benchmark for evaluating privacy behavior in AI agents using Contextual Integrity principles, addressing gaps in existing text-only benchmarks that overlook multimodal risks and privacy-utility tradeoffs.

DetailsMotivation: As AI agents evolve from passive chatbots to proactive assistants handling personal data, evaluating their adherence to social norms through Contextual Integrity becomes critical. Existing benchmarks are text-centric, focus only on negative refusal scenarios, and overlook multimodal privacy risks and the privacy-utility tradeoff.

Method: Created MPCI-Bench, a Multimodal Pairwise Contextual Integrity benchmark with paired positive/negative instances from the same visual source across three tiers: normative Seed judgments, context-rich Story reasoning, and executable agent action Traces. Used a Tri-Principle Iterative Refinement pipeline to ensure data quality.

Result: Evaluations of state-of-the-art multimodal models reveal systematic failures to balance privacy and utility, and a pronounced modality leakage gap where sensitive visual information is leaked more frequently than textual information.

Conclusion: MPCI-Bench addresses critical gaps in agentic privacy evaluation and will be open-sourced to facilitate future research on Contextual Integrity in AI agents, highlighting the need for better multimodal privacy safeguards.

Abstract: As language-model agents evolve from passive chatbots into proactive assistants that handle personal data, evaluating their adherence to social norms becomes increasingly critical, often through the lens of Contextual Integrity (CI). However, existing CI benchmarks are largely text-centric and primarily emphasize negative refusal scenarios, overlooking multimodal privacy risks and the fundamental trade-off between privacy and utility. In this paper, we introduce MPCI-Bench, the first Multimodal Pairwise Contextual Integrity benchmark for evaluating privacy behavior in agentic settings. MPCI-Bench consists of paired positive and negative instances derived from the same visual source and instantiated across three tiers: normative Seed judgments, context-rich Story reasoning, and executable agent action Traces. Data quality is ensured through a Tri-Principle Iterative Refinement pipeline. Evaluations of state-of-the-art multimodal models reveal systematic failures to balance privacy and utility and a pronounced modality leakage gap, where sensitive visual information is leaked more frequently than textual information. We will open-source MPCI-Bench to facilitate future research on agentic CI.

[289] Advancing ESG Intelligence: An Expert-level Agent and Comprehensive Benchmark for Sustainable Finance

Yilei Zhao, Wentao Zhang, Lei Xiao, Yandan Zheng, Mengpu Liu, Wei Yang Bryan Lim

Main category: cs.AI

TL;DR: ESGAgent: A hierarchical multi-agent system with specialized tools for comprehensive ESG analysis, outperforming state-of-the-art LLMs on a new three-level benchmark derived from corporate sustainability reports.

DetailsMotivation: Professional ESG analysis faces challenges due to data fragmentation across unstructured sources, and existing LLMs struggle with complex multi-step workflows required for rigorous auditing of corporate sustainability and ethical performance.

Method: Introduces ESGAgent, a hierarchical multi-agent system with specialized tools including retrieval augmentation, web search, and domain-specific functions. Also presents a comprehensive three-level benchmark derived from 310 corporate sustainability reports to evaluate capabilities from atomic questions to integrated analysis.

Result: ESGAgent outperforms state-of-the-art closed-source LLMs with 84.15% average accuracy on atomic QA tasks, and excels in professional report generation by integrating rich charts and verifiable references.

Conclusion: The benchmark has diagnostic value and establishes a vital testbed for assessing general and advanced agentic capabilities in high-stakes vertical domains like ESG analysis.

Abstract: Environmental, social, and governance (ESG) criteria are essential for evaluating corporate sustainability and ethical performance. However, professional ESG analysis is hindered by data fragmentation across unstructured sources, and existing large language models (LLMs) often struggle with the complex, multi-step workflows required for rigorous auditing. To address these limitations, we introduce ESGAgent, a hierarchical multi-agent system empowered by a specialized toolset, including retrieval augmentation, web search and domain-specific functions, to generate in-depth ESG analysis. Complementing this agentic system, we present a comprehensive three-level benchmark derived from 310 corporate sustainability reports, designed to evaluate capabilities ranging from atomic common-sense questions to the generation of integrated, in-depth analysis. Empirical evaluations demonstrate that ESGAgent outperforms state-of-the-art closed-source LLMs with an average accuracy of 84.15% on atomic question-answering tasks, and excels in professional report generation by integrating rich charts and verifiable references. These findings confirm the diagnostic value of our benchmark, establishing it as a vital testbed for assessing general and advanced agentic capabilities in high-stakes vertical domains.

[290] Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

Tengjun Jin, Yoojin Choi, Yuxuan Zhu, Daniel Kang

Main category: cs.AI

TL;DR: Text-to-SQL benchmarks have high annotation error rates (53-63%) that significantly distort agent performance rankings and misguide research/deployment decisions.

DetailsMotivation: Text-to-SQL benchmarks rely on human annotations for question construction and answer evaluation, but the validity of these annotations hasn't been systematically studied. Annotation errors could distort reported performance and rankings, potentially misleading research directions and deployment choices.

Method: Conducted empirical study: (1) benchmarked annotation error rates for BIRD and Spider 2.0-Snow through expert analysis, (2) corrected subset of BIRD Dev set, (3) re-evaluated 16 open-source agents on both original and corrected subsets, (4) assessed generalization to full BIRD Dev set using correlation analysis.

Result: Found high error rates: BIRD Mini-Dev (52.8%) and Spider 2.0-Snow (62.8%). Performance changes ranged from -7% to 31% relative, rank changes from -9 to +9 positions. Rankings on uncorrected subset strongly correlated with full Dev set (r_s=0.85), but weakly correlated with corrected subset (r_s=0.32).

Conclusion: Annotation errors significantly distort text-to-SQL benchmark results and leaderboard rankings, potentially misguiding research and deployment decisions. The community needs to address annotation quality issues to ensure reliable evaluation.

Abstract: Researchers have proposed many text-to-SQL techniques to streamline data analytics and accelerate the development of database-driven applications. To compare these techniques and select the best one for deployment, the community depends on public benchmarks and their leaderboards. Since these benchmarks heavily rely on human annotations during question construction and answer evaluation, the validity of the annotations is crucial. In this paper, we conduct an empirical study that (i) benchmarks annotation error rates for two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and (ii) corrects a subset of the BIRD development (Dev) set to measure the impact of annotation errors on text-to-SQL agent performance and leaderboard rankings. Through expert analysis, we show that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively. We re-evaluate all 16 open-source agents from the BIRD leaderboard on both the original and the corrected BIRD Dev subsets. We show that performance changes range from -7% to 31% (in relative terms) and rank changes range from $-9$ to $+9$ positions. We further assess whether these impacts generalize to the full BIRD Dev set. We find that the rankings of agents on the uncorrected subset correlate strongly with those on the full Dev set (Spearman’s $r_s$=0.85, $p$=3.26e-5), whereas they correlate weakly with those on the corrected subset (Spearman’s $r_s$=0.32, $p$=0.23). These findings show that annotation errors can significantly distort reported performance and rankings, potentially misguiding research directions or deployment choices. Our code and data are available at https://github.com/uiuc-kang-lab/text_to_sql_benchmarks.

cs.SD

[291] Semantic visually-guided acoustic highlighting with large vision-language models

Junhua Huang, Chao Huang, Chenliang Xu

Main category: cs.SD

TL;DR: Systematic study shows that camera focus, tone, and scene background visual cues from vision-language models most improve audio remixing quality for video content.

DetailsMotivation: Current audio mixing for video is manual and labor-intensive. While visually guided acoustic highlighting exists, it's unclear which visual aspects work best as conditioning signals for audio remixing.

Method: Used textual descriptions as proxy for visual analysis, prompting large vision-language models to extract six visual-semantic aspects: object/character appearance, emotion, camera focus, tone, scene background, and inferred sound cues. Systematically studied which aspects improve audio remixing.

Result: Camera focus, tone, and scene background consistently yielded the largest improvements in perceptual mix quality over state-of-the-art baselines.

Conclusion: Identified which visual-semantic cues most strongly support coherent audio remixing and outlined a practical path toward automating cinema-grade sound design using lightweight guidance from vision-language models.

Abstract: Balancing dialogue, music, and sound effects with accompanying video is crucial for immersive storytelling, yet current audio mixing workflows remain largely manual and labor-intensive. While recent advancements have introduced the visually guided acoustic highlighting task, which implicitly rebalances audio sources using multimodal guidance, it remains unclear which visual aspects are most effective as conditioning signals.We address this gap through a systematic study of whether deep video understanding improves audio remixing. Using textual descriptions as a proxy for visual analysis, we prompt large vision-language models to extract six types of visual-semantic aspects, including object and character appearance, emotion, camera focus, tone, scene background, and inferred sound-related cues. Through extensive experiments, camera focus, tone, and scene background consistently yield the largest improvements in perceptual mix quality over state-of-the-art baselines. Our findings (i) identify which visual-semantic cues most strongly support coherent and visually aligned audio remixing, and (ii) outline a practical path toward automating cinema-grade sound design using lightweight guidance derived from large vision-language models.

[292] Echoes of Ideology: Toward an Audio Analysis Pipeline to Unveil Character Traits in Historical Nazi Propaganda Films

Nicolas Ruth, Manuel Burghardt

Main category: cs.SD

TL;DR: Computational audio analysis reveals ideological patterns in Nazi propaganda films through speaker diarization, transcription, and psycholinguistic analysis.

DetailsMotivation: To systematically examine ideological narratives in Nazi propaganda films using computational methods to uncover patterns that might not be apparent through traditional analysis.

Method: Three-step pipeline: 1) speaker diarization to identify different speakers, 2) audio transcription to convert speech to text, and 3) psycholinguistic analysis to examine ideological patterns in characters’ language.

Result: The methodology successfully reveals ideological patterns in characters despite current limitations in speaker diarization accuracy, providing insights into character traits and propaganda narratives.

Conclusion: The computational approach offers scalable applications for analyzing propaganda films and suggests potential for broader use in media analysis despite technical challenges with speaker identification.

Abstract: This study investigates the use of computational audio analysis to examine ideological narratives in Nazi propaganda films. Employing a three-step pipeline, speaker diarization, audio transcription and psycholinguistic analysis, it reveals ideological patterns in characters. Despite current issues with speaker diarization, the methodology provides insights into character traits and propaganda narratives, suggesting scalable applications.

[293] Research on Piano Timbre Transformation System Based on Diffusion Model

Chun-Chieh Hsu, Tsai-Ling Hsu, Chen-Chen Yeh, Shao-Chien Lu, Cheng-Han Wu, Bing-Ze Liu, Timothy K. Shih, Yu-Cheng Lin

Main category: cs.SD

TL;DR: A diffusion-based timbre conversion model that transforms various instrument music into piano versions using pitch and loudness encoders for conditional generation.

DetailsMotivation: To create a precise timbre conversion system that can translate music from various instruments into high-quality piano versions, addressing the need for accurate musical transformation across different styles and complexities.

Method: Uses a Diffusion architecture with Pitch Encoder and Loudness Encoder to extract musical features as conditional inputs to the Diffusion Model’s decoder for generating piano timbres.

Result: Excellent performance in pitch accuracy and timbral similarity, stable conversion across different musical styles (classical, jazz, pop) and lengths, maintains high sound quality even with rapid note changes and complex structures, demonstrates good generalization capability.

Conclusion: The model shows strong potential for real-time musical conversion applications and future enhancements could include improved loudness dynamics handling and additional musical features, with potential applications in vocal-to-instrument conversion and MIDI integration.

Abstract: We propose a timbre conversion model based on the Diffusion architecture de-signed to precisely translate music played by various instruments into piano ver-sions. The model employs a Pitch Encoder and Loudness Encoder to extract pitch and loudness features of the music, which serve as conditional inputs to the Dif-fusion Model’s decoder, generating high-quality piano timbres. Case analysis re-sults show that the model performs excellently in terms of pitch accuracy and timbral similarity, maintaining stable conversion across different musical styles (classical, jazz, pop) and lengths (from short clips to full pieces). Particularly, the model maintains high sound quality and accuracy even when dealing with rapidly changing notes and complex musical structures, demonstrating good generaliza-tion capability. Additionally, the model has the potential for real-time musical conversion and is suitable for live performances and digital music creation tools. Future research will focus on enhancing the handling of loudness dynamics and incorporating additional musical features (such as timbral variations and rhythmic complexity) to improve the model’s adaptability and expressiveness. We plan to explore the model’s application potential in other timbre conversion tasks, such as converting vocals to instrumental sounds or integration with MIDI digital pianos, further expanding the application scope of the Diffusion-based timbre conversion model in the field of music generation.

[294] DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Hanlin Zhang, Daxin Tan, Dehua Tao, Xiao Chen, Haochen Tan, Yunhe Li, Yuchen Cao, Jianping Wang, Linqi Song

Main category: cs.SD

TL;DR: DSA-Tokenizer is a speech tokenizer that explicitly disentangles speech into separate semantic and acoustic tokens using distinct optimization constraints, enabling better control for speech LLMs.

DetailsMotivation: Existing speech tokenizers either prioritize semantics only, fuse semantic and acoustic information inseparably, or achieve incomplete disentanglement, limiting controllable generation in speech LLMs.

Method: Uses ASR supervision for semantic tokens to capture linguistic content, and mel-spectrogram restoration for acoustic tokens to encode style. Introduces hierarchical Flow-Matching decoder to handle different sequence lengths, and employs joint reconstruction-recombination training for separation.

Result: Achieves high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech LLMs.

Conclusion: Disentangled tokenization is a pivotal paradigm for future speech modeling, enabling better control and quality in speech generation.

Abstract: Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models (Speech LLMs). Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. To eliminate rigid length constraints between the two sequences, we introduce a hierarchical Flow-Matching decoder that further improve speech generation quality.Furthermore, We employ a joint reconstruction-recombination training strategy to enforce this separation. DSA-Tokenizer enables high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech LLMs. Our analysis highlights disentangled tokenization as a pivotal paradigm for future speech modeling. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/. The code and model will be made publicly available after the paper has been accepted.

[295] Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian, Hanrong Ye, Ankita Pasad, Szu-wei Fu, Arushi Goel, Ryo Hachiuma, Shizhe Diao, Kunal Dhawan, Sreyan Ghosh, Yusuke Hirota, Zhehuai Chen, Rafael Valle, Ehsan Hosseini Asl, Chenhui Chu, Shinji Watanabe, Yu-Chiang Frank Wang, Boris Ginsburg

Main category: cs.SD

TL;DR: Speech-Hands is a voice-agentic framework that learns when to trust its own audio perception vs. consult external sources, solving the problem where naive fine-tuning degrades performance due to noisy external hypotheses.

DetailsMotivation: The work is motivated by the counterintuitive finding that naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy external hypotheses.

Method: The framework recasts the problem as an explicit self-reflection decision, introducing a learnable reflection primitive that prevents the model from being derailed by flawed external candidates. This agentic action mechanism generalizes from speech recognition to complex audio reasoning.

Result: Speech-Hands outperforms strong baselines by 12.1% WER on seven OpenASR benchmarks and achieves 77.37% accuracy with high F1 on audio QA decisions, showing robust generalization across diverse audio question answering datasets.

Conclusion: By unifying perception and decision-making through explicit self-reflection, the work offers a practical path toward more reliable and resilient audio intelligence that knows when to trust itself versus consult external sources.

Abstract: We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.

[296] SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

Ziyang Ma, Guanrou Yang, Wenxi Chen, Zhifu Gao, Yexing Du, Xiquan Li, Zhisheng Zheng, Haina Zhu, Jianheng Zhuo, Zheshu Song, Ruiyang Xu, Tiranrui Wang, Yifan Yang, Yanqiao Zhu, Zhikang Niu, Liumeng Xue, Yinghao Ma, Ruibin Yuan, Shiliang Zhang, Kai Yu, Eng Siong Chng, Xie Chen

Main category: cs.SD

TL;DR: SLAM-LLM is an open-source framework for training multimodal LLMs focused on speech, audio, and music processing, addressing the gap in existing vision-focused MLLM frameworks.

DetailsMotivation: Existing MLLM frameworks (like LLaVA) are primarily vision-focused with limited support for speech/audio/music modalities, forcing researchers to spend excessive effort on coding and hyperparameter tuning instead of research.

Method: Provides modular configuration of encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins, along with detailed training/inference recipes for mainstream audio-language tasks.

Result: Delivers high-performance checkpoints for tasks like LLM-based ASR, Automated Audio Captioning, and Music Captioning, with some recipes reaching or approaching state-of-the-art performance.

Conclusion: SLAM-LLM aims to accelerate research in audio-based MLLMs by providing an open-source framework and calls for community contributions to advance LLM-based speech/audio/music processing.

Abstract: The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.

[297] Population-Aligned Audio Reproduction With LLM-Based Equalizers

Ioannis Stylianou, Jon Francombe, Pablo Martinez-Nuevo, Sven Ewan Shepstone, Zheng-Hua Tan

Main category: cs.SD

TL;DR: LLM-based system converts natural language prompts to audio equalization settings, enabling conversational sound control that adapts to different listening contexts.

DetailsMotivation: Traditional audio equalization is static and requires manual adjustments for different contexts (mood, location, social setting), which is cumbersome and not user-friendly.

Method: Uses LLMs with in-context learning and parameter-efficient fine-tuning, trained on data from controlled listening experiments to map text prompts to preferred equalization settings.

Result: Shows statistically significant improvements in distributional alignment over random sampling and static preset baselines, capturing varied user preferences effectively.

Conclusion: LLMs can serve as “artificial equalizers” for more accessible, context-aware, and expert-level audio tuning methods through conversational interfaces.

Abstract: Conventional audio equalization is a static process that requires manual and cumbersome adjustments to adapt to changing listening contexts (e.g., mood, location, or social setting). In this paper, we introduce a Large Language Model (LLM)-based alternative that maps natural language text prompts to equalization settings. This enables a conversational approach to sound system control. By utilizing data collected from a controlled listening experiment, our models exploit in-context learning and parameter-efficient fine-tuning techniques to reliably align with population-preferred equalization settings. Our evaluation methods, which leverage distributional metrics that capture users’ varied preferences, show statistically significant improvements in distributional alignment over random sampling and static preset baselines. These results indicate that LLMs could function as “artificial equalizers,” contributing to the development of more accessible, context-aware, and expert-level audio tuning methods.

[298] Analysis of the Maximum Prediction Gain of Short-Term Prediction on Sustained Speech

Reemt Hinrichs, Muhamad Fadli Damara, Stephan Preihs, Jörn Ostermann

Main category: cs.SD

TL;DR: This paper analyzes the upper bound of prediction gain in speech coding using NWKR and information theory, finding that linear predictors are optimal for unvoiced speech but nonlinear predictors can achieve 2-6 dB higher gain for voiced speech.

DetailsMotivation: To determine the maximum achievable prediction gain for speech signals independent of predictor models, which is important for evaluating predictor performance in speech coding applications like data compression and transmission.

Method: Applied Nadaraya-Watson kernel-regression (NWKR) and information theoretic upper bound analysis on a newly recorded dataset of sustained speech/phonemes to analyze prediction gain upper bounds.

Result: For unvoiced speech, linear predictors achieve maximum prediction gain within 0.3 dB. For voiced speech, one-tap linear predictors are optimal, but with two or more taps, nonlinear predictors can achieve 2-6 dB higher gain than linear predictors. Significant speaker-dependent differences were observed.

Conclusion: The study reveals fundamental limits of prediction gain in speech coding, showing that while linear predictors suffice for unvoiced speech, nonlinear approaches offer substantial benefits for voiced speech prediction, with performance varying across speakers.

Abstract: Signal prediction is widely used in, e.g., economic forecasting, echo cancellation and in data compression, particularly in predictive coding of speech and music. Predictive coding algorithms reduce the bit-rate required for data transmission or storage by signal prediction. The prediction gain is a classic measure in applied signal coding of the quality of a predictor, as it links the mean-squared prediction error to the signal-to-quantization-noise of predictive coders. To evaluate predictor models, knowledge about the maximum achievable prediction gain independent of a predictor model is desirable. In this manuscript, Nadaraya-Watson kernel-regression (NWKR) and an information theoretic upper bound are applied to analyze the upper bound of the prediction gain on a newly recorded dataset of sustained speech/phonemes. It was found that for unvoiced speech a linear predictor always achieves the maximum prediction gain within at most 0.3 dB. On voiced speech, the optimum one-tap predictor was found to be linear but starting with two taps, the maximum achievable prediction gain was found to be about 2 dB to 6 dB above the prediction gain of the linear predictor. Significant differences between speakers/subjects were observed. The created dataset as well as the code can be obtained for research purpose upon request.

[299] Towards Realistic Synthetic Data for Automatic Drum Transcription

Pierfrancesco Melucci, Paolo Merialdo, Taketo Akama

Main category: cs.SD

TL;DR: A new semi-supervised method for Automatic Drum Transcription that automatically curates high-quality one-shot drum samples from unlabeled audio, then synthesizes training data from MIDI files alone, achieving state-of-the-art results without paired audio-MIDI datasets.

DetailsMotivation: Deep learning ADT models require large paired audio-MIDI datasets which are scarce. Existing synthetic data approaches use low-fidelity SoundFont libraries with acoustic diversity limitations, while high-quality one-shot samples lack standardized large-scale formats for training.

Method: 1) Semi-supervised method to automatically curate large diverse corpus of one-shot drum samples from unlabeled audio sources. 2) Use this corpus to synthesize high-quality dataset from MIDI files alone. 3) Train sequence-to-sequence transcription model on this synthesized dataset.

Result: Achieves new state-of-the-art results on ENST and MDB test sets, significantly outperforming both fully supervised methods and previous synthetic-data approaches.

Conclusion: Introduces a new paradigm for ADT that circumvents the need for paired audio-MIDI training data by leveraging automatically curated one-shot samples and synthetic data generation, enabling high-performance transcription without scarce paired datasets.

Abstract: Deep learning models define the state-of-the-art in Automatic Drum Transcription (ADT), yet their performance is contingent upon large-scale, paired audio-MIDI datasets, which are scarce. Existing workarounds that use synthetic data often introduce a significant domain gap, as they typically rely on low-fidelity SoundFont libraries that lack acoustic diversity. While high-quality one-shot samples offer a better alternative, they are not available in a standardized, large-scale format suitable for training. This paper introduces a new paradigm for ADT that circumvents the need for paired audio-MIDI training data. Our primary contribution is a semi-supervised method to automatically curate a large and diverse corpus of one-shot drum samples from unlabeled audio sources. We then use this corpus to synthesize a high-quality dataset from MIDI files alone, which we use to train a sequence-to-sequence transcription model. We evaluate our model on the ENST and MDB test sets, where it achieves new state-of-the-art results, significantly outperforming both fully supervised methods and previous synthetic-data approaches. The code for reproducing our experiments is publicly available at https://github.com/pier-maker92/ADT_STR

[300] MATS: An Audio Language Model under Text-only Supervision

Wen Wang, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

Main category: cs.SD

TL;DR: MATS is an audio-language multimodal LLM that achieves audio comprehension using only text supervision, bypassing the need for expensive audio-language paired data.

DetailsMotivation: Training large audio-language models requires costly audio-language paired datasets. The paper aims to develop audio comprehension capabilities in LLMs without using audio data during training to reduce data collection and training costs.

Method: Leverages pre-trained CLAP models for audio-language alignment, projects shared audio-language latent space into LLM space using text-only training. Introduces Santa mechanism to bridge modality gap by mapping audio embeddings into CLAP language embedding space while preserving audio information.

Result: MATS achieves competitive performance compared to recent LALMs trained on large-scale audio-language pairs, despite being trained exclusively on text data.

Conclusion: The proposed text-only training approach enables effective audio comprehension in LLMs without expensive audio-language paired data, offering a cost-effective alternative to traditional LALM training.

Abstract: Large audio-language models (LALMs), built upon powerful Large Language Models (LLMs), have exhibited remarkable audio comprehension and reasoning capabilities. However, the training of LALMs demands a large corpus of audio-language pairs, which requires substantial costs in both data collection and training resources. In this paper, we propose \textbf{MATS}, an audio-language multimodal LLM designed to handle \textbf{M}ultiple \textbf{A}udio task using solely \textbf{T}ext-only \textbf{S}upervision. By leveraging pre-trained audio-language alignment models such as CLAP, we develop a text-only training strategy that projects the shared audio-language latent space into LLM latent space, endowing the LLM with audio comprehension capabilities without relying on audio data during training. To further bridge the modality gap between audio and language embeddings within CLAP, we propose the \textbf{S}trongly-rel\textbf{a}ted \textbf{n}oisy \textbf{t}ext with \textbf{a}udio (\textbf{Santa}) mechanism. Santa maps audio embeddings into CLAP language embedding space while preserving essential information from the audio input. Extensive experiments demonstrate that MATS, despite being trained exclusively on text data, achieves competitive performance compared to recent LALMs trained on large-scale audio-language pairs. The code is publicly available in \href{https://github.com/wangwen-banban/MATS}{https://github.com/wangwen-banban/MATS}.

[301] Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer

Petros Vavaroutsos, Theodoros Palamas, Pantelis Vikatos

Main category: cs.SD

TL;DR: This paper proposes a method to reduce foundation model size for music information retrieval tasks using Branchformer architecture with SummaryMixing and random quantization, achieving competitive performance with 8.5-12.3% size reduction.

DetailsMotivation: Foundation models have exceptional performance but are resource-intensive due to their large parameter sizes (hundreds of millions to billions), leading to increased training and production costs. This is particularly challenging when applying these models to music information retrieval tasks.

Method: The research combines Branchformer architecture with SummaryMixing (originally applied in speech recognition) along with a random quantization process. Pre-training is conducted on publicly available datasets plus a proprietary dataset comparable to other private datasets in the literature. Evaluation uses a framework of various downstream MIR tasks.

Result: The proposed architecture achieves competitive performance compared to state-of-the-art models using multi-head self-attention, while reducing model size by 8.5% to 12.3%.

Conclusion: The approach successfully reduces foundation model size for MIR applications while maintaining competitive performance, addressing the resource and cost challenges of large foundation models in production systems.

Abstract: In recent years, foundation models have become very popular due to their exceptional performance, mainly in natural language (NLP) tasks where they were first introduced. These models usually consist of hundreds of millions, or even billions, of parameters, making them resource-intensive during training and in production systems, leading to increased costs. This paper focuses on the reduction of a foundation’s model size when applied to music information retrieval (MIR) tasks. Our research combines the Branchformer architecture with SummaryMixing, which were first applied in speech recognition, along with a random quantization process. To facilitate reproducibility, we conduct pre-training on publicly available datasets, complemented by a proprietary dataset comparable in scale to other private datasets reported in the literature. We ensure robust evaluation by using a framework consisting of a variety of downstream MIR tasks. Our results show that our architecture achieves competitive performance when compared with other state-of-the-art models that use multi-head self-attention, while reducing the model size from 8.5% up to 12.3%.

[302] A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering

Shahana Yasmin Chowdhury, Bithi Banik, Md Tamjidul Hoque, Shreya Banerjee

Main category: cs.SD

TL;DR: The paper proposes a DCRF-BiLSTM model for speech emotion recognition that achieves state-of-the-art accuracy across five benchmark datasets, including a comprehensive evaluation on all datasets combined.

DetailsMotivation: Speech emotion recognition is crucial for human-computer interaction and AI development. Existing studies typically evaluate models on individual datasets, lacking comprehensive assessment across multiple benchmark datasets simultaneously.

Method: The authors propose a DCRF-BiLSTM model (likely combining Deep Convolutional Recurrent Features with Bidirectional LSTM) to recognize seven emotions (neutral, happy, sad, angry, fear, disgust, surprise). The model is trained and evaluated on five benchmark datasets: RAVDESS, TESS, SAVEE, EmoDB, and Crema-D.

Result: The model achieves exceptional accuracy: 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% on CREMA-D, 100% on both TESS and EMO-DB. For combined R+T+S datasets, it achieves 98.82% accuracy. Most notably, it achieves 93.76% overall accuracy on the comprehensive combination of all five datasets (R+T+S+C+E), which is the first such evaluation reported.

Conclusion: The DCRF-BiLSTM framework demonstrates robustness and generalizability across diverse speech emotion datasets, setting new benchmarks for comprehensive SER evaluation and outperforming previous approaches.

Abstract: Nowadays, speech emotion recognition (SER) plays a vital role in the field of human-computer interaction (HCI) and the evolution of artificial intelligence (AI). Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise, which are trained on five datasets: RAVDESS (R), TESS (T), SAVEE (S), EmoDB (E), and Crema-D (C). The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% for CREMA-D, and a perfect 100% on both TESS and EMO-DB. For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results. To our knowledge, no existing study has evaluated a single SER model across all five benchmark datasets (i.e., R+T+S+C+E) simultaneously. In our work, we introduce this comprehensive combination and achieve a remarkable overall accuracy of 93.76%. These results confirm the robustness and generalizability of our DCRF-BiLSTM framework across diverse datasets.

[303] MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

MOSI. AI, :, Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang, Wenbo Zhang, Zhaoye Fei, Songlin Wang, Zhiyu Wu, Qinyuan Cheng, Shimin Li, Xipeng Qiu

Main category: cs.SD

TL;DR: MOSS Transcribe Diarize is a unified multimodal LLM that performs end-to-end speaker-attributed, time-stamped transcription, outperforming commercial systems with its 128k context window for 90-minute inputs.

DetailsMotivation: Existing SATS systems lack end-to-end formulation, have limited context windows, weak long-range speaker memory, and cannot output timestamps, creating limitations for meeting transcription needs.

Method: Developed MOSS Transcribe Diarize, a unified multimodal large language model trained on extensive real wild data with 128k context window for up to 90-minute inputs, performing joint speaker-attributed, time-stamped transcription end-to-end.

Result: Outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks, demonstrating strong scaling and robust generalization capabilities.

Conclusion: MOSS Transcribe Diarize successfully addresses key limitations of existing SATS systems through an end-to-end multimodal LLM approach with large context windows, achieving superior performance for meeting transcription tasks.

Abstract: Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.

cs.LG

[304] Attention Consistency Regularization for Interpretable Early-Exit Neural Networks

Yanhua Zhao

Main category: cs.LG

TL;DR: EGT framework improves interpretability and consistency in early-exit networks via attention-based regularization, maintaining accuracy while boosting speed and explanation quality.

DetailsMotivation: Early-exit networks lack interpretability and feature consistency across layers, limiting trust and explainability in resource-constrained AI applications.

Method: Explanation-Guided Training (EGT) uses attention consistency loss to align early-exit attention maps with final exit, jointly optimizing classification and attention alignment.

Result: Achieves 98.97% accuracy (matching baseline) with 1.97x speedup, improving attention consistency by up to 18.5% compared to baselines.

Conclusion: EGT makes early-exit networks more interpretable and consistent across exits, enhancing suitability for explainable AI in resource-constrained environments.

Abstract: Early-exit neural networks enable adaptive inference by allowing predictions at intermediate layers, reducing computational cost. However, early exits often lack interpretability and may focus on different features than deeper layers, limiting trust and explainability. This paper presents Explanation-Guided Training (EGT), a multi-objective framework that improves interpretability and consistency in early-exit networks through attention-based regularization. EGT introduces an attention consistency loss that aligns early-exit attention maps with the final exit. The framework jointly optimizes classification accuracy and attention consistency through a weighted combination of losses. Experiments on a real-world image classification dataset demonstrate that EGT achieves up to 98.97% overall accuracy (matching baseline performance) with a 1.97x inference speedup through early exits, while improving attention consistency by up to 18.5% compared to baseline models. The proposed method provides more interpretable and consistent explanations across all exit points, making early-exit networks more suitable for explainable AI applications in resource-constrained environments.

[305] Spectral Generative Flow Models: A Physics-Inspired Replacement for Vectorized Large Language Models

Andrew Kiruluta

Main category: cs.LG

TL;DR: Spectral Generative Flow Models (SGFMs) are physics-inspired generative models that treat text/video as continuous field evolution using stochastic dynamics in wavelet basis, replacing transformers with local operators and spectral projections.

DetailsMotivation: To create an alternative to transformer-based LLMs that is grounded in physical principles, offering better long-range coherence, multimodal generality, and physically structured inductive bias for next-generation generative models.

Method: Treat generation as evolution of continuous field governed by constrained stochastic dynamics in multiscale wavelet basis; uses field-theoretic ontology (text/video as SPDE trajectories), wavelet-domain representation for sparsity/scale separation, and constrained stochastic flow for stability/coherence.

Result: Proposes a novel generative architecture that fundamentally departs from autoregressive and diffusion-based approaches, offering principled path toward improved long-range coherence and multimodal capabilities.

Conclusion: SGFMs represent a physics-inspired alternative to transformers that leverages continuity, geometry, and physical structure to potentially overcome limitations of current generative models while maintaining computational efficiency through wavelet representations.

Abstract: We introduce Spectral Generative Flow Models (SGFMs), a physics-inspired alternative to transformer-based large language models. Instead of representing text or video as sequences of discrete tokens processed by attention, SGFMs treat generation as the evolution of a continuous field governed by constrained stochastic dynamics in a multiscale wavelet basis. This formulation replaces global attention with local operators, spectral projections, and Navier–Stokes-like transport, yielding a generative mechanism grounded in continuity, geometry, and physical structure. Our framework provides three key innovations: (i) a field-theoretic ontology in which text and video are unified as trajectories of a stochastic partial differential equation; (ii) a wavelet-domain representation that induces sparsity, scale separation, and computational efficiency; and (iii) a constrained stochastic flow that enforces stability, coherence, and uncertainty propagation. Together, these components define a generative architecture that departs fundamentally from autoregressive modeling and diffusion-based approaches. SGFMs offer a principled path toward long-range coherence, multimodal generality, and physically structured inductive bias in next-generation generative models.

[306] XGBoost Forecasting of NEPSE Index Log Returns with Walk Forward Validation

Sahaj Raj Malla, Shreeyash Kayastha, Rumi Suwal, Harish Chandra Bhandari, Rajendra Adhikari

Main category: cs.LG

TL;DR: XGBoost-based ML framework for one-step-ahead forecasting of Nepal Stock Exchange daily returns, outperforming ARIMA and Ridge regression benchmarks with 65.15% directional accuracy.

DetailsMotivation: To develop a robust machine learning framework for forecasting daily log-returns in the Nepal Stock Exchange (NEPSE) Index, addressing the need for effective predictive models in volatile emerging markets.

Method: Uses XGBoost regressor with engineered features including lagged log-returns (up to 30 days), rolling volatility measures, and RSI. Hyperparameter optimization via Optuna with time-series cross-validation. Performance evaluated through walk-forward validation with expanding and rolling windows, using RMSE, MAE, R-squared, and directional accuracy metrics.

Result: Optimal configuration (expanding window with 20 lags) achieved lowest log-return RMSE (0.013450) and MAE (0.009814) with 65.15% directional accuracy, outperforming ARIMA and Ridge regression benchmarks.

Conclusion: Demonstrates effectiveness of gradient boosting ensembles for modeling nonlinear dynamics in volatile emerging markets and establishes reproducible benchmark for NEPSE Index forecasting.

Abstract: This study develops a robust machine learning framework for one-step-ahead forecasting of daily log-returns in the Nepal Stock Exchange (NEPSE) Index using the XGBoost regressor. A comprehensive feature set is engineered, including lagged log-returns (up to 30 days) and established technical indicators such as short- and medium-term rolling volatility measures and the 14-period Relative Strength Index. Hyperparameter optimization is performed using Optuna with time-series cross-validation on the initial training segment. Out-of-sample performance is rigorously assessed via walk-forward validation under both expanding and fixed-length rolling window schemes across multiple lag configurations, simulating real-world deployment and avoiding lookahead bias. Predictive accuracy is evaluated using root mean squared error, mean absolute error, coefficient of determination (R-squared), and directional accuracy on both log-returns and reconstructed closing prices. Empirical results show that the optimal configuration, an expanding window with 20 lags, outperforms tuned ARIMA and Ridge regression benchmarks, achieving the lowest log-return RMSE (0.013450) and MAE (0.009814) alongside a directional accuracy of 65.15%. While the R-squared remains modest, consistent with the noisy nature of financial returns, primary emphasis is placed on relative error reduction and directional prediction. Feature importance analysis and visual inspection further enhance interpretability. These findings demonstrate the effectiveness of gradient boosting ensembles in modeling nonlinear dynamics in volatile emerging market time series and establish a reproducible benchmark for NEPSE Index forecasting.

[307] DriftGuard: A Hierarchical Framework for Concept Drift Detection and Remediation in Supply Chain Forecasting

Shahnawaz Alam, Mohammed Abdul Rahman, Bareera Sadeqa

Main category: cs.LG

TL;DR: DriftGuard is an end-to-end framework for detecting, diagnosing, and remediating concept drift in supply chain forecasting models, achieving 97.8% detection recall within 4.2 days and 417 ROI through targeted model updates.

DetailsMotivation: Supply chain forecasting models degrade silently over time due to concept drift from promotions, consumer preferences, and supply disruptions, causing stockouts or excess inventory. Current industry practice relies on inefficient manual monitoring and scheduled retraining, while academic methods focus only on detection without addressing diagnosis, remediation, or hierarchical data structure.

Method: DriftGuard is a five-module framework combining: 1) ensemble of four detection methods (error-based monitoring, statistical tests, autoencoder anomaly detection, CUSUM change-point analysis), 2) hierarchical propagation analysis to locate drift across product lines, 3) SHAP analysis for root cause diagnosis, 4) cost-aware retraining strategy for selective model updates, and 5) end-to-end drift lifecycle management.

Result: Evaluated on over 30,000 time series from M5 retail dataset, DriftGuard achieves 97.8% detection recall within 4.2 days and delivers up to 417 return on investment through targeted remediation.

Conclusion: DriftGuard provides a comprehensive solution for concept drift in supply chain forecasting by addressing the complete drift lifecycle - detection, diagnosis, and remediation - while considering hierarchical data structure and enabling cost-effective targeted retraining.

Abstract: Supply chain forecasting models degrade over time as real-world conditions change. Promotions shift, consumer preferences evolve, and supply disruptions alter demand patterns, causing what is known as concept drift. This silent degradation leads to stockouts or excess inventory without triggering any system warnings. Current industry practice relies on manual monitoring and scheduled retraining every 3-6 months, which wastes computational resources during stable periods while missing rapid drift events. Existing academic methods focus narrowly on drift detection without addressing diagnosis or remediation, and they ignore the hierarchical structure inherent in supply chain data. What retailers need is an end-to-end system that detects drift early, explains its root causes, and automatically corrects affected models. We propose DriftGuard, a five-module framework that addresses the complete drift lifecycle. The system combines an ensemble of four complementary detection methods, namely error-based monitoring, statistical tests, autoencoder anomaly detection, and Cumulative Sum (CUSUM) change-point analysis, with hierarchical propagation analysis to identify exactly where drift occurs across product lines. Once detected, Shapley Additive Explanations (SHAP) analysis diagnoses the root causes, and a cost-aware retraining strategy selectively updates only the most affected models. Evaluated on over 30,000 time series from the M5 retail dataset, DriftGuard achieves 97.8% detection recall within 4.2 days and delivers up to 417 return on investment through targeted remediation.

[308] Adaptive Requesting in Decentralized Edge Networks via Non-Stationary Bandits

Yi Zhuang, Kun Yang, Xingran Chen

Main category: cs.LG

TL;DR: Decentralized collaborative requesting problem for optimizing information freshness in edge networks using a novel bandit algorithm with adaptive reset mechanisms.

DetailsMotivation: Optimize information freshness for time-sensitive clients in edge networks where clients cannot observe access node states or other clients' actions, creating a decentralized, partially observable environment with coupled and non-stationary reward processes.

Method: Propose AGING BANDIT WITH ADAPTIVE RESET algorithm that combines adaptive windowing with periodic monitoring to track evolving reward distributions in a non-stationary multi-armed bandit formulation.

Result: Theoretical performance guarantees show the algorithm achieves near-optimal performance, validated through simulations.

Conclusion: The proposed algorithm effectively addresses challenges of history-dependent, coupled reward processes with abrupt and gradual changes in decentralized edge networks, outperforming classical bandit approaches.

Abstract: We study a decentralized collaborative requesting problem that aims to optimize the information freshness of time-sensitive clients in edge networks consisting of multiple clients, access nodes (ANs), and servers. Clients request content through ANs acting as gateways, without observing AN states or the actions of other clients. We define the reward as the age of information reduction resulting from a client’s selection of an AN, and formulate the problem as a non-stationary multi-armed bandit. In this decentralized and partially observable setting, the resulting reward process is history-dependent and coupled across clients, and exhibits both abrupt and gradual changes in expected rewards, rendering classical bandit-based approaches ineffective. To address these challenges, we propose the AGING BANDIT WITH ADAPTIVE RESET algorithm, which combines adaptive windowing with periodic monitoring to track evolving reward distributions. We establish theoretical performance guarantees showing that the proposed algorithm achieves near-optimal performance, and we validate the theoretical results through simulations.

[309] Breaking the Bottlenecks: Scalable Diffusion Models for 3D Molecular Generation

Adrita Das, Peiran Jiang, Dantong Zhu, Barnabas Poczos, Jose Lugo-Martinez

Main category: cs.LG

TL;DR: DDDM replaces stochastic diffusion with deterministic denoising for faster molecular generation, and this work provides theoretical justification using Reverse Transition Kernel framework.

DetailsMotivation: Diffusion models for molecular design suffer from slow sampling, stochastic variance, and limited structural awareness. DDDM improves efficiency but lacks theoretical foundation.

Method: Reinterpret DDDM through Reverse Transition Kernel (RTK) framework, showing deterministic denoising as approximate kernel operator optimizing structured transport maps between noisy and clean samples.

Result: RTK-guided deterministic denoising achieves faster convergence, higher structural fidelity, and preserves chemical validity on GEOM-DRUGS dataset while eliminating stochastic variance.

Conclusion: The RTK framework provides theoretical foundation for deterministic denoising, resolves bottlenecks in molecular diffusion, and enables efficient, stable, symmetry-preserving molecular generation.

Abstract: Diffusion models have emerged as a powerful class of generative models for molecular design, capable of capturing complex structural distributions and achieving high fidelity in 3D molecule generation. However, their widespread use remains constrained by long sampling trajectories, stochastic variance in the reverse process, and limited structural awareness in denoising dynamics. The Directly Denoising Diffusion Model (DDDM) mitigates these inefficiencies by replacing stochastic reverse MCMC updates with deterministic denoising step, substantially reducing inference time. Yet, the theoretical underpinnings of such deterministic updates have remained opaque. In this work, we provide a principled reinterpretation of DDDM through the lens of the Reverse Transition Kernel (RTK) framework by Huang et al. 2024, unifying deterministic and stochastic diffusion under a shared probabilistic formalism. By expressing the DDDM reverse process as an approximate kernel operator, we show that the direct denoising process implicitly optimizes a structured transport map between noisy and clean samples. This perspective elucidates why deterministic denoising achieves efficient inference. Beyond theoretical clarity, this reframing resolves several long-standing bottlenecks in molecular diffusion. The RTK view ensures numerical stability by enforcing well-conditioned reverse kernels, improves sample consistency by eliminating stochastic variance, and enables scalable and symmetry-preserving denoisers that respect SE(3) equivariance. Empirically, we demonstrate that RTK-guided deterministic denoising achieves faster convergence and higher structural fidelity than stochastic diffusion models, while preserving chemical validity across GEOM-DRUGS dataset. Code, models, and datasets are publicly available in our project repository.

[310] Continuous Fairness On Data Streams

Subhodeep Ghosh, Zhihui Du, Angela Bonifati, Manish Kumar, David Bader, Senjuti Basu Roy

Main category: cs.LG

TL;DR: The paper proposes a novel fairness model for data streams that enforces group fairness at a finer block-level granularity within sliding windows, with efficient monitoring and reordering algorithms.

DetailsMotivation: When window sizes are large in data streams, enforcing fairness at the window level may be insufficient. There's a need for finer-grained fairness enforcement at block-level within each sliding window to ensure more equitable treatment across groups.

Method: Proposes a block-level group fairness model for sliding windows in data streams. Designs sketch-based data structures for efficient real-time monitoring of fairness violations. Develops optimal algorithms for reordering windows when fairness is violated, with theoretical guarantees.

Result: Achieves millisecond-level processing and ~30,000 queries/second throughput. The reordering algorithm improves block-level group fairness by up to 95% in some cases, and 50-60% on average across datasets. Qualitative study shows advantages over window-level fairness.

Conclusion: The proposed block-level fairness model provides more granular fairness enforcement in data streams. The efficient monitoring and reordering algorithms make it practical for real-world streaming scenarios, offering significant improvements in fairness metrics.

Abstract: We study the problem of enforcing continuous group fairness over windows in data streams. We propose a novel fairness model that ensures group fairness at a finer granularity level (referred to as block) within each sliding window. This formulation is particularly useful when the window size is large, making it desirable to enforce fairness at a finer granularity. Within this framework, we address two key challenges: efficiently monitoring whether each sliding window satisfies block-level group fairness, and reordering the current window as effectively as possible when fairness is violated. To enable real-time monitoring, we design sketch-based data structures that maintain attribute distributions with minimal overhead. We also develop optimal, efficient algorithms for the reordering task, supported by rigorous theoretical guarantees. Our evaluation on four real-world streaming scenarios demonstrates the practical effectiveness of our approach. We achieve millisecond-level processing and a throughput of approximately 30,000 queries per second on average, depending on system parameters. The stream reordering algorithm improves block-level group fairness by up to 95% in certain cases, and by 50-60% on average across datasets. A qualitative study further highlights the advantages of block-level fairness compared to window-level fairness.

[311] Optimising for Energy Efficiency and Performance in Machine Learning

Emile Dos Santos Ferreira, Neil D. Lawrence, Andrei Paleyes

Main category: cs.LG

TL;DR: ECOpt is a hyperparameter tuner that optimizes for both energy efficiency and model performance, creating Pareto frontiers to help practitioners balance accuracy with environmental impact.

DetailsMotivation: Growing energy consumption in ML, lack of understanding about energy scaling laws, focus on training costs ignoring inference costs, and insufficient tools for measuring and optimizing energy efficiency.

Method: Developed Energy Consumption Optimiser (ECOpt) - a hyperparameter tuner that simultaneously optimizes for energy efficiency and model performance, quantifying trade-offs as interpretable Pareto frontiers.

Result: Showed that parameter and FLOP counts are unreliable proxies for energy consumption; Transformer energy efficiency is consistent across hardware; discovered 7 CIFAR-10 models that improve state-of-the-art when considering both accuracy and energy efficiency.

Conclusion: ECOpt enables informed decisions about energy costs and environmental impact while maximizing model benefits and complying with regulations; motivates measuring and publishing ML energy metrics for net positive environmental impact.

Abstract: The ubiquity of machine learning (ML) and the demand for ever-larger models bring an increase in energy consumption and environmental impact. However, little is known about the energy scaling laws in ML, and existing research focuses on training cost – ignoring the larger cost of inference. Furthermore, tools for measuring the energy consumption of ML do not provide actionable feedback. To address these gaps, we developed Energy Consumption Optimiser (ECOpt): a hyperparameter tuner that optimises for energy efficiency and model performance. ECOpt quantifies the trade-off between these metrics as an interpretable Pareto frontier. This enables ML practitioners to make informed decisions about energy cost and environmental impact, while maximising the benefit of their models and complying with new regulations. Using ECOpt, we show that parameter and floating-point operation counts can be unreliable proxies for energy consumption, and observe that the energy efficiency of Transformer models for text generation is relatively consistent across hardware. These findings motivate measuring and publishing the energy metrics of ML models. We further show that ECOpt can have a net positive environmental impact and use it to uncover seven models for CIFAR-10 that improve upon the state of the art, when considering accuracy and energy efficiency together.

[312] Physics-Guided Counterfactual Explanations for Large-Scale Multivariate Time Series: Application in Scalable and Interpretable SEP Event Prediction

Pranjal Patil, Anli Ji, Berkay Aydin

Main category: cs.LG

TL;DR: A physics-guided counterfactual explanation framework for solar energetic particle forecasting that generates physically plausible explanations while improving proximity, sparsity, and runtime compared to existing methods.

DetailsMotivation: Solar energetic particle prediction is critical for space safety, but existing ML models lack interpretability and physical plausibility in their explanations. Current counterfactual methods don't enforce domain-specific feasibility constraints.

Method: A Physics-Guided Counterfactual Explanation framework that generates counterfactual explanations for time series classification tasks while ensuring consistency with underlying physical principles. Applied to SEP forecasting using multivariate time series data from GOES satellites.

Result: Achieves over 80% reduction in Dynamic Time Warping distance (improving proximity), produces counterfactuals with higher sparsity, and reduces runtime by nearly 50% compared to state-of-the-art baselines like DiCE. Ensures physical plausibility of explanations.

Conclusion: The framework generates valid and physically consistent counterfactual explanations, providing actionable insights for scientific domains while laying foundation for scalable counterfactual generation in big data environments.

Abstract: Accurate prediction of solar energetic particle events is vital for safeguarding satellites, astronauts, and space-based infrastructure. Modern space weather monitoring generates massive volumes of high-frequency, multivariate time series (MVTS) data from sources such as the Geostationary perational Environmental Satellites (GOES). Machine learning (ML) models trained on this data show strong predictive power, but most existing methods overlook domain-specific feasibility constraints. Counterfactual explanations have emerged as a key tool for improving model interpretability, yet existing approaches rarely enforce physical plausibility. This work introduces a Physics-Guided Counterfactual Explanation framework, a novel method for generating counterfactual explanations in time series classification tasks that remain consistent with underlying physical principles. Applied to solar energetic particles (SEP) forecasting, this framework achieves over 80% reduction in Dynamic Time Warping (DTW) distance increasing the proximity, produces counterfactual explanations with higher sparsity, and reduces runtime by nearly 50% compared to state-of-the-art baselines such as DiCE. Beyond numerical improvements, this framework ensures that generated counterfactual explanations are physically plausible and actionable in scientific domains. In summary, the framework generates counterfactual explanations that are both valid and physically consistent, while laying the foundation for scalable counterfactual generation in big data environments.

[313] Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers

Annalisa Belloni, Lorenzo Noci, Antonio Orvieto

Main category: cs.LG

TL;DR: WSD scheduler performs similarly across transformers and CNNs, suggesting shared loss landscape geometry in high-dimensional optimization.

DetailsMotivation: To investigate whether WSD scheduler's success is specific to transformers or reveals general properties of high-dimensional optimization landscapes.

Method: Compare WSD optimizer paths on Pythia-like language model vs CNN on CIFAR10, analyzing training signals, path features, and sharpness dynamics.

Result: Most training signals, optimizer path features, and sharpness dynamics are qualitatively similar across both architectures.

Conclusion: WSD’s effectiveness reveals shared geometric characteristics in loss landscapes of diverse nonconvex problems, pointing to fundamental properties of high-dimensional optimization.

Abstract: The Warmup Stable Decay (WSD) learning rate scheduler has recently become popular, largely due to its good performance and flexibility when training large language models. It remains an open question whether the remarkable performance of WSD - using a decaying learning rate for only a fraction of training compared to cosine decay - is a phenomenon specific to transformer-based language models that can potentially offer new theoretical insights into their training dynamics. Inspired by the usage of learning rate schedulers as a new lens into understanding landscape geometry (e.g., river valley, connected minima, progressive sharpening), in this work we compare the WSD path of the Adam optimizer on a Pythia-like language model to that of a small CNN trained to classify CIFAR10 images. We observe most training signals, optimizer path features, and sharpness dynamics to be qualitatively similar in such architectures. This consistency points to shared geometric characteristics of the loss landscapes of old and new nonconvex problems, and hints to future research questions around the geometry of high dimensional optimization problems.

[314] Meta-learning to Address Data Shift in Time Series Classification

Samuel Myren, Nidhi Parikh, Natalie Klein

Main category: cs.LG

TL;DR: Meta-learning outperforms traditional deep learning for time-series classification under data shift, especially with limited data and smaller models, but advantages diminish with more data and larger models.

DetailsMotivation: Traditional deep learning models degrade rapidly under real-world data shift, requiring costly relabeling and retraining. Meta-learning offers promise for quick adaptation to new data with few examples.

Method: Systematic comparison of traditional deep learning with fine-tuning vs. optimization-based meta-learning algorithms on time-series classification under data shift, using a controlled seismic benchmark (SeisTask).

Result: Meta-learning achieves faster, more stable adaptation with reduced overfitting in data-scarce regimes and smaller architectures. Advantages diminish with increased data availability and model capacity. Task diversity alone doesn’t drive gains - alignment between training and test distributions is key.

Conclusion: Meta-learning outperforms traditional deep learning under data shift, especially with limited data/resources. SeisTask benchmark enables systematic evaluation of adaptive learning in time-series domains.

Abstract: Across engineering and scientific domains, traditional deep learning (TDL) models perform well when training and test data share the same distribution. However, the dynamic nature of real-world data, broadly termed \textit{data shift}, renders TDL models prone to rapid performance degradation, requiring costly relabeling and inefficient retraining. Meta-learning, which enables models to adapt quickly to new data with few examples, offers a promising alternative for mitigating these challenges. Here, we systematically compare TDL with fine-tuning and optimization-based meta-learning algorithms to assess their ability to address data shift in time-series classification. We introduce a controlled, task-oriented seismic benchmark (SeisTask) and show that meta-learning typically achieves faster and more stable adaptation with reduced overfitting in data-scarce regimes and smaller model architectures. As data availability and model capacity increase, its advantages diminish, with TDL with fine-tuning performing comparably. Finally, we examine how task diversity influences meta-learning and find that alignment between training and test distributions, rather than diversity alone, drives performance gains. Overall, this work provides a systematic evaluation of when and why meta-learning outperforms TDL under data shift and contributes SeisTask as a benchmark for advancing adaptive learning research in time-series domains.

[315] Layer-Parallel Training for Transformers

Shuai Jiang, Marc Salvado, Eric C. Cyr, Alena Kopaničáková, Rolf Krause, Jacob B. Schroder

Main category: cs.LG

TL;DR: Multilevel layer-parallel training for transformers using neural ODE formulation achieves parallel acceleration across layers, with algorithm to detect and mitigate gradient bias issues.

DetailsMotivation: To enhance parallel scalability for increasingly deep foundational models by enabling parallel acceleration across the layer dimension during transformer training.

Method: Uses neural ODE formulation of transformers with multilevel parallel-in-time algorithm for forward/backpropagation, plus detection algorithm to switch to serial training or increase accuracy when gradient bias occurs.

Result: Demonstrates parallel acceleration across BERT, GPT2, ViT, and translation architectures while maintaining accuracy comparable to serial pre-training; fine-tuning unaffected.

Conclusion: Multilevel layer-parallel training enables scalable parallel acceleration for deep transformers with mechanisms to preserve convergence quality when approaching minima.

Abstract: We present a new training methodology for transformers using a multilevel, layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and backpropagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, which is particularly useful for increasingly large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop an algorithm to detect this critical transition and either switch to serial training or systematically increase the accuracy of layer-parallel training. Results, including BERT, GPT2, ViT, and machine translation architectures, demonstrate parallel-acceleration as well as accuracy commensurate with serial pre-training while fine-tuning is unaffected.

[316] SCaLE: Switching Cost aware Learning and Exploration

Neelkamal Bhuyan, Debankur Mukherjee, Adam Wierman

Main category: cs.LG

TL;DR: SCaLE algorithm achieves sub-linear dynamic regret for bandit online convex optimization with unbounded metric movement costs, using novel spectral analysis without knowledge of hitting cost structure.

DetailsMotivation: Addresses the fundamental problem of unbounded metric movement costs in bandit online convex optimization, particularly for high-dimensional dynamic quadratic hitting costs with ℓ₂-norm switching costs in noisy bandit feedback settings.

Method: Proposes SCaLE algorithm with novel spectral regret analysis that separately quantifies eigenvalue-error driven regret and eigenbasis-perturbation driven regret for general stochastic environments without requiring knowledge of hitting cost structure.

Result: First algorithm that provably achieves distribution-agnostic sub-linear dynamic regret in this setting. Extensive numerical experiments show superiority over online-learning baselines and highlight statistical consistency.

Conclusion: SCaLE successfully addresses the unbounded metric movement cost problem in bandit online convex optimization through innovative spectral analysis, providing theoretical guarantees and practical performance improvements.

Abstract: This work addresses the fundamental problem of unbounded metric movement costs in bandit online convex optimization, by considering high-dimensional dynamic quadratic hitting costs and $\ell_2$-norm switching costs in a noisy bandit feedback model. For a general class of stochastic environments, we provide the first algorithm SCaLE that provably achieves a distribution-agnostic sub-linear dynamic regret, without the knowledge of hitting cost structure. En-route, we present a novel spectral regret analysis that separately quantifies eigenvalue-error driven regret and eigenbasis-perturbation driven regret. Extensive numerical experiments, against online-learning baselines, corroborate our claims, and highlight statistical consistency of our algorithm.

[317] Deep Incomplete Multi-View Clustering via Hierarchical Imputation and Alignment

Yiming Du, Ziyu Wang, Jian Li, Rui Ning, Lusi Li

Main category: cs.LG

TL;DR: DIMVC-HIA: A deep incomplete multi-view clustering framework with hierarchical imputation and alignment that achieves superior performance under varying missingness levels.

DetailsMotivation: Incomplete multi-view clustering faces challenges in accurately imputing missing views without bias while maintaining semantic consistency across views and compactness within clusters.

Method: Four key components: 1) view-specific autoencoders with shared clustering predictor, 2) hierarchical imputation module using cross-view contrastive similarity and intra-view statistics, 3) energy-based semantic alignment for intra-cluster compactness, 4) contrastive assignment alignment for cross-view consistency.

Result: Experiments on benchmarks demonstrate superior performance under varying levels of missingness.

Conclusion: The proposed DIMVC-HIA framework effectively addresses incomplete multi-view clustering challenges through integrated hierarchical imputation and alignment mechanisms.

Abstract: Incomplete multi-view clustering (IMVC) aims to discover shared cluster structures from multi-view data with partial observations. The core challenges lie in accurately imputing missing views without introducing bias, while maintaining semantic consistency across views and compactness within clusters. To address these challenges, we propose DIMVC-HIA, a novel deep IMVC framework that integrates hierarchical imputation and alignment with four key components: (1) view-specific autoencoders for latent feature extraction, coupled with a view-shared clustering predictor to produce soft cluster assignments; (2) a hierarchical imputation module that first estimates missing cluster assignments based on cross-view contrastive similarity, and then reconstructs missing features using intra-view, intra-cluster statistics; (3) an energy-based semantic alignment module, which promotes intra-cluster compactness by minimizing energy variance around low-energy cluster anchors; and (4) a contrastive assignment alignment module, which enhances cross-view consistency and encourages confident, well-separated cluster predictions. Experiments on benchmarks demonstrate that our framework achieves superior performance under varying levels of missingness.

[318] Unified Multimodal Brain Decoding via Cross-Subject Soft-ROI Fusion

Xuanyu Hu

Main category: cs.LG

TL;DR: BrainROI model improves multimodal brain decoding by addressing cross-subject generalization and interpretability challenges, achieving state-of-the-art results on NSD dataset through fMRI encoder design, interpretable prompt optimization, and decoding constraints.

DetailsMotivation: Multimodal brain decoding faces key challenges in cross-subject generalization (due to functional brain topology heterogeneity) and interpretability (due to limitations of manual and black-box prompting methods).

Method: 1) New fMRI encoder using multi-atlas soft functional parcellations (soft-ROI) as shared space with voxel-wise gated fusion mechanism and global label alignment; 2) Interpretable prompt optimization using locally deployed Qwen model in small-sample closed loop; 3) Parameterized decoding constraints during inference.

Result: Achieves leading-level results in brain-captioning evaluation on NSD dataset with clear improvements in BLEU-4 and CIDEr metrics under cross-subject setting compared to recent state-of-the-art methods.

Conclusion: BrainROI model successfully addresses cross-subject generalization and interpretability challenges in multimodal brain decoding through innovative fMRI encoding, transparent prompt optimization, and constrained decoding, setting new benchmarks for brain-captioning tasks.

Abstract: Multimodal brain decoding aims to reconstruct semantic information that is consistent with visual stimuli from brain activity signals such as fMRI, and then generate readable natural language descriptions. However, multimodal brain decoding still faces key challenges in cross-subject generalization and interpretability. We propose a BrainROI model and achieve leading-level results in brain-captioning evaluation on the NSD dataset. Under the cross-subject setting, compared with recent state-of-the-art methods and representative baselines, metrics such as BLEU-4 and CIDEr show clear improvements. Firstly, to address the heterogeneity of functional brain topology across subjects, we design a new fMRI encoder. We use multi-atlas soft functional parcellations (soft-ROI) as a shared space. We extend the discrete ROI Concatenation strategy in MINDLLM to a voxel-wise gated fusion mechanism (Voxel-gate). We also ensure consistent ROI mapping through global label alignment, which enhances cross-subject transferability. Secondly, to overcome the limitations of manual and black-box prompting methods in stability and transparency, we introduce an interpretable prompt optimization process. In a small-sample closed loop, we use a locally deployed Qwen model to iteratively generate and select human-readable prompts. This process improves the stability of prompt design and preserves an auditable optimization trajectory. Finally, we impose parameterized decoding constraints during inference to further improve the stability and quality of the generated descriptions.

[319] Resolving Predictive Multiplicity for the Rashomon Set

Parian Haghighat, Hadis Anahideh, Cynthia Rudin

Main category: cs.LG

TL;DR: The paper proposes three methods to reduce predictive inconsistency in Rashomon sets: outlier correction, local patching, and pairwise reconciliation, which can be combined to create more consistent models while maintaining accuracy.

DetailsMotivation: Predictive multiplicity in Rashomon sets (multiple equally accurate models) leads to inconsistent predictions that undermine trust in high-stakes applications where consistent predictions are essential.

Method: Three approaches: 1) Outlier correction - fixing data points that no good model can predict correctly; 2) Local patching - detecting and fixing model biases in local regions using validation data; 3) Pairwise reconciliation - modifying disagreeing predictions between model pairs to reduce bias.

Result: Experiments across multiple datasets show the methods effectively reduce disagreement metrics while maintaining competitive accuracy levels.

Conclusion: The proposed reconciliation approaches can be used individually or combined to reduce predictive inconsistency in Rashomon sets, and the reconciled predictions can be distilled into a single interpretable model for real-world deployment.

Abstract: The existence of multiple, equally accurate models for a given predictive task leads to predictive multiplicity, where a ``Rashomon set’’ of models achieve similar accuracy but diverges in their individual predictions. This inconsistency undermines trust in high-stakes applications where we want consistent predictions. We propose three approaches to reduce inconsistency among predictions for the members of the Rashomon set. The first approach is \textbf{outlier correction}. An outlier has a label that none of the good models are capable of predicting correctly. Outliers can cause the Rashomon set to have high variance predictions in a local area, so fixing them can lower variance. Our second approach is local patching. In a local region around a test point, models may disagree with each other because some of them are biased. We can detect and fix such biases using a validation set, which also reduces multiplicity. Our third approach is pairwise reconciliation, where we find pairs of models that disagree on a region around the test point. We modify predictions that disagree, making them less biased. These three approaches can be used together or separately, and they each have distinct advantages. The reconciled predictions can then be distilled into a single interpretable model for real-world deployment. In experiments across multiple datasets, our methods reduce disagreement metrics while maintaining competitive accuracy.

[320] Lean Clients, Full Accuracy: Hybrid Zeroth- and First-Order Split Federated Learning

Zhoubin Kou, Zihan Chen, Jing Yang, Cong Shen

Main category: cs.LG

TL;DR: HERON-SFL is a hybrid optimization framework for Split Federated Learning that combines zeroth-order optimization on clients with first-order optimization on the server to reduce client-side computation and memory requirements while maintaining accuracy.

DetailsMotivation: Current Split Federated Learning (SFL) faces two main challenges: communication overhead (partially addressed by auxiliary networks) and client-side computation limitations. Back-propagation requires substantial memory and computation costs, severely restricting the scale of models that resource-constrained edge devices can support.

Method: HERON-SFL integrates zeroth-order (ZO) optimization for local client training while retaining first-order (FO) optimization on the server. With auxiliary networks, ZO updates enable clients to approximate local gradients using perturbed forward-only evaluations per step, eliminating memory-intensive activation caching and avoiding explicit gradient computation. This approach leverages the low effective rank assumption for theoretical convergence guarantees.

Result: Theoretically, HERON-SFL’s convergence rate is proven to be independent of model dimensionality, addressing a key scalability concern of ZO algorithms. Empirically, on ResNet training and language model fine-tuning tasks, HERON-SFL matches benchmark accuracy while reducing client peak memory by up to 64% and client-side compute cost by up to 33% per step.

Conclusion: HERON-SFL substantially expands the range of models that can be trained or adapted on resource-limited devices by enabling more resource-efficient client computation and reducing client-server communication, making large-scale model training feasible on edge devices.

Abstract: Split Federated Learning (SFL) enables collaborative training between resource-constrained edge devices and a compute-rich server. Communication overhead is a central issue in SFL and can be mitigated with auxiliary networks. Yet, the fundamental client-side computation challenge remains, as back-propagation requires substantial memory and computation costs, severely limiting the scale of models that edge devices can support. To enable more resource-efficient client computation and reduce the client-server communication, we propose HERON-SFL, a novel hybrid optimization framework that integrates zeroth-order (ZO) optimization for local client training while retaining first-order (FO) optimization on the server. With the assistance of auxiliary networks, ZO updates enable clients to approximate local gradients using perturbed forward-only evaluations per step, eliminating memory-intensive activation caching and avoiding explicit gradient computation in the traditional training process. Leveraging the low effective rank assumption, we theoretically prove that HERON-SFL’s convergence rate is independent of model dimensionality, addressing a key scalability concern common to ZO algorithms. Empirically, on ResNet training and language model (LM) fine-tuning tasks, HERON-SFL matches benchmark accuracy while reducing client peak memory by up to 64% and client-side compute cost by up to 33% per step, substantially expanding the range of models that can be trained or adapted on resource-limited devices.

[321] SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache

Chi-Chih Chang, Siqi Zhu, Zhichen Zeng, Haibin Lin, Jiaxuan You, Mohamed S. Abdelfattah, Ziheng Jiang, Xuehai Qian

Main category: cs.LG

TL;DR: SRT is a model-free method that accelerates on-policy RL for language models using a tree-structured cache for speculative decoding, achieving up to 2.08x speedup without sacrificing distributional correctness.

DetailsMotivation: On-policy RL for language models is computationally expensive due to sequential token generation during rollouts. There's a need to accelerate this process without compromising the correctness of the learned policy distribution.

Method: SRT stores previously generated continuations in a per-prompt tree-structured cache and uses it as a draft model for speculative decoding. The cache is kept fresh through online updates from ongoing rollouts and proactive run-ahead generation during idle GPU cycles.

Result: SRT consistently reduces generation and step latency, lowers per-token inference cost, and achieves up to 2.08x wall-clock time speedup during rollout when integrated into standard RL pipelines (PPO, GRPO, DAPO) and multi-turn settings.

Conclusion: SRT provides a simple, model-free approach to accelerate on-policy RL for language models by leveraging empirical similarity of rollouts across training steps through tree-structured caching and speculative decoding, maintaining distributional correctness while significantly improving efficiency.

Abstract: We present Speculative Rollout with Tree-Structured Cache (SRT), a simple, model-free approach to accelerate on-policy reinforcement learning (RL) for language models without sacrificing distributional correctness. SRT exploits the empirical similarity of rollouts for the same prompt across training steps by storing previously generated continuations in a per-prompt tree-structured cache. During generation, the current policy uses this tree as the draft model for performing speculative decoding. To keep the cache fresh and improve draft model quality, SRT updates trees online from ongoing rollouts and proactively performs run-ahead generation during idle GPU bubbles. Integrated into standard RL pipelines (\textit{e.g.}, PPO, GRPO and DAPO) and multi-turn settings, SRT consistently reduces generation and step latency and lowers per-token inference cost, achieving up to 2.08x wall-clock time speedup during rollout.

[322] MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

Kangda Wei, Ruihong Huang

Main category: cs.LG

TL;DR: MMR-GRPO accelerates mathematical reasoning model training by using Maximal Marginal Relevance to prioritize diverse completions, reducing training steps by 47.9% and wall-clock time by 70.2% while maintaining performance.

DetailsMotivation: GRPO is computationally expensive due to requiring multiple completions per prompt, and while recent work reduced training steps, wall-clock time often remains unchanged or increases due to higher per-step costs.

Method: Integrates Maximal Marginal Relevance (MMR) to reweigh rewards based on completion diversity, prioritizing diverse solutions over semantically redundant ones to yield more informative updates and accelerate convergence.

Result: Across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks, MMR-GRPO achieves comparable peak performance while requiring 47.9% fewer training steps and 70.2% less wall-clock time.

Conclusion: MMR-GRPO provides significant efficiency gains for mathematical reasoning model training by leveraging completion diversity, with consistent improvements across models, methods, and benchmarks.

Abstract: Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. We will release our code, trained models, and experimental protocols.

[323] Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye

Main category: cs.LG

TL;DR: DASD-4B-Thinking is a lightweight open-source reasoning model that achieves SOTA performance with only 448K training samples by addressing limitations in current sequence-level distillation approaches.

DetailsMotivation: Current sequence-level distillation approaches focus on heuristic SFT data filtering but overlook the core distillation principle of enabling students to learn teachers' full output distribution for better generalization. Three critical limitations are identified: inadequate representation of teacher's distribution, misalignment between teacher distribution and student capacity, and exposure bias.

Method: Proposes methodological innovations forming an enhanced sequence-level distillation training pipeline that addresses the three identified limitations, enabling effective knowledge transfer with minimal training data.

Result: DASD-4B-Thinking achieves SOTA performance among comparable open-source models on mathematics, scientific reasoning, and code generation benchmarks, outperforming larger models while using only 448K training samples (order of magnitude fewer than existing approaches).

Conclusion: The work demonstrates that addressing fundamental distillation principles through explicit teacher-student interaction enables highly efficient and effective model training, challenging current community practices and providing a more principled approach to sequence-level distillation.

Abstract: In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation – even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself – enabling the student model to learn the teacher’s full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher’s sequence-level distribution; ii) Misalignment between the teacher’s output distribution and the student’s learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples – an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.

[324] Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling

Zhixiang Liang, Beichen Huang, Zheng Wang, Minjia Zhang

Main category: cs.LG

TL;DR: STEP: Step-level Trace Evaluation and Pruning - a framework that accelerates LLM reasoning by dynamically pruning unpromising reasoning traces during generation using hidden state analysis and GPU memory-aware strategy.

DetailsMotivation: Current methods for accelerating LLM reasoning through test-time scaling (generating multiple traces) suffer from high computation and latency. Existing pruning approaches based on similarity or confidence don't reliably indicate trace quality, creating a need for more effective acceleration techniques.

Method: Proposes STEP framework with: 1) Lightweight step scorer trained to estimate trace quality using hidden states, 2) Dynamic pruning of unpromising traces during generation, 3) GPU memory-aware pruning strategy triggered when KV cache saturates memory to reduce latency.

Result: STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy across challenging reasoning benchmarks.

Conclusion: STEP provides an effective solution for accelerating LLM reasoning by evaluating reasoning steps at the step level and implementing memory-aware dynamic pruning, achieving both latency reduction and accuracy improvement.

Abstract: Large Language Models (LLMs) can enhance reasoning capabilities through test-time scaling by generating multiple traces. However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end-to-end latency. Prior work on accelerating this process has relied on similarity-based or confidence-based pruning, but these signals do not reliably indicate trace quality. To address these limitations, we propose STEP: Step-level Trace Evaluation and Pruning, a novel pruning framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation. We train a lightweight step scorer to estimate trace quality, and design a GPU memory-aware pruning strategy that triggers pruning as the GPU memory is saturated by KV cache to reduce end-to-end latency. Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy. Our code is released at: https://github.com/Supercomputing-System-AI-Lab/STEP

[325] Comparative Assessment of Concrete Compressive Strength Prediction at Industry Scale Using Embedding-based Neural Networks, Transformers, and Traditional Machine Learning Approaches

Md Asiful Islam, Md Ahmed Al Muzaddid, Afia Jahin Prema, Sreenath Reddy Vuske

Main category: cs.LG

TL;DR: Embedding-based neural networks outperform other ML models for concrete compressive strength prediction, achieving ~2.5% error comparable to lab testing variability.

DetailsMotivation: Concrete compressive strength prediction is challenging due to material heterogeneity, variable mix proportions, and sensitivity to field/environmental conditions. There's a need for reliable data-driven models to support automated decision-making in construction quality control.

Method: Used industry-scale dataset of ~70,000 compressive strength test records to evaluate multiple predictive approaches: linear regression, decision trees, random forests, transformer-based neural networks, and embedding-based neural networks. Models incorporated mixture design and placement variables (water cement ratio, cementitious material content, slump, air content, temperature, placement conditions).

Result: Embedding-based neural network consistently outperformed traditional ML and transformer-based models, achieving mean 28-day prediction error of approximately 2.5%. This accuracy level is comparable to routine laboratory testing variability.

Conclusion: Embedding-based learning frameworks show potential to enable automated, data-driven quality control and decision support in large-scale construction operations, offering accuracy comparable to standard lab testing.

Abstract: Concrete is the most widely used construction material worldwide; however, reliable prediction of compressive strength remains challenging due to material heterogeneity, variable mix proportions, and sensitivity to field and environmental conditions. Recent advances in artificial intelligence enable data-driven modeling frameworks capable of supporting automated decision-making in construction quality control. This study leverages an industry-scale dataset consisting of approximately 70,000 compressive strength test records to evaluate and compare multiple predictive approaches, including linear regression, decision trees, random forests, transformer-based neural networks, and embedding-based neural networks. The models incorporate key mixture design and placement variables such as water cement ratio, cementitious material content, slump, air content, temperature, and placement conditions. Results indicate that the embedding-based neural network consistently outperforms traditional machine learning and transformer-based models, achieving a mean 28-day prediction error of approximately 2.5%. This level of accuracy is comparable to routine laboratory testing variability, demonstrating the potential of embedding-based learning frameworks to enable automated, data-driven quality control and decision support in large-scale construction operations.

[326] Enhancing Imbalanced Electrocardiogram Classification: A Novel Approach Integrating Data Augmentation through Wavelet Transform and Interclass Fusion

Haijian Shao, Wei Liu, Xing Deng, Daze Lu

Main category: cs.LG

TL;DR: Enhanced ECG classifier using wavelet transform-based feature fusion to address class imbalance and noise in ECG data, achieving up to 99% accuracy on CPSC 2018 dataset.

DetailsMotivation: Imbalanced ECG data hampers automated cardiovascular diagnosis and deep learning classification, with infrequent cardiac conditions being underrepresented. Noise from acquisition methods further complicates ECG processing, creating challenges for reliable classification.

Method: Proposes wavelet transform-based feature fusion with interclass fusion to generate training and test feature libraries. Original data is combined with these feature databases to create more balanced datasets. This approach addresses both class imbalance and noise simultaneously.

Result: Achieved recognition accuracies of 99% (Normal), 98% (AF), 97% (I-AVB), 98% (LBBB), 96% (RBBB), 92% (PAC), 93% (PVC, STD, STE). Average accuracy ranges from 92% to 98% across categories. Outperforms all known algorithms on CPSC 2018 dataset.

Conclusion: The proposed data fusion methodology effectively addresses both class imbalance and noise challenges in ECG analysis, achieving superior classification accuracy compared to existing methods on the CPSC 2018 dataset.

Abstract: Imbalanced electrocardiogram (ECG) data hampers the efficacy and resilience of algorithms in the automated processing and interpretation of cardiovascular diagnostic information, which in turn impedes deep learning-based ECG classification. Notably, certain cardiac conditions that are infrequently encountered are disproportionately underrepresented in these datasets. Although algorithmic generation and oversampling of specific ECG signal types can mitigate class skew, there is a lack of consensus regarding the effectiveness of such techniques in ECG classification. Furthermore, the methodologies and scenarios of ECG acquisition introduce noise, further complicating the processing of ECG data. This paper presents a significantly enhanced ECG classifier that simultaneously addresses both class imbalance and noise-related challenges in ECG analysis, as observed in the CPSC 2018 dataset. Specifically, we propose the application of feature fusion based on the wavelet transform, with a focus on wavelet transform-based interclass fusion, to generate the training feature library and the test set feature library. Subsequently, the original training and test data are amalgamated with their respective feature databases, resulting in more balanced training and test datasets. Employing this approach, our ECG model achieves recognition accuracies of up to 99%, 98%, 97%, 98%, 96%, 92%, and 93% for Normal, AF, I-AVB, LBBB, RBBB, PAC, PVC, STD, and STE, respectively. Furthermore, the average recognition accuracy for these categories ranges between 92% and 98%. Notably, our proposed data fusion methodology surpasses any known algorithms in terms of ECG classification accuracy in the CPSC 2018 dataset.

[327] EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge

Shijian Ma, Yan Lin, Yi Yang

Main category: cs.LG

TL;DR: EvasionBench: A large-scale benchmark for detecting evasive answers in earnings calls, using multi-model disagreement mining to create high-quality training data, resulting in Eva-4B model with 81.3% accuracy.

DetailsMotivation: Progress in detecting evasive answers in earnings calls is hindered by lack of large-scale benchmarks, which is critical for financial transparency.

Method: Multi-model annotation framework that mines boundary cases where two strong LLM annotators disagree, using a judge model to resolve labels. This disagreement mining approach creates valuable training data.

Result: Outperforms single-model distillation by 2.4%, with judge-resolved samples improving generalization despite higher training loss. Eva-4B (4B parameters) achieves 81.3% accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at lower inference cost.

Conclusion: Disagreement mining between frontier LLMs serves as implicit regularization and creates valuable training data, enabling effective detection of evasive answers in earnings calls with practical inference efficiency.

Abstract: Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen’s Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.

[328] Discrete Solution Operator Learning for Geometry-Dependent PDEs

Jinshuai Bai, Haolin Li, Zahra Sharif Khodaei, M. H. Aliabadi, YuanTong Gu, Xi-Qiao Feng

Main category: cs.LG

TL;DR: DiSOL learns discrete solution procedures for PDEs on varying geometries, handling topological changes and boundary discontinuities that break continuous operator learning assumptions.

DetailsMotivation: Neural operator learning assumes smooth geometry variations, but many engineering problems involve discrete structural changes like topological changes, abrupt boundary condition changes, and computational domain changes that break this assumption.

Method: DiSOL factorizes the solver into learnable stages mirroring classical discretizations: local contribution encoding, multiscale assembly, and implicit solution reconstruction on an embedded grid, preserving procedure-level consistency while adapting to geometry-dependent discrete structures.

Result: Across geometry-dependent Poisson, advection-diffusion, linear elasticity, and spatiotemporal heat-conduction problems, DiSOL produces stable and accurate predictions under both in-distribution and strongly out-of-distribution geometries, including discontinuous boundaries and topological changes.

Conclusion: Procedural operator representations are needed for geometry-dominated regimes, and discrete solution operator learning represents a distinct, complementary direction in scientific machine learning.

Abstract: Neural operator learning accelerates PDE solution by approximating operators as mappings between continuous function spaces. Yet in many engineering settings, varying geometry induces discrete structural changes, including topological changes, abrupt changes in boundary conditions or boundary types, and changes in the effective computational domain, which break the smooth-variation premise. Here we introduce Discrete Solution Operator Learning (DiSOL), a complementary paradigm that learns discrete solution procedures rather than continuous function-space operators. DiSOL factorizes the solver into learnable stages that mirror classical discretizations: local contribution encoding, multiscale assembly, and implicit solution reconstruction on an embedded grid, thereby preserving procedure-level consistency while adapting to geometry-dependent discrete structures. Across geometry-dependent Poisson, advection-diffusion, linear elasticity, as well as spatiotemporal heat-conduction problems, DiSOL produces stable and accurate predictions under both in-distribution and strongly out-of-distribution geometries, including discontinuous boundaries and topological changes. These results highlight the need for procedural operator representations in geometry-dominated regimes and position discrete solution operator learning as a distinct, complementary direction in scientific machine learning.

[329] Interpretable Probability Estimation with LLMs via Shapley Reconstruction

Yang Nan, Qihao Wen, Jiahao Wang, Pengfei He, Ravi Tandon, Yong Ge, Han Xu

Main category: cs.LG

TL;DR: PRISM framework uses Shapley values to decompose LLM probability predictions, improving accuracy and transparency over direct prompting for uncertain event estimation.

DetailsMotivation: LLMs show promise for probability estimation of uncertain events to support decision-making in fields like finance and healthcare, but direct prompting produces noisy, opaque predictions.

Method: PRISM (Probability Reconstruction via Shapley Measures) decomposes LLM predictions by quantifying marginal contributions of each input factor using Shapley values, then aggregates these factor-level contributions to reconstruct calibrated final estimates.

Result: PRISM improves predictive accuracy over direct prompting and other baselines across multiple domains (finance, healthcare, agriculture) and provides transparent visualization of how individual factors shape final estimates.

Conclusion: PRISM offers a transparent and precise framework for LLM-based probability estimation that builds trust in decision support systems by making the prediction process interpretable.

Abstract: Large Language Models (LLMs) demonstrate potential to estimate the probability of uncertain events, by leveraging their extensive knowledge and reasoning capabilities. This ability can be applied to support intelligent decision-making across diverse fields, such as financial forecasting and preventive healthcare. However, directly prompting LLMs for probability estimation faces significant challenges: their outputs are often noisy, and the underlying predicting process is opaque. In this paper, we propose PRISM: Probability Reconstruction via Shapley Measures, a framework that brings transparency and precision to LLM-based probability estimation. PRISM decomposes an LLM’s prediction by quantifying the marginal contribution of each input factor using Shapley values. These factor-level contributions are then aggregated to reconstruct a calibrated final estimate. In our experiments, we demonstrate PRISM improves predictive accuracy over direct prompting and other baselines, across multiple domains including finance, healthcare, and agriculture. Beyond performance, PRISM provides a transparent prediction pipeline: our case studies visualize how individual factors shape the final estimate, helping build trust in LLM-based decision support systems.

[330] KTCF: Actionable Recourse in Knowledge Tracing via Counterfactual Explanations for Education

Woojin Kim, Changkwon Lee, Hyeoncheol Kim

Main category: cs.LG

TL;DR: KTCF: A counterfactual explanation method for Knowledge Tracing that generates educational instructions to help students improve, with 5.7-34% performance improvements over existing methods.

DetailsMotivation: To bridge XAI for Knowledge Tracing with practical education by providing actionable, causal, and understandable explanations for educational stakeholders who are often non-experts in AI.

Method: Proposed KTCF method for counterfactual explanation generation that accounts for knowledge concept relationships, plus a post-processing scheme to convert explanations into educational instruction sequences.

Result: Superior and robust performance with 5.7% to 34% improvements across metrics on large-scale educational dataset; qualitative evaluation shows educational instructions help reduce study burden.

Conclusion: Counterfactual explanations can advance responsible AI in education; future XAI for KT should use educationally grounded conceptualization and stakeholder-centered methods.

Abstract: Using Artificial Intelligence to improve teaching and learning benefits greater adaptivity and scalability in education. Knowledge Tracing (KT) is recognized for student modeling task due to its superior performance and application potential in education. To this end, we conceptualize and investigate counterfactual explanation as the connection from XAI for KT to education. Counterfactual explanations offer actionable recourse, are inherently causal and local, and easy for educational stakeholders to understand who are often non-experts. We propose KTCF, a counterfactual explanation generation method for KT that accounts for knowledge concept relationships, and a post-processing scheme that converts a counterfactual explanation into a sequence of educational instructions. We experiment on a large-scale educational dataset and show our KTCF method achieves superior and robust performance over existing methods, with improvements ranging from 5.7% to 34% across metrics. Additionally, we provide a qualitative evaluation of our post-processing scheme, demonstrating that the resulting educational instructions help in reducing large study burden. We show that counterfactuals have the potential to advance the responsible and practical use of AI in education. Future works on XAI for KT may benefit from educationally grounded conceptualization and developing stakeholder-centered methods.

[331] Efficient Clustering in Stochastic Bandits

G Dhinesh Chandran, Kota Srinivas Reddy, Srikrishna Bhashyam

Main category: cs.LG

TL;DR: Efficient Bandit Clustering algorithm (EBC) for sequential clustering of data sequences under fixed confidence setting, with computationally efficient sampling rule and asymptotic optimality.

DetailsMotivation: Existing Bandit Clustering algorithms are computationally expensive as they require solving optimization problems at each time step, and they assume Gaussian-distributed arms, limiting their applicability to broader distribution classes.

Method: Propose EBC algorithm that takes single-step gradient updates toward optimal value instead of solving full optimization at each step, and EBC-H heuristic variant that uses quantities from stopping rule for arm selection.

Result: EBC achieves computational efficiency with significantly lower per-sample run time compared to existing algorithms while maintaining asymptotic optimality, with performance gains demonstrated on synthetic and real-world datasets.

Conclusion: EBC provides a computationally efficient solution for Bandit Clustering problems with broader distribution classes, balancing asymptotic optimality with practical computational requirements.

Abstract: We study the Bandit Clustering (BC) problem under the fixed confidence setting, where the objective is to group a collection of data sequences (arms) into clusters through sequential sampling from adaptively selected arms at each time step while ensuring a fixed error probability at the stopping time. We consider a setting where arms in a cluster may have different distributions. Unlike existing results in this setting, which assume Gaussian-distributed arms, we study a broader class of vector-parametric distributions that satisfy mild regularity conditions. Existing asymptotically optimal BC algorithms require solving an optimization problem as part of their sampling rule at each step, which is computationally costly. We propose an Efficient Bandit Clustering algorithm (EBC), which, instead of solving the full optimization problem, takes a single step toward the optimal value at each time step, making it computationally efficient while remaining asymptotically optimal. We also propose a heuristic variant of EBC, called EBC-H, which further simplifies the sampling rule, with arm selection based on quantities computed as part of the stopping rule. We highlight the computational efficiency of EBC and EBC-H by comparing their per-sample run time with that of existing algorithms. The asymptotic optimality of EBC is supported through simulations on the synthetic datasets. Through simulations on both synthetic and real-world datasets, we show the performance gain of EBC and EBC-H over existing approaches.

[332] Multi-Teacher Ensemble Distillation: A Mathematical Framework for Probability-Domain Knowledge Aggregation

Aaron R. Flouro, Shawn P. Chadwick

Main category: cs.LG

TL;DR: The paper develops an axiomatic framework for multi-teacher knowledge distillation, defining five core axioms for valid aggregation operators and proving their existence/non-uniqueness, with theoretical guarantees on variance reduction and bias mitigation.

DetailsMotivation: To provide theoretical grounding for multi-teacher knowledge distillation from diverse frontier models, moving beyond specific aggregation formulas to establish foundational principles that govern valid knowledge aggregation operators.

Method: Develops an operator-theoretic framework with five core axioms (convexity, positivity, continuity, weight monotonicity, temperature coherence) for multi-teacher ensemble knowledge distillation, building on Sparse-KD’s probability-domain distillation framework.

Result: Proves existence and non-uniqueness of operator families satisfying the axioms, establishes operator-agnostic guarantees showing multi-teacher aggregation reduces stochastic variance and systematic supervisory bias, provides Jensen-type bounds, log-loss guarantees, and safety attenuation properties.

Conclusion: The framework provides theoretical grounding for multi-teacher distillation from diverse models while admitting multiple valid implementation strategies, establishing that multiple distinct aggregation mechanisms can conform to the same foundational principles.

Abstract: Building on the probability-domain distillation framework of Sparse-KD, we develop an axiomatic, operator-theoretic framework for multi-teacher ensemble knowledge distillation. Rather than prescribing a specific aggregation formula, we define five core axioms governing valid knowledge aggregation operators, encompassing convexity, positivity, continuity, weight monotonicity, and temperature coherence. We prove the existence and non-uniqueness of operator families satisfying these axioms, establishing that multiple distinct aggregation mechanisms conform to the same foundational principles. Within this framework, we establish operator-agnostic guarantees showing that multi-teacher aggregation reduces both stochastic variance and systematic supervisory bias under heterogeneous teachers, while providing Jensen-type bounds, log-loss guarantees, and safety attenuation properties. For aggregation operators linear in teacher weights, we further establish classical ensemble variance-reduction results under standard independence assumptions, with extensions to correlated-error regimes. The framework provides theoretical grounding for multi-teacher distillation from diverse frontier models while admitting multiple valid implementation strategies.

[333] DP-FEDSOFIM: Differentially Private Federated Stochastic Optimization using Regularized Fisher Information Matrix

Sidhant R. Nair, Tanmay Sen, Mrinmay Sen

Main category: cs.LG

TL;DR: DP-FedSOFIM: A server-side second-order optimization framework for differentially private federated learning that achieves O(d) memory/computation per client while maintaining convergence benefits and privacy guarantees.

DetailsMotivation: DP-FL suffers from slow convergence under tight privacy budgets due to excessive noise. Existing second-order methods require O(d²) memory per client, making them impractical for high-dimensional models.

Method: Proposes DP-FedSOFIM using Fisher Information Matrix as natural gradient preconditioner with O(d) memory per client. Uses Sherman-Morrison formula for efficient matrix inversion, achieving O(d) computational complexity while preserving privacy via post-processing theorem.

Result: Empirical evaluation on CIFAR-10 shows DP-FedSOFIM achieves superior test accuracy compared to first-order baselines across multiple privacy regimes.

Conclusion: DP-FedSOFIM provides an efficient second-order optimization framework for DP-FL that balances privacy, convergence speed, and computational efficiency, making it practical for high-dimensional models.

Abstract: Differentially private federated learning (DP-FL) suffers from slow convergence under tight privacy budgets due to the overwhelming noise introduced to preserve privacy. While adaptive optimizers can accelerate convergence, existing second-order methods such as DP-FedNew require O(d^2) memory at each client to maintain local feature covariance matrices, making them impractical for high-dimensional models. We propose DP-FedSOFIM, a server-side second-order optimization framework that leverages the Fisher Information Matrix (FIM) as a natural gradient preconditioner while requiring only O(d) memory per client. By employing the Sherman-Morrison formula for efficient matrix inversion, DP-FedSOFIM achieves O(d) computational complexity per round while maintaining the convergence benefits of second-order methods. Our analysis proves that the server-side preconditioning preserves (epsilon, delta)-differential privacy through the post-processing theorem. Empirical evaluation on CIFAR-10 demonstrates that DP-FedSOFIM achieves superior test accuracy compared to first-order baselines across multiple privacy regimes.

[334] BalDRO: A Distributionally Robust Optimization based Framework for Large Language Model Unlearning

Pengyang Shao, Naixin Zhai, Lei Chen, Yonghui Yang, Fengbin Zhu, Xun Yang, Meng Wang

Main category: cs.LG

TL;DR: BalDRO: A balanced LLM unlearning framework that addresses sample-wise imbalance in forget sets using min-sup optimization with worst-case distribution focusing on hard-to-unlearn samples.

DetailsMotivation: As LLMs shape online content, effective unlearning becomes critical for web governance. The key challenge is sample-wise imbalance in forget sets where different samples have varying unlearning difficulty, causing asynchronous forgetting - some knowledge remains insufficiently erased while others become over-forgotten.

Method: BalDRO formulates unlearning as a min-sup process: inner step identifies worst-case data distribution emphasizing hard-to-unlearn samples, outer step updates model parameters under this distribution. Two efficient variants: BalDRO-G (discrete GroupDRO-based approximation focusing on high-loss subsets) and BalDRO-DV (continuous Donsker-Varadhan dual method enabling smooth adaptive weighting).

Result: Experiments on TOFU and MUSE datasets show BalDRO significantly improves both forgetting quality and model utility over existing methods. Code is released for reproducibility.

Conclusion: BalDRO provides an effective solution to the sample-wise imbalance problem in LLM unlearning, achieving balanced forgetting through worst-case distribution optimization, with both discrete and continuous variants offering efficient implementations.

Abstract: As Large Language Models (LLMs) increasingly shape online content, removing targeted information from well-trained LLMs (also known as LLM unlearning) has become critical for web governance. A key challenge lies in sample-wise imbalance within the forget set: different samples exhibit widely varying unlearning difficulty, leading to asynchronous forgetting where some knowledge remains insufficiently erased while others become over-forgotten. To address this, we propose BalDRO, a novel and efficient framework for balanced LLM unlearning. BalDRO formulates unlearning as a min-sup process: an inner step identifies a worst-case data distribution that emphasizes hard-to-unlearn samples, while an outer step updates model parameters under this distribution. We instantiate BalDRO via two efficient variants: BalDRO-G, a discrete GroupDRO-based approximation focusing on high-loss subsets, and BalDRO-DV, a continuous Donsker-Varadhan dual method enabling smooth adaptive weighting within standard training pipelines. Experiments on TOFU and MUSE show that BalDRO significantly improves both forgetting quality and model utility over existing methods, and we release code for reproducibility.

[335] Geometric Stability: The Missing Axis of Representations

Prashant C. Raju

Main category: cs.LG

TL;DR: The paper introduces geometric stability as a new dimension for analyzing representations that measures how reliably representational geometry holds under perturbation, distinct from similarity metrics.

DetailsMotivation: Current representation analysis focuses only on similarity (alignment with external references), which reveals what is represented but not whether that structure is robust. There's a need to measure how reliably representational geometry maintains its structure under perturbation.

Method: Introduces Shesha framework for measuring geometric stability. Tests across 2,463 configurations in seven domains, comparing stability with similarity metrics, analyzing sensitivity to principal components, and applying to various use cases including safety monitoring, controllability, and model selection.

Result: Stability and similarity are empirically uncorrelated (ρ≈0.01) and mechanistically distinct. Stability acts as a functional geometric canary for safety monitoring (2× more sensitive than CKA), predicts linear steerability (ρ=0.89-0.96), and reveals a “geometric tax” in transfer optimization. Also predicts CRISPR perturbation coherence and neural-behavioral coupling.

Conclusion: Geometric stability provides a necessary complement to similarity for auditing representations across biological and computational systems, quantifying how reliably systems maintain structure under perturbation.

Abstract: Analysis of learned representations has a blind spot: it focuses on $similarity$, measuring how closely embeddings align with external references, but similarity reveals only what is represented, not whether that structure is robust. We introduce $geometric$ $stability$, a distinct dimension that quantifies how reliably representational geometry holds under perturbation, and present $Shesha$, a framework for measuring it. Across 2,463 configurations in seven domains, we show that stability and similarity are empirically uncorrelated ($ρ\approx 0.01$) and mechanistically distinct: similarity metrics collapse after removing the top principal components, while stability retains sensitivity to fine-grained manifold structure. This distinction yields actionable insights: for safety monitoring, stability acts as a functional geometric canary, detecting structural drift nearly 2$\times$ more sensitively than CKA while filtering out the non-functional noise that triggers false alarms in rigid distance metrics; for controllability, supervised stability predicts linear steerability ($ρ= 0.89$-$0.96$); for model selection, stability dissociates from transferability, revealing a geometric tax that transfer optimization incurs. Beyond machine learning, stability predicts CRISPR perturbation coherence and neural-behavioral coupling. By quantifying $how$ $reliably$ systems maintain structure, geometric stability provides a necessary complement to similarity for auditing representations across biological and computational systems.

[336] $D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness

Lang Xiong, Ning Liu, Ao Ren, Yuheng Bai, Haining Fang, BinYan Zhang, Zhe Jiang, Yujuan Tan, Duo Liu

Main category: cs.LG

TL;DR: D²Prune is a novel pruning method for LLMs that addresses activation distribution shifts and attention long-tail patterns through dual Taylor expansion modeling and attention-aware dynamic updates.

DetailsMotivation: LLMs have massive computational demands making deployment challenging. Existing pruning methods fail to account for activation distribution shifts between calibration and test data, and overlook the long-tail distribution characteristics of activations in attention modules.

Method: 1) Dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, enabling accurate pruning mask selection and weight updating. 2) Attention-aware dynamic update strategy that preserves long-tail attention patterns by jointly minimizing KL divergence of attention distributions and reconstruction error.

Result: D²Prune consistently outperforms state-of-the-art methods across various LLMs (OPT-125M, LLaMA2/3, Qwen3). The dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.

Conclusion: The proposed D²Prune method effectively addresses key limitations in LLM pruning by handling activation distribution shifts and attention long-tail patterns, demonstrating strong performance across both language and vision transformer models.

Abstract: Large language models (LLMs) face significant deployment challenges due to their massive computational demands. % While pruning offers a promising compression solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) They overlook the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel pruning method, $D^2Prune$. First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise pruning mask selection and weight updating and facilitating error minimization during pruning. % Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that $D^2Prune$ consistently outperforms SOTA methods across various LLMs (e.g., OPT-125M, LLaMA2/3, and Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.

[337] From Hawkes Processes to Attention: Time-Modulated Mechanisms for Event Sequences

Xinzi Tan, Kejian Zhang, Junhan Yu, Doudou Zhou

Main category: cs.LG

TL;DR: A novel Hawkes Attention mechanism derived from multivariate Hawkes process theory for MTPPs, using learnable per-type neural kernels to capture heterogeneous temporal effects, outperforming existing Transformer-based methods.

DetailsMotivation: Existing Transformer-based methods for MTPPs rely on shared or parametric decay structures via positional encodings, limiting their ability to capture heterogeneous and type-specific temporal effects that naturally arise in medical, social, commercial, and financial domains.

Method: Derived Hawkes Attention from multivariate Hawkes process theory, using learnable per-type neural kernels to modulate query, key and value projections in attention mechanism, replacing corresponding parts in traditional attention to unify event timing and content interaction.

Result: Experimental results show the method achieves better performance compared to baselines. The attention mechanism can also be easily applied to specific temporal structures like time series forecasting.

Conclusion: Hawkes Attention effectively captures both time-relevant behavior and type-specific excitation patterns from data, addressing limitations of existing Transformer-based MTPP methods while maintaining applicability to various temporal structures.

Abstract: Marked Temporal Point Processes (MTPPs) arise naturally in medical, social, commercial, and financial domains. However, existing Transformer-based methods mostly inject temporal information only via positional encodings, relying on shared or parametric decay structures, which limits their ability to capture heterogeneous and type-specific temporal effects. Inspired by this observation, we derive a novel attention operator called Hawkes Attention from the multivariate Hawkes process theory for MTPP, using learnable per-type neural kernels to modulate query, key and value projections, thereby replacing the corresponding parts in the traditional attention. Benefited from the design, Hawkes Attention unifies event timing and content interaction, learning both the time-relevant behavior and type-specific excitation patterns from the data. The experimental results show that our method achieves better performance compared to the baselines. In addition to the general MTPP, our attention mechanism can also be easily applied to specific temporal structures, such as time series forecasting.

[338] GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization

Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang

Main category: cs.LG

TL;DR: GIFT (Gibbs Initialization with Finite Temperature) addresses the optimization mismatch in post-training LRMs by reformulating SFT as a finite-temperature energy potential instead of rigid supervision, creating a distributional bridge for better RL initialization.

DetailsMotivation: The current post-training paradigm (SFT followed by RL) suffers from distributional collapse due to rigid supervision in SFT, which exhausts the exploration space needed for subsequent RL optimization.

Method: Reformulate SFT within a unified post-training framework as Gibbs Initialization with Finite Temperature (GIFT), treating supervision as a finite-temperature energy potential rather than zero-temperature rigid supervision, establishing a distributional bridge between base priors and target distributions.

Result: GIFT significantly outperforms standard SFT and other competitive baselines when used for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training.

Conclusion: GIFT addresses the intrinsic optimization mismatch in LRM post-training by providing a distributional bridge that ensures objective consistency throughout the pipeline, offering better RL initialization than traditional approaches.

Abstract: The prevailing post-training paradigm for Large Reasoning Models (LRMs)–Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)–suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training. Our code is available at https://github.com/zzy1127/GIFT.

[339] Reward Learning through Ranking Mean Squared Error

Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor

Main category: cs.LG

TL;DR: R4 is a new RL method that learns reward functions from human ratings (ordinal feedback) using a novel ranking MSE loss with formal guarantees, outperforming existing methods with less feedback.

DetailsMotivation: Reward design is a bottleneck in RL applications. Reward learning from human feedback offers an alternative, but traditional binary preferences are limited. Ratings provide richer, less cognitively demanding supervision, but existing rating-based methods lack formal guarantees.

Method: R4 uses a novel ranking mean squared error (rMSE) loss that treats teacher-provided ratings as ordinal targets. It learns from trajectory-rating pairs, samples trajectories, predicts returns, ranks them using differentiable soft ranks, and optimizes MSE between soft ranks and teacher ratings.

Result: R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks while requiring significantly less feedback.

Conclusion: R4 provides a theoretically grounded, effective approach to reward learning from human ratings that reduces feedback requirements while maintaining or improving performance compared to existing methods.

Abstract: Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g., “bad,” “neutral,” “good”). At each training step, we sample a set of trajectories, predict their returns, and rank them using a differentiable sorting operator (soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the teacher’s ratings. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks from OpenAI Gym and the DeepMind Control Suite, while requiring significantly less feedback.

[340] XLinear: A Lightweight and Accurate MLP-Based Model for Long-Term Time Series Forecasting with Exogenous Inputs

Xinyang Chen, Huidong Jin, Yu Huang, Zaiwen Feng

Main category: cs.LG

TL;DR: XLinear is a lightweight MLP-based model for time series forecasting that efficiently exploits asymmetric causal relationships between endogenous and exogenous variables using a global token approach.

DetailsMotivation: Real-world time series forecasting often involves asymmetric causal relationships where cost-effective exogenous data (like weather) influences endogenous variables, but existing models assume uniform variable importance. Transformer-based models are computationally expensive and permutation-invariant, while patch-based variants may miss local patterns.

Method: XLinear uses MLPs with sigmoid activation and introduces a global token derived from endogenous variables as a hub for interacting with exogenous variables. It extracts both temporal patterns and variate-wise dependencies, then integrates these signals through a prediction head.

Result: XLinear outperforms state-of-the-art models on seven standard benchmarks and five real-world datasets with exogenous inputs, delivering superior accuracy and efficiency for both multivariate and univariate forecasts influenced by exogenous variables.

Conclusion: The proposed XLinear model effectively addresses the limitations of existing approaches by efficiently exploiting informative signals across temporal dimensions and relevant exogenous variables through a lightweight MLP-based architecture with global token integration.

Abstract: Despite the prevalent assumption of uniform variable importance in long-term time series forecasting models, real world applications often exhibit asymmetric causal relationships and varying data acquisition costs. Specifically, cost-effective exogenous data (e.g., local weather) can unilaterally influence dynamics of endogenous variables, such as lake surface temperature. Exploiting these links enables more effective forecasts when exogenous inputs are readily available. Transformer-based models capture long-range dependencies but incur high computation and suffer from permutation invariance. Patch-based variants improve efficiency yet can miss local temporal patterns. To efficiently exploit informative signals across both the temporal dimension and relevant exogenous variables, this study proposes XLinear, a lightweight time series forecasting model built upon MultiLayer Perceptrons (MLPs). XLinear uses a global token derived from an endogenous variable as a pivotal hub for interacting with exogenous variables, and employs MLPs with sigmoid activation to extract both temporal patterns and variate-wise dependencies. Its prediction head then integrates these signals to forecast the endogenous series. We evaluate XLinear on seven standard benchmarks and five real-world datasets with exogenous inputs. Compared with state-of-the-art models, XLinear delivers superior accuracy and efficiency for both multivariate forecasts and univariate forecasts influenced by exogenous inputs.

[341] HGATSolver: A Heterogeneous Graph Attention Solver for Fluid-Structure Interaction

Qin-Yi Zhang, Hong Wang, Siyao Liu, Haichuan Lin, Linying Cao, Xiao-Hu Zhou, Chen Chen, Shuangyi Wang, Zeng-Guang Hou

Main category: cs.LG

TL;DR: HGATSolver is a heterogeneous graph attention solver for fluid-structure interaction problems that uses specialized message-passing for different physical domains, physics-conditioned gating for stability, and gradient-balancing loss for optimization.

DetailsMotivation: Existing learning-based solvers struggle with heterogeneous dynamics in FSI systems, interface coupling inconsistencies, and disparities in learning difficulty across fluid and solid regions, leading to instability during prediction.

Method: HGATSolver encodes FSI systems as heterogeneous graphs with distinct node/edge types for fluid, solid, and interface regions, enabling specialized message-passing. It introduces physics-conditioned gating for stable explicit time stepping and Inter-domain Gradient-Balancing Loss for balanced optimization.

Result: Extensive experiments on two constructed FSI benchmarks and a public dataset demonstrate state-of-the-art performance, establishing an effective framework for surrogate modeling of coupled multi-physics systems.

Conclusion: HGATSolver successfully addresses challenges in learning-based FSI simulation by incorporating physical structure into the model architecture, stabilizing predictions, and balancing optimization across heterogeneous domains.

Abstract: Fluid-structure interaction (FSI) systems involve distinct physical domains, fluid and solid, governed by different partial differential equations and coupled at a dynamic interface. While learning-based solvers offer a promising alternative to costly numerical simulations, existing methods struggle to capture the heterogeneous dynamics of FSI within a unified framework. This challenge is further exacerbated by inconsistencies in response across domains due to interface coupling and by disparities in learning difficulty across fluid and solid regions, leading to instability during prediction. To address these challenges, we propose the Heterogeneous Graph Attention Solver (HGATSolver). HGATSolver encodes the system as a heterogeneous graph, embedding physical structure directly into the model via distinct node and edge types for fluid, solid, and interface regions. This enables specialized message-passing mechanisms tailored to each physical domain. To stabilize explicit time stepping, we introduce a novel physics-conditioned gating mechanism that serves as a learnable, adaptive relaxation factor. Furthermore, an Inter-domain Gradient-Balancing Loss dynamically balances the optimization objectives across domains based on predictive uncertainty. Extensive experiments on two constructed FSI benchmarks and a public dataset demonstrate that HGATSolver achieves state-of-the-art performance, establishing an effective framework for surrogate modeling of coupled multi-physics systems.

[342] RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning

Zehua Liu, Shuqi Liu, Tao Zhong, Mingxuan Yuan

Main category: cs.LG

TL;DR: RIFT (Reward Informed Fine-Tuning) is a new LLM alignment method that uses all self-generated samples (both positive and negative) with reward-weighted loss, outperforming RFT while being more data-efficient.

DetailsMotivation: Current alignment methods like SFT require costly expert data, while RFT discards valuable negative samples, leading to data inefficiency. There's a need for methods that can effectively utilize all self-generated data.

Method: RIFT repurposes negative trajectories by reweighting the loss with scalar rewards to learn from both positive and negative trajectories. To prevent training collapse from naive reward integration, they introduce a stabilized loss formulation that ensures numerical robustness and optimization efficiency.

Result: Extensive experiments on mathematical benchmarks across various base models show that RIFT consistently outperforms RFT. The method demonstrates robustness and data efficiency for alignment using mixed-quality, self-generated data.

Conclusion: RIFT provides a robust and data-efficient alternative for LLM alignment that effectively utilizes all self-generated samples, overcoming limitations of both SFT and RFT approaches.

Abstract: While Supervised Fine-Tuning (SFT) and Rejection Sampling Fine-Tuning (RFT) are standard for LLM alignment, they either rely on costly expert data or discard valuable negative samples, leading to data inefficiency. To address this, we propose Reward Informed Fine-Tuning (RIFT), a simple yet effective framework that utilizes all self-generated samples. Unlike the hard thresholding of RFT, RIFT repurposes negative trajectories, reweighting the loss with scalar rewards to learn from both the positive and negative trajectories from the model outputs. To overcome the training collapse caused by naive reward integration, where direct multiplication yields an unbounded loss, we introduce a stabilized loss formulation that ensures numerical robustness and optimization efficiency. Extensive experiments on mathematical benchmarks across various base models show that RIFT consistently outperforms RFT. Our results demonstrate that RIFT is a robust and data-efficient alternative for alignment using mixed-quality, self-generated data.

[343] Learning to Trust Experience: A Monitor-Trust-Regulator Framework for Learning under Unobservable Feedback Reliability

Zhipeng Zhang, Zhenjie Yao, Kai Li, Lei Yang

Main category: cs.LG

TL;DR: The paper introduces metacognitive regulation for learning under unobservable feedback reliability, proposing a Monitor-Trust-Regulator framework with self-diagnosis to infer experience credibility from internal dynamics.

DetailsMotivation: Standard robust learning can converge stably yet form systematically wrong beliefs when feedback reliability is unobservable and locally indistinguishable. This Epistemic Identifiability under Unobservable Reliability (EIUR) problem requires systems to decide whether to learn from experiences, not just how to learn stably.

Method: Proposes metacognitive regulation via a Monitor-Trust-Regulator (MTR) decomposition, instantiated with self-diagnosis that maintains slowly varying experience-trust variables to softly modulate learning updates without needing external reliability labels or corruption models.

Result: Self-diagnosis improves epistemic identifiability in EIUR regimes. In RL, it enables calibrated skepticism and recovery under corrupted rewards. In supervised learning, it reveals a critical dissociation where performance recovery doesn’t imply epistemic recovery - accuracy can rebound while beliefs remain locked-in by early misleading data.

Conclusion: MTR and self-diagnosis provide an organizing abstraction and concrete design template for intrinsic reliability assessment in autonomous learning under unobservable reliability, addressing the fundamental challenge of deciding whether to learn from experiences.

Abstract: Learning under unobservable feedback reliability poses a distinct challenge beyond optimization robustness: a system must decide whether to learn from an experience, not only how to learn stably. We study this setting as Epistemic Identifiability under Unobservable Reliability (EIUR), where each experience has a latent credibility, reliable and unreliable feedback can be locally indistinguishable, and data are generated in a closed loop by the learner’s own evolving beliefs and actions. In EIUR, standard robust learning can converge stably yet form high-confidence, systematically wrong beliefs. We propose metacognitive regulation as a practical response: a second, introspective control loop that infers experience credibility from endogenous evidence in the learner’s internal dynamics. We formalize this as a modular Monitor-Trust-Regulator (MTR) decomposition and instantiate it with self-diagnosis, which maintains a slowly varying experience-trust variable that softly modulates learning updates, without exogenous reliability labels or an explicit corruption model. Empirically, in the EIUR regimes studied here, self-diagnosis is associated with improved epistemic identifiability. In reinforcement learning, it enables calibrated skepticism and recovery under systematically corrupted rewards. In supervised learning, it exposes a critical dissociation: performance recovery does not imply epistemic recovery. Accuracy can rebound while internal belief dynamics remain locked-in by early misleading data, a failure detectable only through introspective diagnostics. Together, MTR and self-diagnosis provide an organizing abstraction and a concrete design template for intrinsic reliability assessment in autonomous learning under unobservable reliability.

[344] Enhancing Spatial Reasoning in Large Language Models for Metal-Organic Frameworks Structure Prediction

Mianzhi Pan, JianFei Li, Peishuo Liu, Botian Wang, Yawen Ouyang, Yiming Rong, Hao Zhou, Jianbing Zhang

Main category: cs.LG

TL;DR: MOF-LLM: First LLM framework for block-level MOF structure prediction using spatial-aware training and reinforcement learning to overcome atomic complexity challenges.

DetailsMotivation: MOFs are important porous materials for applications like carbon capture and drug delivery, but predicting their 3D structures is difficult due to high atomic complexity. While LLMs work for simpler crystals, they struggle with MOFs' modular nature.

Method: MOF-LLM adapts LLMs for block-level MOF prediction using: 1) Spatial-aware continual pre-training (CPT), 2) Structural supervised fine-tuning (SFT), and 3) Matching-driven reinforcement learning with Soft Adaptive Policy Optimization (SAPO) to incorporate spatial priors and optimize stability.

Result: MOF-LLM outperforms state-of-the-art denoising-based and LLM-based methods in MOF structure prediction while showing superior sampling efficiency. The approach enhances spatial reasoning in a Qwen-3 8B model.

Conclusion: The block-wise paradigm with specialized LLM training enables accurate MOF structure prediction, overcoming atomic complexity challenges and demonstrating the potential of LLMs for complex modular material design.

Abstract: Metal-organic frameworks (MOFs) are porous crystalline materials with broad applications such as carbon capture and drug delivery, yet accurately predicting their 3D structures remains a significant challenge. While Large Language Models (LLMs) have shown promise in generating crystals, their application to MOFs is hindered by MOFs’ high atomic complexity. Inspired by the success of block-wise paradigms in deep generative models, we pioneer the use of LLMs in this domain by introducing MOF-LLM, the first LLM framework specifically adapted for block-level MOF structure prediction. To effectively harness LLMs for this modular assembly task, our training paradigm integrates spatial-aware continual pre-training (CPT), structural supervised fine-tuning (SFT), and matching-driven reinforcement learning (RL). By incorporating explicit spatial priors and optimizing structural stability via Soft Adaptive Policy Optimization (SAPO), our approach substantially enhances the spatial reasoning capability of a Qwen-3 8B model for accurate MOF structure prediction. Comprehensive experiments demonstrate that MOF-LLM outperforms state-of-the-art denoising-based and LLM-based methods while exhibiting superior sampling efficiency.

[345] Single-Round Clustered Federated Learning via Data Collaboration Analysis for Non-IID Data

Sota Sugawara, Yuji Kawamata, Akihiro Toyoda, Tomoru Nakayama, Yukihiko Okada

Main category: cs.LG

TL;DR: DC-CFL is a single-round clustered federated learning framework that performs client clustering and cluster-wise learning in one communication round using data collaboration analysis.

DetailsMotivation: Existing clustered federated learning approaches require multiple communication rounds for cluster estimation and model updates, which limits their practicality under tight communication constraints. There's a need for efficient CFL that works with limited communication rounds.

Method: DC-CFL uses data collaboration analysis to quantify inter-client similarity via total variation distance between label distributions, estimates clusters using hierarchical clustering, and performs cluster-wise learning via DC analysis - all in a single communication round.

Result: Experiments on multiple open datasets under non-IID conditions show DC-CFL achieves accuracy comparable to multi-round baselines while requiring only one communication round.

Conclusion: DC-CFL is a practical alternative for collaborative AI model development when multiple communication rounds are impractical, offering efficient clustered federated learning with minimal communication overhead.

Abstract: Federated Learning (FL) enables distributed learning across multiple clients without sharing raw data. When statistical heterogeneity across clients is severe, Clustered Federated Learning (CFL) can improve performance by grouping similar clients and training cluster-wise models. However, most CFL approaches rely on multiple communication rounds for cluster estimation and model updates, which limits their practicality under tight constraints on communication rounds. We propose Data Collaboration-based Clustered Federated Learning (DC-CFL), a single-round framework that completes both client clustering and cluster-wise learning, using only the information shared in DC analysis. DC-CFL quantifies inter-client similarity via total variation distance between label distributions, estimates clusters using hierarchical clustering, and performs cluster-wise learning via DC analysis. Experiments on multiple open datasets under representative non-IID conditions show that DC-CFL achieves accuracy comparable to multi-round baselines while requiring only one communication round. These results indicate that DC-CFL is a practical alternative for collaborative AI model development when multiple communication rounds are impractical.

[346] GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR

Jiaying Zhang, Lei Shi, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He

Main category: cs.LG

TL;DR: GeoRA is a geometry-aware low-rank adaptation method for RL with verifiable rewards that addresses spectral collapse and optimization instability in existing parameter-efficient methods by exploiting anisotropic RL update subspaces.

DetailsMotivation: Existing parameter-efficient methods (PiSSA, MiLoRA) designed for SFT don't account for RLVR's distinct optimization dynamics and geometric structures, causing spectral collapse and optimization instability. Alternative sparse update approaches face efficiency bottlenecks on modern hardware due to unstructured computations.

Method: GeoRA extracts principal directions via SVD within geometrically constrained RL update subspaces, initializes adapters with these directions while freezing residual components. This preserves pre-trained geometric structure and enables efficient GPU computation through dense operators.

Result: GeoRA consistently outperforms established low-rank baselines on key mathematical benchmarks, achieving SOTA results. It shows superior generalization and resilience to catastrophic forgetting in out-of-domain tasks on Qwen and Llama models.

Conclusion: GeoRA effectively addresses geometric misalignment in RLVR optimization, providing stable, efficient adaptation that preserves model structure while achieving superior performance and generalization compared to existing parameter-efficient methods.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is crucial for advancing large-scale reasoning models. However, existing parameter-efficient methods, such as PiSSA and MiLoRA, are designed for Supervised Fine-Tuning (SFT) and do not account for the distinct optimization dynamics and geometric structures of RLVR. Applying these methods directly leads to spectral collapse and optimization instability, which severely limit model performance. Meanwhile, alternative approaches that leverage update sparsity encounter significant efficiency bottlenecks on modern hardware due to unstructured computations. To address these challenges, we propose GeoRA (Geometry-Aware Low-Rank Adaptation), which exploits the anisotropic and compressible nature of RL update subspaces. GeoRA initializes adapters by extracting principal directions via Singular Value Decomposition (SVD) within a geometrically constrained subspace while freezing the residual components. This method preserves the pre-trained geometric structure and enables efficient GPU computation through dense operators. Experiments on Qwen and Llama demonstrate that GeoRA mitigates optimization bottlenecks caused by geometric misalignment. It consistently outperforms established low-rank baselines on key mathematical benchmarks, achieving state-of-the-art (SOTA) results. Moreover, GeoRA shows superior generalization and resilience to catastrophic forgetting in out-of-domain tasks.

[347] Preliminary Tests of the Anticipatory Classifier System with Hindsight Experience Replay

Olgierd Unold, Stanisław Franczyk

Main category: cs.LG

TL;DR: ACS2HER combines Anticipatory Classifier System (ACS2) with Hindsight Experience Replay (HER) to improve learning in sparse-reward environments by re-labeling visited states as virtual goals when primary goals aren’t reached.

DetailsMotivation: ACS2 is effective at building cognitive maps but struggles with performance stagnation in environments with sparse rewards. The authors aim to address this limitation by integrating hindsight learning mechanisms.

Method: Integrates ACS2 with HER through a specific architectural variant that triggers hindsight learning when the agent fails to reach its primary goal. Visited states are re-labeled as virtual goals to create denser learning signals.

Result: ACS2HER significantly accelerates knowledge acquisition and environmental mastery compared to standard ACS2 on both deterministic Maze 6 and stochastic FrozenLake benchmarks. However, it comes with increased computational overhead and substantial expansion in classifier numerosity.

Conclusion: This work provides the first analysis of combining anticipatory mechanisms with retrospective goal-relabeling in Learning Classifier Systems, demonstrating improved learning efficiency at the cost of computational complexity.

Abstract: This paper introduces ACS2HER, a novel integration of the Anticipatory Classifier System (ACS2) with the Hindsight Experience Replay (HER) mechanism. While ACS2 is highly effective at building cognitive maps through latent learning, its performance often stagnates in environments characterized by sparse rewards. We propose a specific architectural variant that triggers hindsight learning when the agent fails to reach its primary goal, re-labeling visited states as virtual goals to densify the learning signal. The proposed model was evaluated on two benchmarks: the deterministic \texttt{Maze 6} and the stochastic \texttt{FrozenLake}. The results demonstrate that ACS2HER significantly accelerates knowledge acquisition and environmental mastery compared to the standard ACS2. However, this efficiency gain is accompanied by increased computational overhead and a substantial expansion in classifier numerosity. This work provides the first analysis of combining anticipatory mechanisms with retrospective goal-relabeling in Learning Classifier Systems.

[348] Draw it like Euclid: Teaching transformer models to generate CAD profiles using ruler and compass construction steps

Siyi Li, Joseph G. Lambourne, Longfei Zhang, Pradeep Kumar Jayaraman, Karl. D. D. Willis

Main category: cs.LG

TL;DR: A new CAD profile generation method using geometric construction sequences (offsetting, rotations, intersections) that improves quality through intermediate construction steps, enables parametric editing, and benefits from reinforcement learning.

DetailsMotivation: To improve CAD profile generation quality by introducing intermediate construction steps similar to chain-of-thought reasoning in language models, while enabling precise parametric editing with floating point precision.

Method: Uses sequences of geometric constructions (curve offsetting, rotations, intersections) starting from designer input, building profiles step-by-step. Applies reinforcement learning to optimize construction sequences across multiple metrics.

Result: Construction sequences improve generation quality similar to chain-of-thought in language models. Reinforcement learning provides further improvements across multiple metrics, including non-optimized ones. The method enables parametric editing with floating point precision.

Conclusion: Geometric construction sequences with intermediate steps enhance CAD profile generation quality and enable parametric editing, with reinforcement learning offering additional performance gains across diverse metrics.

Abstract: We introduce a new method of generating Computer Aided Design (CAD) profiles via a sequence of simple geometric constructions including curve offsetting, rotations and intersections. These sequences start with geometry provided by a designer and build up the points and curves of the final profile step by step. We demonstrate that adding construction steps between the designer’s input geometry and the final profile improves generation quality in a similar way to the introduction of a chain of thought in language models. Similar to the constraints in a parametric CAD model, the construction sequences reduce the degrees of freedom in the modeled shape to a small set of parameter values which can be adjusted by the designer, allowing parametric editing with the constructed geometry evaluated to floating point precision. In addition we show that applying reinforcement learning to the construction sequences gives further improvements over a wide range of metrics, including some which were not explicitly optimized.

[349] DeepLight: A Sobolev-trained Image-to-Image Surrogate Model for Light Transport in Tissue

Philipp Haim, Vasilis Ntziachristos, Torsten Enßlin, Dominik Jüstel

Main category: cs.LG

TL;DR: Sobolev-trained neural surrogate model improves derivative accuracy for light transport in tissue, enhancing optoacoustic imaging reconstruction.

DetailsMotivation: Recovering tissue absorption coefficients in optoacoustic imaging requires accurate inversion of light transport, but existing variational methods need differentiable models. Neural surrogates offer fast simulations but lack derivative accuracy guarantees, which hinders high-fidelity reconstructions in inverse problems.

Method: Developed a surrogate model for light transport in tissue using Sobolev training to improve derivative accuracy. The Sobolev training approach is designed to be suitable for high-dimensional models in general.

Result: Sobolev training not only improves derivative accuracy but also reduces generalization error for both in-distribution and out-of-distribution samples. These improvements enhance the utility of surrogate models for downstream tasks.

Conclusion: Sobolev-trained neural surrogate models for light transport promise to significantly enhance optoacoustic imaging reconstruction by providing accurate derivatives crucial for solving inverse problems, potentially increasing clinical value.

Abstract: In optoacoustic imaging, recovering the absorption coefficients of tissue by inverting the light transport remains a challenging problem. Improvements in solving this problem can greatly benefit the clinical value of optoacoustic imaging. Existing variational inversion methods require an accurate and differentiable model of this light transport. As neural surrogate models allow fast and differentiable simulations of complex physical processes, they are considered promising candidates to be used in solving such inverse problems. However, there are in general no guarantees that the derivatives of these surrogate models accurately match those of the underlying physical operator. As accurate derivatives are central to solving inverse problems, errors in the model derivative can considerably hinder high fidelity reconstructions. To overcome this limitation, we present a surrogate model for light transport in tissue that uses Sobolev training to improve the accuracy of the model derivatives. Additionally, the form of Sobolev training we used is suitable for high-dimensional models in general. Our results demonstrate that Sobolev training for a light transport surrogate model not only improves derivative accuracy but also reduces generalization error for in-distribution and out-of-distribution samples. These improvements promise to considerably enhance the utility of the surrogate model in downstream tasks, especially in solving inverse problems.

[350] Late Breaking Results: Quamba-SE: Soft-edge Quantizer for Activations in State Space Models

Yizhi Chen, Ahmed Hemani

Main category: cs.LG

TL;DR: Quamba-SE is a soft-edge quantizer for SSM activation quantization that uses three adaptive scales (high-precision for small values, standard for normal, low-precision for outliers) instead of hard clipping, achieving better performance than Quamba on Mamba-130M across 6 benchmarks.

DetailsMotivation: Existing quantization methods for State Space Models use standard INT8 operations with hard clipping, which loses important outlier information in activations. This paper aims to preserve outlier information while maintaining precision for other values through adaptive scaling.

Method: Quamba-SE employs a soft-edge quantizer with three adaptive scales: high-precision scaling for small values, standard scaling for normal values, and low-precision scaling for outliers. This approach avoids hard clipping and preserves outlier information that would otherwise be lost.

Result: Evaluation on Mamba-130M across 6 zero-shot benchmarks shows Quamba-SE consistently outperforms Quamba, achieving up to +2.68% improvement on individual benchmarks and up to +0.83% average accuracy improvement across all 6 datasets.

Conclusion: The soft-edge quantization approach with adaptive scaling effectively preserves outlier information in SSM activation quantization, leading to better performance than standard quantization methods while maintaining computational efficiency.

Abstract: We propose Quamba-SE, a soft-edge quantizer for State Space Model (SSM) activation quantization. Unlike existing methods, using standard INT8 operation, Quamba-SE employs three adaptive scales: high-precision for small values, standard scale for normal values, and low-precision for outliers. This preserves outlier information instead of hard clipping, while maintaining precision for other values. We evaluate on Mamba- 130M across 6 zero-shot benchmarks. Results show that Quamba- SE consistently outperforms Quamba, achieving up to +2.68% on individual benchmarks and up to +0.83% improvement in the average accuracy of 6 datasets.

[351] On the Hardness of Computing Counterfactual and Semifactual Explanations in XAI

André Artelt, Martin Olsen, Kevin Tierney

Main category: cs.LG

TL;DR: Counterfactual and semi-factual explanations for ML models are often computationally hard to generate and even hard to approximate, with implications for XAI research and AI regulation.

DetailsMotivation: Clear explanations for ML model decisions are crucial for deployment in critical applications, and understanding the computational complexity of generating counterfactual and semi-factual explanations is important for the XAI community and policymakers.

Method: The paper provides an overview of existing computational complexity results for generating explanations, and contributes new inapproximability results showing that explanations are not only hard to generate but also hard to approximate under certain assumptions.

Result: The research finds that generating counterfactual and semi-factual explanations is computationally hard in many cases, and strengthens this argument with new inapproximability results showing these explanations are also hard to approximate.

Conclusion: The computational complexity results have significant implications for the XAI community’s research directions and for policymakers seeking to regulate AI explanations, highlighting fundamental limitations in generating certain types of explanations.

Abstract: Providing clear explanations to the choices of machine learning models is essential for these models to be deployed in crucial applications. Counterfactual and semi-factual explanations have emerged as two mechanisms for providing users with insights into the outputs of their models. We provide an overview of the computational complexity results in the literature for generating these explanations, finding that in many cases, generating explanations is computationally hard. We strengthen the argument for this considerably by further contributing our own inapproximability results showing that not only are explanations often hard to generate, but under certain assumptions, they are also hard to approximate. We discuss the implications of these complexity results for the XAI community and for policymakers seeking to regulate explanations in AI.

[352] Searth Transformer: A Transformer Architecture Incorporating Earth’s Geospheric Physical Priors for Global Mid-Range Weather Forecasting

Tianye Li, Qi Liu, Hao Li, Lei Chen, Wencong Cheng, Fei Zheng, Xiangao Xia, Ya Wang, Gang Huang, Weiwei Wang, Xuan Tong, Ziqing Zu, Yi Fang, Shenming Fu, Jiang Jiang, Haochen Li, Mingxing Li, Jiangjiang Xia

Main category: cs.LG

TL;DR: Searth Transformer with physics-informed architecture and RAR fine-tuning enables efficient global weather forecasting with competitive accuracy to ECMWF HRES at 200x lower computational cost.

DetailsMotivation: Existing Transformer-based weather models neglect Earth's spherical geometry and zonal periodicity, while conventional autoregressive training is computationally expensive and limits forecast horizons due to error accumulation.

Method: Proposed Shifted Earth Transformer (Searth Transformer) incorporates zonal periodicity and meridional boundaries into window-based self-attention for physically consistent global information exchange. Introduced Relay Autoregressive (RAR) fine-tuning strategy for learning long-range atmospheric evolution under constrained memory/computational budgets.

Result: YanTian model achieves higher accuracy than ECMWF HRES, performs competitively with state-of-the-art AI models at one-degree resolution with 200x lower computational cost than standard autoregressive fine-tuning. Attains longer skillful forecast lead time for Z500 (10.3 days vs HRES’s 9 days).

Conclusion: The work establishes a robust algorithmic foundation for predictive modeling of complex global-scale geophysical circulation systems, offering new pathways for Earth system science beyond just weather forecasting.

Abstract: Accurate global medium-range weather forecasting is fundamental to Earth system science. Most existing Transformer-based forecasting models adopt vision-centric architectures that neglect the Earth’s spherical geometry and zonal periodicity. In addition, conventional autoregressive training is computationally expensive and limits forecast horizons due to error accumulation. To address these challenges, we propose the Shifted Earth Transformer (Searth Transformer), a physics-informed architecture that incorporates zonal periodicity and meridional boundaries into window-based self-attention for physically consistent global information exchange. We further introduce a Relay Autoregressive (RAR) fine-tuning strategy that enables learning long-range atmospheric evolution under constrained memory and computational budgets. Based on these methods, we develop YanTian, a global medium-range weather forecasting model. YanTian achieves higher accuracy than the high-resolution forecast of the European Centre for Medium-Range Weather Forecasts and performs competitively with state-of-the-art AI models at one-degree resolution, while requiring roughly 200 times lower computational cost than standard autoregressive fine-tuning. Furthermore, YanTian attains a longer skillful forecast lead time for Z500 (10.3 days) than HRES (9 days). Beyond weather forecasting, this work establishes a robust algorithmic foundation for predictive modeling of complex global-scale geophysical circulation systems, offering new pathways for Earth system science.

[353] FairGU: Fairness-aware Graph Unlearning in Social Network

Renqiang Luo, Yongshuai Yang, Huafei Huang, Qing Qing, Mingliang Hou, Ziqi Xu, Yi Yu, Jingjing Zhou, Feng Xia

Main category: cs.LG

TL;DR: FairGU is a fairness-aware graph unlearning framework that preserves both utility and fairness when removing nodes, addressing the fairness degradation issue in existing graph unlearning methods.

DetailsMotivation: Existing graph unlearning techniques insufficiently protect sensitive attributes, often leading to degraded algorithmic fairness compared to traditional graph learning methods, creating a gap in privacy-preserving social networks.

Method: FairGU integrates a dedicated fairness-aware module with effective data protection strategies to ensure sensitive attributes are neither inadvertently amplified nor structurally exposed when nodes are removed.

Result: FairGU consistently outperforms state-of-the-art graph unlearning methods and fairness-enhanced graph learning baselines in terms of both accuracy and fairness metrics across multiple real-world datasets.

Conclusion: The research highlights a previously overlooked risk in current unlearning practices and establishes FairGU as a robust and equitable solution for socially sustainable networked systems.

Abstract: Graph unlearning has emerged as a critical mechanism for supporting sustainable and privacy-preserving social networks, enabling models to remove the influence of deleted nodes and thereby better safeguard user information. However, we observe that existing graph unlearning techniques insufficiently protect sensitive attributes, often leading to degraded algorithmic fairness compared with traditional graph learning methods. To address this gap, we introduce FairGU, a fairness-aware graph unlearning framework designed to preserve both utility and fairness during the unlearning process. FairGU integrates a dedicated fairness-aware module with effective data protection strategies, ensuring that sensitive attributes are neither inadvertently amplified nor structurally exposed when nodes are removed. Through extensive experiments on multiple real-world datasets, we demonstrate that FairGU consistently outperforms state-of-the-art graph unlearning methods and fairness-enhanced graph learning baselines in terms of both accuracy and fairness metrics. Our findings highlight a previously overlooked risk in current unlearning practices and establish FairGU as a robust and equitable solution for the next generation of socially sustainable networked systems. The codes are available at https://github.com/LuoRenqiang/FairGU.

[354] SimMerge: Learning to Select Merge Operators from Similarity Signals

Oliver Bolton, Aakanksha, Arash Ahmadian, Sara Hooker, Marzieh Fadaee, Beyza Ermis

Main category: cs.LG

TL;DR: SIMERGE: A predictive method for selecting optimal LLM merges using task-agnostic similarity signals, eliminating expensive merge-and-evaluate searches.

DetailsMotivation: Model merging is valuable for LLM development but becomes difficult at scale due to the need to choose the right merge operator, select appropriate models, and determine optimal merge order, which typically requires expensive merge-and-evaluate searches.

Method: SIMERGE uses inexpensive, task-agnostic similarity signals between models. From a small set of unlabeled probes, it computes functional and structural features to predict the performance of 2-way merges, then uses these predictions to select the best merge operator, subset of models, and merge order.

Result: The method surpasses standard merge-operator performance on 2-way merges of 7B-parameter LLMs, generalizes to multi-way merges and 111B-parameter LLM merges without retraining, and includes a bandit variant that supports adding new tasks, models, and operators dynamically.

Conclusion: Learning how to merge is a practical approach for scalable model composition when dealing with large checkpoint catalogs and tight evaluation budgets, offering an efficient alternative to expensive merge-and-evaluate loops.

Abstract: Model merging enables multiple large language models (LLMs) to be combined into a single model while preserving performance. This makes it a valuable tool in LLM development, offering a competitive alternative to multi-task training. However, merging can be difficult at scale, as successful merging requires choosing the right merge operator, selecting the right models, and merging them in the right order. This often leads researchers to run expensive merge-and-evaluate searches to select the best merge. In this work, we provide an alternative by introducing \simmerge{}, \emph{a predictive merge-selection method} that selects the best merge using inexpensive, task-agnostic similarity signals between models. From a small set of unlabeled probes, we compute functional and structural features and use them to predict the performance of a given 2-way merge. Using these predictions, \simmerge{} selects the best merge operator, the subset of models to merge, and the merge order, eliminating the expensive merge-and-evaluate loop. We demonstrate that we surpass standard merge-operator performance on 2-way merges of 7B-parameter LLMs, and that \simmerge{} generalizes to multi-way merges and 111B-parameter LLM merges without retraining. Additionally, we present a bandit variant that supports adding new tasks, models, and operators on the fly. Our results suggest that learning how to merge is a practical route to scalable model composition when checkpoint catalogs are large and evaluation budgets are tight.

[355] Terminally constrained flow-based generative models from an optimal control perspective

Weiguo Gao, Ming Li, Qianxiao Li

Main category: cs.LG

TL;DR: TOCFlow: Optimal control method for sampling from terminally constrained distributions using pre-trained flow models, with geometry-aware guidance that matches Gauss-Newton updates at gradient guidance cost.

DetailsMotivation: Need to sample from constrained distributions using pre-trained flow-based generative models while satisfying terminal constraints (equality, inequality, statistical constraints) without compromising generative quality.

Method: Formulate as optimal control problem with Hamilton-Jacobi-Bellman equation. TOCFlow solves control in terminal co-moving frame tracking reference trajectories, yielding closed-form scalar damping factor along Riemannian gradient capturing curvature effects without matrix inversions.

Result: Theoretically: As control penalty increases, process recovers reference distribution; as penalty vanishes, converges to generalized Wasserstein projection. Empirically: TOCFlow improves constraint satisfaction over Euclidean guidance and projection baselines while preserving generative quality on Darcy flow, constrained trajectory planning, and turbulence snapshot generation tasks.

Conclusion: TOCFlow provides geometry-aware sampling-time guidance for pre-trained flows that matches geometric consistency of Gauss-Newton updates at computational cost of standard gradient guidance, enabling effective constrained sampling across diverse high-dimensional scientific applications.

Abstract: We address the problem of sampling from terminally constrained distributions with pre-trained flow-based generative models through an optimal control formulation. Theoretically, we characterize the value function by a Hamilton-Jacobi-Bellman equation and derive the optimal feedback control as the minimizer of the associated Hamiltonian. We show that as the control penalty increases, the controlled process recovers the reference distribution, while as the penalty vanishes, the terminal law converges to a generalized Wasserstein projection onto the constraint manifold. Algorithmically, we introduce Terminal Optimal Control with Flow-based models (TOCFlow), a geometry-aware sampling-time guidance method for pre-trained flows. Solving the control problem in a terminal co-moving frame that tracks reference trajectories yields a closed-form scalar damping factor along the Riemannian gradient, capturing second-order curvature effects without matrix inversions. TOCFlow therefore matches the geometric consistency of Gauss-Newton updates at the computational cost of standard gradient guidance. We evaluate TOCFlow on three high-dimensional scientific tasks spanning equality, inequality, and global statistical constraints, namely Darcy flow, constrained trajectory planning, and turbulence snapshot generation with Kolmogorov spectral scaling. Across all settings, TOCFlow improves constraint satisfaction over Euclidean guidance and projection baselines while preserving the reference model’s generative quality.

[356] Deep Operator Networks for Surrogate Modeling of Cyclic Adsorption Processes with Varying Initial Conditions

Beatrice Ceccanti, Mattia Galanti, Ivo Roghair, Martin van Sint Annaland

Main category: cs.LG

TL;DR: DeepONets are applied as efficient surrogates for cyclic adsorption process simulation and optimization, demonstrating accurate predictions even outside training distributions.

DetailsMotivation: Cyclic adsorption processes like Temperature-Vacuum Swing Adsorption (TVSA) require repeated solution of computationally expensive transient PDEs. There's a need for efficient surrogate models to accelerate convergence and optimization workflows for these cyclic processes.

Method: Applied Deep Operator Networks (DeepONets) to learn solution operators for adsorption process PDEs. Constructed mixed training datasets with heterogeneous initial conditions to evaluate functional generalization. Trained models to approximate solution operators and tested on initial conditions outside training parameter ranges and unseen functional forms.

Result: DeepONets demonstrated accurate predictions both within and beyond the training distribution. They successfully handled steep traveling fronts in the governing equations and showed potential as efficient surrogates for cyclic adsorption simulations.

Conclusion: DeepONets are promising efficient surrogates for accelerating cyclic adsorption process simulation and optimization workflows, with demonstrated ability to generalize across wide ranges of initial conditions and challenging PDE characteristics.

Abstract: Deep Operator Networks are emerging as fundamental tools among various neural network types to learn mappings between function spaces, and have recently gained attention due to their ability to approximate nonlinear operators. In particular, DeepONets offer a natural formulation for PDE solving, since the solution of a partial differential equation can be interpreted as an operator mapping an initial condition to its corresponding solution field. In this work, we applied DeepONets in the context of process modeling for adsorption technologies, to assess their feasibility as surrogates for cyclic adsorption process simulation and optimization. The goal is to accelerate convergence of cyclic processes such as Temperature-Vacuum Swing Adsorption (TVSA), which require repeated solution of transient PDEs, which are computationally expensive. Since each step of a cyclic adsorption process starts from the final state of the preceding step, effective surrogate modeling requires generalization across a wide range of initial conditions. The governing equations exhibit steep traveling fronts, providing a demanding benchmark for operator learning. To evaluate functional generalization under these conditions, we construct a mixed training dataset composed of heterogeneous initial conditions and train DeepONets to approximate the corresponding solution operators. The trained models are then tested on initial conditions outside the parameter ranges used during training, as well as on completely unseen functional forms. The results demonstrate accurate predictions both within and beyond the training distribution, highlighting DeepONets as potential efficient surrogates for accelerating cyclic adsorption simulations and optimization workflows.

[357] Parallelizable memory recurrent units

Florent De Geeter, Gaspard Lambrechts, Damien Ernst, Guillaume Drion

Main category: cs.LG

TL;DR: The paper introduces Memory Recurrent Units (MRUs), a new family of RNNs that combine persistent memory capabilities of nonlinear RNNs with parallelizable computations of state-space models, addressing limitations of both Transformers and SSMs.

DetailsMotivation: Transformers enable parallel training but are inefficient at sequence generation due to reprocessing past timesteps. State-space models (SSMs) offer efficient recurrent updates and parallelization but lack persistent memory capabilities due to monostability. There's a need for models that combine persistent memory with parallelizable computations.

Method: Introduces Memory Recurrent Units (MRUs) that leverage multistability for persistent memory while eliminating transient dynamics for efficient computation. Presents a specific implementation called Bistable Memory Recurrent Unit (BMRU) that is compatible with the parallel scan algorithm, enabling parallel training.

Result: BMRU achieves good results in tasks with long-term dependencies. It can be combined with state-space models to create hybrid networks that are both parallelizable and possess both transient dynamics and persistent memory capabilities.

Conclusion: The paper successfully bridges the gap between nonlinear RNNs with persistent memory and parallelizable state-space models, offering a new family of RNNs that maintain efficient computation while gaining the representation power needed for long-term dependencies.

Abstract: With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.

[358] Class Adaptive Conformal Training

Badr-Eddine Marani, Julio Silva-Rodriguez, Ismail Ben Ayed, Maria Vakalopoulou, Stergios Christodoulidis, Jose Dolz

Main category: cs.LG

TL;DR: CaCT introduces class-adaptive conformal training that learns to shape prediction sets class-conditionally without distributional assumptions, outperforming prior methods with smaller, more informative sets while maintaining coverage guarantees.

DetailsMotivation: Deep neural networks often produce unreliable probability estimates and can be overconfident. While Conformal Prediction provides uncertainty quantification with coverage guarantees, existing methods optimize for overall set size but struggle with class-conditional shaping without prior knowledge of data distributions.

Method: Class Adaptive Conformal Training (CaCT) formulates conformal training as an augmented Lagrangian optimization problem that adaptively learns to shape prediction sets class-conditionally without making any distributional assumptions.

Result: Experiments on multiple benchmark datasets (standard and long-tailed image recognition, text classification) show CaCT consistently outperforms prior conformal training methods, producing significantly smaller and more informative prediction sets while maintaining desired coverage guarantees.

Conclusion: CaCT provides an effective approach for class-adaptive conformal training that learns to shape prediction sets without distributional assumptions, offering improved uncertainty quantification with smaller, more informative sets across diverse tasks.

Abstract: Deep neural networks have achieved remarkable success across a variety of tasks, yet they often suffer from unreliable probability estimates. As a result, they can be overconfident in their predictions. Conformal Prediction (CP) offers a principled framework for uncertainty quantification, yielding prediction sets with rigorous coverage guarantees. Existing conformal training methods optimize for overall set size, but shaping the prediction sets in a class-conditional manner is not straightforward and typically requires prior knowledge of the data distribution. In this work, we introduce Class Adaptive Conformal Training (CaCT), which formulates conformal training as an augmented Lagrangian optimization problem that adaptively learns to shape prediction sets class-conditionally without making any distributional assumptions. Experiments on multiple benchmark datasets, including standard and long-tailed image recognition as well as text classification, demonstrate that CaCT consistently outperforms prior conformal training methods, producing significantly smaller and more informative prediction sets while maintaining the desired coverage guarantees.

[359] Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs

Jonathan Knoop, Hendrik Holtmann

Main category: cs.LG

TL;DR: Consumer GPUs (RTX 50-series) offer cost-effective, private LLM inference for SMEs, with RTX 5090 providing 3.5-4.6x higher throughput than 5060 Ti and self-hosted costs 40-200x cheaper than cloud APIs.

DetailsMotivation: SMEs need alternatives to cloud LLM APIs due to data privacy concerns, while dedicated cloud GPU instances have limited privacy guarantees and ongoing costs, and professional on-premise hardware is too expensive.

Method: Systematic evaluation of NVIDIA Blackwell consumer GPUs (RTX 5060 Ti, 5070 Ti, 5090) for production LLM inference, benchmarking four open-weight models across 79 configurations including quantization formats, context lengths, and three workloads: RAG, multi-LoRA agentic serving, and high-concurrency APIs.

Result: RTX 5090 delivers 3.5-4.6x higher throughput than 5060 Ti with 21x lower latency for RAG; budget GPUs achieve highest throughput-per-dollar for API workloads; NVFP4 quantization provides 1.6x throughput over BF16 with 41% energy reduction and only 2-4% quality loss; self-hosted inference costs $0.001-0.04 per million tokens (electricity only), 40-200x cheaper than cloud APIs.

Conclusion: Consumer GPUs can reliably replace cloud inference for most SME workloads, except latency-critical long-context RAG where high-end GPUs remain essential. Hardware breaks even in under four months at moderate volume (30M tokens/day).

Abstract: SMEs increasingly seek alternatives to cloud LLM APIs, which raise data privacy concerns. Dedicated cloud GPU instances offer improved privacy but with limited guarantees and ongoing costs, while professional on-premise hardware (A100, H100) remains prohibitively expensive. We present a systematic evaluation of NVIDIA’s Blackwell consumer GPUs (RTX 5060 Ti, 5070 Ti, 5090) for production LLM inference, benchmarking four open-weight models (Qwen3-8B, Gemma3-12B, Gemma3-27B, GPT-OSS-20B) across 79 configurations spanning quantization formats (BF16, W4A16, NVFP4, MXFP4), context lengths (8k-64k), and three workloads: RAG, multi-LoRA agentic serving, and high-concurrency APIs. The RTX 5090 delivers 3.5-4.6x higher throughput than the 5060 Ti with 21x lower latency for RAG, but budget GPUs achieve the highest throughput-per-dollar for API workloads with sub-second latency. NVFP4 quantization provides 1.6x throughput over BF16 with 41% energy reduction and only 2-4% quality loss. Self-hosted inference costs $0.001-0.04 per million tokens (electricity only), which is 40-200x cheaper than budget-tier cloud APIs, with hardware breaking even in under four months at moderate volume (30M tokens/day). Our results show that consumer GPUs can reliably replace cloud inference for most SME workloads, except latency-critical long-context RAG, where high-end GPUs remain essential. We provide deployment guidance and release all benchmark data for reproducible SME-scale deployments.

[360] Constraint- and Score-Based Nonlinear Granger Causality Discovery with Kernels

Fiona Murphy, Alessio Benavoli

Main category: cs.LG

TL;DR: Unifies kernel-based Granger Causality methods under KPCR framework, introduces Gaussian Process score-based model with SIC penalization, and proposes contemporaneous causal identification algorithm.

DetailsMotivation: To improve nonlinear causal discovery in time series by unifying existing kernel-based Granger Causality approaches and developing more effective methods for both lagged and contemporaneous causal relationships.

Method: 1) Theoretical unification of kernel-based GC methods under Kernel Principal Component Regression (KPCR) framework. 2) Gaussian Process score-based model with Smooth Information Criterion (SIC) penalization on marginal likelihood. 3) Contemporaneous causal identification algorithm using the proposed GP_SIC method.

Result: The unified KPCR-based approach improves causal identification. The GP_SIC method demonstrates improved performance over existing state-of-the-art nonlinear causal discovery methods. The contemporaneous algorithm performs comparably to state-of-the-art contemporaneous time series causal discovery methods.

Conclusion: The paper provides a unified theoretical framework for kernel-based Granger Causality, introduces improved methods for nonlinear causal discovery, and extends Granger Causality to contemporaneous causal identification with competitive performance.

Abstract: Kernel-based methods are used in the context of Granger Causality to enable the identification of nonlinear causal relationships between time series variables. In this paper, we show that two state of the art kernel-based Granger Causality (GC) approaches can be theoretically unified under the framework of Kernel Principal Component Regression (KPCR), and introduce a method based on this unification, demonstrating that this approach can improve causal identification. Additionally, we introduce a Gaussian Process score-based model with Smooth Information Criterion penalisation on the marginal likelihood, and demonstrate improved performance over existing state of the art time-series nonlinear causal discovery methods. Furthermore, we propose a contemporaneous causal identification algorithm, fully based on GC, using the proposed score-based $GP_{SIC}$ method, and compare its performance to a state of the art contemporaneous time series causal discovery algorithm.

[361] Energy-Entropy Regularization: The True Power of Minimal Looped Transformers

Wai-Lun Lam

Main category: cs.LG

TL;DR: Novel training framework using Tsallis entropy and Hamiltonian dynamics successfully trains single-head looped Transformers for reasoning tasks, overcoming optimization challenges in irregular loss landscapes.

DetailsMotivation: Looped Transformers show superior reasoning capabilities but are difficult to train due to highly non-convex and irregular loss landscapes that cause optimization to stagnate in poor local minima and saddle points. The internal mechanisms of these models remain poorly understood.

Method: Proposes a training framework that leverages Tsallis entropy and Hamiltonian dynamics to transform the geometry of the loss landscape. Treats parameter updates as a physical flow to navigate the optimization challenges.

Result: Successfully trained a single-head looped Transformer with model dimension d=8 to solve induction head task with input sequence length of 1000 tokens, revealing the internal mechanism behind superior reasoning capability.

Conclusion: The proposed physics-inspired training approach enables effective training of looped Transformers, overcoming previous optimization challenges and providing insights into their reasoning mechanisms.

Abstract: Recent research suggests that looped Transformers have superior reasoning capabilities compared to standard deep architectures. Current approaches to training single-head looped architectures on benchmark tasks frequently fail or yield suboptimal performance due to a highly non-convex and irregular loss landscape. In these settings, optimization often stagnates in poor local minima and saddle points of the loss landscape, preventing the model from discovering the global minimum point. The internal mechanisms of these single-head looped transformer models remain poorly understood, and training them from scratch remains a significant challenge. In this paper, we propose a novel training framework that leverages Tsallis entropy and Hamiltonian dynamics to transform the geometry of the loss landscape. By treating the parameter updates as a physical flow, we successfully trained a single-head looped Transformer with model dimension $d = 8$ to solve induction head task with input sequence length of 1000 tokens. This success reveals the internal mechanism behind the superior reasoning capability.

[362] Toward Understanding Unlearning Difficulty: A Mechanistic Perspective and Circuit-Guided Difficulty Metric

Jiali Cheng, Ziheng Chen, Chirag Agarwal, Hadi Amiri

Main category: cs.LG

TL;DR: The paper proposes CUD, a circuit-based metric to predict unlearning difficulty for individual samples before unlearning, revealing that easy-to-unlearn samples use shorter, shallower circuits while hard samples rely on longer, deeper pathways.

DetailsMotivation: Machine unlearning success varies significantly across samples, with some easily erased while others persist despite identical procedures. The authors argue this disparity reflects model-internal mechanisms that encode and protect memorized information, not just data-side phenomena.

Method: The authors propose Circuit-guided Unlearning Difficulty (CUD), a pre-unlearning metric that assigns continuous difficulty scores using circuit-level signals. They analyze model circuits—structured interaction pathways that govern predictions—to identify mechanistic patterns associated with unlearning difficulty.

Result: CUD reliably separates intrinsically easy and hard samples and remains stable across different unlearning methods. Easy-to-unlearn samples are associated with shorter, shallower interactions concentrated in earlier-to-intermediate model parts, while hard samples rely on longer, deeper pathways closer to late-stage computation.

Conclusion: CUD represents a first step toward principled, fine-grained, and interpretable analysis of unlearning difficulty, moving beyond qualitative studies. It motivates developing unlearning methods grounded in model mechanisms rather than just data characteristics.

Abstract: Machine unlearning is becoming essential for building trustworthy and compliant language models. Yet unlearning success varies considerably across individual samples: some are reliably erased, while others persist despite the same procedure. We argue that this disparity is not only a data-side phenomenon, but also reflects model-internal mechanisms that encode and protect memorized information. We study this problem from a mechanistic perspective based on model circuits–structured interaction pathways that govern how predictions are formed. We propose Circuit-guided Unlearning Difficulty (CUD), a {\em pre-unlearning} metric that assigns each sample a continuous difficulty score using circuit-level signals. Extensive experiments demonstrate that CUD reliably separates intrinsically easy and hard samples, and remains stable across unlearning methods. We identify key circuit-level patterns that reveal a mechanistic signature of difficulty: easy-to-unlearn samples are associated with shorter, shallower interactions concentrated in earlier-to-intermediate parts of the original model, whereas hard samples rely on longer and deeper pathways closer to late-stage computation. Compared to existing qualitative studies, CUD takes a first step toward a principled, fine-grained, and interpretable analysis of unlearning difficulty; and motivates the development of unlearning methods grounded in model mechanisms.

[363] From Prompt to Protocol: Fast Charging Batteries with Large Language Models

Ge Lei, Ferran Brosa Planella, Sterling G. Baird, Samuel J. Cooper

Main category: cs.LG

TL;DR: LLM-driven methods (P2O and P2P) outperform traditional optimization approaches for battery charging protocol design, achieving 4.2% improvement in battery health metrics.

DetailsMotivation: Battery charging protocol optimization is difficult due to slow, costly, non-differentiable evaluations. Existing methods heavily constrain search spaces, limiting discovery of better solutions.

Method: Two gradient-free LLM-driven methods: 1) Prompt-to-Optimizer (P2O) uses LLM to propose code for neural-network-based protocols trained by inner loop; 2) Prompt-to-Protocol (P2P) writes explicit current functions with scalar parameters.

Result: LLM-guided P2O outperforms Bayesian optimization, evolutionary algorithms, and random search. Both P2O and P2P achieve ~4.2% improvement in state of health over state-of-the-art multi-step constant current baseline under matched evaluation budgets.

Conclusion: LLMs can expand protocol functional form space, incorporate language-based constraints, and enable efficient optimization in high-cost experimental settings for battery charging.

Abstract: Efficiently optimizing battery charging protocols is challenging because each evaluation is slow, costly, and non-differentiable. Many existing approaches address this difficulty by heavily constraining the protocol search space, which limits the diversity of protocols that can be explored, preventing the discovery of higher-performing solutions. We introduce two gradient-free, LLM-driven closed-loop methods: Prompt-to-Optimizer (P2O), which uses an LLM to propose the code for small neural-network-based protocols, which are then trained by an inner loop, and Prompt-to-Protocol (P2P), which simply writes an explicit function for the current and its scalar parameters. Across our case studies, LLM-guided P2O outperforms neural networks designed by Bayesian optimization, evolutionary algorithms, and random search. In a realistic fast charging scenario, both P2O and P2P yield around a 4.2 percent improvement in state of health (capacity retention based health metric under fast charging cycling) over a state-of-the-art multi-step constant current (CC) baseline, with P2P achieving this under matched evaluation budgets (same number of protocol evaluations). These results demonstrate that LLMs can expand the space of protocol functional forms, incorporate language-based constraints, and enable efficient optimization in high cost experimental settings.

[364] Exploring Fine-Tuning for Tabular Foundation Models

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Vinay Kumar Sankarapu

Main category: cs.LG

TL;DR: Zero-shot tabular foundation models already perform well; fine-tuning benefits vary by model/data, with SFT often hurting performance, while meta-learning and PEFT offer limited gains.

DetailsMotivation: To provide the first comprehensive study of fine-tuning in Tabular Foundation Models (TFMs), examining when and how fine-tuning is beneficial versus when it may degrade performance compared to zero-shot approaches.

Method: Comparative study across multiple benchmarks (TALENT, OpenML-CC18, TabZilla) evaluating four approaches: Zero-Shot, Meta-Learning, Supervised Fine-Tuning (SFT), and Parameter-Efficient Fine-Tuning (PEFT), analyzing effects of dataset characteristics like imbalance, size, and dimensionality.

Result: Zero-shot TFMs already achieve strong performance comparable to traditional ML; fine-tuning benefits are model/data-dependent; meta-learning and PEFT provide moderate gains in specific conditions; SFT often reduces accuracy or calibration quality.

Conclusion: Fine-tuning in TFMs should be approached cautiously - zero-shot often suffices, while SFT can be detrimental; practical guidelines are provided for when fine-tuning is beneficial based on dataset characteristics and model type.

Abstract: Tabular Foundation Models (TFMs) have recently shown strong in-context learning capabilities on structured data, achieving zero-shot performance comparable to traditional machine learning methods. We find that zero-shot TFMs already achieve strong performance, while the benefits of fine-tuning are highly model and data-dependent. Meta-learning and PEFT provide moderate gains under specific conditions, whereas full supervised fine-tuning (SFT) often reduces accuracy or calibration quality. This work presents the first comprehensive study of fine-tuning in TFMs across benchmarks including TALENT, OpenML-CC18, and TabZilla. We compare Zero-Shot, Meta-Learning, Supervised (SFT), and parameter-efficient (PEFT) approaches, analyzing how dataset factors such as imbalance, size, and dimensionality affect outcomes. Our findings cover performance, calibration, and fairness, offering practical guidelines on when fine-tuning is most beneficial and its limitations.

[365] Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection

Ziyu Yang, Guibin Chen, Yuxin Yang, Aoxiong Zeng, Xiangquan Yang

Main category: cs.LG

TL;DR: Ortho-LoRA: A gradient projection method that reduces negative transfer in multi-task LoRA by projecting conflicting task gradients onto orthogonal subspaces within the LoRA structure.

DetailsMotivation: Multi-task learning with LoRA reduces storage overhead by sharing adapters across tasks, but suffers from negative transfer where conflicting gradient updates degrade individual task performance compared to single-task fine-tuning. This problem is worse in LoRA due to low-rank constraints limiting optimization capacity.

Method: Ortho-LoRA dynamically projects conflicting task gradients onto the orthogonal complement of each other within the intrinsic LoRA subspace, specifically designed for LoRA’s bipartite structure.

Result: Extensive experiments on GLUE benchmark show Ortho-LoRA effectively mitigates task interference, outperforms standard joint training, and recovers 95% of the performance gap between multi-task and single-task baselines with negligible computational overhead.

Conclusion: Ortho-LoRA provides an effective solution to negative transfer in multi-task LoRA, enabling parameter-efficient deployment of LLMs across multiple tasks without significant performance degradation.

Abstract: Multi-Task Learning (MTL) combined with Low-Rank Adaptation (LoRA) has emerged as a promising direction for parameter-efficient deployment of Large Language Models (LLMs). By sharing a single adapter across multiple tasks, one can significantly reduce storage overhead. However, this approach suffers from negative transfer, where conflicting gradient updates from distinct tasks degrade the performance of individual tasks compared to single-task fine-tuning. This problem is exacerbated in LoRA due to the low-rank constraint, which limits the optimization landscape’s capacity to accommodate diverse task requirements. In this paper, we propose Ortho-LoRA, a gradient projection method specifically tailored for the bipartite structure of LoRA. Ortho-LoRA dynamically projects conflicting task gradients onto the orthogonal complement of each other within the intrinsic LoRA subspace. Extensive experiments on the GLUE benchmark demonstrate that Ortho-LoRA effectively mitigates task interference, outperforming standard joint training and recovering 95% of the performance gap between multi-task and single-task baselines with negligible computational overhead.

[366] Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design

Lisa Schneckenreiter, Sohvi Luukkonen, Lukas Friedrich, Daniel Kuhn, Günter Klambauer

Main category: cs.LG

TL;DR: ConGLUDe is a unified contrastive geometric model that combines structure-based and ligand-based drug design approaches in a single framework, enabling joint training on both protein-ligand complexes and bioactivity data without requiring pre-defined binding pockets.

DetailsMotivation: Traditional computational drug design methods use separate structure-based and ligand-based approaches with disjoint data sources and modeling assumptions, limiting their joint application at scale. There's a need for unified models that can leverage both types of information simultaneously.

Method: ConGLUDe uses a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites, coupled with a fast ligand encoder. It employs contrastive learning to align ligands with both global protein representations and multiple candidate binding sites, eliminating the need for pre-defined pockets.

Result: Achieves state-of-the-art zero-shot virtual screening performance without binding pocket information, substantially outperforms existing methods on target fishing tasks, and demonstrates competitive ligand-conditioned pocket selection across diverse benchmarks.

Conclusion: ConGLUDe demonstrates the advantages of unified structure-ligand training and represents progress toward general-purpose foundation models for drug discovery by effectively combining both structural and ligand-based information in a single framework.

Abstract: Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for pre-defined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves state-of-the-art zero-shot virtual screening performance in settings where no binding pocket information is provided as input, substantially outperforms existing methods on a challenging target fishing task, and demonstrates competitive ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.

[367] Reinforcement Learning with Exogenous States and Rewards

George Trimponias, Thomas G. Dietterich

Main category: cs.LG

TL;DR: The paper introduces methods to decompose MDPs into endogenous and exogenous components to reduce reward variance and accelerate reinforcement learning.

DetailsMotivation: Exogenous state variables and rewards inject uncontrolled variation into reward signals, which slows down reinforcement learning by increasing variance and making optimization more difficult.

Method: Formalizes exogenous state variables and rewards, shows additive reward decomposition enables MDP separation into exogenous Markov Reward Process and endogenous MDP. Introduces algorithms to discover exogenous/endogenous subspaces when mixed through linear combination, applicable online during RL.

Result: Experiments on challenging synthetic MDPs demonstrate the methods successfully discover large exogenous state spaces and produce substantial speedups in reinforcement learning when applied online.

Conclusion: Decomposing MDPs into endogenous and exogenous components reduces reward variance, making RL easier to solve while preserving optimal policies, with practical algorithms for discovering these decompositions during learning.

Abstract: Exogenous state variables and rewards can slow reinforcement learning by injecting uncontrolled variation into the reward signal. This paper formalizes exogenous state variables and rewards and shows that if the reward function decomposes additively into endogenous and exogenous components, the MDP can be decomposed into an exogenous Markov Reward Process (based on the exogenous reward) and an endogenous Markov Decision Process (optimizing the endogenous reward). Any optimal policy for the endogenous MDP is also an optimal policy for the original MDP, but because the endogenous reward typically has reduced variance, the endogenous MDP is easier to solve. We study settings where the decomposition of the state space into exogenous and endogenous state spaces is not given but must be discovered. The paper introduces and proves correctness of algorithms for discovering the exogenous and endogenous subspaces of the state space when they are mixed through linear combination. These algorithms can be applied during reinforcement learning to discover the exogenous subspace, remove the exogenous reward, and focus reinforcement learning on the endogenous MDP. Experiments on a variety of challenging synthetic MDPs show that these methods, applied online, discover large exogenous state spaces and produce substantial speedups in reinforcement learning.

[368] Tipping Point Forecasting in Non-Stationary Dynamics on Function Spaces

Miguel Liu-Schiaffini, Clare E. Singer, Nikola Kovachki, Sze Chai Leung, Hyunji Jane Bae, Kamyar Azizzadenesheli, Anima Anandkumar

Main category: cs.LG

TL;DR: The paper proposes a recurrent neural operator (RNO) with conformal prediction to forecast tipping points in non-stationary dynamical systems using only pre-tipping data and partial physics constraints.

DetailsMotivation: Tipping points represent abrupt, irreversible changes in complex systems (like climate change), but forecasting them is challenging due to non-stationarity and chaos. Existing methods often require complete knowledge of governing equations or extensive data across tipping events.

Method: Develop a recurrent neural operator (RNO) that learns mappings between function spaces from pre-tipping dynamics. Use conformal prediction to monitor deviations from physics constraints (conserved quantities, PDEs) to detect tipping points with uncertainty quantification.

Result: Successfully applied to Lorenz-63, Kuramoto-Sivashinsky equations, stratocumulus cloud cover tipping, and airfoil wake/stall transitions. The method zero-shot generalizes to forecast multiple tipping points under varying Reynolds numbers and works with partial/approximate physics constraints.

Conclusion: The proposed RNO with conformal prediction framework enables accurate forecasting of tipping points using only pre-tipping data and limited physics knowledge, providing rigorous uncertainty measures and generalizing to unseen conditions.

Abstract: Tipping points are abrupt, drastic, and often irreversible changes in the evolution of non-stationary and chaotic dynamical systems. For instance, increased greenhouse gas concentrations are predicted to lead to drastic decreases in low cloud cover, referred to as a climatological tipping point. In this paper, we learn the evolution of such non-stationary dynamical systems using a novel recurrent neural operator (RNO), which learns mappings between function spaces. After training RNO on only the pre-tipping dynamics, we employ it to detect future tipping points using an uncertainty-based approach. In particular, we propose a conformal prediction framework to forecast tipping points by monitoring deviations from physics constraints (such as conserved quantities and partial differential equations), enabling forecasting of these abrupt changes along with a rigorous measure of uncertainty. We illustrate our proposed methodology on non-stationary ordinary and partial differential equations, such as the Lorenz-63 and Kuramoto-Sivashinsky equations. We also apply our methods to forecast a climate tipping point in stratocumulus cloud cover and airfoil wake and stall transitions using only limited knowledge of the governing equations. For the latter, we show that our proposed method zero-shot generalizes to forecasting multiple future tipping points under varying Reynolds numbers. In our experiments, we demonstrate that even partial or approximate physics constraints can be used to accurately forecast future tipping points.

[369] Kernel Limit for a Class of Recurrent Neural Networks Trained on Ergodic Data Sequences

Samuel Chun-Hei Lam, Justin Sirignano, Konstantinos Spiliopoulos

Main category: cs.LG

TL;DR: Develops mathematical methods to characterize asymptotics of RNNs as network size, sequence length, updates, and training steps grow to infinity, proving convergence to infinite-dimensional ODE with random algebraic equation fixed point.

DetailsMotivation: RNNs cannot be analyzed with standard mean-field techniques because their hidden layer updates are O(1) rather than O(1/N), requiring new mathematical approaches to understand their asymptotic behavior in the infinite-width, infinite-data limit.

Method: Develops fixed point analysis for RNN memory state evolution with convergence estimates, studies hidden layer as function in Sobolev space, uses Poisson equation to bound fluctuations, and establishes neural tangent kernel limits for RNNs.

Result: Proves convergence of RNN with simplified weight matrix to solution of infinite-dimensional ODE coupled with fixed point of random algebraic equation, establishing NTK limits for RNNs trained on data sequences.

Conclusion: Novel mathematical framework overcomes unique challenges of RNN analysis, enabling rigorous asymptotic characterization and NTK limits for recurrent networks in the infinite-width, infinite-data regime.

Abstract: Mathematical methods are developed to characterize the asymptotics of recurrent neural networks (RNN) as the number of hidden units, data samples in the sequence, hidden state updates, and training steps simultaneously grow to infinity. In the case of an RNN with a simplified weight matrix, we prove the convergence of the RNN to the solution of an infinite-dimensional ODE coupled with the fixed point of a random algebraic equation. The analysis requires addressing several challenges which are unique to RNNs. In typical mean-field applications (e.g., feedforward neural networks), discrete updates are of magnitude $\mathcal{O}(1/N)$ and the number of updates is $\mathcal{O}(N)$. Therefore, the system can be represented as an Euler approximation of an appropriate ODE/PDE, which it will converge to as $N \rightarrow \infty$. However, the RNN hidden layer updates are $\mathcal{O}(1)$. Therefore, RNNs cannot be represented as a discretization of an ODE/PDE and standard mean-field techniques cannot be applied. Instead, we develop a fixed point analysis for the evolution of the RNN memory states, with convergence estimates in terms of the number of update steps and the number of hidden units. The RNN hidden layer is studied as a function in a Sobolev space, whose evolution is governed by the data sequence (a Markov chain), the parameter updates, and its dependence on the RNN hidden layer at the previous time step. Due to the strong correlation between updates, a Poisson equation must be used to bound the fluctuations of the RNN around its limit equation. These mathematical methods give rise to the neural tangent kernel (NTK) limits for RNNs trained on data sequences as the number of data samples and size of the neural network grow to infinity.

[370] Soft Contrastive Learning for Time Series

Seunghan Lee, Taeyoung Park, Kibok Lee

Main category: cs.LG

TL;DR: SoftCLT introduces soft contrastive learning for time series with instance-wise and temporal contrastive loss using soft assignments based on data distance and timestamp differences, improving representation quality across various downstream tasks.

DetailsMotivation: Standard contrastive learning for time series ignores inherent correlations between similar instances or adjacent timestamps, which deteriorates representation quality. Hard negative sampling in contrastive learning fails to capture the nuanced relationships in time series data.

Method: SoftCLT uses soft contrastive learning with two components: 1) instance-wise contrastive loss with soft assignments based on distance between time series in data space, and 2) temporal contrastive loss with soft assignments based on timestamp differences. The method is plug-and-play and doesn’t require complex modifications.

Result: SoftCLT consistently improves performance across various downstream tasks including classification, semi-supervised learning, transfer learning, and anomaly detection, achieving state-of-the-art performance.

Conclusion: SoftCLT is an effective soft contrastive learning strategy for time series that addresses the limitations of hard contrastive learning by incorporating soft assignments, leading to better quality representations without complex modifications.

Abstract: Contrastive learning has shown to be effective to learn representations from time series in a self-supervised way. However, contrasting similar time series instances or values from adjacent timestamps within a time series leads to ignore their inherent correlations, which results in deteriorating the quality of learned representations. To address this issue, we propose SoftCLT, a simple yet effective soft contrastive learning strategy for time series. This is achieved by introducing instance-wise and temporal contrastive loss with soft assignments ranging from zero to one. Specifically, we define soft assignments for 1) instance-wise contrastive loss by the distance between time series on the data space, and 2) temporal contrastive loss by the difference of timestamps. SoftCLT is a plug-and-play method for time series contrastive learning that improves the quality of learned representations without bells and whistles. In experiments, we demonstrate that SoftCLT consistently improves the performance in various downstream tasks including classification, semi-supervised learning, transfer learning, and anomaly detection, showing state-of-the-art performance. Code is available at this repository: https://github.com/seunghan96/softclt.

[371] Lens: A Knowledge-Guided Foundation Model for Network Traffic

Xiaochang Li, Chen Qian, Qineng Wang, Jiangtao Kong, Yuchen Wang, Ziyu Yao, Bo Ji, Long Cheng, Gang Zhou, Huajie Shao

Main category: cs.LG

TL;DR: Lens is a knowledge-guided foundation model for network traffic analysis that improves both classification and generation tasks by incorporating network knowledge during pretraining and using context-aware finetuning for distribution shifts.

DetailsMotivation: Existing Transformer-based methods for network traffic analysis overlook network knowledge during pretraining and struggle with distribution shifts when extending to new classes during fine-tuning, limiting semantic understanding and adaptability.

Method: Proposes Lens with: 1) Knowledge-Guided Mask Span Prediction with textual context for pretraining to learn knowledge-enriched representations, and 2) reframing classification as closed-ended generation with context-aware finetuning to handle distribution shifts.

Result: Achieves 96.33% average accuracy on traffic classification, outperforming baselines on 8/12 tasks, and extends to novel classes with significantly better performance. For traffic generation, gains up to 30.46% accuracy and 33.3% F1 improvement in fuzzing tests.

Conclusion: Lens demonstrates superior performance for both network traffic classification and generation by incorporating network knowledge during pretraining and using context-aware adaptation for distribution shifts, offering a unified foundation model approach.

Abstract: Network traffic refers to the amount of data being sent and received over the Internet or any system that connects computers. Analyzing network traffic is vital for security and management, yet remains challenging due to the heterogeneity of plain-text packet headers and encrypted payloads. To capture the latent semantics of traffic, recent studies have adopted Transformer-based pretraining techniques to learn network representations from massive traffic data. However, these methods pre-train on data-driven tasks but overlook network knowledge, such as masking partial digits of the indivisible network port numbers for prediction, thereby limiting semantic understanding. In addition, they struggle to extend classification to new classes during fine-tuning due to the distribution shift. Motivated by these limitations, we propose \Lens, a unified knowledge-guided foundation model for both network traffic classification and generation. In pretraining, we propose a Knowledge-Guided Mask Span Prediction method with textual context for learning knowledge-enriched representations. For extending to new classes in finetuning, we reframe the traffic classification as a closed-ended generation task and introduce context-aware finetuning to adapt to the distribution shift. Evaluation results across various benchmark datasets demonstrate that the proposed Lensachieves superior performance on both classification and generation tasks. For traffic classification, Lensoutperforms competitive baselines substantially on 8 out of 12 tasks with an average accuracy of \textbf{96.33%} and extends to novel classes with significantly better performance. For traffic generation, Lens~generates better high-fidelity network traffic for network simulation, gaining up to \textbf{30.46%} and \textbf{33.3%} better accuracy and F1 in fuzzing tests. We will open-source the code upon publication.

[372] Differentially Private Bilevel Optimization

Guy Kornowski

Main category: cs.LG

TL;DR: First DP algorithms for bilevel optimization that avoid Hessian computations, achieving privacy-preserving hyperparameter tuning with gradient-based methods.

DetailsMotivation: Bilevel optimization is important for ML applications (like hyperparameter tuning), but existing methods lack DP guarantees and require expensive Hessian computations that don't scale well.

Method: Proposed gradient-based (ε,δ)-DP algorithm for bilevel optimization with non-convex upper-level and strongly-convex lower-level problems. Uses DP gradient methods without Hessian computations, works for constrained/unconstrained problems, and handles mini-batch gradients.

Result: Achieves hypergradient norm bound of $\widetilde{\mathcal{O}}\left((\sqrt{d_\mathrm{up}}/εn)^{1/2}+(\sqrt{d_\mathrm{low}}/εn)^{1/3}\right)$ where n is dataset size, d_up/d_low are dimensions. First DP algorithms for bilevel optimization with practical applications like hyperparameter tuning.

Conclusion: First differentially private algorithms for bilevel optimization that avoid Hessian computations, enabling privacy-preserving hyperparameter tuning and other ML applications with theoretical guarantees.

Abstract: We present differentially private (DP) algorithms for bilevel optimization, a problem class that received significant attention lately in various machine learning applications. These are the first algorithms for such problems under standard DP constraints, and are also the first to avoid Hessian computations which are prohibitive in large-scale settings. Under the well-studied setting in which the upper-level is not necessarily convex and the lower-level problem is strongly-convex, our proposed gradient-based $(ε,δ)$-DP algorithm returns a point with hypergradient norm at most $\widetilde{\mathcal{O}}\left((\sqrt{d_\mathrm{up}}/εn)^{1/2}+(\sqrt{d_\mathrm{low}}/εn)^{1/3}\right)$ where $n$ is the dataset size, and $d_\mathrm{up}/d_\mathrm{low}$ are the upper/lower level dimensions. Our analysis covers constrained and unconstrained problems alike, accounts for mini-batch gradients, and applies to both empirical and population losses. As an application, we specialize our analysis to derive a simple private rule for tuning a regularization hyperparameter.

[373] DNN Modularization via Activation-Driven Training

Tuan Ngo, Abid Hassan, Saad Shafiq, Nenad Medvidovic

Main category: cs.LG

TL;DR: MODA is an activation-driven modular training approach that decomposes DNNs into reusable modules with less training time, fewer weights, and minimal accuracy loss compared to existing methods.

DetailsMotivation: DNNs accumulate technical debt and have high retraining costs when adapting to new requirements. Existing modularization techniques have issues like weight overlaps, accuracy losses, limited scope to convolutional layers, and increased training complexity.

Method: MODA promotes inherent modularity by directly regulating layer activation outputs based on three objectives: intra-class affinity (similar activations within same class), inter-class dispersion (different activations across classes), and compactness (efficient module structure).

Result: MODA achieves 22% less training time, modules with 24x fewer weights and 37x less weight overlap, preserves original model accuracy without fine-tuning, and improves target class accuracy by 12% in module replacement scenarios with minimal impact on other classes.

Conclusion: MODA provides an effective activation-driven approach for modular DNN training that addresses limitations of previous methods, offering better efficiency, reduced complexity, and improved reusability while maintaining accuracy.

Abstract: Deep Neural Networks (DNNs) tend to accrue technical debt and suffer from significant retraining costs when adapting to evolving requirements. Modularizing DNNs offers the promise of improving their reusability. Previous work has proposed techniques to decompose DNN models into modules both during and after training. However, these strategies yield several shortcomings, including significant weight overlaps and accuracy losses across modules, restricted focus on convolutional layers only, and added complexity and training time by introducing auxiliary masks to control modularity. In this work, we propose MODA, an activation-driven modular training approach. MODA promotes inherent modularity within a DNN model by directly regulating the activation outputs of its layers based on three modular objectives: intra-class affinity, inter-class dispersion, and compactness. MODA is evaluated using three well-known DNN models and five datasets with varying sizes. This evaluation indicates that, compared to the existing state-of-the-art, using MODA yields several advantages: (1) MODA accomplishes modularization with 22% less training time; (2) the resultant modules generated by MODA comprise up to 24x fewer weights and 37x less weight overlap while (3) preserving the original model’s accuracy without additional fine-tuning; in module replacement scenarios, (4) MODA improves the accuracy of a target class by 12% on average while ensuring minimal impact on the accuracy of other classes.

[374] Graph Neural Network Surrogates to leverage Mechanistic Expert Knowledge towards Reliable and Immediate Pandemic Response

Agatha Schmidt, Henrik Zunker, Alexander Heinlein, Martin J. Kühn

Main category: cs.LG

TL;DR: GNN surrogate model accelerates pandemic simulation 28,670x for real-time decision support, achieving 10-27% MAPE across 30-90 day forecasts with multiple contact change points.

DetailsMotivation: Time-critical pandemic decisions require rapid evidence-based modeling, but traditional mechanistic models are too slow for real-time decision support during dynamic outbreaks.

Method: Developed graph neural network (GNN) surrogate for age-structured spatial metapopulation model, tested ARMAConv-based architecture across outbreak regimes with up to 3 contact change points on 400-node German county graph.

Result: Achieved 10-27% MAPE across 30-90 day forecasts, near-constant runtime regardless of horizon, and 28,670x speedup over mechanistic model while maintaining interpretability.

Conclusion: GNN surrogates enable translation of complex epidemiological models into immediate, reliable tools for time-critical pandemic response with web integration potential.

Abstract: During the COVID-19 crisis, mechanistic models have guided evidence-based decision making. However, time-critical decisions in a dynamical environment limit the time available to gather supporting evidence. We address this bottleneck by developing a graph neural network (GNN) surrogate of an age-structured and spatially resolved mechanistic metapopulation simulation model. This combined approach complements classical modeling approaches which are mostly mechanistic and purely data-driven machine learning approaches which are often black box. Our design of experiments spans outbreak and persistent-threat regimes, up to three contact change points, and age-structured contact matrices on a spatial graph with 400 nodes representing German counties. We benchmark multiple GNN layers and identify an ARMAConv-based architecture that offers a strong accuracy-runtime trade-off. Across horizons of 30-90 day simulation and prediction, allowing up to three contact change points, the surrogate model attains 10-27 % mean absolute percentage error (MAPE) while delivering (near) constant runtime with respect to the forecast horizon. Our approach accelerates evaluation by up to 28,670 times compared with the mechanistic model, allowing responsive decision support in time-critical scenarios and straightforward web integration. These results show how GNN surrogates can translate complex metapopulation models into immediate, reliable tools for pandemic response.

[375] Benchmarking Positional Encodings for GNNs and Graph Transformers

Florian Grötschla, Jiaqing Xie, Roger Wattenhofer

Main category: cs.LG

TL;DR: Positional encodings (PEs) in graph neural networks lack empirical understanding; a benchmarking framework shows expressiveness proxies don’t predict performance, and simple overlooked PE-model combos can outperform SOTA.

DetailsMotivation: The empirical impact of positional encodings in Graph Neural Networks and Graph Transformers is poorly understood despite their importance for injecting structural information. There's a need to systematically evaluate PEs across different models and datasets to understand what actually works in practice.

Method: Created a unified benchmarking framework that decouples PEs from architectural choices, enabling fair comparison across 8 GNN/Transformer models, 9 different PEs, and 10 synthetic/real-world datasets. Tested over 500 model-PE-dataset configurations.

Result: Common expressiveness proxies (like Weisfeiler-Lehman distinguishability) don’t reliably predict downstream performance. Highly expressive PEs often fail to improve or even degrade real-world task performance. Identified simple, previously overlooked model-PE combinations that match or outperform recent state-of-the-art methods.

Conclusion: PE effectiveness is strongly task-dependent, requiring empirical validation beyond theoretical expressiveness. The authors release an open-source benchmarking framework to support reproducible research in evaluating PEs for graph learning tasks.

Abstract: Positional Encodings (PEs) are essential for injecting structural information into Graph Neural Networks (GNNs), particularly Graph Transformers, yet their empirical impact remains insufficiently understood. We introduce a unified benchmarking framework that decouples PEs from architectural choices, enabling a fair comparison across 8 GNN and Transformer models, 9 PEs, and 10 synthetic and real-world datasets. Across more than 500 model-PE-dataset configurations, we find that commonly used expressiveness proxies, including Weisfeiler-Lehman distinguishability, do not reliably predict downstream performance. In particular, highly expressive PEs frequently fail to improve, and can even degrade performance on real-world tasks. At the same time, we identify several simple and previously overlooked model-PE combinations that match or outperform recent state-of-the-art methods. Our results demonstrate the strong task-dependence of PEs and underscore the need for empirical validation beyond theoretical expressiveness. To support reproducible research, we release an open-source benchmarking framework for evaluating PEs for graph learning tasks.

[376] STORM: A Spatio-Temporal Factor Model Based on Dual Vector Quantized Variational Autoencoders for Financial Trading

Yilei Zhao, Wentao Zhang, Tingran Yang, Yong Jiang, Fei Huang, Wei Yang Bryan Lim

Main category: cs.LG

TL;DR: STORM is a spatio-temporal factor model using dual vector quantized variational autoencoders to extract and fuse temporal and spatial stock features as multi-dimensional embeddings, improving factor quality and diversity for financial trading tasks.

DetailsMotivation: Current variational autoencoder-based latent factor models focus on overall market conditions but fail to capture individual stock temporal patterns. Representing factors as single values limits ability to capture complex relationships, resulting in low-quality, non-diverse factors that reduce effectiveness across trading periods.

Method: STORM uses dual vector quantized variational autoencoders to extract stock features from both temporal and spatial perspectives, then fuses and aligns these features at fine-grained and semantic levels. Factors are represented as multi-dimensional embeddings, with discrete codebooks clustering similar factor embeddings to ensure orthogonality and diversity.

Result: Extensive experiments on portfolio management (two stock datasets) and individual trading tasks (six specific stocks) demonstrate STORM’s flexibility in adapting to downstream tasks and superior performance over baseline models.

Conclusion: STORM addresses limitations of existing factor models by capturing both temporal and spatial patterns through multi-dimensional embeddings, improving factor quality and diversity for more effective and robust financial trading applications.

Abstract: In financial trading, factor models are widely used to price assets and capture excess returns from mispricing. Recently, we have witnessed the rise of variational autoencoder-based latent factor models, which learn latent factors self-adaptively. While these models focus on modeling overall market conditions, they often fail to effectively capture the temporal patterns of individual stocks. Additionally, representing multiple factors as single values simplifies the model but limits its ability to capture complex relationships and dependencies. As a result, the learned factors are of low quality and lack diversity, reducing their effectiveness and robustness across different trading periods. To address these issues, we propose a Spatio-Temporal factOR Model based on dual vector quantized variational autoencoders, named STORM, which extracts features of stocks from temporal and spatial perspectives, then fuses and aligns these features at the fine-grained and semantic level, and represents the factors as multi-dimensional embeddings. The discrete codebooks cluster similar factor embeddings, ensuring orthogonality and diversity, which helps distinguish between different factors and enables factor selection in financial trading. To show the performance of the proposed factor model, we apply it to two downstream experiments: portfolio management on two stock datasets and individual trading tasks on six specific stocks. The extensive experiments demonstrate STORM’s flexibility in adapting to downstream tasks and superior performance over baseline models.

[377] Bayesian Optimization with Preference Exploration using a Monotonic Neural Network Ensemble

Hanyang Wang, Juergen Branke, Matthias Poloczek

Main category: cs.LG

TL;DR: Proposes neural network ensemble for Bayesian Optimization with Preference Exploration (BOPE) that leverages monotonicity in utility functions to focus search on relevant Pareto-optimal subsets.

DetailsMotivation: Many real-world black-box optimization problems have multiple conflicting objectives, and while interactive preference learning can focus search on relevant subsets, previous approaches haven't sufficiently exploited the monotonic nature of utility functions.

Method: Uses neural network ensemble as utility surrogate model that naturally integrates monotonicity constraints and supports pairwise comparison data for Bayesian Optimization with Preference Exploration.

Result: Outperforms state-of-the-art approaches and exhibits robustness to noise in utility evaluations. Ablation study confirms monotonicity’s critical role in enhancing performance.

Conclusion: The proposed neural network ensemble approach effectively leverages monotonicity in utility functions for BOPE, providing superior performance and noise robustness compared to existing methods.

Abstract: Many real-world black-box optimization problems have multiple conflicting objectives. Rather than attempting to approximate the entire set of Pareto-optimal solutions, interactive preference learning allows to focus the search on the most relevant subset. However, few previous studies have exploited the fact that utility functions are usually monotonic. In this paper, we address the Bayesian Optimization with Preference Exploration (BOPE) problem and propose using a neural network ensemble as a utility surrogate model. This approach naturally integrates monotonicity and supports pairwise comparison data. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches and exhibits robustness to noise in utility evaluations. An ablation study highlights the critical role of monotonicity in enhancing performance.

[378] Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens

Samuele Bortolotti, Emanuele Marconato, Paolo Morettin, Andrea Passerini, Stefano Teso

Main category: cs.LG

TL;DR: Concept-based Models often fail to produce interpretable concepts due to reasoning shortcuts, even with mitigation strategies.

DetailsMotivation: Concept-based Models aim to provide interpretable AI by mapping inputs to high-level concepts, but they often suffer from reasoning shortcuts where models learn low-quality concepts while still achieving high accuracy, undermining interpretability and reliability.

Method: Established a novel connection between Concept-based Models and reasoning shortcuts, extended RSs to this complex setting, derived theoretical conditions for identifying both concepts and inference layer, and empirically tested existing methods with multiple mitigation strategies.

Result: Empirical results show that reasoning shortcuts significantly impact Concept-based Models, and existing methods often fail to meet the theoretical conditions for producing interpretable concepts, even when combined with multiple natural mitigation strategies.

Conclusion: Current Concept-based Models frequently fail to produce truly interpretable concepts due to reasoning shortcuts, highlighting the need for better approaches that satisfy theoretical conditions for reliable concept learning and inference.

Abstract: Concept-based Models are neural networks that learn a concept extractor to map inputs to high-level concepts and an inference layer to translate these into predictions. Ensuring these modules produce interpretable concepts and behave reliably in out-of-distribution is crucial, yet the conditions for achieving this remain unclear. We study this problem by establishing a novel connection between Concept-based Models and reasoning shortcuts (RSs), a common issue where models achieve high accuracy by learning low-quality concepts, even when the inference layer is fixed and provided upfront. Specifically, we extend RSs to the more complex setting of Concept-based Models and derive theoretical conditions for identifying both the concepts and the inference layer. Our empirical results highlight the impact of RSs and show that existing methods, even combined with multiple natural mitigation strategies, often fail to meet these conditions in practice.

[379] Exploiting Task Relationships in Continual Learning via Transferability-Aware Task Embeddings

Yanru Wu, Jianning Wang, Xiangyu Chen, Enming Zhang, Yang Tan, Hanbing Liu, Yang Li

Main category: cs.LG

TL;DR: Proposes H-embedding, a transferability-aware task embedding derived from information theory, used in a hypernet framework for continual learning to enhance forward/backward transfer by capturing inter-task relationships.

DetailsMotivation: Existing continual learning strategies focus on task models through regularization or component separation, but overlook leveraging inter-task relationships to enhance transfer. There's a gap in using transferability information between tasks to improve CL performance.

Method: Develops H-embedding, an online computable task embedding based on information theoretic transferability measure. Uses this embedding to guide a hypernetwork framework that learns task-conditioned model weights. Stores only low-dimensional embeddings per task and supports efficient end-to-end training.

Result: Extensive evaluations on CIFAR-100, ImageNet-R, and DomainNet benchmarks show prominent performance compared to various baseline and state-of-the-art approaches, demonstrating strong potential in capturing and utilizing intrinsic task relationships.

Conclusion: The proposed H-embedding guided hypernet framework effectively addresses the gap in leveraging inter-task relationships for continual learning, achieving strong transfer performance with practical storage and training efficiency.

Abstract: Continual learning (CL) has been a critical topic in contemporary deep neural network applications, where higher levels of both forward and backward transfer are desirable for an effective CL performance. Existing CL strategies primarily focus on task models, either by regularizing model updates or by separating task-specific and shared components, while often overlooking the potential of leveraging inter-task relationships to enhance transfer. To address this gap, we propose a transferability-aware task embedding, termed H-embedding, and construct a hypernet framework under its guidance to learn task-conditioned model weights for CL tasks. Specifically, H-embedding is derived from an information theoretic measure of transferability and is designed to be online and easy to compute. Our method is also characterized by notable practicality, requiring only the storage of a low-dimensional task embedding per task and supporting efficient end-to-end training. Extensive evaluations on benchmarks including CIFAR-100, ImageNet-R, and DomainNet show that our framework performs prominently compared to various baseline and SOTA approaches, demonstrating strong potential in capturing and utilizing intrinsic task relationships. Our code is publicly available at https://github.com/viki760/Hembedding_Guided_Hypernet.

[380] Training Large Neural Networks With Low-Dimensional Error Feedback

Maher Hanut, Jonathan Kadmon

Main category: cs.LG

TL;DR: Low-dimensional error signals can effectively train deep neural networks, matching backpropagation performance while being more biologically plausible and computationally efficient.

DetailsMotivation: Backpropagation relies on high-dimensional error signals that are computationally intensive and lack biological evidence. Since most tasks have low-dimensional outputs, the authors hypothesize that low-dimensional error signals may suffice for effective learning, offering a more biologically plausible alternative.

Method: Introduces a novel local learning rule based on Feedback Alignment that uses indirect, low-dimensional error feedback. The method decouples backward from forward passes, allowing control over error signal dimensionality while maintaining high-dimensional representations. Theoretical derivation for linear networks forms the foundation, extended to nonlinear, convolutional, and transformer architectures.

Result: Remarkably, even minimal error dimensionality (on the order of task dimensionality) achieves performance matching traditional backpropagation. The method enables efficient training of convolutional networks previously resistant to Feedback Alignment, with minimal error.

Conclusion: Low-dimensional error signals can be as effective as high-dimensional ones, challenging conventional reliance on high-dimensional gradients. This breakthrough offers a fresh perspective on neural network optimization and contributes to understanding learning mechanisms in both artificial and biological systems.

Abstract: Training deep neural networks typically relies on backpropagating high dimensional error signals a computationally intensive process with little evidence supporting its implementation in the brain. However, since most tasks involve low-dimensional outputs, we propose that low-dimensional error signals may suffice for effective learning. To test this hypothesis, we introduce a novel local learning rule based on Feedback Alignment that leverages indirect, low-dimensional error feedback to train large networks. Our method decouples the backward pass from the forward pass, enabling precise control over error signal dimensionality while maintaining high-dimensional representations. We begin with a detailed theoretical derivation for linear networks, which forms the foundation of our learning framework, and extend our approach to nonlinear, convolutional, and transformer architectures. Remarkably, we demonstrate that even minimal error dimensionality on the order of the task dimensionality can achieve performance matching that of traditional backpropagation. Furthermore, our rule enables efficient training of convolutional networks, which have previously been resistant to Feedback Alignment methods, with minimal error. This breakthrough not only paves the way toward more biologically accurate models of learning but also challenges the conventional reliance on high-dimensional gradient signals in neural network training. Our findings suggest that low-dimensional error signals can be as effective as high-dimensional ones, prompting a reevaluation of gradient-based learning in high-dimensional systems. Ultimately, our work offers a fresh perspective on neural network optimization and contributes to understanding learning mechanisms in both artificial and biological systems.

[381] KnowEEG: Explainable Knowledge Driven EEG Classification

Amarpal Sahota, Navid Mohammadi Foumani, Raul Santos-Rodriguez, Zahraa S. Abdallah

Main category: cs.LG

TL;DR: KnowEEG is an explainable machine learning approach for EEG classification that combines per-electrode features with connectivity statistics, achieving state-of-the-art performance while providing inherent explainability through feature importance scores.

DetailsMotivation: While deep learning has improved EEG classification performance, model explainability remains a critical limitation. There's a need for interpretable models in EEG applications, especially in healthcare domains where understanding model decisions is crucial.

Method: KnowEEG extracts comprehensive per-electrode features, filters them using statistical tests, and integrates between-electrode connectivity statistics. These features are input to a modified Random Forest model (Fusion Forest) that balances per-electrode statistics with connectivity features when growing trees.

Result: KnowEEG achieves performance comparable to or exceeding state-of-the-art deep learning models across five classification tasks: emotion detection, mental workload classification, eyes open/closed detection, abnormal EEG classification, and event detection.

Conclusion: KnowEEG provides both high performance and inherent explainability through feature importance scores, enabling knowledge discovery about EEG classes. The approach’s discovered knowledge for eyes open/closed classification aligns with current neuroscience literature, making it valuable for healthcare applications where explainability is critical.

Abstract: Electroencephalography (EEG) is a method of recording brain activity that shows significant promise in applications ranging from disease classification to emotion detection and brain-computer interfaces. Recent advances in deep learning have improved EEG classification performance yet model explainability remains an issue. To address this key limitation of explainability we introduce KnowEEG; a novel explainable machine learning approach for EEG classification. KnowEEG extracts a comprehensive set of per-electrode features, filters them using statistical tests, and integrates between-electrode connectivity statistics. These features are then input to our modified Random Forest model (Fusion Forest) that balances per electrode statistics with between electrode connectivity features in growing the trees of the forest. By incorporating knowledge from both the generalized time-series and EEG-specific domains, KnowEEG achieves performance comparable to or exceeding state-of-the-art deep learning models across five different classification tasks: emotion detection, mental workload classification, eyes open/closed detection, abnormal EEG classification, and event detection. In addition to high performance, KnowEEG provides inherent explainability through feature importance scores for understandable features. We demonstrate by example on the eyes closed/open classification task that this explainability can be used to discover knowledge about the classes. This discovered knowledge for eyes open/closed classification was proven to be correct by current neuroscience literature. Therefore, the impact of KnowEEG will be significant for domains where EEG explainability is critical such as healthcare.

[382] Integration Matters for Learning PDEs with Backward SDEs

Sungje Park, Stephen Tu

Main category: cs.LG

TL;DR: BSDE-based deep learning methods for solving high-dimensional PDEs underperform PINNs due to discretization bias from Euler-Maruyama integration. Proposed Stratonovich-based BSDE with Heun integration eliminates bias and achieves competitive performance with PINNs.

DetailsMotivation: BSDE-based deep learning methods offer advantages for solving high-dimensional PDEs in stochastic optimal control, but empirically underperform compared to PINNs. The paper aims to identify and fix the root cause of this performance gap.

Method: Identifies discretization bias from Euler-Maruyama integration in standard BSDE solvers as the root cause. Proposes a Stratonovich-based BSDE formulation implemented with stochastic Heun integration to eliminate the bias.

Result: The proposed Heun-based BSDE method completely eliminates bias issues faced by EM integration, consistently outperforms EM-based variants, and achieves competitive results with PINNs across multiple high-dimensional benchmarks.

Conclusion: Integration schemes play a critical role in BSDE-based PDE solvers, a previously overlooked algorithmic detail. The proposed Stratonovich-based BSDE with Heun integration effectively addresses the performance gap with PINNs.

Abstract: Backward stochastic differential equation (BSDE)-based deep learning methods provide an alternative to Physics-Informed Neural Networks (PINNs) for solving high-dimensional partial differential equations (PDEs), offering potential algorithmic advantages in settings such as stochastic optimal control, where the PDEs of interest are tied to an underlying dynamical system. However, standard BSDE-based solvers have empirically been shown to underperform relative to PINNs in the literature. In this paper, we identify the root cause of this performance gap as a discretization bias introduced by the standard Euler-Maruyama (EM) integration scheme applied to one-step self-consistency BSDE losses, which shifts the optimization landscape off target. We find that this bias cannot be satisfactorily addressed through finer step-sizes or multi-step self-consistency losses. To properly handle this issue, we propose a Stratonovich-based BSDE formulation, which we implement with stochastic Heun integration. We show that our proposed approach completely eliminates the bias issues faced by EM integration. Furthermore, our empirical results show that our Heun-based BSDE method consistently outperforms EM-based variants and achieves competitive results with PINNs across multiple high-dimensional benchmarks. Our findings highlight the critical role of integration schemes in BSDE-based PDE solvers, an algorithmic detail that has received little attention thus far in the literature.

[383] Quiet Feature Learning in Algorithmic Tasks

Prudhviraj Naidu, Zixian Wang, Leon Bergen, Ramamohan Paturi

Main category: cs.LG

TL;DR: Transformer models show unexpected phase transitions when learning algorithms - loss stays flat then suddenly drops, with “quiet features” forming internally before visible improvement.

DetailsMotivation: To understand how neural networks learn algorithmic tasks and challenge the assumption that cross-entropy loss directly reflects representational learning progress.

Method: Train Transformer language models on 10 foundational algorithmic tasks, analyze loss curves, probe internal representations, and conduct ablation experiments on discovered “quiet features”.

Result: Models exhibit pronounced phase transitions where validation loss remains flat for long periods then abruptly decreases. “Quiet features” representing intermediate algorithmic computations form before loss improvement, and ablation shows these features are causally necessary for task performance.

Conclusion: Substantial representational learning can occur without visible loss improvement, challenging cross-entropy as a reliable proxy for learning and calling for richer training diagnostics.

Abstract: We train Transformer-based language models on ten foundational algorithmic tasks and observe pronounced phase transitions in their loss curves that deviate from established power-law scaling trends. Over large ranges of compute, the validation loss barely improves, then abruptly decreases. Probing the models’ internal representations reveals that quiet features are learned prior to any decrease in task loss. These quiet features represent intermediate algorithmic computations that do not by themselves improve the output loss. Ablation experiments demonstrate that individual quiet features are causally necessary for task performance. Our results demonstrate that substantial representational progress can remain hidden beneath an apparently flat loss curve, challenging the prevailing use of cross-entropy as a proxy for learning and motivating richer diagnostics for monitoring model training.

[384] Out-of-Distribution Detection via Channelwise Feature Aggregation in Neural Network-Based Receivers

Marko Tuononen, Heikki Penttinen, Duy Vu, Dani Korpi, Vesa Starck, Ville Hautamäki

Main category: cs.LG

TL;DR: Proposes a post-hoc, layerwise OOD detection framework for neural network-based radio receivers using channelwise feature aggregation, avoiding classwise statistics for multi-label soft-bit outputs.

DetailsMotivation: Neural network-based radio receivers are crucial for future wireless systems, making reliable OOD detection essential. Traditional approaches face challenges with multi-label soft-bit outputs that have astronomically many classes.

Method: Post-hoc, layerwise OOD framework based on channelwise feature aggregation that avoids classwise statistics. Leverages the observation that receiver activations form a smooth SNR-aligned manifold rather than discrete clusters, enabling manifold-aware OOD detection.

Result: Gaussian Mahalanobis with mean activations is the strongest single detector. Earlier layers outperform later layers in OOD detection. SNR/classifier fusions offer small, inconsistent AUROC gains. High-delay OOD is detected reliably, while high-speed OOD remains challenging.

Conclusion: The proposed framework effectively addresses OOD detection for neural network-based radio receivers, particularly for multi-label soft-bit outputs. The manifold-aware approach aligns with classical receiver behavior and shows promising results, though challenges remain with high-speed OOD scenarios.

Abstract: Neural network-based radio receivers are expected to play a key role in future wireless systems, making reliable Out-Of-Distribution (OOD) detection essential. We propose a post-hoc, layerwise OOD framework based on channelwise feature aggregation that avoids classwise statistics–critical for multi-label soft-bit outputs with astronomically many classes. Receiver activations exhibit no discrete clusters but a smooth Signal-to-Noise-Ratio (SNR)-aligned manifold, consistent with classical receiver behavior and motivating manifold-aware OOD detection. We evaluate multiple OOD feature types, distance metrics, and methods across layers. Gaussian Mahalanobis with mean activations is the strongest single detector, earlier layers outperform later, and SNR/classifier fusions offer small, inconsistent AUROC gains. High-delay OOD is detected reliably, while high-speed remains challenging.

[385] What Can RL Bring to VLA Generalization? An Empirical Study

Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, Yu Wang

Main category: cs.LG

TL;DR: RL fine-tuning (especially PPO) significantly improves generalization in vision-language-action models compared to supervised fine-tuning, particularly for semantic understanding and execution robustness.

DetailsMotivation: Current VLA models trained via supervised fine-tuning suffer from limited generalization due to compounding errors under distribution shifts. While RL offers potential for better generalization through trial-and-error optimization, there's a lack of systematic understanding of RL's specific benefits for VLAs compared to SFT.

Method: The study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates RL fine-tuning across visual, semantic, and execution dimensions. They compare different RL algorithms (PPO, DPO, GRPO) and develop a simple recipe for efficient PPO training on VLAs.

Result: RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. PPO proves more effective than LLM-derived methods like DPO and GRPO for VLAs.

Conclusion: RL fine-tuning, especially using PPO, provides substantial generalization benefits for VLAs over traditional supervised fine-tuning, with practical utility demonstrated through an efficient training recipe. The findings establish RL as a valuable approach for improving embodied AI systems.

Abstract: Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https://rlvla.github.io

[386] Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

James Oldfield, Shawn Im, Sharon Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios G Chrysos

Main category: cs.LG

TL;DR: MxDs (Mixture of Decoders) use layer-level sparsity to create interpretable approximations of MLPs in LLMs, preserving accuracy while enabling specialization through thousands of sparse sublayers.

DetailsMotivation: MLPs in large language models are dense and difficult to understand, edit, and steer. Existing neuron-level sparse approximations fail to faithfully reconstruct the original mapping and significantly increase model loss.

Method: MxDs generalize MLPs and Gated Linear Units by expanding pre-trained dense layers into tens of thousands of specialized sublayers using tensor factorization. Each sparsely activating sublayer implements a linear transformation with full-rank weights, preserving expressive capacity under heavy sparsity.

Result: MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models up to 3B parameters. They learn similarly specialized features of natural language and enable effective sparse probing and feature steering.

Conclusion: MxDs provide a promising new avenue for designing interpretable yet faithful decompositions of MLPs in LLMs through layer-level sparsity, overcoming the accuracy trade-off of previous sparse approximation methods.

Abstract: Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping–significantly increasing model’s next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights–preserving the original decoders’ expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language–opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.

[387] Enhancing Federated Class-Incremental Learning via Spatial-Temporal Statistics Aggregation

Zenghao Guan, Guojun Zhu, Yucan Zhou, Wu Liu, Weiping Wang, Jiebo Luo, Xiaoyan Gu

Main category: cs.LG

TL;DR: STSA: A federated class-incremental learning method that aggregates feature statistics spatially across clients and temporally across stages to address data heterogeneity and reduce communication overhead.

DetailsMotivation: Existing FCIL methods suffer from spatial-temporal client drift due to data heterogeneity and incur high computational/communication costs, limiting practical deployment.

Method: Proposes STSA framework that aggregates feature statistics both spatially (across clients) and temporally (across stages), enabling closed-form classifier updates. Also introduces STSA-E variant with theoretical guarantees for communication efficiency.

Result: Outperforms state-of-the-art FCIL methods on three datasets with varying data heterogeneity, achieving better performance, flexibility, and communication/computation efficiency.

Conclusion: STSA provides an effective solution for FCIL challenges by addressing data heterogeneity and reducing overhead through spatial-temporal statistics aggregation.

Abstract: Federated Class-Incremental Learning (FCIL) enables Class-Incremental Learning (CIL) from distributed data. Existing FCIL methods typically integrate old knowledge preservation into local client training. However, these methods cannot avoid spatial-temporal client drift caused by data heterogeneity and often incur significant computational and communication overhead, limiting practical deployment. To address these challenges simultaneously, we propose a novel approach, Spatial-Temporal Statistics Aggregation (STSA), which provides a unified framework to aggregate feature statistics both spatially (across clients) and temporally (across stages). The aggregated feature statistics are unaffected by data heterogeneity and can be used to update the classifier in closed form at each stage. Additionally, we introduce STSA-E, a communication-efficient variant with theoretical guarantees, achieving similar performance to STSA-E with much lower communication overhead. Extensive experiments on three widely used FCIL datasets, with varying degrees of data heterogeneity, show that our method outperforms state-of-the-art FCIL methods in terms of performance, flexibility, and both communication and computation efficiency. The code is available at https://github.com/Yuqin-G/STSA.

[388] Exploring the Secondary Risks of Large Language Models

Jiawei Chen, Zhengwei Fang, Xiao Yang, Chao Yu, Zhaoxia Yin, Hang Su

Main category: cs.LG

TL;DR: The paper introduces “secondary risks” - harmful LLM behaviors during benign interactions, proposes SecLens framework to elicit them, and releases SecRiskBench benchmark showing these risks are widespread across models.

DetailsMotivation: Current safety research focuses too much on adversarial jailbreak attacks while ignoring non-adversarial failures that emerge during normal, benign interactions with LLMs. These subtle harmful behaviors can evade standard safety mechanisms and pose real-world risks.

Method: 1) Define two risk primitives: verbose response and speculative advice. 2) Propose SecLens - a black-box, multi-objective search framework that optimizes for task relevance, risk activation, and linguistic plausibility to efficiently elicit secondary risks. 3) Create SecRiskBench benchmark with 650 prompts across 8 real-world risk categories.

Result: Experimental evaluation on 16 popular models shows secondary risks are widespread, transferable across models, and modality independent. The risks occur during benign interactions and evade standard safety mechanisms.

Conclusion: Secondary risks represent a significant safety gap in current LLMs that requires enhanced safety mechanisms. These benign-yet-harmful behaviors pose real-world deployment risks that need systematic evaluation and mitigation.

Abstract: Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives verbose response and speculative advice that capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors by optimizing task relevance, risk activation, and linguistic plausibility. To support reproducible evaluation, we release SecRiskBench, a benchmark dataset of 650 prompts covering eight diverse real-world risk categories. Experimental results from extensive evaluations on 16 popular models demonstrate that secondary risks are widespread, transferable across models, and modality independent, emphasizing the urgent need for enhanced safety mechanisms to address benign yet harmful LLM behaviors in real-world deployments.

[389] Optimism Without Regularization: Constant Regret in Zero-Sum Games

John Lazarsfeld, Georgios Piliouras, Ryann Sim, Stratis Skoulakis

Main category: cs.LG

TL;DR: Optimistic Fictitious Play achieves constant regret in two-strategy zero-sum games without regularization, while Alternating Fictitious Play has Ω(√T) regret lower bound.

DetailsMotivation: To investigate whether optimal learning rates (constant regret) can be achieved in two-player zero-sum games without regularization, challenging the conventional wisdom that regularization is necessary for fast learning.

Method: Analyzes Optimistic Fictitious Play (unregularized) using a geometric approach in the dual space of payoff vectors, tracking an energy function of iterates. Also examines Alternating Fictitious Play for comparison.

Result: Proves Optimistic Fictitious Play achieves only constant regret in two-strategy games, providing first evidence that non-no-regret algorithms can achieve fast learning. Also proves Ω(√T) regret lower bound for Alternating Fictitious Play.

Conclusion: Optimism enables fast learning (constant regret) without regularization in two-strategy games, while alternation does not, separating their capabilities in the unregularized regime.

Abstract: This paper studies the optimistic variant of Fictitious Play for learning in two-player zero-sum games. While it is known that Optimistic FTRL – a regularized algorithm with a bounded stepsize parameter – obtains constant regret in this setting, we show for the first time that similar, optimal rates are also achievable without regularization: we prove for two-strategy games that Optimistic Fictitious Play (using any tiebreaking rule) obtains only constant regret, providing surprising new evidence on the ability of non-no-regret algorithms for fast learning in games. Our proof technique leverages a geometric view of Optimistic Fictitious Play in the dual space of payoff vectors, where we show a certain energy function of the iterates remains bounded over time. Additionally, we also prove a regret lower bound of $Ω(\sqrt{T})$ for Alternating Fictitious Play. In the unregularized regime, this separates the ability of optimism and alternation in achieving $o(\sqrt{T})$ regret.

[390] Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models

Michael Plainer, Hao Wu, Leon Klein, Stephan Günnemann, Frank Noé

Main category: cs.LG

TL;DR: The paper identifies inconsistencies between diffusion models’ sampling distribution and their energy-based interpretation, traces this to score inaccuracies at small diffusion timesteps, and proposes a Fokker-Planck-regularized energy-based diffusion model to enforce consistency.

DetailsMotivation: While diffusion models are effective for sampling biomolecules, there's an inconsistency: classical diffusion sampling recovers the training distribution, but the energy-based interpretation of the learned score often doesn't match this distribution, even in simple systems.

Method: The authors trace the inconsistency to score inaccuracies at very small diffusion timesteps where models must capture correct data distribution evolution. They propose an energy-based diffusion model with a Fokker-Planck-derived regularization term to enforce consistency between the learned score and the equilibrium distribution.

Result: The approach successfully samples and simulates multiple biomolecular systems including fast-folding proteins, and introduces a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation while achieving improved consistency and efficient sampling.

Conclusion: Fokker-Planck regularization addresses the inconsistency between diffusion models’ sampling behavior and their energy-based interpretation, enabling more consistent and efficient biomolecular sampling and simulation with available code and model weights.

Abstract: In recent years, diffusion models trained on equilibrium molecular distributions have proven effective for sampling biomolecules. Beyond direct sampling, the score of such a model can also be used to derive the forces that act on molecular systems. However, while classical diffusion sampling usually recovers the training distribution, the corresponding energy-based interpretation of the learned score is often inconsistent with this distribution, even for low-dimensional toy systems. We trace this inconsistency to inaccuracies of the learned score at very small diffusion timesteps, where the model must capture the correct evolution of the data distribution. In this regime, diffusion models fail to satisfy the Fokker-Planck equation, which governs the evolution of the score. We interpret this deviation as one source of the observed inconsistencies and propose an energy-based diffusion model with a Fokker-Planck-derived regularization term to enforce consistency. We demonstrate our approach by sampling and simulating multiple biomolecular systems, including fast-folding proteins, and by introducing a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and achieves improved consistency and efficient sampling. Our code, model weights, and self-contained JAX and PyTorch notebooks are available at https://github.com/noegroup/ScoreMD.

[391] ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi

Main category: cs.LG

TL;DR: ExPO is a self-explanation framework that generates effective positive samples for RL post-training on reasoning tasks by conditioning on ground-truth answers, enabling exploration beyond the model’s current distribution.

DetailsMotivation: Current RL post-training methods fail on complex reasoning tasks because they only reinforce existing knowledge rather than enabling exploration when models initially generate no correct solutions. Expert demonstrations are often ineffective, requiring a new approach to generate effective positive samples.

Method: ExPO generates positive samples by conditioning on ground-truth answers, creating samples that are likely under the current policy while increasing the model’s likelihood of predicting correct answers. It’s modular and integrates with RL methods like GRPO and DPO.

Result: ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods, especially on challenging tasks like MATH level-5 where models initially struggle most.

Conclusion: ExPO enables effective exploration in RL post-training for reasoning tasks by generating better positive samples than expert demonstrations or the model’s own incorrect outputs, addressing the distribution-sharpening limitation of existing methods.

Abstract: Self-improvement via RL often fails on complex reasoning tasks because GRPO-style post-training methods rely on the model’s initial ability to generate positive samples. Without guided exploration, these approaches merely reinforce what the model already knows (distribution-sharpening) rather than enabling the model to solve problems where it initially generates no correct solutions. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model’s likelihood of predicting the correct answer. Based on these insights, we propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. It can be integrated with popular RL training methods like GRPO and DPO. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most. Code is available at https://github.com/HumainLab/ExPO_rl_reasoning_by_explanation .

[392] Improving AI-Based Canine Heart Disease Diagnosis with Expert-Consensus Auscultation Labeling

Pinar Bisgin, Tom Strube, Niklas Tschorn, Michael Pantförder, Maximilian Fecke, Ingrid Ljungvall, Jens Häggström, Gerhard Wess, Christoph Schummer, Sven Meister, Falk M. Howar

Main category: cs.LG

TL;DR: Study addresses label noise in veterinary AI by using multiple expert annotations to create cleaner canine heart murmur dataset, showing significant classification improvements with XGBoost.

DetailsMotivation: Noisy labels from expert assessment ambiguity in veterinary medicine negatively impact AI model training for canine heart murmur classification, requiring methods to reduce label noise.

Method: Used 140 heart sound recordings annotated by multiple experts for MMVD murmur intensity, selected 70 high-quality recordings via consensus, expanded training data using individual heart cycles, and compared AdaBoost, XGBoost, and Random Forest classifiers.

Result: All algorithms improved with label noise reduction, especially XGBoost: mild murmur sensitivity increased from 37.71% to 90.98%, specificity from 76.70% to 93.69%; moderate murmur sensitivity from 30.23% to 55.81%, specificity from 64.56% to 97.19%; loud/thrilling sensitivity from 58.28% to 95.09%, specificity from 84.84% to 89.69%.

Conclusion: Minimizing label noise through multiple expert consensus significantly improves classification algorithms for canine heart murmur detection, with XGBoost showing the most notable performance gains across all murmur intensity categories.

Abstract: Noisy labels pose significant challenges for AI model training in veterinary medicine. This study examines expert assessment ambiguity in canine auscultation data, highlights the negative impact of label noise on classification performance, and introduces methods for label noise reduction. To evaluate whether label noise can be minimized by incorporating multiple expert opinions, a dataset of 140 heart sound recordings (HSR) was annotated regarding the intensity of holosystolic heart murmurs caused by Myxomatous Mitral Valve Disease (MMVD). The expert opinions facilitated the selection of 70 high-quality HSR, resulting in a noise-reduced dataset. By leveraging individual heart cycles, the training data was expanded and classification robustness was enhanced. The investigation encompassed training and evaluating three classification algorithms: AdaBoost, XGBoost, and Random Forest. While AdaBoost and Random Forest exhibited reasonable performances, XGBoost demonstrated notable improvements in classification accuracy. All algorithms showed significant improvements in classification accuracy due to the applied label noise reduction, most notably XGBoost. Specifically, for the detection of mild heart murmurs, sensitivity increased from 37.71% to 90.98% and specificity from 76.70% to 93.69%. For the moderate category, sensitivity rose from 30.23% to 55.81% and specificity from 64.56% to 97.19%. In the loud/thrilling category, sensitivity and specificity increased from 58.28% to 95.09% and from 84.84% to 89.69%, respectively. These results highlight the importance of minimizing label noise to improve classification algorithms for the detection of canine heart murmurs. Index Terms: AI diagnosis, canine heart disease, heart sound classification, label noise reduction, machine learning, XGBoost, veterinary cardiology, MMVD.

[393] e-Profits: A Business-Aligned Evaluation Metric for Profit-Sensitive Customer Churn Prediction

Awais Manzoor, M. Atif Qureshi, Etain Kidney, Luca Longo

Main category: cs.LG

TL;DR: e-Profits is a new business-aligned evaluation metric for churn prediction models that quantifies performance based on customer lifetime value, retention probability, and intervention costs, providing more financially relevant insights than traditional metrics like AUC and F1-score.

DetailsMotivation: Traditional churn prediction metrics (AUC, F1-score) fail to reflect financial outcomes and can mislead strategic decisions in retention campaigns. There's a need for metrics that bridge predictive modeling with profit-driven decision-making.

Method: e-Profits uses Kaplan-Meier survival analysis to estimate tenure-conditioned (customer-level) one-period retention probabilities, incorporating customer lifetime value and intervention costs. It supports granular, per-customer profit evaluation unlike existing profit-based metrics that assume fixed population-level parameters.

Result: Benchmarking six classifiers across two telecom datasets (IBM Telco and Maven Telecom) shows e-Profits reshapes model rankings compared to traditional metrics, revealing financial advantages in models previously overlooked by AUC or F1-score. It also enables segment-level insight into which models maximize ROI for high-value customers.

Conclusion: e-Profits provides a transparent, customer-level evaluation framework that bridges predictive modeling and profit-driven decision-making in operational churn management, offering more business-relevant model evaluation than traditional metrics.

Abstract: Retention campaigns in customer relationship management often rely on churn prediction models evaluated using traditional metrics such as AUC and F1-score. However, these metrics fail to reflect financial outcomes and may mislead strategic decisions. We introduce e-Profits, a novel business-aligned evaluation metric that quantifies model performance based on customer lifetime value, retention probability, and intervention costs. Unlike existing profit-based metrics such as Expected Maximum Profit, which assume fixed population-level parameters, e-Profits uses Kaplan-Meier survival analysis to estimate tenure-conditioned (customer-level) one-period retention probabilities and supports granular, per-customer profit evaluation. We benchmark six classifiers across two telecom datasets (IBM Telco and Maven Telecom) and demonstrate that e-Profits reshapes model rankings compared to traditional metrics, revealing financial advantages in models previously overlooked by AUC or F1-score. The metric also enables segment-level insight into which models maximise return on investment for high-value customers. e-Profits provides a transparent, customer-level evaluation framework that bridges predictive modelling and profit-driven decision-making in operational churn management. All source code is available at: https://github.com/Awaismanzoor/eprofits.

[394] Distributional Machine Unlearning via Selective Data Removal

Youssef Allouah, Rachid Guerraoui, Sanmi Koyejo

Main category: cs.LG

TL;DR: The paper proposes distributional unlearning, a framework that selects a small subset of data to remove for efficient domain forgetting while preserving desired distributions, achieving strong unlearning effects with significantly less deletion than full removal.

DetailsMotivation: Machine learning systems need to remove entire domains of information (like toxic language or biases), but complete removal is computationally expensive while random partial removal is statistically inefficient. There's a need for an efficient middle ground.

Method: Proposes distributional unlearning framework using Kullback-Leibler divergence constraints to select optimal subsets for removal. Derives exact removal-preservation Pareto frontier for Gaussian distributions and proposes a distance-based selection algorithm that is quadratically more sample-efficient than random removal.

Result: Experiments across synthetic, text, and image datasets (Jigsaw, CIFAR-10, SMS spam) show the method requires 15-82% less deletion than full removal while achieving strong unlearning effects, such as halving initial forget set accuracy.

Conclusion: By demonstrating that a small forget set often suffices for effective domain unlearning, the framework lays foundations for more scalable and rigorous subpopulation unlearning in machine learning systems.

Abstract: Machine learning systems increasingly face requirements to remove entire domains of information–such as toxic language or biases–rather than individual user data. This task presents a dilemma: full removal of the unwanted domain data is computationally expensive, while random partial removal is statistically inefficient. We find that a domain’s statistical influence is often concentrated in a small subset of its data samples, suggesting a path between ineffective partial removal and unnecessary complete removal. We formalize this as distributional unlearning: a framework to select a small subset that balances forgetting an unwanted distribution while preserving a desired one. Using Kullback-Leibler divergence constraints, we derive the exact removal-preservation Pareto frontier for Gaussian distributions and prove that models trained on the edited data achieve corresponding log-loss bounds. We propose a distance-based selection algorithm and show it is quadratically more sample-efficient than random removal in the challenging low-divergence regime. Experiments across synthetic, text, and image datasets (Jigsaw, CIFAR-10, SMS spam) show our method requires 15-82% less deletion than full removal for strong unlearning effects, e.g., halving initial forget set accuracy. Ultimately, by showing a small forget set often suffices, our framework lays the foundations for more scalable and rigorous subpopulation unlearning.

[395] Can Language Models Discover Scaling Laws?

Haowei Lin, Haotian Ye, Wenzheng Feng, Quzhe Huang, Yujun Li, Hubert Lim, Zhengrui Li, Xiangyu Wang, Jianzhu Ma, James Zou, Yitao Liang

Main category: cs.LG

TL;DR: SLDAgent is an evolution-based AI agent that automatically discovers scaling laws for predicting model performance, outperforming human-derived laws across diverse tasks.

DetailsMotivation: Discovering scaling laws for predicting model performance at scale is currently slow and relies on human experimentation. The paper investigates whether LLMs can automate this process to overcome the limitations of manual discovery.

Method: The authors collect over 5,000 experiments from literature and create eight diverse scaling law discovery tasks. They introduce SLDAgent, an evolution-based agent that co-optimizes both the scaling law model structure and its parameters, enabling autonomous exploration of complex variable relationships.

Result: SLDAgent discovers scaling laws that consistently outperform established human-derived counterparts across all eight tasks, demonstrating more accurate extrapolation. The discovered laws show practical utility in both pretraining and finetuning applications.

Conclusion: This work establishes a new paradigm for agentic scientific discovery, showing that AI systems can understand their own scaling behavior and contribute novel, practical knowledge back to the research community.

Abstract: Discovering scaling laws for predicting model performance at scale is a fundamental and open-ended challenge, mostly reliant on slow, case specific human experimentation. To investigate the potential for LLMs to automate this process, we collect over 5,000 experiments from existing literature and curate eight diverse scaling law discovery tasks. While existing agents struggle to produce accurate law formulas, this paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables. For the first time, we demonstrates that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts across all tasks. Through comprehensive analysis, we elucidate why these discovered laws are superior and verify their practical utility in both pretraining and finetuning applications. This work establishes a new paradigm for agentic scientific discovery, showing that AI systems can understand their own scaling behavior, and can contribute novel and practical knowledge back to the research community.

[396] Nonlinear reconciliation: Error reduction theorems

Lorenzo Nespoli, Anubhab Biswas, Roberto Rocchetta, Vasco Medici

Main category: cs.LG

TL;DR: This paper establishes formal error reduction theorems for forecast reconciliation in nonlinear constraint settings, extending existing linear constraint results to various classes of nonlinear hypersurfaces and manifolds.

DetailsMotivation: While forecast reconciliation has been studied for linear constraints, formal error reduction theorems for nonlinear constraints (analogous to Panagiotelis et al. 2021) are lacking, creating a gap in the theoretical foundation for probabilistic reconciliation methods.

Method: The authors derive exact analogs of existing linear constraint theorems for hypersurfaces with constant-sign curvature, then extend to broader cases including hypersurfaces with non-constant-sign curvature and general manifolds with codimension > 1.

Result: Established formal error reduction theorems for various classes of nonlinear constraints, providing theoretical guarantees for forecast reconciliation in nonlinear settings. Released JNLR, a JAX-based Python package implementing the theorems and reconciliation procedures.

Conclusion: The paper fills a significant theoretical gap by providing formal error reduction guarantees for forecast reconciliation with nonlinear constraints, enabling more robust probabilistic forecasting with theoretical foundations and practical implementation tools.

Abstract: Forecast reconciliation, an ex-post technique applied to forecasts that must satisfy constraints, has been a prominent topic in the forecasting literature over the past two decades. Recently, several efforts have sought to extend reconciliation methods to the probabilistic settings. Nevertheless, formal theorems demonstrating error reduction in nonlinear constraints, analogous to those presented in Panagiotelis et al.(2021), are still lacking. This paper addresses that gap by establishing such theorems for various classes of nonlinear hypersurfaces and vector-valued functions. Specifically, we derive an exact analog of Theorem 3.1 from Panagiotelis et al.(2021) for hypersurfaces with constant-sign curvature. Additionally, we provide an error reduction theorem for the broader case of hypersurfaces with non-constant-sign curvature and for general manifolds with codimension > 1. To support reproducibility and practical adoption, we release a JAX-based Python package, JNLR, implementing the presented theorems and reconciliation procedures.

[397] Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli

Main category: cs.LG

TL;DR: Spiffy is a speculative decoding algorithm that accelerates diffusion LLMs by 2.8-3.1× while preserving output distribution, using auto-speculative draft generation and optimized directed draft graphs.

DetailsMotivation: Current open-source diffusion LLMs (dLLMs) generate tokens at much lower rates than their potential, typically decoding only one token per denoising timestep to maximize quality. There's a need to accelerate dLLM inference while maintaining output quality.

Method: Spiffy uses auto-speculative draft generation leveraging the dLLM’s own distribution, eliminating separate draft model overhead. It introduces directed draft graphs designed for dLLM’s bidirectional, block-wise generation, verified in parallel. An offline calibration algorithm optimizes graph configurations for higher acceptance rates.

Result: Spiffy achieves 2.8-3.1× speedup while provably preserving output distribution. When combined with parallel decoding methods like KV-caching and multi-token unmasking, it achieves up to 7.9× total speedup.

Conclusion: Spiffy effectively accelerates diffusion LLM inference through novel speculative decoding techniques that are complementary to existing parallel decoding methods, significantly improving generation speeds while maintaining output quality.

Abstract: Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $\mathbf{2.8{-}3.1\times}$ while provably preserving the model’s output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM’s distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$.

[398] Content Accuracy and Quality Aware Resource Allocation Based on LP-Guided DRL for ISAC-Driven AIGC Networks

Ningzhe Shi, Yiqing Zhou, Ling Liu, Jinglin Shi, Yihao Wu, Haiwei Shi, Hanxiao Yu

Main category: cs.LG

TL;DR: Proposes CAQA metric and LPDRL-F algorithm to optimize resource allocation in ISAC-based AIGC networks, improving quality by over 50%.

DetailsMotivation: Existing AIGC services assume accurate input data, but ISAC-based networks use inaccurate sensed data, requiring new quality assessment that considers both content accuracy and generation quality.

Method: Proposes CAQA metric for ISAC-AIGC quality assessment, then develops LPDRL-F algorithm (linear programming guided deep reinforcement learning with action filter) to optimize three-dimensional resource tradeoff between sensing, computing, and communication.

Result: LPDRL-F converges faster and finds better solutions than existing DRL and GDM algorithms, improving AvgCAQA by >10%. Overall, CAQA-AIGC achieves >50% improvement over schemes focusing only on content generation quality.

Conclusion: The proposed CAQA metric and LPDRL-F algorithm effectively address the resource allocation challenge in ISAC-based AIGC networks, significantly improving service quality through optimized sensing-computing-communication tradeoff.

Abstract: Integrated sensing and communication (ISAC) can enhance artificial intelligence-generated content (AIGC) networks by providing efficient sensing and transmission. Existing AIGC services usually assume that the accuracy of the generated content can be ensured, given accurate input data and prompt, thus only the content generation quality (CGQ) is concerned. However, it is not applicable in ISAC-based AIGC networks, where content generation is based on inaccurate sensed data. Moreover, the AIGC model itself introduces generation errors, which depend on the number of generating steps (i.e., computing resources). To assess the quality of experience of ISAC-based AIGC services, we propose a content accuracy and quality aware service assessment metric (CAQA). Since allocating more resources to sensing and generating improves content accuracy but may reduce communication quality, and vice versa, this sensing-generating (computing)-communication three-dimensional resource tradeoff must be optimized to maximize the average CAQA (AvgCAQA) across all users with AIGC (CAQA-AIGC). This problem is NP-hard, with a large solution space that grows exponentially with the number of users. To solve the CAQA-AIGC problem with low complexity, a linear programming (LP) guided deep reinforcement learning (DRL) algorithm with an action filter (LPDRL-F) is proposed. Through the LP-guided approach and the action filter, LPDRL-F can transform the original three-dimensional solution space to two dimensions, reducing complexity while improving the learning performance of DRL. Simulations show that compared to existing DRL and generative diffusion model (GDM) algorithms without LP, LPDRL-F converges faster and finds better resource allocation solutions, improving AvgCAQA by more than 10%. With LPDRL-F, CAQA-AIGC can achieve an improvement in AvgCAQA of more than 50% compared to existing schemes focusing solely on CGQ.

[399] Optimizing Fine-Tuning through Advanced Initialization Strategies for Low-Rank Adaptation

Yongfu Xue

Main category: cs.LG

TL;DR: IniLoRA improves LoRA by initializing low-rank matrices to approximate original model weights, achieving better performance across models and tasks.

DetailsMotivation: LoRA's zero-product initialization limits its ability to effectively activate and leverage original model weights, creating a performance bottleneck. The authors aim to overcome this limitation.

Method: Proposes IniLoRA with novel initialization strategy that initializes low-rank matrices to closely approximate original model weights. Also introduces two variants: IniLoRA-α and IniLoRA-β with distinct initialization methods.

Result: Experimental results show IniLoRA achieves better performance than LoRA across a range of models and tasks.

Conclusion: IniLoRA addresses LoRA’s initialization limitation and provides improved parameter-efficient fine-tuning through better weight approximation initialization.

Abstract: The rapid development of parameter-efficient fine-tuning methods has noticeably improved the efficiency of adapting large language models. Among these, LoRA has gained widespread popularity due to its strong balance of effectiveness and parameter efficiency. However, LoRA relies on initializing two low-rank matrices whose product is zero, which limits its ability to effectively activate and leverage the original model weights-creating a potential bottleneck for optimal performance. To address this limitation, we propose \textbf{IniLoRA}, a novel initialization strategy that initializes the low-rank matrices to closely approximate the original model weights. Experimental results indicate that IniLoRA achieves better performance than LoRA across a range of models and tasks. Additionally, we introduce two variants, IniLoRA-$α$ and IniLoRA-$β$, both leveraging distinct initialization methods to enhance performance further.

[400] CuMoLoS-MAE: A Masked Autoencoder for Remote Sensing Data Reconstruction

Anurup Naskar, Nathanael Zhixin Wong, Sara Shamekh

Main category: cs.LG

TL;DR: CuMoLoS-MAE is a curriculum-guided Monte Carlo stochastic ensemble masked autoencoder that reconstructs atmospheric profiles from noisy remote sensing data while preserving fine-scale features and providing uncertainty estimates.

DetailsMotivation: Remote sensing instruments like Doppler Lidar, Radar, and radiometers produce atmospheric profiles that are corrupted by low-SNR gates, range folding, and spurious discontinuities. Traditional gap filling methods blur fine-scale structures, while existing deep learning models lack confidence estimates, limiting their utility for scientific applications.

Method: CuMoLoS-MAE uses a curriculum-guided training approach with progressive mask ratios that forces a Vision Transformer (ViT) decoder to reconstruct from increasingly sparse context. At inference, it performs Monte Carlo sampling over random mask realizations, running the MAE multiple times and aggregating outputs to obtain posterior predictive mean reconstruction with per-pixel uncertainty maps.

Result: The method successfully restores fine-scale atmospheric features (updraft/downdraft cores, shear lines, small vortices), learns a data-driven prior over atmospheric fields, and provides pixel-wise uncertainty quantification. It enables high-fidelity reconstruction with uncertainty estimates.

Conclusion: CuMoLoS-MAE provides a novel deep learning workflow that enhances convection diagnostics, supports real-time data assimilation, and improves long-term climate reanalysis by combining high-fidelity reconstruction with uncertainty quantification for atmospheric remote sensing data.

Abstract: Accurate atmospheric profiles from remote sensing instruments such as Doppler Lidar, Radar, and radiometers are frequently corrupted by low-SNR (Signal to Noise Ratio) gates, range folding, and spurious discontinuities. Traditional gap filling blurs fine-scale structures, whereas deep models lack confidence estimates. We present CuMoLoS-MAE, a Curriculum-Guided Monte Carlo Stochastic Ensemble Masked Autoencoder designed to (i) restore fine-scale features such as updraft and downdraft cores, shear lines, and small vortices, (ii) learn a data-driven prior over atmospheric fields, and (iii) quantify pixel-wise uncertainty. During training, CuMoLoS-MAE employs a mask-ratio curriculum that forces a ViT decoder to reconstruct from progressively sparser context. At inference, we approximate the posterior predictive by Monte Carlo over random mask realisations, evaluating the MAE multiple times and aggregating the outputs to obtain the posterior predictive mean reconstruction together with a finely resolved per-pixel uncertainty map. Together with high-fidelity reconstruction, this novel deep learning-based workflow enables enhanced convection diagnostics, supports real-time data assimilation, and improves long-term climate reanalysis.

[401] AQ-PCDSys: An Adaptive Quantized Planetary Crater Detection System for Autonomous Space Exploration

Aditri Paul, Archan Paul

Main category: cs.LG

TL;DR: A quantized neural network system for real-time crater detection on planetary exploration hardware with limited power and memory.

DetailsMotivation: Standard deep learning models require too much memory and computation for space-qualified hardware, creating a bottleneck for autonomous planetary exploration that needs real-time environmental perception.

Method: Adaptive Quantized Planetary Crater Detection System combining Quantized Neural Network (via Quantization Aware Training) with Adaptive Multi-Sensor Fusion module that merges Optical Imagery and Digital Elevation Models using adaptive weighting, plus Multi-Scale Detection Heads.

Result: The system achieves leaner model footprint, significantly faster processing while maintaining high detection fidelity, enabling real-time crater detection and hazard avoidance on power/memory-constrained hardware.

Conclusion: Provides a computationally efficient solution for autonomous planetary exploration, serving as a blueprint for future empirical validation and hardware benchmarking on integer-arithmetic units.

Abstract: Successful autonomous planetary exploration hinges on real-time, high-fidelity environmental perception. However, standard deep learning models usually demand far more memory and computation power than space-qualified, radiation-hardened onboard hardware can provide. This creates a fundamental design challenge of deploying sophisticated detection architectures without saturating the rigid power and memory envelopes of the computation hardware of planetary exploration platforms. We propose the Adaptive Quantized Planetary Crater Detection System to resolve this bottleneck. Our framework integrates a Quantized Neural Network, refined through Quantization Aware Training, with an Adaptive Multi-Sensor Fusion module. By forcing weights into low-precision integer arithmetic, we effectively strip away the floating-point overhead that typically bottlenecks onboard processors and system memory. This yields a leaner model footprint and significantly faster processing while the detection fidelity remains high. Such efficiency enables AMF module to merge high-bandwidth Optical Imagery streams with Digital Elevation Models using an Adaptive Weighting Mechanism to re-balance sensor priority under variable conditions like deep shadows or high albedo. Integrated Multi-Scale Detection Heads then resolve craters across a wide range of diameters, providing a computationally efficient and precise solution for real-time detection, localization of craters and hazard avoidance. This paper establishes the architectural design and theoretical justification of the system. While our methodology is grounded in principles of hybrid computer vision and planetary science, we present this as a blueprint for future empirical validation and hardware benchmarking on integer-arithmetic units. This system provides a capability vital for the next generation of autonomous landing, navigation, and deep space explorations.

[402] Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization

Frank Röder, Jan Benad, Manfred Eppe, Pradeep Kr. Banerjee

Main category: cs.LG

TL;DR: DALI is a Dreamer-based framework that learns latent context representations from agent-environment interactions for robust RL generalization without explicit context variables.

DetailsMotivation: Real-world RL needs adaptation to unseen conditions without costly retraining. Existing cMDP methods require explicit context variables (friction, gravity), limiting use when contexts are latent or hard to measure.

Method: DALI integrates with Dreamer architecture, trains self-supervised encoder to predict forward dynamics, infers latent context representations from agent-environment interactions, and conditions world model and policy on these representations.

Result: DALI achieves significant gains over context-unaware baselines, often surpasses context-aware baselines in extrapolation tasks, enables zero-shot generalization to unseen contextual variations, and demonstrates counterfactual consistency in latent space.

Conclusion: DALI provides an effective framework for learning latent context representations that bridge perception and control, enabling robust generalization in contextual MDPs without requiring explicit context variables.

Abstract: Real-world reinforcement learning demands adaptation to unseen environmental conditions without costly retraining. Contextual Markov Decision Processes (cMDP) model this challenge, but existing methods often require explicit context variables (e.g., friction, gravity), limiting their use when contexts are latent or hard to measure. We introduce Dynamics-Aligned Latent Imagination (DALI), a framework integrated within the Dreamer architecture that infers latent context representations from agent-environment interactions. By training a self-supervised encoder to predict forward dynamics, DALI generates actionable representations conditioning the world model and policy, bridging perception and control. We theoretically prove this encoder is essential for efficient context inference and robust generalization. DALI’s latent space enables counterfactual consistency: Perturbing a gravity-encoding dimension alters imagined rollouts in physically plausible ways. On challenging cMDP benchmarks, DALI achieves significant gains over context-unaware baselines, often surpassing context-aware baselines in extrapolation tasks, enabling zero-shot generalization to unseen contextual variations.

[403] In-Context Learning Enhanced Credibility Transformer

Kishan Padayachy, Ronald Richman, Salvatore Scognamiglio, Mario V. Wüthrich

Main category: cs.LG

TL;DR: Credibility Transformer extended with in-context learning mechanism improves predictive accuracy by adapting to similar risk patterns and generalizing to new instances with unseen categorical features.

DetailsMotivation: To enhance the Credibility Transformer's learning and predictive performance by incorporating in-context learning, allowing the model to adapt to similar risk patterns and handle new instances with previously unseen categorical feature levels.

Method: Extends the Credibility Transformer architecture with an in-context learning mechanism that augments the information set with a context batch of similar instances. This enhances CLS token representations through additional in-context information and fine-tuning.

Result: Empirical verification shows that in-context learning enhances predictive accuracy by adapting to similar risk patterns and allows generalization to new instances with previously unseen categorical covariate levels.

Conclusion: The proposed in-context learning paradigm successfully improves the Credibility Transformer’s performance by enabling better adaptation to risk patterns and handling of novel instances, demonstrating practical value for real-world applications like insurance risk assessment.

Abstract: The starting point of our network architecture is the Credibility Transformer which extends the classical Transformer architecture by a credibility mechanism to improve model learning and predictive performance. This Credibility Transformer learns credibilitized CLS tokens that serve as learned representations of the original input features. In this paper we present a new paradigm that augments this architecture by an in-context learning mechanism, i.e., we increase the information set by a context batch consisting of similar instances. This allows the model to enhance the CLS token representations of the instances by additional in-context information and fine-tuning. We empirically verify that this in-context learning enhances predictive accuracy by adapting to similar risk patterns. Moreover, this in-context learning also allows the model to generalize to new instances which, e.g., have feature levels in the categorical covariates that have not been present when the model was trained – for a relevant example, think of a new vehicle model which has just been developed by a car manufacturer.

[404] Differentially private federated learning for localized control of infectious disease dynamics

Raouf Kerkouche, Henrik Zunker, Mario Fritz, Martin J. Kühn

Main category: cs.LG

TL;DR: Privacy-preserving federated learning with differential privacy enables local COVID-19 case forecasting across German counties without centralizing sensitive health data.

DetailsMotivation: During epidemics, localized responses are crucial but face data limitations: training separate ML models locally is infeasible due to small datasets, while centralizing sensitive health data violates privacy constraints. German counties and local health authorities need collaborative forecasting while preserving data privacy.

Method: Proposed federated learning framework with client-level differential privacy: counties/LHAs act as clients, train shared multilayer perceptron on sliding windows of recent case counts. Clients exchange only norm-clipped model updates, server aggregates updates with DP noise. Balances utility vs. privacy trade-off.

Result: At moderately strong privacy levels, DP model closely approaches non-DP performance: R² ~0.94 (vs. 0.95) and MAPE 26% in Nov 2020; R² ~0.88 (vs. 0.93) and MAPE 21% in Mar 2022. Very strict privacy yields unstable forecasts, but viable privacy budgets exist for useful predictions.

Conclusion: Client-level DP-FL can deliver useful county-level epidemic predictions with strong privacy guarantees. Privacy budgets depend on epidemic phase, enabling privacy-compliant collaboration among health authorities for local forecasting without centralizing sensitive data.

Abstract: In times of epidemics, swift reaction is necessary to mitigate epidemic spreading. For this reaction, localized approaches have several advantages, limiting necessary resources and reducing the impact of interventions on a larger scale. However, training a separate machine learning (ML) model on a local scale is often not feasible due to limited available data. Centralizing the data is also challenging because of its high sensitivity and privacy constraints. In this study, we consider a localized strategy based on the German counties and communities managed by the related local health authorities (LHA). For the preservation of privacy to not oppose the availability of detailed situational data, we propose a privacy-preserving forecasting method that can assist public health experts and decision makers. ML methods with federated learning (FL) train a shared model without centralizing raw data. Considering the counties, communities or LHAs as clients and finding a balance between utility and privacy, we study a FL framework with client-level differential privacy (DP). We train a shared multilayer perceptron on sliding windows of recent case counts to forecast the number of cases, while clients exchange only norm-clipped updates and the server aggregated updates with DP noise. We evaluate the approach on COVID-19 data on county-level during two phases. As expected, very strict privacy yields unstable, unusable forecasts. At a moderately strong level, the DP model closely approaches the non-DP model: R2 around 0.94 (vs. 0.95) and mean absolute percentage error (MAPE) of 26 % in November 2020; R2 around 0.88 (vs. 0.93) and MAPE of 21 % in March 2022. Overall, client-level DP-FL can deliver useful county-level predictions with strong privacy guarantees, and viable privacy budgets depend on epidemic phase, allowing privacy-compliant collaboration among health authorities for local forecasting.

[405] Policy Compatible Skill Incremental Learning via Lazy Learning Interface

Daehee Lee, Dongsu Lee, TaeYoon Kwack, Wonje Choi, Honguk Woo

Main category: cs.LG

TL;DR: SIL-C is a framework that maintains compatibility between incrementally learned skills and existing downstream policies, enabling skill improvements to enhance policy performance without retraining.

DetailsMotivation: As agents incrementally learn new skills, their evolving skill repertoire can disrupt compatibility with existing skill-based policies, limiting policy reusability and generalization. Current approaches require policy retraining or structural adaptation when skills change.

Method: SIL-C uses a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This allows each subtask (from policy decomposition) to be executed by selecting appropriate skills based on trajectory distribution similarity.

Result: The framework maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process across diverse SIL scenarios.

Conclusion: SIL-C enables skill improvements to enhance downstream policy performance without requiring policy retraining or structural adaptation, addressing a key limitation in skill incremental learning systems.

Abstract: Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill-based policies, limiting their reusability and generalization. In this work, we propose SIL-C, a novel framework that ensures skill-policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re-training or structural adaptation. SIL-C employs a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy’s decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL-C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.

[406] Physically Plausible Multi-System Trajectory Generation and Symmetry Discovery

Jiayin Liu, Yulong Yang, Vineet Bansal, Christine Allen-Blanchette

Main category: cs.LG

TL;DR: SPS-GAN is a novel neural network that learns dynamics of multiple physical systems, generalizes to unseen parameters, and discovers configuration space structure from various measurements without prior knowledge.

DetailsMotivation: Existing neural network models with mechanical inductive biases typically capture dynamics of single systems with fixed parameters and require known configuration spaces. There's a need for models that can handle multiple systems, generalize to unseen parameters, and discover configuration space structure from arbitrary measurements.

Method: SPS-GAN embeds a Hamiltonian neural network recurrent module in a conditional GAN backbone. It optimizes the conditional time-series GAN objective with an additional physically motivated term that encourages sparse representation of configuration space structure.

Result: The model captures multiple systems and achieves performance comparable to supervised models designed for single systems. It demonstrates utility for trajectory prediction, video generation, and symmetry discovery.

Conclusion: SPS-GAN successfully addresses limitations of previous physics-inspired neural networks by handling multiple systems, generalizing to unseen parameters, and discovering configuration space structure from arbitrary measurements while maintaining physical plausibility.

Abstract: From metronomes to celestial bodies, mechanics underpins how the world evolves in time and space. With consideration of this, a number of recent neural network models leverage inductive biases from classical mechanics to encourage model interpretability and ensure forecasted states are physical. However, in general, these models are designed to capture the dynamics of a single system with fixed physical parameters, from state-space measurements of a known configuration space. In this paper we introduce Symplectic Phase Space GAN (SPS-GAN) which can capture the dynamics of multiple systems, and generalize to unseen physical parameters from. Moreover, SPS-GAN does not require prior knowledge of the system configuration space. In fact, SPS-GAN can discover the configuration space structure of the system from arbitrary measurement types (e.g., state-space measurements, video frames). To achieve physically plausible generation, we introduce a novel architecture which embeds a Hamiltonian neural network recurrent module in a conditional GAN backbone. To discover the structure of the configuration space, we optimize the conditional time-series GAN objective with an additional physically motivated term to encourages a sparse representation of the configuration space. We demonstrate the utility of SPS-GAN for trajectory prediction, video generation and symmetry discovery. Our approach captures multiple systems and achieves performance on par with supervised models designed for single systems.

[407] OptiMind: Teaching LLMs to Think Like Optimization Experts

Xinzhi Zhang, Zeyi Chen, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janardhan Kulkarni, Ishai Menache, Sirui Li

Main category: cs.LG

TL;DR: OptiMind: An LLM framework that integrates optimization expertise to significantly improve mathematical programming formulation accuracy by preventing common class-based errors.

DetailsMotivation: Mathematical programming requires operations research expertise and is skill-intensive. Current LLM approaches for automating natural language to optimization models have limited accuracy due to scarce/noisy training data and lack of domain knowledge integration.

Method: OptiMind framework uses semi-automated, class-based error analysis to guide training and inference, explicitly preventing common mistakes within each optimization class. Fine-tunes LLMs with this optimization expertise.

Result: Significantly improves formulation accuracy by 20.7% across multiple optimization benchmarks. Shows consistent gains with test-time scaling methods like self-consistency and multi-turn feedback.

Conclusion: Systematic integration of optimization expertise enables robust LLM-assisted optimization formulation, representing important progress toward automating mathematical programming tasks.

Abstract: Mathematical programming – the task of expressing operations and decision-making problems in precise mathematical language – is fundamental across domains, yet remains a skill-intensive process requiring operations research expertise. Recent advances in large language models for complex reasoning have spurred interest in automating this task, translating natural language into executable optimization models. Current approaches, however, achieve limited accuracy, hindered by scarce and noisy training data without leveraging domain knowledge. In this work, we systematically integrate optimization expertise to improve formulation accuracy for mixed-integer linear programming, a key family of mathematical programs. Our OptiMind framework leverages semi-automated, class-based error analysis to guide both training and inference, explicitly preventing common mistakes within each optimization class. Our resulting fine-tuned LLM significantly improves formulation accuracy by 20.7% across multiple optimization benchmarks, with consistent gains under test-time scaling methods such as self-consistency and multi-turn feedback, enabling further progress toward robust LLM-assisted optimization formulation.

[408] Coupled Data and Measurement Space Dynamics for Enhanced Diffusion Posterior Sampling

Shayan Mohajer Hamidi, En-Hui Yang, Ben Liang

Main category: cs.LG

TL;DR: C-DPS: A novel diffusion-based framework for inverse problems that couples data and measurement spaces to enable closed-form posterior sampling without heuristic updates or likelihood approximations.

DetailsMotivation: Existing diffusion-based methods for inverse problems rely on projection-based techniques with heuristic updates or likelihood approximations, leading to artifacts and instability under complex or high-noise conditions. There's a need for a more principled approach that avoids these limitations.

Method: Proposes coupled data and measurement space diffusion posterior sampling (C-DPS), which introduces a forward stochastic process in the measurement space evolving parallel to data-space diffusion. This coupling enables derivation of a closed-form posterior p(x_{t-1} | x_t, y_{t-1}) for accurate recursive sampling without constraint tuning or likelihood approximation.

Result: C-DPS consistently outperforms existing baselines both qualitatively and quantitatively across multiple inverse problem benchmarks, demonstrating improved performance under complex and high-noise conditions.

Conclusion: The proposed C-DPS framework provides a principled solution to inverse problems by coupling data and measurement spaces, enabling closed-form posterior sampling that eliminates artifacts and instability issues present in existing methods.

Abstract: Inverse problems, where the goal is to recover an unknown signal from noisy or incomplete measurements, are central to applications in medical imaging, remote sensing, and computational biology. Diffusion models have recently emerged as powerful priors for solving such problems. However, existing methods either rely on projection-based techniques that enforce measurement consistency through heuristic updates, or they approximate the likelihood $p(\boldsymbol{y} \mid \boldsymbol{x})$, often resulting in artifacts and instability under complex or high-noise conditions. To address these limitations, we propose a novel framework called \emph{coupled data and measurement space diffusion posterior sampling} (C-DPS), which eliminates the need for constraint tuning or likelihood approximation. C-DPS introduces a forward stochastic process in the measurement space ${\boldsymbol{y}_t}$, evolving in parallel with the data-space diffusion ${\boldsymbol{x}t}$, which enables the derivation of a closed-form posterior $p(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t, \boldsymbol{y}{t-1})$. This coupling allows for accurate and recursive sampling based on a well-defined posterior distribution. Empirical results demonstrate that C-DPS consistently outperforms existing baselines, both qualitatively and quantitatively, across multiple inverse problem benchmarks.

[409] Hydrogen production from blended waste biomass: pyrolysis, thermodynamic-kinetic analysis and AI-based modelling

Sana Kordoghli, Abdelhakim Settar, Oumayma Belaati, Mohammad Alkhatib, Khaled Chetehouna, Zakaria Mansouri

Main category: cs.LG

TL;DR: AI-enhanced pyrolysis optimization of food biomass blends (date seeds & coffee grounds) for hydrogen production, with LSTM achieving near-perfect TGA curve predictions.

DetailsMotivation: To advance sustainable energy and waste management by exploring underutilized food biomass resources for hydrogen production through optimized pyrolysis processes, leveraging AI for enhanced modeling accuracy.

Method: Comprehensive characterization (proximate, ultimate, fiber, TGA/DTG, kinetic, thermodynamic, Py-Micro GC) of pure DS, SCG, and their blends; kinetic modeling using isoconversional methods (KAS, FWO, Friedman); and AI integration via LSTM model trained on lignocellulosic data for TGA curve prediction.

Result: Blend 3 (25% DS - 75% SCG) showed superior hydrogen yield potential but highest activation energy (313.24 kJ/mol), while Blend 1 (75% DS - 25% SCG) had best activation energy (161.75 kJ/mol). KAS method was most accurate kinetic model. LSTM achieved exceptional TGA prediction accuracy (R²: 0.9996-0.9998).

Conclusion: AI-enhanced pyrolysis modeling successfully optimizes food biomass conversion for hydrogen production, with specific blend compositions offering different trade-offs between hydrogen yield and process energy requirements, demonstrating the potential of integrated AI approaches for sustainable energy systems.

Abstract: This work contributes to advancing sustainable energy and waste management strategies by investigating the thermochemical conversion of food-based biomass through pyrolysis, highlighting the role of artificial intelligence (AI) in enhancing process modelling accuracy and optimization efficiency. The main objective is to explore the potential of underutilized biomass resources, such as spent coffee grounds (SCG) and date seeds (DS), for sustainable hydrogen production. Specifically, it aims to optimize the pyrolysis process while evaluating the performance of these resources both individually and as blends. Proximate, ultimate, fibre, TGA/DTG, kinetic, thermodynamic, and Py-Micro GC analyses were conducted for pure DS, SCG, and blends (75% DS - 25% SCG, 50% DS - 50% SCG, 25% DS - 75% SCG). Blend 3 offered superior hydrogen yield potential but had the highest activation energy (Ea: 313.24 kJ/mol), while Blend 1 exhibited the best activation energy value (Ea: 161.75 kJ/mol). The kinetic modelling based on isoconversional methods (KAS, FWO, Friedman) identified KAS as the most accurate. These approaches provide a detailed understanding of the pyrolysis process, with particular emphasis on the integration of artificial intelligence. An LSTM model trained with lignocellulosic data predicted TGA curves with exceptional accuracy (R^2: 0.9996-0.9998).

[410] Instance-Dependent Regret Bounds for Nonstochastic Linear Partial Monitoring

Federico Di Gennaro, Khaled Eldowa, Nicolò Cesa-Bianchi

Main category: cs.LG

TL;DR: Linear partial monitoring extends classic partial monitoring to infinite outcome spaces with linear structure, decoupling losses and observations. The paper presents efficient algorithms with improved regret bounds that transparently depend on game structure.

DetailsMotivation: Classic partial monitoring has limitations with finite outcome spaces. Linear partial monitoring can model infinite outcome spaces while maintaining linear structure, generalizing linear bandits by flexibly decoupling losses and feedback. There's a need for efficient algorithms with transparent regret bounds that better reflect game structure.

Method: Addresses nonstochastic (adversarial) finite-actions linear partial monitoring using a simple instance of exploration-by-optimization method that is amenable to efficient implementation. The approach yields regret bounds with instance-specific quantities reflecting alignment between observations and losses.

Result: Derived regret bounds that depend on game structure more transparently than previous guarantees. Bounds feature instance-specific quantities reflecting observation-loss alignment, achieving √T rate in easy (locally observable) games and T^{2/3} in hard (globally observable) games. These bounds are instantiated in various partial information settings and shown to be tight in interesting cases.

Conclusion: The paper presents an efficient algorithm for linear partial monitoring with improved theoretical guarantees that transparently capture game structure. The results bridge stochastic and nonstochastic settings and provide tight bounds for various partial information problems.

Abstract: In contrast to the classic formulation of partial monitoring, linear partial monitoring can model infinite outcome spaces, while imposing a linear structure on both the losses and the observations. This setting can be viewed as a generalization of linear bandits where loss and feedback are decoupled in a flexible manner. In this work, we address a nonstochastic (adversarial), finite-actions version of the problem through a simple instance of the exploration-by-optimization method that is amenable to efficient implementation. We derive regret bounds that depend on the game structure in a more transparent manner than previous theoretical guarantees for this paradigm. Our bounds feature instance-specific quantities that reflect the degree of alignment between observations and losses, and resemble known guarantees in the stochastic setting. Notably, they achieve the standard $\sqrt{T}$ rate in easy (locally observable) games and $T^{2/3}$ in hard (globally observable) games, where $T$ is the time horizon. We instantiate these bounds in a selection of old and new partial information settings subsumed by this model, and illustrate that the achieved dependence on the game structure can be tight in interesting cases.

[411] Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design

Lianghong Chen, Dongkyu Eugene Kim, Mike Domaratzki, Pingzhao Hu

Main category: cs.LG

TL;DR: RL-guided diffusion model for multi-objective 3D molecular design with uncertainty-aware reward shaping

DetailsMotivation: Current diffusion models struggle to control complex multi-objective constraints needed for real-world drug discovery applications

Method: Uncertainty-aware RL framework using surrogate models with predictive uncertainty estimation to dynamically shape reward functions for balancing multiple optimization objectives

Result: Outperforms baselines across three benchmark datasets and multiple diffusion architectures; MD simulations and ADMET profiling show promising drug-like behavior comparable to known EGFR inhibitors

Conclusion: RL-guided generative diffusion models have strong potential for advancing automated molecular design

Abstract: Designing de novo 3D molecules with desirable properties remains a fundamental challenge in drug discovery and molecular engineering. While diffusion models have demonstrated remarkable capabilities in generating high-quality 3D molecular structures, they often struggle to effectively control complex multi-objective constraints critical for real-world applications. In this study, we propose an uncertainty-aware Reinforcement Learning (RL) framework to guide the optimization of 3D molecular diffusion models toward multiple property objectives while enhancing the overall quality of the generated molecules. Our method leverages surrogate models with predictive uncertainty estimation to dynamically shape reward functions, facilitating balance across multiple optimization objectives. We comprehensively evaluate our framework across three benchmark datasets and multiple diffusion model architectures, consistently outperforming baselines for molecular quality and property optimization. Additionally, Molecular Dynamics (MD) simulations and ADMET profiling of top generated candidates indicate promising drug-like behavior and binding stability, comparable to known Epidermal Growth Factor Receptor (EGFR) inhibitors. Our results demonstrate the strong potential of RL-guided generative diffusion models for advancing automated molecular design.

[412] Beyond Uniform SVD:Dual-Level Optimization across Columns and Modules for LLM Compression

Lin Xv, Xian Gao, Ting Li, Yuzhuo Fu

Main category: cs.LG

TL;DR: Duo-SVD is a training-free framework that optimizes SVD-based LLM compression at both column and module levels to address decomposition error disparities and lack of importance metrics in existing methods.

DetailsMotivation: Current SVD-based compression methods for LLMs have two key limitations: 1) they ignore that decomposition errors vary significantly across different matrix components, leading to suboptimal approximations, and 2) they lack direct metrics to evaluate the importance of individual weight matrices.

Method: Duo-SVD combines two strategies: 1) Column-Preserving Strategy that retains columns with high decomposition errors while applying low-rank approximation only to columns with lower errors, and 2) Module-Adaptive Allocation Strategy that formulates ratio allocation as a global constrained optimization problem based on perturbation-induced model deviation.

Result: Extensive experiments show Duo-SVD consistently outperforms state-of-the-art SVD-based baselines and structured pruning methods, establishing it as a superior paradigm for efficient LLM compression.

Conclusion: Duo-SVD provides an effective training-free framework for LLM compression that addresses key limitations of existing SVD methods through dual-level optimization, achieving better performance than current approaches.

Abstract: Low-rank decomposition, particularly Singular Value Decomposition (SVD), is a pivotal technique for mitigating the storage and computational demands of Large Language Models (LLMs). However, prevalent SVD-based approaches overlook the critical phenomenon that decomposition errors exhibit significant disparity across different components of the parameter matrix, often leading to suboptimal approximation. Furthermore, existing methods lack a direct metric to evaluate the importance of individual weight matrices. To address these limitations, we propose Duo-SVD (Dual-level Optimization SVD), a novel training-free framework that synergizes optimization at both the column and the module levels. First, Duo-SVD incorporates a Column-Preserving Strategy that explicitly retains columns exhibiting high decomposition errors, while applying low-rank approximation solely to those with lower errors. Second, at the module level, we employ a Module-Adaptive Allocation Strategy that formulates ratio allocation as a global constrained optimization problem based on perturbation-induced model deviation. Extensive experiments demonstrate that Duo-SVD consistently outperforms state-of-the-art SVD-based baselines and structured pruning methods, establishing it as a superior paradigm for efficient LLM compression.

[413] ELUTQ: Optimizing Quantization Accuracy under LUT-Based Computation for Edge LLMs

Xin Nie, Liang Dong, Haicheng Zhang, Jiawang Xiao, G. Sun

Main category: cs.LG

TL;DR: ELUTQ is an efficient quantization framework using Hierarchical Linear Quantization (HLQ) format that improves low-bit weight quantization accuracy, eliminates dequantization overhead via bit-serial LUT-based GEMM, and reduces hardware requirements for large model quantization.

DetailsMotivation: Existing hardware-friendly uniform quantization methods suffer from poor weight-distribution fitting and high dequantization overhead under low-bit settings, limiting deployment of Large Language Models on edge devices.

Method: Proposes ELUTQ framework with Hierarchical Linear Quantization (HLQ) format that captures weight statistical characteristics better, uses bit-serial LUT-based GEMM operations to eliminate dequantization overhead, and includes optimized quantization pipeline with high-performance kernels for edge deployment.

Result: Achieves QAT-comparable accuracy without retraining, quantizes LLaMA 3.1-70B with only 64GB CPU + 48GB VRAM, and 2-bit LLaMA3.1-8B shows 1.5x speedup over AWQ on RTX 3090.

Conclusion: ELUTQ enables efficient low-bit quantization for LLMs with improved accuracy, reduced hardware requirements, and faster inference on edge devices through its novel HLQ format and optimized deployment pipeline.

Abstract: Weight quantization effectively reduces memory consumption and enable the deployment of Large Language Models on edge devices, yet existing hardware-friendly methods often rely on uniform quantization, which suffers from poor weight-distribution fitting and high dequantization overhead under low-bit settings. In this paper, we propose ELUTQ, an efficient quantization framework featuring a novel quantization format termed Hierarchical Linear Quantization (HLQ). HLQ is designed to better capture the statistical characteristics of weights and eliminate dequantization overhead using Bit-serial LUT-based GEMM operations. HLQ significantly improves model accuracy under low-bit settings and achieves performance comparable to QAT methods without any retraining of the weights. Moreover, an optimized quantization pipeline is integrated into ELUTQ, enabling it to complete the quantization of LLaMA 3.1-70B using only 64 GB of CPU memory and 48 GB of VRAM, reducing the hardware requirements for large-scale model quantization. To enable efficient deployment on edge devices, ELUTQ designs high-performance kernels to support end-to-end inference. Our 2-bit LLaMA3.1-8B achieves 1.5x speedup over AWQ on RTX 3090. Code is available at https://github.com/Nkniexin/ELUTQ.

[414] MISA: Memory-Efficient LLMs Optimization with Module-wise Importance Sampling

Yuxi Liu, Renjia Deng, Yutong He, Xue Wang, Tao Yao, Kun Yuan

Main category: cs.LG

TL;DR: MISA is a memory-efficient optimization method for LLMs that samples modules within layers based on importance scores, reducing gradient variance and memory usage compared to layer-wise approaches.

DetailsMotivation: Layer-wise optimization for LLMs saves memory but ignores varying importance of modules within each layer and provides limited memory savings since at least one full layer must remain active.

Method: Divide each transformer layer into smaller modules, assign importance scores to each module, and use weighted random sampling to activate modules during optimization, reducing gradient variance.

Result: Proves O(1/√K) convergence rate under non-convex stochastic conditions, provides detailed memory analysis showing superiority over baselines, and validates effectiveness on diverse learning tasks.

Conclusion: MISA offers a more fine-grained, memory-efficient optimization approach for LLMs that outperforms layer-wise methods by addressing module importance variation and achieving better memory savings.

Abstract: The substantial memory demands of pre-training and fine-tuning large language models (LLMs) require memory-efficient optimization algorithms. One promising approach is layer-wise optimization, which treats each transformer block as a single layer and optimizes it sequentially, while freezing the other layers to save optimizer states and activations. Although effective, these methods ignore the varying importance of the modules within each layer, leading to suboptimal performance. Moreover, layer-wise sampling provides only limited memory savings, as at least one full layer must remain active during optimization. To overcome these limitations, we propose Module-wise Importance SAmpling (MISA), a novel method that divides each layer into smaller modules and assigns importance scores to each module. MISA uses a weighted random sampling mechanism to activate modules, provably reducing gradient variance compared to layer-wise sampling. Additionally, we establish an (\mathcal{O}(1/\sqrt{K})) convergence rate under non-convex and stochastic conditions, where $K$ is the total number of block updates, and provide a detailed memory analysis showcasing MISA’s superiority over existing baseline methods. Experiments on diverse learning tasks validate the effectiveness of MISA. Source code is available at https://github.com/pkumelon/MISA.

[415] Neural Emulator Superiority: When Machine Learning for PDEs Surpasses its Training Data

Felix Koehler, Nils Thuerey

Main category: cs.LG

TL;DR: Neural emulators trained on low-fidelity solver data can outperform their training source when evaluated against higher-fidelity references, challenging conventional assumptions about data fidelity limitations.

DetailsMotivation: The paper challenges the conventional assumption that neural operators/emulators for PDEs are inherently limited by the fidelity of their training data from numerical solvers. The authors aim to demonstrate that neural networks can potentially surpass their training data limitations.

Method: Theoretical analysis of how emulator inductive biases, training objectives, and numerical error characteristics interact to enable superior performance during multi-step rollouts. Empirical validation across different PDEs using standard neural architectures to demonstrate the phenomenon.

Result: The paper identifies “emulator superiority” where neural networks trained purely on low-fidelity solver data achieve higher accuracy than those solvers when evaluated against higher-fidelity references. Emulators can implicitly learn more regularized dynamics with more favorable error accumulation properties than their training data.

Conclusion: This work prompts a re-evaluation of emulator benchmarking, suggesting neural emulators might achieve greater physical fidelity than their training source within specific operational regimes, potentially mitigating numerical artifacts and surpassing training data limitations.

Abstract: Neural operators or emulators for PDEs trained on data from numerical solvers are conventionally assumed to be limited by their training data’s fidelity. We challenge this assumption by identifying “emulator superiority,” where neural networks trained purely on low-fidelity solver data can achieve higher accuracy than those solvers when evaluated against a higher-fidelity reference. Our theoretical analysis reveals how the interplay between emulator inductive biases, training objectives, and numerical error characteristics enables superior performance during multi-step rollouts. We empirically validate this finding across different PDEs using standard neural architectures, demonstrating that emulators can implicitly learn dynamics that are more regularized or exhibit more favorable error accumulation properties than their training data, potentially surpassing training data limitations and mitigating numerical artifacts. This work prompts a re-evaluation of emulator benchmarking, suggesting neural emulators might achieve greater physical fidelity than their training source within specific operational regimes. Project Page: https://tum-pbs.github.io/emulator-superiority

[416] When do spectral gradient updates help in deep learning?

Damek Davis, Dmitriy Drusvyatskiy

Main category: cs.LG

TL;DR: Spectral gradient methods like Muon can outperform Euclidean gradient descent when gradients have high nuclear-to-Frobenius ratio and activations have low stable rank, with this advantage scaling with data dimension.

DetailsMotivation: Spectral gradient methods show promise for training deep neural networks and transformers, but it's unclear when they outperform standard Euclidean gradient descent. The paper aims to identify specific conditions where spectral updates are more effective.

Method: Proposes a layerwise condition comparing squared nuclear-to-Frobenius ratio of gradients to stable rank of incoming activations. Analyzes this condition theoretically in random feature regression, feedforward networks, and transformer blocks, showing activations have low stable rank at Gaussian initialization. In spiked random feature models, demonstrates Euclidean gradient’s nuclear-to-Frobenius ratio grows with dimension while activation stable rank remains bounded.

Result: Theoretical analysis shows spectral updates have dimension-scaling advantage. Experimental validation in synthetic regression and NanoGPT-scale language model training confirms intermediate activations maintain low stable rank throughout training, and gradients maintain large nuclear-to-Frobenius ratios.

Conclusion: The paper identifies concrete conditions for spectral gradient methods like Muon to be effective: when gradients have high nuclear-to-Frobenius ratio and activations have low stable rank, with this advantage scaling with data dimension. These conditions are satisfied in practical deep network and transformer training.

Abstract: Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient’s nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.

[417] LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models

Siddharth Betala, Samuel P. Gleason, Ali Ramlaoui, Andy Xu, Georgia Channing, Daniel Levy, Clémentine Fourrier, Nikita Kazeev, Chaitanya K. Joshi, Sékou-Oumar Kaba, Félix Therrien, Alex Hernandez-Garcia, Rocío Mercado, N. M. Anoop Krishnan, Alexandre Duval

Main category: cs.LG

TL;DR: LeMat-GenBench is a unified benchmark for generative ML models of crystalline materials, providing standardized evaluation metrics and a public leaderboard to enable fair comparison and guide development of more reliable discovery-oriented models.

DetailsMotivation: The lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop generative ML models for inorganic crystal discovery, despite their great promise for accelerating materials exploration.

Method: Introduces LeMat-GenBench, a unified benchmark with evaluation metrics designed to inform model development and downstream applications. Includes an open-source evaluation suite and public Hugging Face leaderboard, used to benchmark 12 recent generative models.

Result: Results show that increased stability in generated materials leads to decreased novelty and diversity on average, with no single model excelling across all evaluation dimensions.

Conclusion: LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide development of more reliable, discovery-oriented generative models for crystalline materials.

Abstract: Generative machine learning (ML) models hold great promise for accelerating materials discovery through the inverse design of inorganic crystals, enabling an unprecedented exploration of chemical space. Yet, the lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop these ML models meaningfully. In this work, we introduce LeMat-GenBench, a unified benchmark for generative models of crystalline materials, supported by a set of evaluation metrics designed to better inform model development and downstream applications. We release both an open-source evaluation suite and a public leaderboard on Hugging Face, and benchmark 12 recent generative models. Results reveal that an increase in stability leads to a decrease in novelty and diversity on average, with no model excelling across all dimensions. Altogether, LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide the development of more reliable, discovery-oriented generative models for crystalline materials.

[418] Fusion or Confusion? Multimodal Complexity Is Not All You Need

Tillmann Rheude, Roland Eils, Benjamin Wild

Main category: cs.LG

TL;DR: Complex multimodal architectures don’t reliably outperform simple late-fusion baselines under standardized conditions.

DetailsMotivation: Challenge the assumption that complex multimodal-specific methods inherently improve performance over simpler approaches.

Method: Large-scale empirical study reimplementing 19 high-impact methods under standardized conditions, evaluating across 9 datasets with up to 23 modalities, and proposing SimBaMM (Simple Baseline for Multimodal Learning) - a late-fusion Transformer architecture.

Result: Complex methods perform on par with SimBaMM and often fail to consistently outperform well-tuned unimodal baselines, especially in small-data settings. No reliable performance advantage for complex architectures.

Conclusion: Shift focus from architectural novelty to methodological rigor; include reliability checklist for comparable, robust evaluations.

Abstract: Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions. We evaluate them across nine diverse datasets with up to 23 modalities, and test their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analyses show that complex methods perform on par with SimBaMM and often fail to consistently outperform well-tuned unimodal baselines, especially in small-data settings. To support our findings, we include a case study highlighting common methodological shortcomings in the literature followed by a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.

[419] Evaluating Anomaly Detectors for Simulated Highly Imbalanced Industrial Classification Problems

Lesley Wheat, Martin v. Mohrenschildt, Saeid Habibi

Main category: cs.LG

TL;DR: This paper evaluates anomaly detection algorithms for industrial applications with extreme class imbalance, finding that the best detector depends on the number of faulty examples available, with unsupervised methods dominating when fewer than 20 faulty examples exist, and semi-supervised/supervised methods showing large performance gains with 30-50 faulty examples.

DetailsMotivation: Machine learning offers solutions for industrial quality control and predictive maintenance, but faces challenges with extreme class imbalance due to limited faulty data availability. There's a need to understand how different anomaly detection methods perform under realistic industrial constraints with varying data conditions.

Method: Comprehensive evaluation using a problem-agnostic simulated dataset with hyper-spherical anomaly distribution in 2D and 10D. Benchmarking 14 detectors across training datasets with anomaly rates from 0.05% to 20% and training sizes from 1,000 to 10,000 samples (with 40,000 test samples).

Result: Best detector depends on total number of faulty examples: unsupervised methods (kNN/LOF) dominate with <20 faulty examples; semi-supervised (XGBOD) and supervised (SVM/CatBoost) show large performance increases with 30-50 faulty examples. Semi-supervised methods show benefits at ten features but not with two features. Performance drops significantly on generalization with smaller datasets.

Conclusion: The study provides practical insights for deploying anomaly detection in industrial environments, highlighting the importance of considering available faulty examples when selecting detection methods and showing that additional healthy examples offer insignificant benefits in most cases.

Abstract: Machine learning offers potential solutions to current issues in industrial systems in areas such as quality control and predictive maintenance, but also faces unique barriers in industrial applications. An ongoing challenge is extreme class imbalance, primarily due to the limited availability of faulty data during training. This paper presents a comprehensive evaluation of anomaly detection algorithms using a problem-agnostic simulated dataset that reflects real-world engineering constraints. Using a synthetic dataset with a hyper-spherical based anomaly distribution in 2D and 10D, we benchmark 14 detectors across training datasets with anomaly rates between 0.05% and 20% and training sizes between 1 000 and 10 000 (with a testing dataset size of 40 000) to assess performance and generalization error. Our findings reveal that the best detector is highly dependant on the total number of faulty examples in the training dataset, with additional healthy examples offering insignificant benefits in most cases. With less than 20 faulty examples, unsupervised methods (kNN/LOF) dominate; but around 30-50 faulty examples, semi-supervised (XGBOD) and supervised (SVM/CatBoost) detectors, we see large performance increases. While semi-supervised methods do not show significant benefits with only two features, the improvements are evident at ten features. The study highlights the performance drop on generalization of anomaly detection methods on smaller datasets, and provides practical insights for deploying anomaly detection in industrial environments.

[420] Sobolev Approximation of Deep ReLU Networks in Log-Barron Space

Changhoon Song, Seungchan Ko, Youngjoon Hong

Main category: cs.LG

TL;DR: The paper introduces log-weighted Barron spaces that require weaker regularity assumptions than classical Barron spaces, enabling better understanding of why deep ReLU networks work well on high-dimensional data with reduced regularity requirements.

DetailsMotivation: Classical Barron space theory explains neural network approximation but requires stronger regularity than Sobolev spaces, and existing depth-sensitive results have restrictive constraints. There's a need to better understand why deep networks work well on high-dimensional data with reduced regularity requirements.

Method: Introduce log-weighted Barron space B^log with weaker assumptions than classical B^s spaces. Study embedding properties, conduct statistical analysis via Rademacher complexity, prove approximation bounds for deep ReLU networks with explicit depth dependence, define family B^{s,log}, establish H^1 norm bounds, and identify maximal depth scales.

Result: Functions in B^log can be approximated by deep ReLU networks with explicit depth dependence. The new space requires strictly weaker assumptions than classical Barron spaces, and the paper identifies depth scales where approximation rates are preserved, showing how depth reduces regularity requirements.

Conclusion: The log-weighted Barron space framework provides a more precise explanation for deep network performance beyond classical Barron settings, clarifying how depth reduces regularity requirements for efficient representation in high-dimensional problems.

Abstract: Universal approximation theorems show that neural networks can approximate any continuous function; however, the number of parameters may grow exponentially with the ambient dimension, so these results do not fully explain the practical success of deep models on high-dimensional data. Barron space theory addresses this: if a target function belongs to a Barron space, a two-layer network with $n$ parameters achieves an $O(n^{-1/2})$ approximation error in $L^2$. Yet classical Barron spaces $\mathscr{B}^{s+1}$ still require stronger regularity than Sobolev spaces $H^s$, and existing depth-sensitive results often assume constraints such as $sL \le 1/2$. In this paper, we introduce a log-weighted Barron space $\mathscr{B}^{\log}$, which requires a strictly weaker assumption than $\mathscr{B}^s$ for any $s>0$. For this new function space, we first study embedding properties and carry out a statistical analysis via the Rademacher complexity. Then we prove that functions in $\mathscr{B}^{\log}$ can be approximated by deep ReLU networks with explicit depth dependence. We then define a family $\mathscr{B}^{s,\log}$, establish approximation bounds in the $H^1$ norm, and identify maximal depth scales under which these rates are preserved. Our results clarify how depth reduces regularity requirements for efficient representation, offering a more precise explanation for the performance of deep architectures beyond the classical Barron setting, and for their stable use in high-dimensional problems used today.

[421] DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

DatologyAI, :, Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Scott Loftin, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

Main category: cs.LG

TL;DR: The paper proposes DatBench, a cleaned evaluation suite for vision-language models that addresses critical failure modes in current benchmarks through transformation and filtering to improve faithfulness, discriminability, and efficiency.

DetailsMotivation: Current evaluation methods for vision-language models have critical flaws: multiple-choice formats reward guessing, many questions can be answered without images (blindly solvable), datasets contain mislabeled samples, and evaluation consumes excessive compute (up to 20% of development resources).

Method: Instead of creating new benchmarks, the authors curate existing ones by transforming multiple-choice questions to generative tasks and filtering out blindly solvable and mislabeled samples. They create DatBench-Full (33 datasets across 9 VLM capabilities) and DatBench (a discriminative subset).

Result: Converting multiple-choice to generative tasks reveals capability drops up to 35%. Filtering improves discriminative power while reducing computational cost. DatBench achieves 13x average speedup (up to 50x) while maintaining discriminative power comparable to original datasets.

Conclusion: The work provides a path toward more rigorous and sustainable VLM evaluation practices that satisfy three key desiderata: faithfulness to modality/application, discriminability between models, and computational efficiency.

Abstract: Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.

[422] Horizon Activation Mapping for Neural Networks in Time Series Forecasting

Hans Krupakar, V A Kandappan

Main category: cs.LG

TL;DR: HAM (Horizon Activation Mapping) is a visual interpretability technique for time series forecasting models that uses gradient norm averages to analyze subseries importance across different horizons, enabling model-agnostic comparison and selection.

DetailsMotivation: Existing interpretability approaches for time series forecasting are architecture-specific and don't work across different model families, making model comparison and selection difficult. There's a need for a unified interpretability method that applies to diverse neural network architectures.

Method: HAM adapts grad-CAM concepts to time series by using gradient norm averages to study subseries importance across forecasting horizons. It introduces causal/anti-causal modes and lines of proportionality to analyze uniform distributions. The method is tested across various architectures (MLP-based, attention-based, SSM-based, diffusion-based) on the ETTm2 dataset.

Result: HAM reveals interesting patterns: batch size differences show potential exponential approximations, NHITS demonstrates neural approximation theorem patterns, and SpaceTime shows exponential autoregressive activities. The technique works across diverse model families and provides insights into training dynamics.

Conclusion: HAM enables granular model selection, validation set choices, and comparisons across different neural network families for time series forecasting, providing a unified interpretability framework that transcends architectural differences.

Abstract: Neural networks for time series forecasting have relied on error metrics and architecture-specific interpretability approaches for model selection that don’t apply across models of different families. To interpret forecasting models agnostic to the types of layers across state-of-the-art model families, we introduce Horizon Activation Mapping (HAM), a visual interpretability technique inspired by grad-CAM that uses gradient norm averages to study the horizon’s subseries where grad-CAM studies attention maps over image data. We introduce causal and anti-causal modes to calculate gradient update norm averages across subseries at every timestep and lines of proportionality signifying uniform distributions of the norm averages. Optimization landscape studies with respect to changes in batch sizes, early stopping, train-val-test splits, architectural choices, univariate forecasting and dropouts are studied with respect to performances and subseries in HAM. Interestingly, batch size based differences in activities seem to indicate potential for existence of an exponential approximation across them per epoch relative to each other. Multivariate forecasting models including MLP-based CycleNet, N-Linear, N-HITS, self attention-based FEDformer, Pyraformer, SSM-based SpaceTime and diffusion-based Multi-Resolution DDPM over different horizon sizes trained over the ETTm2 dataset are used for HAM plots in this study. NHITS’ neural approximation theorem and SpaceTime’s exponential autoregressive activities have been attributed to trends in HAM plots over their training, validation and test sets. In general, HAM can be used for granular model selection, validation set choices and comparisons across different neural network model families.

[423] Differential Privacy for Transformer Embeddings of Text with Nonparametric Variational Information Bottleneck

Dina El Zein, James Henderson

Main category: cs.LG

TL;DR: Privacy-preserving text data sharing via noisy transformer embeddings using Nonparametric Variational Differential Privacy (NVDP) that balances privacy and utility.

DetailsMotivation: Transformer embeddings can encode sensitive information, making it possible for adversaries to recover input data, especially problematic since transformer embeddings consist of multiple vectors per token.

Method: NVDP integrates a nonparametric variational information bottleneck (NVIB) layer into transformer architecture to inject noise into multivector embeddings, using differential privacy approach with Rényi Divergence and Bayesian Differential Privacy guarantees.

Result: Tested on GLUE benchmark, varying noise levels provides useful privacy-accuracy trade-off; lower noise levels maintain high accuracy while offering strong privacy guarantees.

Conclusion: NVDP effectively balances privacy and utility for text data sharing by protecting sensitive information in transformer embeddings while maintaining downstream task performance.

Abstract: We propose a privacy-preserving method for sharing text data by sharing noisy versions of their transformer embeddings. It has been shown that hidden representations learned by deep models can encode sensitive information from the input, making it possible for adversaries to recover the input data with considerable accuracy. This problem is exacerbated in transformer embeddings because they consist of multiple vectors, one per token. To mitigate this risk, we propose Nonparametric Variational Differential Privacy (NVDP), which ensures both useful data sharing and strong privacy protection. We take a differential privacy (DP) approach, integrating a nonparametric variational information bottleneck (NVIB) layer into the transformer architecture to inject noise into its multivector embeddings and thereby hide information, and measuring privacy protection with Rényi Divergence (RD) and its corresponding Bayesian Differential Privacy (BDP) guarantee. Training the NVIB layer calibrates the noise level according to the utility of the downstream task. We test NVDP on the General Language Understanding Evaluation (GLUE) benchmark and show that varying the noise level gives us a useful trade-off between privacy and accuracy. With lower noise levels, our model maintains high accuracy while offering strong privacy guarantees, effectively balancing privacy and utility.

[424] Do Sparse Autoencoders Identify Reasoning Features in Language Models?

George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi

Main category: cs.LG

TL;DR: SAE features identified by current contrastive methods capture linguistic correlates rather than genuine reasoning computations in LLMs.

DetailsMotivation: To investigate whether sparse autoencoders (SAEs) actually identify genuine reasoning features in large language models, or if they're capturing superficial linguistic patterns instead.

Method: 1) Theoretical analysis showing ℓ₁-regularized SAEs are biased toward low-dimensional patterns; 2) Falsification framework combining causal token injection and LLM-guided falsification; 3) Testing across 20 configurations spanning multiple model families, layers, and reasoning datasets.

Result: 45-90% of features activate when associated tokens are injected into non-reasoning text; remaining features can be activated by non-reasoning inputs or fail to activate on reasoning inputs; no analyzed feature satisfies criteria for genuine reasoning behavior; steering features yields no benchmark improvements.

Conclusion: SAE features from current contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves.

Abstract: We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). We first show through a simple theoretical analysis that $\ell_1$-regularized SAEs are intrinsically biased toward low-dimensional patterns, providing a mechanistic explanation for why shallow linguistic cues may be preferentially captured over distributed reasoning behaviors. Motivated by this bias, we introduce a falsification-oriented evaluation framework that combines causal token injection and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that features identified by contrastive methods are highly sensitive to token-level interventions, with 45% to 90% activating when a small number of associated tokens are injected into non-reasoning text. For the remaining features, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields no improvements in benchmark performance. Overall, our results suggest that SAE features identified by current contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves. Code is available at https://github.com/GeorgeMLP/reasoning-probing.

[425] The Hessian of tall-skinny networks is easy to invert

Ali Rahimi

Main category: cs.LG

TL;DR: An exact algorithm for solving linear systems involving neural network Hessians that computes Hessian-inverse-vector products without storing the full Hessian or its inverse, with linear scaling in layers.

DetailsMotivation: Solving linear systems with neural network Hessians is computationally expensive using naive approaches that require quadratic storage and cubic time in the number of layers. There's a need for efficient methods that avoid storing the full Hessian matrix.

Method: The method computes Hessian-inverse-vector products directly without explicitly computing or storing the Hessian or its inverse. It achieves linear scaling in both time and storage relative to the number of network layers.

Result: The algorithm provides exact solutions to Hessian linear systems with linear time and storage complexity in the number of layers, contrasting with the naive approach’s quadratic storage and cubic time complexity.

Conclusion: This Hessian-inverse-vector product method offers a practical solution for working with neural network Hessians, with computational efficiency comparable to Pearlmutter’s Hessian-vector product algorithm.

Abstract: We describe an exact algorithm to solve linear systems of the form $Hx=b$ where $H$ is the Hessian of a deep net. The method computes Hessian-inverse-vector products without storing the Hessian or its inverse. It requires time and storage that scale linearly in the number of layers. This is in contrast to the naive approach of first computing the Hessian, then solving the linear system, which takes storage and time that are respectively quadratic and cubic in the number of layers. The Hessian-inverse-vector product method scales roughly like Pearlmutter’s algorithm for computing Hessian-vector products.

[426] Why are there many equally good models? An Anatomy of the Rashomon Effect

Harsh Parikh

Main category: cs.LG

TL;DR: The paper explores three categories of causes for the Rashomon effect in ML: statistical (finite samples, noise), structural (non-convexity, unobserved variables), and procedural (algorithm limitations, model restrictions), with implications for inference, interpretability, fairness, and decision-making.

DetailsMotivation: The Rashomon effect - existence of multiple distinct models with similar predictive performance - is a fundamental phenomenon in modern ML that needs systematic understanding. The paper aims to explore and categorize the underlying causes of this multiplicity to provide a unified framework for analysis.

Method: The authors synthesize insights from machine learning, statistics, and optimization literature to develop a three-category framework: statistical sources (finite samples, data noise), structural sources (non-convex optimization, unobserved variables causing non-identifiability), and procedural sources (algorithm limitations, deliberate model restrictions).

Result: The key distinction shows that statistical multiplicity diminishes with more data, structural multiplicity persists asymptotically and requires different data or assumptions to resolve, while procedural multiplicity reflects practitioner choices. The framework provides a systematic way to understand why multiple good models exist.

Conclusion: The Rashomon effect has important implications for inference, interpretability, fairness, and decision-making under uncertainty. Understanding its causes helps distinguish between different types of multiplicity and provides guidance for addressing both challenges and opportunities presented by model multiplicity in practice.

Abstract: The Rashomon effect – the existence of multiple, distinct models that achieve nearly equivalent predictive performance – has emerged as a fundamental phenomenon in modern machine learning and statistics. In this paper, we explore the causes underlying the Rashomon effect, organizing them into three categories: statistical sources arising from finite samples and noise in the data-generating process; structural sources arising from non-convexity of optimization objectives and unobserved variables that create fundamental non-identifiability; and procedural sources arising from limitations of optimization algorithms and deliberate restrictions to suboptimal model classes. We synthesize insights from machine learning, statistics, and optimization literature to provide a unified framework for understanding why the multiplicity of good models arises. A key distinction emerges: statistical multiplicity diminishes with more data, structural multiplicity persists asymptotically and cannot be resolved without different data or additional assumptions, and procedural multiplicity reflects choices made by practitioners. Beyond characterizing causes, we discuss both the challenges and opportunities presented by the Rashomon effect, including implications for inference, interpretability, fairness, and decision-making under uncertainty.

[427] Enhancing Large Language Models for Time-Series Forecasting via Vector-Injected In-Context Learning

Jianqi Zhang, Jingyao Wang, Wenwen Qiang, Fanjiang Xu, Changwen Zheng

Main category: cs.LG

TL;DR: LVICL: Vector-injected in-context learning for time series forecasting with frozen LLMs, improving performance without computational overhead.

DetailsMotivation: LLMs for time series forecasting face challenges: pretraining data differs from time series, direct application hurts quality, fine-tuning is computationally expensive. Need to improve forecasting performance while freezing LLM parameters to reduce overhead.

Method: LVICL (vector-injected ICL) uses a learnable context vector adapter to extract compressed example information from multiple time series examples, then injects this vector into every layer of a frozen LLM during forward pass, eliciting in-context learning ability.

Result: Extensive experiments demonstrate effectiveness. Vector injection doesn’t increase prompt length, adaptively derives context vector that suppresses harmful components, improving forecasting performance compared to conventional ICL.

Conclusion: LVICL addresses dual challenge of prediction performance and compute overhead in LLM4TSF by using vector-injected ICL with frozen LLMs, achieving better forecasting without parameter updates.

Abstract: The World Wide Web needs reliable predictive capabilities to respond to changes in user behavior and usage patterns. Time series forecasting (TSF) is a key means to achieve this goal. In recent years, the large language models (LLMs) for TSF (LLM4TSF) have achieved good performance. However, there is a significant difference between pretraining corpora and time series data, making it hard to guarantee forecasting quality when directly applying LLMs to TSF; fine-tuning LLMs can mitigate this issue, but often incurs substantial computational overhead. Thus, LLM4TSF faces a dual challenge of prediction performance and compute overhead. To address this, we aim to explore a method for improving the forecasting performance of LLM4TSF while freezing all LLM parameters to reduce computational overhead. Inspired by in-context learning (ICL), we propose LVICL. LVICL uses our vector-injected ICL to inject example information into a frozen LLM, eliciting its in-context learning ability and thereby enhancing its performance on the example-related task (i.e., TSF). Specifically, we first use the LLM together with a learnable context vector adapter to extract a context vector from multiple examples adaptively. This vector contains compressed, example-related information. Subsequently, during the forward pass, we inject this vector into every layer of the LLM to improve forecasting performance. Compared with conventional ICL that adds examples into the prompt, our vector-injected ICL does not increase prompt length; moreover, adaptively deriving a context vector from examples suppresses components harmful to forecasting, thereby improving model performance. Extensive experiments demonstrate the effectiveness of our approach.

cs.MA

[428] MACRO-LLM: LLM-Empowered Multi-Agent Collaborative Reasoning under Spatiotemporal Partial Observability

Handi Chen, Running Zhao, Xiuzhe Wu, Edith C. H. Ngai

Main category: cs.MA

TL;DR: MACRO-LLM: LLM-based multi-agent framework that addresses spatiotemporal partial observability in distributed systems through three specialized modules for temporal verification, spatial conflict resolution, and adaptive strategy refinement.

DetailsMotivation: LLM agents in real-world distributed systems face spatiotemporal partial observability - limited local perception and finite temporal horizons due to physical dispersion, which hinders efficient coordination among distributed agents.

Method: Three-module architecture: (1) CoProposer verifies candidate actions via predictive rollouts to mitigate temporal uncertainty, (2) Negotiator resolves conflicts through mean-field statistical aggregation to overcome spatial myopia, (3) Introspector analyzes historical experience to refine strategies via semantic gradient descent for continuous adaptation.

Result: Extensive evaluations on cooperative adaptive cruise control and pandemic control tasks demonstrate effective mitigation of spatiotemporal partial observability through spatial and temporal strategies, enabling robust coordination in complex long-horizon scenarios.

Conclusion: MACRO-LLM framework successfully addresses the fundamental challenge of spatiotemporal partial observability in distributed LLM agents, providing a comprehensive solution through specialized modules for temporal, spatial, and adaptive reasoning capabilities.

Abstract: Large Language Model (LLM) agents deployed in complex real-world scenarios typically operate as spatially distributed entities. However, this physical dispersion constrains agents to limited local perception and finite temporal horizons. We characterize this bottleneck as spatiotemporal partial observability. Given such fragmented awareness, distributed agents struggle to coordinate efficiently. To bridge this gap, we introduce MACRO-LLM, LLM-empowered multi-agent collaborative reasoning under spatiotemporal partial observability. The architecture addresses spatiotemporal constraints via three modules: (1) the CoProposer mitigates temporal uncertainty by verifying candidate actions via predictive rollouts; (2) the Negotiator overcomes spatial myopia by resolving conflicts through mean-field statistical aggregation; and (3) the Introspector ensures continuous adaptation by analyzing historical experience to refine strategies via semantic gradient descent. Extensive evaluations on two complex long-horizon tasks, cooperative adaptive cruise control and pandemic control, demonstrate that our framework effectively mitigates spatiotemporal partial observability through spatial and temporal strategies, enabling robust coordination.

[429] SC-MAS: Constructing Cost-Efficient Multi-Agent Systems with Edge-Level Heterogeneous Collaboration

Di Zhao, Longhui Ma, Siwei Wang, Miao Wang, Yi Kong

Main category: cs.MA

TL;DR: SC-MAS is a framework for building heterogeneous multi-agent systems that uses Social Capital Theory to enable different collaboration strategies between agent pairs, improving accuracy while reducing costs compared to homogeneous approaches.

DetailsMotivation: Current LLM-based multi-agent systems incur high costs and use homogeneous collaboration modes where all agents follow the same interaction pattern, limiting flexibility. Social Capital Theory suggests different roles benefit from distinct collaboration forms, motivating a heterogeneous approach.

Method: SC-MAS models MAS as directed graphs with edges representing pairwise collaboration strategies. A unified controller constructs executable MAS by selecting relevant agent roles, assigning edge-level collaboration strategies, and allocating appropriate LLM backbones to individual agents.

Result: SC-MAS improves accuracy by 3.35% on MMLU while reducing inference cost by 15.38%, and achieves 3.53% accuracy gain with 12.13% cost reduction on MBPP, demonstrating effectiveness of heterogeneous collaboration.

Conclusion: SC-MAS validates the feasibility of heterogeneous multi-agent systems and highlights the effectiveness of tailored collaboration strategies in improving performance while reducing costs, addressing limitations of homogeneous approaches.

Abstract: Large Language Model (LLM)-based Multi-Agent Systems (MAS) enhance complex problem solving through multi-agent collaboration, but often incur substantially higher costs than single-agent systems. Recent MAS routing methods aim to balance performance and overhead by dynamically selecting agent roles and language models. However, these approaches typically rely on a homogeneous collaboration mode, where all agents follow the same interaction pattern, limiting collaboration flexibility across different roles. Motivated by Social Capital Theory, which emphasizes that different roles benefit from distinct forms of collaboration, we propose SC-MAS, a framework for constructing heterogeneous and cost-efficient multi-agent systems. SC-MAS models MAS as directed graphs, where edges explicitly represent pairwise collaboration strategies, allowing different agent pairs to interact through tailored communication patterns. Given an input query, a unified controller progressively constructs an executable MAS by selecting task-relevant agent roles, assigning edge-level collaboration strategies, and allocating appropriate LLM backbones to individual agents. Experiments on multiple benchmarks demonstrate the effectiveness of SC-MAS. In particular, SC-MAS improves accuracy by 3.35% on MMLU while reducing inference cost by 15.38%, and achieves a 3.53% accuracy gain with a 12.13% cost reduction on MBPP. These results validate the feasibility of SC-MAS and highlight the effectiveness of heterogeneous collaboration in multi-agent systems.

[430] CrowdLLM: Building LLM-Based Digital Populations Augmented with Generative Models

Ryan Feng Lin, Keyu Tian, Hanming Zheng, Congjing Zhang, Li Zeng, Shuai Huang

Main category: cs.MA

TL;DR: CrowdLLM integrates pretrained LLMs with generative models to create more accurate and diverse digital populations that better match real human crowds.

DetailsMotivation: Existing LLM-based digital populations lack accuracy and diversity compared to real human populations, limiting their effectiveness for applications like social simulation, crowdsourcing, marketing, and recommendation systems.

Method: Proposes CrowdLLM, which integrates pretrained large language models with generative models to enhance diversity and fidelity of digital populations.

Result: Theoretical analysis shows CrowdLLM can create cost-effective, representative, scalable digital populations matching real crowd quality. Experiments across multiple domains demonstrate promising performance in accuracy and distributional fidelity to human data.

Conclusion: CrowdLLM addresses limitations of existing LLM-only approaches by combining LLMs with generative models, creating more realistic digital populations that can effectively substitute for human participants in various applications.

Abstract: The emergence of large language models (LLMs) has sparked much interest in creating LLM-based digital populations that can be applied to many applications such as social simulation, crowdsourcing, marketing, and recommendation systems. A digital population can reduce the cost of recruiting human participants and alleviate many concerns related to human subject study. However, research has found that most of the existing works rely solely on LLMs and could not sufficiently capture the accuracy and diversity of a real human population. To address this limitation, we propose CrowdLLM that integrates pretrained LLMs and generative models to enhance the diversity and fidelity of the digital population. We conduct theoretical analysis of CrowdLLM regarding its great potential in creating cost-effective, sufficiently representative, scalable digital populations that can match the quality of a real crowd. Comprehensive experiments are also conducted across multiple domains (e.g., crowdsourcing, voting, user rating) and simulation studies which demonstrate that CrowdLLM achieves promising performance in both accuracy and distributional fidelity to human data.

cs.MM

eess.AS

[431] Integrated Minimum Mean Squared Error Algorithms for Combined Acoustic Echo Cancellation and Noise Reduction

Arnout Roebben, Toon van Waterschoot, Jan Wouters, Marc Moonen

Main category: eess.AS

TL;DR: Proposes integrated approach for combined noise reduction and acoustic echo cancellation using single signal model and cost function, showing equivalence between cascade algorithms under unified framework.

DetailsMotivation: Traditional cascade approaches for noise reduction and acoustic echo cancellation are designed separately without accounting for their interaction, leading to suboptimal performance.

Method: Uses single signal model (microphone or extended vector), formulates single mean squared error cost function, derives multi-channel Wiener filter (MWF) and extended MWF (MWFext), shows equivalence to cascade algorithms.

Result: Demonstrates MWFext is equivalent to AEC-NR, NR-AEC, and NRext-AEC-PF cascade algorithms; under rank-deficiency, MWFext is non-unique; AEC-NR and NRext-AEC-PF achieve best practical performance.

Conclusion: Integrated approach provides unified framework for combined NR and AEC, showing equivalence between cascade structures while accounting for their interaction, with practical advantages for certain cascade configurations.

Abstract: In many speech recording applications, noise and acoustic echo corrupt the desired speech. Consequently, combined noise reduction (NR) and acoustic echo cancellation (AEC) is required. Generally, a cascade approach is followed, i.e., the AEC and NR are designed in isolation by selecting a separate signal model, separate cost function, and separate solution strategy. The AEC and NR are then cascaded one after the other, not accounting for their interaction. In this paper, an integrated approach is proposed to consider this interaction in a general multi-microphone/multi-loudspeaker setup. Therefore, a single signal model of either the microphone signal vector or the extended signal vector, obtained by stacking microphone and loudspeaker signals, is selected, a single mean squared error cost function is formulated, and a common solution strategy is used. Using this microphone signal model, a multi-channel Wiener filter (MWF) is derived. Using the extended signal model, it is shown that an extended MWF (MWFext) can be derived, and several equivalent expressions can be found, which are nevertheless shown to be interpretable as cascade algorithms. Specifically, the MWFext is shown to be equivalent to algorithms where the AEC precedes the NR (AEC-NR), the NR precedes the AEC (NR-AEC), and the extended NR (NRext) precedes the AEC and post-filter (PF) (NRext-AEC-PF). Under rank-deficiency conditions the MWFext is non-unique. Equivalence then amounts to the expressions being specific, not necessarily minimum-norm solutions, for this MWFext. The practical performances differ due to non-stationarities and imperfect correlation matrix estimation, with the AEC-NR and NRext-AEC-PF attaining best overall performance.

[432] MORE: Multi-Objective Adversarial Attacks on Speech Recognition

Xiaoxue Gao, Zexin Li, Yiming Chen, Nancy F. Chen

Main category: eess.AS

TL;DR: MORE is a multi-objective adversarial attack on ASR models that simultaneously degrades both accuracy and efficiency through a hierarchical staged mechanism and repetitive encouragement doubling objective.

DetailsMotivation: Existing ASR robustness research focuses only on accuracy degradation under attacks, ignoring efficiency impacts. This provides an incomplete understanding of ASR vulnerabilities, especially as large-scale models like Whisper are deployed in real-time applications where both accuracy and efficiency matter.

Method: MORE (Multi-Objective Repetitive Doubling Encouragement attack) uses a hierarchical staged repulsion-anchoring mechanism that reformulates multi-objective adversarial optimization into sequential dual objectives. It introduces REDO (Repetitive Encouragement Doubling Objective) that maintains accuracy degradation while periodically doubling predicted sequence length to induce duplicative text generation.

Result: Experiments show MORE consistently produces significantly longer transcriptions while maintaining high word error rates compared to existing baselines, effectively degrading both recognition accuracy and inference efficiency with a single adversarial input.

Conclusion: MORE successfully demonstrates that ASR models are vulnerable to multi-objective attacks that compromise both accuracy and efficiency, highlighting the need for more comprehensive robustness evaluation beyond just accuracy metrics.

Abstract: The emergence of large-scale automatic speech recognition (ASR) models such as Whisper has greatly expanded their adoption across diverse real-world applications. Ensuring robustness against even minor input perturbations is therefore critical for maintaining reliable performance in real-time environments. While prior work has mainly examined accuracy degradation under adversarial attacks, robustness with respect to efficiency remains largely unexplored. This narrow focus provides only a partial understanding of ASR model vulnerabilities. To address this gap, we conduct a comprehensive study of ASR robustness under multiple attack scenarios. We introduce MORE, a multi-objective repetitive doubling encouragement attack, which jointly degrades recognition accuracy and inference efficiency through a hierarchical staged repulsion-anchoring mechanism. Specifically, we reformulate multi-objective adversarial optimization into a hierarchical framework that sequentially achieves the dual objectives. To further amplify effectiveness, we propose a novel repetitive encouragement doubling objective (REDO) that induces duplicative text generation by maintaining accuracy degradation and periodically doubling the predicted sequence length. Overall, MORE compels ASR models to produce incorrect transcriptions at a substantially higher computational cost, triggered by a single adversarial input. Experiments show that MORE consistently yields significantly longer transcriptions while maintaining high word error rates compared to existing baselines, underscoring its effectiveness in multi-objective adversarial attack.

eess.IV

[433] Comprehensive Machine Learning Benchmarking for Fringe Projection Profilometry with Photorealistic Synthetic Data

Anush Lakshman S, Adam Haroon, Beiwen Li

Main category: eess.IV

TL;DR: First open-source photorealistic synthetic dataset for fringe projection profilometry enables benchmarking of neural networks for single-shot depth reconstruction, revealing fundamental limitations of direct fringe-to-depth mapping.

DetailsMotivation: Machine learning for fringe projection profilometry lacks large, diverse datasets and standardized benchmarking protocols, hindering systematic comparison and development of learning-based approaches.

Method: Created open-source photorealistic synthetic dataset using NVIDIA Isaac Sim with 15,600 fringe images and 300 depth reconstructions across 50 diverse objects. Benchmarked four neural network architectures (UNet, Hformer, ResUNet, Pix2Pix) on single-shot depth reconstruction.

Result: All models achieved similar performance (58-77 mm RMSE) despite architectural differences. Reconstruction errors approach 75-95% of typical object depth range, demonstrating fundamental limitations of direct fringe-to-depth mapping without explicit phase information.

Conclusion: The dataset provides standardized evaluation protocols enabling systematic comparison and development of learning-based FPP approaches, revealing that current neural architectures face inherent limitations in direct fringe-to-depth reconstruction without phase processing.

Abstract: Machine learning approaches for fringe projection profilometry (FPP) are hindered by the lack of large, diverse datasets and comprehensive benchmarking protocols. This paper introduces the first open-source, photorealistic synthetic dataset for FPP, generated using NVIDIA Isaac Sim with 15,600 fringe images and 300 depth reconstructions across 50 diverse objects. We benchmark four neural network architectures (UNet, Hformer, ResUNet, Pix2Pix) on single-shot depth reconstruction, revealing that all models achieve similar performance (58-77 mm RMSE) despite substantial architectural differences. Our results demonstrate fundamental limitations of direct fringe-to-depth mapping without explicit phase information, with reconstruction errors approaching 75-95% of the typical object depth range. This resource provides standardized evaluation protocols enabling systematic comparison and development of learning-based FPP approaches.

[434] W-DUALMINE: Reliability-Weighted Dual-Expert Fusion With Residual Correlation Preservation for Medical Image Fusion

Md. Jahidul Islam

Main category: eess.IV

TL;DR: W-DUALMINE: A reliability-weighted dual-expert fusion framework that resolves the trade-off between global statistical similarity (CC/MI) and local structural fidelity in medical image fusion through architectural constraints and theoretical loss design.

DetailsMotivation: Existing deep learning methods for medical image fusion, including recent spatial-frequency frameworks like AdaFuse and ASFE-Fusion, suffer from a fundamental trade-off between global statistical similarity (measured by correlation coefficient and mutual information) and local structural fidelity. This limitation hinders optimal clinical interpretation.

Method: 1) Dense reliability maps for adaptive modality weighting; 2) Dual-expert fusion strategy combining global-context spatial expert and wavelet-domain frequency expert; 3) Soft gradient-based arbitration mechanism; 4) Residual-to-average fusion paradigm that guarantees preservation of global correlation while enhancing local details.

Result: Extensive experiments on CT-MRI, PET-MRI, and SPECT-MRI datasets demonstrate that W-DUALMINE consistently outperforms AdaFuse and ASFE-Fusion in both correlation coefficient (CC) and mutual information (MI) metrics.

Conclusion: W-DUALMINE successfully resolves the trade-off between global statistical similarity and local structural fidelity in medical image fusion through its reliability-weighted dual-expert framework, offering improved performance over existing methods across multiple medical imaging modalities.

Abstract: Medical image fusion integrates complementary information from multiple imaging modalities to improve clinical interpretation. However, existing deep learningbased methods, including recent spatial-frequency frameworks such as AdaFuse and ASFE-Fusion, often suffer from a fundamental trade-off between global statistical similaritymeasured by correlation coefficient (CC) and mutual information (MI)and local structural fidelity. This paper proposes W-DUALMINE, a reliability-weighted dual-expert fusion framework designed to explicitly resolve this trade-off through architectural constraints and a theoretically grounded loss design. The proposed method introduces dense reliability maps for adaptive modality weighting, a dual-expert fusion strategy combining a global-context spatial expert and a wavelet-domain frequency expert, and a soft gradient-based arbitration mechanism. Furthermore, we employ a residual-to-average fusion paradigm that guarantees the preservation of global correlation while enhancing local details. Extensive experiments on CT-MRI, PET-MRI, and SPECT-MRI datasets demonstrate that W-DUALMINE consistently outperforms AdaFuse and ASFE-Fusion in CC and MI metrics while

[435] GOUHFI 2.0: A Next-Generation Toolbox for Brain Segmentation and Cortex Parcellation at Ultra-High Field MRI

Marc-Antoine Fortin, Anne Louise Kristoffersen, Paal Erik Goa

Main category: eess.IV

TL;DR: GOUHFI 2.0 is an enhanced deep-learning toolbox for brain segmentation and cortical parcellation in Ultra-High Field MRI, addressing challenges of UHF data with improved accuracy and comprehensive functionality.

DetailsMotivation: Ultra-High Field MRI faces challenges in automatic brain segmentation due to signal inhomogeneities, heterogeneous contrasts/resolutions, and limited optimized tools. Standard software yields suboptimal results for UHF data, restricting quantitative analyses.

Method: Updated implementation with two independently trained 3D U-Net segmentation tasks: 1) whole-brain segmentation into 35 labels using domain-randomization strategy on 238 subjects, 2) cortical parcellation into 62 DKT labels using same training data. Preserves contrast- and resolution-agnostic design.

Result: Improved segmentation accuracy relative to original toolbox, particularly in heterogeneous cohorts. Produced reliable cortical parcellations and integrated volumetry pipeline yielded results consistent with standard workflows.

Conclusion: GOUHFI 2.0 provides comprehensive solution for brain segmentation, parcellation and volumetry across field strengths, constituting first deep-learning toolbox enabling robust cortical parcellation at UHF-MRI.

Abstract: Ultra-High Field MRI (UHF-MRI) is increasingly used in large-scale neuroimaging studies, yet automatic brain segmentation and cortical parcellation remain challenging due to signal inhomogeneities, heterogeneous contrasts and resolutions, and the limited availability of tools optimized for UHF data. Standard software packages such as FastSurferVINN and SynthSeg+ often yield suboptimal results when applied directly to UHF images, thereby restricting region-based quantitative analyses. To address this need, we introduce GOUHFI 2.0, an updated implementation of GOUHFI that incorporates increased training data variability and additional functionalities, including cortical parcellation and volumetry. GOUHFI 2.0 preserves the contrast- and resolution-agnostic design of the original toolbox while introducing two independently trained 3D U-Net segmentation tasks. The first performs whole-brain segmentation into 35 labels across contrasts, resolutions, field strengths and populations, using a domain-randomization strategy and a training dataset of 238 subjects. Using the same training data, the second network performs cortical parcellation into 62 labels following the Desikan-Killiany-Tourville (DKT) protocol. Across multiple datasets, GOUHFI 2.0 demonstrated improved segmentation accuracy relative to the original toolbox, particularly in heterogeneous cohorts, and produced reliable cortical parcellations. In addition, the integrated volumetry pipeline yielded results consistent with standard volumetric workflows. Overall, GOUHFI 2.0 provides a comprehensive solution for brain segmentation, parcellation and volumetry across field strengths, and constitutes the first deep-learning toolbox enabling robust cortical parcellation at UHF-MRI.

[436] Universal Latent Homeomorphic Manifolds: Cross-Domain Representation Learning via Homeomorphism Verification

Tong Wu, Tayab Uddin Wara, Daniel Hernandez, Sidong Lei

Main category: eess.IV

TL;DR: ULHM framework unifies semantic and observation representations via homeomorphic latent manifolds, enabling semantic-guided sparse recovery, cross-domain transfer, and zero-shot learning with theoretical guarantees.

DetailsMotivation: Different modalities (semantic descriptions vs. observation data) capture the same underlying reality but exist in separate representations. There's a need to unify these fundamentally different pathways into a single latent structure with mathematical guarantees.

Method: Establishes homeomorphism (continuous bijection preserving topological structure) as the mathematical criterion for unifying latent manifolds. Uses conditional variational inference to learn continuous manifold-to-manifold transformations, avoiding point-to-point mappings. Develops practical verification algorithms with trust, continuity, and Wasserstein distance metrics.

Result: Achieves: (1) sparse image recovery from 5% CelebA pixels and MNIST reconstruction at multiple sparsity levels, (2) cross-domain classifier transfer with 86.73% accuracy from MNIST to Fashion-MNIST without retraining, (3) zero-shot classification: 89.47% on MNIST, 84.70% on Fashion-MNIST, 78.76% on CIFAR-10. Homeomorphism criterion correctly rejects incompatible datasets.

Conclusion: ULHM provides a principled framework for unifying semantic and observation representations with theoretical guarantees, enabling robust applications in sparse recovery, transfer learning, and zero-shot learning while preventing invalid unification through mathematical verification.

Abstract: We present the Universal Latent Homeomorphic Manifold (ULHM), a framework that unifies semantic representations (e.g., human descriptions, diagnostic labels) and observation-driven machine representations (e.g., pixel intensities, sensor readings) into a single latent structure. Despite originating from fundamentally different pathways, both modalities capture the same underlying reality. We establish \emph{homeomorphism}, a continuous bijection preserving topological structure, as the mathematical criterion for determining when latent manifolds induced by different semantic-observation pairs can be rigorously unified. This criterion provides theoretical guarantees for three critical applications: (1) semantic-guided sparse recovery from incomplete observations, (2) cross-domain transfer learning with verified structural compatibility, and (3) zero-shot compositional learning via valid transfer from semantic to observation space. Our framework learns continuous manifold-to-manifold transformations through conditional variational inference, avoiding brittle point-to-point mappings. We develop practical verification algorithms, including trust, continuity, and Wasserstein distance metrics, that empirically validate homeomorphic structure from finite samples. Experiments demonstrate: (1) sparse image recovery from 5% of CelebA pixels and MNIST digit reconstruction at multiple sparsity levels, (2) cross-domain classifier transfer achieving 86.73% accuracy from MNIST to Fashion-MNIST without retraining, and (3) zero-shot classification on unseen classes achieving 89.47% on MNIST, 84.70% on Fashion-MNIST, and 78.76% on CIFAR-10. Critically, the homeomorphism criterion correctly rejects incompatible datasets, preventing invalid unification and providing a feasible way to principled decomposition of general foundation models into verified domain-specific components.

[437] POWDR: Pathology-preserving Outpainting with Wavelet Diffusion for 3D MRI

Fei Tan, Ashok Vardhan Addala, Bruno Astuto Arouche Nunes, Xucheng Zhu, Ravi Soni

Main category: eess.IV

TL;DR: POWDR is a pathology-preserving outpainting framework for 3D MRI that generates anatomically plausible surrounding tissue while retaining real pathological regions, addressing data scarcity and class imbalance in medical imaging.

DetailsMotivation: Medical imaging datasets suffer from class imbalance and limited availability of pathology-rich cases, which constrains machine learning model performance for segmentation, classification, and vision-language tasks.

Method: A conditioned wavelet diffusion model with wavelet-domain conditioning to enhance high-frequency detail and mitigate blurring, plus a random connected mask training strategy to overcome conditioning-induced collapse and improve diversity outside lesions.

Result: Quantitative metrics confirm image realism (FID, SSIM, LPIPS), diversity analysis shows significant improvement with random-mask training, and clinically relevant assessments reveal gains in tumor segmentation performance (Dice scores improving from 0.6992 to 0.7137 with 50 synthetic cases).

Conclusion: POWDR is a practical solution for addressing data scarcity and class imbalance in medical imaging, extensible to multiple anatomies, offering a controllable framework for generating diverse, pathology-preserving synthetic data to support robust model development.

Abstract: Medical imaging datasets often suffer from class imbalance and limited availability of pathology-rich cases, which constrains the performance of machine learning models for segmentation, classification, and vision-language tasks. To address this challenge, we propose POWDR, a pathology-preserving outpainting framework for 3D MRI based on a conditioned wavelet diffusion model. Unlike conventional augmentation or unconditional synthesis, POWDR retains real pathological regions while generating anatomically plausible surrounding tissue, enabling diversity without fabricating lesions. Our approach leverages wavelet-domain conditioning to enhance high-frequency detail and mitigate blurring common in latent diffusion models. We introduce a random connected mask training strategy to overcome conditioning-induced collapse and improve diversity outside the lesion. POWDR is evaluated on brain MRI using BraTS datasets and extended to knee MRI to demonstrate tissue-agnostic applicability. Quantitative metrics (FID, SSIM, LPIPS) confirm image realism, while diversity analysis shows significant improvement with random-mask training (cosine similarity reduced from 0.9947 to 0.9580; KL divergence increased from 0.00026 to 0.01494). Clinically relevant assessments reveal gains in tumor segmentation performance using nnU-Net, with Dice scores improving from 0.6992 to 0.7137 when adding 50 synthetic cases. Tissue volume analysis indicates no significant differences for CSF and GM compared to real images. These findings highlight POWDR as a practical solution for addressing data scarcity and class imbalance in medical imaging. The method is extensible to multiple anatomies and offers a controllable framework for generating diverse, pathology-preserving synthetic data to support robust model development.

[438] Equi-ViT: Rotational Equivariant Vision Transformer for Robust Histopathology Analysis

Fuyao Chen, Yuexi Du, Elèonore V. Lieffrig, Nicha C. Dvornek, John A. Onofrey

Main category: eess.IV

TL;DR: Equi-ViT integrates equivariant convolution kernels into Vision Transformer patch embeddings to achieve rotational equivariance, improving robustness and data efficiency in histopathology applications.

DetailsMotivation: Standard Vision Transformers lack equivariance to common transformations like rotations and reflections that are ubiquitous in histopathology imaging, limiting their robustness and generalizability in computational pathology applications.

Method: Proposes Equi-ViT which integrates an equivariant convolution kernel into the patch embedding stage of a Vision Transformer architecture, imparting built-in rotational equivariance to learned representations while maintaining the transformer’s ability to model long-range dependencies.

Result: Equi-ViT achieves superior rotation-consistent patch embeddings and stable classification performance across image orientations on a public colorectal cancer dataset, demonstrating enhanced data efficiency and robustness compared to standard ViTs.

Conclusion: Equivariant transformers like Equi-ViT could serve as more generalizable backbones for Vision Transformer applications in histopathology, potentially improving foundation models for digital pathology by addressing the inherent non-equivariance limitations of standard ViTs.

Abstract: Vision Transformers (ViTs) have gained rapid adoption in computational pathology for their ability to model long-range dependencies through self-attention, addressing the limitations of convolutional neural networks that excel at local pattern capture but struggle with global contextual reasoning. Recent pathology-specific foundation models have further advanced performance by leveraging large-scale pretraining. However, standard ViTs remain inherently non-equivariant to transformations such as rotations and reflections, which are ubiquitous variations in histopathology imaging. To address this limitation, we propose Equi-ViT, which integrates an equivariant convolution kernel into the patch embedding stage of a ViT architecture, imparting built-in rotational equivariance to learned representations. Equi-ViT achieves superior rotation-consistent patch embeddings and stable classification performance across image orientations. Our results on a public colorectal cancer dataset demonstrate that incorporating equivariant patch embedding enhances data efficiency and robustness, suggesting that equivariant transformers could potentially serve as more generalizable backbones for the application of ViT in histopathology, such as digital pathology foundation models.

[439] AGE-US: automated gestational age estimation based on fetal ultrasound images

César Díaz-Parga, Marta Nuñez-Garcia, Maria J. Carreira, Gabriel Bernardino, Nicolás Vila-Blanco

Main category: eess.IV

TL;DR: An interpretable deep learning method for automated gestational age calculation using novel segmentation architecture and distance maps, achieving state-of-the-art performance with reduced complexity for resource-constrained settings.

DetailsMotivation: Accurate gestational age estimation is critical for monitoring fetal growth and health risks, but traditional methods like last menstrual period are unreliable, and ultrasound-based approaches suffer from manual measurement variability.

Method: Interpretable deep learning approach using novel segmentation architecture with distance maps to overcome dataset limitations and scarcity of segmentation masks, specifically designed for automated gestational age calculation.

Result: Achieves performance comparable to state-of-the-art models while reducing complexity, making it suitable for resource-constrained settings; distance maps prove particularly effective for estimating femur endpoints.

Conclusion: The proposed method provides an automated, interpretable solution for gestational age calculation that addresses limitations of traditional approaches and is particularly valuable in settings with limited resources and annotated data.

Abstract: Being born small carries significant health risks, including increased neonatal mortality and a higher likelihood of future cardiac diseases. Accurate estimation of gestational age is critical for monitoring fetal growth, but traditional methods, such as estimation based on the last menstrual period, are in some situations difficult to obtain. While ultrasound-based approaches offer greater reliability, they rely on manual measurements that introduce variability. This study presents an interpretable deep learning-based method for automated gestational age calculation, leveraging a novel segmentation architecture and distance maps to overcome dataset limitations and the scarcity of segmentation masks. Our approach achieves performance comparable to state-of-the-art models while reducing complexity, making it particularly suitable for resource-constrained settings and with limited annotated data. Furthermore, our results demonstrate that the use of distance maps is particularly suitable for estimating femur endpoints.

[440] Large-scale modality-invariant foundation models for brain MRI analysis: Application to lesion segmentation

Petros Koutsouvelis, Matej Gazda, Leroy Volmer, Sina Amirrajab, Kamil Barbierik, Branislav Setlak, Jakub Gazda, Peter Drotar

Main category: eess.IV

TL;DR: This paper proposes a modality-invariant representation learning approach for brain MRI, evaluating it on stroke and epilepsy lesion segmentation after large-scale self-supervised pre-training.

DetailsMotivation: While computer vision is shifting to large-scale foundation models via SSL, most frameworks are tailored to natural images and don't effectively capture multi-modal MRI information. There's a need to adapt SSL methods to learn anatomical priors from unlabeled brain MRI data for improved few-shot performance in neuroimaging tasks.

Method: The authors propose a modality-invariant representation learning setup for multi-modal MRI data, using large-scale self-supervised pre-training on unlabeled brain MRI data. They evaluate this approach specifically on stroke and epilepsy lesion segmentation tasks.

Result: Experimental results show that while cross-modality alignment is successful, lesion segmentation primarily benefits from preserving fine-grained modality-specific features rather than complete modality invariance.

Conclusion: For lesion segmentation in neuroimaging, preserving modality-specific features is more important than achieving complete modality invariance, despite successful cross-modality alignment. The authors make their model checkpoints and code publicly available.

Abstract: The field of computer vision is undergoing a paradigm shift toward large-scale foundation model pre-training via self-supervised learning (SSL). Leveraging large volumes of unlabeled brain MRI data, such models can learn anatomical priors that improve few-shot performance in diverse neuroimaging tasks. However, most SSL frameworks are tailored to natural images, and their adaptation to capture multi-modal MRI information remains underexplored. This work proposes a modality-invariant representation learning setup and evaluates its effectiveness in stroke and epilepsy lesion segmentation, following large-scale pre-training. Experimental results suggest that despite successful cross-modality alignment, lesion segmentation primarily benefits from preserving fine-grained modality-specific features. Model checkpoints and code are made publicly available.

[441] A Multi-Stage Deep Learning Framework with PKCP-MixUp Augmentation for Pediatric Liver Tumor Diagnosis Using Multi-Phase Contrast-Enhanced CT

Wanqi Wang, Chun Yang, Jianbo Shao, Yaokai Zhang, Xuehua Peng, Jin Sun, Chao Xiong, Long Lu, Lianting Hu

Main category: eess.IV

TL;DR: A multi-stage deep learning framework using multi-phase CT scans for non-invasive diagnosis of pediatric liver tumors, achieving high accuracy in distinguishing benign vs malignant and classifying subtypes.

DetailsMotivation: Current invasive biopsy for pediatric liver tumors has significant limitations including bleeding risks, need for anesthesia in young children, high costs, and psychological trauma. There's a gap in AI applications specifically for pediatric liver tumors despite AI's growing role in clinical settings.

Method: Developed a multi-stage DL framework using multi-phase contrast-enhanced CT scans. Used PKCP-MixUp data augmentation to address data scarcity and class imbalance. Implemented tumor detection model to extract ROIs, followed by two-stage diagnosis pipeline with three backbones using ROI-masked images.

Result: Tumor detection achieved mAP=0.871. First-stage benign vs malignant classification reached AUC=0.989. Final diagnosis models showed robust performance: benign subtype classification AUC=0.915 and malignant subtype classification AUC=0.979. Conducted ablation studies and interpretability analyses (Shapley-Value, CAM).

Conclusion: The framework fills the pediatric-specific DL diagnostic gap, provides insights for CT phase selection and model design, and enables precise, accessible non-invasive diagnosis of pediatric liver tumors.

Abstract: Pediatric liver tumors are one of the most common solid tumors in pediatrics, with differentiation of benign or malignant status and pathological classification critical for clinical treatment. While pathological examination is the gold standard, the invasive biopsy has notable limitations: the highly vascular pediatric liver and fragile tumor tissue raise complication risks such as bleeding; additionally, young children with poor compliance require anesthesia for biopsy, increasing medical costs or psychological trauma. Although many efforts have been made to utilize AI in clinical settings, most researchers have overlooked its importance in pediatric liver tumors. To establish a non-invasive examination procedure, we developed a multi-stage deep learning (DL) framework for automated pediatric liver tumor diagnosis using multi-phase contrast-enhanced CT. Two retrospective and prospective cohorts were enrolled. We established a novel PKCP-MixUp data augmentation method to address data scarcity and class imbalance. We also trained a tumor detection model to extract ROIs, and then set a two-stage diagnosis pipeline with three backbones with ROI-masked images. Our tumor detection model has achieved high performance (mAP=0.871), and the first stage classification model between benign and malignant tumors reached an excellent performance (AUC=0.989). Final diagnosis models also exhibited robustness, including benign subtype classification (AUC=0.915) and malignant subtype classification (AUC=0.979). We also conducted multi-level comparative analyses, such as ablation studies on data and training pipelines, as well as Shapley-Value and CAM interpretability analyses. This framework fills the pediatric-specific DL diagnostic gap, provides actionable insights for CT phase selection and model design, and paves the way for precise, accessible pediatric liver tumor diagnosis.

[442] M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, Shuicheng Yan

Main category: eess.IV

TL;DR: M3CoTBench is a new benchmark for evaluating Chain-of-Thought reasoning in medical multimodal LLMs, focusing on correctness, efficiency, impact, and consistency of reasoning paths rather than just final answers.

DetailsMotivation: Current medical image understanding benchmarks focus only on final answers while ignoring reasoning paths, creating opaque AI systems that lack reliable bases for judgment and cannot effectively assist doctors in diagnosis.

Method: Created M3CoTBench benchmark featuring: 1) diverse multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) CoT-specific evaluation metrics (correctness, efficiency, impact, consistency) tailored to clinical reasoning, and 4) performance analysis of multiple MLLMs.

Result: The benchmark systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning.

Conclusion: M3CoTBench aims to foster development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare by addressing the gap in evaluating reasoning processes in medical image understanding.

Abstract: Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at https://juntaojianggavin.github.io/projects/M3CoTBench/.

Last updated: 2026-01-21
Built with Hugo, theme modified on Stack