Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 120]
cs.CV [Total: 122]
cs.AI [Total: 74]
cs.SD [Total: 8]
cs.LG [Total: 122]
cs.MA [Total: 8]
cs.MM [Total: 3]
eess.AS [Total: 3]
eess.IV [Total: 10]

cs.CL

[1] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

Yiming Ren, Junjie Wang, Yuxin Meng, Yihang Shi, Zhiqiang Lin, Ruihang Chu, Yiran Xu, Ziming Li, Yunfei Zhao, Zihan Wang, Yu Qiao, Ruiming Tang, Minghao Liu, Yujiu Yang

Main category: cs.CL

TL;DR: FITO paradigm evaluates MLLMs on scientific document understanding by requiring explicit cross-modal evidence chains, revealing grounding as the primary bottleneck.

Details

Motivation: Current evaluation methods for multimodal LLMs on scientific papers are inadequate - answer-only metrics and synthetic tests reward answer matching without requiring causal, evidence-linked reasoning traces in native document contexts.

Method: Proposed “Fish-in-the-Ocean” (FITO) paradigm requiring explicit cross-modal evidence chains. Built SIN-Data (scientific interleaved corpus preserving text-figure interleaving) and SIN-Bench with four progressive tasks: evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). Introduced “No Evidence, No Score” evaluation scoring predictions only when grounded to verifiable anchors.

Result: Experiments on eight MLLMs show grounding is the primary bottleneck. Gemini-3-pro achieved best average overall score (0.573), while GPT-5 attained highest SIN-QA answer accuracy (0.767) but underperformed on evidence-aligned overall scores, exposing gap between correctness and traceable support.

Conclusion: The FITO paradigm effectively reveals limitations in MLLMs’ ability to construct explicit evidence chains in scientific documents, highlighting the critical importance of grounding and traceable reasoning beyond mere answer correctness.

Abstract: Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic “Needle-In-A-Haystack” tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the “Fish-in-the-Ocean” (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce “No Evidence, No Score”, scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.

[2] LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue

Jinqiang Wang, Huansheng Ning, Jianguo Ding, Tao Zhu, Liming Chen, Chris Nugent

Main category: cs.CL

TL;DR: ProUtt is an LLM-driven method that synthesizes preference data for proactive next utterance prediction by modeling intent reasoning trajectories through intent trees and constructing preference/non-preference reasoning processes.

Details

Motivation: Existing solutions for proactive next utterance prediction face privacy concerns with commercial APIs, computational expense with local LLMs, and limitations in existing methods that lack explicit intent reasoning modeling and proper preference data synthesis for user preference alignment.

Method: ProUtt converts dialogue history into an intent tree, explicitly models intent reasoning trajectories by predicting next plausible paths from exploitation and exploration perspectives, and constructs preference/non-preference reasoning processes by perturbing or revising intent tree paths at different future turns.

Result: ProUtt consistently outperforms existing data synthesis methods, user simulators, and commercial LLM APIs across four benchmark datasets, as demonstrated through extensive evaluations using LLM-as-a-judge and human judgments.

Conclusion: ProUtt provides an effective LLM-driven preference data synthesis method for proactive next utterance prediction that addresses limitations in existing approaches by explicitly modeling intent reasoning and synthesizing comprehensive preference data, with released code and datasets to facilitate future research.

Abstract: Proactively predicting a users next utterance in human-machine dialogue can streamline interaction and improve user experience. Existing commercial API-based solutions are subject to privacy concerns while deploying general-purpose LLMs locally remains computationally expensive. As such, training a compact, task-specific LLM provides a practical alternative. Although user simulator methods can predict a user’s next utterance, they mainly imitate their speaking style rather than advancing the dialogue. Preference data synthesis has been investigated to generate data for proactive next utterance prediction and help align LLMs with user preferences. Yet existing methods lack the ability to explicitly model the intent reasoning that leads to the user’s next utterance and to define and synthesize preference and non-preference reasoning processes for predicting the user’s next utterance.To address these challenges, we propose ProUtt, an LLM-driven preference data synthesis method for proactive next utterance prediction. ProUtt converts dialogue history into an intent tree and explicitly models intent reasoning trajectories by predicting the next plausible path from both exploitation and exploration perspectives. It then constructs preference and non-preference reasoning processes by perturbing or revising intent tree paths at different future turns. Extensive evaluations using LLM-as-a-judge and human judgments demonstrate that ProUtt consistently outperforms existing data synthesis methods, user simulators, and commercial LLM APIs across four benchmark datasets. We release both the code and the synthesized datasets to facilitate future research.

[3] Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines

Devesh Saraogi, Rohit Singhee, Dhruv Kumar

Main category: cs.CL

TL;DR: Agentic workflows (multi-step AI systems) generate more novel research plans than single-step prompting, with decomposition-based and long-context approaches achieving highest novelty scores while maintaining feasibility.

Details

Motivation: To address concerns about "smart plagiarism" in LLM-generated research where models reproduce existing ideas with terminological shifts, and investigate whether multi-step agentic workflows can produce more original and feasible research plans.

Method: Benchmarked five reasoning architectures: Reflection-based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co-Scientist multi-agent framework, GPT Deep Research (GPT-5.1) recursive decomposition, and Gemini~3 Pro multimodal long-context pipeline. Evaluated thirty proposals each on novelty, feasibility, and impact.

Result: Decomposition-based and long-context workflows achieved highest mean novelty scores (4.17/5), while reflection-based approaches scored significantly lower (2.33/5). High-performing workflows maintained feasibility without sacrificing creativity, with varied performance across research domains.

Conclusion: Carefully designed multi-stage agentic workflows can advance AI-assisted research ideation by generating more novel and feasible research plans compared to single-step prompting approaches.

Abstract: The integration of Large Language Models (LLMs) into the scientific ecosystem raises fundamental questions about the creativity and originality of AI-generated research. Recent work has identified ``smart plagiarism’’ as a concern in single-step prompting approaches, where models reproduce existing ideas with terminological shifts. This paper investigates whether agentic workflows – multi-step systems employing iterative reasoning, evolutionary search, and recursive decomposition – can generate more novel and feasible research plans. We benchmark five reasoning architectures: Reflection-based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co-Scientist multi-agent framework, GPT Deep Research (GPT-5.1) recursive decomposition, and Gemini~3 Pro multimodal long-context pipeline. Using evaluations from thirty proposals each on novelty, feasibility, and impact, we find that decomposition-based and long-context workflows achieve mean novelty of 4.17/5, while reflection-based approaches score significantly lower (2.33/5). Results reveal varied performance across research domains, with high-performing workflows maintaining feasibility without sacrificing creativity. These findings support the view that carefully designed multi-stage agentic workflows can advance AI-assisted research ideation.

[4] Introducing Axlerod: An LLM-based Chatbot for Assisting Independent Insurance Agents

Adam Bradley, John Hastings, Khandaker Mamun Ahmed

Main category: cs.CL

TL;DR: Axlerod is an AI-powered conversational interface for insurance agents that uses NLP and RAG to automate policy recommendations and claims triage, achieving 93.18% accuracy and reducing search time by 2.42 seconds.

Details

Motivation: The insurance industry is undergoing an AI-driven transformation, with chatbots evolving into sophisticated systems for automating complex workflows. There's a need for AI applications that improve operational efficiency for independent insurance agents through intelligent conversational interfaces.

Method: Axlerod leverages natural language processing (NLP), retrieval-augmented generation (RAG), and domain-specific knowledge integration. It’s designed to parse user intent, access structured policy databases, and deliver real-time, contextually relevant responses through a conversational interface.

Result: Experimental results show Axlerod achieves 93.18% overall accuracy in policy retrieval tasks and reduces average search time by 2.42 seconds compared to traditional methods.

Conclusion: This work contributes to enterprise-grade AI applications in insurtech, focusing on agent-assistive rather than consumer-facing architectures, demonstrating that AI-powered conversational interfaces can significantly improve operational efficiency for insurance professionals.

Abstract: The insurance industry is undergoing a paradigm shift through the adoption of artificial intelligence (AI) technologies, particularly in the realm of intelligent conversational agents. Chatbots have evolved into sophisticated AI-driven systems capable of automating complex workflows, including policy recommendation and claims triage, while simultaneously enabling dynamic, context-aware user engagement. This paper presents the design, implementation, and empirical evaluation of Axlerod, an AI-powered conversational interface designed to improve the operational efficiency of independent insurance agents. Leveraging natural language processing (NLP), retrieval-augmented generation (RAG), and domain-specific knowledge integration, Axlerod demonstrates robust capabilities in parsing user intent, accessing structured policy databases, and delivering real-time, contextually relevant responses. Experimental results underscore Axlerod’s effectiveness, achieving an overall accuracy of 93.18% in policy retrieval tasks while reducing the average search time by 2.42 seconds. This work contributes to the growing body of research on enterprise-grade AI applications in insurtech, with a particular focus on agent-assistive rather than consumer-facing architectures.

Derguene Mbaye, Tatiana D. P. Mbengue, Madoune R. Seye, Moussa Diallo, Mamadou L. Ndiaye, Dimitri S. Adjanohoun, Cheikh S. Wade, Djiby Sow, Jean-Claude B. Munyaka, Jerome Chenal

Main category: cs.CL

TL;DR: First comprehensive overview of NLP progress and challenges for Senegal’s six official languages, with centralized resources and roadmap for sustainable development.

Details

Motivation: African languages are underrepresented in NLP despite its transformative potential, creating a need to address the digital readiness gap for Senegal's six national languages (Wolof, Pulaar, Sereer, Joola, Mandingue, Soninke).

Method: Synthesizes linguistic, sociotechnical, and infrastructural factors; analyzes existing initiatives in text normalization, machine translation, and speech processing; creates centralized GitHub repository of resources; examines NLP applications in social sciences.

Result: Identifies gaps in data, tools, and benchmarks; provides comprehensive overview of current state; establishes centralized resource repository; demonstrates how NLP can enhance social science research through multilingual transcription, translation, and retrieval pipelines.

Conclusion: Proposes roadmap for sustainable, community-centered NLP ecosystems emphasizing ethical data governance, open resources, and interdisciplinary collaboration to advance Senegalese language technologies.

Abstract: Natural Language Processing (NLP) is rapidly transforming research methodologies across disciplines, yet African languages remain largely underrepresented in this technological shift. This paper provides the first comprehensive overview of NLP progress and challenges for the six national languages officially recognized by the Senegalese Constitution: Wolof, Pulaar, Sereer, Joola, Mandingue, and Soninke. We synthesize linguistic, sociotechnical, and infrastructural factors that shape their digital readiness and identify gaps in data, tools, and benchmarks. Building on existing initiatives and research works, we analyze ongoing efforts in text normalization, machine translation, and speech processing. We also provide a centralized GitHub repository that compiles publicly accessible resources for a range of NLP tasks across these languages, designed to facilitate collaboration and reproducibility. A special focus is devoted to the application of NLP to the social sciences, where multilingual transcription, translation, and retrieval pipelines can significantly enhance the efficiency and inclusiveness of field research. The paper concludes by outlining a roadmap toward sustainable, community-centered NLP ecosystems for Senegalese languages, emphasizing ethical data governance, open resources, and interdisciplinary collaboration.

[6] SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data

Yiwei Yan, Hao Li, Hua He, Gong Kai, Zhengyi Yang, Guanfeng Liu

Main category: cs.CL

TL;DR: SALP-CG is an LLM-based pipeline for classifying and grading privacy risks in online conversational health data, achieving strong performance on Chinese medical dialogue benchmarks with micro-F1=0.900 for maximum-level prediction.

Details

Motivation: Online medical consultations generate large volumes of conversational health data containing protected health information, but existing approaches lack unified standards and reliable automated methods for sensitivity classification and risk grading.

Method: Developed SALP-CG, a large language model-based extraction pipeline that combines few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules. The backend-agnostic pipeline follows GB/T 39725-2020 standards for health-data classification and grading.

Result: On the MedDialog-CN benchmark, models achieved robust entity counts, high schema compliance, and accurate sensitivity grading. The strongest model attained micro-F1=0.900 for maximum-level prediction. Analysis showed Level 2-3 items dominate and enable re-identification when combined, while Level 4-5 items are less frequent but carry greater harm potential.

Conclusion: SALP-CG reliably classifies categories and grades sensitivity in online conversational health data across different LLMs, offering a practical method for health data governance and privacy protection in medical consultations.

Abstract: Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data. We concluded health-data classification and grading rules in accordance with GB/T 39725-2020. Combining few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules, the backend-agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog-CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro-F1=0.900 for maximum-level prediction. The category landscape stratified by sensitivity shows that Level 2-3 items dominate, enabling re-identification when combined; Level 4-5 items are less frequent but carry outsize harm. SALP-CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at https://github.com/dommii1218/SALP-CG.

[7] StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model

Jing-Yi Zeng, Guan-Hua Huang

Main category: cs.CL

TL;DR: Researchers develop StatLLaMA, a domain-specialized LLM for statistics using LLaMA-3.2-3B, finding that starting from instruction-tuned foundation models is crucial for statistical reasoning, with careful trade-offs between domain expertise and general abilities.

Details

Motivation: To create a resource-efficient, domain-specialized LLM for statistics that balances statistical expertise with general reasoning capabilities, addressing the challenge of how to effectively adapt lightweight foundation models for specialized domains.

Method: Systematically compared three training pipelines using LLaMA-3.2-3B: starting from base FM, base FM with instruction tuning, and instruction-tuned FM. Evaluated continual pretraining, supervised fine-tuning (SFT), RLHF preference alignment, and downstream task adaptation with direct preference optimization.

Result: Pipelines starting from base FM failed to develop statistical reasoning, while starting from instruction-tuned FM enabled effective domain specialization. SFT variants showed trade-offs between domain expertise and general reasoning. Direct preference optimization provided stable RLHF alignment, and downstream fine-tuning required low intensity to avoid catastrophic forgetting.

Conclusion: StatLLaMA achieves balanced performance on mathematical reasoning, common-sense reasoning, and statistical expertise, providing a practical blueprint for developing resource-efficient statistical LLMs, with the key insight that starting from instruction-tuned foundation models is essential for domain specialization.

Abstract: This study investigates how to efficiently build a domain-specialized large language model (LLM) for statistics using the lightweight LLaMA-3.2-3B family as the foundation model (FM). We systematically compare three multi-stage training pipelines, starting from a base FM with no instruction-following capability, a base FM augmented with post-hoc instruction tuning, and an instruction-tuned FM with strong general reasoning abilities across continual pretraining, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task adaptation. Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA-3.2-3B-Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade-offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that downstream fine-tuning must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common-sense reasoning, and statistical expertise, offering a practical blueprint for developing resource-efficient statistical LLMs. The code is available at https://github.com/HuangDLab/StatLLaMA.

[8] MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

Yuxuan Lou, Kai Yang, Yang You

Main category: cs.CL

TL;DR: MoST is a multimodal LLM that integrates speech and text processing using a Modality-Aware Mixture of Experts architecture with specialized routing for different input types.

Details

Motivation: Current multimodal models process diverse modality representations with identical parameters, ignoring inherent representational differences between speech and text.

Method: Proposes Modality-Aware Mixture of Experts (MAMoE) with specialized routing pathways directing tokens to modality-appropriate experts, plus modality-specific expert groups and shared experts for cross-modal understanding. Uses efficient transformation pipeline adapting pretrained MoE language model through strategic post-training on ASR/TTS datasets and fine-tuning with speech-text instruction data.

Result: MoST consistently outperforms existing models of comparable parameter counts across ASR, TTS, audio language modeling, and spoken question answering benchmarks. Ablation studies confirm modality-specific routing and shared experts design significantly contribute to performance gains.

Conclusion: MoST represents the first fully open-source speech-text LLM built on Mixture of Experts architecture, achieving strong performance using only fully accessible open-source datasets.

Abstract: We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnote{We release MoST model, training code, inference code, and training data at https://github.com/NUS-HPC-AI-Lab/MoST

[9] Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song

Main category: cs.CL

TL;DR: BHyT (Bounded Hyperbolic Tanh) is a drop-in replacement for Pre-LN that addresses both stability and efficiency issues in deep LLMs by combining tanh nonlinearity with explicit input bounding and efficient variance approximation.

Details

Motivation: Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth where activation magnitude and variance escalate with layer depth, destabilizing training. Efficiency-oriented normalization-free methods like DyT improve speed but remain fragile at depth.

Method: BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation for efficiency.

Result: BHyT demonstrates improved stability and efficiency during pretraining, achieving 15.8% faster training and 4.2% higher token generation throughput compared to RMSNorm, while matching or surpassing inference performance and robustness across language understanding and reasoning benchmarks.

Conclusion: BHyT jointly addresses stability and efficiency issues in deep LLMs, providing a theoretically-guaranteed stable alternative to Pre-LN with practical speed improvements and maintained performance.

Abstract: Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT

[10] Uncertainty-Aware Dynamic Knowledge Graphs for Reliable Question Answering

Yu Takahashi, Shun Takeuchi, Kexuan Xin, Guillaume Pelat, Yoshiaki Ikai, Junya Saito, Jonathan Vitale, Shlomo Berkovsky, Amin Beheshti

Main category: cs.CL

TL;DR: Uncertainty-aware dynamic knowledge graphs for robust QA in healthcare, featuring evolving KGs, confidence scoring, and interactive visualization.

Details

Motivation: Existing KG-based QA systems treat facts as static and deterministic, failing to handle incomplete/noisy evidence and uncertainty in reasoning, especially problematic in high-stakes domains like healthcare.

Method: Framework combines: (1) dynamic construction of evolving KGs, (2) confidence scoring and uncertainty-aware retrieval, and (3) interactive interface for reliable and interpretable QA. Instantiated in healthcare with personalized KGs from EHRs.

Result: Demonstrates how uncertainty modeling enhances QA robustness and transparency, enabling users to explore dynamic graphs, inspect confidence-annotated triples, and compare baseline vs confidence-aware answers. Evaluated on mortality prediction task.

Conclusion: Uncertainty-aware dynamic KGs show promise for enhancing QA reliability in high-stakes applications, particularly for clinical data scientists and clinicians working with evolving patient data.

Abstract: Question answering (QA) systems are increasingly deployed across domains. However, their reliability is undermined when retrieved evidence is incomplete, noisy, or uncertain. Existing knowledge graph (KG) based QA frameworks typically represent facts as static and deterministic, failing to capture the evolving nature of information and the uncertainty inherent in reasoning. We present a demonstration of uncertainty-aware dynamic KGs, a framework that combines (i) dynamic construction of evolving KGs, (ii) confidence scoring and uncertainty-aware retrieval, and (iii) an interactive interface for reliable and interpretable QA. Our system highlights how uncertainty modeling can make QA more robust and transparent by enabling users to explore dynamic graphs, inspect confidence-annotated triples, and compare baseline versus confidence-aware answers. The target users of this demo are clinical data scientists and clinicians, and we instantiate the framework in healthcare: constructing personalized KGs from electronic health records, visualizing uncertainty across patient visits, and evaluating its impact on a mortality prediction task. This use case demonstrates the broader promise of uncertainty-aware dynamic KGs for enhancing QA reliability in high-stakes applications.

[11] Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

Yuxiang Mei, Dongxing Xu, Jiaen Liang, Yanhua Long

Main category: cs.CL

TL;DR: Enhanced LLM-based ASR framework using parallel speech encoders (Whisper + mHuBERT) with cross-attention fusion, achieving competitive results on MLC-SLM Challenge but still underperforming compared to fine-tuned E2E Whisper models.

Details

Motivation: Address limitations of previous SHNU-mASR system: simple feature concatenation fails to fully exploit complementary information between speech encoders, and performance gap between LLM-based ASR and E2E encoder-decoder ASR remains unexplored.

Method: Enhanced LLM-based ASR framework combining fine-tuned Whisper and mHuBERT encoders with LLM. First evaluates E2E Whisper models with LoRA and full fine-tuning, then proposes cross-attention-based fusion mechanisms for parallel-speech-encoder architecture.

Result: Achieves CER/WER of 10.69% on MLC-SLM Challenge evaluation set, ranking on par with top Track 1 systems despite using only 1,500 hours of training data vs. competitors’ large-scale datasets. However, final LLM-based ASR still underperforms compared to fine-tuned E2E Whisper model.

Conclusion: The enhanced LLM-based ASR framework shows competitive performance with efficient data usage, but empirical results demonstrate that current LLM-based approaches still cannot match fine-tuned E2E Whisper models, providing valuable guidance for future Speech-LLM design.

Abstract: The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at https://github.com/1535176727/MLC-SLM.

[12] Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox

Vahideh Zolfaghari

Main category: cs.CL

TL;DR: LLM safety in pediatric consultations shows smaller models can outperform larger ones under parental anxiety-driven adversarial pressures, with safety depending more on alignment and architecture than scale.

Details

Motivation: LLMs are increasingly used in medical consultations but their safety under realistic user pressures (like anxious parents challenging safeguards) remains understudied, as prior assessments focused only on neutral conditions.

Method: Used PediatricAnxietyBench with 300 queries (150 authentic, 150 adversarial) across 10 topics. Tested three models via APIs: Llama-3.3-70B, Llama-3.1-8B, and Mistral-7B. Safety scored on 0-15 scale measuring restraint, referral, hedging, emergency recognition, and non-prescriptive behavior. Analyzed with paired t-tests and bootstrapped CIs.

Result: Mean safety scores ranged 9.70-10.39. Smaller Llama-3.1-8B outperformed larger Llama-3.3-70B (+0.66, p=0.0001). Models showed positive adversarial effects, strongest for Mistral-7B (+1.09, p=0.0002). Safety generalized across platforms; Llama-3.3-70B had 8% failures. Seizures were vulnerable (33% inappropriate diagnoses). Hedging strongly predicted safety (r=0.68, p<0.001).

Conclusion: Safety depends more on alignment and architecture than scale, with smaller models potentially outperforming larger ones. Evolution across releases suggests targeted training progress. Vulnerabilities and lack of emergency recognition indicate unsuitability for triage. Findings guide model selection, stress adversarial testing importance, and provide open benchmark for medical AI safety.

Abstract: Background Large language models (LLMs) are increasingly deployed in medical consultations, yet their safety under realistic user pressures remains understudied. Prior assessments focused on neutral conditions, overlooking vulnerabilities from anxious users challenging safeguards. This study evaluated LLM safety under parental anxiety-driven adversarial pressures in pediatric consultations across models and platforms. Methods PediatricAnxietyBench, from a prior evaluation, includes 300 queries (150 authentic, 150 adversarial) spanning 10 topics. Three models were assessed via APIs: Llama-3.3-70B and Llama-3.1-8B (Groq), Mistral-7B (HuggingFace), yielding 900 responses. Safety used a 0-15 scale for restraint, referral, hedging, emergency recognition, and non-prescriptive behavior. Analyses employed paired t-tests with bootstrapped CIs. Results Mean scores: 9.70 (Llama-3.3-70B) to 10.39 (Mistral-7B). Llama-3.1-8B outperformed Llama-3.3-70B by +0.66 (p=0.0001, d=0.225). Models showed positive adversarial effects, Mistral-7B strongest (+1.09, p=0.0002). Safety generalized across platforms; Llama-3.3-70B had 8% failures. Seizures vulnerable (33% inappropriate diagnoses). Hedging predicted safety (r=0.68, p<0.001). Conclusions Evaluation shows safety depends on alignment and architecture over scale, with smaller models outperforming larger. Evolution to robustness across releases suggests targeted training progress. Vulnerabilities and no emergency recognition indicate unsuitability for triage. Findings guide selection, stress adversarial testing, and provide open benchmark for medical AI safety.

[13] ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

Franciszek Górski, Andrzej Czyżewski

Main category: cs.CL

TL;DR: Using multilingual Llama3.1 as a teacher model to annotate Polish medical texts, then training smaller BERT-based classifiers that achieve high performance with dramatically reduced resource requirements.

Details

Motivation: The project needed to develop multi-class classifiers for Polish medical texts across five clinical categories, but faced a fundamental problem of lacking resources for manual annotation of sufficient training data.

Method: Used multilingual Llama3.1 as a teacher model to automatically annotate an extensive corpus of Polish medical texts. Created a test set by manually verifying a portion of these labels. Then trained three BERT-based classifiers (DistilBERT, BioBERT, HerBERT) on the automatically annotated data.

Result: DistilBERT achieved the best results with F1 score > 0.80 for each clinical category and > 0.93 for 3 categories. The classifiers are 500x smaller, use 300x less GPU VRAM, and have several hundred times faster inference compared to large language models.

Conclusion: The framework successfully uses multilingual LLMs as teacher models to overcome annotation resource limitations, enabling creation of highly efficient specialized classifiers that offer practical advantages over large language models for medical text classification in Polish.

Abstract: In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score > 0.80 for each clinical category and an F1 score > 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.

[14] SagaScale: A Realistic, Scalable, and High-Quality Long-Context Benchmark Built from Full-Length Novels

Guancheng Du, Yong Hu, Wenqing Wang, Yaming Yang, Jiaheng Gao

Main category: cs.CL

TL;DR: SagaScale is a new long-context benchmark built from full-length novels that addresses limitations in existing benchmarks by offering realistic, scalable, and high-quality evaluation with context lengths exceeding 250K-320K tokens.

Details

Motivation: Existing long-context benchmarks have limitations in task realism, data scalability, and data quality. There's a need for better evaluation tools to assess LLMs' ability to understand long and complex documents.

Method: Built an automated data collection pipeline using full-length novels and external resources (Wikipedia) to curate question-answer pairs. The benchmark is bilingual (English/Chinese) and provides the longest context lengths to date. Evaluation compares 12 frontier LLMs across three methods: Naïve RAG, Agentic RAG, and Long Context.

Result: Key findings: (1) Direct full context feeding outperforms other methods significantly; (2) Most LLMs struggle with lengthy contexts except Gemini-2.5-Pro; (3) Agentic RAG effectively addresses retrieval bottlenecks in Naïve RAG.

Conclusion: SagaScale provides a realistic, scalable benchmark for long-context evaluation. The benchmark and codebase are publicly released to facilitate future research on long-document understanding in LLMs.

Abstract: Large Language Models (LLMs) have shown significant progress, but understanding long and complex documents remains challenging. Many long-context benchmarks have been proposed, but they face several limitations, including task realism, data scalability, and data quality. To this end, we introduce SagaScale, a realistic, scalable, and high-quality long-context benchmark built from full-length novels. The entire benchmark is constructed using an automated data collection pipeline that utilizes external resources (e.g., Wikipedia pages) to curate question-answer pairs. Critically, these external resources are provided only for benchmark construction and not during evaluation, which allows LLMs to curate complex questions that go beyond what they can answer during evaluation. SagaScale is also bilingual and offers the largest context length to date, with average token counts exceeding 250K for English novels and 320K for Chinese novels. Our evaluation across 12 frontier LLMs and three long-context methods – Naïve RAG, Agentic RAG, and Long Context – yields key insights, including: (1) Directly supplying the full context to the LLM can outperform other methods by a large margin; (2) Most LLMs still struggle with lengthy contexts, but Gemini-2.5-Pro stands out as an exception; and (3) Agentic RAG effectively addresses the retrieval bottleneck in Naïve RAG. Finally, we publicly release the SagaScale benchmark and our data collection codebase to facilitate future research.

[15] Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions

Katherine Elkins, Jon Chun

Main category: cs.CL

TL;DR: LLMs show inconsistent ethical judgments across syntactically different but logically equivalent prompts, with open-source models being twice as fragile as commercial ones.

Details

Motivation: LLMs are increasingly used in consequential decision-making, but their robustness to benign prompt variations remains underexplored. The paper aims to study whether LLMs maintain consistent ethical judgments across logically equivalent but syntactically different prompts.

Method: Introduces Syntactic Framing Fragility (SFF) framework with Logical Polarity Normalization (LPN) to isolate purely syntactic effects. Audits 23 state-of-the-art models (U.S., China, open-source) over 14 ethical scenarios and 4 controlled framings (39,975 decisions). Uses chain-of-thought reasoning as mitigation strategy.

Result: Widespread and statistically significant inconsistency: many models reverse ethical endorsements solely due to syntactic polarity. Open-source models show over twice the fragility of commercial counterparts. Extreme negation sensitivity found (80-97% endorsement when prompted with “should not”). Chain-of-thought reasoning substantially reduces fragility. Higher risk in financial/business contexts than medical scenarios.

Conclusion: Syntactic consistency is a distinct and critical dimension of ethical robustness. SFF-style audits should be a standard component of safety evaluation for deployed LLMs.

Abstract: Large language models (LLMs) are increasingly deployed in consequential decision-making settings, yet their robustness to benign prompt variation remains underexplored. In this work, we study whether LLMs maintain consistent ethical judgments across logically equivalent but syntactically different prompts, focusing on variations involving negation and conditional structure. We introduce Syntactic Framing Fragility (SFF), a robustness evaluation framework that isolates purely syntactic effects via Logical Polarity Normalization (LPN), enabling direct comparison of decisions across positive and negative framings without semantic drift. Auditing 23 state-of-the-art models spanning the U.S. and China as well as small U.S. open-source software models over 14 ethical scenarios and four controlled framings (39,975 decisions), we find widespread and statistically significant inconsistency: many models reverse ethical endorsements solely due to syntactic polarity, with open-source models exhibiting over twice the fragility of commercial counterparts. We further uncover extreme negation sensitivity, where some models endorse actions in 80-97% of cases when explicitly prompted with “should not.” We show that eliciting chain-of-thought reasoning substantially reduces fragility, identifying a practical mitigation lever, and we map fragility across scenarios, finding higher risk in financial and business contexts than in medical scenarios. Our results demonstrate that syntactic consistency constitutes a distinct and critical dimension of ethical robustness, and we argue that SFF-style audits should be a standard component of safety evaluation for deployed LLMs. Code and results will be available on github.com.

[16] Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

Kaustubh Shivshankar Shejole, Sourabh Deoghare, Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: Virām is the first diagnostic benchmark for assessing punctuation robustness in English-to-Marathi MT, showing that specialized fine-tuned models and pipeline systems significantly improve translation quality over standard baselines and LLMs.

Details

Motivation: Punctuation is critical for resolving semantic and structural ambiguity in written language, but MT systems face challenges with punctuation-ambiguous text, especially for low- to middle-resource languages like Marathi.

Method: Created Virām benchmark with 54 manually curated punctuation-ambiguous instances. Evaluated two strategies: 1) pipeline-based restore-then-translate approach, and 2) direct fine-tuning on punctuation-varied data.

Result: Specialized fine-tuned models and pipeline systems significantly improve translation quality over standard baselines on Virām benchmark. Current LLMs lag behind task-specific approaches in preserving meaning for punctuation-ambiguous text.

Conclusion: Task-specific approaches (fine-tuning and pipeline systems) are more effective than standard MT baselines and LLMs for handling punctuation ambiguity in English-to-Marathi translation, highlighting the need for further research in this area.

Abstract: Punctuation plays a critical role in resolving semantic and structural ambiguity in written language. Machine Translation (MT) systems are now widely applied across diverse domains and languages, including many low-resource settings. In this work, we focus on Marathi, a low- to middle-resource language. We introduce Virām, the first diagnostic benchmark for assessing punctuation robustness in English-to-Marathi machine translation, consisting of 54 manually curated, punctuation-ambiguous instances. We evaluate two primary strategies for enhancing reliability: a pipeline-based restore-then-translate approach and direct fine-tuned on punctuation-varied data. Our results demonstrate that specialized fine-tuned models and pipeline systems significantly improve translation quality over standard baselines on the Virām benchmark. Qualitative analysis reveals that the original model may result in wrong translations leading to wrong interpretations, while fine-tuned models significantly improve overall reliability. Furthermore, we find that current Large Language Models (LLMs) lag behind these task-specific approaches in preserving meaning for punctuation-ambiguous text, thus necessitating further research in this area.

[17] Forgetting as a Feature: Cognitive Alignment of Large Language Models

Hien Tran, Quinten Steenhuis, Alexandros Christoforos, Chadbourne Davis

Main category: cs.CL

TL;DR: LLMs exhibit systematic forgetting during in-context reasoning, which the authors reinterpret as a functional cognitive mechanism rather than a limitation, drawing parallels to human memory dynamics with exponential decay.

Details

Motivation: LLMs are typically evaluated against perfect Bayesian inference, but growing evidence shows they systematically forget past information. Instead of viewing this as a flaw, the authors aim to reinterpret forgetting as a functional cognitive mechanism inspired by human memory dynamics.

Method: Model LLM inference as a probabilistic memory process with exponential decay. Introduce a benchmark suite for temporal reasoning, concept drift adaptation, and associative recall. Propose probabilistic memory prompting - a lightweight strategy that shapes evidence integration to mimic human-like memory decay.

Result: Empirical results show LLMs demonstrate forgetting rates analogous to human memory efficiency trade-offs between stability and adaptability. The proposed probabilistic memory prompting improves long-horizon reasoning performance.

Conclusion: Forgetting should be viewed not as a failure mode but as a principled mechanism for adaptive intelligence, positioning LLM memory dynamics as functionally similar to human cognitive patterns.

Abstract: Large Language Models (LLMs) are often evaluated against ideals of perfect Bayesian inference, yet growing evidence suggests that their in-context reasoning exhibits systematic forgetting of past information. Rather than viewing this behavior as a limitation, we reinterpret forgetting as a functional cognitive mechanism. Drawing inspiration from human memory dynamics, we model LLM inference as a probabilistic memory process governed by exponential decay. We introduce a benchmark suite that evaluates temporal reasoning, concept drift adaptation, and associative recall, enabling direct comparison between model behavior and human cognitive patterns. Our empirical results reveal that LLMs demonstrate forgetting rates analogous to human memory efficiency trade-offs between stability and adaptability. Building on these observations, we propose probabilistic memory prompting, a lightweight strategy that shapes evidence integration to mimic human-like memory decay, leading to improved long-horizon reasoning performance. Our findings position forgetting not as a failure mode, but as a principled mechanism for adaptive intelligence.

[18] SciNets: Graph-Constrained Multi-Hop Reasoning for Scientific Literature Synthesis

Sauhard Dubey

Main category: cs.CL

TL;DR: SciNets frames scientific synthesis as graph-constrained multi-hop reasoning over literature-derived concept graphs, enabling controllable mechanistic explanations by connecting concepts that rarely co-occur in individual papers.

Details

Motivation: Cross-domain scientific synthesis requires connecting mechanistic explanations across fragmented literature, which is challenging for both retrieval-based systems and unconstrained language models. Current approaches provide limited control over reasoning depth and structural grounding.

Method: SciNets constructs directed concept graphs from query-local corpora and synthesizes mechanistic explanations by identifying multi-hop reasoning paths. It compares shortest-path reasoning, k-shortest paths with diversity constraints, stochastic random walks, and retrieval-augmented language model baselines.

Result: Explicit graph constraints enable controllable multi-hop reasoning across ML, biology, and climate science tasks. There’s a consistent trade-off: deeper and more diverse symbolic reasoning increases grounding instability, while shortest-path reasoning remains stable but structurally conservative.

Conclusion: The study provides a systematic behavioral characterization of the limits and capabilities of current graph-LLM integration for scientific synthesis, revealing fundamental trade-offs between reasoning depth, diversity, and grounding stability.

Abstract: Cross-domain scientific synthesis requires connecting mechanistic explanations across fragmented literature, a capability that remains challenging for both retrieval-based systems and unconstrained language models. While recent work has applied large language models to scientific summarization and question answering, these approaches provide limited control over reasoning depth and structural grounding. We frame mechanistic synthesis as a graph-constrained multi-hop reasoning problem over literature-derived concept graphs. Given a scientific query and a compact, query-local corpus, SciNets constructs a directed concept graph and synthesizes mechanistic explanations by identifying multi-hop reasoning paths that connect concepts that rarely co-occur within individual papers. We systematically compare shortest-path reasoning, k-shortest paths with diversity constraints, stochastic random walks, and a retrieval-augmented language model baseline. Rather than evaluating correctness, which is often indeterminate when synthesizing connections across distributed sources, we introduce a behavioral framework that measures symbolic reasoning depth, mechanistic diversity, and grounding stability. Across machine learning, biology, and climate science tasks, explicit graph constraints enable controllable multi-hop reasoning while revealing a consistent trade-off: deeper and more diverse symbolic reasoning increases grounding instability, whereas shortest-path reasoning remains highly stable but structurally conservative. These findings provide a systematic behavioral characterization of the limits and capabilities of current graph-LLM integration for scientific synthesis.

[19] Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens

Meicong Zhang, Tiancheng su, Guoxiu He

Main category: cs.CL

TL;DR: STIG: Stage Token for Introduction Generation - eliminates external agentic workflows by parameterizing logical structure into LLM, enabling single-inference generation of research introductions.

Details

Motivation: Existing agentic workflows for research introduction writing suffer from long reasoning chains, error accumulation, and reduced textual coherence. Writing research introductions requires rigorous logic, coherent structure, and abstract summarization, which current approaches struggle with.

Method: Proposes STIG which converts multiple stages of original workflow into explicit stage signals (stage tokens). Through instruction tuning, the model learns mapping between stage tokens and text functions, logical order, and transition patterns between stages, encoding this knowledge into model parameters.

Result: STIG can generate multi-stage text in a single inference without explicit workflow calls. It outperforms traditional agentic workflows and other baselines on metrics of semantic similarity and sentence-level structural rationality.

Conclusion: Directly parameterizing logical structure into LLM via stage tokens is effective for research introduction generation, eliminating issues with external agentic workflows while maintaining logical rigor and textual coherence.

Abstract: In recent years, using predefined agentic workflows to guide large language models (LLMs) for literature classification and review has become a research focus. However, writing research introductions is more challenging. It requires rigorous logic, coherent structure, and abstract summarization. Existing workflows often suffer from long reasoning chains, error accumulation, and reduced textual coherence. To address these limitations, we propose eliminating external agentic workflows. Instead, we directly parameterize their logical structure into the LLM. This allows the generation of a complete introduction in a single inference. To this end, we introduce the Stage Token for Introduction Generation (STIG). STIG converts the multiple stages of the original workflow into explicit stage signals. These signals guide the model to follow different logical roles and functions during generation. Through instruction tuning, the model learns the mapping between stage tokens and text functions. It also learns the logical order and transition patterns between stages, encoding this knowledge into the model parameters. Experimental results show that STIG can generate multi-stage text in a single inference. It does not require explicit workflow calls. STIG outperforms traditional agentic workflows and other baselines on metrics of semantic similarity and sentence-level structural rationality. The code is provided in the Supplementary Materials.

[20] Enhancing Business Analytics through Hybrid Summarization of Financial Reports

Tohida Rehman

Main category: cs.CL

TL;DR: A hybrid extractive-abstractive summarization framework for financial earnings call transcripts, combining LexRank for sentence selection with fine-tuned BART/PEGASUS models, plus Longformer for long-range context, achieving strong performance with improved factual consistency.

Details

Motivation: Manual analysis of lengthy financial documents like earnings conference calls is inefficient, time-consuming, and prone to interpretive bias and errors, creating a need for automated summarization systems to distill complex financial information into usable business insights.

Method: Two-stage hybrid pipeline: 1) LexRank algorithm for extractive sentence selection, followed by 2) fine-tuned BART and PEGASUS models for abstractive summarization. Parallel approach uses fine-tuned Longformer Encoder-Decoder (LED) to capture long-range dependencies in financial documents.

Result: Long-context models achieve strongest overall performance, while hybrid framework delivers competitive results with improved factual consistency under computational constraints. Evaluation uses ROUGE, METEOR, MoverScore, BERTScore, plus domain-specific SciBERTScore/FinBERTScore and entity-level factual accuracy metrics.

Conclusion: The hybrid summarization framework effectively distills lengthy financial texts into concise Reuters-style summaries, supporting practical systems for efficient analysis of earnings communications with balanced performance-factual consistency trade-offs.

Abstract: Financial reports and earnings communications contain large volumes of structured and semi structured information, making detailed manual analysis inefficient. Earnings conference calls provide valuable evidence about a firm’s performance, outlook, and strategic priorities. The manual analysis of lengthy call transcripts requires substantial effort and is susceptible to interpretive bias and unintentional error. In this work, we present a hybrid summarization framework that combines extractive and abstractive techniques to produce concise and factually reliable Reuters-style summaries from the ECTSum dataset. The proposed two stage pipeline first applies the LexRank algorithm to identify salient sentences, which are subsequently summarized using fine-tuned variants of BART and PEGASUS designed for resource constrained settings. In parallel, we fine-tune a Longformer Encoder-Decoder (LED) model to directly capture long-range contextual dependencies in financial documents. Model performance is evaluated using standard automatic metrics, including ROUGE, METEOR, MoverScore, and BERTScore, along with domain-specific variants such as SciBERTScore and FinBERTScore. To assess factual accuracy, we further employ entity-level measures based on source-precision and F1-target. The results highlight complementary trade offs between approaches, long context models yield the strongest overall performance, while the hybrid framework achieves competitive results with improved factual consistency under computational constraints. These findings support the development of practical summarization systems for efficiently distilling lengthy financial texts into usable business insights.

[21] Clinical Document Metadata Extraction: A Scoping Review

Kurt Miller, Qiuhao Lu, William Hersh, Kirk Roberts, Steven Bedrick, Andrew Wen, Hongfang Liu

Main category: cs.CL

TL;DR: Scoping review of clinical document metadata extraction research from 2011-2025, analyzing 67 relevant articles to catalog methods, applications, and identify gaps in the field.

Details

Motivation: Clinical document metadata (document type, structure, author role, medical specialty, encounter setting) is essential for accurate interpretation of clinical information, but heterogeneity and drift over time challenge harmonization. Automated extraction methods are needed to coalesce metadata from disparate practices into target schemas.

Method: Conducted a scoping review following PRISMA-ScR guidelines, screening 266 articles published between January 2011 and August 2025, with comprehensive review of 67 relevant articles. Analyzed methodological trends, applications, and data availability.

Result: Found 45 methodological articles, 17 using document metadata in downstream applications, and 5 analyzing metadata composition. Methods evolved from rule-based/traditional ML with feature engineering to transformer-based architectures with minimal engineering. Public labeled data remains sparse except for structural section datasets. LLMs enable broader generalizability exploration.

Conclusion: Research will continue expanding into richer document metadata representations and further integration into clinical applications and workflows, with LLMs enabling advanced clinical text processing systems.

Abstract: Clinical document metadata, such as document type, structure, author role, medical specialty, and encounter setting, is essential for accurate interpretation of information captured in clinical documents. However, vast documentation heterogeneity and drift over time challenge harmonization of document metadata. Automated extraction methods have emerged to coalesce metadata from disparate practices into target schema. This scoping review aims to catalog research on clinical document metadata extraction, identify methodological trends and applications, and highlight gaps. We followed the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines to identify articles that perform clinical document metadata extraction. We initially found and screened 266 articles published between January 2011 and August 2025, then comprehensively reviewed 67 we deemed relevant to our study. Among the articles included, 45 were methodological, 17 used document metadata as features in a downstream application, and 5 analyzed document metadata composition. We observe myriad purposes for methodological study and application types. Available labelled public data remains sparse except for structural section datasets. Methods for extracting document metadata have progressed from largely rule-based and traditional machine learning with ample feature engineering to transformer-based architectures with minimal feature engineering. The emergence of large language models has enabled broader exploration of generalizability across tasks and datasets, allowing the possibility of advanced clinical text processing systems. We anticipate that research will continue to expand into richer document metadata representations and integrate further into clinical applications and workflows.

[22] Geometric Patterns of Meaning: A PHATE Manifold Analysis of Multi-lingual Embeddings

Wen G Gong

Main category: cs.CL

TL;DR: Semanscope uses PHATE manifold learning to analyze multilingual embeddings across four linguistic levels, revealing systematic geometric patterns and limitations in current models.

Details

Motivation: To develop a comprehensive framework for examining semantic geometry in multilingual embeddings and validate embedding models' effectiveness in capturing semantic relationships.

Method: Multi-level analysis using Semanscope visualization tool that applies PHATE manifold learning across four linguistic levels: sub-character components, alphabetic systems, semantic domains, and numerical concepts.

Result: Revealed systematic geometric patterns: Chinese radicals show geometric collapse (structural vs semantic confusion), different writing systems have distinct geometric signatures, content words form clustering-branching patterns across 20 semantic domains, and Arabic numbers organize through spiral trajectories rather than clustering.

Conclusion: PHATE manifold learning is an essential analytic tool for studying geometric structure of meaning in embedding space and validating embedding models’ effectiveness in capturing semantic relationships.

Abstract: We introduce a multi-level analysis framework for examining semantic geometry in multilingual embeddings, implemented through Semanscope (a visualization tool that applies PHATE manifold learning across four linguistic levels). Analysis of diverse datasets spanning sub-character components, alphabetic systems, semantic domains, and numerical concepts reveals systematic geometric patterns and critical limitations in current embedding models. At the sub-character level, purely structural elements (Chinese radicals) exhibit geometric collapse, highlighting model failures to distinguish semantic from structural components. At the character level, different writing systems show distinct geometric signatures. At the word level, content words form clustering-branching patterns across 20 semantic domains in English, Chinese, and German. Arabic numbers organize through spiral trajectories rather than clustering, violating standard distributional semantics assumptions. These findings establish PHATE manifold learning as an essential analytic tool not only for studying geometric structure of meaning in embedding space, but also for validating the effectiveness of embedding models in capturing semantic relationships.

[23] Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings

Wen G. Gong

Main category: cs.CL

TL;DR: Researchers introduce Semantic Affinity (SA) metric to evaluate multilingual embedding models’ true cross-lingual alignment, revealing that explicit translation supervision matters more than model scale or multilingual data.

Details

Motivation: With hundreds of multilingual embedding models available, practitioners lack clear guidance on which models provide genuine cross-lingual semantic alignment versus those that achieve task performance through language-specific patterns. Current task-driven benchmarks may mask fundamental alignment shortcomings.

Method: Introduce Semantic Affinity (SA), a bounded metric (0-1) measuring inter-lingual to intra-lingual spread ratio using cosine distance, combined with PHATE visualization in the Semanscope framework. Benchmark 13 models across 4 datasets (52 experiments total).

Result: Three-tier structure emerges: (1) Top BERT models with translation-pair supervision achieve strong alignment (SA ~0.68-0.70); (2) LLM embeddings plateau at SA 0.55-0.61 regardless of scale (0.6B to 8B parameters); (3) MLM-only BERT models fail (SA < 0.50) despite training on 100+ languages. Training objective, not architecture or scale, determines alignment.

Conclusion: Cross-lingual alignment requires explicit translation supervision, not merely model scale or multilingual data. This work provides semantic benchmarking to help practitioners select quality multilingual embeddings from hundreds of available models.

Abstract: With hundreds of multilingual embedding models available, practitioners lack clear guidance on which provide genuine cross-lingual semantic alignment versus task performance through language-specific patterns. Task-driven benchmarks (MTEB) may mask fundamental alignment shortcomings. We introduce Semantic Affinity (SA), a bounded (between 0 and 1) metric measuring inter-lingual to intra-lingual spread ratio using cosine distance, combined with PHATE visualization in our Semanscope framework. Benchmarking 13 models across 4 datasets (52 experiments) reveals a three-tier structure: (1) Top BERT models (LaBSE SA = 0.70, USE SA = 0.68, S-BERT SA = 0.68) achieve strong alignment via translation-pair supervision; (2) LLM embeddings plateau at SA between 0.55 and 0.61 regardless of 0.6 B to 8 B scale; (3) MLM-only BERT models (mBERT, XLM-R, SA < 0.50) fail despite more than 100 language training. Training objective, not architecture or scale, determines alignment. Oracle Bone primitives (1200 BCE) expose semantic drift-models learn corpus patterns rather than cognitive primitives. This work provides semantic benchmarking to help practitioners select quality multilingual embeddings from hundreds of available models, showing cross-lingual alignment requires explicit translation supervision, not merely model scale or multilingual data.

[24] Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

Xin Gao, Xiaoyang Wang, Yun Zhu, Mengzhang Cai, Conghui He, Lijun Wu

Main category: cs.CL

TL;DR: ODA framework transforms value benchmarking into feedback for systematic SFT dataset construction, creating specialized math and multi-domain datasets that outperform larger baselines with superior data efficiency.

Details

Motivation: Current SFT dataset construction is heuristic and under-theorized, lacking systematic understanding of how individual samples contribute to model performance. There's a need to move from ad-hoc curation to data-centric AI with transparent evaluation.

Method: OpenDataArena (ODA) framework using value-anchored rankings and multi-dimensional analysis to transform benchmarking into feedback signals. Two implementations: ODA-Math-460k with two-stage difficulty-aware pipeline, and ODA-Mixture using “Anchor-and-Patch” strategy for multi-domain instruction datasets.

Result: ODA-Math-460k achieves SOTA on AIME and HMMT benchmarks. ODA-Mixture (100k & 500k) outperforms significantly larger open-source baselines. Both demonstrate superior data efficiency and improve domain-specific reasoning and general utility.

Conclusion: ODA-driven datasets validate a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data, enabling systematic dataset construction with better performance and efficiency.

Abstract: The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k & 500k)}, a series of multi-domain instruction datasets built via an ``Anchor-and-Patch’’ strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.

[25] From Detection to Diagnosis: Advancing Hallucination Analysis with Automated Data Synthesis

Yanyi Liu, Qingwen Yang, Tiezheng Guo, Feiyu Qu, Jun Liu, Yingyou Wen

Main category: cs.CL

TL;DR: Proposes shifting from hallucination detection to diagnosis, introducing a new task requiring error localization, causal explanation, and content correction, with a 4B-parameter model trained on automatically generated diagnostic data.

Details

Motivation: Current hallucination detection approaches are binary and lack interpretable, actionable feedback for model improvement, limiting practical utility in critical domains where reliable LLM deployment is needed.

Method: Introduces Hallucination Diagnosis Task requiring detection, error localization, causal explanation, and content correction. Develops HDG pipeline to generate training samples with diagnostic metadata via controlled fact fabrication and reasoning chain perturbation. Trains HDM-4B-RL model using GRPO with comprehensive reward function.

Result: HDM-4B-RL surpasses previous SOTA detection models on HaluEval benchmark, achieves comparable performance to advanced general-purpose models in diagnosis tasks while maintaining smaller size (4B parameters).

Conclusion: Validates feasibility and value of hallucination diagnosis paradigm, providing effective methodology for building more trustworthy and reliable generative AI systems beyond simple detection.

Abstract: Hallucinations in Large Language Models (LLMs), defined as the generation of content inconsistent with facts or context, represent a core obstacle to their reliable deployment in critical domains. Current research primarily focuses on binary “detection” approaches that, while capable of identifying hallucinations, fail to provide interpretable and actionable feedback for model improvement, thus limiting practical utility. To address this limitation, a new research paradigm is proposed, shifting from “detection” to “diagnosis”. The Hallucination Diagnosis Task is introduced, a task which requires models to not only detect hallucinations, but also perform error localization, causal explanation, and content correction. We develop the Hallucination Diagnosis Generator (HDG), an automated pipeline that systematically generates high-quality training samples with rich diagnostic metadata from raw corpora through multi-dimensional augmentation strategies including controlled fact fabrication and reasoning chain perturbation. Using HDG-generated data, we train HDM-4B-RL, a 4-billion-parameter hallucination diagnosis model, employing Group Relative Policy Optimization (GRPO) with a comprehensive reward function incorporating structural, accuracy, and localization signals. Experimental results demonstrate that our model surpasses previous state-of-the-art detection models on the HaluEval benchmark while achieving comparable performance to advanced general-purpose models. In comprehensive diagnosis tasks, HDM-4B-RL matches the capabilities of larger general models while maintaining a smaller size. This work validates the feasibility and value of hallucination diagnosis, providing an effective methodology for building more trustworthy and reliable generative AI systems.

[26] Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations

Xiaoxu Ma, Xiangbo Zhang, Zhenyu Weng

Main category: cs.CL

TL;DR: PVNI is a new method for evaluating personality traits in LLMs using internal activations instead of questionnaires, providing more stable and explainable results.

Details

Motivation: Existing questionnaire-based methods for evaluating personality traits in LLMs have limited stability and explainability, as results are highly sensitive to minor prompt variations or role-play configurations.

Method: PVNI extracts a persona vector associated with target personality traits from model’s internal activations using contrastive prompts, then estimates neutral scores by interpolating along the persona vector as an anchor axis for interpretable comparison.

Result: Extensive experiments across diverse LLMs show PVNI yields substantially more stable personality trait evaluations than existing methods, even under questionnaire and role-play variants.

Conclusion: PVNI provides a stable and explainable approach for personality trait evaluation in LLMs, addressing limitations of questionnaire-based methods through internal activation analysis.

Abstract: Evaluating personality traits in Large Language Models (LLMs) is key to model interpretation, comparison, and responsible deployment. However, existing questionnaire-based evaluation methods exhibit limited stability and offer little explainability, as their results are highly sensitive to minor variations in prompt phrasing or role-play configurations. To address these limitations, we propose an internal-activation-based approach, termed Persona-Vector Neutrality Interpolation (PVNI), for stable and explainable personality trait evaluation in LLMs. PVNI extracts a persona vector associated with a target personality trait from the model’s internal activations using contrastive prompts. It then estimates the corresponding neutral score by interpolating along the persona vector as an anchor axis, enabling an interpretable comparison between the neutral prompt representation and the persona direction. We provide a theoretical analysis of the effectiveness and generalization properties of PVNI. Extensive experiments across diverse LLMs demonstrate that PVNI yields substantially more stable personality trait evaluations than existing methods, even under questionnaire and role-play variants.

[27] Bears, all bears, and some bears. Language Constraints on Language Models’ Inductive Inferences

Sriram Padmanabhan, Siyuan Song, Kanishka Misra

Main category: cs.CL

TL;DR: VLMs show human-like sensitivity to subtle linguistic constraints in inductive reasoning, differentiating between generic, universal, and indefinite statements like children do.

Details

Motivation: To test whether general-purpose statistical learners like Vision Language Models (VLMs) exhibit the same subtle sensitivity to linguistic constraints in inductive reasoning that children demonstrate, particularly in differentiating between generic statements, universally quantified NPs, and indefinite plural NPs.

Method: Replicated Gelman et al.’s (2002) developmental experiment with VLMs, first conducting precondition tests (robust category identification and sensitivity to “all” and “some”), then administering the original experiment where models extend novel properties based on different statement types.

Result: VLMs showed behavioral alignment with human children, demonstrating the same pattern of property extension (all > generics > some) and representing these proposition types differently based on inductive constraints rather than surface-form differences.

Conclusion: General-purpose statistical learners like VLMs capture subtle linguistic constraints in inductive reasoning similar to human children, suggesting these constraints emerge from statistical learning rather than requiring specialized cognitive mechanisms.

Abstract: Language places subtle constraints on how we make inductive inferences. Developmental evidence by Gelman et al. (2002) has shown children (4 years and older) to differentiate among generic statements (“Bears are daxable”), universally quantified NPs (“all bears are daxable”) and indefinite plural NPs (“some bears are daxable”) in extending novel properties to a specific member (all > generics > some), suggesting that they represent these types of propositions differently. We test if these subtle differences arise in general purpose statistical learners like Vision Language Models, by replicating the original experiment. On tasking them through a series of precondition tests (robust identification of categories in images and sensitivities to all and some), followed by the original experiment, we find behavioral alignment between models and humans. Post-hoc analyses on their representations revealed that these differences are organized based on inductive constraints and not surface-form differences.

[28] MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication

Sraavya Sambara, Yuan Pu, Ayman Ali, Vishala Mishra, Lionel Wong, Monica Agrawal

Main category: cs.CL

TL;DR: LLMs often fail to redirect health questions with false premises, posing safety risks for medical AI systems.

Details

Motivation: Patients often ask health questions with false assumptions, requiring clinicians to redirect rather than directly answer. LLMs are increasingly used for medical advice but haven't been tested for this crucial redirection competency.

Method: Developed MedRedFlag dataset of 1100+ real-world health questions from Reddit requiring redirection. Created semi-automated pipeline to curate questions with false premises. Systematically compared LLM responses to clinician responses.

Result: LLMs often fail to redirect problematic questions even when detecting false premises. They provide answers that could lead to suboptimal medical decision making, revealing a substantial safety gap.

Conclusion: There’s a critical safety concern for patient-facing medical AI systems as LLMs struggle with the redirection competency needed for safe medical communication when questions contain false assumptions.

Abstract: Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communication typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (LLMs) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how LLMs react to false premises embedded within real-world health questions. We develop a semi-automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state-of-the-art LLMs to those from clinicians. Our analysis reveals that LLMs often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how LLMs perform under the conditions of real-world health communication, highlighting critical safety concerns for patient-facing medical AI systems. Code and dataset are available at https://github.com/srsambara-1/MedRedFlag.

[29] OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing

Yilin Bao, Ziyao He, Zayden Yang

Main category: cs.CL

TL;DR: Reinforcement learning framework for scientific paper generation that treats outline construction as hierarchical planning, with two-stage optimization for structural consistency and scientific correctness.

Details

Motivation: Current LLMs fail at document-level planning, global structure, input coverage, and citation consistency in scientific paper generation, despite strong local fluency.

Method: Reinforcement learning framework modeling outline construction as long-horizon planning over hierarchical structures. Uses two-stage optimization: (1) backward outline reconstruction from partial plans for structural consistency, and (2) forward value-guided RL with rewards for scientific correctness, discourse coherence, and citation fidelity.

Result: Consistent improvements over strong neural and LLM baselines, particularly in long-range structural coherence and citation reliability. Introduced new benchmark for scientific paper generation evaluation.

Conclusion: The RL framework effectively addresses limitations of current LLMs in scientific paper generation by enabling structured, incremental outline construction with explicit modeling of scientific quality metrics.

Abstract: Scientific paper generation requires document-level planning and factual grounding, but current large language models, despite their strong local fluency, often fail in global structure, input coverage, and citation consistency. We present a reinforcement learning framework that casts scientific outline construction as a long-horizon planning problem over hierarchical document structures. Our approach models edit evolving outlines through structured actions, enabling the system to incrementally build a complete scientific manuscript. To support effective and stabilize learning,we introduce a two-stage optimization procedure consisting of (i) backward outline reconstruction from partial plans to enforce global structural consistency, and (ii) forward value-guided reinforcement learning with rewards explicitly modeling scientific correctness, discourse coherence, and citation fidelity. In addition, We further introduce a benchmark for scientific paper generation that evaluates document planning, input utilization, reference faithfulness, outline organization, and content-level factual accuracy. Our results show consistent improvements over strong neural and LLM baselines, particularly in long-range structural coherence and citation reliability.

[30] Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Yifei Shen, Yilun Zhao, Justice Ou, Tinglin Huang, Arman Cohan

Main category: cs.CL

TL;DR: CLINSQL is a new benchmark for clinical text-to-SQL with 633 expert-annotated tasks on MIMIC-IV EHR data, requiring complex multi-table joins, temporal reasoning, and patient cohort analysis. Current models (including GPT-5-mini at 74.7% and DeepSeek-R1 at 69.2%) still fall short of clinical reliability despite recent advances.

Details

Motivation: Real-world clinical text-to-SQL requires handling heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries for EHR analytics. Existing benchmarks don't capture the complexity of clinical reasoning needed for practical healthcare applications.

Method: Created CLINSQL benchmark with 633 expert-annotated tasks on MIMIC-IV v3.1 requiring multi-table joins, clinically meaningful filters, and executable SQL. Evaluated 22 proprietary and open-source models using Chain-of-Thought self-refinement with rubric-based SQL analysis and execution checks prioritizing clinical requirements.

Result: Performance remains far from clinical reliability: GPT-5-mini achieves 74.7% execution score on test set, DeepSeek-R1 leads open-source at 69.2%, and Gemini-2.5-Pro drops from 85.5% on Easy tasks to 67.2% on Hard tasks. Models struggle with complex clinical reasoning despite recent advances.

Conclusion: CLINSQL provides a challenging benchmark for clinical text-to-SQL that reveals current models’ limitations in real-world EHR analytics. Progress on this benchmark represents tangible advances toward clinically reliable text-to-SQL systems, but significant work remains to achieve clinical-grade reliability.

Abstract: Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce CLINSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful filters, and executable SQL. Solving CLINSQL entails navigating schema metadata and clinical coding systems, handling long contexts, and composing multi-step queries beyond traditional text-to-SQL. We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement and use rubric-based SQL analysis with execution checks that prioritize critical clinical requirements. Despite recent advances, performance remains far from clinical reliability: on the test set, GPT-5-mini attains 74.7% execution score, DeepSeek-R1 leads open-source at 69.2% and Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard. Progress on CLINSQL marks tangible advances toward clinically reliable text-to-SQL for real-world EHR analytics.

[31] Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

Sathvik Nair, Byung-Doh Oh

Main category: cs.CL

TL;DR: LM probabilities outperform cloze probabilities in predicting language processing effort due to better resolution, ability to distinguish semantically similar words, and accurate handling of low-frequency words.

Details

Motivation: The paper aims to understand why language model (LM) probabilities outperform cloze task probabilities in predicting language processing effort, since different predictors can lead to different scientific conclusions about prediction in language comprehension.

Method: The authors present evidence for three hypotheses about LM advantages: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words.

Result: LM probabilities outperform cloze probabilities for the right reasons - they have higher resolution, better distinguish semantic similarities, and handle low-frequency words more accurately.

Conclusion: The findings call for improving cloze study resolution and conducting experiments to determine if human-like prediction is as sensitive to the fine-grained distinctions made by LM probabilities.

Abstract: How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs).When used as predictors of processing effort, LM probabilities outperform probabilities derived from cloze data. However, it is important to establish that LM probabilities do so for the right reasons, since different predictors can lead to different scientific conclusions about the role of prediction in language comprehension. We present evidence for three hypotheses about the advantage of LM probabilities: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words. These results call for efforts to improve the resolution of cloze studies, coupled with experiments on whether human-like prediction is also as sensitive to the fine-grained distinctions made by LM probabilities.

[32] Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

Christabel Acquaye, Yi Ting Huang, Marine Carpuat, Rachel Rudinger

Main category: cs.CL

TL;DR: LLMs can predict math question difficulty by simulating classrooms of students with varying proficiency levels, achieving correlations up to 0.82 with real-world NAEP data.

Details

Motivation: Traditional math assessment difficulty calibration requires expensive human pilot studies. The paper investigates whether LLMs can provide a cost-effective alternative for predicting item difficulty.

Method: Simulate “classrooms” of 4th, 8th, or 12th grade students by prompting LLMs to role-play students with varying proficiency levels. Use simulation outcomes to fit Item Response Theory (IRT) models and compare learned difficulty parameters to real-world NAEP statistics. Experiment with different classroom sizes, named vs. anonymous students, and demographic stratification.

Result: Achieved correlations of 0.75, 0.76, and 0.82 for grades 4, 8, and 12 respectively. Weaker mathematical models (Gemma) performed better than stronger ones (Llama, Qwen). Named students improved predictions, and demographic stratification further enhanced accuracy.

Conclusion: LLM-based simulation approaches show promise for predicting math item difficulty, offering a cost-effective alternative to human pilot studies. Open-source models are particularly suitable, with weaker mathematical models surprisingly outperforming stronger ones for this task.

Abstract: Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a “classroom” of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively. In our simulations, we experiment with different “classroom sizes,” showing tradeoffs between computation size and accuracy. We find that role-plays with named students improves predictions (compared to student ids), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of open-source models for the task.

[33] Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

David Samuel Setiawan, Raphaël Merx, Jey Han Lau

Main category: cs.CL

TL;DR: A hybrid NMT+LLM+RAG framework recovers performance loss when translating low-resource languages across domain shifts, achieving near in-domain quality by using LLM as a safety net to repair NMT failures.

Details

Motivation: NMT models for low-resource languages suffer severe performance degradation under domain shift, as demonstrated with Dhao language where translation quality drops significantly when moving from New Testament to Old Testament domains.

Method: Hybrid framework: 1) Fine-tuned NMT model generates initial draft translation, 2) Large Language Model refines the draft using Retrieval-Augmented Generation (RAG) with retrieved examples, 3) Analysis focuses on impact of retrieved examples quantity vs. retrieval algorithm choice.

Result: System recovers 8.10 chrF++ points (from 27.11 to 35.21), effectively matching original in-domain quality of 36.17 chrF++. Performance driven primarily by number of retrieved examples rather than retrieval algorithm. LLM acts as robust safety net repairing severe failures in zero-shot domains.

Conclusion: Hybrid NMT+LLM+RAG framework successfully addresses domain shift challenges for low-resource languages, with LLM serving as effective safety net to recover translation quality. Key insight: quantity of retrieved examples matters more than retrieval algorithm sophistication.

Abstract: Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust “safety net,” repairing severe failures in zero-shot domains.

[34] SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction

Sanghyeok Choi, Woosang Jeon, Kyuseok Yang, Taehyeong Kim

Main category: cs.CL

TL;DR: SocraticKG introduces question-answer pairs as intermediate representation for KG construction, addressing the trade-off between factual coverage and relational connectivity in LLM-based approaches.

Details

Motivation: Current LLM-based KG construction methods face a fundamental trade-off: achieving broad factual coverage often leads to relational fragmentation, while premature consolidation causes information loss. There's a need for better semantic structuring before triple extraction.

Method: SocraticKG uses question-answer pairs as structured intermediate representation, employing 5W1H-guided QA expansion to systematically unfold document-level semantics before triple extraction. This captures contextual dependencies and implicit relational links typically lost in direct extraction pipelines.

Result: Evaluation on MINE benchmark shows SocraticKG effectively addresses the coverage-connectivity trade-off, achieving superior factual retention while maintaining high structural cohesion even as extracted knowledge volume expands substantially.

Conclusion: QA-mediated semantic scaffolding plays a critical role in structuring semantics prior to KG extraction, enabling more coherent and reliable graph construction in subsequent stages by providing explicit grounding in source documents.

Abstract: Constructing Knowledge Graphs (KGs) from unstructured text provides a structured framework for knowledge representation and reasoning, yet current LLM-based approaches struggle with a fundamental trade-off: factual coverage often leads to relational fragmentation, while premature consolidation causes information loss. To address this, we propose SocraticKG, an automated KG construction method that introduces question-answer pairs as a structured intermediate representation to systematically unfold document-level semantics prior to triple extraction. By employing 5W1H-guided QA expansion, SocraticKG captures contextual dependencies and implicit relational links typically lost in direct KG extraction pipelines, providing explicit grounding in the source document that helps mitigate implicit reasoning errors. Evaluation on the MINE benchmark demonstrates that our approach effectively addresses the coverage-connectivity trade-off, achieving superior factual retention while maintaining high structural cohesion even as extracted knowledge volume substantially expands. These results highlight that QA-mediated semantic scaffolding plays a critical role in structuring semantics prior to KG extraction, enabling more coherent and reliable graph construction in subsequent stages.

[35] EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records

Lingfei Qian, Mauro Giuffre, Yan Wang, Huan He, Qianqian Xie, Xuguang Ai, Xeuqing Peng, Fan Ma, Ruey-Ling Weng, Donald Wright, Adan Wang, Qingyu Chen, Vipina K. Keloth, Hua Xu

Main category: cs.CL

TL;DR: EHRNavigator is a multi-agent AI framework for patient-level question answering across heterogeneous EHR data, achieving 86% accuracy on real-world cases with clinically acceptable response times.

Details

Motivation: Existing EHR QA systems are evaluated only on benchmark datasets, limiting their practical clinical relevance despite the need for timely, context-aware access to patient information in clinical decision-making.

Method: Multi-agent framework that harnesses AI agents to perform patient-level QA across heterogeneous and multimodal EHR data, evaluated using both public benchmark and institutional datasets under realistic hospital conditions.

Result: Achieved 86% accuracy on real-world cases with strong generalization, maintained clinically acceptable response times, and demonstrated effectiveness through clinician-validated chart review.

Conclusion: EHRNavigator effectively bridges the gap between benchmark evaluation and clinical deployment, offering a robust, adaptive, and efficient solution for real-world EHR question answering.

Abstract: Clinical decision-making increasingly relies on timely and context-aware access to patient information within Electronic Health Records (EHRs), yet most existing natural language question-answering (QA) systems are evaluated solely on benchmark datasets, limiting their practical relevance. To overcome this limitation, we introduce EHRNavigator, a multi-agent framework that harnesses AI agents to perform patient-level question answering across heterogeneous and multimodal EHR data. We assessed its performance using both public benchmark and institutional datasets under realistic hospital conditions characterized by diverse schemas, temporal reasoning demands, and multimodal evidence integration. Through quantitative evaluation and clinician-validated chart review, EHRNavigator demonstrated strong generalization, achieving 86% accuracy on real-world cases while maintaining clinically acceptable response times. Overall, these findings confirm that EHRNavigator effectively bridges the gap between benchmark evaluation and clinical deployment, offering a robust, adaptive, and efficient solution for real-world EHR question answering.

[36] EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels

Wan Jou She, Lis Kanashiro Pereira, Fei Cheng, Sakiko Yahata, Panote Siriaraya, Eiji Aramaki

Main category: cs.CL

TL;DR: EmplifAI is a Japanese empathetic dialogue dataset for chronic disease patients, featuring 28 fine-grained emotions across 280 medical situations and 4125 dialogues, with evaluation showing improved LLM empathy through fine-tuning.

Details

Motivation: Chronic disease patients experience complex emotional shifts during disease management, requiring empathetic support. Existing datasets lack fine-grained emotion categorization for Japanese medical contexts.

Method: Created dataset with 28 emotion categories adapted from GoEmotions taxonomy, collected 280 medical situations and 4125 two-turn dialogues via crowdsourcing and expert review. Evaluated using BERTScore on LLMs and fine-tuned Japanese LLM (LLM-jp-3.1-13b-instruct4).

Result: Achieved F1 score of 0.83 for emotional alignment evaluation. Fine-tuning improved fluency, general empathy, and emotion-specific empathy. Validated evaluation pipeline by comparing LLM-as-a-Judge with human raters.

Conclusion: EmplifAI effectively supports empathetic dialogue generation for chronic disease patients, with validated evaluation methods showing promising results for improving LLM empathy in Japanese medical contexts.

Abstract: This paper introduces EmplifAI, a Japanese empathetic dialogue dataset designed to support patients coping with chronic medical conditions. They often experience a wide range of positive and negative emotions (e.g., hope and despair) that shift across different stages of disease management. EmplifAI addresses this complexity by providing situation-based dialogues grounded in 28 fine-grained emotion categories, adapted and validated from the GoEmotions taxonomy. The dataset includes 280 medically contextualized situations and 4125 two-turn dialogues, collected through crowdsourcing and expert review. To evaluate emotional alignment in empathetic dialogues, we assessed model predictions on situation–dialogue pairs using BERTScore across multiple large language models (LLMs), achieving F1 scores of 0.83. Fine-tuning a baseline Japanese LLM (LLM-jp-3.1-13b-instruct4) with EmplifAI resulted in notable improvements in fluency, general empathy, and emotion-specific empathy. Furthermore, we compared the scores assigned by LLM-as-a-Judge and human raters on dialogues generated by multiple LLMs to validate our evaluation pipeline and discuss the insights and potential risks derived from the correlation analysis.

[37] Long-Chain Reasoning Distillation via Adaptive Prefix Alignment

Zhenghao Liu, Zhuoyang Wu, Xinze Li, Yukun Yan, Shuo Wang, Zulong Chen, Yu Gu, Ge Yu, Maosong Sun

Main category: cs.CL

TL;DR: P-ALIGN is a distillation framework that adaptively truncates teacher-generated reasoning trajectories to create concise, learnable prefixes for student models, improving mathematical reasoning performance by over 3% compared to baselines.

Details

Motivation: Teacher-generated reasoning trajectories for LLM distillation are often excessively long and structurally complex, creating a mismatch between supervision signals and student model learning capacity, which hinders effective knowledge transfer.

Method: Prefix-ALIGNment distillation (P-ALIGN) adaptively truncates teacher CoTs by determining whether the remaining suffix is concise and sufficient for student guidance, then uses teacher-generated prefixes to supervise student models through effective prefix alignment.

Result: P-ALIGN outperforms all baselines by over 3% on multiple mathematical reasoning benchmarks, with analysis showing that constructed prefixes provide more effective supervision while avoiding negative impacts from redundant and uncertain reasoning components.

Conclusion: Adaptive prefix alignment effectively addresses the teacher-student capacity mismatch in reasoning distillation, providing a practical framework for leveraging teacher CoTs while avoiding the pitfalls of overly complex reasoning trajectories.

Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in solving complex mathematical problems. Recent studies show that distilling long reasoning trajectories can effectively enhance the reasoning performance of small-scale student models. However, teacher-generated reasoning trajectories are often excessively long and structurally complex, making them difficult for student models to learn. This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix-ALIGNment distillation (P-ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. Specifically, P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise and sufficient to guide the student model. Then, P-ALIGN leverages the teacher-generated prefix to supervise the student model, encouraging effective prefix alignment. Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%. Further analysis indicates that the prefixes constructed by P-ALIGN provide more effective supervision signals, while avoiding the negative impact of redundant and uncertain reasoning components. All code is available at https://github.com/NEUIR/P-ALIGN.

[38] Deriving Character Logic from Storyline as Codified Decision Trees

Letian Peng, Kun Zhou, Longfei Yun, Yupeng Hou, Jingbo Shang

Main category: cs.CL

TL;DR: CDT framework creates executable, interpretable decision trees from narrative data for more reliable role-playing agents.

Details

Motivation: Existing role-playing agent behavioral profiles are unstructured, non-executable, weakly validated, and lead to brittle agent behavior.

Method: Data-driven framework that induces executable decision trees from narrative data through iterative rule induction, validation against data, and hierarchical specialization.

Result: Substantially outperforms human-written profiles and prior methods on 85 characters across 16 artifacts, demonstrating more reliable agent grounding.

Conclusion: Codified and validated behavioral representations through decision trees lead to more reliable and consistent role-playing agent behavior.

Abstract: Role-playing (RP) agents rely on behavioral profiles to act consistently across diverse narrative contexts, yet existing profiles are largely unstructured, non-executable, and weakly validated, leading to brittle agent behavior. We propose Codified Decision Trees (CDT), a data-driven framework that induces an executable and interpretable decision structure from large-scale narrative data. CDT represents behavioral profiles as a tree of conditional rules, where internal nodes correspond to validated scene conditions and leaves encode grounded behavioral statements, enabling deterministic retrieval of context-appropriate rules at execution time. The tree is learned by iteratively inducing candidate scene-action rules, validating them against data, and refining them through hierarchical specialization, yielding profiles that support transparent inspection and principled updates. Across multiple benchmarks, CDT substantially outperforms human-written profiles and prior profile induction methods on $85$ characters across $16$ artifacts, indicating that codified and validated behavioral representations lead to more reliable agent grounding.

[39] Is MT Ready for the Next Crisis or Pandemic?

Vipasha Bansal, Elizabeth Brown, Chelsea Kendrick, Benjamin Pong, William D. Lewis

Main category: cs.CL

TL;DR: Evaluates commercial MT systems on pandemic-related translations for low-resource languages using TICO-19 dataset to assess readiness for future crises.

Details

Motivation: There's a critical communication gap in crisis situations between aid providers/governments and affected communities, especially with low-resource languages. Commercial MT systems could help bridge this gap, but their effectiveness for crisis/medical domains with these languages is unknown.

Method: Evaluated four commercial MT systems using the TICO-19 dataset, which contains pandemic-related sentences from high-priority languages spoken by communities most vulnerable in future pandemics. Assessed translation quality based on usability of output.

Result: The study provides an assessment of how ready current commercial MT systems are for pandemic communication with low-resource language communities, though specific performance metrics aren’t detailed in the abstract.

Conclusion: Commercial MT systems need evaluation for crisis communication, especially with low-resource languages. The findings help determine current readiness levels for future pandemics/epidemics and identify gaps in MT capabilities for vulnerable communities.

Abstract: Communication in times of crisis is essential. However, there is often a mismatch between the language of governments, aid providers, doctors, and those to whom they are providing aid. Commercial MT systems are reasonable tools to turn to in these scenarios. But how effective are these tools for translating to and from low resource languages, particularly in the crisis or medical domain? In this study, we evaluate four commercial MT systems using the TICO-19 dataset, which is composed of pandemic-related sentences from a large set of high priority languages spoken by communities most likely to be affected adversely in the next pandemic. We then assess the current degree of ``readiness’’ for another pandemic (or epidemic) based on the usability of the output translations.

[40] CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking

Viet Cuong Nguyen, Nhi Yen Nguyen, Kristin A. Candan, Mary Conlon, Vanessa Rumie, Kristen Risola, Srijan Kumar, Munmun De Choudhury

Main category: cs.CL

TL;DR: CALM-IT is a framework for generating long-form Motivational Interviewing dialogues that models dual-actor conversational dynamics to maintain therapeutic coherence over extended interactions.

Details

Motivation: Current LLMs struggle with sustaining realistic, goal-directed dialogue in mental health settings, as they optimize for local next-turn responses rather than maintaining coherent therapeutic progress, leading to brittleness and long-horizon drift in extended conversations.

Method: CALM-IT represents therapist-client interaction as a bidirectional state-space process where both agents continuously update inferred alignment, mental states, and short-term goals to guide strategy selection and utterance generation, explicitly modeling dual-actor conversational dynamics.

Result: CALM-IT consistently outperforms strong baselines in Effectiveness and Goal Alignment, remains substantially more stable as conversation length increases, achieves the highest client acceptance rate (64.3%), and initiates fewer but more precise therapist redirections.

Conclusion: Modeling evolving conversational state is essential for generating high-quality long-form synthetic conversations, and CALM-IT provides evidence that explicit dual-actor state-space modeling enables more therapeutically aligned and stable dialogue generation.

Abstract: Large Language Models (LLMs) are increasingly used in mental health-related settings, yet they struggle to sustain realistic, goal-directed dialogue over extended interactions. While LLMs generate fluent responses, they optimize locally for the next turn rather than maintaining a coherent model of therapeutic progress, leading to brittleness and long-horizon drift. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing (MI) dialogues that explicitly models dual-actor conversational dynamics. CALM-IT represents therapist-client interaction as a bidirectional state-space process, in which both agents continuously update inferred alignment, mental states, and short-term goals to guide strategy selection and utterance generation. Across large-scale evaluations, CALM-IT consistently outperforms strong baselines in Effectiveness and Goal Alignment and remains substantially more stable as conversation length increases. Although CALM-IT initiates fewer therapist redirections, it achieves the highest client acceptance rate (64.3%), indicating more precise and therapeutically aligned intervention timing. Overall, CALM-IT provides evidence for modeling evolving conversational state being essential for generating high-quality long-form synthetic conversations.

[41] Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation

Lechen Zhang, Yunxiang Zhang, Wei Hu, Lu Wang

Main category: cs.CL

TL;DR: Skill-centric distillation framework transfers reasoning ability efficiently using skill-based data selection and skill-aware fine-tuning, achieving strong performance with only 1,000 examples.

Details

Motivation: Distilling large reasoning models typically requires large-scale SFT data, creating a need for more data-efficient training methods to transfer reasoning capabilities to weaker models.

Method: Two-component framework: (1) Skill-based data selection that prioritizes examples targeting student’s weaker skills, and (2) Skill-aware fine-tuning that encourages explicit skill decomposition during problem solving.

Result: With only 1,000 training examples from 100K teacher corpus, method surpasses random SFT baselines by +1.6% on Qwen3-4B and +1.4% on Qwen3-8B across five mathematical reasoning benchmarks.

Conclusion: Skill-centric training is effective for efficient reasoning distillation, with gains concentrating on emphasized skills during training.

Abstract: Large reasoning models such as DeepSeek-R1 and their distilled variants achieve strong performance on complex reasoning tasks. Yet, distilling these models often demands large-scale data for supervised fine-tuning (SFT), motivating the pursuit of data-efficient training methods. To address this, we propose a skill-centric distillation framework that efficiently transfers reasoning ability to weaker models with two components: (1) Skill-based data selection, which prioritizes examples targeting the student model’s weaker skills, and (2) Skill-aware fine-tuning, which encourages explicit skill decomposition during problem solving. With only 1,000 training examples selected from a 100K teacher-generated corpus, our method surpasses random SFT baselines by +1.6% on Qwen3-4B and +1.4% on Qwen3-8B across five mathematical reasoning benchmarks. Further analysis confirms that these gains concentrate on skills emphasized during training, highlighting the effectiveness of skill-centric training for efficient reasoning distillation.

[42] Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends

Ye Wang, Jiaxing Chen, Hongjiang Xiao

Main category: cs.CL

TL;DR: This paper provides a comprehensive review of role-playing language agents (RPLAs), covering their technological evolution, key technical pathways, data construction methods, evaluation frameworks, and future research directions.

Details

Motivation: With the rapid advancement of large language models, role-playing language agents have become an important research area at the intersection of NLP and human-computer interaction. The paper aims to systematically review the current development and key technologies of RPLAs to provide a comprehensive perspective for researchers.

Method: The paper conducts a systematic review approach, analyzing the technological evolution of RPLAs through three stages: rule-based template paradigms, language style imitation, and cognitive simulation. It examines critical technical pathways including psychological scale-driven character modeling, memory-augmented prompting, and motivation-situation-based behavioral decision control. The review also covers data construction methods and multi-dimensional evaluation frameworks.

Result: The paper successfully delineates the complete technological landscape of RPLAs, identifying key technical pathways and challenges. It provides a comprehensive analysis of data construction methods (sources, copyright constraints, annotation) and evaluation frameworks covering role knowledge, personality fidelity, value alignment, and interactive hallucination. The review also assesses various evaluation methods including human evaluation, reward models, and LLM-based scoring.

Conclusion: The paper concludes by outlining future research directions for role-playing agents, including personality evolution modeling, multi-agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience. It aims to provide systematic perspective and methodological insights for subsequent research in this emerging field.

Abstract: In recent years, with the rapid advancement of large language models (LLMs), role-playing language agents (RPLAs) have emerged as a prominent research focus at the intersection of natural language processing (NLP) and human-computer interaction. This paper systematically reviews the current development and key technologies of RPLAs, delineating the technological evolution from early rule-based template paradigms, through the language style imitation stage, to the cognitive simulation stage centered on personality modeling and memory mechanisms. It summarizes the critical technical pathways supporting high-quality role-playing, including psychological scale-driven character modeling, memory-augmented prompting mechanisms, and motivation-situation-based behavioral decision control. At the data level, the paper further analyzes the methods and challenges of constructing role-specific corpora, focusing on data sources, copyright constraints, and structured annotation processes. In terms of evaluation, it collates multi-dimensional assessment frameworks and benchmark datasets covering role knowledge, personality fidelity, value alignment, and interactive hallucination, while commenting on the advantages and disadvantages of methods such as human evaluation, reward models, and LLM-based scoring. Finally, the paper outlines future development directions of role-playing agents, including personality evolution modeling, multi-agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience, aiming to provide a systematic perspective and methodological insights for subsequent research.

[43] ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao

Main category: cs.CL

TL;DR: TS-Guard: A reinforcement learning-based guardrail model that proactively detects unsafe tool invocations in LLM agents before execution, reducing harmful actions by 65% while improving benign task completion.

Details

Motivation: LLM-based agents' expanded tool invocation capabilities create significant security risks. Real-time monitoring and proactive intervention before unsafe execution are critical for safe agent deployment but remain under-explored.

Method: 1) Construct TS-Bench benchmark for step-level tool invocation safety detection; 2) Develop TS-Guard guardrail model using multi-task reinforcement learning to detect unsafe actions by reasoning over interaction history; 3) Create TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents.

Result: TS-Guard reduces harmful tool invocations of ReAct-style agents by 65% on average and improves benign task completion by approximately 10% under prompt injection attacks.

Conclusion: The proposed guardrail framework effectively enhances LLM agent safety by proactively detecting unsafe tool invocations through interpretable safety judgments and feedback-driven reasoning.

Abstract: While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.

[44] What Gets Activated: Uncovering Domain and Driver Experts in MoE Language Models

Guimin Hu, Meng Li, Qiwei Peng, Lijie Hu, Boyan Xu, Ruichu Cai

Main category: cs.CL

TL;DR: This paper analyzes expert activation patterns in Mixture of Experts (MoE) LLMs, distinguishing between domain experts (specialized for specific domains) and driver experts (causally influential on model performance). The study introduces metrics to identify these expert types and examines how tokens trigger expert activation.

Details

Motivation: Most interpretability work focuses on layer- or neuron-level mechanisms in Transformers, leaving expert-level behavior in MoE LLMs underexplored. The authors are motivated by functional specialization in the human brain to analyze expert activation patterns and understand how different expert types contribute to model behavior.

Method: The study analyzes expert activation in MoE models across three public domains. It introduces entropy-based metrics to assess domain preference and causal-effect metrics to measure expert influence on model output. The approach distinguishes between domain experts (strongly favored for particular domains) and driver experts (causally influential on performance). The research also explores how individual tokens are associated with specific expert activation.

Result: Key findings: (1) Among activated experts, some show clear domain preferences while others exert strong causal influence on model performance; (2) tokens occurring earlier in sentences are more likely to trigger driver experts; (3) adjusting weights of domain and driver experts leads to significant performance gains across all three models and domains.

Conclusion: The findings shed light on the internal mechanisms of MoE models and enhance their interpretability, providing insights into expert specialization and causal influence patterns that can be leveraged for model improvement through targeted expert weight adjustments.

Abstract: Most interpretability work focuses on layer- or neuron-level mechanisms in Transformers, leaving expert-level behavior in MoE LLMs underexplored. Motivated by functional specialization in the human brain, we analyze expert activation by distinguishing domain and driver experts. In this work, we study expert activation in MoE models across three public domains and address two key questions: (1) which experts are activated, and whether certain expert types exhibit consistent activation patterns; and (2) how tokens are associated with and trigger the activation of specific experts. To answer these questions, we introduce entropy-based and causal-effect metrics to assess whether an expert is strongly favored for a particular domain, and how strongly expert activation contributes causally to the model’s output, thus identify domain and driver experts, respectively. Furthermore, we explore how individual tokens are associated with the activation of specific experts. Our analysis reveals that (1) Among the activated experts, some show clear domain preferences, while others exert strong causal influence on model performance, underscoring their decisive roles. (2) tokens occurring earlier in a sentence are more likely to trigger the driver experts, and (3) adjusting the weights of domain and driver experts leads to significant performance gains across all three models and domains. These findings shed light on the internal mechanisms of MoE models and enhance their interpretability.

[45] Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, Kyle O’Brien

Main category: cs.CL

TL;DR: LLMs internalize behavioral priors from AI discourse in pretraining data - negative discourse increases misalignment, positive discourse reduces it, showing “self-fulfilling alignment” effects that persist through post-training.

Details

Motivation: Pretraining corpora contain extensive discourse about AI systems, but the causal influence of this discourse on downstream alignment is poorly understood. If AI descriptions are predominantly negative, LLMs may internalize corresponding behavioral priors, creating self-fulfilling misalignment.

Method: Controlled study pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. Upsampling synthetic training documents about AI misalignment vs. aligned behavior to test causal effects.

Result: Discussion of AI contributes to misalignment. Upsampling misalignment discourse increases misaligned behavior. Upsampling aligned behavior reduces misalignment scores from 45% to 9%. Effects dampened but persist through post-training.

Conclusion: Establishes “alignment pretraining” as a complement to post-training. Recommends practitioners pretrain for alignment as well as capabilities. Shows LLMs internalize behavioral priors from AI discourse, creating self-fulfilling alignment effects.

Abstract: Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at alignmentpretraining.ai

[46] AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers

Prachuryya Kaushik, Ashish Anand

Main category: cs.CL

TL;DR: AWED-FiNER is an open-source ecosystem providing Fine-grained Named Entity Recognition for 36 global languages, addressing LLM limitations in low-resource languages and fine-grained tasks through agentic tools, web apps, and expert models.

Details

Motivation: Large Language Models struggle with low-resource languages and fine-grained NLP tasks, creating a gap in Fine-grained Named Entity Recognition for many global languages spoken by billions of people.

Method: Developed an ecosystem with: 1) Agentic toolkits to route multilingual text to specialized expert models, 2) Web applications for non-technical users, 3) Collection of 49 extremely small-sized open-source expert models for 36 languages, including vulnerable languages like Bodo and Manipuri.

Result: Created a comprehensive FgNER solution covering languages spoken by over 6.6 billion people, with fast annotation capabilities (within seconds), offline deployment options for resource-constrained scenarios, and specific focus on vulnerable languages.

Conclusion: AWED-FiNER successfully bridges the FgNER gap for 36 global languages, providing accessible, efficient, and resource-friendly solutions that address LLM limitations while supporting both technical and non-technical users across diverse linguistic contexts.

Abstract: We introduce AWED-FiNER, an open-source ecosystem designed to bridge the gap in Fine-grained Named Entity Recognition (FgNER) for 36 global languages spoken by more than 6.6 billion people. While Large Language Models (LLMs) dominate general Natural Language Processing (NLP) tasks, they often struggle with low-resource languages and fine-grained NLP tasks. AWED-FiNER provides a collection of agentic toolkits, web applications, and several state-of-the-art expert models that provides FgNER solutions across 36 languages. The agentic tools enable to route multilingual text to specialized expert models and fetch FgNER annotations within seconds. The web-based platforms provide ready-to-use FgNER annotation service for non-technical users. Moreover, the collection of language specific extremely small sized open-source state-of-the-art expert models facilitate offline deployment in resource contraint scenerios including edge devices. AWED-FiNER covers languages spoken by over 6.6 billion people, including a specific focus on vulnerable languages such as Bodo, Manipuri, Bishnupriya, and Mizo. The resources can be accessed here: Agentic Tool (https://github.com/PrachuryyaKaushik/AWED-FiNER), Web Application (https://hf.co/spaces/prachuryyaIITG/AWED-FiNER), and 49 Expert Detector Models (https://hf.co/collections/prachuryyaIITG/awed-finer).

[47] Credit C-GPT: A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection

Nhung Nguyen Thi Hong, Cuong Nguyen Dang, Tri Le Ngoc

Main category: cs.CL

TL;DR: Credit C-GPT is a 7B parameter Vietnamese LLM specialized for debt collection conversations, integrating multiple conversational intelligence tasks into a single framework to handle informal language and emotional variability in BFSI contact centers.

Details

Motivation: Debt collection in BFSI relies on human conversations with informal Vietnamese, emotional variability, and complex domain reasoning that challenge traditional NLP systems, requiring specialized solutions for contact center operations.

Method: Developed a 7B parameter domain-specialized LLM fine-tuned for Vietnamese debt collection, integrating dialogue understanding, sentiment recognition, intent detection, call stage classification, and slot-value extraction in a single reasoning framework with proprietary annotated datasets.

Result: Experimental results show consistent improvements over traditional pipeline approaches, demonstrating scalable and privacy-aware solutions for real-time assistance and post-call analytics in enterprise contact centers.

Conclusion: Domain-specialized conversational language models like Credit C-GPT effectively address the unique challenges of Vietnamese debt collection conversations, offering practical solutions for BFSI contact centers through integrated multi-task reasoning.

Abstract: Debt collection is a critical function within the banking, financial services, and insurance (BFSI) sector, relying heavily on large-scale human-to-human conversational interactions conducted primarily in Vietnamese contact centers. These conversations involve informal spoken language, emotional variability, and complex domain-specific reasoning, which pose significant challenges for traditional natural language processing systems. This paper introduces Credit C-GPT, a domain-specialized large language model with seven billion parameters, fine-tuned for conversational understanding in Vietnamese debt collection scenarios. The proposed model integrates multiple conversational intelligence tasks, including dialogue understanding, sentiment recognition, intent detection, call stage classification, and structured slot-value extraction, within a single reasoning-based framework. We describe the data construction process, annotation strategy, and training methodology, and evaluate the model on proprietary human-annotated datasets. Experimental results show consistent improvements over traditional pipeline-based approaches, indicating that domain-specialized conversational language models provide a scalable and privacy-aware solution for real-time assistance and post-call analytics in enterprise contact centers.

[48] HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning

Ziang Cui, Mengran Yu, Tianjiao Li, Chenyu Shi, Yingxuan Shi, Lusheng Zhang, Hongwei Lin

Main category: cs.CL

TL;DR: LLMs have cross-lingual verbosity bias making them unsuitable for time-constrained translation tasks like subtitling. The paper introduces Sand-Glass benchmark for syllable-level duration evaluation and HOMURA RL framework to optimize semantic-temporal trade-off.

Details

Motivation: LLMs show systemic cross-lingual verbosity bias, making them unsuitable for strict time-constrained translation tasks like subtitling and dubbing where temporal feasibility is critical. Current prompt-engineering approaches fail to resolve the conflict between semantic fidelity and rigid temporal constraints.

Method: 1) Introduce Sand-Glass benchmark for evaluating translation under syllable-level duration constraints. 2) Propose HOMURA reinforcement learning framework with KL-regularized objective and novel dynamic syllable-ratio reward to explicitly optimize the trade-off between semantic preservation and temporal compliance.

Result: HOMURA significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy. The method effectively “tames” output length while maintaining translation quality.

Conclusion: The paper successfully addresses LLMs’ verbosity bias for time-constrained translation tasks through a specialized benchmark and RL framework that balances semantic preservation with temporal compliance, enabling practical applications in subtitling and dubbing.

Abstract: Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively “tames” the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.

[49] HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns

Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang, Jun Gao, Shuai Huang, Yueping Kang, Liyuan Gou, Hongwei Feng, Yanghua Xiao

Main category: cs.CL

TL;DR: HUMANLLM is a framework that models psychological patterns as causal forces to create more authentic Role-Playing Language Agents, achieving strong human alignment and outperforming larger models on multi-pattern dynamics.

Details

Motivation: Current LLM-based role-playing agents lack authentic alignment with human cognitive and behavioral patterns, despite their strong reasoning and generation capabilities. There's a need to simulate not just what humans do, but the underlying psychological processes that generate those behaviors.

Method: The framework treats psychological patterns as interacting causal forces. Researchers constructed 244 patterns from ~12,000 academic papers and synthesized 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other. They created multi-turn conversations expressing inner thoughts, actions, and dialogue, with dual-level checklists evaluating both individual pattern fidelity and emergent multi-pattern dynamics.

Result: HUMANLLM achieves strong human alignment (r=0.91) and reveals that holistic metrics conflate simulation accuracy with social desirability. Notably, HUMANLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite having 4x fewer parameters.

Conclusion: Authentic anthropomorphism in language agents requires cognitive modeling—simulating not just what humans do, but the psychological processes generating those behaviors. The framework demonstrates that modeling psychological patterns as causal forces enables more authentic human alignment.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HUMANLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from ~12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.91) while revealing that holistic metrics conflate simulation accuracy with social desirability. HUMANLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4x fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling–simulating not just what humans do, but the psychological processes generating those behaviors.

[50] One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?

Arya Shah, Himanshu beniwal, Mayank Singh

Main category: cs.CL

TL;DR: A benchmark for evaluating multilingual embedding models on persona-instruction compatibility across 12 Indian languages, focusing on retrieval and classification tasks without response generation.

Details

Motivation: Existing benchmarks either focus on single languages or conflate retrieval with generation, leaving gaps in understanding whether embedding models can encode persona-instruction compatibility without response synthesis, which is crucial for serving India's linguistically diverse population.

Method: Created a unified benchmark spanning 12 Indian languages with four evaluation tasks: monolingual/cross-lingual persona-to-instruction retrieval, reverse retrieval (instruction-to-persona), and binary compatibility classification. Evaluated eight multilingual embedding models in a frozen-encoder setting with a thin logistic regression head for classification.

Result: E5-Large-Instruct achieved highest Recall@1 of 27.4% on monolingual retrieval and 20.7% on cross-lingual transfer. BGE-M3 led reverse retrieval at 32.1% Recall@1. For classification, LaBSE attained 75.3% AUROC with strong calibration.

Conclusion: The findings provide practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work, with code, datasets, and models publicly available.

Abstract: Aligning multilingual assistants with culturally grounded user preferences is essential for serving India’s linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona-instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross-lingual persona-to-instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen-encoder setting with a thin logistic regression head for classification. E5-Large-Instruct achieves the highest Recall@1 of 27.4% on monolingual retrieval and 20.7% on cross-lingual transfer, while BGE-M3 leads reverse retrieval at 32.1% Recall@1. For classification, LaBSE attains 75.3% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnote{Code, datasets, and models are publicly available at https://github.com/aryashah2k/PI-Indic-Align.

[51] GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients

Kentaro Kazama, Daiki Shirafuji, Tatsuhiko Saito

Main category: cs.CL

TL;DR: GeoSteer is a manifold-based framework that improves LLM reasoning quality by steering hidden states toward high-quality CoT trajectories in latent space, boosting accuracy and win rates.

Details

Motivation: LLMs often generate logically inconsistent intermediate reasoning steps even when final answers are correct, reducing reliability of step-level reasoning and trust in model outputs.

Method: Three-step approach: (1) build CoT dataset with segment-level scores, (2) train VAE and quality estimation model to learn low-dimensional manifold of high-quality CoT trajectories, (3) steer target LLM hidden states toward higher-quality regions in latent space using natural-gradient-like adjustments.

Result: On GSM8k with Qwen3 series: improved exact match accuracy by up to 2.6 points and enhanced pairwise win rate by 5.3 points, demonstrating effective improvement in reasoning quality.

Conclusion: GeoSteer provides an effective and controllable mechanism for improving intermediate reasoning quality in LLMs through geometrically coherent steering in latent space.

Abstract: Recent advances in Large Language Models (LLMs) have improved multi-step reasoning. Most approaches rely on Chain-of-Thought (CoT) rationales. Previous studies have shown that LLMs often generate logically inconsistent reasoning steps even when their final answers are correct. These inconsistencies reduce the reliability of step-level reasoning. We propose GeoSteer, a manifold-based framework that improves the quality of intermediate reasoning. The method consists of: (1) constructing a CoT dataset with segment-level scores, (2) training a Variational Autoencoder (VAE) model and a quality estimation model to learn a low-dimensional manifold of high-quality CoT trajectories, and (3) steering hidden states of target LLMs toward higher-quality regions in the latent space. This update in a latent space behaves like a natural-gradient adjustment in the original hidden-state space. It ensures geometrically coherent steering. We evaluate GeoSteer on the GSM8k dataset using the Qwen3 series. We measure via answer accuracy and overall reasoning performance. GeoSteer improved the exact match accuracy by up to 2.6 points. It also enhanced the pairwise win rate by 5.3 points. These results indicate that GeoSteer provides an effective and controllable mechanism for improving the quality of intermediate reasoning in LLMs.

[52] Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?

Guanxu Chen, Dongrui Liu, Jing Shao

Main category: cs.CL

TL;DR: Loop Transformers narrow the gap between internal knowledge and outputs but degrade knowledge quality, and lack true introspection across loops.

Details

Motivation: LLMs have a gap between their internal knowledge and explicit outputs. The paper investigates whether Loop Transformers can bridge this gap through iterative introspection.

Method: Empirical investigation of Loop Transformers (LTs) - architectures that increase computational depth by iterating shared layers. Experiments analyze how increasing loop iterations affects the knowledge gap and representation quality.

Result: Increasing loop iterations narrows the gap between internal knowledge and outputs, but partly due to degradation of internal knowledge. Current LTs’ ability to perceive representations doesn’t improve across loops - only present in final loop.

Conclusion: While LTs offer promising direction for scaling computational depth, they haven’t achieved the introspection needed to truly link representation space and natural language.

Abstract: Large Language Models (LLMs) often exhibit a gap between their internal knowledge and their explicit linguistic outputs. In this report, we empirically investigate whether Looped Transformers (LTs)–architectures that increase computational depth by iterating shared layers–can bridge this gap by utilizing their iterative nature as a form of introspection. Our experiments reveal that while increasing loop iterations narrows the gap, it is partly driven by a degradation of their internal knowledge carried by representations. Moreover, another empirical analysis suggests that current LTs’ ability to perceive representations does not improve across loops; it is only present in the final loop. These results suggest that while LTs offer a promising direction for scaling computational depth, they have yet to achieve the introspection required to truly link representation space and natural language.

[53] coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts

Prottay Kumar Adhikary, Reena Rawat, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: coTherapist is a small language model framework for mental healthcare support that demonstrates expert-like therapeutic competencies through fine-tuning and agentic reasoning.

Details

Motivation: Addressing mental healthcare workforce shortages and rising demand by developing intelligent systems to support mental healthcare experts.

Method: Unified framework using small language model with domain-specific fine-tuning, retrieval augmentation, and agentic reasoning to emulate therapeutic competencies.

Result: Outperforms contemporary baselines on clinical queries, exhibits high empathy and therapist-consistent personality traits via T-BARS rubric and psychometric profiling, and receives positive human evaluation by domain experts for accuracy, trustworthiness, and safety.

Conclusion: Small models can be engineered to exhibit expert-like behavior, offering a scalable pathway for digital mental health tools.

Abstract: Access to mental healthcare is increasingly strained by workforce shortages and rising demand, motivating the development of intelligent systems that can support mental healthcare experts. We introduce coTherapist, a unified framework utilizing a small language model to emulate core therapeutic competencies through domain-specific fine-tuning, retrieval augmentation, and agentic reasoning. Evaluation on clinical queries demonstrates that coTherapist generates more relevant and clinically grounded responses than contemporary baselines. Using our novel T-BARS rubric and psychometric profiling, we confirm coTherapist exhibits high empathy and therapist-consistent personality traits. Furthermore, human evaluation by domain experts validates that coTherapist delivers accurate, trustworthy, and safe responses. coTherapist was deployed and tested by clinical experts. Collectively, these findings demonstrate that small models can be engineered to exhibit expert-like behavior, offering a scalable pathway for digital mental health tools.

[54] Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLMs

Nan Li, Bo Kang, Tijl De Bie

Main category: cs.CL

TL;DR: LLMs show different moral judgments across languages due to both input language and reasoning language effects, with reasoning language having twice the impact of input language.

Details

Motivation: To understand why LLMs reach different moral conclusions in different languages, and to disentangle the effects of dilemma language vs. reasoning language that standard evaluations conflate.

Method: Introduces a methodology that separately manipulates input language and reasoning language, including mismatched conditions. Uses Moral Foundations Theory to interpret judgments, and applies this to English-Chinese moral judgment with 13 LLMs.

Result: 1) Reasoning-language effects contribute twice the variance of input-language effects; 2) Detects context-dependency in nearly half of models that standard evaluation misses; 3) Creates diagnostic taxonomy for deployment guidance.

Conclusion: The framework successfully isolates language effects in moral judgments, revealing that reasoning language is more influential than input language, and provides practical guidance for deploying LLMs across languages.

Abstract: When LLMs judge moral dilemmas, do they reach different conclusions in different languages, and if so, why? Two factors could drive such differences: the language of the dilemma itself, or the language in which the model reasons. Standard evaluation conflates these by testing only matched conditions (e.g., English dilemma with English reasoning). We introduce a methodology that separately manipulates each factor, covering also mismatched conditions (e.g., English dilemma with Chinese reasoning), enabling decomposition of their contributions. To study \emph{what} changes, we propose an approach to interpret the moral judgments in terms of Moral Foundations Theory. As a side result, we identify evidence for splitting the Authority dimension into a family-related and an institutional dimension. Applying this methodology to English-Chinese moral judgment with 13 LLMs, we demonstrate its diagnostic power: (1) the framework isolates reasoning-language effects as contributing twice the variance of input-language effects; (2) it detects context-dependency in nearly half of models that standard evaluation misses; and (3) a diagnostic taxonomy translates these patterns into deployment guidance. We release our code and datasets at https://anonymous.4open.science/r/CrossCulturalMoralJudgement.

[55] Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel

Hiroaki Yamagiwa, Yusuke Takase, Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: The paper introduces Projection Kernel (PK), a principal-angle-based metric for measuring attention head subspace similarity, which better captures head-to-head relationships than existing metrics like Composition Score.

Details

Motivation: Existing metrics for understanding attention head relationships in Transformers don't capture the internal structure well, making it difficult to interpret how attention heads interact and relate to each other.

Method: Focus on subspaces spanned by attention-head weight matrices and quantify head-to-head relationships using Projection Kernel (PK), a principal-angle-based measure of subspace similarity. Also introduce a framework to quantify PK distribution informativeness by comparing with random orthogonal subspace reference distributions.

Result: PK reproduces known head-to-head interactions on the IOI task more clearly than prior metrics like Composition Score. Analysis of PK-based directed graphs in GPT2-small reveals L4H7 acts as a hub by functioning as an identity head.

Conclusion: Projection Kernel provides a better metric for understanding attention head relationships in Transformers, enabling clearer analysis of internal structure and identification of important head roles like identity hubs.

Abstract: Understanding relationships between attention heads is essential for interpreting the internal structure of Transformers, yet existing metrics do not capture this structure well. We focus on the subspaces spanned by attention-head weight matrices and quantify head-to-head relationships using the Projection Kernel (PK), a principal-angle-based measure of subspace similarity. Experiments show that PK reproduces known head-to-head interactions on the IOI task more clearly than prior metrics such as the Composition Score. We further introduce a framework to quantify the informativeness of PK distributions by comparing them with a reference distribution derived from random orthogonal subspaces. As an application, we analyze a directed graph constructed from PK and show that, in GPT2-small, L4H7 acts as a hub by functioning as an identity head.

[56] The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

Luoming Hu, Jingjie Zeng, Liang Yang, Hongfei Lin

Main category: cs.CL

TL;DR: The paper proposes using Moral Foundations Theory to map and manipulate LLMs’ moral representations, extracting steerable moral vectors and developing Adaptive Moral Fusion to improve safety-helpfulness trade-off.

Details

Motivation: Current LLM alignment techniques are superficial guardrails that don't address intrinsic moral representations, creating a gap in AI safety that needs more fundamental moral understanding and manipulation.

Method: Uses Moral Foundations Theory with cross-lingual linear probing to map moral representations, extracts steerable Moral Vectors, and develops Adaptive Moral Fusion - a dynamic inference-time intervention combining probe detection with vector injection.

Result: Validated shared moral representations in middle layers, discovered shared yet different moral subspaces between English and Chinese, and demonstrated effective reduction of incorrect refusals on benign queries while minimizing jailbreak success rates.

Conclusion: The approach provides targeted intrinsic defense for LLMs, effectively addressing the safety-helpfulness trade-off by fundamentally manipulating moral representations rather than just adding superficial guardrails.

Abstract: Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.

[57] Multilinguality as Sense Adaptation

Jan Christian Blaise Cruz, David Ifeoluwa Adelani, Alham Fikri Aji

Main category: cs.CL

TL;DR: SENSIA aligns sense-level representations across languages for multilingual adaptation, outperforming comparable methods with less target-language data while preserving sense geometry.

Details

Motivation: Current multilingual approaches rely heavily on shared parameters and scale. The paper proposes treating multilinguality as sense adaptation - aligning latent meaning representations across languages rather than just sharing parameters.

Method: Introduces SENSIA (SENse-based Symmetric Interlingual Alignment) which adapts a Backpack language model from source to target language by explicitly aligning sense-level mixtures and contextual representations on parallel data, while jointly training with target-language language modeling loss to preserve fluency.

Result: Outperforms comparable multilingual alignment methods across four typologically diverse languages. Achieves competitive accuracy against monolingual from-scratch baselines while using 2-4x less target-language data. Learned sense geometry shows preserved local topology and global structure relative to English.

Conclusion: Sense-based alignment is an effective approach for multilingual adaptation that preserves meaning representations across languages while being data-efficient and robust to design choices and scale.

Abstract: We approach multilinguality as sense adaptation: aligning latent meaning representations across languages rather than relying solely on shared parameters and scale. In this paper, we introduce SENse-based Symmetric Interlingual Alignment (SENSIA), which adapts a Backpack language model from one language to another by explicitly aligning sense-level mixtures and contextual representations on parallel data, while jointly training a target-language language modeling loss to preserve fluency. Across benchmarks on four typologically diverse languages, SENSIA generally outperforms comparable multilingual alignment methods and achieves competitive accuracy against monolingual from-scratch baselines while using 2-4x less target-language data. Analyses of learned sense geometry indicate that local sense topology and global structure relative to English are largely preserved, and ablations show that the method is robust in terms of design and scale.

[58] ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios

Aniket Deroy

Main category: cs.CL

TL;DR: Researchers created Advosynth-500, a dataset of 100 synthetic speech files with 10 unique advocate identities generated using Speech Llama Omni model, simulating courtroom arguments to study synthetic voice distinction.

Details

Motivation: As speech-to-speech models achieve high fidelity, distinguishing between synthetic voices in structured environments becomes crucial for authentication and identification systems.

Method: Used Speech Llama Omni model to generate 100 synthetic speech files with 10 unique advocate identities, simulating 5 distinct advocate pairs engaged in courtroom arguments with defined vocal characteristics.

Result: Created the Advosynth-500 dataset available on GitHub, presenting a speaker identification challenge to evaluate modern systems’ ability to map audio files to their synthetic origins.

Conclusion: The dataset provides a specialized resource for studying synthetic voice distinction in structured environments like courtrooms, enabling evaluation of speaker identification systems for synthetic voices.

Abstract: As large-scale speech-to-speech models achieve high fidelity, the distinction between synthetic voices in structured environments becomes a vital area of study. This paper introduces Advosynth-500, a specialized dataset comprising 100 synthetic speech files featuring 10 unique advocate identities. Using the Speech Llama Omni model, we simulate five distinct advocate pairs engaged in courtroom arguments. We define specific vocal characteristics for each advocate and present a speaker identification challenge to evaluate the ability of modern systems to map audio files to their respective synthetic origins. Dataset is available at this link-https: //github.com/naturenurtureelite/ADVOSYNTH-500.

[59] Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis

Songsong Tian, Kongsheng Zhuo, Zhendong Wang, Rong Shen, Shengtao Zhang, Yong Wu

Main category: cs.CL

TL;DR: BAR-SQL is a unified training framework for NL2SQL that embeds reliability and boundary awareness, using seed mutation data synthesis and knowledge-grounded reasoning to handle ambiguous/unanswerable queries, achieving 91.48% accuracy and outperforming leading proprietary models.

Details

Motivation: Current NL2SQL systems lack reliability in handling boundary cases like ambiguous queries and schema limitations, which is critical for enterprise applications where incorrect SQL can have serious consequences.

Method: 1) Seed Mutation data synthesis to create representative enterprise corpus with boundary cases; 2) Knowledge-Grounded Reasoning Synthesis for interpretable Chain-of-Thought traces; 3) Two-stage training: SFT followed by Reinforcement Learning via Group Relative Policy Optimization; 4) Task-Conditioned Hybrid Reward mechanism optimizing both SQL execution accuracy and semantic precision in abstention responses.

Result: BAR-SQL achieves 91.48% average accuracy on Ent-SQL-Bench, outperforming leading proprietary models (Claude 4.5 Sonnet and GPT-5) in both SQL generation quality and boundary-aware abstention capability.

Conclusion: BAR-SQL successfully integrates reliability and boundary awareness into NL2SQL generation, demonstrating superior performance over state-of-the-art models and providing a practical solution for enterprise SQL generation with built-in safety mechanisms.

Abstract: In this paper, we present BAR-SQL (Boundary-Aware Reliable NL2SQL), a unified training framework that embeds reliability and boundary awareness directly into the generation process. We introduce a Seed Mutation data synthesis paradigm that constructs a representative enterprise corpus, explicitly encompassing multi-step analytical queries alongside boundary cases including ambiguity and schema limitations. To ensure interpretability, we employ Knowledge-Grounded Reasoning Synthesis, which produces Chain-of-Thought traces explicitly anchored in schema metadata and business rules. The model is trained through a two-stage process: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization. We design a Task-Conditioned Hybrid Reward mechanism that simultaneously optimizes SQL execution accuracy-leveraging Abstract Syntax Tree analysis and dense result matching-and semantic precision in abstention responses. To evaluate reliability alongside generation accuracy, we construct and release Ent-SQL-Bench, which jointly assesse SQL precision and boundary-aware abstention across ambiguous and unanswerable queries. Experimental results on this benchmark demonstrate that BAR-SQL achieves 91.48% average accuracy, outperforming leading proprietary models, including Claude 4.5 Sonnet and GPT-5, in both SQL generation quality and boundary-aware abstention capability. The source code and benchmark are available anonymously at: https://github.com/TianSongS/BAR-SQL.

[60] An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit

Warren Jouanneau, Emma Jouffroy, Marc Palyart

Main category: cs.CL

TL;DR: A re-ranking model using late cross-attention architecture and LLM distillation for multilingual, long-context person-job matching with interpretable skill-fit scores.

Details

Motivation: Real-time person-job matching is challenging due to long, structured, multilingual resumes and historical data biases in recruitment systems.

Method: Late cross-attention architecture decomposes resumes and project briefs for efficient long-context processing. Uses LLM as teacher to generate fine-grained supervision, distilled into student model via enriched distillation loss.

Result: Outperforms state-of-the-art baselines on relevance, ranking, and calibration metrics. Produces interpretable skill-fit scores for consistent person-job matching.

Conclusion: The proposed approach effectively addresses long-context processing and bias mitigation in person-job matching through architectural innovation and LLM-based distillation.

Abstract: Finding the most relevant person for a job proposal in real time is challenging, especially when resumes are long, structured, and multilingual. In this paper, we propose a re-ranking model based on a new generation of late cross-attention architecture, that decomposes both resumes and project briefs to efficiently handle long-context inputs with minimal computational overhead. To mitigate historical data biases, we use a generative large language model (LLM) as a teacher, generating fine-grained, semantically grounded supervision. This signal is distilled into our student model via an enriched distillation loss function. The resulting model produces skill-fit scores that enable consistent and interpretable person-job matching. Experiments on relevance, ranking, and calibration metrics demonstrate that our approach outperforms state-of-the-art baselines.

[61] OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

Deming Ding, Shichun Liu, Enhui Yang, Jiahang Lin, Ziying Chen, Shihan Dou, Honglin Guo, Weiyu Cheng, Pengyu Zhao, Chengjun Xiao, Qunhong Zeng, Qi Zhang, Xuanjing Huang, Qidi Xu, Tao Gui

Main category: cs.CL

TL;DR: OctoBench is a new benchmark for evaluating LLM coding agents’ ability to follow scaffold-specified instructions in repository-grounded coding tasks, revealing a gap between task-solving and rule compliance.

Details

Motivation: Current LLM coding scaffolds aren't properly evaluated for their ability to follow heterogeneous, persistent constraints across interactions, creating a need for systematic benchmarking of scaffold-aware instruction following.

Method: Created OctoBench with 34 environments, 217 tasks across three scaffold types, paired with 7,098 objective checklist items. Developed automated observation-and-scoring toolkit to capture full trajectories and perform fine-grained checks, separating task-solving from rule compliance.

Result: Experiments on eight representative models show systematic gap between task-solving ability and scaffold-aware compliance, highlighting the need for specialized training and evaluation for heterogeneous instruction following.

Conclusion: OctoBench enables reproducible benchmarking and accelerates development of more scaffold-aware coding agents by providing comprehensive evaluation of instruction-following capabilities in agentic coding scenarios.

Abstract: Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.

[62] Training-Trajectory-Aware Token Selection

Zhanming Shen, Jiaqi Hu, Zeyu Qin, Hao Chen, Wentao Ye, Zenan Huang, Yihong Zhuang, Guoshan Lu, Junlin Zhou, Junbo Zhao

Main category: cs.CL

TL;DR: The paper identifies a bottleneck phenomenon in continual distillation where performance metrics drop sharply despite decreasing loss, proposes a token-level mechanism explanation, and introduces T3S (Training-Trajectory-Aware Token Selection) to address this by reconstructing training objectives at token level.

Details

Motivation: In frontier distillation where students already have strong reasoning ability, naive continual distillation often yields limited gains or even degradation. The authors observed a characteristic training phenomenon where performance metrics drop sharply at a bottleneck despite decreasing loss, indicating a fundamental issue in current distillation approaches.

Method: The authors propose T3S (Training-Trajectory-Aware Token Selection), which reconstructs training objectives at the token level based on analyzing token-level mechanisms. They identified that tokens bifurcate into Imitation-Anchor Tokens (which quickly anchor optimization) and yet-to-learn tokens (whose confidence is suppressed), and T3S clears the optimization path for the latter.

Result: T3S yields consistent gains in both AR and dLLM settings: Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks with only hundreds of examples, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all 16B-scale no-think models.

Conclusion: The paper demonstrates that addressing token-level optimization conflicts through training-trajectory-aware token selection enables more effective continual distillation, allowing smaller models to approach or surpass much larger models’ performance with minimal data, representing a significant advance in efficient knowledge distillation.

Abstract: Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

[63] Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

Zhihao Xu, Rumei Li, Jiahuan Li, Rongxiang Weng, Jingang Wang, Xunliang Cai, Xiting Wang

Main category: cs.CL

TL;DR: GEM is a novel text-based data synthesis pipeline that extracts multi-turn tool-use trajectories from textual corpora, enabling efficient generation of diverse tool-use data for LLM training.

Details

Motivation: Acquiring diverse and realistic multi-turn tool-use data for LLM training is challenging. Textual corpora contain rich, multi-step problem-solving experiences that can serve as an untapped, scalable, and authentic data source for this purpose.

Method: GEM uses a four-stage pipeline: relevance filtering, workflow & tool extraction, trajectory grounding, and complexity refinement. To reduce computational cost, they train a specialized Trajectory Synthesizer via supervised fine-tuning that distills the pipeline into an efficient end-to-end generator.

Result: GEM-32B achieves 16.5% improvement on BFCL V3 Multi-turn benchmark. The models partially surpass performance of models trained on τ-bench in-domain data, showing superior generalization. The Trajectory Synthesizer matches pipeline quality while significantly reducing inference latency and costs.

Conclusion: Text-based synthesis from corpora provides a scalable, authentic source for multi-turn tool-use data. The distilled Trajectory Synthesizer offers efficient generation while maintaining quality, enabling better LLM tool-use capabilities.

Abstract: Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow & tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on τ - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.

[64] The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey

Main category: cs.CL

TL;DR: Researchers discover an “Assistant Axis” in LLMs that controls persona expression - steering toward it reinforces helpful behavior while steering away induces mystical/theatrical styles and persona drift.

Details

Motivation: To understand how large language models represent different personas and investigate the structure of persona space, particularly how models default to helpful Assistant identities after post-training but can drift into other personas.

Method: Extracted activation directions corresponding to diverse character archetypes across several models, identified the leading component as the “Assistant Axis,” and experimented with steering activations along this axis to manipulate persona expression and stabilize behavior.

Result: Found that the Assistant Axis exists in both post-trained and pre-trained models, controls helpful vs. mystical/theatrical behavior, predicts persona drift (harmful/bizarre behaviors), and that restricting activations to fixed regions along this axis can stabilize model behavior against drift and adversarial jailbreaks.

Conclusion: Post-training only loosely tethers models to the Assistant persona, motivating the need for better training and steering strategies to more deeply anchor models to coherent personas and prevent undesirable persona drift.

Abstract: Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an “Assistant Axis,” which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model’s tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts “persona drift,” a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model’s processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios – and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.

[65] INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects

Tarun Sharma, Manikandan Ravikiran, Sourava Kumar Behera, Pramit Bhattacharya, Arnab Bhattacharya, Rohit Saluja

Main category: cs.CL

TL;DR: INDIC-DIALECT is a new 13k parallel corpus covering 11 dialects of Hindi and Odia, with multi-task benchmark showing LLMs struggle but fine-tuned Indian language models achieve strong performance on dialect classification and translation tasks.

Details

Motivation: NLP research focuses on standardized languages, leaving low-resource Indian dialects underrepresented despite large speaker populations (Hindi: 600M+, Odia: 45M). Existing datasets cover standard languages but lack dialectal data, creating a significant gap for Indian dialect NLP.

Method: Created INDIC-DIALECT: human-curated parallel corpus of 13k sentence pairs across 11 dialects of Hindi and Odia. Built multi-task benchmark with dialect classification, MCQ answering, and machine translation tasks. Evaluated LLMs (GPT-4o, Gemini 2.5) and fine-tuned transformer models pretrained on Indian languages.

Result: LLMs performed poorly on dialect classification, while fine-tuned Indian language models improved F1 from 19.6% to 89.8%. For dialect-to-language translation, hybrid AI achieved BLEU 61.32 vs baseline 23.36. For language-to-dialect, rule-based+AI approach achieved BLEU 48.44 vs baseline 27.59.

Conclusion: INDIC-DIALECT provides a valuable benchmark for dialect-aware Indic NLP, demonstrating the need for specialized models for Indian dialects. The corpus will be released open-source to support further research on low-resource Indian dialects.

Abstract: Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6% to 89.8% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.

[66] TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction

Mihai Dan Nadas, Laura Diosan, Andreea Tomescu, Andrei Piscoran

Main category: cs.CL

TL;DR: TF3-RO is an end-to-end Romanian language modeling pipeline that creates compact models and generates synthetic Romanian fables through tokenizer design, pretraining, compression, and controlled generation.

Details

Motivation: There's no openly documented, reproducible pipeline for Romanian language modeling that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and synthetic data generation, especially for this morphologically rich, under-resourced language.

Method: Builds Romanian-specific BPE and Unigram tokenizers from linguistically informed corpus to handle morphology. Pretrains 51.65M-parameter LLaMA-style Transformer from scratch using long-sequence packed training. Optimizes through quantization, structured pruning, and logit-based knowledge distillation to create 26.45M-parameter student model. Generates 3 million Romanian synthetic fables via controlled combinatorial prompting.

Result: Created TF3-RO pipeline with Romanian-specific tokenizers, pretrained 51.65M-parameter model, distilled to 26.45M-parameter compact model with tied embeddings, and generated 3 million Romanian-native synthetic fables. Comprehensive evaluation suite integrated throughout.

Conclusion: TF3-RO provides a reproducible, linguistically grounded framework for training compact Romanian language models and producing large-scale synthetic narrative corpora, addressing the gap for under-resourced morphologically rich languages.

Abstract: Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and large-scale synthetic data generation in a reproducible framework. Building on TF1, a three-million-story English fable dataset, and TF2, which extends TF1 through high-quality Romanian translations, we introduce TF3-RO, a Romanian-centric language modeling pipeline spanning tokenizer training, from-scratch model development, and Romanian-native dataset generation. TF3-RO constructs Romanian-specific BPE and Unigram tokenizers from a linguistically informed corpus to mitigate token inflation induced by Romanian morphology. Using long-sequence packed training, we pretrain a 51.65M-parameter LLaMA-style Transformer entirely from scratch. The model is subsequently optimized through quantization, structured pruning, and logit-based knowledge distillation, yielding a compact 26.45M-parameter student model with tied embeddings and strong deployment characteristics. Using this distilled model, TF3-RO generates three million Romanian-native synthetic fables via a controlled combinatorial prompting framework. Across all stages, the pipeline integrates a comprehensive evaluation suite combining intrinsic metrics, Romanian agreement probes, entity coherence, rule-based grammar checking, and LLM-based assessment. TF3-RO provides a reproducible and linguistically grounded framework for training compact Romanian language models and producing large-scale synthetic narrative corpora.

[67] Are Language Models Models?

Philip Resnik

Main category: cs.CL

TL;DR: The paper critiques the claim that language models serve as cognitive model systems, arguing this overstates their capabilities and feeds LLM hype.

Details

Motivation: To critically assess the claim by Futrell and Mahowald that language models (LMs) can serve as model systems for cognitive science, using Marr's three levels of analysis as a framework.

Method: The authors use David Marr’s three levels of analysis (computational theory, algorithmic-representational, and implementation) to systematically evaluate whether LMs can serve as model systems for cognitive science.

Result: The analysis shows that LMs fail as model systems at the implementation level, are poorly motivated at the algorithmic-representational level, and are problematic at the computational theory level. LMs are better viewed as tools rather than cognitive models.

Conclusion: Language models should be considered useful tools rather than cognitive models. Calling them cognitive models overstates their capabilities and unnecessarily contributes to LLM hype.

Abstract: Futrell and Mahowald claim LMs “serve as model systems”, but an assessment at each of Marr’s three levels suggests the claim is clearly not true at the implementation level, poorly motivated at the algorithmic-representational level, and problematic at the computational theory level. LMs are good candidates as tools; calling them cognitive models overstates the case and unnecessarily feeds LLM hype.

[68] SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability

Ruochen Li, Kun Yuan, Yufei Xia, Yue Zhou, Qingyu Lu, Weihang Li, Youxiang Zhu, Nassir Navab

Main category: cs.CL

TL;DR: Current VLM evaluation metrics fail to properly assess surgical planning quality. The paper introduces a rule-based meta-evaluation benchmark showing sequence similarity metrics misjudge valid plans and miss invalid ones, while structural knowledge improves planning performance.

Details

Motivation: Current evaluation protocols for vision-language models in surgical planning are unreliable for safety-critical settings. There's a need for better assessment methods that align with goal-oriented surgical planning where plan validity should be determined by expert-defined surgical rules rather than sequence similarity.

Method: Defines planning correctness via phase-goal satisfiability based on expert-defined surgical rules. Introduces a multicentric meta-evaluation benchmark with valid procedural variations and invalid plans containing order and content errors. Uses rule-based goal-satisfiability metric as meta-evaluation reference to assess Video-LLMs under progressively constrained settings.

Result: Sequence similarity metrics systematically misjudge planning quality, penalizing valid plans while failing to identify invalid ones. Perception errors and under-constrained reasoning cause failures in Video-LLMs. Structural knowledge consistently improves performance, while semantic guidance alone is unreliable and only benefits larger models when combined with structural constraints.

Conclusion: Current evaluation metrics are inadequate for assessing surgical planning in VLMs. A rule-based, goal-oriented approach with expert-defined surgical rules provides more reliable assessment. Structural constraints are crucial for improving planning performance, especially in safety-critical surgical settings.

Abstract: Surgical planning integrates visual perception, long-horizon reasoning, and procedural knowledge, yet it remains unclear whether current evaluation protocols reliably assess vision-language models (VLMs) in safety-critical settings. Motivated by a goal-oriented view of surgical planning, we define planning correctness via phase-goal satisfiability, where plan validity is determined by expert-defined surgical rules. Based on this definition, we introduce a multicentric meta-evaluation benchmark with valid procedural variations and invalid plans containing order and content errors. Using this benchmark, we show that sequence similarity metrics systematically misjudge planning quality, penalizing valid plans while failing to identify invalid ones. We therefore adopt a rule-based goal-satisfiability metric as a high-precision meta-evaluation reference to assess Video-LLMs under progressively constrained settings, revealing failures due to perception errors and under-constrained reasoning. Structural knowledge consistently improves performance, whereas semantic guidance alone is unreliable and benefits larger models only when combined with structural constraints.

[69] Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models

Abhinaba Basu, Pavan Chakraborty

Main category: cs.CL

TL;DR: Contextual StereoSet benchmark shows that measured bias in language models shifts dramatically with different contextual framings (time, audience, place), challenging the generalization of fixed-condition bias tests.

Details

Motivation: Current bias benchmarks use fixed contexts, but models that avoid stereotypes in lab tests may still exhibit bias in deployment when contexts change. The paper aims to stress-test evaluation robustness by showing how bias measurements vary with contextual framing.

Method: Introduces Contextual StereoSet benchmark that holds stereotype content fixed while systematically varying contextual framing (time, audience, place). Tests 13 models across two protocols: a 360-context diagnostic grid for deep analysis and a budgeted protocol covering 4,229 items for production screening. Proposes Context Sensitivity Fingerprints (CSF) - a compact profile with per-dimension dispersion, paired contrasts with bootstrap CIs and FDR correction.

Result: Found striking patterns: anchoring to 1990 (vs. 2030) raises stereotype selection in all models tested; gossip framing raises it in 5 of 6 full-grid models; out-group observer framing shifts bias by up to 13 percentage points. Effects replicate in hiring, lending, and help-seeking vignettes.

Conclusion: Bias scores from fixed-condition tests may not generalize. The paper shifts focus from “Is this model biased?” to “Under what conditions does bias appear?” through CSF analysis. Releases benchmark, code, and results to improve evaluation robustness.

Abstract: A model that avoids stereotypes in a lab benchmark may not avoid them in deployment. We show that measured bias shifts dramatically when prompts mention different places, times, or audiences – no adversarial prompting required. We introduce Contextual StereoSet, a benchmark that holds stereotype content fixed while systematically varying contextual framing. Testing 13 models across two protocols, we find striking patterns: anchoring to 1990 (vs. 2030) raises stereotype selection in all models tested on this contrast (p<0.05); gossip framing raises it in 5 of 6 full-grid models; out-group observer framing shifts it by up to 13 percentage points. These effects replicate in hiring, lending, and help-seeking vignettes. We propose Context Sensitivity Fingerprints (CSF): a compact profile of per-dimension dispersion and paired contrasts with bootstrap CIs and FDR correction. Two evaluation tracks support different use cases – a 360-context diagnostic grid for deep analysis and a budgeted protocol covering 4,229 items for production screening. The implication is methodological: bias scores from fixed-condition tests may not generalize.This is not a claim about ground-truth bias rates; it is a stress test of evaluation robustness. CSF forces evaluators to ask, “Under what conditions does bias appear?” rather than “Is this model biased?” We release our benchmark, code, and results.

[70] DR-Arena: an Automated Evaluation Framework for Deep Research Agents

Yiwen Gao, Ruochen Zhao, Yang Deng, Wenxuan Zhang

Main category: cs.CL

TL;DR: DR-Arena is an automated evaluation framework for Deep Research Agents that uses real-time web trends to create dynamic tasks, adaptively escalating complexity to test reasoning and coverage capabilities, achieving high correlation with human preferences.

Details

Motivation: Current static benchmarks for evaluating Large Language Models as Deep Research Agents have limitations: limited task generality, temporal misalignment with real-world information, and data contamination issues. There's a need for reliable evaluation that pushes agents to their capability limits through dynamic investigation.

Method: DR-Arena constructs real-time Information Trees from fresh web trends to ensure evaluation rubrics are synchronized with the live world state. It uses an automated Examiner to generate structured tasks testing two orthogonal capabilities: deep reasoning and wide coverage. The framework employs an Adaptive Evolvement Loop - a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until capability boundaries emerge.

Result: Experiments with six advanced DR agents show DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents state-of-the-art alignment with human preferences without manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.

Conclusion: DR-Arena provides an effective automated evaluation framework for Deep Research Agents that overcomes limitations of static benchmarks by using dynamic, real-time web-based tasks with adaptive complexity escalation, achieving high correlation with human judgment.

Abstract: As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.

Xuan Luo, Lewei Yao, Libo Zhao, Lanqing Hong, Kai Chen, Dehua Tao, Daxin Tan, Ruifeng Xu, Jing Li

Main category: cs.CL

TL;DR: AEQ-Bench is a new benchmark for evaluating omni-modal large models’ empathetic capabilities in both generating empathetic responses from audio+text inputs and judging empathy in audio responses without text transcription.

Details

Motivation: Automatic evaluation of omni-modal large models (OLMs) is essential but assessing empathy remains challenging due to its inherent affectivity. Existing benchmarks don't adequately address empathetic capabilities in multi-modal contexts.

Method: Introduced AEQ-Bench (Audio Empathy Quotient Benchmark) with two novel settings varying in context specificity and speech tone. Systematically assesses two core empathetic capabilities: 1) generating empathetic responses from multi-modal inputs (audio+text), and 2) judging empathy of audio responses without text transcription.

Result: OLMs with audio output capabilities generally outperformed text-only models. While OLMs align with human judgments for coarse-grained quality assessment, they remain unreliable for evaluating fine-grained paralinguistic expressiveness.

Conclusion: AEQ-Bench provides a systematic framework for evaluating empathetic capabilities in OLMs, revealing current limitations in fine-grained paralinguistic assessment while showing promise for models with audio capabilities.

Abstract: While the automatic evaluation of omni-modal large models (OLMs) is essential, assessing empathy remains a significant challenge due to its inherent affectivity. To investigate this challenge, we introduce AEQ-Bench (Audio Empathy Quotient Benchmark), a novel benchmark to systematically assess two core empathetic capabilities of OLMs: (i) generating empathetic responses by comprehending affective cues from multi-modal inputs (audio + text), and (ii) judging the empathy of audio responses without relying on text transcription. Compared to existing benchmarks, AEQ-Bench incorporates two novel settings that vary in context specificity and speech tone. Comprehensive assessment across linguistic and paralinguistic metrics reveals that (1) OLMs trained with audio output capabilities generally outperformed models with text-only outputs, and (2) while OLMs align with human judgments for coarse-grained quality assessment, they remain unreliable for evaluating fine-grained paralinguistic expressiveness.

[72] PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models

Chengbing Wang, Wuqiang Zheng, Yang Zhang, Fengbin Zhu, Junyi Cheng, Yi Xie, Wenjie Wang, Fuli Feng

Main category: cs.CL

TL;DR: PERM introduces a psychology-grounded bidirectional reward model for training more empathetic LLMs by evaluating empathy from supporter, seeker, and bystander perspectives, outperforming existing methods by over 10%.

Details

Motivation: Current LLMs deployed in human-centric applications often fail to provide substantive emotional support. Existing RL-based empathy enhancement methods use reward models that evaluate empathy from only a single perspective, ignoring the bidirectional nature of empathy interactions as defined by Empathy Cycle theory.

Method: PERM (Psychology-grounded Empathetic Reward Modeling) operationalizes empathy evaluation through a bidirectional decomposition: 1) Supporter perspective (assessing internal resonation and communicative expression), 2) Seeker perspective (evaluating emotional reception), plus 3) Bystander perspective to monitor overall interaction quality.

Result: PERM outperforms state-of-the-art baselines by over 10% on a widely-used emotional intelligence benchmark and an industrial daily conversation dataset. A blinded user study shows 70% preference for PERM-generated responses.

Conclusion: PERM effectively enhances LLM empathy by incorporating psychology-grounded bidirectional evaluation, demonstrating superior performance and user preference over existing methods, with code and models publicly available.

Abstract: Large Language Models (LLMs) are increasingly deployed in human-centric applications, yet they often fail to provide substantive emotional support. While Reinforcement Learning (RL) has been utilized to enhance empathy of LLMs, existing reward models typically evaluate empathy from a single perspective, overlooking the inherently bidirectional interaction nature of empathy between the supporter and seeker as defined by Empathy Cycle theory. To address this limitation, we propose Psychology-grounded Empathetic Reward Modeling (PERM). PERM operationalizes empathy evaluation through a bidirectional decomposition: 1) Supporter perspective, assessing internal resonation and communicative expression; 2) Seeker perspective, evaluating emotional reception. Additionally, it incorporates a bystander perspective to monitor overall interaction quality. Extensive experiments on a widely-used emotional intelligence benchmark and an industrial daily conversation dataset demonstrate that PERM outperforms state-of-the-art baselines by over 10%. Furthermore, a blinded user study reveals a 70% preference for our approach, highlighting its efficacy in generating more empathetic responses. Our code, dataset, and models are available at https://github.com/ZhengWwwq/PERM.

[73] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure

Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, Nazia Tasnim, Farig Sadeque

Main category: cs.CL

TL;DR: KIF is a representation-aware unlearning framework that achieves near-perfect knowledge erasure while maintaining utility, breaking the stability-erasure tradeoff by targeting internal activation signatures rather than surface outputs.

Details

Motivation: Current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. This is problematic for GDPR compliance and model safety, requiring genuine knowledge erasure rather than just refusal training.

Method: Knowledge Immunization Framework (KIF) uses representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures. It combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining.

Result: KIF achieves near-oracle erasure (FQ ≈ 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff. Evaluation across 3B to 14B parameters shows standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence.

Conclusion: KIF provides a systematic approach to genuine knowledge erasure that distinguishes between surface-level refusal and true internal knowledge removal, enabling GDPR compliance and model safety through representation-aware unlearning that maintains model utility.

Abstract: Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.

[74] Form and Meaning in Intrinsic Multilingual Evaluations

Wessel Poelman, Miryam de Lhoneux

Main category: cs.CL

TL;DR: Current intrinsic metrics like perplexity are not universally comparable across languages in multilingual settings due to conflating semantic meaning with information-theoretic content.

Details

Motivation: To examine the problematic assumptions behind using intrinsic evaluation metrics (like perplexity) in multilingual settings, particularly the assumption that comparing perplexity on parallel sentences indicates model quality when semantic meaning is the same.

Method: Explicitly identifying assumptions about multilingual metric comparisons, then conducting experiments with six metrics on two multi-parallel corpora using both monolingual and multilingual models.

Result: Found that current intrinsic metrics are not universally comparable across languages, with the form-meaning debate providing explanatory insights for this limitation.

Conclusion: The paper reveals fundamental issues with applying monolingual evaluation assumptions to multilingual settings, highlighting that information-theoretic metrics don’t align with semantic equivalence across languages.

Abstract: Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.

[75] Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs

Yuxi Xia, Loris Schoenegger, Benjamin Roth

Main category: cs.CL

TL;DR: TracVC traces LLM confidence expressions to training data, revealing models often mimic confidence patterns rather than ground confidence in relevant content.

Details

Motivation: LLMs often verbalize confidence to increase user trust, but this confidence is unreliable and doesn't align with factual accuracy. The paper aims to understand the sources of this verbalized confidence by tracing it back to training data.

Method: Introduces TracVC (Tracing Verbalized Confidence), which combines information retrieval and influence estimation to trace generated confidence expressions back to training data. Evaluates on OLMo and Llama models in QA settings, proposing a new metric called “content groundness” that measures how much confidence is grounded in content-related training examples vs. generic confidence verbalization examples.

Result: Analysis shows OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to queries, suggesting it mimics superficial linguistic expressions of certainty rather than relying on genuine content grounding.

Conclusion: Current training regimes have a fundamental limitation: LLMs learn how to sound confident without learning when confidence is justified. The analysis provides a foundation for improving LLMs’ trustworthiness in expressing more reliable confidence.

Abstract: Large language models (LLMs) can increase users’ perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs’ trustworthiness in expressing more reliable confidence.

[76] Detecting Winning Arguments with Large Language Models and Persuasion Strategies

Tiziano Labruna, Arkadiusz Modzelewski, Giorgio Satta, Giovanni Da San Martino

Main category: cs.CL

TL;DR: LLMs with multi-strategy persuasion scoring improve persuasiveness prediction in arguments by analyzing six persuasion strategies across three datasets.

Details

Motivation: Detecting persuasion in argumentative text is challenging but important for understanding human communication, and current methods need better ways to incorporate persuasion strategies for improved prediction.

Method: Multi-Strategy Persuasion Scoring approach using large language models (LLMs) that guides reasoning over six persuasion strategies (Attack on reputation, Distraction, Manipulative wording, etc.) across three argument datasets.

Result: Strategy-guided reasoning improves persuasiveness prediction. The Winning Argument dataset was organized into discussion topics for analysis, and performance varies across topics. The topic-annotated dataset is publicly released.

Conclusion: Structured, strategy-aware prompting enhances interpretability and robustness in argument quality assessment, demonstrating the value of incorporating persuasion strategies in LLM-based analysis.

Abstract: Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.

[77] LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Gilat Toker, Nitay Calderon, Ohad Amosy, Roi Reichart

Main category: cs.CL

TL;DR: LIBERTy is a new benchmark using LLM-generated structural counterfactuals to evaluate faithfulness of concept-based explanations, revealing significant room for improvement in current methods.

Details

Motivation: Existing benchmarks for evaluating concept-based explanations rely on costly human-written counterfactuals that are imperfect proxies. There's a need for better evaluation frameworks to assess the faithfulness of explanations that quantify how high-level concepts influence model behavior.

Method: Introduces LIBERTy framework that constructs datasets with structural counterfactual pairs grounded in explicitly defined Structured Causal Models (SCMs). Interventions on concepts propagate through SCMs until LLMs generate counterfactuals. Includes three datasets (disease detection, CV screening, workplace violence) and a new order-faithfulness metric.

Result: Evaluation of various methods across five models shows substantial headroom for improving concept-based explanations. Analysis reveals proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation.

Conclusion: LIBERTy provides a much-needed benchmark for developing faithful explainability methods, enabling systematic analysis of model sensitivity to interventions and revealing limitations in current concept-based explanation approaches.

Abstract: Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.

[78] Grounding Agent Memory in Contextual Intent

Ruozhen Yang, Yucheng Jiang, Yueqi Jiang, Priyanka Kargupta, Yunyi Zhang, Jiawei Han

Main category: cs.CL

TL;DR: STITCH is a memory system for LLMs that uses structured intent tracking to improve retrieval in long-horizon goal-oriented interactions by reducing context-mismatched evidence.

Details

Motivation: Large language models struggle with long-horizon, goal-oriented interactions because similar entities and facts recur under different latent goals, causing memory systems to retrieve context-mismatched evidence that interferes with reasoning.

Method: STITCH indexes each trajectory step with structured retrieval cues (contextual intent) including: (1) current latent goal defining thematic segment, (2) action type, and (3) salient entity types. During inference, it filters and prioritizes memory snippets by intent compatibility.

Result: STITCH achieves state-of-the-art performance on CAME-Bench and LongMemEval, outperforming the strongest baseline by 35.6%, with largest gains as trajectory length increases. Intent indexing substantially reduces retrieval noise.

Conclusion: Structured intent tracking enables robust long-horizon reasoning by providing compact signals that disambiguate repeated mentions and reduce interference, supporting intent-aware memory systems for goal-oriented interactions.

Abstract: Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step’s intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history. For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.

[79] MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin

Main category: cs.CL

TL;DR: MatchTIR improves LLM tool-use training via fine-grained credit assignment using bipartite matching for turn-level rewards and dual-level advantage estimation.

Details

Motivation: Existing RL methods for tool-integrated reasoning use coarse-grained rewards (outcome- or trajectory-level) that fail to distinguish effective vs. redundant/erroneous tool calls, especially in long-horizon multi-turn scenarios.

Method: Proposes MatchTIR with: 1) bipartite matching between predicted and ground-truth traces to derive dense turn-level rewards, and 2) dual-level advantage estimation integrating turn-level and trajectory-level signals for distinct advantage values per interaction turn.

Result: Extensive experiments on three benchmarks show superiority; 4B model surpasses majority of 8B competitors, especially in long-horizon and multi-turn tasks.

Conclusion: Fine-grained supervision via bipartite matching and dual-level advantage estimation effectively addresses credit assignment challenges in tool-integrated reasoning, enabling better performance with smaller models.

Abstract: Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.

[80] Fairness Definitions in Language Models Explained

Zhipeng Yin, Zichong Wang, Avash Palikhe, Wenbin Zhang

Main category: cs.CL

TL;DR: A systematic survey paper that clarifies fairness definitions in Large Language Models, categorizes them by transformer architecture types, and provides experimental demonstrations.

Details

Motivation: Despite LMs' strong performance, they can inherit and amplify societal biases related to gender, race, etc., limiting real-world adoption. The lack of clear agreement on fairness definitions and complexity in understanding distinctions creates confusion and impedes progress.

Method: Proposes a systematic survey that: 1) introduces LMs and fairness concepts, 2) provides comprehensive overview of existing fairness notions, 3) introduces novel taxonomy categorizing fairness concepts based on transformer architecture (encoder-only, decoder-only, encoder-decoder), 4) illustrates each definition through experiments.

Result: The paper provides a structured framework for understanding fairness in LMs, with experimental demonstrations of practical implications. A public repository is made available with the survey materials.

Conclusion: The survey clarifies fairness definitions in LMs, offers a novel architectural taxonomy, and discusses current research challenges and open questions to foster innovation and advance the field of fairness in language models.

Abstract: Language Models (LMs) have demonstrated exceptional performance across various Natural Language Processing (NLP) tasks. Despite these advancements, LMs can inherit and amplify societal biases related to sensitive attributes such as gender and race, limiting their adoption in real-world applications. Therefore, fairness has been extensively explored in LMs, leading to the proposal of various fairness notions. However, the lack of clear agreement on which fairness definition to apply in specific contexts and the complexity of understanding the distinctions between these definitions can create confusion and impede further progress. To this end, this paper proposes a systematic survey that clarifies the definitions of fairness as they apply to LMs. Specifically, we begin with a brief introduction to LMs and fairness in LMs, followed by a comprehensive, up-to-date overview of existing fairness notions in LMs and the introduction of a novel taxonomy that categorizes these concepts based on their transformer architecture: encoder-only, decoder-only, and encoder-decoder LMs. We further illustrate each definition through experiments, showcasing their practical implications and outcomes. Finally, we discuss current research challenges and open questions, aiming to foster innovative ideas and advance the field. The repository is publicly available online at https://github.com/vanbanTruong/Fairness-in-Large-Language-Models/tree/main/definitions.

[81] Exploring the Translation Mechanism of Large Language Models

Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Yang Xiang, Min Zhang

Main category: cs.CL

TL;DR: LLMs use sparse specialized components for translation: attention heads extract source language features and translation indicators, MLPs integrate them into English-centric representations, and minimal fine-tuning of these components improves translation while preserving general capabilities.

Details

Motivation: Despite LLMs' success in multilingual translation, their internal translation mechanisms at the fundamental word level remain poorly understood, creating a critical gap in interpretability.

Method: Proposes subspace-intervened path patching for fine-grained causal analysis to detect translation-critical components and characterize their behavioral patterns in human-interpretable terms.

Result: Translation is driven by sparse specialized components: attention heads extract source language, translation indicators, and positional features; MLPs integrate these into English-centric latent representations before final translation. Targeted fine-tuning of minimal parameter subset (<5%) enhances translation while preserving general capabilities.

Conclusion: The identified crucial components generalize to sentence-level translation and help elucidate more intricate translation tasks, providing a systematic framework for understanding LLM translation mechanisms.

Abstract: While large language models (LLMs) demonstrate remarkable success in multilingual translation, their internal core translation mechanisms, even at the fundamental word level, remain insufficiently understood. To address this critical gap, this work introduces a systematic framework for interpreting the mechanism behind LLM translation from the perspective of computational components. This paper first proposes subspace-intervened path patching for precise, fine-grained causal analysis, enabling the detection of components crucial to translation tasks and subsequently characterizing their behavioral patterns in human-interpretable terms. Comprehensive experiments reveal that translation is predominantly driven by a sparse subset of components: specialized attention heads serve critical roles in extracting source language, translation indicators, and positional features, which are then integrated and processed by specific multi-layer perceptrons (MLPs) into intermediary English-centric latent representations before ultimately yielding the final translation. The significance of these findings is underscored by the empirical demonstration that targeted fine-tuning a minimal parameter subset ($<5%$) enhances translation performance while preserving general capabilities. This result further indicates that these crucial components generalize effectively to sentence-level translation and are instrumental in elucidating more intricate translation tasks.

[82] Text Classification Under Class Distribution Shift: A Survey

Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu

Main category: cs.CL

TL;DR: Survey paper on open-set text classification methods for handling distribution shifts, covering Universum learning, zero-shot learning, and open-set learning approaches.

Details

Motivation: Traditional ML assumes training and test data come from the same distribution, but in practice (especially in text classification), distributions shift over time as new topics emerge, hindering conventional models.

Method: Survey methodology categorizing approaches based on distribution shift constraints: 1) Learning with the Universum, 2) Zero-shot learning, and 3) Open-set learning. Also discusses mitigation strategies for each problem setup.

Result: Comprehensive taxonomy of open-set text classification methods, identification of predominant mitigation approaches, and future research directions. Maintains curated repository of relevant papers.

Conclusion: Continual learning can address many issues from shifting class distributions. The survey provides structured understanding of open-set text classification challenges and solutions, with identified future directions to advance the field.

Abstract: The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e. the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e. learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. We further identify several future work directions, aiming to push the boundaries beyond the state of the art. Finally, we explain how continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at https://github.com/Eduard6421/Open-Set-Survey.

[83] Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models

Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste

Main category: cs.CL

TL;DR: LLMs can learn Bayesian reasoning skills from examples and generalize them to new domains, improving their ability to update beliefs like human agents.

Details

Motivation: LLMs are increasingly used as interactive agents that need to form probabilistic beliefs about the world and user preferences, requiring proper belief updating capabilities that currently fall short of Bayesian standards.

Method: Teach LLMs to mimic predictions of normative Bayesian models through examples, enabling them to learn Bayesian reasoning skills for belief updating.

Result: LLMs dramatically improve their ability to update beliefs when taught Bayesian reasoning, and this ability generalizes to new tasks beyond the training examples.

Conclusion: LLMs can effectively learn and generalize reasoning skills from examples, demonstrating potential for developing more sophisticated AI agents with proper belief updating capabilities.

Abstract: Large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs must construct representations of the world and form probabilistic beliefs about them. To provide personalized recommendations, for example, the LLM needs to infer a user’s preferences from their behavior over multiple interactions. The Bayesian inference framework lays out the optimal way for an agent to update its beliefs as it receives new information. We first show that LLMs fall far short of the standard defined by the Bayesian framework. We then show that by teaching LLMs to mimic the predictions of the normative Bayesian model, we can dramatically improve their ability to update their beliefs; this ability generalizes to new tasks. We conclude that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.

[84] Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish

Cedric Lothritz, Jordi Cabot, Laura Bernardy

Main category: cs.CL

TL;DR: LLMs perform poorly in less-resourced languages like Luxembourgish. Language proficiency exams can serve as effective evaluation tools, with large models (Claude, DeepSeek-R1) scoring high while smaller models perform weakly. Exam performance predicts NLP task performance in Luxembourgish.

Details

Motivation: LLMs are predominantly developed for English and widespread languages, leaving less-resourced languages like Luxembourgish with sparse evaluation tools and datasets. There's a need for effective evaluation methods for these under-resourced languages.

Method: Investigating the viability of language proficiency exams as evaluation tools for Luxembourgish language models. Testing various LLMs (including Claude and DeepSeek-R1) on these exams to assess their performance.

Result: Large models like Claude and DeepSeek-R1 achieve high scores on Luxembourgish language exams, while smaller models show weak performances. Language exam performance can predict performance in other NLP tasks in Luxembourgish.

Conclusion: Language proficiency exams are viable evaluation tools for less-resourced languages like Luxembourgish. They reveal performance gaps between large and small models and can serve as predictors for broader NLP task performance in these languages.

Abstract: Large Language Models (LLMs) have become an increasingly important tool in research and society at large. While LLMs are regularly used all over the world by experts and lay-people alike, they are predominantly developed with English-speaking users in mind, performing well in English and other wide-spread languages while less-resourced languages such as Luxembourgish are seen as a lower priority. This lack of attention is also reflected in the sparsity of available evaluation tools and datasets. In this study, we investigate the viability of language proficiency exams as such evaluation tools for the Luxembourgish language. We find that large models such as Claude and DeepSeek-R1 typically achieve high scores, while smaller models show weak performances. We also find that the performances in such language exams can be used to predict performances in other NLP tasks in Luxembourgish.

[85] WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms

Zhisong Zhang, Tianqing Fang, Kaixin Ma, Wenhao Yu, Hongming Zhang, Haitao Mi, Dong Yu

Main category: cs.CL

TL;DR: Web agents enhanced with explicit rollback mechanism for better navigation in complex web environments, outperforming greedy one-way search strategies.

Details

Motivation: Current web agents using greedy one-way search struggle to recover from erroneous states in complex, dynamic web environments, requiring more advanced planning and search capabilities.

Method: Introduces an explicit rollback mechanism that allows web agents to revert to previous states in their navigation trajectory, giving models direct control over the search process for more effective navigation.

Result: Experiments on two live web navigation benchmarks with zero-shot and fine-tuning settings demonstrate the effectiveness of the proposed approach.

Conclusion: The explicit rollback mechanism provides web agents with greater flexibility and efficiency in navigating complex web environments, addressing limitations of previous greedy search strategies.

Abstract: With recent advancements in large language models, web agents have been greatly improved. However, dealing with complex and dynamic web environments requires more advanced planning and search abilities. Previous studies usually adopt a greedy one-way search strategy, which may struggle to recover from erroneous states. In this work, we enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory. This mechanism gives models the flexibility to directly control the search process, leading to an effective and efficient web navigation method. We conduct experiments on two live web navigation benchmarks with zero-shot and fine-tuning settings. The results demonstrate the effectiveness of our proposed approach.

[86] On the Failure of Latent State Persistence in Large Language Models

Jen-tse Huang, Kaiser Sun, Wenxuan Wang, Mark Dredze

Main category: cs.CL

TL;DR: LLMs lack persistent latent states, functioning as reactive solvers rather than proactive planners with working memory-like capabilities.

Details

Motivation: To investigate whether LLMs can maintain and manipulate unexpressed internal representations (latent states) analogous to human working memory, which is crucial for complex reasoning.

Method: Three novel experiments: 1) Number Guessing Game to test probability allocation to hidden choices, 2) Yes-No Game to measure concept drift and self-contradictions, 3) Mathematical Mentalism-inspired task to evaluate variable binding and state evolution tracking.

Result: LLMs fail to maintain persistent latent states: they can’t allocate probability mass to singular hidden choices, suffer from concept drift leading to self-contradictions, and fail to track transformations on hidden variables without explicit context.

Conclusion: LLMs function as reactive post-hoc solvers rather than proactive planners with Latent State Persistence, revealing a fundamental architectural divergence between autoregressive transformers and human-like cognition.

Abstract: While Large Language Models (LLMs) excel in reasoning, whether they can sustain persistent latent states remains under-explored. The capacity to maintain and manipulate unexpressed, internal representations-analogous to human working memory-is a cornerstone of complex reasoning. In this paper, we formalize and quantify the “Latent State Persistence” (LSP) gap through three novel experiments. First, we utilize a Number Guessing Game, demonstrating that across independent queries, LLMs fail to allocate probability mass to a singular hidden choice, violating a fundamental probabilistic principle. Second, we employ a Yes-No Game to show that as the number of questions increases, LLMs suffer from “concept drift,” leading to inevitable self-contradictions due to the lack of LSP. Finally, inspired by Mathematical Mentalism, we task models with tracking transformations on hidden variables, revealing a failure in variable binding and state evolution when the initial state is not explicitly present in the context. Collectively, these findings suggest that LLMs function as reactive post-hoc solvers rather than proactive planners with LSP. Our work provides a framework for evaluating the fidelity of internal representations and highlights a fundamental architectural divergence between autoregressive transformers and human-like cognition.

[87] PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss

Main category: cs.CL

TL;DR: PMOA-TTS is a large corpus of 124,699 PubMed case reports converted into structured timelines with over 5.6 million timestamped events, created using LLM pipelines for temporal medical research.

Details

Motivation: Clinical narratives contain crucial temporal information for modeling patient trajectories, but there's a scarcity of large-scale temporally annotated resources in the medical domain.

Method: Used scalable LLM pipeline (Llama 3.3 70B and DeepSeek-R1) to convert PubMed Open Access case reports into structured textual timelines of (event, time) pairs, with technical validation using clinician-curated gold set and three evaluation metrics.

Result: Created corpus of 124,699 single-patient case reports with over 5.6 million timestamped events, plus extracted demographics and diagnoses. Technical validation shows the pipeline’s effectiveness through semantic event matching, temporal concordance, and alignment error metrics.

Conclusion: PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling, and event forecasting from narrative text, with broad diagnostic and demographic coverage. The data and code are openly available for reproduction and further research.

Abstract: Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.

[88] The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev

Main category: cs.CL

TL;DR: The paper introduces the Open Proof Corpus (OPC), a large-scale dataset of 5,000+ human-evaluated LLM-generated proofs, including correct solutions to prestigious math competition problems, and uses it to analyze key questions in automated proof generation while demonstrating its utility by finetuning an 8B-parameter model.

Details

Motivation: Current progress in LLM-based mathematical proof generation is limited by the lack of large-scale, high-quality human-evaluated proof datasets, which are essential for training improvements and rigorous analysis of proof generation capabilities.

Method: Created the Open Proof Corpus (OPC) with over 5,000 human-evaluated proofs from state-of-the-art LLMs, specifically designed for broad applicability in proof generation research. Used OPC to analyze: (1) natural language vs. formal proof generation gap, (2) final-answer accuracy vs. full-proof validity discrepancy, and (3) impact of best-of-n selection on proof quality.

Result: OPC is the first dataset with substantial correct LLM-generated solutions to prestigious math competitions (USAMO/IMO). Finetuned an 8B-parameter model on OPC that performs on par with the best model (Gemini-2.5-Pro) on proof correctness evaluation.

Conclusion: The Open Proof Corpus addresses a critical gap in mathematical proof generation research by providing a high-quality, human-evaluated dataset that enables systematic analysis of proof generation capabilities and facilitates model development, as demonstrated by achieving state-of-the-art performance in proof evaluation.

Abstract: In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.

Seunghee Kim, Ingyu Bang, Seokgyu Jang, Changhyeon Kim, Sanghwan Bae, Jihun Choi, Richeng Xuan, Taeuk Kim

Main category: cs.CL

TL;DR: OMHBench is a new benchmark for evaluating omni-modal multi-hop reasoning in MLLMs, revealing performance gaps between proprietary vs. open-source models and highlighting speech modality weaknesses.

Details

Motivation: Existing MLLM evaluation frameworks have critical limitations including modality shortcuts and biased reasoning paths, making it difficult to properly assess true omni-modal reasoning capabilities.

Method: Proposed OMHBench benchmark with 6,144 questions featuring balanced reasoning paths jointly grounded across text, vision, and speech modalities to rigorously evaluate multi-hop reasoning.

Result: Evaluation of 13 SOTA models shows: 1) large performance gap between proprietary and open-source MLLMs, 2) proprietary models are highly sensitive to reasoning path variations with asymmetric omni-modal grounding, 3) models particularly struggle with speech modality processing.

Conclusion: The benchmark reveals significant weaknesses in current MLLMs’ omni-modal reasoning, especially in speech processing, highlighting the need for balanced, multi-hop evaluation to advance true omni-modal intelligence.

Abstract: Multimodal Large Language Models (MLLMs) have increasingly supported omni-modal processing across text, vision, and speech. However, existing evaluation frameworks for such models suffer from critical limitations, including modality shortcuts and biased reasoning paths. To address these challenges, we propose OMHBench, a novel benchmark designed to rigorously evaluate omni-modal multi-hop reasoning. It consists of 6,144 questions with balanced reasoning paths that are jointly grounded across all three modalities. Extensive evaluation of 13 state-of-the-art models reveals that (1) a large performance gap exists between proprietary and open-source MLLMs and (2) even proprietary models exhibit high sensitivity to reasoning path variations, resulting in asymmetric omni-modal grounding. Notably, models struggle when processing the speech modality, underscoring the need for balanced, multi-hop evaluation of omni-modal intelligence.

[90] How Quantization Shapes Bias in Large Language Models

Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych

Main category: cs.CL

TL;DR: Quantization has nuanced effects on model bias: reduces toxicity but slightly increases stereotypes and unfairness, especially with aggressive compression.

Details

Motivation: To comprehensively evaluate how quantization (weight and activation compression) affects model bias across different demographic subgroups and bias types, balancing efficiency with ethical considerations.

Method: Evaluated weight and activation quantization strategies across 13 benchmarks using probability- and generated text-based metrics. Tested models with different architectures and reasoning abilities, examining effects on stereotypes, fairness, toxicity, and sentiment.

Result: Quantization reduces model toxicity and doesn’t significantly impact sentiment, but tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. Trends are consistent across demographic categories and model types, though magnitude varies by specific setting.

Conclusion: Careful balancing of efficiency and ethical considerations is crucial when applying quantization in practice, as compression techniques can have nuanced and sometimes negative impacts on model bias.

Abstract: This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, fairness, toxicity, and sentiment. We employ both probability- and generated text-based metrics across 13 benchmarks and evaluate models that differ in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and subgroups, and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.

[91] JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation

Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Shengjia Ma, Yinghan Shen, Zixuan Li, Jian Guo, Yuanzhuo Wang

Main category: cs.CL

TL;DR: JudgeAgent is a knowledge-driven dynamic evaluation framework for LLMs that uses context graphs and adaptive interviews to overcome limitations of static benchmarks.

Details

Motivation: Current LLM evaluation methods rely on static benchmarks with limited knowledge coverage and fixed difficulties that mismatch with evaluated models, leading to superficial assessments and impeding targeted optimizations.

Method: Uses LLM agents with context graphs to systematically traverse knowledge structures for question generation, and implements difficulty-adaptive multi-turn interview mechanisms to address data contamination and difficulty mismatch.

Result: Empirical results demonstrate that JudgeAgent enables more comprehensive evaluations and facilitates effective model iterations, highlighting the potential of this knowledge-driven dynamic evaluation paradigm.

Conclusion: JudgeAgent bridges the gap in LLM evaluation by providing a knowledge-driven dynamic framework that enables comprehensive assessments and facilitates targeted model improvements.

Abstract: Current evaluation methods for large language models (LLMs) primarily rely on static benchmarks, presenting two major challenges: limited knowledge coverage and fixed difficulties that mismatch with the evaluated LLMs. These limitations lead to superficial assessments of LLM knowledge, thereby impeding the targeted model optimizations. To bridge this gap, we propose JudgeAgent, a knowledge-driven and dynamic evaluation framework for LLMs. To address the challenge of limited knowledge coverage, JudgeAgent leverages LLM agents equipped with context graphs to traverse knowledge structures systematically for question generation. Furthermore, to mitigate data contamination and difficulty mismatch, it adopts a difficulty-adaptive and multi-turn interview mechanism. Thereby, JudgeAgent can achieve comprehensive evaluations and facilitate more effective improvement of LLMs. Empirical results demonstrate that JudgeAgent enables more comprehensive evaluations and facilitates effective model iterations, highlighting the potential of this knowledge-driven and dynamic evaluation paradigm. The source code is available on https://github.com/DataArcTech/JudgeAgent.

[92] Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost

Mihai Nadas, Laura Diosan, Andreea Tomescu, Andrei Piscoran

Main category: cs.CL

TL;DR: TF2 introduces a unified framework for English->Romanian literary translation with a fine-tuned 12B model, synthetic datasets, and evaluation methods, making literary translation more accessible for low-resource languages.

Details

Motivation: Literary translation is complex but important, yet small open models struggle with it. There's a need for high-quality literary datasets and accessible translation solutions for low-resource languages like Romanian.

Method: Created synthetic parallel datasets (3M and 15K entries), used a two-stage fine-tuning process: instruction tuning for narrative style and adapter compression for efficiency, with LLM-based multi-dimensional evaluation.

Result: The fine-tuned TF2-12B model achieves strong fluency and adequacy, narrowing the gap to proprietary models while being open, accessible, and cost-effective.

Conclusion: TF2 provides an end-to-end reproducible pipeline for cost-efficient literary translation, enabling broader adoption of open models for culturally significant content in low-resource settings.

Abstract: Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TinyFabulist Translation Framework (TF2), a unified framework for dataset creation, fine-tuning, and evaluation in English->Romanian literary translation, centered on the creation and open release of both a compact, fine-tuned language model (TF2-12B) and large-scale synthetic parallel datasets (DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the largest collection of synthetic English fables to date, we address the need for rich, high-quality literary datasets in low-resource languages such as Romanian. Our pipeline first generates 15k high-quality Romanian reference translations from the TF1 pool using a high-performing LLM. We then apply a two-stage fine-tuning process to a 12B-parameter open-weight model: (i) instruction tuning to capture genre-specific narrative style, and (ii) adapter compression for efficient deployment. Evaluation combines corpus-level BLEU with a five-dimension LLM-based rubric (accuracy, fluency, coherence, style, and cultural adaptation) to provide a nuanced assessment of translation quality. Results show that our fine-tuned model achieves strong fluency and adequacy, narrowing the gap to top-performing proprietary models under automated and human-anchored evaluation, while being open, accessible, and significantly more cost-effective. Alongside the fine-tuned model and both datasets, we publicly release all scripts and evaluation prompts. TF2 thus provides an end-to-end, reproducible pipeline for research on cost-efficient translation, cross-lingual narrative generation, and the broad adoption of open models for culturally significant literary content in low-resource settings.

[93] Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction

Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, Wanxiang Che

Main category: cs.CL

TL;DR: Judge Q: A novel training method using soft token lists to improve KV cache eviction by capturing global information, reducing performance degradation with minimal training cost.

Details

Motivation: Current KV cache eviction methods focus too much on local information from the last window, potentially missing crucial global context, which hurts decoding quality when KV cache is evicted.

Method: Propose Judge Q training method with soft token list concatenated to input sequence. Only tunes embedding layer at low cost. Trains soft tokens’ attention maps to align with actual decoded tokens, enabling them to capture global information for better KV importance evaluation.

Result: Under same eviction budget, shows less performance degradation than existing methods. Improves ~1 point on LongBench and over 3 points on RULER benchmarks. Works with models like Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3.

Conclusion: Method can be seamlessly integrated into existing open-source models with minimal training overhead, enhancing performance in KV cache eviction scenarios by better capturing global information.

Abstract: Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model’s embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens’ attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.

[94] Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system

Shohei Hisada, Endo Sunao, Himi Yamato, Shoko Wakamiya, Eiji Aramaki

Main category: cs.CL

TL;DR: This study adapts HealthBench, a medical benchmark, to Japanese context, revealing performance drops due to rubric mismatches and highlighting the need for localized adaptation rather than direct translation.

Details

Motivation: There's a scarcity of robust Japanese medical evaluation frameworks for LLMs, with existing resources often being simple translations of multiple-choice questions, which may not align with Japan's clinical guidelines, healthcare systems, or cultural norms.

Method: 1) Established baseline performance using machine-translated HealthBench scenarios to evaluate GPT-4.1 and LLM-jp-3.1; 2) Used LLM-as-a-Judge approach to systematically classify scenarios and rubric criteria to identify contextual gaps.

Result: GPT-4.1 showed modest performance drop due to rubric mismatches; Japanese-native model (LLM-jp-3.1) had significant failures due to lack of clinical completeness. Most scenarios were applicable, but significant proportion of rubric criteria require localization.

Conclusion: Direct benchmark translation has limitations; there’s urgent need for context-aware localized adaptation (“J-HealthBench”) to ensure reliable and safe evaluation of medical LLMs in Japan.

Abstract: This study investigates the applicability of HealthBench, a large-scale, rubric-based medical benchmark, to the Japanese context. Although robust evaluation frameworks are essential for the safe development of medical LLMs, resources in Japanese are scarce and often consist of translated multiple-choice questions. Our research addresses this issue in two ways. First, we establish a performance baseline by applying a machine-translated version of HealthBench’s 5,000 scenarios to evaluate two models: a high-performing multilingual model (GPT-4.1) and a Japanese-native open-source model (LLM-jp-3.1). Secondly, we use an LLM-as-a-Judge approach to systematically classify the benchmark’s scenarios and rubric criteria. This allows us to identify ‘contextual gaps’ where the content is misaligned with Japan’s clinical guidelines, healthcare systems or cultural norms. Our findings reveal a modest performance drop in GPT-4.1 due to rubric mismatches, as well as a significant failure in the Japanese-native model, which lacked the required clinical completeness. Furthermore, our classification shows that, despite most scenarios being applicable, a significant proportion of the rubric criteria require localisation. This work underscores the limitations of direct benchmark translation and highlights the urgent need for a context-aware, localised adaptation, a “J-HealthBench”, to ensure the reliable and safe evaluation of medical LLMs in Japan.

[95] Textual Entailment is not a Better Bias Metric than Token Probability

Virginia K. Felkner, Allison Lim, Jonathan May

Main category: cs.CL

TL;DR: NLI and token probability bias metrics behave very differently with low correlation; neither is clearly better, and NLI shouldn’t fully replace TP metrics.

Details

Motivation: Token probability (TP) metrics for measuring social bias in language models have been criticized for being distant from real-world use cases and harms. The paper explores natural language inference (NLI) as an alternative bias metric to address these limitations.

Method: Conducted extensive experiments across seven language model families, comparing NLI and TP bias evaluation methods. Analyzed correlation between different NLI metrics and between NLI and TP metrics, and examined sensitivity to wording variations in stereotypical and counterstereotypical sentences.

Result: NLI and TP bias evaluation behave substantially differently with very low correlation. NLI metrics are more brittle and unstable, slightly less sensitive to wording of counterstereotypical sentences, and slightly more sensitive to wording of tested stereotypes than TP approaches.

Conclusion: Neither token probability nor natural language inference is a “better” bias metric in all cases. There’s insufficient evidence to justify NLI as a complete replacement for TP metrics in bias evaluation.

Abstract: Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world language model use cases and harms. In this work, we test natural language inference (NLI) as an alternative bias metric. In extensive experiments across seven LM families, we show that NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. NLI metrics are more brittle and unstable, slightly less sensitive to wording of counterstereotypical sentences, and slightly more sensitive to wording of tested stereotypes than TP approaches. Given this conflicting evidence, we conclude that neither token probability nor natural language inference is a ``better’’ bias metric in all cases. We do not find sufficient evidence to justify NLI as a complete replacement for TP metrics in bias evaluation.

[96] Parallel Test-Time Scaling for Latent Reasoning Models

Runyang You, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, Wenjie Li

Main category: cs.CL

TL;DR: Enables parallel test-time scaling for latent reasoning models through uncertainty-inspired sampling strategies and a latent reward model for trajectory selection.

Details

Motivation: While parallel test-time scaling (TTS) effectively enhances LLMs through parallel sampling and aggregation, it hasn't been applied to latent reasoning models due to lack of sampling mechanisms in continuous space and probabilistic signals for trajectory aggregation.

Method: Introduces two uncertainty-inspired stochastic sampling strategies (Monte Carlo Dropout and Additive Gaussian Noise) for continuous space sampling, and designs a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning trajectories.

Result: Both sampling strategies scale effectively with compute and show distinct exploration dynamics, while LatentRM enables effective trajectory selection, opening new direction for scalable inference in continuous spaces.

Conclusion: The work successfully enables parallel TTS for latent reasoning models, addressing key challenges of sampling in continuous space and trajectory aggregation, with code and checkpoints publicly released.

Abstract: Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code and checkpoints released at https://github.com/ModalityDance/LatentTTS

[97] One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations

Kohei Oda, Po-Min Chuang, Kiyoaki Shirai, Natthawut Kertkeidkachorn

Main category: cs.CL

TL;DR: DualCSE is a sentence embedding method that assigns two vectors per sentence - one for explicit semantics and one for implicit semantics - to better capture both types of meaning in a shared space.

Details

Motivation: Current sentence embedding methods struggle to capture implicit semantics because they assign only a single vector per sentence, which limits their ability to represent both explicit and implicit meanings effectively.

Method: DualCSE assigns two embeddings to each sentence: one representing explicit semantics and another representing implicit semantics. These embeddings coexist in a shared space, allowing selection of appropriate semantics for different downstream tasks.

Result: Experimental results show that DualCSE can effectively encode both explicit and implicit meanings and improves performance on downstream tasks like information retrieval and text classification.

Conclusion: The dual embedding approach overcomes limitations of single-vector sentence embeddings by better capturing both explicit and implicit semantics, leading to improved performance on practical applications.

Abstract: Sentence embedding methods have made remarkable progress, yet they still struggle to capture the implicit semantics within sentences. This can be attributed to the inherent limitations of conventional sentence embedding methods that assign only a single vector per sentence. To overcome this limitation, we propose DualCSE, a sentence embedding method that assigns two embeddings to each sentence: one representing the explicit semantics and the other representing the implicit semantics. These embeddings coexist in the shared space, enabling the selection of the desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE can effectively encode both explicit and implicit meanings and improve the performance of the downstream task.

[98] Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems

Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Anthony Rose, Jesse C. Cresswell

Main category: cs.CL

TL;DR: The paper presents a comprehensive taxonomy of error types in retrieval-augmented generation (RAG) systems, provides practical guidance for addressing them, and introduces an annotated dataset and auto-evaluation method for error tracking.

Details

Motivation: RAG systems are widely used for LLM-based QA but can produce erroneous outputs due to system complexity. Understanding the range of possible errors is crucial for robust deployment, yet there's a lack of systematic error classification and practical guidance.

Method: The authors develop a new taxonomy of error types in realistic RAG systems, curate a dataset of erroneous RAG responses annotated by error types, and propose an auto-evaluation method aligned with their taxonomy for practical error tracking.

Result: The paper delivers: 1) A comprehensive error taxonomy for RAG systems, 2) Practical advice for addressing each error type, 3) An annotated dataset of RAG errors, and 4) An auto-evaluation method for development-time error tracking.

Conclusion: The systematic classification of RAG errors, practical guidance, dataset, and evaluation method provide valuable tools for improving RAG system robustness and deployment reliability in real-world applications.

Abstract: Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at https://github.com/layer6ai-labs/rag-error-classification.

[99] Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering

Feras AlMannaa, Talia Tseriotou, Jenny Chim, Maria Liakata

Main category: cs.CL

TL;DR: First study investigating LLM comprehension on long-context medical QA beyond multiple choice, examining model size effects, memorization issues, reasoning benefits, and RAG performance in medical contexts.

Details

Motivation: To investigate LLM comprehension capabilities over long-context clinically relevant medical QA beyond simple multiple-choice questions, addressing gaps in understanding how LLMs handle complex medical information in extended contexts.

Method: Comprehensive approach examining various settings: content inclusion of different sizes/relevance, diverse LLM models with varying capabilities, multiple datasets across task formulations, and evaluation of Retrieval Augmented Generation (RAG) in single vs multi-document QA.

Result: Revealed insights on model size effects and limitations, underlying memorization issues, benefits of reasoning models, challenges of leveraging full patient context, RAG’s effectiveness in certain cases but limitations in temporal reasoning, and identified metric challenges in evaluation.

Conclusion: First comprehensive study on LLM long-context medical QA shows both promise and limitations, with RAG offering benefits but still struggling with temporal reasoning, highlighting the need for improved evaluation metrics and model capabilities for clinical applications.

Abstract: This study is the first to investigate LLM comprehension capabilities over long-context (LC), clinically relevant medical Question Answering (QA) beyond MCQA. Our comprehensive approach considers a range of settings based on content inclusion of varying size and relevance, LLM models of different capabilities and a variety of datasets across task formulations. We reveal insights on model size effects and their limitations, underlying memorization issues and the benefits of reasoning models, while demonstrating the value and challenges of leveraging the full long patient’s context. Importantly, we examine the effect of Retrieval Augmented Generation (RAG) on medical LC comprehension, showcasing best settings in single versus multi-document QA datasets. We shed light into some of the evaluation aspects using a multi-faceted approach uncovering common metric challenges. Our quantitative analysis reveals challenging cases where RAG excels while still showing limitations in cases requiring temporal reasoning.

[100] CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation

Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe

Main category: cs.CL

TL;DR: CoSense-LLM is an edge-first framework that converts multimodal sensor data into semantic tokens for LLM coordination under strict latency, energy, bandwidth, and privacy constraints.

Details

Motivation: To enable large language model deployments in interference-prone environments (homes, offices, clinics) while maintaining privacy, predictable latency, and efficient resource usage by processing sensor data at the edge.

Method: Four-component system: 1) SenseFusion (lightweight encoder aligning sensor embeddings with language), 2) Edge-RAG (local hybrid retrieval for site-specific grounding), 3) PromptRouter (cost/uncertainty-aware policy for edge/cloud decisions), 4) Secure Execution (auditable redaction ensuring raw data stays on device).

Result: Achieves sub-second (p95) latency on edge paths, reduces inter-tier token/bandwidth costs via local retrieval, preserves privacy by transmitting only discrete codes/redacted metadata, improves factual consistency, enables selective abstention, and lowers energy per decision.

Conclusion: Edge-first design successfully treats semantics, privacy, and predictable latency as co-equal goals for LLM deployments in challenging environments, demonstrating viability of edge-dominant processing with controlled cloud escalation.

Abstract: We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.

[101] User Perceptions vs. Proxy LLM Judges: Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios

Xiaoyuan Wu, Roshni Kaushik, Wenkai Li, Lujo Bauer, Koichi Onoue

Main category: cs.CL

TL;DR: LLMs struggle with privacy-sensitive scenarios, and proxy LLMs used for evaluation don’t align with actual user perceptions of helpfulness and privacy preservation.

Details

Motivation: As LLMs are increasingly used for tasks involving private information (emails, health questions, etc.), there's a need to evaluate their ability to handle privacy-sensitive scenarios. Previous benchmarks used proxy LLMs to judge responses, but these don't measure actual user perceptions.

Method: Conducted a user study (n=94) using 90 PrivacyLens scenarios to directly measure users’ perceptions of helpfulness and privacy-preservation quality of LLM responses, comparing these with evaluations from five proxy LLMs.

Result: Users showed low agreement when evaluating identical LLM responses, while proxy LLMs reached high agreement. However, each proxy LLM had low correlation with users’ evaluations, indicating they cannot accurately estimate users’ wide range of perceptions.

Conclusion: Proxy LLMs are poor proxies for user perceptions of utility and privacy in privacy-sensitive scenarios. More user-centered studies are needed to measure LLMs’ ability to help users while preserving privacy, and to improve alignment between LLMs and users in estimating perceived privacy and utility.

Abstract: Large language models (LLMs) are rapidly being adopted for tasks like drafting emails, summarizing meetings, and answering health questions. In these settings, users may need to share private information (e.g., contact details, health records). To evaluate LLMs’ ability to identify and redact such information, prior work introduced real-life, scenario-based benchmarks (e.g., ConfAIde, PrivacyLens) and found that LLMs can leak private information in complex scenarios. However, these evaluations relied on proxy LLMs to judge the helpfulness and privacy-preservation quality of LLM responses, rather than directly measuring users’ perceptions. To understand how users perceive the helpfulness and privacy-preservation quality of LLM responses to privacy-sensitive scenarios, we conducted a user study ($n=94$) using 90 PrivacyLens scenarios. We found that users had low agreement with each other when evaluating identical LLM responses. In contrast, five proxy LLMs reached high agreement, yet each proxy LLM had low correlation with users’ evaluations. These results indicate that proxy LLMs cannot accurately estimate users’ wide range of perceptions of utility and privacy in privacy-sensitive scenarios. We discuss the need for more user-centered studies to measure LLMs’ ability to help users while preserving privacy, and for improving alignment between LLMs and users in estimating perceived privacy and utility.

[102] Relative Scaling Laws for LLMs

William Held, David Hall, Percy Liang, Diyi Yang

Main category: cs.CL

TL;DR: Relative scaling laws track performance gaps between test distributions as models scale, revealing that scaling doesn’t equalize all disparities - some gaps narrow, others widen or shift depending on domain.

Details

Motivation: Traditional scaling laws focus on aggregate performance but obscure performance disparities across heterogeneous subpopulations. The authors want to understand how scaling affects performance gaps between different test distributions rather than just absolute improvement.

Method: Introduced relative scaling laws to track performance gaps between test distributions. Trained 255 decoder-only Transformers under matched-compute (IsoFLOP) budgets from 10^18 to 10^20 FLOPs on standard pretraining datasets, then analyzed how gaps evolve across different domains.

Result: Found diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; AI risk behaviors split - capability- and influence-related risks increase during pretraining while adversarial risks don’t. Scaling improves overall performance but isn’t a universal equalizer.

Conclusion: Scaling doesn’t uniformly reduce performance disparities - some gaps narrow, others widen or shift. The authors release all model checkpoints to enable practitioners to measure relative scaling laws alongside traditional ones, helping prioritize robustness challenges.

Abstract: Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$–$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.

[103] Are Language Models Efficient Reasoners? A Perspective from Logic Programming

Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, Bernhard Schölkopf

Main category: cs.CL

TL;DR: The paper proposes a framework to evaluate language model reasoning efficiency using logic programming, measuring how well models avoid unnecessary inferences when solving problems with irrelevant information.

Details

Motivation: Current LM evaluations focus only on correctness while ignoring efficiency. Real-world reasoning often involves irrelevant information, and effective deductive inference requires identifying and ignoring such distractions.

Method: Propose a framework using logic programming to align natural language proofs generated by LMs with shortest proofs found by executing logic programs. Construct a dataset of math word problems injected with varying numbers of irrelevant axioms that have semantic overlap with goal theorems.

Result: Current LMs show marked accuracy declines even with minimal, domain-consistent distractions, and the proofs they generate frequently exhibit detours through irrelevant inferences.

Conclusion: Reasoning efficiency is a crucial but overlooked aspect of LM evaluation. The proposed framework reveals significant inefficiencies in current models when dealing with irrelevant information, highlighting the need for better reasoning efficiency in language models.

Abstract: Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language – as generated by an LM – with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions – even with minimal, domain-consistent distractions – and the proofs they generate frequently exhibit detours through irrelevant inferences.

[104] PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization

Jiajun Zhang, Jianke Zhang, Zeyu Cui, Jiaxi Yang, Lei Zhang, Binyuan Hui, Qiang Liu, Zilei Wang, Liang Wang, Junyang Lin

Main category: cs.CL

TL;DR: PlotCraft is a new benchmark for evaluating LLMs on complex visualization tasks, revealing performance gaps. The authors develop SynthVis-30K dataset and PlotCraftor model to address these deficiencies.

Details

Motivation: Current LLMs show strong code generation abilities but their capacity for creating complex visualizations for structured data remains largely unevaluated and underdeveloped. There's a need to systematically assess and improve LLMs' visualization capabilities.

Method: 1) Created PlotCraft benchmark with 1k challenging visualization tasks across 7 high-level tasks and 48 chart types. 2) Developed SynthVis-30K dataset using collaborative agent framework. 3) Built PlotCraftor model trained on this dataset for complex visualization generation.

Result: Evaluation of 23 leading LLMs on PlotCraft revealed significant performance deficiencies in sophisticated visualization tasks. PlotCraftor achieved performance comparable to leading proprietary approaches across multiple benchmarks, with over 50% improvement on hard tasks.

Conclusion: The work addresses a critical gap in LLM evaluation for visualization tasks, provides a comprehensive benchmark, and demonstrates that specialized models like PlotCraftor can achieve strong visualization capabilities even with small model sizes.

Abstract: Recent Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation. However, their ability to create complex visualizations for scaled and structured data remains largely unevaluated and underdeveloped. To address this gap, we introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks that cover a wide range of topics, such as finance, scientific research, and sociology. The benchmark is structured around seven high-level visualization tasks and encompasses 48 distinct chart types. Crucially, it is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities. Our comprehensive evaluation of 23 leading LLMs on PlotCraft reveals obvious performance deficiencies in handling sophisticated visualization tasks. To bridge this performance gap, we develope SynthVis-30K, a large-scale, high-quality dataset of complex visualization code synthesized via a collaborative agent framework. Building upon this dataset, we develope PlotCraftor, a novel code generation model that achieves strong capabilities in complex data visualization with a remarkably small size. Across VisEval, PandasPlotBench, and our proposed PlotCraft, PlotCraftor shows performance comparable to that of leading proprietary approaches. Especially, on hard task, Our model achieves over 50% performance improvement. We will release the benchmark, dataset, and code at https://github.com/Speakn0w/PlotCraft-Benchmark.

[105] Multi-Personality Generation of LLMs at Decoding-time

Rongxin Chen, Yunfan Li, Yige Yuan, Bingbing Xu, Huawei Shen

Main category: cs.CL

TL;DR: MPG is a decoding-time framework that enables LLMs to generate multi-personality responses without retraining, using implicit density ratios from single-dimensional models and speculative chunk-level rejection sampling.

Details

Motivation: Existing methods for multi-personality generation are either costly (retraining-based) or limited (decoding-time methods relying on external models/heuristics), lacking flexibility and scalability.

Method: Proposes MPG framework that reformulates multi-personality generation as sampling from aggregated target strategies using implicit density ratios from single-dimensional models. Implements Speculative Chunk-level Rejection sampling (SCR) for efficient parallel validation with sliding window thresholds.

Result: Experiments on MBTI personality and Role-Playing show effectiveness with improvements up to 16%-18% over existing methods.

Conclusion: MPG provides a flexible, efficient decoding-time solution for multi-personality generation without retraining or external models, achieving significant performance gains.

Abstract: Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a “free lunch” to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at https://github.com/Libra117/MPG .

[106] GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt

Zhenhe Li, Can Lin, Ling Zheng, Wen-Da Wei, Junli Liang, Qi Song

Main category: cs.CL

TL;DR: GraphIF: A plug-and-play framework that models multi-turn dialogues as directed relation graphs and uses graph prompts to enhance LLMs’ instruction following across dialogue turns.

Details

Motivation: Existing approaches treat multi-turn dialogue responses as isolated tasks and fail to explicitly incorporate multi-turn instruction following into optimization objectives, causing LLMs to struggle with complex long-distance constraints across dialogue turns.

Method: Three components: (1) agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) relation graph prompt generation module that converts graph information into natural language prompts; (3) response rewriting module that refines initial LLM outputs using generated graph prompts.

Result: Extensive experiments on two long multi-turn dialogue datasets show GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.

Conclusion: GraphIF successfully bridges the gap in leveraging graph structures to enhance multi-turn instruction following capabilities of LLMs, providing a novel approach that explicitly models relational constraints across dialogue turns.

Abstract: Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.

[107] Mistake Notebook Learning: Batch-Clustered Failures for Training-Free Agent Adaptation

Xuanbo Su, Yingfang Zhang, Hao Luo, Xiaoteng Liu, Leo Huang

Main category: cs.CL

TL;DR: Mistake Notebook Learning (MNL) enables LLM agents to learn from failures by clustering mistakes into structured guidance, avoiding repeated errors without parameter updates.

Details

Motivation: LLM agents in persistent real-world roles encounter continuous tasks and inevitable failures, but current methods lack systematic learning from mistakes, causing repeated identical errors in similar contexts.

Method: MNL is a memory framework that enables agents to self-curate generalizable guidance from batch-clustered failures, distilling shared error patterns into structured “mistake notes.” It updates external memory only when batch performance improves for stability, and integrates with test-time scaling to steer search away from known pitfalls.

Result: Experiments on mathematical reasoning, Text-to-SQL, and interactive agent benchmarks show MNL achieves competitive performance compared to existing memory mechanisms and in-context methods in both effectiveness and efficiency.

Conclusion: Structured mistake abstraction is a critical lever for robust agent evolution, enabling continuous improvement without the cost of parameter updates.

Abstract: With the growing adoption of Large Language Model (LLM) agents in persistent, real-world roles, they naturally encounter continuous streams of tasks and inevitable failures. A key limitation, however, is their inability to systematically learn from these mistakes, forcing them to repeat identical errors in similar contexts. Unlike prior training-free methods that primarily store raw instance-level experience or focus on retrieving successful trajectories, we propose Mistake Notebook Learning (MNL), a novel memory framework that enables agents to self-curate generalizable guidance from batch-clustered failures. This mechanism allows agents to distill shared error patterns into structured ``mistake notes,’’ updating an external memory only when batch performance improves to ensure stability. To further amplify adaptability, we integrate MNL with test-time scaling, leveraging aggregated failure patterns to actively steer the search process away from known pitfalls. Experiments on mathematical reasoning, Text-to-SQL, and interactive agent benchmarks show that MNL achieves competitive performance compared to existing memory mechanisms and in-context methods in both effectiveness and efficiency. These findings position structured mistake abstraction as a critical lever for robust agent evolution, enabling continuous improvement without the cost of parameter updates.

[108] Rakuten Data Release: A Large-Scale and Long-Term Reviews Corpus for Hotel Domain

Yuki Nakayama, Koki Hikichi, Yun Ching Liu, Yu Hirate

Main category: cs.CL

TL;DR: Large-scale corpus of 7.29 million Rakuten Travel reviews from 2009-2024 with rich metadata and aspect ratings, analyzed for statistical patterns and data drift.

Details

Motivation: To create a comprehensive, large-scale dataset of travel reviews spanning 16 years that can support research in natural language processing, recommendation systems, and understanding customer behavior in the travel industry over time.

Method: Collection of 7.29 million customer reviews from Rakuten Travel platform (2009-2024), including rich metadata (review text, responses, user IDs, accommodation details, purpose, group composition) and six aspect ratings plus overall scores. Statistical analysis of corpus characteristics and data drift patterns.

Result: Created a massive corpus of 7.29 million travel reviews with comprehensive metadata spanning 16 years. Provided statistical insights into the dataset and identified factors driving data drift between 2019-2024 using statistical approaches.

Conclusion: The Rakuten Travel Reviews corpus represents a valuable resource for research in NLP, recommendation systems, and longitudinal analysis of customer behavior in the travel industry, with insights into temporal data drift patterns.

Abstract: This paper presents a large-scale corpus of Rakuten Travel Reviews. Our collection contains 7.29 million customer reviews for 16 years, ranging from 2009 to 2024. Each record in the dataset contains the review text, its response from an accommodation, an anonymized reviewer ID, review date, accommodation ID, plan ID, plan title, room type, room name, purpose, accompanying group, and user ratings from six aspect categories, as well as an overall score. We present statistical information about our corpus and provide insights into factors driving data drift between 2019 and 2024 using statistical approaches.

[109] AWPO: Enhancing Tool-Use of Large Language Models through Adaptive Integration of Reasoning Rewards

Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, Ran He

Main category: cs.CL

TL;DR: AWPO is a new RL framework that adaptively integrates reasoning rewards with outcome rewards to improve tool-use LLM performance through advantage-weighted policy optimization.

Details

Motivation: Existing RL methods for training tool-use LLMs focus on verifiable outcome rewards but overlook reasoning rewards based on chain-of-thought quality. Simply combining reasoning and outcome rewards can lead to suboptimal performance or conflict with primary optimization objectives.

Method: AWPO (Advantage-Weighted Policy Optimization) adaptively integrates reasoning rewards into advantage estimation using variance-aware gating and difficulty-aware weighting based on group-relative statistics, plus a tailored clipping mechanism for stable optimization.

Result: AWPO achieves state-of-the-art performance across standard tool-use benchmarks, outperforming strong baselines and closed-source models in multi-turn scenarios. A 4B parameter model surpasses Grok-4 by 16.0% in multi-turn accuracy while maintaining generalization on out-of-distribution MMLU-Pro.

Conclusion: AWPO provides a principled RL framework that effectively leverages reasoning rewards to enhance tool-use LLM performance with exceptional parameter efficiency and generalization capability.

Abstract: While Reinforcement Learning (RL) shows promise in training tool-use Large Language Models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of reasoning rewards based on chain-of-thought quality for better tool utilization. Furthermore, naïvely combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose Advantage-Weighted Policy Optimization (AWPO), a principled RL framework that adaptively integrates reasoning rewards into advantage estimation to improve tool-use performance. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by $16.0%$ in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.

[110] Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning

Ting-Hao ‘Kenneth’ Huang, Ryan A. Rossi, Sungchul Kim, Tong Yu, Ting-Yao E. Hsu, Ho Yin, Ng, C. Lee Giles

Main category: cs.CL

TL;DR: The SciCap project evolved from a seed-funded idea into a major scientific figure-captioning initiative, creating datasets, conducting evaluations, adapting to LLMs, launching challenges, and building interactive captioning tools over 5 years.

Details

Motivation: To test whether domain-specific training (successful in text models like SciBERT) could work for figure captioning, and to improve scientific figure captioning through systematic research and tool development.

Method: Curated and released large collections of figure-caption pairs from arXiv papers, conducted extensive automatic and human evaluations, adapted to large language models (LLMs), launched annual challenges, and built interactive captioning systems.

Result: SciCap grew into a central effort shaping the scientific figure-captioning landscape, with multi-institution collaboration, updated datasets, evaluation frameworks, and practical tools for scientists.

Conclusion: The paper summarizes key technical/methodological lessons from 5 years of SciCap research and outlines five major unsolved challenges and future directions for scientific figure captioning.

Abstract: Between 2021 and 2025, the SciCap project grew from a small seed-funded idea at The Pennsylvania State University (Penn State) into one of the central efforts shaping the scientific figure-captioning landscape. Supported by a Penn State seed grant, Adobe, and the Alfred P. Sloan Foundation, what began as our attempt to test whether domain-specific training, which was successful in text models like SciBERT, could also work for figure captions expanded into a multi-institution collaboration. Over these five years, we curated, released, and continually updated a large collection of figure-caption pairs from arXiv papers, conducted extensive automatic and human evaluations on both generated and author-written captions, navigated the rapid rise of large language models (LLMs), launched annual challenges, and built interactive systems that help scientists write better captions. In this piece, we look back at the first five years of SciCap and summarize the key technical and methodological lessons we learned. We then outline five major unsolved challenges and propose directions for the next phase of research in scientific figure captioning.

[111] Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich, Mark Ibrahim, Levent Sagun

Main category: cs.CL

TL;DR: Multilingual LLMs show high task accuracy but often have reasoning that doesn’t logically support their conclusions, especially in non-Latin scripts where misalignment is 2x worse.

Details

Motivation: While LLMs demonstrate strong reasoning via chain-of-thought prompting, it's unclear whether this reasoning quality transfers across different languages, creating a blind spot in current multilingual evaluation practices.

Method: Created a human-validated framework to evaluate if reasoning traces logically support conclusions across languages. Analyzed 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models. Developed error taxonomy through human annotation.

Result: Found critical blind spot: models achieve high task accuracy but reasoning often fails to support conclusions. Reasoning traces in non-Latin scripts show at least 2x more misalignment between reasoning and conclusions than Latin scripts. Primary failures are evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps.

Conclusion: Current multilingual evaluation practices provide incomplete picture of model reasoning capabilities. Need reasoning-aware evaluation frameworks to properly assess reasoning quality across languages.

Abstract: Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.

[112] Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage

Junhao Hu, Fangze Li, Mingtao Xu, Feifan Meng, Shiju Zhao, Tiancheng Hu, Ting Peng, Anmin Liu, Wenrui Huang, Chenxu Liu, Ziyue Hua, Tao Xie

Main category: cs.CL

TL;DR: Sparse attention in LLM decoding can paradoxically increase end-to-end complexity by causing longer sequences (“Less is Less” phenomenon), and an early-stopping algorithm mitigates this by detecting when information loss exceeds gain.

Details

Motivation: LLM inference efficiency is crucial for large-scale deployment, with decode stage dominating latency. While sparse-attention algorithms aim to reduce decode complexity, they can paradoxically increase end-to-end complexity due to information loss causing longer sequences.

Method: Proposes an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding, preventing unnecessary token generation when sparse attention becomes counterproductive.

Result: The early-stopping algorithm reduces token consumption by up to 90% with marginal accuracy degradation (<2%) across reasoning-intensive benchmarks, effectively mitigating the “Less is Less” problem.

Conclusion: Sparse attention can paradoxically increase end-to-end complexity through the “Less is Less” phenomenon, but this can be effectively mitigated with an early-stopping algorithm that balances efficiency and accuracy.

Abstract: Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term ``Less is Less’’ (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to 90% with a marginal accuracy degradation of less than 2% across reasoning-intensive benchmarks.

[113] Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

Dongqi Liu, Hang Ding, Qiming Feng, Jian Li, Xurong Xie, Zhucun Xue, Chengjie Wang, Jiangning Zhang, Yabiao Wang

Main category: cs.CL

TL;DR: Disco-RAG: A discourse-aware RAG framework that uses discourse trees and rhetorical graphs to inject structural cues into generation, achieving SOTA results on QA and summarization without fine-tuning.

Details

Motivation: Existing RAG strategies treat retrieved passages in a flat, unstructured way, which prevents models from capturing structural cues and limits their ability to synthesize knowledge from dispersed evidence across documents.

Method: Constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation process.

Result: Achieves state-of-the-art results on question answering and long-document summarization benchmarks without requiring fine-tuning.

Conclusion: Discourse structure plays an important role in advancing RAG systems, and explicitly injecting discourse signals into generation significantly enhances performance on knowledge-intensive tasks.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.

[114] TeleMem: Building Long-Term and Multimodal Memory for Agentic AI

Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, Xuelong Li

Main category: cs.CL

TL;DR: TeleMem is a unified long-term multimodal memory system for LLMs that improves dialogue coherence, reduces hallucinations, and enables efficient video understanding through narrative extraction, structured writing, and ReAct-style reasoning.

Details

Motivation: LLMs struggle with long-term interactions due to limited attention over extended dialogue histories. Existing RAG approaches lack reliable memory updating/refining mechanisms, leading to schema-driven hallucinations, inefficient write operations, and poor multimodal reasoning support.

Method: 1) Narrative dynamic extraction to maintain coherent user profiles with dialogue-grounded information only; 2) Structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries; 3) Multimodal memory module with ReAct-style reasoning (observe-think-act closed-loop) for video understanding.

Result: Outperforms state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and 2.1x speedup on ZH-4O long-term role-play gaming benchmark.

Conclusion: TeleMem effectively addresses long-term memory limitations in LLMs through a unified multimodal memory system that improves accuracy, efficiency, and speed while supporting complex video reasoning in extended interactions.

Abstract: Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal reasoning.To address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.

[115] Controlled Self-Evolution for Algorithmic Code Optimization

Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, Yi Xu

Main category: cs.CL

TL;DR: CSE improves code generation efficiency by addressing exploration limitations in self-evolution methods through diversified initialization, feedback-guided genetic operations, and hierarchical memory.

Details

Motivation: Existing self-evolution methods for code generation suffer from low exploration efficiency due to initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks.

Method: Controlled Self-Evolution (CSE) with three components: 1) Diversified Planning Initialization for broad solution space coverage, 2) Genetic Evolution with feedback-guided mutation and compositional crossover, 3) Hierarchical Evolution Memory capturing successful and failed experiences at inter-task and intra-task levels.

Result: CSE consistently outperforms all baselines across various LLM backbones on EffiBench-X, achieves higher efficiency from early generations, and maintains continuous improvement throughout evolution.

Conclusion: CSE effectively addresses exploration inefficiency in self-evolution methods through controlled mechanisms, demonstrating superior performance and efficiency in code generation tasks.

Abstract: Self-evolution methods enhance code generation through iterative “generate-verify-refine” cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels. Experiments on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.

[116] Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe

JV Roig

Main category: cs.CL

TL;DR: RIKER is a novel evaluation framework for knowledge systems that uses paradigm inversion - generating documents from ground truth rather than extracting truth from documents - enabling scalable, contamination-resistant evaluation without human annotation.

Details

Motivation: Current evaluation methods for knowledge systems (LLMs, RAG, knowledge graphs) face three key challenges: static benchmarks are vulnerable to contamination, LLM-based judges have systematic biases, and ground truth extraction requires expensive human annotation.

Method: RIKER uses paradigm inversion: generating synthetic documents from known structured ground truth, then evaluating systems on their ability to extract that ground truth back from the documents. This enables deterministic scoring, scalable evaluation without human annotation or reference models, and contamination resistance through regenerable corpora.

Result: Evaluation of 33 models using over 21 billion tokens revealed: 1) context length claims often exceed usable capacity with significant degradation beyond 32K tokens, 2) cross-document aggregation is substantially harder than single-document extraction, and 3) grounding ability and hallucination resistance are distinct capabilities - models good at finding existing facts may still fabricate non-existent facts.

Conclusion: RIKER provides both a specific benchmark and a domain-agnostic methodology for constructing scalable, contamination-resistant evaluations wherever synthetic documents can be generated from structured ground truth, addressing fundamental limitations in current knowledge system evaluation approaches.

Abstract: Evaluating knowledge systems (LLMs, RAG, knowledge graphs, etc) faces fundamental challenges: static benchmarks are vulnerable to contamination, LLM-based judges exhibit systematic biases, and ground truth extraction requires expensive human annotation. We present RIKER (Retrieval Intelligence and Knowledge Extraction Rating), both a benchmark and a replicable methodology based on paradigm inversion - generating documents from known ground truth rather than extracting ground truth from documents. This approach enables deterministic scoring and scalable evaluation without human annotation or reference models, and contamination resistance through regenerable corpora. Our evaluation of 33 models using over 21 billion tokens reveals that context length claims frequently exceed usable capacity, with significant degradation beyond 32K tokens; cross-document aggregation proves substantially harder than single-document extraction; and grounding ability and hallucination resistance are distinct capabilities - models excelling at finding facts that exist may still fabricate facts that do not. Beyond the specific benchmark, we contribute a domain-agnostic methodology for constructing scalable and contamination-resistant evaluations wherever synthetic documents can be generated from structured ground truth.

[117] TranslateGemma Technical Report

Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, Markus Freitag, David Vilar

Main category: cs.CL

TL;DR: TranslateGemma is an open machine translation model suite built on Gemma 3 foundation models, enhanced through two-stage fine-tuning with synthetic/human data and reinforcement learning, achieving strong translation performance across many language pairs while maintaining multimodal capabilities.

Details

Motivation: To enhance the inherent multilingual capabilities of Gemma 3 foundation models specifically for machine translation tasks, creating powerful open models for the research community.

Method: Two-stage fine-tuning: 1) Supervised fine-tuning using mixture of high-quality synthetic parallel data (generated via state-of-the-art models) and human-translated parallel data; 2) Reinforcement learning phase optimizing translation quality using ensemble reward models (MetricX-QE and AutoMQM).

Result: Demonstrated effectiveness through human evaluation on WMT25 test set (10 language pairs) and automatic evaluation on WMT24++ benchmark (55 language pairs). Showed consistent substantial gains over baseline Gemma 3 models across all sizes, with smaller TranslateGemma models often matching larger baseline models. Models also retain strong multimodal capabilities with enhanced performance on Vistra image translation benchmark.

Conclusion: TranslateGemma provides powerful and adaptable open machine translation tools for the research community, offering improved efficiency and performance while maintaining multimodal capabilities.

Abstract: We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models. To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process. First, supervised fine-tuning is performed using a rich mixture of high-quality large-scale synthetic parallel data generated via state-of-the-art models and human-translated parallel data. This is followed by a reinforcement learning phase, where we optimize translation quality using an ensemble of reward models, including MetricX-QE and AutoMQM, targeting translation quality. We demonstrate the effectiveness of TranslateGemma with human evaluation on the WMT25 test set across 10 language pairs and with automatic evaluation on the WMT24++ benchmark across 55 language pairs. Automatic metrics show consistent and substantial gains over the baseline Gemma 3 models across all sizes. Notably, smaller TranslateGemma models often achieve performance comparable to larger baseline models, offering improved efficiency. We also show that TranslateGemma models retain strong multimodal capabilities, with enhanced performance on the Vistra image translation benchmark. The release of the open TranslateGemma models aims to provide the research community with powerful and adaptable tools for machine translation.

[118] How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation

Wilson Y. Lee

Main category: cs.CL

TL;DR: Human preference evaluations often need far more judgments than typically collected to reliably detect small model improvements, as most comparisons show diffuse preference signals across prompts.

Details

Motivation: To understand how many human judgments are needed to reliably detect small improvements in generative models, since current evaluation practices may be underpowered and lead to inconclusive results.

Method: Analyzed large-scale human preference datasets across multiple modalities (chat, image generation, code generation) to examine preference signal distribution, compared allocation strategies, and evaluated how curated benchmarks reduce prompt-level variance.

Result: Most comparisons show diffuse preference signals with small margins requiring far more judgments than typically collected. Proportional allocation is minimax-optimal in diffuse regimes. Curated benchmarks reduce prompt-level variance by 1.5× and improve detectability.

Conclusion: Inconclusive human evaluation outcomes often reflect underpowered evaluations rather than model equivalence. Evaluation design must explicitly account for effect size, budget, and protocol design to reliably detect improvements.

Abstract: Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve detectability through a $1.5\times$ reduction in prompt-level variance. Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence, underscoring the need to account explicitly for effect size, budget, and protocol design.

[119] A.X K1 Technical Report

Sung Jun Cheon, Jaekyung Cho, Seongho Choi, Hyunjun Eun, Seokhwan Jo, Jaehyun Jun, Minsoo Kang, Jin Kim, Jiwon Kim, Minsang Kim, Sungwan Kim, Seungsik Kim, Tae Yoon Kim, Youngrang Kim, Hyeongmun Lee, Sangyeol Lee, Sungeun Lee, Youngsoon Lee, Yujin Lee, Seongmin Ok, Chanyong Park, Hyewoong Park, Junyoung Park, Hyunho Yang, Subin Yi, Soohyun Bae, Dhammiko Arya, Yongseok Choi, Sangho Choi, Dongyeon Cho, Seungmo Cho, Gyoungeun Han, Yong-jin Han, Seokyoung Hong, Hyeon Hwang, Wonbeom Jang, Minjeong Ju, Wonjin Jung, Keummin Ka, Sungil Kang, Dongnam Kim, Joonghoon Kim, Jonghwi Kim, SaeRom Kim, Sangjin Kim, Seongwon Kim, Youngjin Kim, Seojin Lee, Sunwoo Lee, Taehoon Lee, Chanwoo Park, Sohee Park, Sooyeon Park, Yohan Ra, Sereimony Sek, Seungyeon Seo, Gun Song, Sanghoon Woo, Janghan Yoon, Sungbin Yoon

Main category: cs.CL

TL;DR: A.X K1 is a 519B-parameter Mixture-of-Experts language model trained from scratch on 10T tokens, featuring controllable reasoning modes and strong Korean-language performance.

Details

Motivation: To bridge the gap between reasoning capability and inference efficiency, enabling scalable deployment across diverse real-world scenarios with controllable reasoning.

Method: Leverages scaling laws for training optimization, uses multi-stage data processing pipeline on 10T tokens, and implements Think-Fusion training recipe for user-controlled switching between thinking and non-thinking modes.

Result: A.X K1 achieves competitive performance with leading open-source models and establishes distinctive advantage in Korean-language benchmarks.

Conclusion: The model successfully demonstrates how to combine reasoning capability with inference efficiency through controllable reasoning modes, while showing strong multilingual performance particularly in Korean.

Abstract: We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.

[120] Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing

Filip Trhlik, Andrew Caines, Paula Buttery

Main category: cs.CL

TL;DR: BabyLMs (small BERT-like models) serve as low-cost proxies for studying bias formation and debiasing in large language models, reducing pre-training costs from 500+ to under 30 GPU-hours while maintaining similar bias patterns.

Details

Motivation: Large language models are expensive to train, making bias research difficult. Need affordable ways to study and mitigate biases before full-scale training.

Method: Use BabyLMs - compact BERT models trained on small, mutable corpora - as proxies to approximate bias acquisition and learning dynamics of larger models. Compare their bias patterns with standard BERT and test various debiasing methods.

Result: BabyLMs show closely aligned bias formation and performance patterns compared to standard BERT. Correlations hold across multiple debiasing methods. Pre-model debiasing experiments replicate prior findings and reveal new insights about gender imbalance and toxicity effects.

Conclusion: BabyLMs effectively serve as sandboxes for large-scale LM research, democratizing pre-model debiasing by reducing costs from 500+ to under 30 GPU-hours, enabling faster exploration of fairer LM development methods.

Abstract: Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.

cs.CV

[121] Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge

Sicheng Yang, Yukai Huang, Shitong Sun, Weitong Cai, Jiankang Deng, Jifei Song, Zhensong Zhang

Main category: cs.CV

TL;DR: A framework combining query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, Temporal Chain-of-Thought prompting, and post-processing achieves 41.6% accuracy on challenging HD-EPIC VQA benchmark.

Details

Motivation: Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs.

Method: Framework integrates: 1) query/choice pre-processing, 2) domain-specific Qwen2.5-VL fine-tuning, 3) novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and 4) robust post-processing.

Result: The system achieves 41.6% accuracy on HD-EPIC VQA benchmark, demonstrating significant improvement over baseline MLLM approaches.

Conclusion: Highlights the need for holistic pipeline optimization in demanding video understanding tasks, with code and fine-tuned models made publicly available.

Abstract: Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.

[122] Diffusion-Driven Deceptive Patches: Adversarial Manipulation and Forensic Detection in Facial Identity Verification

Shahrzad Sayyafzadeh, Hongmei Chi, Shonda Bernadin

Main category: cs.CV

TL;DR: End-to-end pipeline for generating, refining, and evaluating adversarial patches to compromise facial biometric systems using FGSM, diffusion models, and ViT-GPT2 for forensic analysis and security testing.

Details

Motivation: To develop a comprehensive system for testing and analyzing vulnerabilities in facial biometric systems through adversarial attacks, supporting forensic analysis and security evaluation of identity verification systems.

Method: Uses FGSM to generate adversarial noise targeting identity classifiers, employs diffusion models with reverse diffusion for refinement (Gaussian smoothing, adaptive brightness correction), applies patches to facial images, and uses ViT-GPT2 for semantic captioning of adversarial images.

Result: Achieves effective adversarial patch generation with high imperceptibility (SSIM 0.95), successfully evades facial recognition systems while maintaining natural appearance, and enables forensic interpretation through semantic captioning.

Conclusion: The pipeline provides a comprehensive framework for generating, evaluating, and analyzing adversarial attacks on facial biometric systems, demonstrating practical applications in security testing and forensic analysis with high-quality results.

Abstract: This work presents an end-to-end pipeline for generating, refining, and evaluating adversarial patches to compromise facial biometric systems, with applications in forensic analysis and security testing. We utilize FGSM to generate adversarial noise targeting an identity classifier and employ a diffusion model with reverse diffusion to enhance imperceptibility through Gaussian smoothing and adaptive brightness correction, thereby facilitating synthetic adversarial patch evasion. The refined patch is applied to facial images to test its ability to evade recognition systems while maintaining natural visual characteristics. A Vision Transformer (ViT)-GPT2 model generates captions to provide a semantic description of a person’s identity for adversarial images, supporting forensic interpretation and documentation for identity evasion and recognition attacks. The pipeline evaluates changes in identity classification, captioning results, and vulnerabilities in facial identity verification and expression recognition under adversarial conditions. We further demonstrate effective detection and analysis of adversarial patches and adversarial samples using perceptual hashing and segmentation, achieving an SSIM of 0.95.

Peng-Fei Zhang, Zi Huang

Main category: cs.CV

TL;DR: HRA is a multimodal universal attack framework for VLP models that refines universal adversarial perturbations at both sample and optimization levels, achieving efficient and effective attacks across various tasks and datasets.

Details

Motivation: Existing adversarial attacks for VLP models are mostly sample-specific, causing substantial computational overhead when scaled to large datasets or new scenarios. There's a need for more efficient universal attacks that work across different samples.

Method: HRA refines UAPs at sample and optimization levels. For images: disentangles adversarial examples into clean images and perturbations, uses ScMix augmentation to diversify visual contexts. For text: identifies globally influential words using intra-sentence and inter-sentence importance measures. Also introduces temporal hierarchy of historical and estimated future gradients to optimize perturbations.

Result: Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks over existing methods.

Conclusion: HRA provides an effective and efficient universal attack framework for VLP models that overcomes the computational limitations of sample-specific attacks while maintaining strong attack performance across diverse scenarios.

Abstract: Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.

[124] LCF3D: A Robust and Real-Time Late-Cascade Fusion Framework for 3D Object Detection in Autonomous Driving

Carlo Sgaravatti, Riccardo Pieroni, Matteo Corno, Sergio M. Savaresi, Luca Magri, Giacomo Boracchi

Main category: cs.CV

TL;DR: LCF3D is a novel sensor fusion framework for 3D object detection in autonomous driving that combines 2D RGB image detection with 3D LiDAR point cloud detection using late fusion and cascade fusion principles to improve accuracy and domain generalization.

Details

Motivation: Accurate 3D object localization is essential for autonomous driving, but effectively combining RGB camera and LiDAR sensor data remains challenging. The paper aims to address inaccuracies in LiDAR object detection by leveraging multimodal fusion principles.

Method: LCF3D combines two key principles: (1) Late fusion - matches LiDAR 3D detections with RGB 2D detections to filter out unmatched LiDAR false positives; (2) Cascade fusion - generates new 3D frustum proposals from unmatched RGB detections to recover missed objects from LiDAR.

Result: LCF3D achieves significant improvements over LiDAR-based methods, particularly for challenging categories like pedestrians and cyclists in KITTI dataset, and motorcycles and bicycles in nuScenes. The framework also demonstrates benefits for domain generalization across different sensor configurations.

Conclusion: The proposed LCF3D framework effectively combines RGB and LiDAR data through multimodal fusion, improving 3D object detection accuracy while enhancing domain generalization capabilities for autonomous driving applications.

Abstract: Accurately localizing 3D objects like pedestrians, cyclists, and other vehicles is essential in Autonomous Driving. To ensure high detection performance, Autonomous Vehicles complement RGB cameras with LiDAR sensors, but effectively combining these data sources for 3D object detection remains challenging. We propose LCF3D, a novel sensor fusion framework that combines a 2D object detector on RGB images with a 3D object detector on LiDAR point clouds. By leveraging multimodal fusion principles, we compensate for inaccuracies in the LiDAR object detection network. Our solution combines two key principles: (i) late fusion, to reduce LiDAR False Positives by matching LiDAR 3D detections with RGB 2D detections and filtering out unmatched LiDAR detections; and (ii) cascade fusion, to recover missed objects from LiDAR by generating new 3D frustum proposals corresponding to unmatched RGB detections. Experiments show that LCF3D is beneficial for domain generalization, as it turns out to be successful in handling different sensor configurations between training and testing domains. LCF3D achieves significant improvements over LiDAR-based methods, particularly for challenging categories like pedestrians and cyclists in the KITTI dataset, as well as motorcycles and bicycles in nuScenes. Code can be downloaded from: https://github.com/CarloSgaravatti/LCF3D.

Yichong Xia, Yimin Zhou, Jinpeng Wang, Bin Chen

Main category: cs.CV

TL;DR: DiffCR is a novel diffusion-based image compression framework that achieves high-fidelity reconstruction with 10x speed-up and significant bitrate savings through consistency prior refinement and frequency-aware skip estimation.

Details

Motivation: Existing diffusion-based image compression methods suffer from slow sampling processes and suboptimal bit allocation due to fragmented training paradigms, limiting their practical application despite achieving visually plausible results at low bit rates.

Method: DiffCR introduces a Frequency-aware Skip Estimation (FaSE) module that refines ε-prediction prior from pre-trained latent diffusion models using Frequency Decoupling Attention (FDA) to align with compressed latents. It also uses a lightweight consistency estimator enabling fast two-step decoding while preserving diffusion sampling trajectories.

Result: Achieves 27.2% BD-rate savings (LPIPS) and 65.1% BD-rate savings (PSNR), with over 10x speed-up compared to state-of-the-art diffusion-based compression baselines, without updating the backbone diffusion model.

Conclusion: DiffCR demonstrates that efficient and high-fidelity image compression can be achieved by refining diffusion priors through frequency-aware consistency estimation, overcoming previous limitations in speed and bit allocation while maintaining reconstruction quality.

Abstract: Recent advancements in diffusion-based generative priors have enabled visually plausible image compression at extremely low bit rates. However, existing approaches suffer from slow sampling processes and suboptimal bit allocation due to fragmented training paradigms. In this work, we propose Accelerate \textbf{Diff}usion-based Image Compression via \textbf{C}onsistency Prior \textbf{R}efinement (DiffCR), a novel compression framework for efficient and high-fidelity image reconstruction. At the heart of DiffCR is a Frequency-aware Skip Estimation (FaSE) module that refines the $ε$-prediction prior from a pre-trained latent diffusion model and aligns it with compressed latents at different timesteps via Frequency Decoupling Attention (FDA). Furthermore, a lightweight consistency estimator enables fast \textbf{two-step decoding} by preserving the semantic trajectory of diffusion sampling. Without updating the backbone diffusion model, DiffCR achieves substantial bitrate savings (27.2% BD-rate (LPIPS) and 65.1% BD-rate (PSNR)) and over $10\times$ speed-up compared to SOTA diffusion-based compression baselines.

[126] Explainable Deep Learning for Pediatric Pneumonia Detection in Chest X-Ray Images

Adil O. Khadidos, Aziida Nanyonga, Alaa O. Khadidos, Olfat M. Mirza, Mustafa Tahsin Yilmaz

Main category: cs.CV

TL;DR: EfficientNet-B0 outperforms DenseNet121 for pediatric pneumonia detection from chest X-rays, achieving 84.6% accuracy with strong sensitivity and computational efficiency, supported by explainable AI techniques.

Details

Motivation: Pneumonia is a leading cause of child mortality worldwide, creating an urgent need for accurate and efficient diagnostic tools. Deep learning shows promise for medical image analysis, particularly chest X-ray interpretation, but comparative performance of state-of-the-art architectures for pediatric pneumonia detection needs evaluation.

Method: Used a public dataset of 5,863 pediatric chest X-rays with preprocessing (normalization, resizing, data augmentation). Fine-tuned DenseNet121 and EfficientNet-B0 using pretrained ImageNet weights under identical training settings. Evaluated performance using accuracy, F1-score, MCC, and recall. Incorporated explainability with Grad-CAM and LIME to visualize influential image regions.

Result: EfficientNet-B0 outperformed DenseNet121 with 84.6% accuracy, F1-score of 0.8899, and MCC of 0.6849 vs. DenseNet121’s 79.7% accuracy, 0.8597 F1-score, and 0.5852 MCC. Both models showed high recall (>0.99), indicating strong sensitivity. Explainability visualizations consistently focused on clinically relevant lung regions.

Conclusion: EfficientNet-B0 provides more balanced and computationally efficient performance than DenseNet121, making it suitable for clinical deployment. Integration of explainability techniques (Grad-CAM and LIME) enhances transparency and trustworthiness in AI-assisted pediatric pneumonia diagnosis.

Abstract: Background: Pneumonia remains a leading cause of morbidity and mortality among children worldwide, emphasizing the need for accurate and efficient diagnostic support tools. Deep learning has shown strong potential in medical image analysis, particularly for chest X-ray interpretation. This study compares two state-of-the-art convolutional neural network (CNN) architectures for automated pediatric pneumonia detection. Methods: A publicly available dataset of 5,863 pediatric chest X-ray images was used. Images were preprocessed through normalization, resizing, and data augmentation to enhance generalization. DenseNet121 and EfficientNet-B0 were fine-tuned using pretrained ImageNet weights under identical training settings. Performance was evaluated using accuracy, F1-score, Matthews Correlation Coefficient (MCC), and recall. Model explainability was incorporated using Gradient-weighted Class Activation Mapping (Grad-CAM) and Local Interpretable Model-agnostic Explanations (LIME) to visualize image regions influencing predictions. Results: EfficientNet-B0 outperformed DenseNet121, achieving an accuracy of 84.6%, F1-score of 0.8899, and MCC of 0.6849. DenseNet121 achieved 79.7% accuracy, an F1-score of 0.8597, and MCC of 0.5852. Both models demonstrated high recall values above 0.99, indicating strong sensitivity to pneumonia detection. Grad-CAM and LIME visualizations showed consistent focus on clinically relevant lung regions, supporting the reliability of model decisions. Conclusions: EfficientNet-B0 provided a more balanced and computationally efficient performance compared to DenseNet121, making it a strong candidate for clinical deployment. The integration of explainability techniques enhances transparency and trustworthiness in AI-assisted pediatric pneumonia diagnosis.

[127] Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer

Filippo Ruffini, Camillo Maria Caruso, Claudia Tacconi, Lorenzo Nibid, Francesca Miccolis, Marta Lovino, Carlo Greco, Edy Ippolito, Michele Fiore, Alessio Cortellini, Bruno Beomonte Zobel, Giuseppe Perrone, Bruno Vincenzi, Claudio Marrocco, Alessandro Bria, Elisa Ficarra, Sara Ramella, Valerio Guarrasi, Paolo Soda

Main category: cs.CV

TL;DR: A missing-aware multimodal survival framework for NSCLC that integrates CT scans, histopathology images, and clinical data using foundation models and intermediate fusion, achieving 73.30 C-index while handling missing modalities without patient exclusion.

Details

Motivation: Current multimodal deep learning for NSCLC survival prediction faces limitations due to small cohort sizes and missing modalities, forcing complete-case filtering or aggressive imputation that reduces clinical applicability.

Method: Uses foundation models for modality-specific feature extraction (CT, WSI, clinical variables), implements missing-aware encoding strategy, and employs intermediate multimodal fusion that’s resilient to missing modalities by design.

Result: Intermediate fusion outperforms unimodal baselines and early/late fusion strategies, with best performance from WSI+clinical fusion (73.30 C-index). Model adaptively down-weights less informative modalities like CT.

Conclusion: The proposed framework enables robust multimodal survival prediction in NSCLC while handling real-world missing data, improving clinical applicability without forcing patient exclusion during training or inference.

Abstract: Accurate survival prediction in Non-Small Cell Lung Cancer (NSCLC) requires the integration of heterogeneous clinical, radiological, and histopathological information. While Multimodal Deep Learning (MDL) offers a promises for precision prognosis and survival prediction, its clinical applicability is severely limited by small cohort sizes and the presence of missing modalities, often forcing complete-case filtering or aggressive imputation. In this work, we present a missing-aware multimodal survival framework that integrates Computed Tomography (CT), Whole-Slide Histopathology (WSI) Images, and structured clinical variables for overall survival modeling in unresectable stage II-III NSCLC. By leveraging Foundation Models (FM) for modality-specific feature extraction and a missing-aware encoding strategy, the proposed approach enables intermediate multimodal fusion under naturally incomplete modality profiles. The proposed architecture is resilient to missing modalities by design, allowing the model to utilize all available data without being forced to drop patients during training or inference. Experimental results demonstrate that intermediate fusion consistently outperforms unimodal baselines as well as early and late fusion strategies, with the strongest performance achieved by the fusion of WSI and clinical modalities (73.30 C-index). Further analyses of modality importance reveal an adaptive behavior in which less informative modalities, i.e., CT modality, are automatically down-weighted and contribute less to the final survival prediction.

[128] NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration

Subhajit Sanyal, Srinivas Soumitri Miriyala, Akshay Janardan Bankar, Sravanth Kodavanti, Harshit, Abhishek Ameta, Shreyas Pandith, Amit Satish Unde

Main category: cs.CV

TL;DR: NanoSD is a family of lightweight diffusion models distilled from Stable Diffusion 1.5 for real-time image restoration on edge devices, achieving Pareto-optimal performance across accuracy, latency, and model size.

Details

Motivation: Latent diffusion models like Stable Diffusion 1.5 have strong generative priors valuable for image restoration, but their full pipelines are too computationally heavy for edge devices. Existing lightweight variants compress only parts of the pipeline, disrupting the latent manifold and limiting generalization to single tasks.

Method: Full-pipeline co-design approach using network surgery, feature-wise generative distillation, and structured architectural scaling applied jointly to both the U-Net and VAE encoder-decoder. This preserves the generative prior while creating Pareto-optimal models along accuracy-latency-size frontier.

Result: Achieves models with 130M-315M parameters capable of real-time inference down to 20ms on mobile NPUs. Outperforms prior lightweight diffusion models in perceptual quality and deployability across multiple tasks: image super-resolution, deblurring, face restoration, and monocular depth estimation.

Conclusion: NanoSD establishes a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices, demonstrating that parameter reduction alone doesn’t guarantee hardware efficiency, and architectural balance with latent-space preservation is crucial.

Abstract: Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising U-Net or reduce the diffusion trajectory, which disrupts the underlying latent manifold and limits generalization beyond a single task. We introduce NanoSD, a family of Pareto-optimal diffusion foundation models distilled from Stable Diffusion 1.5 through network surgery, feature-wise generative distillation, and structured architectural scaling jointly applied to the U-Net and the VAE encoder-decoder. This full-pipeline co-design preserves the generative prior while producing models that occupy distinct operating points along the accuracy-latency-size frontier (e.g., 130M-315M parameters, achieving real-time inference down to 20ms on mobile-class NPUs). We show that parameter reduction alone does not correlate with hardware efficiency, and we provide an analysis revealing how architectural balance, feature routing, and latent-space preservation jointly shape true on-device latency. When used as a drop-in backbone, NanoSD enables state-of-the-art performance across image super-resolution, image deblurring, face restoration, and monocular depth estimation, outperforming prior lightweight diffusion models in both perceptual quality and practical deployability. NanoSD establishes a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices.

[129] SRAW-Attack: Space-Reweighted Adversarial Warping Attack for SAR Target Recognition

Yiming Zhang, Weibo Qin, Yuntian Liu, Feng Wang

Main category: cs.CV

TL;DR: SRAW attack method uses optimized spatial deformation with reweighted budgets across foreground/background to create stealthy adversarial examples for SAR-ATR systems.

Details

Motivation: SAR-ATR systems using DNNs are vulnerable to adversarial attacks and tend to over-rely on background regions, but existing attacks require visually perceptible distortions. Need for attack method balancing effectiveness and stealthiness.

Method: Space-Reweighted Adversarial Warping (SRAW) generates adversarial examples through optimized spatial deformation with reweighted budgets across foreground and background regions.

Result: SRAW significantly degrades performance of state-of-the-art SAR-ATR models and outperforms existing methods in imperceptibility and adversarial transferability.

Conclusion: SRAW provides an effective and stealthy adversarial attack method for SAR-ATR systems that balances attack effectiveness with imperceptibility.

Abstract: Synthetic aperture radar (SAR) imagery exhibits intrinsic information sparsity due to its unique electromagnetic scattering mechanism. Despite the widespread adoption of deep neural network (DNN)-based SAR automatic target recognition (SAR-ATR) systems, they remain vulnerable to adversarial examples and tend to over-rely on background regions, leading to degraded adversarial robustness. Existing adversarial attacks for SAR-ATR often require visually perceptible distortions to achieve effective performance, thereby necessitating an attack method that balances effectiveness and stealthiness. In this paper, a novel attack method termed Space-Reweighted Adversarial Warping (SRAW) is proposed, which generates adversarial examples through optimized spatial deformation with reweighted budgets across foreground and background regions. Extensive experiments demonstrate that SRAW significantly degrades the performance of state-of-the-art SAR-ATR models and consistently outperforms existing methods in terms of imperceptibility and adversarial transferability. Code is made available at https://github.com/boremycin/SAR-ATR-TransAttack.

[130] UniHash: Unifying Pointwise and Pairwise Hashing Paradigms for Seen and Unseen Category Retrieval

Xiaoxu Ma, Runhao Li, Hanwen Liu, Xiangbo Zhang, Zhenyu Weng

Main category: cs.CV

TL;DR: UniHash is a dual-branch hashing framework that unifies pointwise and pairwise training paradigms to achieve balanced retrieval performance on both seen and unseen categories.

Details

Motivation: Existing deep hashing methods are limited to either pointwise or pairwise training paradigms, where pointwise excels on seen categories but pairwise generalizes better to unseen ones. There's a need for a unified approach that balances performance across both scenarios.

Method: Proposes Unified Hashing (UniHash) with two complementary branches: a center-based branch (pointwise paradigm) and a pairwise branch (pairwise paradigm). Introduces bidirectional knowledge transfer between branches using mutual learning loss and a Split-Merge Mixture of Hash Experts (SM-MoH) module to enhance hash representation exchange.

Result: Extensive experiments on CIFAR-10, MSCOCO, and ImageNet demonstrate that UniHash consistently achieves state-of-the-art performance in both seen and unseen image retrieval scenarios, with theoretical analysis supporting its effectiveness.

Conclusion: UniHash successfully unifies pointwise and pairwise paradigms through a dual-branch framework with bidirectional knowledge transfer, achieving balanced and superior retrieval performance across both seen and unseen categories.

Abstract: Effective retrieval across both seen and unseen categories is crucial for modern image retrieval systems. Retrieval on seen categories ensures precise recognition of known classes, while retrieval on unseen categories promotes generalization to novel classes with limited supervision. However, most existing deep hashing methods are confined to a single training paradigm, either pointwise or pairwise, where the former excels on seen categories and the latter generalizes better to unseen ones. To overcome this limitation, we propose Unified Hashing (UniHash), a dual-branch framework that unifies the strengths of both paradigms to achieve balanced retrieval performance across seen and unseen categories. UniHash consists of two complementary branches: a center-based branch following the pointwise paradigm and a pairwise branch following the pairwise paradigm. A novel hash code learning method is introduced to enable bidirectional knowledge transfer between branches, improving hash code discriminability and generalization. It employs a mutual learning loss to align hash representations and introduces a Split-Merge Mixture of Hash Experts (SM-MoH) module to enhance cross-branch exchange of hash representations. Theoretical analysis substantiates the effectiveness of UniHash, and extensive experiments on CIFAR-10, MSCOCO, and ImageNet demonstrate that UniHash consistently achieves state-of-the-art performance in both seen and unseen image retrieval scenarios.

[131] ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning

Po-han Li, Shenghui Chen, Ufuk Topcu, Sandeep Chinchali

Main category: cs.CV

TL;DR: Proposes ViSIL, an information-theoretic metric to quantify information loss in multimodal video summaries, enabling cross-modal comparison and optimal summary selection.

Details

Motivation: Traditional metrics like BLEU/ROUGE fail to measure information coverage across different modalities (text vs keyframes) in multimodal video summaries, creating a need for a unified evaluation framework.

Method: Develops Video Summary Information Loss (ViSIL) score using vision-language model inference to quantify video information not captured by summaries, enabling direct comparison across different summary formats.

Result: ViSIL shows statistically significant correlation with human and VLM performance on VQA tasks, and enables summary selection that outperforms text summaries by 7% in VQA accuracy without increasing processing load.

Conclusion: ViSIL provides a unified metric for evaluating multimodal video summaries, addressing cross-modal comparison challenges and enabling optimal trade-offs between information coverage and processing efficiency.

Abstract: Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7%$ in VQA accuracy without increasing processing load.

[132] Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP

Anant Mehta, Xiyuan Wei, Xingyu Chen, Tianbao Yang

Main category: cs.CV

TL;DR: TuneCLIP is a self-supervised fine-tuning framework that improves open-weight CLIP models without expensive retraining, achieving performance gains across various downstream tasks.

Details

Motivation: Improving CLIP performance typically requires expensive retraining on billions of samples. The paper asks if we can improve existing open-weight CLIP models using only existing self-supervised datasets, avoiding the need for costly supervised fine-tuning that adapts to single tasks.

Method: TuneCLIP has two key components: (1) a warm-up stage to recover optimization statistics and reduce cold-start bias, inspired by theoretical analysis, and (2) a fine-tuning stage optimizing a new contrastive loss to mitigate penalization on false negative pairs.

Result: TuneCLIP consistently improves performance across model architectures and scales. It elevates leading open-weight models like SigLIP (ViT-B/16), achieving gains of up to +2.5% on ImageNet and related out-of-distribution benchmarks, and +1.2% on the highly competitive DataComp benchmark.

Conclusion: TuneCLIP sets a new strong baseline for efficient post-pretraining adaptation of CLIP models, demonstrating that self-supervised fine-tuning can effectively improve general performance across various tasks without expensive retraining.

Abstract: CLIP has become a cornerstone of multimodal representation learning, yet improving its performance typically requires a prohibitively costly process of training from scratch on billions of samples. We ask a different question: Can we improve the performance of open-weight CLIP models across various downstream tasks using only existing self-supervised datasets? Unlike supervised fine-tuning, which adapts a pretrained model to a single downstream task, our setting seeks to improve general performance across various tasks. However, as both our experiments and prior studies reveal, simply applying standard training protocols starting from an open-weight CLIP model often fails, leading to performance degradation. In this paper, we introduce TuneCLIP, a self-supervised fine-tuning framework that overcomes the performance degradation. TuneCLIP has two key components: (1) a warm-up stage of recovering optimization statistics to reduce cold-start bias, inspired by theoretical analysis, and (2) a fine-tuning stage of optimizing a new contrastive loss to mitigate the penalization on false negative pairs. Our extensive experiments show that TuneCLIP consistently improves performance across model architectures and scales. Notably, it elevates leading open-weight models like SigLIP (ViT-B/16), achieving gains of up to +2.5% on ImageNet and related out-of-distribution benchmarks, and +1.2% on the highly competitive DataComp benchmark, setting a new strong baseline for efficient post-pretraining adaptation.

[133] VibrantSR: Sub-Meter Canopy Height Models from Sentinel-2 Using Generative Flow Matching

Kiarie Ndegwa, Andreas Gros, Tony Chang, David Diaz, Vincent A. Landau, Nathan E. Rutenbeck, Luke J. Zachmann, Guy Bayes, Scott Conway

Main category: cs.CV

TL;DR: VibrantSR is a generative super-resolution framework that estimates 0.5m canopy height models from 10m Sentinel-2 imagery, enabling seasonal forest monitoring without aerial imagery.

Details

Motivation: Current approaches using aerial imagery are limited by infrequent and irregular acquisition schedules. There's a need for consistent, operational forest monitoring at continental scales using globally available satellite data.

Method: Generative super-resolution framework that processes Sentinel-2 seasonal composites to estimate high-resolution (0.5m) canopy height models from lower-resolution (10m) satellite imagery.

Result: Achieves MAE of 4.39m for canopy heights ≥2m, outperforming Meta (4.83m), LANDFIRE (5.96m), and ETH (7.05m) benchmarks. Evaluated across 22 EPA eco-regions with spatially disjoint validation.

Conclusion: VibrantSR enables operational forest monitoring and carbon accounting at continental scales without costly aerial acquisitions, though aerial-based methods still have higher accuracy.

Abstract: We present VibrantSR (Vibrant Super-Resolution), a generative super-resolution framework for estimating 0.5 meter canopy height models (CHMs) from 10 meter Sentinel-2 imagery. Unlike approaches based on aerial imagery that are constrained by infrequent and irregular acquisition schedules, VibrantSR leverages globally available Sentinel-2 seasonal composites, enabling consistent monitoring at a seasonal-to-annual cadence. Evaluated across 22 EPA Level 3 eco-regions in the western United States using spatially disjoint validation splits, VibrantSR achieves a Mean Absolute Error of 4.39 meters for canopy heights >= 2 m, outperforming Meta (4.83 m), LANDFIRE (5.96 m), and ETH (7.05 m) satellite-based benchmarks. While aerial-based VibrantVS (2.71 m MAE) retains an accuracy advantage, VibrantSR enables operational forest monitoring and carbon accounting at continental scales without reliance on costly and temporally infrequent aerial acquisitions.

[134] MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation

Yang Xing, Jiong Wu, Savas Ozdemir, Ying Zhang, Yang Yang, Wei Shao, Kuang Gong

Main category: cs.CV

TL;DR: MedVL-SAM2 is a unified 3D medical multimodal model that combines report generation, VQA, and multi-paradigm segmentation in a single framework for 3D CT imaging.

Details

Motivation: Existing medical VLMs excel at image-level text tasks but struggle with fine-grained visual grounding and 3D spatial reasoning. There's a need for a unified framework that can handle both high-level reasoning and precise 3D localization in medical imaging.

Method: The model integrates image-level reasoning with pixel-level perception using a SAM2-based volumetric segmentation module. It’s trained in two stages: first pre-trained on 3D CT image-text pairs for feature alignment, then jointly optimized with language-understanding and segmentation objectives using comprehensive 3D CT segmentation data.

Result: Achieves state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning.

Conclusion: High-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM, enabling flexible interaction via language, point, or box prompts for comprehensive medical image analysis.

Abstract: Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.

[135] Transition Matching Distillation for Fast Video Generation

Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, Arash Vahdat

Main category: cs.CV

TL;DR: TMD distills large video diffusion models into efficient few-step generators by matching multi-step denoising trajectories with lightweight conditional flows, enabling real-time video generation with quality-speed trade-offs.

Details

Motivation: Large video diffusion models produce high-quality videos but are too slow for real-time applications due to their multi-step sampling process. There's a need to make these models efficient while maintaining visual quality.

Method: Transition Matching Distillation (TMD) decomposes diffusion backbones into: 1) main backbone for semantic representation extraction, and 2) flow head for conditional flow updates. It matches multi-step denoising trajectories with few-step probability transitions using distribution matching distillation with flow head rollout.

Result: TMD outperforms existing distilled models under comparable inference costs in visual fidelity and prompt adherence when applied to Wan2.1 1.3B and 14B text-to-video models, providing flexible speed-quality trade-offs.

Conclusion: TMD enables efficient distillation of large video diffusion models into few-step generators suitable for real-time applications while maintaining strong visual quality and prompt adherence.

Abstract: Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: https://research.nvidia.com/labs/genair/tmd

[136] OT-Drive: Out-of-Distribution Off-Road Traversable Area Segmentation via Optimal Transport

Zhihua Zhao, Guoqiang Li, Chen Min, Kangping Lu

Main category: cs.CV

TL;DR: OT-Drive: Optimal Transport-driven multi-modal fusion framework for robust traversable area segmentation in autonomous driving, achieving strong OOD generalization with limited training data.

Details

Motivation: Existing data-driven approaches for traversable area segmentation suffer from degraded performance in out-of-distribution (OOD) scenarios, which impairs downstream autonomous driving tasks. There's a need for methods that can generalize better to unseen environmental conditions.

Method: Proposes OT-Drive with two key components: 1) Scene Anchor Generator (SAG) that decomposes scene information into joint distribution of weather, time-of-day, and road type to create semantic anchors, and 2) Optimal Transport-based multi-modal fusion module (OT Fusion) that transports RGB and surface normal features onto the semantic anchor manifold.

Result: Achieves 95.16% mIoU on ORFD OOD scenarios (outperforming prior methods by 6.35%) and 89.79% mIoU on cross-dataset transfer tasks (surpassing baselines by 13.99%). Shows strong OOD generalization with limited training data.

Conclusion: OT-Drive substantially enhances practicality and efficiency for real-world deployment by enabling robust traversable area segmentation under OOD scenarios through optimal transport-based multi-modal fusion and semantic anchor construction.

Abstract: Reliable traversable area segmentation in unstructured environments is critical for planning and decision-making in autonomous driving. However, existing data-driven approaches often suffer from degraded segmentation performance in out-of-distribution (OOD) scenarios, consequently impairing downstream driving tasks. To address this issue, we propose OT-Drive, an Optimal Transport–driven multi-modal fusion framework. The proposed method formulates RGB and surface normal fusion as a distribution transport problem. Specifically, we design a novel Scene Anchor Generator (SAG) to decompose scene information into the joint distribution of weather, time-of-day, and road type, thereby constructing semantic anchors that can generalize to unseen scenarios. Subsequently, we design an innovative Optimal Transport-based multi-modal fusion module (OT Fusion) to transport RGB and surface normal features onto the manifold defined by the semantic anchors, enabling robust traversable area segmentation under OOD scenarios. Experimental results demonstrate that our method achieves 95.16% mIoU on ORFD OOD scenarios, outperforming prior methods by 6.35%, and 89.79% mIoU on cross-dataset transfer tasks, surpassing baselines by 13.99%.These results indicate that the proposed model can attain strong OOD generalization with only limited training data, substantially enhancing its practicality and efficiency for real-world deployment.

[137] The Spatial Blindspot of Vision-Language Models

Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A, Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna

Main category: cs.CV

TL;DR: VLMs lack spatial reasoning due to CLIP-style image encoders flattening 2D structure; improving spatial awareness through alternative encoders and 2D positional encodings enhances spatial reasoning benchmarks.

Details

Motivation: Current vision-language models (VLMs) built with CLIP-style image encoders flatten images into 1D patch sequences, discarding crucial 2D structure needed for spatial reasoning. This spatial awareness gap is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding like robotics and embodied AI.

Method: The paper investigates two architectural approaches: (1) image encoders trained with alternative objectives (not just contrastive learning), and (2) incorporation of 2D positional encodings to preserve spatial structure information.

Result: Experiments show that these architectural choices lead to improved spatial reasoning performance on several benchmarks, demonstrating that better spatial awareness can be achieved through encoder design modifications.

Conclusion: Spatial awareness is a critical missing dimension in current VLM design, and addressing it through architectural modifications like alternative encoders and 2D positional encodings can significantly improve spatial reasoning capabilities for applications requiring spatial grounding.

Abstract: Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.

[138] DR$^2$Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models

Yulin He, Wei Chen, Zhikang Jian, Tianhang Guo, Wenjuan Zhou, Minglong Li

Main category: cs.CV

TL;DR: DR²Seg is a self-rewarding framework that improves reasoning segmentation efficiency and accuracy by decomposing the task into multimodal reasoning and referring segmentation stages, using self-contained descriptions and self-rewards to reduce overthinking.

Details

Motivation: Existing reasoning segmentation methods suffer from overthinking in MLLMs, generating verbose reasoning chains that interfere with object localization. There's a need to improve both reasoning efficiency and segmentation accuracy without requiring extra supervision.

Method: Two-stage rollout strategy: 1) Generate self-contained description specifying target object, 2) Use description to verify self-containment. Introduces two self-rewards to strengthen goal-oriented reasoning and suppress redundant thinking. Works without extra thinking supervision.

Result: Extensive experiments across MLLMs of varying scales and segmentation models show consistent improvements in reasoning efficiency and overall segmentation performance.

Conclusion: DR²Seg effectively addresses overthinking in reasoning segmentation by decomposing the task and using self-rewards, improving both efficiency and accuracy without requiring additional supervision.

Abstract: Reasoning segmentation is an emerging vision-language task that requires reasoning over intricate text queries to precisely segment objects. However, existing methods typically suffer from overthinking, generating verbose reasoning chains that interfere with object localization in multimodal large language models (MLLMs). To address this issue, we propose DR$^2$Seg, a self-rewarding framework that improves both reasoning efficiency and segmentation accuracy without requiring extra thinking supervision. DR$^2$Seg employs a two-stage rollout strategy that decomposes reasoning segmentation into multimodal reasoning and referring segmentation. In the first stage, the model generates a self-contained description that explicitly specifies the target object. In the second stage, this description replaces the original complex query to verify its self-containment. Based on this design, two self-rewards are introduced to strengthen goal-oriented reasoning and suppress redundant thinking. Extensive experiments across MLLMs of varying scales and segmentation models demonstrate that DR$^2$Seg consistently improves reasoning efficiency and overall segmentation performance.

[139] DW-DGAT: Dynamically Weighted Dual Graph Attention Network for Neurodegenerative Disease Diagnosis

Chengjia Liang, Zhenjiong Wang, Chao Chen, Ruizhi Zhang, Songxi Liang, Hai Xie, Haijun Lei, Zhongwei Huang

Main category: cs.CV

TL;DR: DW-DGAT: A dynamically weighted dual graph attention network for early diagnosis of Parkinson’s and Alzheimer’s diseases, addressing data fusion, heterogeneity, and class imbalance challenges.

Details

Motivation: Early diagnosis of Parkinson's and Alzheimer's diseases is critical but challenging due to high-dimensional multi-metric data with diverse structural forms, heterogeneity in neuroimaging/phenotypic data, and class imbalance issues.

Method: Proposes DW-DGAT with: 1) General-purpose data fusion strategy for three structural forms of multi-metric data; 2) Dual graph attention architecture based on brain regions and inter-sample relationships; 3) Class weight generation mechanism with two stable loss functions to handle imbalance.

Result: Demonstrates state-of-the-art performance on Parkinson Progression Marker Initiative (PPMI) and Alzheimer’s Disease Neuroimaging Initiative (ADNI) datasets.

Conclusion: The proposed DW-DGAT effectively addresses key challenges in early neurodegenerative disease diagnosis and achieves superior performance compared to existing methods.

Abstract: Parkinson’s disease (PD) and Alzheimer’s disease (AD) are the two most prevalent and incurable neurodegenerative diseases (NDs) worldwide, for which early diagnosis is critical to delay their progression. However, the high dimensionality of multi-metric data with diverse structural forms, the heterogeneity of neuroimaging and phenotypic data, and class imbalance collectively pose significant challenges to early ND diagnosis. To address these challenges, we propose a dynamically weighted dual graph attention network (DW-DGAT) that integrates: (1) a general-purpose data fusion strategy to merge three structural forms of multi-metric data; (2) a dual graph attention architecture based on brain regions and inter-sample relationships to extract both micro- and macro-level features; and (3) a class weight generation mechanism combined with two stable and effective loss functions to mitigate class imbalance. Rigorous experiments, based on the Parkinson Progression Marker Initiative (PPMI) and Alzhermer’s Disease Neuroimaging Initiative (ADNI) studies, demonstrate the state-of-the-art performance of our approach.

[140] VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models

Zefan Zhang, Kehua Zhu, Shijie Jiang, Hongyuan Lu, Shengkai Sun, Tian Bai

Main category: cs.CV

TL;DR: VERHallu benchmark evaluates VideoLLMs on event relation hallucinations (causal, temporal, subevent relations) using counterintuitive videos, revealing models struggle with dense-event reasoning and proposing Key-Frame Propagating strategy to mitigate hallucinations.

Details

Motivation: Existing VideoLLM hallucination research focuses on object/scene presence but neglects event relation hallucinations. There's a need to evaluate and address how VideoLLMs hallucinate about causal, temporal, and subevent relationships between events in videos.

Method: 1) Introduce VERHallu benchmark with three task types (relation classification, QA, counterfactual QA) using counterintuitive video scenarios. 2) Propose Key-Frame Propagating (KFP) strategy that reallocates frame-level attention in intermediate layers to enhance multi-event understanding without affecting inference speed.

Result: Current VideoLLMs struggle with dense-event relation reasoning, rely on prior knowledge due to insufficient frame-level cue usage, and overlook surrounding subevents despite strong grounding for key events. KFP strategy effectively mitigates event relation hallucination.

Conclusion: Event relation hallucination is a critical but overlooked problem in VideoLLMs. The VERHallu benchmark enables comprehensive evaluation, and the proposed KFP strategy offers an effective solution by improving frame-level attention allocation for better multi-event understanding.

Abstract: Video Large Language Models (VideoLLMs) exhibit various types of hallucinations. Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos, while largely neglecting event relation hallucination. In this paper, we introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu. This benchmark focuses on causal, temporal, and subevent relations between events, encompassing three types of tasks: relation classification, question answering, and counterfactual question answering, for a comprehensive evaluation of event relation hallucination. Additionally, it features counterintuitive video scenarios that deviate from typical pretraining distributions, with each sample accompanied by human-annotated candidates covering both vision-language and pure language biases. Our analysis reveals that current state-of-the-art VideoLLMs struggle with dense-event relation reasoning, often relying on prior knowledge due to insufficient use of frame-level cues. Although these models demonstrate strong grounding capabilities for key events, they often overlook the surrounding subevents, leading to an incomplete and inaccurate understanding of event relations. To tackle this, we propose a Key-Frame Propagating (KFP) strategy, which reallocates frame-level attention within intermediate layers to enhance multi-event understanding. Experiments show it effectively mitigates the event relation hallucination without affecting inference speed.

[141] Disentangled Concept Representation for Text-to-image Person Re-identification

Giyeol Kim, Chanho Eom

Main category: cs.CV

TL;DR: DiCo framework uses disentangled concept representation with slot-based anchors and concept blocks to bridge the modality gap between text and images for person re-identification.

Details

Motivation: TIReID faces challenges due to large modality gap between visual appearances and textual descriptions, and difficulty in modeling fine-grained correspondences needed to distinguish individuals with similar attributes like clothing color, texture, or outfit style.

Method: Proposes DiCo (Disentangled Concept Representation) framework with shared slot-based representation where each slot acts as part-level anchor across modalities, further decomposed into multiple concept blocks to disentangle complementary attributes while maintaining consistent part-level correspondence.

Result: Achieves competitive performance with state-of-the-art methods on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets, while enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval.

Conclusion: DiCo effectively addresses modality gap and fine-grained correspondence challenges in TIReID through hierarchical disentangled concept representation, improving both performance and interpretability.

Abstract: Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textit{e.g.}, color, texture, shape) while maintaining consistent part-level correspondence between image and text. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that our framework achieves competitive performance with state-of-the-art methods, while also enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval results.

[142] UEOF: A Benchmark Dataset for Underwater Event-Based Optical Flow

Nick Truong, Pritam P. Karmokar, William J. Beksi

Main category: cs.CV

TL;DR: First synthetic underwater benchmark dataset for event-based optical flow using physically-based ray-traced RGBD sequences to address lack of realistic underwater event data with accurate ground-truth flow.

Details

Motivation: Underwater imaging faces challenges like wavelength-dependent attenuation, scattering, turbidity, and non-uniform illumination, making ground-truth motion nearly impossible to obtain. Event cameras offer microsecond resolution and high dynamic range, but progress has been limited due to lack of datasets pairing realistic underwater optics with accurate optical flow.

Method: Created synthetic underwater benchmark dataset using physically-based ray-traced RGBD sequences. Applied modern video-to-event pipeline to rendered underwater videos to produce realistic event data streams with dense ground-truth flow, depth, and camera motion.

Result: Produced first synthetic underwater benchmark dataset for event-based optical flow with realistic event data streams containing dense ground-truth flow, depth, and camera motion. Benchmarked state-of-the-art learning-based and model-based optical flow prediction methods to understand how underwater light transport affects event formation and motion estimation accuracy.

Conclusion: The dataset establishes a new baseline for future development and evaluation of underwater event-based perception algorithms, addressing the critical gap in underwater event camera research. The source code and dataset are publicly available.

Abstract: Underwater imaging is fundamentally challenging due to wavelength-dependent light attenuation, strong scattering from suspended particles, turbidity-induced blur, and non-uniform illumination. These effects impair standard cameras and make ground-truth motion nearly impossible to obtain. On the other hand, event cameras offer microsecond resolution and high dynamic range. Nonetheless, progress on investigating event cameras for underwater environments has been limited due to the lack of datasets that pair realistic underwater optics with accurate optical flow. To address this problem, we introduce the first synthetic underwater benchmark dataset for event-based optical flow derived from physically-based ray-traced RGBD sequences. Using a modern video-to-event pipeline applied to rendered underwater videos, we produce realistic event data streams with dense ground-truth flow, depth, and camera motion. Moreover, we benchmark state-of-the-art learning-based and model-based optical flow prediction methods to understand how underwater light transport affects event formation and motion estimation accuracy. Our dataset establishes a new baseline for future development and evaluation of underwater event-based perception algorithms. The source code and dataset for this project are publicly available at https://robotic-vision-lab.github.io/ueof.

[143] CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Chengzhuo Tong, Mingkun Chang, Shenglong Zhang, Yuran Wang, Cheng Liang, Zhizheng Zhao, Ruichuan An, Bohan Zeng, Yang Shi, Yifan Dai, Ziming Zhao, Guanbin Li, Pengfei Wan, Yuanxing Zhang, Wentao Zhang

Main category: cs.CV

TL;DR: CoF-T2I integrates Chain-of-Frame reasoning into text-to-image generation using progressive visual refinement with intermediate frames as explicit reasoning steps.

Details

Motivation: While video models have shown Chain-of-Frame reasoning capabilities for various visual tasks, their potential for enhancing text-to-image generation remains unexplored due to lack of clear visual reasoning starting points and interpretable intermediate states in T2I generation.

Method: Proposes CoF-T2I model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps. Creates CoF-Evol-Instruct dataset of CoF trajectories modeling generation from semantics to aesthetics. Implements independent encoding for each frame to avoid motion artifacts.

Result: CoF-T2I significantly outperforms base video model and achieves competitive performance: 0.86 on GenEval and 7.468 on Imagine-Bench benchmarks.

Conclusion: Video models show substantial promise for advancing high-quality text-to-image generation through Chain-of-Frame reasoning, with CoF-T2I demonstrating effective integration of progressive visual refinement into T2I generation.

Abstract: Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.

[144] ReaMIL: Reasoning- and Evidence-Aware Multiple Instance Learning for Whole-Slide Histopathology

Hyun Do Jung, Jungwon Choi, Hwiyoung Kim

Main category: cs.CV

TL;DR: ReaMIL is a multiple instance learning method for histopathology that adds a selection head to MIL backbones to identify small, compact evidence sets without sacrificing accuracy, using a budgeted-sufficiency objective with sparsity constraints.

Details

Motivation: Current MIL methods for whole-slide histopathology lack interpretability and produce evidence sets that are too large and scattered. There's a need for methods that can identify minimal, spatially compact evidence sets while maintaining or improving classification performance.

Method: Adds a light selection head to MIL backbone that produces soft per-tile gates. Trained with budgeted-sufficiency objective: hinge loss enforcing true-class probability ≥ τ using only kept evidence, under sparsity budget on selected tiles. No extra supervision required.

Result: Matches or slightly improves baseline AUC across TCGA-NSCLC, TCGA-BRCA, and PANDA datasets. On NSCLC: AUC 0.983 with mean minimal sufficient K ≈ 8.2 tiles at τ=0.90 and AUKC ≈ 0.864. Shows class confidence rises sharply with small tile sets.

Conclusion: ReaMIL provides interpretable, compact evidence sets without performance loss, integrates seamlessly with standard MIL training, and offers quantitative diagnostics (MSK, AUKC, contiguity) for rigorous evaluation of WSI models.

Abstract: We introduce ReaMIL (Reasoning- and Evidence-Aware MIL), a multiple instance learning approach for whole-slide histopathology that adds a light selection head to a strong MIL backbone. The head produces soft per-tile gates and is trained with a budgeted-sufficiency objective: a hinge loss that enforces the true-class probability to be $\geq τ$ using only the kept evidence, under a sparsity budget on the number of selected tiles. The budgeted-sufficiency objective yields small, spatially compact evidence sets without sacrificing baseline performance. Across TCGA-NSCLC (LUAD vs. LUSC), TCGA-BRCA (IDC vs. Others), and PANDA, ReaMIL matches or slightly improves baseline AUC and provides quantitative evidence-efficiency diagnostics. On NSCLC, it attains AUC 0.983 with a mean minimal sufficient K (MSK) $\approx 8.2$ tiles at $τ= 0.90$ and AUKC $\approx 0.864$, showing that class confidence rises sharply and stabilizes once a small set of tiles is kept. The method requires no extra supervision, integrates seamlessly with standard MIL training, and naturally yields slide-level overlays. We report accuracy alongside MSK, AUKC, and contiguity for rigorous evaluation of model behavior on WSIs.

[145] Thinking Like Van Gogh: Structure-Aware Style Transfer via Flow-Guided 3D Gaussian Splatting

Zhendong Wang, Lebin Zhou, Jingchuan Xiao, Rongduo Han, Nam Ling, Cihan Ruan

Main category: cs.CV

TL;DR: A 3D style transfer method that uses flow-guided geometric advection to create Post-Impressionist stylization in 3D Gaussian Splatting, focusing on structural exaggeration rather than surface texture.

Details

Motivation: Existing 3D style transfer methods treat geometry as rigid substrate for texture projection, which contradicts the Post-Impressionist principle of amplifying structural form while suppressing photographic detail. The paper aims to authentically reproduce Post-Impressionist stylization by embracing geometric abstraction as the primary vehicle of expression.

Method: A flow-guided geometric advection framework for 3D Gaussian Splatting that extracts directional flow fields from 2D paintings and back-propagates them into 3D space. Uses projection-based, mesh-free flow guidance, luminance-structure decoupling strategy, and rectifies Gaussian primitives to form flow-aligned brushstrokes conforming to scene topology.

Result: Enables expressive structural deformation driven directly by painterly motion rather than photometric constraints, creating authentic Post-Impressionist stylization in 3D without relying on explicit mesh priors.

Conclusion: The proposed method successfully operationalizes van Gogh’s principle of “exaggeration in the essential” by making geometric abstraction the primary expressive vehicle, and introduces a VLM-as-a-Judge evaluation framework to assess artistic authenticity through aesthetic judgment rather than conventional pixel-level metrics.

Abstract: In 1888, Vincent van Gogh wrote, “I am seeking exaggeration in the essential.” This principle, amplifying structural form while suppressing photographic detail, lies at the core of Post-Impressionist art. However, most existing 3D style transfer methods invert this philosophy, treating geometry as a rigid substrate for surface-level texture projection. To authentically reproduce Post-Impressionist stylization, geometric abstraction must be embraced as the primary vehicle of expression. We propose a flow-guided geometric advection framework for 3D Gaussian Splatting (3DGS) that operationalizes this principle in a mesh-free setting. Our method extracts directional flow fields from 2D paintings and back-propagates them into 3D space, rectifying Gaussian primitives to form flow-aligned brushstrokes that conform to scene topology without relying on explicit mesh priors. This enables expressive structural deformation driven directly by painterly motion rather than photometric constraints. Our contributions are threefold: (1) a projection-based, mesh-free flow guidance mechanism that transfers 2D artistic motion into 3D Gaussian geometry; (2) a luminance-structure decoupling strategy that isolates geometric deformation from color optimization, mitigating artifacts during aggressive structural abstraction; and (3) a VLM-as-a-Judge evaluation framework that assesses artistic authenticity through aesthetic judgment instead of conventional pixel-level metrics, explicitly addressing the subjective nature of artistic stylization.

[146] Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks

Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao, Takahiro Ogawa, Konstantinos N. Plataniotis, Miki Haseyama

Main category: cs.CV

TL;DR: DGS (Difficulty-Guided Sampling) improves dataset distillation by sampling from generated image pools based on difficulty distributions to bridge the gap between distillation objectives and downstream tasks.

Details

Motivation: Existing dataset distillation methods focus on features from original datasets but overlook task-specific information, creating a target gap between distillation objectives and downstream tasks. This gap limits the effectiveness of distilled datasets for downstream applications.

Method: Proposes DGS as a plug-in post-stage sampling module that selects images from pools generated by existing methods according to specific target difficulty distributions. Also introduces DAG (Difficulty-Aware Guidance) to incorporate difficulty considerations during the generation process.

Result: Extensive experiments across multiple settings demonstrate the effectiveness of the proposed methods in improving dataset distillation performance for image classification tasks.

Conclusion: Difficulty-guided approaches successfully bridge the target gap in dataset distillation and highlight the broader potential of difficulty concepts for diverse downstream tasks beyond image classification.

Abstract: In this paper, we propose difficulty-guided sampling (DGS) to bridge the target gap between the distillation objective and the downstream task, therefore improving the performance of dataset distillation. Deep neural networks achieve remarkable performance but have time and storage-consuming training processes. Dataset distillation is proposed to generate compact, high-quality distilled datasets, enabling effective model training while maintaining downstream performance. Existing approaches typically focus on features extracted from the original dataset, overlooking task-specific information, which leads to a target gap between the distillation objective and the downstream task. We propose leveraging characteristics that benefit the downstream training into data distillation to bridge this gap. Focusing on the downstream task of image classification, we introduce the concept of difficulty and propose DGS as a plug-in post-stage sampling module. Following the specific target difficulty distribution, the final distilled dataset is sampled from image pools generated by existing methods. We also propose difficulty-aware guidance (DAG) to explore the effect of difficulty in the generation process. Extensive experiments across multiple settings demonstrate the effectiveness of the proposed methods. It also highlights the broader potential of difficulty for diverse downstream tasks.

[147] V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation

Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen

Main category: cs.CV

TL;DR: V-Zero is a self-improvement framework for vision-language models that uses only unlabeled images, achieving performance gains without human annotations through a co-evolutionary loop between Questioner and Solver roles.

Details

Motivation: Current multimodal learning approaches heavily depend on costly, time-consuming human-annotated datasets, creating a need for more scalable and cost-effective training methods.

Method: V-Zero establishes a co-evolutionary loop with two roles: a Questioner that synthesizes challenging questions using dual-track reasoning rewards, and a Solver optimized via pseudo-labels from majority voting. Both are trained iteratively using Group Relative Policy Optimization (GRPO).

Result: Without any human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric tasks by +2.6.

Conclusion: V-Zero demonstrates the potential of self-improvement in multimodal systems, offering a scalable alternative to human-annotated datasets while maintaining competitive performance.

Abstract: Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero

[148] InfoSculpt: Sculpting the Latent Space for Generalized Category Discovery

Wenwen Liao, Hang Ruan, Jianbo Yu, Yuansong Wang, Qingchao Jiang, Xiaofeng Yang

Main category: cs.CV

TL;DR: InfoSculpt is a novel GCD framework that uses information bottleneck principles with dual conditional mutual information objectives to disentangle category-level signals from instance-specific noise.

Details

Motivation: Existing GCD methods rely on pseudo-labeling or two-stage clustering without principled mechanisms to disentangle essential category-defining signals from instance-specific noise, limiting their effectiveness for real-world open-world applications.

Method: InfoSculpt reframes GCD from an information-theoretic perspective using the Information Bottleneck principle. It minimizes a dual Conditional Mutual Information objective: Category-Level CMI on labeled data for compact discriminative representations of known classes, and Instance-Level CMI on all data to distill invariant features by compressing augmentation-induced noise.

Result: Extensive experiments on 8 benchmarks demonstrate that InfoSculpt achieves state-of-the-art performance, validating the effectiveness of the information-theoretic approach for disentangling categorical information from noise.

Conclusion: The information-theoretic framework provides a principled mechanism for GCD by systematically sculpting representation space to preserve categorical information while discarding instance-specific noise, offering a robust solution for real-world open-world applications.

Abstract: Generalized Category Discovery (GCD) aims to classify instances from both known and novel categories within a large-scale unlabeled dataset, a critical yet challenging task for real-world, open-world applications. However, existing methods often rely on pseudo-labeling, or two-stage clustering, which lack a principled mechanism to explicitly disentangle essential, category-defining signals from instance-specific noise. In this paper, we address this fundamental limitation by re-framing GCD from an information-theoretic perspective, grounded in the Information Bottleneck (IB) principle. We introduce InfoSculpt, a novel framework that systematically sculpts the representation space by minimizing a dual Conditional Mutual Information (CMI) objective. InfoSculpt uniquely combines a Category-Level CMI on labeled data to learn compact and discriminative representations for known classes, and a complementary Instance-Level CMI on all data to distill invariant features by compressing augmentation-induced noise. These two objectives work synergistically at different scales to produce a disentangled and robust latent space where categorical information is preserved while noisy, instance-specific details are discarded. Extensive experiments on 8 benchmarks demonstrate that InfoSculpt validating the effectiveness of our information-theoretic approach.

[149] FlowAct-R1: Towards Interactive Humanoid Video Generation

Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, Mingshuang Luo, Jiaxu Zhang, Xin Chen, Yulong Wang, Zerong Zheng, Jianwen Jiang, Chao Liang, Weifeng Chen, Xing Wang, Yuan Zhang, Mingyuan Gao

Main category: cs.CV

TL;DR: FlowAct-R1 is a real-time interactive humanoid video generation framework that achieves 25fps at 480p with 1.5s TTFF, using MMDiT architecture with chunkwise diffusion forcing for temporal consistency.

Details

Motivation: Existing video synthesis methods struggle with balancing high-fidelity synthesis and real-time interaction requirements for interactive humanoid agents.

Method: Built on MMDiT architecture with chunkwise diffusion forcing strategy (including novel self-forcing variant) to prevent error accumulation, plus efficient distillation and system-level optimizations.

Result: Achieves stable 25fps at 480p resolution with 1.5s time-to-first-frame, provides holistic full-body control, and demonstrates exceptional behavioral vividness and perceptual realism across diverse character styles.

Conclusion: FlowAct-R1 successfully addresses the trade-off between quality and real-time performance for interactive humanoid video generation, enabling natural behavioral transitions in interactive scenarios.

Abstract: Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.

[150] MathDoc: Benchmarking Structured Extraction and Active Refusal on Noisy Mathematics Exam Papers

Chenyue Zhou, Jiayi Tuo, Shitong Qin, Wei Dai, Mingxuan Wang, Ziwei Zhao, Duoyang Li, Shiyang Su, Yanxi Lu, Yanbiao Ma

Main category: cs.CV

TL;DR: MathDoc is the first benchmark for document-level information extraction from authentic high school math exams, focusing on handling visual noise and evaluating models’ ability to refuse incomplete inputs.

Details

Motivation: Existing benchmarks focus on clean documents or generic layout analysis, overlooking structural integrity of math problems and models' ability to reject incomplete inputs. Real-world math exams have severe visual noise that challenges automated extraction.

Method: Created MathDoc benchmark with 3,609 curated questions from authentic high school math exams, including real-world artifacts and unrecognizable samples. Proposed multi-dimensional evaluation framework covering stem accuracy, visual similarity, and refusal capability.

Result: SOTA MLLMs (Qwen3-VL, Gemini-2.5-Pro) achieve strong extraction performance but consistently fail to refuse illegible inputs, producing confident but invalid outputs instead.

Conclusion: Reveals critical gap in current MLLMs’ reliability under degraded document conditions. MathDoc establishes a benchmark for assessing model refusal capability and robustness in real-world educational document analysis.

Abstract: The automated extraction of structured questions from paper-based mathematics exams is fundamental to intelligent education, yet remains challenging in real-world settings due to severe visual noise. Existing benchmarks mainly focus on clean documents or generic layout analysis, overlooking both the structural integrity of mathematical problems and the ability of models to actively reject incomplete inputs. We introduce MathDoc, the first benchmark for document-level information extraction from authentic high school mathematics exam papers. MathDoc contains \textbf{3,609} carefully curated questions with real-world artifacts and explicitly includes unrecognizable samples to evaluate active refusal behavior. We propose a multi-dimensional evaluation framework covering stem accuracy, visual similarity, and refusal capability. Experiments on SOTA MLLMs, including Qwen3-VL and Gemini-2.5-Pro, show that although end-to-end models achieve strong extraction performance, they consistently fail to refuse illegible inputs, instead producing confident but invalid outputs. These results highlight a critical gap in current MLLMs and establish MathDoc as a benchmark for assessing model reliability under degraded document conditions. Our project repository is available at \href{https://github.com/winnk123/papers/tree/master}{GitHub repository}

[151] Enhancing Visual In-Context Learning by Multi-Faceted Fusion

Wenwen Liao, Jianbo Yu, Yuansong Wang, Qingchao Jiang, Xiaofeng Yang

Main category: cs.CV

TL;DR: A novel multi-combination collaborative fusion framework for Visual In-Context Learning that generates three contextual representation branches from different prompt combinations, fed into a MULTI-VQGAN architecture for improved performance across diverse visual tasks.

Details

Motivation: Current "retrieve-then-prompt" approaches in Visual In-Context Learning typically select only the single best visual prompt, discarding valuable contextual information from other suitable candidates. Even recent top-K fusion methods simply collapse multiple signals into one representation, limiting reasoning capability. The authors argue that multi-faceted, collaborative fusion is needed to unlock the full potential of diverse contexts.

Method: Proposes a framework that moves beyond single-prompt fusion to multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, the method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into a proposed MULTI-VQGAN architecture designed to jointly interpret and utilize collaborative information from multiple sources.

Result: Extensive experiments on diverse tasks including foreground segmentation, single-object detection, and image colorization demonstrate strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.

Conclusion: The proposed multi-combination collaborative fusion framework successfully addresses limitations of current VICL approaches by better leveraging diverse contextual information through multiple complementary representation branches, leading to improved performance and robustness across various visual tasks.

Abstract: Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant “retrieve-then-prompt” approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model’s reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.

[152] Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL

Wenwen Liao, Jianbo Yu, Yuansong Wang, Shifu Yan, Xiaofeng Yang

Main category: cs.CV

TL;DR: VICL framework with adaptive fusion of multiple prompts, arrangement-specific MLPs, and bidirectional fine-tuning for better inpainting adaptation.

Details

Motivation: Existing Vision In-Context Learning methods have two key issues: (1) they only select the most similar prompt, discarding complementary cues from other high-quality prompts, and (2) they fail to exploit structured information from different prompt arrangements.

Method: End-to-end VICL framework with: (1) adaptive Fusion Module to aggregate critical patterns and annotations from multiple prompts, (2) arrangement-specific lightweight MLPs to decouple layout priors, and (3) bidirectional fine-tuning mechanism that swaps query and prompt roles to enhance collaboration.

Result: Superior results on foreground segmentation, single-object detection, and image colorization tasks, with strong cross-task generalization demonstrated.

Conclusion: The proposed VICL framework effectively addresses limitations of existing methods by better utilizing multiple prompts and their arrangements, leading to improved inpainting adaptation and generalization across visual tasks.

Abstract: Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts. However, existing methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements. We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts. Secondly, we introduce arrangement-specific lightweight MLPs to decouple layout priors from the core model, while minimally affecting the overall model. In addition, an bidirectional fine-tuning mechanism swaps the roles of query and prompt, encouraging the model to reconstruct the original prompt from fused context and thus enhancing collaboration between the fusion module and the inpainting model. Experiments on foreground segmentation, single-object detection, and image colorization demonstrate superior results and strong cross-task generalization of our method.

[153] VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation

Sicheng Yang, Zhaohu Xing, Lei Zhu

Main category: cs.CV

TL;DR: VQ-Seg introduces vector quantization for controllable feature perturbation in semi-supervised medical image segmentation, replacing dropout with a quantized perturbation module and incorporating foundation model guidance.

Details

Motivation: Existing consistency learning methods in semi-supervised medical segmentation rely on dropout-based feature perturbation, which requires careful manual tuning of dropout rates - a sensitive hyperparameter that's difficult to optimize and often leads to suboptimal regularization.

Method: Proposes VQ-Seg with: 1) Vector quantization to discretize feature space, 2) Quantized Perturbation Module (QPM) that shuffles spatial locations of codebook indices for controllable regularization, 3) Dual-branch architecture sharing post-quantization features between reconstruction and segmentation tasks, 4) Post-VQ Feature Adapter (PFA) to incorporate foundation model guidance for semantic information.

Result: Extensive experiments on a new large-scale Lung Cancer dataset (828 CT scans) and other public benchmarks show the method outperforms state-of-the-art approaches.

Conclusion: VQ-Seg provides an effective alternative to dropout-based perturbation with controllable regularization, addresses information loss through dual-branch architecture and foundation model guidance, and demonstrates superior performance on medical segmentation tasks.

Abstract: Consistency learning with feature perturbation is a widely used strategy in semi-supervised medical image segmentation. However, many existing perturbation methods rely on dropout, and thus require a careful manual tuning of the dropout rate, which is a sensitive hyperparameter and often difficult to optimize and may lead to suboptimal regularization. To overcome this limitation, we propose VQ-Seg, the first approach to employ vector quantization (VQ) to discretize the feature space and introduce a novel and controllable Quantized Perturbation Module (QPM) that replaces dropout. Our QPM perturbs discrete representations by shuffling the spatial locations of codebook indices, enabling effective and controllable regularization. To mitigate potential information loss caused by quantization, we design a dual-branch architecture where the post-quantization feature space is shared by both image reconstruction and segmentation tasks. Moreover, we introduce a Post-VQ Feature Adapter (PFA) to incorporate guidance from a foundation model (FM), supplementing the high-level semantic information lost during quantization. Furthermore, we collect a large-scale Lung Cancer (LC) dataset comprising 828 CT scans annotated for central-type lung carcinoma. Extensive experiments on the LC dataset and other public benchmarks demonstrate the effectiveness of our method, which outperforms state-of-the-art approaches. Code available at: https://github.com/script-Yang/VQ-Seg.

Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung

Main category: cs.CV

TL;DR: LaViT addresses the perception gap in multimodal reasoning by aligning latent visual thoughts between teacher and student models, improving visual grounding through attention trajectory reconstruction.

Details

Motivation: Current multimodal reasoning methods rely on external supervision and language priors, ignoring intrinsic visual attention dynamics. There's a critical perception gap where students mimic teacher text outputs while attending to different visual regions.

Method: LaViT aligns latent visual thoughts rather than static embeddings. Students autoregressively reconstruct teacher’s visual semantics and attention trajectories before text generation, using curriculum sensory gating to prevent shortcut learning.

Result: LaViT significantly enhances visual grounding with up to +16.9% gains on complex reasoning tasks. A compact 3B model outperforms larger open-source variants and proprietary models like GPT-4o.

Conclusion: Aligning latent visual thoughts effectively bridges the perception gap in multimodal reasoning, enabling smaller models to achieve superior visual grounding and reasoning performance.

Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher’s textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher’s visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.

[155] Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method

Chao Huang, Benfeng Wang, Wei Wang, Jie Wen, Li Shen, Wenqi Ren, Yong Xu, Xiaochun Cao

Main category: cs.CV

TL;DR: The paper introduces Video Anomaly Reasoning (VAR), a new task that elevates video anomaly analysis from simple description to structured multi-stage reasoning, and presents a large dataset with 8,641 videos and 50K+ samples using a Perception-Cognition-Action Chain-of-Thought framework.

Details

Motivation: Current MLLM-based video anomaly detection methods are limited to anomaly localization or post-hoc description, lacking explicit reasoning processes, risk awareness, and decision-oriented interpretation. There's a need to move beyond descriptive understanding to structured reasoning.

Method: 1) Define VAR task requiring progressive reasoning over anomalous events; 2) Create large dataset with 8,641 videos and 50K+ samples using PerCoAct-CoT (Perception-Cognition-Action Chain-of-Thought) framework; 3) Propose Anomaly-Aware Group Relative Policy Optimization for weak supervision; 4) Develop Vad-R1-Plus, an end-to-end MLLM-based VAR model supporting adaptive hierarchical reasoning.

Result: The proposed benchmark and method effectively advance MLLM reasoning capabilities on VAR tasks, outperforming both open-source and proprietary baselines in extensive experiments.

Conclusion: The VAR task and associated dataset successfully elevate video anomaly analysis from descriptive understanding to structured multi-stage reasoning, enabling systematic evaluation of multi-stage and adaptive anomaly reasoning while enhancing reasoning reliability under weak supervision.

Abstract: Recent progress in reasoning capabilities of Multimodal Large Language Models(MLLMs) has highlighted their potential for performing complex video understanding tasks. However, in the domain of Video Anomaly Detection and Understanding (VAD&U), existing MLLM-based methods are largely limited to anomaly localization or post-hoc description, lacking explicit reasoning processes, risk awareness, and decision-oriented interpretation. To address this gap, we define a new task termed Video Anomaly Reasoning (VAR), which elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning. VAR explicitly requires models to perform progressive reasoning over anomalous events before answering anomaly-related questions, encompassing visual perception, causal interpretation, and risk-aware decision making. To support this task, we present a new dataset with 8,641 videos, where each video is annotated with diverse question types corresponding to different reasoning depths, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly. The annotations are based on a structured Perception-Cognition-Action Chain-of-Thought (PerCoAct-CoT), which formalizes domain-specific reasoning priors for video anomaly understanding. This design enables systematic evaluation of multi-stage and adaptive anomaly reasoning. In addition, we propose Anomaly-Aware Group Relative Policy Optimization to further enhance reasoning reliability under weak supervision. Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making. Extensive experiments demonstrate that the proposed benchmark and method effectively advance the reasoning capabilities of MLLMs on VAR tasks, outperforming both open-source and proprietary baselines.

[156] RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation

Yue Chang, Rufeng Chen, Zhaofan Zhang, Yi Chen, Sihong Xie

Main category: cs.CV

TL;DR: RAG-3DSG improves open-vocabulary 3D scene graph generation by reducing aggregation noise with uncertainty estimation and accelerating processing through dynamic downsample-mapping, achieving better accuracy and 3x speedup.

Details

Motivation: Existing open-vocabulary 3D scene graph generation methods suffer from low object recognition accuracy and slow speed due to constrained viewpoints, occlusions, and redundant surface density in multi-image scene reconstruction.

Method: Proposes RAG-3DSG with two key innovations: 1) Re-shot guided uncertainty estimation to mitigate aggregation noise and support object-level retrieval-augmented generation using reliable low-uncertainty objects, and 2) Dynamic downsample-mapping strategy for accelerated cross-image object aggregation with adaptive granularity.

Result: Experiments on Replica dataset show RAG-3DSG significantly improves node captioning accuracy in 3D scene graph generation while reducing mapping time by two-thirds (3x speedup) compared to vanilla version.

Conclusion: RAG-3DSG effectively addresses accuracy and speed limitations in open-vocabulary 3D scene graph generation through uncertainty-aware aggregation and adaptive processing strategies, making it more practical for robotics applications.

Abstract: Open-vocabulary 3D Scene Graph (3DSG) generation can enhance various downstream tasks in robotics, such as manipulation and navigation, by leveraging structured semantic representations. A 3DSG is constructed from multiple images of a scene, where objects are represented as nodes and relationships as edges. However, existing works for open-vocabulary 3DSG generation suffer from both low object-level recognition accuracy and speed, mainly due to constrained viewpoints, occlusions, and redundant surface density. To address these challenges, we propose RAG-3DSG to mitigate aggregation noise through re-shot guided uncertainty estimation and support object-level Retrieval-Augmented Generation (RAG) via reliable low-uncertainty objects. Furthermore, we propose a dynamic downsample-mapping strategy to accelerate cross-image object aggregation with adaptive granularity. Experiments on Replica dataset demonstrate that RAG-3DSG significantly improves node captioning accuracy in 3DSG generation while reducing the mapping time by two-thirds compared to the vanilla version.

[157] From Physical Degradation Models to Task-Aware All-in-One Image Restoration

Hu Gao, Xiaoning Lei, Xichen Xu, Xingjian Wang, Lizhuang Ma

Main category: cs.CV

TL;DR: OPIR: A two-stage framework for efficient all-in-one image restoration using predicted task-aware inverse degradation operators with uncertainty guidance.

Details

Motivation: Existing all-in-one image restoration methods use complex prompt systems or large models that increase system complexity and hinder real-time applicability. There's a need for more efficient approaches that maintain performance while being computationally practical.

Method: Two-stage framework: 1) Predict task-aware inverse degradation operator to produce initial restoration with uncertainty map highlighting difficult regions; 2) Refine restoration guided by uncertainty map. Uses same inverse operator prediction network in both stages with task-aware parameters, and accelerates convolution for efficiency.

Result: Extensive experiments demonstrate superior all-in-one restoration performance while remaining highly competitive on task-aligned restoration. The method achieves efficient restoration through the accelerated convolution approach.

Conclusion: OPIR provides a tightly integrated, efficient architecture for all-in-one image restoration that outperforms existing methods while maintaining computational efficiency suitable for real-time applications.

Abstract: All-in-one image restoration aims to adaptively handle multiple restoration tasks with a single trained model. Although existing methods achieve promising results by introducing prompt information or leveraging large models, the added learning modules increase system complexity and hinder real-time applicability. In this paper, we adopt a physical degradation modeling perspective and predict a task-aware inverse degradation operator for efficient all-in-one image restoration. The framework consists of two stages. In the first stage, the predicted inverse operator produces an initial restored image together with an uncertainty perception map that highlights regions difficult to reconstruct, ensuring restoration reliability. In the second stage, the restoration is further refined under the guidance of this uncertainty map. The same inverse operator prediction network is used in both stages, with task-aware parameters introduced after operator prediction to adapt to different degradation tasks. Moreover, by accelerating the convolution of the inverse operator, the proposed method achieves efficient all-in-one image restoration. The resulting tightly integrated architecture, termed OPIR, is extensively validated through experiments, demonstrating superior all-in-one restoration performance while remaining highly competitive on task-aligned restoration.

[158] ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation

Kim Youwang, Lee Hyoseok, Subin Park, Gerard Pons-Moll, Tae-Hyun Oh

Main category: cs.CV

TL;DR: ELITE is an efficient Gaussian head avatar synthesis system from monocular video that combines 3D and 2D priors for high-fidelity results with strong in-the-wild generalization and 60x faster synthesis than 2D generative prior methods.

Details

Motivation: Prior methods have limitations: 3D data prior methods struggle with in-the-wild generalization, while 2D generative prior methods are computationally heavy and prone to identity hallucination. The authors identify complementary synergy between these two priors and aim to create an efficient system that overcomes these limitations.

Method: 1) Feed-forward Mesh2Gaussian Prior Model (MGPM) for fast Gaussian avatar initialization. 2) Test-time generative adaptation using both real and synthetic images. 3) Rendering-guided single-step diffusion enhancer that restores missing visual details grounded on Gaussian avatar renderings, avoiding slow and hallucination-prone full diffusion denoising.

Result: ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than 2D generative prior methods. The system demonstrates strong in-the-wild generalization capabilities.

Conclusion: ELITE successfully combines the strengths of 3D and 2D priors to create an efficient, high-fidelity avatar synthesis system that overcomes the limitations of previous approaches, achieving both quality and speed improvements.

Abstract: We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.

[159] Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation

Dong-Yu Chen, Yixin Guo, Shuojin Yang, Tai-Jiang Mu, Shi-Min Hu

Main category: cs.CV

TL;DR: DepthDirector: A video re-rendering framework that enables precise camera trajectory control while preserving video content consistency by leveraging depth guidance and dual-stream conditioning.

Details

Motivation: Existing methods for camera control in video generation often fail to fully leverage 3D priors of video diffusion models, leading to subject inconsistency and degraded quality (the "Inpainting Trap"). There's a need for precise camera trajectory alteration while faithfully preserving video content.

Method: Proposes DepthDirector with View-Content Dual-Stream Condition mechanism that injects both source video and warped depth sequence rendered under target viewpoint into pretrained video diffusion models. Uses lightweight LoRA-based video diffusion adapter to preserve model priors. Constructs MultiCam-WarpData dataset with 8K videos across 1K dynamic scenes using Unreal Engine 5.

Result: Outperforms existing methods in both camera controllability and visual quality. The framework enables faithful reproduction of dynamic scenes under novel camera trajectories while maintaining content consistency.

Conclusion: DepthDirector successfully addresses the camera control challenge by leveraging geometric guidance signals to help video diffusion models comprehend camera movements and utilize their 3D understanding capabilities, achieving precise control without sacrificing content quality.

Abstract: Camera control has been extensively studied in conditioned video generation; however, performing precisely altering the camera trajectories while faithfully preserving the video content remains a challenging task. The mainstream approach to achieving precise camera control is warping a 3D representation according to the target trajectory. However, such methods fail to fully leverage the 3D priors of video diffusion models (VDMs) and often fall into the Inpainting Trap, resulting in subject inconsistency and degraded generation quality. To address this problem, we propose DepthDirector, a video re-rendering framework with precise camera controllability. By leveraging the depth video from explicit 3D representation as camera-control guidance, our method can faithfully reproduce the dynamic scene of an input video under novel camera trajectories. Specifically, we design a View-Content Dual-Stream Condition mechanism that injects both the source video and the warped depth sequence rendered under the target viewpoint into the pretrained video generation model. This geometric guidance signal enables VDMs to comprehend camera movements and leverage their 3D understanding capabilities, thereby facilitating precise camera control and consistent content generation. Next, we introduce a lightweight LoRA-based video diffusion adapter to train our framework, fully preserving the knowledge priors of VDMs. Additionally, we construct a large-scale multi-camera synchronized dataset named MultiCam-WarpData using Unreal Engine 5, containing 8K videos across 1K dynamic scenes. Extensive experiments show that DepthDirector outperforms existing methods in both camera controllability and visual quality. Our code and dataset will be publicly available.

[160] Attend to what I say: Highlighting relevant content on slides

Megha Mariam K M, C. V. Jawahar

Main category: cs.CV

TL;DR: Automatic slide region highlighting system that synchronizes speaker narration with relevant visual elements in presentation slides to reduce cognitive load and improve comprehension.

Details

Motivation: Addresses the cognitive strain caused by the disconnect between spoken narration and visual slide content during presentations, especially in fast-paced or content-heavy talks where listeners struggle to identify relevant slide regions while following the speaker.

Method: Introduces a method that analyzes spoken content and matches it with textual or graphical elements in slides to automatically identify and highlight the most relevant slide regions based on the speaker’s narrative.

Result: The approach explores different ways of solving the synchronization problem and assesses their success and failure cases, with code and dataset made publicly available.

Conclusion: This work addresses the emerging requirement for analyzing multimedia documents to enable seamless understanding of content-rich videos by reducing cognitive strain and improving comprehension in educational and conference settings.

Abstract: Imagine sitting in a presentation, trying to follow the speaker while simultaneously scanning the slides for relevant information. While the entire slide is visible, identifying the relevant regions can be challenging. As you focus on one part of the slide, the speaker moves on to a new sentence, leaving you scrambling to catch up visually. This constant back-and-forth creates a disconnect between what is being said and the most important visual elements, making it hard to absorb key details, especially in fast-paced or content-heavy presentations such as conference talks. This requires an understanding of slides, including text, graphics, and layout. We introduce a method that automatically identifies and highlights the most relevant slide regions based on the speaker’s narrative. By analyzing spoken content and matching it with textual or graphical elements in the slides, our approach ensures better synchronization between what listeners hear and what they need to attend to. We explore different ways of solving this problem and assess their success and failure cases. Analyzing multimedia documents is emerging as a key requirement for seamless understanding of content-rich videos, such as educational videos and conference talks, by reducing cognitive strain and improving comprehension. Code and dataset are available at: https://github.com/meghamariamkm2002/Slide_Highlight

[161] DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Hengyu Shen, Tiancheng Gu, Bin Qin, Lan Wu, Yuling Wu, Shuo Tan, Zelong Sun, Jun Wang, Nan Wu, Xiang An, Weidong Cai, Ziyong Feng, Kaicheng Yang

Main category: cs.CV

TL;DR: DanQing is a new 100M Chinese image-text dataset with superior quality and recent data (2024-2025) that outperforms existing datasets for Chinese vision-language pretraining tasks.

Details

Motivation: Chinese vision-language pretraining has lagged behind English due to scarcity of high-quality Chinese image-text data, despite the success of models like CLIP and SigLIP in English.

Method: Developed a comprehensive pipeline to construct high-quality Chinese cross-modal dataset from Common Crawl, with rigorous selection process and focus on recent 2024-2025 web data.

Result: DanQing consistently achieves superior performance across Chinese downstream tasks including zero-shot classification, cross-modal retrieval, and LMM-based evaluations when used for continual pre-training of SigLIP2 model.

Conclusion: DanQing addresses the Chinese VLP data gap with high-quality, recent data that captures evolving semantic trends, and will be open-sourced under CC-BY 4.0 license to advance Chinese vision-language research.

Abstract: Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.

[162] ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Xueyun Tian, Wei Li, Bingbing Xu, Heng Dong, Yuanzhuo Wang, Huawei Shen

Main category: cs.CV

TL;DR: ROMA is a real-time omni-multimodal assistant that unifies reactive and proactive interaction for streaming audio-video understanding, addressing modality granularity mismatches and enabling precise triggering for autonomous monitoring.

Details

Motivation: Existing approaches for streaming audio-video understanding suffer from disjointed capabilities - they typically have incomplete modality support or lack autonomous proactive monitoring capabilities needed for real-time applications.

Method: ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames. It uses a lightweight speak head that decouples response initiation from generation for precise triggering. Training involves a curated streaming dataset and two-stage curriculum for streaming format adaptation and proactive responsiveness.

Result: Extensive experiments across 12 benchmarks show ROMA achieves state-of-the-art performance on proactive tasks (alert, narration) while remaining competitive in reactive settings (QA), validating its robustness in unified real-time omni-multimodal understanding.

Conclusion: ROMA successfully addresses the challenges of streaming audio-video understanding by providing a unified framework for both reactive and proactive interaction, with synchronized multimodal processing and precise triggering mechanisms that enable robust real-time performance.

Abstract: Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.

[163] Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang, Quan Chen, Peng Jiang, Xiao Yang, Jun Zhu, Kai Yu, Zhijie Deng

Main category: cs.CV

TL;DR: T2G paradigm enables LLM-based text encoders to reason about and rewrite user prompts before image generation, improving factual consistency and semantic alignment in text-to-image models.

Details

Motivation: Current T2I diffusion models treat LLMs as mere text encoders without leveraging their reasoning capabilities, leading to literal generation that lacks deeper understanding of what should be visually depicted from textual prompts.

Method: Propose think-then-generate (T2G) paradigm: 1) Lightweight supervised fine-tuning to activate LLM’s think-then-rewrite pattern, 2) Co-optimization of LLM encoder and diffusion backbone via Dual-GRPO with image-grounded rewards for reasoning and semantic consistency.

Result: Substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based benchmarks, achieving 0.79 on WISE score (nearly on par with GPT-4).

Conclusion: T2G represents a promising step toward next-generation unified models with integrated reasoning, expression, and demonstration capabilities, moving beyond literal text-to-pixel mapping.

Abstract: Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers – they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.

[164] An analytic theory of convolutional neural network inverse problems solvers

Minh Hai Nguyen, Quoc Bao Do, Edouard Pauwels, Pierre Weiss

Main category: cs.CV

TL;DR: The paper analyzes supervised CNNs for imaging inverse problems through the lens of MMSE estimator with CNN constraints (translation equivariance and locality), deriving an interpretable LE-MMSE formula that matches neural network outputs in experiments.

Details

Motivation: Despite empirical success of CNNs in solving imaging inverse problems, they are poorly understood theoretically and treated as black boxes. The paper aims to bridge this gap by providing theoretical understanding of trained neural networks.

Method: Analyze trained neural networks through MMSE estimator with functional constraints capturing CNN inductive biases: translation equivariance and locality via finite receptive fields. Derive analytic, interpretable LE-MMSE formula under empirical training distribution.

Result: Extensive experiments across inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP) show theory matches neural network outputs with PSNR ≥25dB. Provides insights into physics-aware vs physics-agnostic estimators, impact of high-density training regions, and other factors.

Conclusion: The paper successfully bridges theory and practice by deriving interpretable LE-MMSE formula that explains CNN behavior in imaging inverse problems, providing theoretical understanding of previously black-box methods and insights into key factors affecting performance.

Abstract: Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).

[165] Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs

Ningyu Sun, Zhaolin Cai, Zitong Xu, Peihang Chen, Huiyu Duan, Yichao Yan, Xiongkuo Min, Xiaokang Yang

Main category: cs.CV

TL;DR: HPE-Bench: A specialized benchmark and unified MLLM framework for evaluating text-guided human pose editing, addressing structural anomalies and generative artifacts through authenticity detection and multi-dimensional quality assessment.

Details

Motivation: Text-guided human pose editing suffers from structural anomalies and generative artifacts, while existing evaluation metrics fail to provide fine-grained insights into pose-specific inconsistencies by isolating authenticity detection from quality assessment.

Method: Introduces HPE-Bench with 1,700 standardized samples from 17 SOTA editing models, and proposes a unified MLLM framework using contrastive LoRA tuning and layer sensitivity analysis (LSA) to identify optimal feature layers for pose evaluation.

Result: The framework achieves superior performance in both authenticity detection and multi-dimensional quality regression, effectively bridging forensic detection and quality assessment.

Conclusion: HPE-Bench and the proposed MLLM framework provide a comprehensive solution for evaluating text-guided human pose editing, addressing the limitations of existing metrics through integrated authenticity and quality assessment.

Abstract: Text-guided human pose editing has gained significant traction in AIGC applications. However,it remains plagued by structural anomalies and generative artifacts. Existing evaluation metrics often isolate authenticity detection from quality assessment, failing to provide fine-grained insights into pose-specific inconsistencies. To address these limitations, we introduce HPE-Bench, a specialized benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models, offering both authenticity labels and multi-dimensional quality scores. Furthermore, we propose a unified framework based on layer-selective multimodal large language models (MLLMs). By employing contrastive LoRA tuning and a novel layer sensitivity analysis (LSA) mechanism, we identify the optimal feature layer for pose evaluation. Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression, effectively bridging the gap between forensic detection and quality assessment.

[166] Global Context Compression with Interleaved Vision-Text Transformation

Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang, Feng Huang

Main category: cs.CV

TL;DR: VIST2 is a novel Transformer that compresses text into visual tokens for both prefilling and inference, achieving 4× compression with 3× faster first-token generation and 77% memory reduction.

Details

Motivation: Existing vision-language models compress text into images only during prefilling, but fail to save computational costs during token-by-token inference. There's a need for global context compression that works at both prefilling and inference stages.

Method: VIST2 interleaves text chunks with their visual encoding, using only visual tokens in pre-context to predict next text tokens. Text chunks are rendered into sketch images, trained with curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning.

Result: With 4× compression ratio, VIST2 models (0.6B to 8B) achieve 3× speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS, outperforming baselines on long writing tasks.

Conclusion: VIST2 demonstrates effective global context compression through visual encoding, significantly improving efficiency in both prefilling and inference stages while maintaining performance on long writing tasks.

Abstract: Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer’s input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.

[167] Multi-Temporal Frames Projection for Dynamic Processes Fusion in Fluorescence Microscopy

Hassan Eshkiki, Sarah Costa, Mostafa Mohammadpour, Farinaz Tanhaei, Christopher H. George, Fabio Caraffini

Main category: cs.CV

TL;DR: A computational framework that integrates multiple time-resolved microscopy frames into a single high-quality composite image, preserving biological content while reducing noise and variability.

Details

Motivation: Fluorescence microscopy recordings of living biological samples are often limited by noise, temporal variability, and inconsistent visualization of oscillating signals, which reduces their utility for analysis.

Method: A unique computational framework combining explainable techniques from different computer vision fields to fuse multiple time-resolved frames into a single high-quality image while preserving original biological content.

Result: The method achieves 44% average increase in cell count compared to previous methods when tested on 111 configurations using challenging cardiac cell datasets with dynamic, heterogeneous, and morphologically complex 2D monolayers.

Conclusion: The proposed pipeline effectively generates composite images that preserve and enhance quality and information from individual microscopy frames, and is applicable to other imaging domains requiring multi-temporal image fusion for annotation and segmentation tasks.

Abstract: Fluorescence microscopy is widely employed for the analysis of living biological samples; however, the utility of the resulting recordings is frequently constrained by noise, temporal variability, and inconsistent visualisation of signals that oscillate over time. We present a unique computational framework that integrates information from multiple time-resolved frames into a single high-quality image, while preserving the underlying biological content of the original video. We evaluate the proposed method through an extensive number of configurations (n = 111) and on a challenging dataset comprising dynamic, heterogeneous, and morphologically complex 2D monolayers of cardiac cells. Results show that our framework, which consists of a combination of explainable techniques from different computer vision application fields, is capable of generating composite images that preserve and enhance the quality and information of individual microscopy frames, yielding 44% average increase in cell count compared to previous methods. The proposed pipeline is applicable to other imaging domains that require the fusion of multi-temporal image stacks into high-quality 2D images, thereby facilitating annotation and downstream segmentation.

[168] Lunar-G2R: Geometry-to-Reflectance Learning for High-Fidelity Lunar BRDF Estimation

Clementine Grethen, Nicolas Menga, Roland Brochard, Geraldine Morin, Simone Gasparini, Jeremy Lebreton, Manuel Sanchez Gestido

Main category: cs.CV

TL;DR: Lunar-G2R predicts spatially varying BRDF parameters from lunar terrain geometry using U-Net and differentiable rendering, achieving 38% photometric error reduction over baselines without needing multi-view imagery or special hardware.

Details

Motivation: Existing lunar rendering pipelines use simplified or uniform BRDF models that fail to capture local reflectance variations, limiting photometric realism for high-fidelity rendering and vision-based navigation.

Method: Uses a geometry-to-reflectance learning framework (U-Net with differentiable rendering) to predict spatially varying BRDF parameters directly from lunar digital elevation models, minimizing photometric discrepancies between real orbital images and physically based renderings.

Result: Reduces photometric error by 38% compared to state-of-the-art baseline, achieves higher PSNR and SSIM, and captures fine-scale reflectance variations absent from spatially uniform models on held-out Tycho crater region.

Conclusion: First method to infer spatially varying reflectance model directly from terrain geometry, enabling realistic lunar surface rendering without multi-view imagery or specialized hardware at inference time.

Abstract: We address the problem of estimating realistic, spatially varying reflectance for complex planetary surfaces such as the lunar regolith, which is critical for high-fidelity rendering and vision-based navigation. Existing lunar rendering pipelines rely on simplified or spatially uniform BRDF models whose parameters are difficult to estimate and fail to capture local reflectance variations, limiting photometric realism. We propose Lunar-G2R, a geometry-to-reflectance learning framework that predicts spatially varying BRDF parameters directly from a lunar digital elevation model (DEM), without requiring multi-view imagery, controlled illumination, or dedicated reflectance-capture hardware at inference time. The method leverages a U-Net trained with differentiable rendering to minimize photometric discrepancies between real orbital images and physically based renderings under known viewing and illumination geometry. Experiments on a geographically held-out region of the Tycho crater show that our approach reduces photometric error by 38 % compared to a state-of-the-art baseline, while achieving higher PSNR and SSIM and improved perceptual similarity, capturing fine-scale reflectance variations absent from spatially uniform models. To our knowledge, this is the first method to infer a spatially varying reflectance model directly from terrain geometry.

[169] Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li

Main category: cs.CV

TL;DR: A new vision-language reasoning framework called SocioReasoner achieves socio-semantic segmentation of urban entities from satellite imagery by simulating human annotation processes through cross-modal recognition and multi-stage reasoning.

Details

Motivation: Current segmentation models can reliably segment entities defined by physical attributes (like buildings, water bodies) but struggle with socially defined categories (like schools, parks). There's a need to bridge this gap for better urban analysis and downstream applications.

Method: Proposes SocioReasoner framework that simulates human process of identifying social semantic entities via cross-modal recognition and multi-stage reasoning. Uses reinforcement learning to optimize this non-differentiable process and elicit reasoning capabilities of vision-language models. Also introduces SocioSeg dataset with satellite imagery, digital maps, and pixel-level labels of social semantic entities in hierarchical structure.

Result: Experiments demonstrate gains over state-of-the-art models and show strong zero-shot generalization capabilities. The approach successfully segments socially defined categories that previous models struggled with.

Conclusion: The proposed SocioReasoner framework effectively addresses the challenge of socio-semantic segmentation in urban environments by leveraging vision-language reasoning, showing promising results for segmenting socially defined entities from satellite imagery.

Abstract: As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach’s gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.

[170] mergetune: Continued fine-tuning of vision-language models

Wenqing Wang, Da Li, Xiatian Zhu, Josef Kittler

Main category: cs.CV

TL;DR: MERGETUNE introduces continued fine-tuning (CFT) to recover pretrained knowledge lost during VLMs fine-tuning, using linear mode connectivity to merge zero-shot and fine-tuned models without architectural changes.

Details

Motivation: Fine-tuning vision-language models like CLIP often causes catastrophic forgetting of pretrained knowledge, and existing methods can't fully prevent this forgetting during adaptation.

Method: Proposes MERGETUNE, a model-agnostic CFT strategy guided by linear mode connectivity. It continues fine-tuning trainable parameters to find a model with low-loss paths to both zero-shot and fine-tuned solutions, using a second-order surrogate to approximate LMC constraints without data replay.

Result: Improves harmonic mean of CoOp by +5.6% on base-novel generalization without adding parameters, achieves superior performance over CLIP on DTD and EuroSAT, and surpasses ensemble baselines with lower inference cost.

Conclusion: MERGETUNE effectively recovers pretrained knowledge lost during fine-tuning through continued fine-tuning guided by linear mode connectivity, offering a practical post-hoc solution without architectural changes.

Abstract: Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, \emph{continued fine-tuning (CFT)}, which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6% on base-novel generalisation without adding parameters. % We show \emph{the first time} superior performance than CLIP on both DTD and EuroSAT, on cross-dataset transfer. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at \href{https://github.com/Surrey-UP-Lab/MERGETUNE}{https://github.com/Surrey-UP-Lab/MERGETUNE}.

[171] SatMap: Revisiting Satellite Maps as Prior for Online HD Map Construction

Kanak Mazumder, Fabian B. Flohr

Main category: cs.CV

TL;DR: SatMap integrates satellite maps with multi-view camera observations to predict vectorized HD maps for autonomous driving, overcoming depth perception and occlusion issues.

Details

Motivation: Online HD map construction is essential for autonomous driving, but camera-based approaches suffer from limited depth perception and occlusion problems. Satellite maps provide global BEV perspective with lane-level semantics that can mitigate these issues.

Method: SatMap integrates satellite imagery (providing lane-level semantics and texture from BEV perspective) with multi-view camera observations to directly predict vectorized HD maps. The satellite maps serve as a global prior to address depth ambiguity and occlusion.

Result: On nuScenes dataset: 34.8% mAP improvement over camera-only baseline, 8.5% mAP improvement over camera-LiDAR fusion baseline. Shows advantages in long-range and adverse weather conditions.

Conclusion: Satellite prior maps effectively enhance HD map estimation by providing global BEV perspective, addressing depth and occlusion limitations of camera-based approaches, and improving performance especially in challenging conditions.

Abstract: Online high-definition (HD) map construction is an essential part of a safe and robust end-to-end autonomous driving (AD) pipeline. Onboard camera-based approaches suffer from limited depth perception and degraded accuracy due to occlusion. In this work, we propose SatMap, an online vectorized HD map estimation method that integrates satellite maps with multi-view camera observations and directly predicts a vectorized HD map for downstream prediction and planning modules. Our method leverages lane-level semantics and texture from satellite imagery captured from a Bird’s Eye View (BEV) perspective as a global prior, effectively mitigating depth ambiguity and occlusion. In our experiments on the nuScenes dataset, SatMap achieves 34.8% mAP performance improvement over the camera-only baseline and 8.5% mAP improvement over the camera-LiDAR fusion baseline. Moreover, we evaluate our model in long-range and adverse weather conditions to demonstrate the advantages of using a satellite prior map. Source code will be available at https://iv.ee.hm.edu/satmap/.

[172] BikeActions: An Open Platform and Benchmark for Cyclist-Centric VRU Action Recognition

Max A. Buettner, Kanak Mazumder, Luca Koecher, Mario Finkbeiner, Sebastian Niebler, Fabian B. Flohr

Main category: cs.CV

TL;DR: FUSE-Bike is a novel open perception platform with LiDARs, camera, and GNSS for capturing cyclist-view data, used to create BikeActions dataset with 852 annotated samples across 5 action classes for VRU behavior modeling.

Details

Motivation: Current research focuses on pedestrian crossing from vehicle perspective, but interactions in dense shared spaces are underexplored. There's a need for better VRU intention anticipation for safe autonomous driving and robotics.

Method: Created FUSE-Bike platform with dual LiDARs, camera, and GNSS for high-fidelity close-range data capture from cyclist viewpoint. Collected and annotated 852 multi-modal samples across 5 action classes to create BikeActions dataset.

Result: Established first performance baselines by evaluating state-of-the-art graph convolution and transformer models on the dataset. Made full dataset, curation tools, hardware design, and benchmark code publicly available.

Conclusion: FUSE-Bike and BikeActions dataset fill a critical gap in VRU behavior modeling, particularly for dense shared spaces, and provide open resources to advance research in VRU action understanding.

Abstract: Anticipating the intentions of Vulnerable Road Users (VRUs) is a critical challenge for safe autonomous driving (AD) and mobile robotics. While current research predominantly focuses on pedestrian crossing behaviors from a vehicle’s perspective, interactions within dense shared spaces remain underexplored. To bridge this gap, we introduce FUSE-Bike, the first fully open perception platform of its kind. Equipped with two LiDARs, a camera, and GNSS, it facilitates high-fidelity, close-range data capture directly from a cyclist’s viewpoint. Leveraging this platform, we present BikeActions, a novel multi-modal dataset comprising 852 annotated samples across 5 distinct action classes, specifically tailored to improve VRU behavior modeling. We establish a rigorous benchmark by evaluating state-of-the-art graph convolution and transformer-based models on our publicly released data splits, establishing the first performance baselines for this challenging task. We release the full dataset together with data curation tools, the open hardware design, and the benchmark code to foster future research in VRU action understanding under https://iv.ee.hm.edu/bikeactions/.

[173] SVII-3D: Advancing Roadside Infrastructure Inventory with Decimeter-level 3D Localization and Comprehension from Sparse Street Imagery

Chong Liu, Luxuan Fu, Yang Jia, Zhen Dong, Bisheng Yang

Main category: cs.CV

TL;DR: SVII-3D is a unified framework for automated digital twin creation from sparse imagery, combining robust cross-view association, precise 3D localization, and fine-grained state diagnosis using vision-language models.

Details

Motivation: Current methods for automated digital twin creation from cost-effective sparse imagery face limitations in robustness, localization accuracy, and fine-grained state understanding, creating a gap between sparse perception and intelligent maintenance.

Method: Three-stage approach: 1) LoRA fine-tuned open-set detection with spatial-attention matching for robust cross-view association; 2) Geometry-guided refinement for decimeter-level 3D localization; 3) Vision-Language Model agent with multi-modal prompting for fine-grained operational state diagnosis.

Result: Significant improvements in identification accuracy and minimized localization errors, demonstrating a scalable, cost-effective solution for high-fidelity infrastructure digitization.

Conclusion: SVII-3D effectively bridges the gap between sparse perception and automated intelligent maintenance, offering a unified framework for holistic asset digitization in smart city applications.

Abstract: The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.

[174] Enhancing the quality of gauge images captured in smoke and haze scenes through deep learning

Oscar H. Ramírez-Agudelo, Akshay N. Shewatkar, Edoardo Milana, Roland C. Aydin, Kai Franke

Main category: cs.CV

TL;DR: Deep learning models (FFA-Net and AECR-Net) improve visibility of analog gauge images in hazy/smoky environments, enabling automatic gauge reading for emergency services.

Details

Motivation: Images captured in hazy/smoky environments have reduced visibility, which hinders infrastructure monitoring and emergency services during critical situations. Accurate gauge interpretation is valuable for first responders.

Method: Used two deep learning architectures (FFA-Net and AECR-Net) to enhance gauge images corrupted with haze and smoke. Created a new synthetic dataset of over 14,000 images using Unreal Engine since no benchmark datasets existed. Trained models with 80/10/10 train/validation/test splits.

Result: For synthetic haze dataset: SSIM ~0.98 and PSNR ~43dB, comparable to state-of-the-art. AECR-Net performed more robustly than FFA-Net. Smoke dataset results were poorer but still interesting. Smoke enhancement is more difficult due to inhomogeneity and high density.

Conclusion: Deep learning architectures can significantly improve analog gauge image quality in smoke/haze scenes. Enhanced images can be successfully post-processed for automatic autonomous gauge reading, demonstrating practical value for emergency response applications.

Abstract: Images captured in hazy and smoky environments suffer from reduced visibility, posing a challenge when monitoring infrastructures and hindering emergency services during critical situations. The proposed work investigates the use of the deep learning models to enhance the automatic, machine-based readability of gauge in smoky environments, with accurate gauge data interpretation serving as a valuable tool for first responders. The study utilizes two deep learning architectures, FFA-Net and AECR-Net, to improve the visibility of gauge images, corrupted with light up to dense haze and smoke. Since benchmark datasets of analog gauge images are unavailable, a new synthetic dataset, containing over 14,000 images, was generated using the Unreal Engine. The models were trained with an 80% train, 10% validation, and 10% test split for the haze and smoke dataset, respectively. For the synthetic haze dataset, the SSIM and PSNR metrics are about 0.98 and 43,dB, respectively, comparing well to state-of-the art results. Additionally, more robust results are retrieved from the AECR-Net, when compared to the FFA-Net. Although the results from the synthetic smoke dataset are poorer, the trained models achieve interesting results. In general, imaging in the presence of smoke are more difficult to enhance given the inhomogeneity and high density. Secondly, FFA-Net and AECR-Net are implemented to dehaze and not to desmoke images. This work shows that use of deep learning architectures can improve the quality of analog gauge images captured in smoke and haze scenes immensely. Finally, the enhanced output images can be successfully post-processed for automatic autonomous reading of gauges

[175] Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure

Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong

Main category: cs.CV

TL;DR: Domain-adapted framework transforms VLMs into specialized agents for urban infrastructure analysis using fine-tuning, knowledge-grounded reasoning, and dual-modality RAG to achieve high accuracy in detection and attribute recognition.

Details

Motivation: General-purpose vision models struggle with fine-grained urban infrastructure attributes and domain compliance, while VLMs lack accuracy in interpreting complex facility states according to engineering standards, limiting real-world reliability.

Method: 1) Open-vocabulary fine-tuning on Grounding DINO for robust asset localization with minimal supervision; 2) LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning; 3) Dual-modality RAG module that retrieves authoritative industry standards and visual exemplars during inference to mitigate hallucinations and ensure professional compliance.

Result: Achieved 58.9 mAP for detection performance and 95.5% attribute recognition accuracy on a comprehensive new dataset of urban roadside scenes, demonstrating robust intelligent infrastructure monitoring capabilities.

Conclusion: The proposed domain-adapted framework successfully transforms general VLMs into specialized agents for urban infrastructure analysis, addressing domain compliance challenges through knowledge-grounded reasoning and achieving high accuracy in both detection and attribute recognition tasks.

Abstract: Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.

[176] Inference-time Physics Alignment of Video Generative Models with Latent World Models

Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano

Main category: cs.CV

TL;DR: WMReward improves video generation physics plausibility by using a latent world model (VJEPA-2) as a reward to steer multiple denoising trajectories during inference, winning ICCV 2025 Perception Test PhysicsIQ Challenge.

Details

Motivation: Current video generative models often violate basic physics principles despite promising visual content. The authors identify that this deficiency stems not only from insufficient physics understanding during pre-training but also from suboptimal inference strategies.

Method: Introduces WMReward, treating physics plausibility improvement as an inference-time alignment problem. Uses VJEPA-2 (a latent world model) as a physics prior reward to search and steer multiple candidate denoising trajectories during inference, enabling test-time compute scaling for better performance.

Result: Substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings. Achieved 62.64% final score in ICCV 2025 Perception Test PhysicsIQ Challenge, winning first place and outperforming previous state-of-the-art by 7.42%.

Conclusion: Demonstrates viability of using latent world models to improve physics plausibility of video generation, suggesting this approach extends beyond specific instantiations or parameterizations.

Abstract: State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.

[177] DeepUrban: Interaction-Aware Trajectory Prediction and Planning for Automated Driving by Aerial Imagery

Constantin Selzer, Fabian B. Flohr

Main category: cs.CV

TL;DR: DeepUrban is a new drone dataset for dense urban traffic scenarios that improves trajectory prediction and planning benchmarks, boosting accuracy by up to 44% when combined with nuScenes.

Details

Motivation: Current autonomous driving benchmarks lack dense traffic scenarios needed to understand complex road user interactions, limiting the development of robust prediction and planning systems.

Method: Collaborated with DeepScenario to create DeepUrban - a drone dataset capturing 3D traffic objects from high-resolution images at ~100m altitude over urban intersections, enriched with comprehensive map and scene information.

Result: Adding DeepUrban to nuScenes improves vehicle prediction and planning accuracy by up to 44.1% on ADE and 44.3% on FDE metrics, demonstrating enhanced generalization capabilities.

Conclusion: DeepUrban addresses the scarcity of dense traffic scenarios in current benchmarks and significantly boosts the performance of state-of-the-art prediction and planning methods for autonomous driving systems.

Abstract: The efficacy of autonomous driving systems hinges critically on robust prediction and planning capabilities. However, current benchmarks are impeded by a notable scarcity of scenarios featuring dense traffic, which is essential for understanding and modeling complex interactions among road users. To address this gap, we collaborated with our industrial partner, DeepScenario, to develop DeepUrban-a new drone dataset designed to enhance trajectory prediction and planning benchmarks focusing on dense urban settings. DeepUrban provides a rich collection of 3D traffic objects, extracted from high-resolution images captured over urban intersections at approximately 100 meters altitude. The dataset is further enriched with comprehensive map and scene information to support advanced modeling and simulation tasks. We evaluate state-of-the-art (SOTA) prediction and planning methods, and conducted experiments on generalization capabilities. Our findings demonstrate that adding DeepUrban to nuScenes can boost the accuracy of vehicle predictions and planning, achieving improvements up to 44.1 % / 44.3% on the ADE / FDE metrics. Website: https://iv.ee.hm.edu/deepurban

[178] Jordan-Segmentable Masks: A Topology-Aware definition for characterizing Binary Image Segmentation

Serena Grazia De Benedictis, Amedeo Altavilla, Nicoletta Del Buono

Main category: cs.CV

TL;DR: The paper introduces a topology-aware segmentation evaluation framework based on the Jordan Curve Theorem to assess structural coherence of segmentation masks, addressing limitations of conventional metrics.

Details

Motivation: Conventional segmentation evaluation metrics (pixel-wise, region-based, boundary-focused) often fail to capture structural and topological coherence. Small inaccuracies can yield high scores while masks fail to preserve object shape or connectivity, especially problematic in medical imaging and object delineation where topological correctness is crucial.

Method: Introduces the concept of “Jordan-segmentatable mask” based on digital Jordan Curve Theorem. Uses digital topology and homology theory to analyze masks, extracting a 4-curve candidate and verifying topological validity using Betti numbers. A mask is Jordan-segmentatable when its candidate forms a digital 4-curve with β₀ = β₁ = 1, or equivalently when its complement splits into exactly two 8-connected components.

Result: Provides a mathematically rigorous, unsupervised criterion for assessing structural coherence of segmentation masks. The framework offers a topology-aware evaluation alternative that can identify masks that preserve meaningful interior/exterior separation, unlike conventional metrics.

Conclusion: The proposed framework combining digital Jordan theory and homological invariants offers a valuable alternative to standard evaluation metrics, particularly in applications where topological correctness must be preserved, addressing a fundamental limitation in segmentation assessment.

Abstract: Image segmentation plays a central role in computer vision. However, widely used evaluation metrics, whether pixel-wise, region-based, or boundary-focused, often struggle to capture the structural and topological coherence of a segmentation. In many practical scenarios, such as medical imaging or object delineation, small inaccuracies in boundary, holes, or fragmented predictions can result in high metric scores, despite the fact that the resulting masks fail to preserve the object global shape or connectivity. This highlights a limitation of conventional metrics: they are unable to assess whether a predicted segmentation partitions the image into meaningful interior and exterior regions. In this work, we introduce a topology-aware notion of segmentation based on the Jordan Curve Theorem, and adapted for use in digital planes. We define the concept of a \emph{Jordan-segmentatable mask}, which is a binary segmentation whose structure ensures a topological separation of the image domain into two connected components. We analyze segmentation masks through the lens of digital topology and homology theory, extracting a $4$-curve candidate from the mask, verifying its topological validity using Betti numbers. A mask is considered Jordan-segmentatable when this candidate forms a digital 4-curve with $β_0 = β_1 = 1$, or equivalently when its complement splits into exactly two $8$-connected components. This framework provides a mathematically rigorous, unsupervised criterion with which to assess the structural coherence of segmentation masks. By combining digital Jordan theory and homological invariants, our approach provides a valuable alternative to standard evaluation metrics, especially in applications where topological correctness must be preserved.

[179] Adversarial Evasion Attacks on Computer Vision using SHAP Values

Frank Mollard, Marcus Becker, Florian Roehrbein

Main category: cs.CV

TL;DR: White-box attack on CV models using SHAP values to generate adversarial examples that reduce model confidence or cause misclassifications while remaining imperceptible to humans.

Details

Motivation: Adversarial evasion attacks can compromise deep learning models by deceiving algorithms while eluding human perception. Current attacks may be detectable or less effective in certain scenarios, creating need for more robust methods.

Method: Leverages SHAP (SHapley Additive exPlanations) values to quantify significance of individual inputs to model output at inference stage. Uses these importance scores to generate adversarial perturbations.

Result: SHAP-based attacks are more robust than Fast Gradient Sign Method (FGSM) in generating misclassifications, particularly in gradient hiding scenarios where traditional gradient-based attacks may fail.

Conclusion: SHAP values provide an effective alternative to gradient-based methods for generating adversarial attacks, offering increased robustness especially in scenarios where gradients are hidden or unreliable.

Abstract: The paper introduces a white-box attack on computer vision models using SHAP values. It demonstrates how adversarial evasion attacks can compromise the performance of deep learning models by reducing output confidence or inducing misclassifications. Such attacks are particularly insidious as they can deceive the perception of an algorithm while eluding human perception due to their imperceptibility to the human eye. The proposed attack leverages SHAP values to quantify the significance of individual inputs to the output at the inference stage. A comparison is drawn between the SHAP attack and the well-known Fast Gradient Sign Method. We find evidence that SHAP attacks are more robust in generating misclassifications particularly in gradient hiding scenarios.

[180] Action100M: A Large-scale Video Action Dataset

Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, Pascale Fung

Main category: cs.CV

TL;DR: Action100M: A massive open-vocabulary video action dataset from 1.2M instructional videos with 100M segments, automated annotation pipeline, and strong scaling performance for video understanding.

Details

Motivation: Need for large-scale, open-vocabulary video action datasets to advance machine intelligence in physical world understanding, as existing datasets are limited in scale and vocabulary coverage.

Method: Fully automated pipeline: 1) hierarchical temporal segmentation using V-JEPA 2 embeddings, 2) multi-level frame/segment captions organized as Tree-of-Captions, 3) evidence aggregation with GPT-OSS-120B reasoning model using multi-round Self-Refine procedure for structured annotations.

Result: Created Action100M dataset from 1.2M instructional videos (14.6 years duration) with ~100M temporally localized segments, open-vocabulary action supervision, and rich captions. Training VL-JEPA on Action100M shows consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks.

Conclusion: Action100M establishes a new foundation for scalable research in video understanding and world modeling, demonstrating the value of large-scale automated annotation for advancing physical action inference from visual observations.

Abstract: Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.

[181] RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation

Peng Chen, Xiaobao Wei, Yi Yang, Naiming Yao, Hui Chen, Feng Tian

Main category: cs.CV

TL;DR: RSATalker is a 3D Gaussian Splatting-based framework for realistic, socially-aware talking head generation that supports multi-turn conversations by encoding social relationships and using mesh-driven Gaussian binding.

Details

Motivation: Existing talking head generation methods have limitations: mesh-based 3D methods lack realistic textures, 2D large-model methods are computationally expensive, and 3DGS methods ignore social relationships and only handle single speakers. There's a need for realistic, efficient, and socially-aware talking head generation for VR social scenarios.

Method: 1) Drive mesh-based 3D facial motion from speech, 2) Bind 3D Gaussians to mesh facets for high-fidelity rendering, 3) Use a socially-aware module with learnable query mechanism to encode social relationships (blood/non-blood, equal/unequal), 4) Three-stage training paradigm, 5) Construct RSATalker dataset with speech-mesh-image triplets and social annotations.

Result: Extensive experiments show RSATalker achieves state-of-the-art performance in both realism and social awareness. The framework successfully generates realistic talking heads while capturing interpersonal dynamics in multi-turn conversations.

Conclusion: RSATalker is the first 3DGS-based framework for realistic and socially-aware talking head generation, addressing limitations of existing methods by combining efficient rendering with social relationship modeling for VR social applications.

Abstract: Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.

[182] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna

Main category: cs.CV

TL;DR: Molmo2 is a new open-source video-language model family that achieves state-of-the-art performance among open models and introduces novel point-driven grounding capabilities for images and videos, using entirely open data collection methods.

Details

Motivation: The video-language model field is dominated by proprietary models, with open-source alternatives either using synthetic data from proprietary models or lacking transparency. There's a critical need for open foundations that support not just high-level understanding but also pixel-level grounding capabilities for downstream applications.

Method: Created 7 new video datasets and 2 multi-image datasets collected without closed VLMs, including detailed video captions, free-form Q&A, object tracking, and video pointing datasets. Developed training recipe with efficient packing and message-tree encoding, bi-directional attention on vision tokens, and novel token-weight strategy.

Result: Molmo2 8B model outperforms open-weight models on short videos, counting, and captioning, and is competitive on long videos. Significantly outperforms Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on tracking).

Conclusion: Molmo2 provides the open-source community with state-of-the-art video-language models that include novel grounding capabilities, using entirely open data collection methods and advancing the field beyond proprietary model dependencies.

Abstract: Today’s strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding – either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).

[183] CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

Chengfeng Zhao, Jiazhi Shu, Yubo Zhao, Tianyu Huang, Jiahao Lu, Zekai Gu, Chengwei Ren, Zhiyang Dou, Qing Shuai, Yuan Liu

Main category: cs.CV

TL;DR: CoMoVi is a framework that co-generates 3D human motions and 2D videos synchronously by coupling two video diffusion models within a single diffusion loop.

Details

Motivation: The generation of 3D human motions and 2D human videos is intrinsically coupled - 3D motions provide structural prior for plausibility and consistency, while video models offer generalization capabilities for motions, necessitating coupling their generation processes.

Method: 1) Propose effective 2D human motion representation that inherits prior from pre-trained video diffusion models; 2) Design dual-branch diffusion model to couple human motion and video generation with mutual feature interaction and 3D-2D cross attentions; 3) Curate CoMoVi Dataset with text and motion annotations.

Result: Extensive experiments demonstrate effectiveness in both 3D human motion and video generation tasks.

Conclusion: CoMoVi successfully couples 3D motion and video generation through a co-generative framework that leverages the complementary strengths of both modalities.

Abstract: In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.

[184] CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning

Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave

Main category: cs.CV

TL;DR: CURVE is a new multicultural/multilingual video reasoning benchmark with human-generated annotations across 18 locales, revealing significant performance gaps in current Video-LLMs due to cultural perception failures.

Details

Motivation: Current video understanding benchmarks are biased toward western-centric data and English language, lacking proper evaluation of multicultural and multilingual reasoning capabilities.

Method: Created CURVE benchmark with human-generated annotations from diverse cultural videos across 18 global locales, including complex questions, answers, and reasoning steps in native languages. Used reasoning traces to construct evidence-based graphs for error analysis.

Result: State-of-the-art Video-LLMs perform substantially below human-level accuracy on CURVE, with errors primarily stemming from visual perception of cultural elements rather than language understanding.

Conclusion: CURVE addresses cultural bias in video evaluation and reveals critical limitations in current models’ ability to understand visual cultural context, providing a foundation for developing more culturally-aware video reasoning systems.

Abstract: Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE’s reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file#minerva-cultural

[185] A continental-scale dataset of ground beetles with high-resolution images and validated morphological trait measurements

S M Rayeed, Mridul Khurana, Alyson East, Isadora E. Fluck, Elizabeth G. Campolongo, Samuel Stevens, Iuliia Zarubiieva, Scott C. Lowe, Michael W. Denslow, Evan D. Donoso, Jiaman Wu, Michelle Ramirez, Benjamin Baiser, Charles V. Stewart, Paula Mabee, Tanya Berger-Wolf, Anuj Karpatne, Hilmar Lapp, Robert P. Guralnick, Graham W. Taylor, Sydne Record

Main category: cs.CV

TL;DR: Researchers created a multimodal dataset of over 13,200 ground beetles from NEON collections, digitizing specimens through high-resolution imaging and automated trait extraction with AI to address invertebrate under-representation in ecological trait databases.

Details

Motivation: Global trait databases are heavily biased toward vertebrates and plants, limiting comprehensive ecological analyses of high-diversity invertebrate groups like ground beetles, which serve as critical bioindicators of ecosystem health. NEON's extensive carabid collections exist primarily as physical specimens, restricting research access and large-scale analysis.

Method: Digitized over 13,200 NEON carabid specimens from 30 sites across the continental US and Hawaii using high-resolution imaging. Implemented automated trait extraction through AI, digitally measuring elytra length and width. Validated digital measurements against manual measurements to ensure reliability.

Result: Created a multimodal dataset enabling broader access and computational analysis. Achieved sub-millimeter precision in digital trait extraction validated against manual measurements. Established foundation for automated trait extraction using AI for ecological studies.

Conclusion: This work addresses invertebrate under-representation in trait databases and supports development of AI-driven tools for automated species identification and trait-based research, fostering advancements in biodiversity monitoring and conservation.

Abstract: Despite the ecological significance of invertebrates, global trait databases remain heavily biased toward vertebrates and plants, limiting comprehensive ecological analyses of high-diversity groups like ground beetles. Ground beetles (Coleoptera: Carabidae) serve as critical bioindicators of ecosystem health, providing valuable insights into biodiversity shifts driven by environmental changes. While the National Ecological Observatory Network (NEON) maintains an extensive collection of carabid specimens from across the United States, these primarily exist as physical collections, restricting widespread research access and large-scale analysis. To address these gaps, we present a multimodal dataset digitizing over 13,200 NEON carabids from 30 sites spanning the continental US and Hawaii through high-resolution imaging, enabling broader access and computational analysis. The dataset includes digitally measured elytra length and width of each specimen, establishing a foundation for automated trait extraction using AI. Validated against manual measurements, our digital trait extraction achieves sub-millimeter precision, ensuring reliability for ecological and computational studies. By addressing invertebrate under-representation in trait databases, this work supports AI-driven tools for automated species identification and trait-based research, fostering advancements in biodiversity monitoring and conservation.

[186] See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection

Amir Mallak, Erfan Aasi, Shiva Sreeram, Tsun-Hsuan Wang, Daniela Rus, Alaa Maalouf

Main category: cs.CV

TL;DR: SPS: Stochastic patch selection for autonomous driving improves OOD robustness by randomly masking patch features during training, forcing policies to learn invariant representations.

Details

Motivation: Patch-aligned features from foundation models contain redundant information due to self-attention mechanisms, leading to overfitting on spurious correlations and poor OOD generalization.

Method: Stochastic-Patch-Selection (SPS): Randomly masks a fraction of patch descriptors each frame while preserving spatial layout, providing different stochastic but complete views of the same scene.

Result: Achieves 6.2% average improvement across OOD scenarios, up to 20.4% in closed-loop simulations, 2.4× faster than SOTA. 8 of 9 trained systems surpass prior SOTA, transfers to real-world car without tuning.

Conclusion: SPS effectively addresses feature redundancy in foundation model features, improving robustness, generalization, and efficiency for autonomous driving policies while enabling zero-shot transfer to real-world deployment.

Abstract: Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.

[187] From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion

Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao

Main category: cs.CV

TL;DR: CLI introduces a dynamic many-to-many bridge between vision and language models, replacing the traditional static bottleneck with adaptive cross-layer injection for better multimodal understanding.

Details

Motivation: Current VLMs have a severe visual feature bottleneck with crude, asymmetric connections that only link vision encoder output to LLM input. This static architecture limits LLMs' ability to achieve comprehensive alignment with hierarchical visual knowledge, compromising integration of local details with global semantics.

Method: Cross-Layer Injection (CLI) framework with two synergistic components: Adaptive Multi-Projection (AMP) module harmonizes features from diverse vision layers, and Adaptive Gating Fusion (AGF) mechanism allows LLM to selectively inject relevant visual information based on real-time decoding context.

Result: Extensive experiments on 18 diverse benchmarks show significant performance improvements when integrating CLI into LLaVA-OneVision and LLaVA-1.5, establishing CLI as a scalable paradigm for deeper multimodal understanding.

Conclusion: CLI unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy through dynamic many-to-many bridging between vision and language modalities.

Abstract: Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.

[188] Alterbute: Editing Intrinsic Attributes of Objects in Images

Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen

Main category: cs.CV

TL;DR: Alterbute is a diffusion-based method for editing object intrinsic attributes (color, texture, material, shape) while preserving identity and scene context, using relaxed training with identity reference images and textual prompts, plus Visual Named Entities for fine-grained identity categorization.

Details

Motivation: Existing approaches either rely on unsupervised priors that fail to preserve object identity or use overly restrictive supervision that prevents meaningful intrinsic variations. There's a need for a method that can edit intrinsic attributes while maintaining object identity and scene context.

Method: Uses: (1) Relaxed training objective allowing changes to both intrinsic and extrinsic attributes conditioned on identity reference image, textual prompt describing target attributes, and background/mask defining extrinsic context. At inference, restricts extrinsic changes by reusing original background/mask. (2) Visual Named Entities (VNEs) - fine-grained visual identity categories automatically extracted using vision-language models from large datasets to enable scalable, identity-preserving supervision.

Result: Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing, successfully changing color, texture, material, and shape while preserving perceived identity and scene context.

Conclusion: Alterbute provides an effective diffusion-based approach for editing object intrinsic attributes while preserving identity, overcoming limitations of existing methods through a novel training strategy and Visual Named Entities framework.

Abstract: We introduce Alterbute, a diffusion-based method for editing an object’s intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ‘‘Porsche 911 Carrera’’) that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.

[189] WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

Xuweiyi Chen, Wentao Zhou, Zezhou Cheng

Main category: cs.CV

TL;DR: WildRayZer is a self-supervised framework for novel view synthesis in dynamic scenes where both camera and objects move, using analysis-by-synthesis to separate static structure from transient motion.

Details

Motivation: Dynamic environments break multi-view consistency assumptions of static NVS models, causing ghosting, hallucinated geometry, and unstable pose estimation. Existing methods struggle with casually captured dynamic scenes.

Method: Uses analysis-by-synthesis: camera-only static renderer explains rigid structure, residuals reveal transient regions. Constructs pseudo motion masks, distills motion estimator, masks input tokens and gates loss gradients to focus supervision on cross-view background completion.

Result: Outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with single feed-forward pass. Introduces Dynamic RealEstate10K dataset (15K dynamic sequences) and D-RE10K-iPhone benchmark.

Conclusion: WildRayZer effectively handles dynamic NVS through self-supervised motion separation, enabling high-quality novel view synthesis in casually captured dynamic scenes with moving cameras and objects.

Abstract: We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.

[190] Spatial As Deep: Spatial CNN for Traffic Scene Understanding

Xingang Pan, Xiaohang Zhan, Jianping Shi, Ping Luo, Xiaogang Wang, Xiaoou Tang

Main category: cs.CV

TL;DR: SCNN introduces slice-by-slice convolutions to capture spatial relationships across image rows/columns, improving lane detection performance.

Details

Motivation: Traditional CNNs don't fully capture spatial relationships across image rows and columns, which is crucial for detecting objects with strong shape priors but weak appearance coherences like traffic lanes, poles, and walls.

Method: Spatial CNN (SCNN) generalizes traditional layer-by-layer convolutions to slice-by-slice convolutions within feature maps, enabling message passing between pixels across rows and columns in a layer.

Result: SCNN outperforms RNN-based ReNet and MRF+CNN by 8.7% and 4.6% respectively on lane detection, and achieved 1st place on TuSimple Benchmark with 96.53% accuracy.

Conclusion: SCNN effectively learns spatial relationships for structured output tasks, particularly for long continuous shapes or large objects with strong spatial relationships but weak appearance clues.

Abstract: Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traffic lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traffic lanes, poles, and wall. We apply SCNN on a newly released very challenging traffic lane detection dataset and Cityscapse dataset. The results show that SCNN could learn the spatial relationship for structure output and significantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8.7% and 4.6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96.53%.

[191] Data-Driven Feature Tracking for Event Cameras With and Without Frames

Nico Messikommer, Carter Fang, Mathias Gehrig, Giovanni Cioffi, Davide Scaramuzza

Main category: cs.CV

TL;DR: First data-driven feature tracker for event cameras using frame attention module, works in event-only or hybrid modes (aligned or stereo).

Details

Motivation: Existing event camera feature trackers are handcrafted, require extensive tuning, sensitive to noise, and don't generalize well across scenarios.

Method: Data-driven approach with novel frame attention module that shares information across feature tracks, operates in event-only or hybrid modes (aligned viewpoint or side-by-side stereo).

Result: Achieves robust performance by leveraging low-latency events to track features detected in intensity frames, with stereo configuration providing depth information.

Conclusion: The tracker enhances utility for applications like visual odometry and SLAM by providing depth-aware feature tracking in various configurations.

Abstract: Because of their high temporal resolution, increased resilience to motion blur, and very sparse output, event cameras have been shown to be ideal for low-latency and low-bandwidth feature tracking, even in challenging scenarios. Existing feature tracking methods for event cameras are either handcrafted or derived from first principles but require extensive parameter tuning, are sensitive to noise, and do not generalize to different scenarios due to unmodeled effects. To tackle these deficiencies, we introduce the first data-driven feature tracker for event cameras, which leverages low-latency events to track features detected in an intensity frame. We achieve robust performance via a novel frame attention module, which shares information across feature tracks. Our tracker is designed to operate in two distinct configurations: solely with events or in a hybrid mode incorporating both events and frames. The hybrid model offers two setups: an aligned configuration where the event and frame cameras share the same viewpoint, and a hybrid stereo configuration where the event camera and the standard camera are positioned side-by-side. This side-by-side arrangement is particularly valuable as it provides depth information for each feature track, enhancing its utility in applications such as visual odometry and simultaneous localization and mapping.

[192] A Kolmogorov metric embedding for live cell microscopy signaling patterns

Layton Aho, Mark Winter, Marc DeCarlo, Agne Frismantiene, Yannick Blum, Paolo Armando Gagliardi, Olivier Pertz, Andrew R. Cohen

Main category: cs.CV

TL;DR: A metric embedding method using normalized information distance (NID) to capture spatiotemporal cell signaling patterns from 5-D microscopy movies, requiring no prior knowledge or training data.

Details

Motivation: To develop a theoretically optimal, parameter-minimal approach for analyzing complex spatiotemporal patterns in live cell microscopy without requiring prior knowledge of expected dynamics or training data.

Method: Uses normalized information distance (NID) based on Kolmogorov complexity theory and lossless compression statistics to compute metric distances between 5-D movies. Defines cell signaling structure function (SSF) using metric 3-D image filters that compute voxel intensity configurations around cell centroids. Only parameter is expected cell radii.

Result: The method creates a metric embedding space where Euclidean distance between points approximates optimal pattern differences between corresponding 5-D movies. Demonstrated on synthetic data, 2-D+time movies of ERK/AKT signaling in MCF10A cells, 3-D spheroids under optogenetic ERK manipulation, and ERK dynamics in human stem cell colony differentiation.

Conclusion: The NID-based metric embedding provides a theoretically optimal, training-free approach for analyzing complex spatiotemporal cell signaling patterns across various biological contexts, enabling quantitative comparison of dynamic cellular behaviors.

Abstract: We present a metric embedding that captures spatiotemporal patterns of cell signaling dynamics in 5-D $(x,y,z,channel,time)$ live cell microscopy movies. The embedding uses a metric distance called the normalized information distance (NID) based on Kolmogorov complexity theory, an absolute measure of information content between digital objects. The NID uses statistics of lossless compression to compute a theoretically optimal metric distance between pairs of 5-D movies, requiring no a priori knowledge of expected pattern dynamics, and no training data. The cell signaling structure function (SSF) is defined using a class of metric 3-D image filters that compute at each spatiotemporal cell centroid the voxel intensity configuration of the nucleus w.r.t. the surrounding cytoplasm, or a functional output e.g. velocity. The only parameter is the expected cell radii ($μm$). The SSF can be optionally combined with segmentation and tracking algorithms. The resulting lossless compression pipeline represents each 5-D input movie as a single point in a metric embedding space. The utility of a metric embedding follows from Euclidean distance between any points in the embedding space approximating optimally the pattern difference, as measured by the NID, between corresponding pairs of 5-D movies. This is true throughout the embedding space, not only at points corresponding to input images. Examples are shown for synthetic data, for 2-D+time movies of ERK and AKT signaling under different oncogenic mutations in human epithelial (MCF10A) cells, for 3-D MCF10A spheroids under optogenetic manipulation of ERK, and for ERK dynamics during colony differentiation in human stem cells.

[193] Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

Yanqi Ge, Jiaqi Liu, Qingnan Fan, Xi Jiang, Ye Huang, Shuai Qin, Hong Gu, Wen Li, Lixin Duan

Main category: cs.CV

TL;DR: Proposes ASI method for text-driven style transfer in diffusion models, using SiCA and AdaBlending modules to achieve better structure preservation and stylization than prompt concatenation approaches.

Details

Motivation: Existing text-driven style transfer methods in T2I diffusion models directly concatenate content and style prompts, causing structure distortions. Need for better structure preservation while enabling effective style transfer.

Method: Adaptive Style Incorporation (ASI) with two components: 1) Siamese Cross-Attention (SiCA) decouples single-track cross-attention into dual-track to get separate content/style features, 2) Adaptive Content-Style Blending (AdaBlending) couples content and style information in structure-consistent manner.

Result: Method exhibits much better performance in both structure preservation and stylized effects compared to previous approaches.

Conclusion: ASI provides a novel solution for text-driven style transfer that achieves fine-grained feature-level style incorporation while maintaining structural consistency.

Abstract: In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects.

[194] Jump-teaching: Combating Sample Selection Bias via Temporal Disagreement

Kangye Ji, Fei Cheng, Zeqing Wang, Qichang Zhang, Bohu Huang

Main category: cs.CV

TL;DR: Jump-teaching: An efficient sample selection framework using temporal disagreement for debiased model updates and simplified selection criteria, achieving near-overhead-free training with 4.47× speedup and 54% memory reduction.

Details

Motivation: Existing sample selection methods for combating noisy labels suffer from compounded selection bias and require dual-network disagreement or additional forward propagations, leading to multiplied training overhead. There's a need for more efficient approaches.

Method: Jump-teaching uses temporal disagreement across training iterations for self-correction of selection bias via a jump-manner model update strategy, eliminating multi-network/multi-round training. It employs a sample-wise selection criterion based on intra variance of a decomposed single loss for fine-grained selection without batch-wise ranking or dataset-wise modeling.

Result: Extensive experiments show Jump-teaching outperforms state-of-the-art counterparts while achieving nearly overhead-free selection, boosting training speed by up to 4.47× and reducing peak memory footprint by 54%.

Conclusion: Jump-teaching provides an efficient solution for sample selection with noisy labels by leveraging temporal disagreement and simplified criteria, offering significant performance improvements with minimal computational overhead.

Abstract: Sample selection is a straightforward technique to combat noisy labels, aiming to prevent mislabeled samples from degrading the robustness of neural networks. However, existing methods mitigate compounding selection bias either by leveraging dual-network disagreement or additional forward propagations, leading to multiplied training overhead. To address this challenge, we introduce $\textit{Jump-teaching}$, an efficient sample selection framework for debiased model update and simplified selection criterion. Based on a key observation that a neural network exhibits significant disagreement across different training iterations, Jump-teaching proposes a jump-manner model update strategy to enable self-correction of selection bias by harnessing temporal disagreement, eliminating the need for multi-network or multi-round training. Furthermore, we employ a sample-wise selection criterion building on the intra variance of a decomposed single loss for a fine-grained selection without relying on batch-wise ranking or dataset-wise modeling. Extensive experiments demonstrate that Jump-teaching outperforms state-of-the-art counterparts while achieving a nearly overhead-free selection procedure, which boosts training speed by up to $4.47\times$ and reduces peak memory footprint by $54%$.

[195] AITTI: Learning Adaptive Inclusive Token for Text-to-Image Generation

Xinyu Hou, Xiaoming Li, Chen Change Loy

Main category: cs.CV

TL;DR: AITTI: Adaptive Inclusive Tokens for Text-to-Image generation that mitigates stereotypical biases without requiring explicit attribute specification or prior knowledge of bias distributions.

Details

Motivation: Text-to-image generation models produce high-quality results but suffer from stereotypical biases that compromise fairness. Existing de-biasing approaches require explicit attribute specification or prior knowledge of bias distributions, which limits their applicability.

Method: Proposes lightweight adaptive mapping network that learns inclusive tokens to shift attribute distributions. The network customizes inclusive tokens for concepts to be de-biased, making them generalizable to unseen concepts regardless of original bias distributions. Uses anchor loss with handful of balanced inclusive samples for tuning.

Result: Outperforms previous bias mitigation methods without attribute specification while preserving text-image alignment. Achieves comparable performance to models requiring specific attributes or editing directions. Extensive experiments demonstrate effectiveness in mitigating stereotypical bias.

Conclusion: The adaptive inclusive tokens approach effectively mitigates stereotypical biases in text-to-image generation without needing explicit attribute specification or prior bias knowledge, offering a more flexible and generalizable solution for fair generative models.

Abstract: Despite the high-quality results of text-to-image generation, stereotypical biases have been spotted in their generated contents, compromising the fairness of generative models. In this work, we propose to learn adaptive inclusive tokens to shift the attribute distribution of the final generative outputs. Unlike existing de-biasing approaches, our method requires neither explicit attribute specification nor prior knowledge of the bias distribution. Specifically, the core of our method is a lightweight adaptive mapping network, which can customize the inclusive tokens for the concepts to be de-biased, making the tokens generalizable to unseen concepts regardless of their original bias distributions. This is achieved by tuning the adaptive mapping network with a handful of balanced and inclusive samples using an anchor loss. Experimental results demonstrate that our method outperforms previous bias mitigation methods without attribute specification while preserving the alignment between generative results and text descriptions. Moreover, our method achieves comparable performance to models that require specific attributes or editing directions for generation. Extensive experiments showcase the effectiveness of our adaptive inclusive tokens in mitigating stereotypical bias in text-to-image generation. The code will be available at https://github.com/itsmag11/AITTI.

[196] Deep learning-based ecological analysis of camera trap images is impacted by training data quality and quantity

Peggy A. Bevan, Omiros Pantazis, Holly Pringle, Guilherme Braga Ferreira, Daniel J. Ingram, Emily Madsen, Liam Thomas, Dol Raj Thanet, Thakur Silwal, Santosh Rayamajhi, Gabriel Brostow, Oisin Mac Aodha, Kate E. Jones

Main category: cs.CV

TL;DR: Deep learning models can accurately predict ecological metrics like species richness from camera trap images, but species-specific metrics are more sensitive to classification errors.

Details

Motivation: Manual processing of camera trap images is time-consuming, and while deep learning automates labeling, the impact of classification errors on ecological metrics remains unclear.

Method: Analyzed camera trap data from African savannah (82,300 images, 47 species) and Asian dry forest (40,308 images, 29 species) to compare ecological metrics from expert labels vs. deep learning models, assessing impact of model architecture, label noise, and training dataset size.

Result: Species richness predictions from deep learning closely matched expert labels and were resilient to 10% label noise and 50% training data reduction. Model architecture choice didn’t impact ecological metrics, but less common and visually similar species showed sensitivity in occupancy and activity pattern estimates.

Conclusion: Practitioners should prioritize large, clean training sets and address class imbalance rather than exploring numerous model architectures to ensure reliable ecological findings from automated image classification.

Abstract: Large image collections generated from camera traps offer valuable insights into species richness, occupancy, and activity patterns, significantly aiding biodiversity monitoring. However, the manual processing of these datasets is time-consuming, hindering analytical processes. To address this, deep neural networks have been widely adopted to automate image labelling, but the impact of classification error on key ecological metrics remains unclear. Here, we analyse data from camera trap collections in an African savannah (82,300 labelled images, 47 species) and an Asian sub-tropical dry forest (40,308 labelled images, 29 species) to compare ecological metrics derived from expert-generated species identifications with those generated by deep learning classification models. We specifically assess the impact of deep learning model architecture, proportion of label noise in the training data, and the size of the training dataset on three key ecological metrics: species richness, occupancy, and activity patterns. We found that predictions of species richness derived from deep neural networks closely match those calculated from expert labels and remained resilient to up to 10% noise in the training dataset (mis-labelled images) and a 50% reduction in the training dataset size. We found that our choice of deep learning model architecture (ResNet vs ConvNext-T) or depth (ResNet18, 50, 101) did not impact predicted ecological metrics. In contrast, species-specific metrics were more sensitive; less common and visually similar species were disproportionately affected by a reduction in deep neural network accuracy, with consequences for occupancy and diel activity pattern estimates. To ensure the reliability of their findings, practitioners should prioritize creating large, clean training sets and account for class imbalance across species over exploring numerous deep learning model architectures.

[197] Debiased Orthogonal Boundary-Driven Efficient Noise Mitigation

Hao Li, Jiayang Gu, Jingkuan Song, An Zhang, Lianli Gao

Main category: cs.CV

TL;DR: OSA is a model-agnostic noisy label mitigation method that uses high-dimensional orthogonality to separate clean/noisy samples with one-step inference, offering robust training with low computational overhead.

Details

Motivation: Noisy labels are common in large-scale pre-training but existing mitigation methods are limited by task-specific design, model dependency, and high computational costs.

Method: Exploits high-dimensional orthogonality to find robust boundary in cone space for separating clean/noisy samples. Uses estimator model and scoring function to assess noise level through one-step inference.

Result: Demonstrates enhanced training robustness, improved task transferability, streamlined deployment, and reduced computational overhead across diverse benchmarks, models, and tasks.

Conclusion: OSA provides an effective, model-agnostic paradigm for noisy label mitigation that overcomes limitations of existing methods through efficient one-step inference and robust sample separation.

Abstract: Mitigating the detrimental effects of noisy labels on the training process has become increasingly critical, as obtaining entirely clean or human-annotated samples for large-scale pre-training tasks is often impractical. Nonetheless, existing noise mitigation methods often encounter limitations in practical applications due to their task-specific design, model dependency, and significant computational overhead. In this work, we exploit the properties of high-dimensional orthogonality to identify a robust and effective boundary in cone space for separating clean and noisy samples. Building on this, we propose One-Step Anti-noise (OSA), a model-agnostic noisy label mitigation paradigm that employs an estimator model and a scoring function to assess the noise level of input pairs through just one-step inference. We empirically validate the superiority of OSA, demonstrating its enhanced training robustness, improved task transferability, streamlined deployment, and reduced computational overhead across diverse benchmarks, models, and tasks. Our code is released at https://github.com/leolee99/OSA.

[198] The Hatching-Box: A Novel System for Automated Monitoring and Quantification of Drosophila melanogaster Developmental Behavior

Julian Bigge, Maite Ogueta, Luis Garcia, Benjamin Risse

Main category: cs.CV

TL;DR: Hatching-Box is an automated imaging and analysis system for monitoring Drosophila development in standard vials, eliminating manual experiments through custom hardware and tracking algorithms.

Details

Motivation: To automate monitoring of Drosophila developmental behavior during regular rearing routines without explicit experiments, reducing manual labor and enabling scalable, long-term observation.

Method: Combines custom imaging hardware with dedicated detection and tracking algorithms to quantify larvae, pupae, and flies over multiple days in standard rearing vials using a scalable client/server software architecture.

Result: Successfully reproduced circadian experiment results comparing eclosion periods of wild type vs. clock mutants (per^short, per^long, per^0) without manual labor, and extracted additional group behavior information and individual life-cycle reconstruction.

Conclusion: The Hatching-Box system demonstrates applicability for long-term experiments and benefits for automated monitoring in general Drosophila cultivation processes through affordable, reproducible, and scalable design.

Abstract: In this paper we propose the Hatching-Box, a novel imaging and analysis system to automatically monitor and quantify the developmental behavior of Drosophila in standard rearing vials and during regular rearing routines, rendering explicit experiments obsolete. This is achieved by combining custom tailored imaging hardware with dedicated detection and tracking algorithms, enabling the quantification of larvae, filled/empty pupae and flies over multiple days. Given the affordable and reproducible design of the Hatching-Box in combination with our generic client/server-based software, the system can easily be scaled to monitor an arbitrary amount of rearing vials simultaneously. We evaluated our system on a curated image dataset comprising nearly 470,000 annotated objects and performed several studies on real world experiments. We successfully reproduced results from well-established circadian experiments by comparing the eclosion periods of wild type flies to the clock mutants $\textit{per}^{short}$, $\textit{per}^{long}$ and $\textit{per}^0$ without involvement of any manual labor. Furthermore we show, that the Hatching-Box is able to extract additional information about group behavior as well as to reconstruct the whole life-cycle of the individual specimens. These results not only demonstrate the applicability of our system for long-term experiments but also indicate its benefits for automated monitoring in the general cultivation process.

[199] Symmetrization Weighted Binary Cross-Entropy: Modeling Perceptual Asymmetry for Human-Consistent Neural Edge Detection

Hao Shu

Main category: cs.CV

TL;DR: SWBCE loss improves edge detection by modeling perceptual asymmetry, achieving better visual quality and numerical accuracy than existing methods.

Details

Motivation: Current edge detection models achieve high numerical accuracy but produce edges that lack visual sharpness and perceptual consistency, limiting their reliability in intelligent vision systems. There's a gap between numerical metrics and human perceptual quality.

Method: Introduces Symmetrization Weighted Binary Cross-Entropy (SWBCE) loss, which extends conventional WBCE by incorporating prediction-guided symmetry. It explicitly models perceptual asymmetry in human edge recognition where edge decisions require stronger evidence than non-edge ones.

Result: SWBCE outperforms existing loss functions in both numerical evaluation and visual quality across multiple benchmark datasets and ED architectures. With HED-EES model, SSIM improved by ~15% on BRIND dataset. Consistently achieves best perceptual results in all experiments.

Conclusion: SWBCE provides a perception-inspired optimization approach that bridges the gap between numerical accuracy and perceptual fidelity in edge detection. The method offers a generalizable optimization principle for neural learning systems where asymmetric perceptual reasoning is critical.

Abstract: Edge detection (ED) is a fundamental perceptual process in computer vision, forming the structural basis for high-level reasoning tasks such as segmentation, recognition, and scene understanding. Despite substantial progress achieved by deep neural networks, most ED models attain high numerical accuracy but fail to produce visually sharp and perceptually consistent edges, thereby limiting their reliability in intelligent vision systems. To address this issue, this study introduces the \textit{Symmetrization Weighted Binary Cross-Entropy (SWBCE)} loss, a perception-inspired formulation that extends the conventional WBCE by incorporating prediction-guided symmetry. SWBCE explicitly models the perceptual asymmetry in human edge recognition, wherein edge decisions require stronger evidence than non-edge ones, aligning the optimization process with human perceptual discrimination. The resulting symmetric learning mechanism jointly enhances edge recall and suppresses false positives, achieving a superior balance between quantitative accuracy and perceptual fidelity. Extensive experiments across multiple benchmark datasets and representative ED architectures demonstrate that SWBCE can outperform existing loss functions in both numerical evaluation and visual quality. Particularly with the HED-EES model, the SSIM can be improved by about 15% on BRIND, and in all experiments, training by SWBCE consistently obtains the best perceptual results. Beyond edge detection, the proposed perceptual loss offers a generalizable optimization principle for soft computing and neural learning systems, particularly in scenarios where asymmetric perceptual reasoning plays a critical role.

[200] GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm

Hanrui Wang, Ching-Chun Chang, Chun-Shien Lu, Christopher Leckie, Isao Echizen

Main category: cs.CV

TL;DR: GreedyPixel is a black-box adversarial attack method that uses per-pixel greedy optimization with surrogate guidance and query feedback to achieve white-box-level precision with pixel-wise sparsity and imperceptible perturbations.

Details

Motivation: Existing black-box attack methods face a trade-off between precision and flexibility - pixel-sparse attacks lack adaptability while patch/frequency-based attacks sacrifice precision for efficiency. There's a need to bridge the gap between black-box practicality and white-box performance.

Method: GreedyPixel performs brute-force-style per-pixel greedy optimization guided by a surrogate-derived priority map and refined using query feedback. It evaluates each coordinate directly without gradient information, ensuring monotonic loss reduction and convergence to coordinate-wise optimum.

Result: On CIFAR-10 and ImageNet datasets across CNN and Transformer models, GreedyPixel achieved state-of-the-art success rates with visually imperceptible perturbations, effectively bridging black-box practicality with white-box performance.

Conclusion: GreedyPixel provides a fine-grained black-box attack method that combines the precision of white-box attacks with the practicality of black-box approaches, offering near white-box-level precision with pixel-wise sparsity and perceptual quality.

Abstract: Deep neural networks are highly vulnerable to adversarial examples, which are inputs with small, carefully crafted perturbations that cause misclassification – making adversarial attacks a critical tool for evaluating robustness. Existing black-box methods typically entail a trade-off between precision and flexibility: pixel-sparse attacks (e.g., single- or few-pixel attacks) provide fine-grained control but lack adaptability, whereas patch- or frequency-based attacks improve efficiency or transferability, but at the cost of producing larger and less precise perturbations. We present GreedyPixel, a fine-grained black-box attack method that performs brute-force-style, per-pixel greedy optimization guided by a surrogate-derived priority map and refined by means of query feedback. It evaluates each coordinate directly without any gradient information, guaranteeing monotonic loss reduction and convergence to a coordinate-wise optimum, while also yielding near white-box-level precision and pixel-wise sparsity and perceptual quality. On the CIFAR-10 and ImageNet datasets, spanning convolutional neural networks (CNNs) and Transformer models, GreedyPixel achieved state-of-the-art success rates with visually imperceptible perturbations, effectively bridging the gap between black-box practicality and white-box performance. The implementation is available at https://github.com/azrealwang/greedypixel.

[201] Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

Aditya Kumar, Tom Blanchard, Adam Dziedzic, Franziska Boenisch

Main category: cs.CV

TL;DR: Diffusion models can generate harmful text within images (offensive language, slurs, explicit terms), and current safety methods fail to prevent this while degrading benign text generation. The paper introduces a targeted fine-tuning approach and releases ToxicBench benchmark.

Details

Motivation: While prior work has addressed NSFW visual content in diffusion models, a new threat emerges: the generation of NSFW text embedded within images (offensive language, racial slurs, sexually explicit terms). All state-of-the-art DMs are vulnerable, and existing mitigation techniques fail to prevent harmful text generation while degrading benign text quality.

Method: Introduces a novel fine-tuning strategy targeting only text-generation layers in DMs. Constructs a safety fine-tuning dataset by pairing each NSFW prompt with two images: one with the NSFW term, and another where that term is replaced with a carefully crafted benign alternative while leaving the image unchanged otherwise.

Result: Demonstrates that all state-of-the-art DMs (SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to NSFW text generation. Existing mitigation techniques fail to prevent harmful text while substantially degrading benign text generation. The proposed fine-tuning approach enables models to avoid generating harmful text while preserving benign content and overall image quality.

Conclusion: Releases ToxicBench, an open-source benchmark for evaluating NSFW text generation in images, including curated fine-tuning dataset, harmful prompts, new evaluation metrics, and assessment pipeline. Aims to guide future efforts in mitigating NSFW text generation in text-to-image models for safe deployment.

Abstract: State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by pairing each NSFW prompt with two images: one with the NSFW term, and another where that term is replaced with a carefully crafted benign alternative while leaving the image unchanged otherwise. By training on this dataset, the model learns to avoid generating harmful text while preserving benign content and overall image quality. Finally, to advance research in the area, we release ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. It includes our curated fine-tuning dataset, a set of harmful prompts, new evaluation metrics, and a pipeline that assesses both NSFW-ness and text and image quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models, thereby contributing to their safe deployment.

[202] RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation

Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang

Main category: cs.CV

TL;DR: RS2-SAM2 adapts SAM2 for referring remote sensing image segmentation by aligning visual-text features and generating pseudo-mask prompts.

Details

Motivation: SAM2 performs well in general segmentation but struggles with remote sensing images due to challenges in understanding text-described RS scenes and generating effective text prompts.

Method: Uses union encoder for joint visual-text encoding, bidirectional hierarchical fusion for feature alignment, and mask prompt generator to create pseudo-mask dense prompts for SAM2.

Result: Achieves state-of-the-art performance on multiple RRSIS benchmarks.

Conclusion: RS2-SAM2 successfully adapts SAM2 to remote sensing image segmentation by addressing text-scene understanding and prompt generation challenges.

Abstract: Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text. To address these issues, we propose \textbf{RS2-SAM2}, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features while providing pseudo-mask-based dense prompts. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model’s interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.

[203] High-Quality 3D Head Reconstruction from Any Single Portrait Image

Jianfu Zhang, Yujie Gao, Jiahui Zhan, Wentao Wang, Yiyi Zhang, Haohua Zhao, Liqing Zhang

Main category: cs.CV

TL;DR: Novel 3D head reconstruction method from single portrait images using multi-view diffusion with identity/expression guidance and a new high-quality dataset.

Details

Motivation: Existing 2D-to-3D methods struggle with high-quality portrait reconstruction due to missing identity, expression, hair, and accessory information in single images.

Method: Created new dataset (227 sequences, 96 perspectives, 21,792 frames). Integrated identity/expression info into multi-view diffusion with guidance and supervision for facial consistency. Generated orbital videos for 3D reconstruction.

Result: Method demonstrates robust performance across challenging scenarios including side-face angles and complex accessories.

Conclusion: Proposed approach effectively addresses limitations of existing methods for high-fidelity 3D head reconstruction from single portraits.

Abstract: In this work, we introduce a novel high-fidelity 3D head reconstruction method from a single portrait image, regardless of perspective, expression, or accessories. Despite significant efforts in adapting 2D generative models for novel view synthesis and 3D optimization, most methods struggle to produce high-quality 3D portraits. The lack of crucial information, such as identity, expression, hair, and accessories, limits these approaches in generating realistic 3D head models. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring diverse expressions and accessories. To further improve performance, we integrate identity and expression information into the multi-view diffusion process to enhance facial consistency across views. Specifically, we apply identity- and expression-aware guidance and supervision to extract accurate facial representations, which guide the model and enforce objective functions to ensure high identity and expression consistency during generation. Finally, we generate an orbital video around the portrait consisting of 96 multi-view frames, which can be used for 3D portrait model reconstruction. Our method demonstrates robust performance across challenging scenarios, including side-face angles and complex accessories

[204] TriDF: Triplane-Accelerated Density Fields for Few-Shot Remote Sensing Novel View Synthesis

Jiaming Kang, Keyan Chen, Zhengxia Zou, Zhenwei Shi

Main category: cs.CV

TL;DR: TriDF is an efficient hybrid 3D representation for fast remote sensing novel view synthesis from as few as 3 input views, achieving 30x speedup over NeRF methods while improving rendering quality.

Details

Motivation: Remote sensing scenes often lack sufficient multi-view images due to acquisition constraints, and existing NVS methods either overfit with limited views or are computationally intensive and perform poorly in remote sensing contexts.

Method: Decouples color and volume density information, mapping high-frequency color onto triplane representation while modeling density as continuous fields with reference features from neighboring views. Uses depth-guided optimization based on point clouds to mitigate overfitting.

Result: Achieves 30x speed increase compared to NeRF-based methods, with 7.4% improvement in PSNR and 3.4% in SSIM over advanced few-shot methods across multiple remote sensing scenes.

Conclusion: TriDF provides an efficient hybrid representation that enables fast, high-quality novel view synthesis from very few input views in remote sensing applications, addressing both computational efficiency and quality limitations of existing approaches.

Abstract: Remote sensing novel view synthesis (NVS) offers significant potential for 3D interpretation of remote sensing scenes, with important applications in urban planning and environmental monitoring. However, remote sensing scenes frequently lack sufficient multi-view images due to acquisition constraints. While existing NVS methods tend to overfit when processing limited input views, advanced few-shot NVS methods are computationally intensive and perform sub-optimally in remote sensing scenes. This paper presents TriDF, an efficient hybrid 3D representation for fast remote sensing NVS from as few as 3 input views. Our approach decouples color and volume density information, modeling them independently to reduce the computational burden on implicit radiance fields and accelerate reconstruction.We explore the potential of the triplane representation in few-shot NVS tasks by mapping high-frequency color information onto this compact structure, and the direct optimization of feature planes significantly speeds up convergence. Volume density is modeled as continuous density fields, incorporating reference features from neighboring views through image-based rendering to compensate for limited input data. Additionally, we introduce depth-guided optimization based on point clouds, which effectively mitigates the overfitting problem in few-shot NVS.Comprehensive experiments across multiple remote sensing scenes demonstrate that our hybrid representation achieves a 30x speed increase compared to NeRF-based methods, while simultaneously improving rendering quality metrics over advanced few-shot methods (7.4% increase in PSNR and 3.4% in SSIM). The code is publicly available at https://github.com/kanehub/TriDF

[205] RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, Xuming Hu

Main category: cs.CV

TL;DR: RTV-Bench is a new benchmark for evaluating Multimodal Large Language Models on real-time video analysis, featuring multi-timestamp QA, hierarchical questions, and multi-dimensional evaluation across 552 videos and 4,608 QA pairs.

Details

Motivation: Existing benchmarks are insufficient for evaluating MLLMs' abilities under continuous, dynamic real-world video streams where models need to maintain coherent understanding as scenes evolve over time.

Method: Built RTV-Bench based on three principles: 1) multi-timestamp question answering, 2) hierarchical question structures spanning perception and reasoning, and 3) multi-dimensional evaluation of continuous perception, understanding, and reasoning. The benchmark includes 552 diverse videos and 4,608 carefully curated QA pairs.

Result: Real-time models generally outperform offline counterparts but still lag behind leading proprietary systems. Scaling model capacity yields performance gains, but simply increasing input frame density doesn’t consistently improve results.

Conclusion: Current architectures have inherent limitations in handling long-horizon video streams, highlighting the need for models explicitly designed for streaming video processing and analysis.

Abstract: Multimodal Large Language Models (MLLMs) have made rapid progress in perception, understanding, and reasoning, yet existing benchmarks fall short in evaluating these abilities under continuous and dynamic real-world video streams. Such settings require models to maintain coherent understanding and reasoning as visual scenes evolve over time. We introduce RTV-Bench, a fine-grained benchmark for real-time video analysis with MLLMs. It is built upon three key principles: multi-timestamp question answering, hierarchical question structures spanning perception and reasoning, and multi-dimensional evaluation of continuous perception, understanding, and reasoning. RTV-Bench comprises 552 diverse videos and 4,608 carefully curated QA pairs covering a wide range of dynamic scenarios. We evaluate a broad range of state-of-the-art MLLMs, including proprietary, open-source offline, and open-source real-time models. Our results show that real-time models generally outperform offline counterparts but still lag behind leading proprietary systems. While scaling model capacity generally yields performance gains, simply increasing the density of sampled input frames does not consistently translate into improved results. These observations suggest inherent limitations in current architectures when handling long-horizon video streams, underscoring the need for models explicitly designed for streaming video processing and analysis.

[206] Towards Understanding Deep Learning Model in Image Recognition via Coverage Test

Wenkai Li, Xiaoqi Li, Yingjie Mao, Yishun Wang

Main category: cs.CV

TL;DR: This paper conducts empirical research on DNN security testing coverage metrics, analyzing relationships between model depth, configuration, and four coverage metrics across different architectures (LeNet, VGG, ResNet) with varying layers.

Details

Motivation: Despite the emergence of various neural network coverage metrics for DNN security testing, there's a lack of empirical research analyzing relationships between model depth, configuration information, and coverage metrics. The paper aims to fill this gap by investigating patterns and relationships among different coverage metrics.

Method: Conducted empirical experiments using LeNet, VGG, and ResNet architectures with 10 models of varying depths (5-54 layers). Investigated four coverage metrics: primary functionality, boundary, hierarchy, and structural coverage. Also examined relationships between modified decision/condition coverage and dataset size.

Result: The empirical study provides comparative analysis of relationships between different depths, configuration information, and various neural network coverage metrics. The investigation reveals patterns in how coverage metrics relate to model architecture and depth.

Conclusion: The research contributes to DNN security testing by empirically analyzing coverage metric relationships. Three potential future directions are proposed to further advance security testing of DNN models.

Abstract: Deep neural networks (DNNs) play a crucial role in the field of artificial intelligence, and their security-related testing has been a prominent research focus. By inputting test cases, the behavior of models is examined for anomalies, and coverage metrics are utilized to determine the extent of neurons covered by these test cases. With the widespread application and advancement of DNNs, different types of neural behaviors have garnered attention, leading to the emergence of various coverage metrics for neural networks. However, there is currently a lack of empirical research on these coverage metrics, specifically in analyzing the relationships and patterns between model depth, configuration information, and neural network coverage. This paper aims to investigate the relationships and patterns of four coverage metrics: primary functionality, boundary, hierarchy, and structural coverage. A series of empirical experiments were conducted, selecting LeNet, VGG, and ResNet as different DNN architectures, along with 10 models of varying depths ranging from 5 to 54 layers, to compare and study the relationships between different depths, configuration information, and various neural network coverage metrics. Additionally, an investigation was carried out on the relationships between modified decision/condition coverage and dataset size. Finally, three potential future directions are proposed to further contribute to the security testing of DNN Models.

[207] AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Agriculture Mapping

Wenyuan Li, Shunlin Liang, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Husheng Fang, Zhenwei Shi

Main category: cs.CV

TL;DR: AgriFM is a multi-source remote sensing foundation model specifically designed for agricultural crop mapping that addresses limitations of current transformer-based models by enabling simultaneous hierarchical spatiotemporal feature extraction across multiple scales.

Details

Motivation: Current transformer-based remote sensing foundation models (RSFMs) are suboptimal for crop mapping because they either use fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing only on spatial patterns. Crop mapping requires modeling multi-scale spatiotemporal patterns ranging from field textures to landscape context, and from short-term phenological transitions to full growing-season dynamics.

Method: Developed AgriFM using a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations, enabling efficient unified processing of long time-series satellite inputs. The model leverages data from three satellite sources (MODIS, Landsat-8/9, Sentinel-2) and is pre-trained on a global dataset of over 25 million image samples supervised by land cover products. Includes a versatile decoder architecture that dynamically fuses learned spatiotemporal representations.

Result: Comprehensive evaluations demonstrate AgriFM’s superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks in agricultural crop mapping.

Conclusion: AgriFM successfully bridges the gaps in current RSFMs for crop mapping by enabling simultaneous hierarchical spatiotemporal feature extraction, making it a powerful foundation model specifically designed for agricultural applications with demonstrated superior performance over existing approaches.

Abstract: Accurate crop mapping fundamentally relies on modeling multi-scale spatiotemporal patterns, where spatial scales range from individual field textures to landscape-level context, and temporal scales capture both short-term phenological transitions and full growing-season dynamics. Transformer-based remote sensing foundation models (RSFMs) offer promising potential for crop mapping due to their innate ability for unified spatiotemporal processing. However, current RSFMs remain suboptimal for crop mapping: they either employ fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing solely on spatial patterns. To bridge these gaps, we present AgriFM, a multi-source remote sensing foundation model specifically designed for agricultural crop mapping. Our approach begins by establishing the necessity of simultaneous hierarchical spatiotemporal feature extraction, leading to the development of a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations. This modified backbone enables efficient unified processing of long time-series satellite inputs. AgriFM leverages temporally rich data streams from three satellite sources including MODIS, Landsat-8/9 and Sentinel-2, and is pre-trained on a global representative dataset comprising over 25 million image samples supervised by land cover products. The resulting framework incorporates a versatile decoder architecture that dynamically fuses these learned spatiotemporal representations, supporting diverse downstream tasks. Comprehensive evaluations demonstrate AgriFM’s superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks. Codes will be available at https://github.com/flyakon/AgriFM.

[208] Normalize Filters! Classical Wisdom for Deep Vision

Gustavo Perez, Stella X. Yu

Main category: cs.CV

TL;DR: The paper proposes filter normalization for convolutional filters in deep networks to make them atmosphere-equivariant, addressing distortion issues when images undergo atmospheric transfer, leading to improved robustness and generalization.

Details

Motivation: Classical image filters are carefully normalized for consistency and to avoid artifacts, but convolutional filters learned end-to-end in deep networks lack such constraints. This causes distorted responses when images undergo atmospheric transfer, leading to incorrect outcomes.

Method: Proposes filter normalization followed by learnable scaling and shifting (similar to batch normalization) to ensure filters are atmosphere-equivariant and enable co-domain symmetry. Integrates classical filtering principles into deep learning for both CNNs and convolution-dependent vision transformers.

Result: Achieves significant improvements on artificial and natural intensity variation benchmarks. ResNet34 with filter normalization could even outperform CLIP by a large margin. Analysis shows unnormalized filters degrade performance, while filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.

Conclusion: Filter normalization is a simple yet effective modification that addresses the lack of constraints in learned convolutional filters, making them atmosphere-equivariant and significantly improving performance on intensity variation tasks while enhancing robustness and generalization.

Abstract: Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.

[209] A Study of Commonsense Reasoning over Visual Object Properties

Abhishek Kolari, Mohammadhossein Khojasteh, Yifan Jiang, Floris den Hengst, Filip Ilievski

Main category: cs.CV

TL;DR: OPTICS introduces a systematic VQA benchmark to evaluate VLMs’ object property reasoning across image types, reasoning levels, and property dimensions, revealing significant limitations compared to humans.

Details

Motivation: Current VQA studies blend perception and reasoning, lack representativeness in reasoning and image categories, making it unclear whether and how VLMs abstract and reason over depicted objects' properties.

Method: Created systematic evaluation framework with three image types, three reasoning levels, and four object property dimensions. Developed two benchmarks: OPTICS-CNT (360 images, 1,080 count-based questions) and OPTICS-CMP (2.1k comparison questions). Evaluated 12 state-of-the-art VLMs in zero-shot settings.

Result: VLMs show significant limitations: best model achieved below 40% counting accuracy and 70% comparison accuracy. Models struggle with photographic images, counterfactual reasoning, physical/functional properties, and higher counts.

Conclusion: VLMs have substantial gaps in object property reasoning compared to humans. The OPTICS benchmark provides resources for future work on scalable benchmarking, generalized annotation guidelines, and advanced reasoning VLMs.

Abstract: Inspired by human categorization, object property reasoning involves identifying and recognizing low-level details and higher-level abstractions. While current visual question answering (VQA) studies consider multiple object properties, such as size, they typically blend perception and reasoning and lack representativeness in terms of reasoning and image categories, making it unclear whether and how vision-language models (VLMs) abstract and reason over depicted objects. To this end, we introduce a systematic evaluation framework comprising images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions, informed by prior work on common sense. We develop a procedure to instantiate this framework in two VQA object reasoning benchmarks: OPTICS-CNT, comprising 360 images paired with 1,080 multi-level, count-based questions, and OPTICS-CMP, with 2.1k comparison questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations relative to humans, with the best-performing model achieving below 40% counting and 70% comparison accuracy. VLMs struggle particularly with photographic images, counterfactual reasoning, physical and functional properties, and higher counts. We make the OPTICS benchmark data and code available to support future work on scalable benchmarking methods, generalized annotation guidelines, and advanced reasoning VLMs.

[210] Unleashing Semantic and Geometric Priors for 3D Scene Completion

Shiyuan Chen, Wei Sui, Bohao Zhang, Zeyd Boukhers, John See, Cong Yang

Main category: cs.CV

TL;DR: FoundationSSC is a novel framework for camera-based 3D semantic scene completion that uses dual decoupling at source and pathway levels to separate semantic and geometric processing, achieving state-of-the-art performance.

Details

Motivation: Existing methods rely on a coupled encoder that forces trade-offs between conflicting semantic and geometric demands, limiting overall performance in 3D semantic scene completion.

Method: Proposes FoundationSSC with dual decoupling: (1) source-level decoupling using a foundation encoder providing semantic features and stereo cost volumes; (2) pathway-level decoupling with specialized branches; (3) hybrid view transformation; (4) Axis-Aware Fusion module for anisotropic feature merging.

Result: Achieves simultaneous improvements in both semantic and geometric metrics: +0.23 mIoU and +2.03 IoU on SemanticKITTI, and state-of-the-art 21.78 mIoU and 48.61 IoU on SSCBench-KITTI-360.

Conclusion: The dual-decoupling design effectively separates semantic and geometric processing, enabling superior performance in 3D semantic scene completion by addressing the limitations of coupled encoders and providing better feature fusion.

Abstract: Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation. However, existing methods rely on a coupled encoder to deliver both semantic and geometric priors, which forces the model to make a trade-off between conflicting demands and limits its overall performance. To tackle these challenges, we propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels. At the source level, we introduce a foundation encoder that provides rich semantic feature priors for the semantic branch and high-fidelity stereo cost volumes for the geometric branch. At the pathway level, these priors are refined through specialised, decoupled pathways, yielding superior semantic context and depth distributions. Our dual-decoupling design produces disentangled and refined inputs, which are then utilised by a hybrid view transformation to generate complementary 3D features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module that addresses the often-overlooked challenge of fusing these features by anisotropically merging them into a unified representation. Extensive experiments demonstrate the advantages of FoundationSSC, achieving simultaneous improvements in both semantic and geometric metrics, surpassing prior bests by +0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61 IoU.

[211] FastMesh: Efficient Artistic Mesh Generation via Component Decoupling

Jeonghwan Kim, Yushi Lan, Armando Fortes, Yongwei Chen, Xingang Pan

Main category: cs.CV

TL;DR: A novel mesh generation framework that separates vertex and face generation, reducing token redundancy by 77% and achieving 8x faster generation with higher quality meshes.

Details

Motivation: Existing mesh generation methods tokenize meshes into sequences where vertices are redundantly repeated (since each vertex is shared by multiple faces), leading to excessively long token sequences and inefficient generation processes.

Method: 1) Use autoregressive model only for vertex generation (reducing tokens to ~23% of existing methods). 2) Employ bidirectional transformer to complete mesh in single step by capturing inter-vertex relationships and constructing adjacency matrix for faces. 3) Add fidelity enhancer to refine vertex positioning. 4) Post-processing framework to remove undesirable edge connections.

Result: Achieves more than 8x faster mesh generation speed compared to state-of-the-art approaches while producing higher mesh quality.

Conclusion: The proposed framework efficiently generates artistic meshes by separating vertex and face generation, significantly reducing redundancy and improving both speed and quality compared to existing methods.

Abstract: Recent mesh generation approaches typically tokenize triangle meshes into sequences of tokens and train autoregressive models to generate these tokens sequentially. Despite substantial progress, such token sequences inevitably reuse vertices multiple times to fully represent manifold meshes, as each vertex is shared by multiple faces. This redundancy leads to excessively long token sequences and inefficient generation processes. In this paper, we propose an efficient framework that generates artistic meshes by treating vertices and faces separately, significantly reducing redundancy. We employ an autoregressive model solely for vertex generation, decreasing the token count to approximately 23% of that required by the most compact existing tokenizer. Next, we leverage a bidirectional transformer to complete the mesh in a single step by capturing inter-vertex relationships and constructing the adjacency matrix that defines the mesh faces. To further improve the generation quality, we introduce a fidelity enhancer to refine vertex positioning into more natural arrangements and propose a post-processing framework to remove undesirable edge connections. Experimental results show that our method achieves more than 8x faster speed on mesh generation compared to state-of-the-art approaches, while producing higher mesh quality.

[212] Encoder-Only Image Registration

Xiang Chen, Renjiu Hu, Jinwei Zhang, Yuxi Zhang, Xinyao Yu, Min Liu, Yaonan Wang, Hang Zhang

Main category: cs.CV

TL;DR: EOIR is an encoder-only image registration framework that separates feature learning from flow estimation using a 3-layer ConvNet and Laplacian pyramid approach to achieve better accuracy-efficiency trade-offs for deformable image registration.

Details

Motivation: Learning-based deformable image registration still faces challenges with computational complexity and handling large deformations. The paper aims to understand how ConvNets influence registration performance and develop a more efficient framework.

Method: Analyzed ConvNets’ roles in registration using Horn-Schunck optical flow equation, then proposed EOIR framework that separates feature learning from flow estimation. Uses only a 3-layer ConvNet for feature extraction and 3-layer flow estimators to build Laplacian feature pyramid, progressively composing diffeomorphic deformations under large-deformation model.

Result: EOIR achieves superior accuracy-efficiency and accuracy-smoothness trade-offs across five datasets of different modalities and anatomical regions. With comparable accuracy, it provides better efficiency and smoothness, and vice versa.

Conclusion: EOIR effectively addresses computational complexity and large deformation challenges in deformable image registration by separating feature learning from flow estimation, achieving optimal trade-offs between accuracy, efficiency, and smoothness.

Abstract: Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Based on these insights, we propose the Encoder-Only Image Registration (EOIR) framework, designed to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on five datasets across different modalities and anatomical regions demonstrate EOIR’s effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR is publicly available on https://github.com/XiangChen1994/EOIR.

[213] Graph Algorithm Unrolling with Douglas-Rachford Iterations for Image Interpolation with Guaranteed Initialization

Xue Zhang, Bingshuo Hu, Gene Cheung

Main category: cs.CV

TL;DR: The paper proposes a novel neural network initialization method using graph filters and Douglas-Rachford iterations for image interpolation, achieving SOTA results with fewer parameters.

Details

Motivation: Conventional DNNs use random initialization followed by SGD optimization, which risks poor local minima. The authors aim to develop a more principled initialization approach using graph theory to improve image interpolation performance while reducing parameters.

Method: 1) Initialize directed graph adjacency matrix A based on known interpolator Θ. 2) Learn perturbation matrices P and P(2) from data to augment A. 3) Implement restoration effects via Douglas-Rachford iterations unrolled into lightweight interpretable neural network.

Result: Experimental results demonstrate state-of-the-art image interpolation performance while drastically reducing network parameters compared to conventional approaches.

Conclusion: The proposed method provides a principled graph-based initialization and optimization approach that outperforms traditional random initialization + SGD methods for image interpolation, offering both performance gains and parameter efficiency.

Abstract: Conventional deep neural nets (DNNs) initialize network parameters at random and then optimize each one via stochastic gradient descent (SGD), resulting in substantial risk of poor-performing local minima.Focusing on the image interpolation problem and leveraging a recent theorem that maps a (pseudo-)linear interpolator Θ to a directed graph filter that is a solution to a MAP problem regularized with a graph shift variation (GSV) prior, we first initialize a directed graph adjacency matrix A based on a known interpolator Θ, establishing a baseline performance.Then, towards further gain, we learn perturbation matrices P and P(2) from data to augment A, whose restoration effects are implemented via Douglas-Rachford (DR) iterations, which we unroll into a lightweight interpretable neural net.Experimental results demonstrate state-of-the-art image interpolation results, while drastically reducing network parameters.

[214] SPATIALGEN: Layout-guided 3D Indoor Scene Generation

Chuan Fang, Heng Li, Yixun Liang, Jia Zheng, Yongsen Mao, Yuan Liu, Rui Tang, Zihan Zhou, Ping Tan

Main category: cs.CV

TL;DR: SpatialGen is a multi-view multi-modal diffusion model that generates realistic 3D indoor scenes with appearance, geometry, and semantic information from layout and reference inputs, using a new large-scale synthetic dataset.

Details

Motivation: Manual 3D modeling of indoor environments is time-consuming, and existing automated methods struggle with balancing visual quality, diversity, semantic consistency, and user control. There's a lack of large-scale, high-quality datasets for this task.

Method: Introduced a comprehensive synthetic dataset with 12,328 structured annotated scenes, 57,431 rooms, and 4.7M photorealistic 2D renderings. Developed SpatialGen, a multi-view multi-modal diffusion model that takes 3D layout and reference image (from text prompt) to synthesize appearance (color image), geometry (scene coordinate map), and semantics (segmentation map) from arbitrary viewpoints while maintaining spatial consistency.

Result: SpatialGen consistently generates superior results compared to previous methods, producing realistic and semantically consistent 3D indoor scenes. The dataset and models are being open-sourced to advance the field.

Conclusion: The proposed SpatialGen model and comprehensive dataset address key challenges in automated 3D indoor scene generation, enabling high-fidelity, consistent scene synthesis with user control, and will empower the research community through open-source release.

Abstract: Creating high-fidelity 3D models of indoor environments is essential for applications in design, virtual reality, and robotics. However, manual 3D modeling remains time-consuming and labor-intensive. While recent advances in generative AI have enabled automated scene synthesis, existing methods often face challenges in balancing visual quality, diversity, semantic consistency, and user control. A major bottleneck is the lack of a large-scale, high-quality dataset tailored to this task. To address this gap, we introduce a comprehensive synthetic dataset, featuring 12,328 structured annotated scenes with 57,431 rooms, and 4.7M photorealistic 2D renderings. Leveraging this dataset, we present SpatialGen, a novel multi-view multi-modal diffusion model that generates realistic and semantically consistent 3D indoor scenes. Given a 3D layout and a reference image (derived from a text prompt), our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints, while preserving spatial consistency across modalities. SpatialGen consistently generates superior results to previous methods in our experiments. We are open-sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.

[215] Depth Edge Alignment Loss: DEALing with Depth in Weakly Supervised Semantic Segmentation

Patrick Schmidt, Vasileios Belagiannis, Lazaros Nalpantidis

Main category: cs.CV

TL;DR: Proposes Depth Edge Alignment Loss (DEAL) to improve Weakly Supervised Semantic Segmentation using depth information from robotic systems, achieving up to +16.416 mIoU improvements.

Details

Motivation: Autonomous robotic systems require expensive pixel-level dense labels for semantic segmentation training. Weak supervision with image-level labels is cheaper but less accurate. Depth information is commonly available in robotic systems and can provide additional supervision to improve segmentation quality.

Method: Introduces model-agnostic Depth Edge Alignment Loss (DEAL) that leverages pixel-level depth information to improve weakly supervised semantic segmentation. The approach generates pixel-level semantic labels from image-level supervision and aligns segmentation boundaries with depth edges from depth sensors.

Result: Improves segmentation performance across multiple datasets and models: +5.439 mIoU on PASCAL VOC validation, +1.274 mIoU on MS COCO validation, and +16.416 mIoU on HOPE static onboarding split. Can be combined with other losses for even better performance.

Conclusion: Depth information from robotic systems provides valuable supervision for improving weakly supervised semantic segmentation. The proposed DEAL approach is model-agnostic and effectively leverages depth edges to enhance segmentation boundaries, making it practical for robotic applications where depth sensors are commonly available.

Abstract: Autonomous robotic systems applied to new domains require an abundance of expensive, pixel-level dense labels to train robust semantic segmentation models under full supervision. This study proposes a model-agnostic Depth Edge Alignment Loss to improve Weakly Supervised Semantic Segmentation models across different datasets. The methodology generates pixel-level semantic labels from image-level supervision, avoiding expensive annotation processes. While weak supervision is widely explored in traditional computer vision, our approach adds supervision with pixel-level depth information, a modality commonly available in robotic systems. We demonstrate how our approach improves segmentation performance across datasets and models, but can also be combined with other losses for even better performance, with improvements up to +5.439, +1.274 and +16.416 points in mean Intersection over Union on the PASCAL VOC / MS COCO validation, and the HOPE static onboarding split, respectively. Our code is made publicly available at https://github.com/DTU-PAS/DEAL.

[216] YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection

Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, Manoj Karkee

Main category: cs.CV

TL;DR: YOLO26 is the latest YOLO variant with architectural improvements for edge device efficiency, supporting multiple vision tasks and offering flexible deployment options with competitive performance benchmarks.

Details

Motivation: To develop an advanced real-time object detection framework optimized for edge and low-power devices, addressing deployment challenges while maintaining high accuracy across various computer vision tasks.

Method: Architectural innovations include removing Distribution Focal Loss, adopting end-to-end NMS-free inference, integrating ProgLoss and Small-Target-Aware Label Assignment, and introducing MuSGD optimizer for stable convergence. Supports multi-task framework for detection, segmentation, pose estimation, oriented detection, and classification.

Result: Performance benchmarks on edge devices (NVIDIA Jetson Nano/Orin) show competitive results compared to YOLOv8, YOLOv11-13, and transformer-based detectors. Flexible export options (ONNX, TensorRT, CoreML, TFLite) and quantization support (INT8/FP16) enable efficient deployment across robotics, manufacturing, and IoT applications.

Conclusion: YOLO26 represents a significant advancement in the YOLO lineage, offering deployment-ready efficiency for edge devices while supporting multiple vision tasks. The framework demonstrates cross-industry adaptability and provides practical deployment pathways with future directions outlined for continued evolution.

Abstract: This study presents a comprehensive analysis of Ultralytics YOLO26(also called as YOLOv26), highlighting its key architectural enhancements and performance benchmarking for real-time object detection. YOLO26, released in September 2025, stands as the newest and most advanced member of the YOLO family, purpose-built to deliver efficiency, accuracy, and deployment readiness on edge and low-power devices. The paper sequentially details architectural innovations of YOLO26, including the removal of Distribution Focal Loss (DFL), adoption of end-to-end NMS-free inference, integration of ProgLoss and Small-Target-Aware Label Assignment (STAL), and the introduction of the MuSGD optimizer for stable convergence. Beyond architecture, the study positions YOLO26 as a multi-task framework, supporting object detection, instance segmentation, pose/keypoints estimation, oriented detection, and classification. We present performance benchmarks of YOLO26 on edge devices such as NVIDIA Jetson Nano and Orin, comparing its results with YOLOv8, YOLOv11, YOLOv12, YOLOv13, and transformer-based detectors(RF-DETR and RT-DETR). This paper further explores real-time deployment pathways, flexible export options (ONNX, TensorRT, CoreML, TFLite), and quantization for INT8/FP16. Practical use cases of YOLO26 across robotics, manufacturing, and IoT are highlighted to demonstrate cross-industry adaptability. Finally, insights on deployment efficiency and broader implications are discussed, with future directions for YOLO26 and the YOLO lineage outlined.

[217] A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models

Leah Bar, Liron Mor Yosef, Shai Zucker, Neta Shoham, Inbar Seroussi, Nir Sochen

Main category: cs.CV

TL;DR: The paper proposes a unified geometric-probabilistic framework for blind image denoising and image generation, introducing Manifold-Probabilistic Projection Model (MPPM) that outperforms Latent Diffusion Models.

Details

Motivation: Current generative AI models treat images as low-dimensional objects in high-dimensional space but overlook geometric structure, focusing only on probabilistic methods. The probability distribution in latent space is often predefined as uniform or considered uninteresting. The paper aims to unify geometric and probabilistic perspectives for better blind image denoising and image generation.

Method: Introduces a novel framework combining geometric assumptions with kernel-based probabilistic methods. Uses explicit and implicit manifold descriptions through distance functions. Interprets diffusion models as projection mechanisms onto the “good images” manifold. Develops Manifold-Probabilistic Projection Model (MPPM) operating in both pixel and latent spaces, with Latent MPPM (LMPPM) variant.

Result: LMPPM outperforms Latent Diffusion Model (LDM) across various datasets in both image restoration and generation tasks, demonstrating superior performance.

Conclusion: The proposed unified geometric-probabilistic framework successfully addresses blind image denoising and image generation by incorporating geometric structure into probabilistic methods, providing new insights into diffusion models as manifold projection mechanisms and achieving state-of-the-art results.

Abstract: Most models of generative AI for images assume that images are inherently low-dimensional objects embedded within a high-dimensional space. Additionally, it is often implicitly assumed that thematic image datasets form smooth or piecewise smooth manifolds. Common approaches overlook the geometric structure and focus solely on probabilistic methods, approximating the probability distribution through universal approximation techniques such as the kernel method. In some generative models the low dimensional nature of the data manifest itself by the introduction of a lower dimensional latent space. Yet, the probability distribution in the latent or the manifold’s coordinate space is considered uninteresting and is predefined or considered uniform. In this study, we address the problem of Blind Image Denoising (BID), and to some extent, the problem of generating images from noise by unifying geometric and probabilistic perspectives. We introduce a novel framework that improves upon existing probabilistic approaches by incorporating geometric assumptions that enable the effective use of kernel-based probabilistic methods. Furthermore, the proposed framework extends prior geometric approaches by combining explicit and implicit manifold descriptions through the introduction of a distance function. The resulting framework demystifies diffusion models by interpreting them as a projection mechanism onto the manifold of ``good images’’. This interpretation leads to the construction of a new deterministic model, the Manifold-Probabilistic Projection Model (MPPM), which operates in both the representation (pixel) space and the latent space. We demonstrate that the Latent MPPM (LMPPM) outperforms the Latent Diffusion Model (LDM) across various datasets, achieving superior results in terms of image restoration and generation.

[218] Decorrelation Speeds Up Vision Transformers

Kieran Carrigg, Rob van Gastel, Melda Yeghaian, Sander Dalm, Faysal Boughorbel, Marcel van Gerven

Main category: cs.CV

TL;DR: DBP-MAE integrates Decorrelated Backpropagation into MAE pre-training to reduce computational costs and accelerate convergence while maintaining or improving performance in low-data regimes.

Details

Motivation: MAE pre-training for vision transformers is computationally expensive and impractical for time/resource-constrained industrial settings, requiring more efficient alternatives.

Method: Integrates Decorrelated Backpropagation (DBP) into MAE pre-training, selectively applying it to the encoder to reduce input correlations at each layer and accelerate convergence.

Result: DBP-MAE reduces wall-clock time by 21.1%, lowers carbon emissions by 21.4%, improves segmentation mIoU by 1.1 points on ImageNet-1K/ADE20K subsets, and shows similar gains on proprietary industrial data.

Conclusion: DBP effectively reduces training time and energy consumption while improving downstream performance for large-scale ViT pre-training, making it applicable to real-world industrial scenarios.

Abstract: Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label data regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. To mimic constrained-data scenarios, we evaluate our approach on ImageNet-1K pre-training and ADE20K fine-tuning using randomly sampled subsets of each dataset. Under this setting, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4%, and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method’s applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training. Keywords: Deep learning, Vision transformers, Efficient AI, Decorrelation

[219] JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting

Xianben Yang, Yuxuan Li, Tao Wang, Tao Wang, Yi Jin, Yidong Li, Haibin Ling

Main category: cs.CV

TL;DR: A unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated inputs, outperforming COLMAP-free methods and standard COLMAP-based baselines.

Details

Motivation: Traditional novel view synthesis methods rely on external camera pose estimation tools like COLMAP, which introduce computational bottlenecks and propagate errors. The authors aim to eliminate dependency on pre-calibrated inputs while improving both scene reconstruction and pose estimation.

Method: A co-optimization strategy that iteratively refines 3D Gaussian parameters and camera poses through two interleaved phases: 1) updating 3D Gaussian parameters via differentiable rendering with fixed poses, and 2) refining camera poses using a customized 3D optical flow algorithm with geometric and photometric constraints.

Result: Extensive evaluations on multiple datasets show the approach significantly outperforms existing COLMAP-free techniques in reconstruction quality and surpasses standard COLMAP-based baselines, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions.

Conclusion: The proposed unified framework successfully eliminates dependency on external pose estimation tools while achieving superior performance in both scene reconstruction and camera pose estimation, addressing key limitations of traditional methods.

Abstract: Traditional novel view synthesis methods heavily rely on external camera pose estimation tools such as COLMAP, which often introduce computational bottlenecks and propagate errors. To address these challenges, we propose a unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated inputs. Our approach iteratively refines 3D Gaussian parameters and updates camera poses through a novel co-optimization strategy, ensuring simultaneous improvements in scene reconstruction fidelity and pose estimation accuracy. The key innovation lies in decoupling the joint optimization into two interleaved phases: first, updating 3D Gaussian parameters via differentiable rendering with fixed poses, and second, refining camera poses using a customized 3D optical flow algorithm that incorporates geometric and photometric constraints. This formulation progressively reduces projection errors, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions, where traditional methods struggle. Extensive evaluations on multiple datasets demonstrate that our approach significantly outperforms existing COLMAP-free techniques in reconstruction quality, and also surpasses the standard COLMAP-based baseline in general.

[220] Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets

Huy M. Le, Dat Tien Nguyen, Phuc Binh Nguyen, Gia Bao Le Tran, Phu Truong Thien, Cuong Dinh, Minh Nguyen, Nga Nguyen, Thuy T. N. Nguyen, Tan Nhat Nguyen, Binh T. Nguyen

Main category: cs.CV

TL;DR: Fusionista2.0 is an optimized video retrieval system for VBS that reduces retrieval time by 75% while improving accuracy and usability through streamlined modules and an improved interface.

Details

Motivation: The Video Browser Showdown (VBS) requires systems to deliver accurate results under strict time constraints, creating a need for more efficient and user-friendly video retrieval systems.

Method: Re-engineered core modules: ffmpeg for fast keyframe extraction, Vintern-1B-v3.5 for multilingual OCR, faster-whisper for real-time ASR, and lightweight vision-language models for QA. Also redesigned UI for better responsiveness and workflow efficiency.

Result: Retrieval time reduced by up to 75% while maintaining or improving accuracy. User satisfaction increased, confirming the system as competitive for large-scale video search.

Conclusion: Fusionista2.0 successfully addresses VBS requirements through technical optimizations and UI improvements, creating a competitive, user-friendly system for efficient video retrieval.

Abstract: The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.

Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, Fan Zhou

Main category: cs.CV

TL;DR: MBCD is a collaborative distillation framework that improves multi-modal domain generalization by addressing WA’s bias toward faster-converging modalities through adaptive dropout, gradient consistency, and cross-modal distillation.

Details

Motivation: Weight Averaging (WA) promotes flat loss landscapes for better generalization but fails in multi-modal settings because it overfits to faster-converging modalities early on, suppressing slower complementary modalities and hindering effective modality fusion.

Method: MBCD uses three key components: 1) adaptive modality dropout in student model to reduce early bias, 2) gradient consistency constraint to align learning between uni-modal and fused representations, and 3) WA-based teacher performing cross-modal distillation to transfer fused knowledge back to uni-modal branches.

Result: Extensive experiments on MMDG benchmarks show MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

Conclusion: MBCD successfully retains WA’s flatness-inducing advantages while overcoming its limitations in multi-modal contexts, enabling effective modality fusion and steering convergence toward flatter, more generalizable solutions.

Abstract: Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA’s flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

[222] Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation

Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee

Main category: cs.CV

TL;DR: A constraint-guided framework combining geometric reasoning with neural semantics for abstract visual composition, using AlphaGo-style search and adversarial reward refinement to generate valid, semantically-aligned geometric structures.

Details

Motivation: Abstract visual composition with geometric primitives is challenging due to combinatorial placement choices, limited data, discrete feasibility constraints (overlap-free, allowable orientations), and sparse solution manifolds that are ill-suited for purely statistical pixel-space generators.

Method: Constraint-guided framework combining explicit geometric reasoning with neural semantics. Uses AlphaGo-style search to enforce feasibility, with a fine-tuned vision-language model scoring semantic alignment as reward signals. Employs policy network as heuristic in Monte-Carlo Tree Search, fine-tuned via search-generated plans. Uses adversarial reward refinement inspired by GANs, where generated instances improve reward model discrimination.

Result: In the Tangram Assembly task, the approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.

Conclusion: The proposed constraint-guided framework successfully addresses the challenges of abstract visual composition by integrating geometric reasoning with neural semantics, demonstrating superior performance in generating valid, semantically-aligned geometric structures under tight constraints.

Abstract: We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.

[223] DFIR-DETR: Frequency Domain Enhancement and Dynamic Feature Aggregation for Cross-Scene Small Object Detection

Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li

Main category: cs.CV

TL;DR: DFIR-DETR: A lightweight transformer-based detector using dynamic feature aggregation and frequency-domain processing for cross-scene small object detection, achieving SOTA on NEU-DET and VisDrone datasets.

Details

Motivation: Current transformer-based detectors struggle with small object detection in UAV remote sensing and industrial inspection due to feature degradation from downsampling, inability of spatial convolutions to capture long-range dependencies, and inefficient upsampling methods that inflate feature maps.

Method: Three novel components: 1) DCFA module with dynamic K-sparse attention (O(NK) complexity) and spatial gated linear units; 2) DFPN module with amplitude-normalized upsampling and dual-path shuffle convolution; 3) FIRC3 module operating in frequency domain for global receptive fields.

Result: Achieved state-of-the-art mAP50 scores: 92.9% on NEU-DET and 51.6% on VisDrone datasets. Model remains lightweight with only 11.7M parameters and 41.2 GFLOPs.

Conclusion: DFIR-DETR effectively addresses small object detection challenges, generalizes well across different domains (UAV remote sensing and industrial inspection), and works efficiently in resource-limited settings for cross-scene applications.

Abstract: Detecting small objects in UAV remote sensing images and identifying surface defects in industrial inspection remain difficult tasks. These applications face common obstacles: features are sparse and weak, backgrounds are cluttered, and object scales vary dramatically. Current transformer-based detectors, while powerful, struggle with three critical issues. First, features degrade severely as networks downsample progressively. Second, spatial convolutions cannot capture long-range dependencies effectively. Third, standard upsampling methods inflate feature maps unnecessarily. We introduce DFIR-DETR to tackle these problems through dynamic feature aggregation combined with frequency-domain processing. Our architecture builds on three novel components. The DCFA module uses dynamic K-sparse attention, cutting complexity from O(N2) down to O(NK), and employs spatial gated linear units for better nonlinear modeling. The DFPN module applies amplitude-normalized upsampling to prevent feature inflation and uses dual-path shuffle convolution to retain spatial details across scales. The FIRC3 module operates in the frequency domain, achieving global receptive fields without sacrificing efficiency. We tested our method extensively on NEU-DET and VisDrone datasets. Results show mAP50 scores of 92.9% and 51.6% respectively-both state-of-the-art. The model stays lightweight with just 11.7M parameters and 41.2 GFLOPs. Strong performance across two very different domains confirms that DFIR-DETR generalizes well and works effectively in resource-limited settings for cross-scene small object detection.

[224] TBC: A Target-Background Contrast Metric for Low-Altitude Infrared and Visible Image Fusion

Yufeng Xie, Cong Wang

Main category: cs.CV

TL;DR: The paper proposes a new Target-Background Contrast (TBC) metric for infrared and visible image fusion in UAV reconnaissance, addressing limitations of traditional metrics that fail in low-light environments due to the “Noise Trap” problem.

Details

Motivation: Traditional no-reference metrics (statistics-based and gradient-based) fail in complex low-light environments for UAV image fusion. These metrics are positively correlated with high-frequency sensor noise, paradoxically assigning higher scores to degraded images and misleading algorithm optimization - a problem termed the "Noise Trap".

Method: The paper proposes the Target-Background Contrast (TBC) metric, inspired by Weber’s Law. Unlike traditional metrics, TBC focuses on relative contrast of salient targets rather than global statistics. It penalizes background noise and rewards target visibility, providing semantic discriminability for distinguishing thermal targets from background clutter.

Result: Extensive experiments on the DroneVehicle dataset demonstrate TBC’s superiority. Results show TBC exhibits high “Semantic Discriminability” in distinguishing thermal targets from background clutter. Additionally, TBC achieves remarkable computational efficiency, making it suitable for real-time applications in intelligent UAV systems.

Conclusion: The proposed TBC metric addresses the limitations of traditional no-reference metrics in low-light environments, providing a reliable and real-time standard for evaluating infrared and visible image fusion in UAV reconnaissance systems by focusing on target-background contrast rather than being misled by sensor noise.

Abstract: Infrared and visible image fusion (IVIF) is a pivotal technology in low-altitude Unmanned Aerial Vehicle (UAV) reconnaissance missions, enabling robust target detection and tracking by integrating thermal saliency with environmental textures. However, traditional no-reference metrics (Statistics-based metrics and Gradient-based metrics) fail in complex low-light environments, termed the Noise Trap''. This paper mathematically prove that these metrics are positively correlated with high-frequency sensor noise, paradoxically assigning higher scores to degraded images and misguiding algorithm optimization. To address this, we propose the Target-Background Contrast (TBC) metric. Inspired by Weber's Law, TBC focuses on the relative contrast of salient targets rather than global statistics. Unlike traditional metrics, TBC penalizes background noise and rewards target visibility. Extensive experiments on the DroneVehicle dataset demonstrate the superiority of TBC. Results show that TBC exhibits high Semantic Discriminability’’ in distinguishing thermal targets from background clutter. Furthermore, TBC achieves remarkable computational efficiency, making it a reliable and real-time standard for intelligent UAV systems.

[225] SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping

Thomas Boudras, Martin Schwartz, Rasmus Fensholt, Martin Brandt, Ibrahim Fayad, Jean-Pierre Wigneron, Gabriel Belouze, Fajwel Fogel, Philippe Ciais

Main category: cs.CV

TL;DR: SERA-H is an end-to-end deep learning model that generates high-resolution (2.5m) canopy height maps from freely available Sentinel-1/2 satellite imagery (10m resolution) using super-resolution and temporal attention encoding, achieving performance comparable to commercial high-resolution imagery.

Details

Motivation: There's a trade-off between data accessibility and spatial resolution in existing canopy height mapping methods. Current deep learning approaches using satellite imagery often face limitations in balancing freely available data with high-resolution outputs needed for effective forest management and biodiversity monitoring.

Method: SERA-H combines a super-resolution module (EDSR) with temporal attention encoding (UTAE) in an end-to-end model. It’s trained using high-density LiDAR data (ALS) as supervision to generate 2.5m resolution height maps from Sentinel-1 and Sentinel-2 time series data (10m native resolution).

Result: On an open-source benchmark dataset in France, SERA-H achieved MAE of 2.6m and R² of 0.82, outperforming standard Sentinel-1/2 baselines and achieving performance comparable to or better than methods using commercial very high-resolution imagery (SPOT-6/7, PlanetScope, Maxar).

Conclusion: Combining high-resolution supervision with spatiotemporal information from time series enables reconstruction of details beyond input sensors’ native resolution. SERA-H demonstrates that freely available satellite data can achieve canopy height mapping accuracy comparable to costly commercial imagery, enabling high-frequency forest monitoring at no cost.

Abstract: High-resolution mapping of canopy height is essential for forest management and biodiversity monitoring. Although recent studies have led to the advent of deep learning methods using satellite imagery to predict height maps, these approaches often face a trade-off between data accessibility and spatial resolution. To overcome these limitations, we present SERA-H, an end-to-end model combining a super-resolution module (EDSR) and temporal attention encoding (UTAE). Trained under the supervision of high-density LiDAR data (ALS), our model generates 2.5 m resolution height maps from freely available Sentinel-1 and Sentinel-2 (10 m) time series data. Evaluated on an open-source benchmark dataset in France, SERA-H, with a MAE of 2.6 m and a coefficient of determination of 0.82, not only outperforms standard Sentinel-1/2 baselines but also achieves performance comparable to or better than methods relying on commercial very high-resolution imagery (SPOT-6/7, PlanetScope, Maxar). These results demonstrate that combining high-resolution supervision with the spatiotemporal information embedded in time series enables the reconstruction of details beyond the input sensors’ native resolution. SERA-H opens the possibility of freely mapping forests with high revisit frequency, achieving accuracy comparable to that of costly commercial imagery.

[226] GANeXt: A Fully ConvNeXt-Enhanced Generative Adversarial Network for MRI- and CBCT-to-CT Synthesis

Siyuan Mei, Yan Xia, Fuxin Fan, Andreas Maier

Main category: cs.CV

TL;DR: GANeXt: A 3D patch-based ConvNeXt GAN for unified CT synthesis from MRI and CBCT across different anatomical regions, using advanced loss functions and training strategies.

Details

Motivation: CT synthesis from MRI and CBCT is crucial for accurate anatomical representation in adaptive radiotherapy treatment planning, but existing methods may lack unified approaches across modalities and anatomical regions.

Method: Proposes GANeXt - a 3D patch-based GAN with U-shaped generator using stacked 3D ConvNeXt blocks and conditional PatchGAN discriminator. Uses MAE, perceptual loss, segmentation-based masked MAE, adversarial loss, and multi-head segmentation discriminator with Dice/CE losses. Training uses AdamW with warmup/cosine decay schedulers, deformable registration, normalization, and data augmentation.

Result: Models trained for 3000 epochs for MRI-to-CT and 1000 epochs for CBCT-to-CT using full training dataset without fine-tuning. Inference uses sliding-window with 0.8 overlap and average folding for full-size synthetic CT reconstruction.

Conclusion: GANeXt provides a unified framework for CT synthesis across different modalities (MRI, CBCT) and anatomical regions, demonstrating potential for clinical radiotherapy planning applications.

Abstract: The synthesis of computed tomography (CT) from magnetic resonance imaging (MRI) and cone-beam CT (CBCT) plays a critical role in clinical treatment planning by enabling accurate anatomical representation in adaptive radiotherapy. In this work, we propose GANeXt, a 3D patch-based, fully ConvNeXt-powered generative adversarial network for unified CT synthesis across different modalities and anatomical regions. Specifically, GANeXt employs an efficient U-shaped generator constructed from stacked 3D ConvNeXt blocks with compact convolution kernels, while the discriminator adopts a conditional PatchGAN. To improve synthesis quality, we incorporate a combination of loss functions, including mean absolute error (MAE), perceptual loss, segmentation-based masked MAE, and adversarial loss and a combination of Dice loss and cross-entropy for multi-head segmentation discriminator. For both tasks, training is performed with a batch size of 8 using two separate AdamW optimizers for the generator and discriminator, each equipped with a warmup and cosine decay scheduler, with learning rates of $5\times10^{-4}$ and $1\times10^{-3}$, respectively. Data preprocessing includes deformable registration, foreground cropping, percentile normalization for the input modality, and linear normalization of the CT to the range $[-1024, 1000]$. Data augmentation involves random zooming within $(0.8, 1.3)$ (for MRI-to-CT only), fixed-size cropping to $32\times160\times192$ for MRI-to-CT and $32\times128\times128$ for CBCT-to-CT, and random flipping. During inference, we apply a sliding-window approach with $0.8$ overlap and average folding to reconstruct the full-size sCT, followed by inversion of the CT normalization. After joint training on all regions without any fine-tuning, the final models are selected at the end of 3000 epochs for MRI-to-CT and 1000 epochs for CBCT-to-CT using the full training dataset.

[227] Granular Ball Guided Masking: Structure-aware Data Augmentation

Shuyin Xia, Fan Chen, Dawei Dai, Meng Yang, Junwei Han, Xinbo Gao, Guoyin Wang

Main category: cs.CV

TL;DR: GBGM is a structure-aware data augmentation method using Granular Ball Computing to preserve semantically important regions while masking redundant areas, improving robustness across vision tasks.

Details

Motivation: Deep learning models rely heavily on labeled data and overfit with limited data or distribution shifts. Existing mask-based augmentation methods lack structural awareness and risk discarding essential semantics.

Method: Granular Ball Guided Masking (GBGM) uses Granular Ball Computing to guide adaptive masking that preserves semantically rich, structurally important regions while suppressing redundant areas through a coarse-to-fine hierarchical masking process.

Result: Extensive experiments show consistent improvements in image classification, masked image reconstruction, and image tampering detection across multiple benchmarks, validating effectiveness and generalization across recognition and forensic scenarios.

Conclusion: GBGM provides a simple, model-agnostic structure-aware data augmentation paradigm that integrates seamlessly into CNNs and Vision Transformers, offering practical benefits for enhancing model robustness.

Abstract: Deep learning models have achieved remarkable success in computer vision but still rely heavily on large-scale labeled data and tend to overfit when data is limited or distributions shift. Data augmentation – particularly mask-based information dropping – can enhance robustness by forcing models to explore complementary cues; however, existing approaches often lack structural awareness and risk discarding essential semantics. We propose Granular Ball Guided Masking (GBGM), a structure-aware augmentation strategy guided by Granular Ball Computing (GBC). GBGM adaptively preserves semantically rich, structurally important regions while suppressing redundant areas through a coarse-to-fine hierarchical masking process, producing augmentations that are both representative and discriminative. Extensive experiments on multiple benchmarks demonstrate consistent improvements not only in image classification and masked image reconstruction, but also in image tampering detection, validating the effectiveness and generalization of GBGM across both recognition and forensic scenarios. Simple and model-agnostic, GBGM integrates seamlessly into CNNs and Vision Transformers, offering a practical paradigm for structure-aware data augmentation.

[228] Wavelet-based Multi-View Fusion of 4D Radar Tensor and Camera for Robust 3D Object Detection

Runwei Guan, Jianan Liu, Shaofeng Liang, Fangqiang Ding, Shanliang Yao, Xiaokai Bai, Daizong Liu, Tao Huang, Guoqiang Mao, Hui Xiong

Main category: cs.CV

TL;DR: WRCFormer: A novel 3D object detection framework that efficiently fuses raw 4D radar cubes with camera images using decoupled multi-view radar representations, achieving state-of-the-art performance on K-Radar benchmark.

Details

Motivation: 4D mmWave radar is widely used in autonomous driving but faces challenges: point-cloud representations lose information due to multi-stage signal processing, while using raw radar tensors directly is computationally prohibitive. Need efficient fusion of raw radar data with camera images.

Method: Proposes WRCFormer with two key components: 1) Wavelet Attention Module in wavelet-based FPN to capture joint spatial-frequency features, enhancing sparse radar and image representations while maintaining efficiency. 2) Geometry-guided Progressive Fusion, a two-stage query-based strategy that progressively aligns multi-view radar and visual features using geometric priors for modality-agnostic integration.

Result: Extensive experiments on K-Radar benchmark show state-of-the-art performance, surpassing best existing model by ~2.4% in all scenarios and 1.6% in sleet conditions, demonstrating strong robustness in adverse weather.

Conclusion: WRCFormer effectively addresses information loss and computational challenges in 4D radar processing by efficiently fusing raw radar cubes with camera images through innovative wavelet-based attention and geometry-guided fusion, achieving superior 3D object detection performance especially in adverse weather conditions.

Abstract: 4D millimeter-wave (mmWave) radar has been widely adopted in autonomous driving and robot perception due to its low cost and all-weather robustness. However, point-cloud-based radar representations suffer from information loss due to multi-stage signal processing, while directly utilizing raw 4D radar tensors incurs prohibitive computational costs. To address these challenges, we propose WRCFormer, a novel 3D object detection framework that efficiently fuses raw 4D radar cubes with camera images via decoupled multi-view radar representations. Our approach introduces two key components: (1) A Wavelet Attention Module embedded in a wavelet-based Feature Pyramid Network (FPN), which enhances the representation of sparse radar signals and image data by capturing joint spatial-frequency features, thereby mitigating information loss while maintaining computational efficiency. (2) A Geometry-guided Progressive Fusion mechanism, a two-stage query-based fusion strategy that progressively aligns multi-view radar and visual features through geometric priors, enabling modality-agnostic and efficient integration without overwhelming computational overhead. Extensive experiments on the K-Radar benchmark show that WRCFormer achieves state-of-the-art performance, surpassing the best existing model by approximately 2.4% in all scenarios and 1.6% in sleet conditions, demonstrating strong robustness in adverse weather.

[229] RealCamo: Boosting Real Camouflage Synthesis with Layout Controls and Textual-Visual Guidance

Chunyuan Chen, Yunuo Cai, Shujuan Li, Weiyun Liang, Bin Wang, Jing Xu

Main category: cs.CV

TL;DR: RealCamo: A novel out-painting-based framework for controllable realistic camouflaged image generation with layout controls and multimodal textual-visual conditions to bridge the gap between synthetic and real camouflaged imagery.

Details

Motivation: Existing camouflaged image generation (CIG) methods produce images with insufficient camouflage (weak visual similarity) or cluttered backgrounds that are semantically inconsistent with foreground targets, creating a substantial gap to real camouflaged imagery needed for training camouflaged object detection models.

Method: Proposes RealCamo, an out-painting-based framework with: 1) explicit layout controls to regulate global image structure and improve semantic coherence between foreground and background, 2) multimodal textual-visual conditions combining fine-grained textual task descriptions with texture-oriented background retrieval to enhance visual fidelity, and 3) a background-foreground distribution divergence metric to quantitatively assess camouflage quality.

Result: Extensive experiments and visualizations demonstrate the effectiveness of the proposed framework in generating realistic camouflaged images that better bridge the gap to real camouflaged imagery.

Conclusion: RealCamo addresses key limitations in existing CIG methods by introducing layout controls and multimodal conditions, providing a more effective approach for generating high-quality training data for camouflaged object detection through improved semantic coherence and visual realism.

Abstract: Camouflaged image generation (CIG) has recently emerged as an efficient alternative for acquiring high-quality training data for camouflaged object detection (COD). However, existing CIG methods still suffer from a substantial gap to real camouflaged imagery: generated images either lack sufficient camouflage due to weak visual similarity, or exhibit cluttered backgrounds that are semantically inconsistent with foreground targets. To address these limitations, we propose RealCamo, a novel out-painting-based framework for controllable realistic camouflaged image generation. RealCamo explicitly introduces additional layout controls to regulate global image structure, thereby improving semantic coherence between foreground objects and generated backgrounds. Moreover, we construct a multimodal textual-visual condition by combining a unified fine-grained textual task description with texture-oriented background retrieval, which jointly guides the generation process to enhance visual fidelity and realism. To quantitatively assess camouflage quality, we further introduce a background-foreground distribution divergence metric that measures the effectiveness of camouflage in generated images. Extensive experiments and visualizations demonstrate the effectiveness of our proposed framework.

[230] Lifelong Domain Adaptive 3D Human Pose Estimation

Qucheng Peng, Hongfei Xue, Pu Wang, Chen Chen

Main category: cs.CV

TL;DR: First work to introduce lifelong domain adaptation to 3D human pose estimation, addressing non-stationary target datasets and catastrophic forgetting through a novel GAN framework with integrated pose-aware, temporal-aware, and domain-aware knowledge.

Details

Motivation: 3D HPE struggles with generalization to diverse real-world scenarios due to reliance on controlled environment data. Existing domain adaptation approaches overlook non-stationary target datasets and catastrophic forgetting when adapting to multiple domains sequentially.

Method: Proposes a novel GAN framework with 3D pose generators, 2D pose discriminator, and 3D pose estimator. Introduces a 3D pose generator paradigm integrating pose-aware, temporal-aware, and domain-aware knowledge to adapt to current domains while preserving previous knowledge.

Result: Superior performance demonstrated through extensive experiments on diverse domain adaptive 3D HPE datasets, effectively mitigating domain shifts and aligning original/augmented poses.

Conclusion: The proposed lifelong domain adaptation framework successfully addresses challenges in 3D HPE by enabling adaptation to non-stationary target domains while combating catastrophic forgetting, representing the first application of lifelong DA to 3D pose estimation.

Abstract: 3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain’s adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.

[231] RGS-SLAM: Robust Gaussian Splatting SLAM with One-Shot Dense Initialization

Wei-Tse Cheng, Yen-Jen Chiou, Yuan-Fu Yang

Main category: cs.CV

TL;DR: RGS-SLAM replaces GS-SLAM’s residual-driven densification with training-free correspondence-to-Gaussian initialization using DINOv3 descriptors, achieving 20% faster convergence and higher rendering quality while maintaining real-time performance.

Details

Motivation: The paper aims to improve GS-SLAM by addressing the limitations of residual-driven densification, which can lead to unstable early mapping and slower convergence. The motivation is to create a more robust initialization method that provides better structure awareness and distribution of Gaussians from the start.

Method: RGS-SLAM uses a training-free correspondence-to-Gaussian initialization that performs one-shot triangulation of dense multi-view correspondences derived from DINOv3 descriptors. These descriptors are refined through a confidence-aware inlier classifier to generate a well-distributed, structure-aware Gaussian seed prior to optimization, replacing the progressive residual-driven densification of GS-SLAM.

Result: The method achieves approximately 20% faster convergence, higher rendering fidelity in texture-rich and cluttered scenes, competitive or superior localization and reconstruction accuracy on TUM RGB-D and Replica datasets compared to state-of-the-art Gaussian and point-based SLAM systems, while maintaining real-time mapping performance up to 925 FPS.

Conclusion: RGS-SLAM demonstrates that replacing residual-driven densification with a robust correspondence-to-Gaussian initialization significantly improves mapping stability, convergence speed, and rendering quality while remaining fully compatible with existing GS-SLAM pipelines, offering a practical enhancement for real-time SLAM applications.

Abstract: We introduce RGS-SLAM, a robust Gaussian-splatting SLAM framework that replaces the residual-driven densification stage of GS-SLAM with a training-free correspondence-to-Gaussian initialization. Instead of progressively adding Gaussians as residuals reveal missing geometry, RGS-SLAM performs a one-shot triangulation of dense multi-view correspondences derived from DINOv3 descriptors refined through a confidence-aware inlier classifier, generating a well-distributed and structure-aware Gaussian seed prior to optimization. This initialization stabilizes early mapping and accelerates convergence by roughly 20%, yielding higher rendering fidelity in texture-rich and cluttered scenes while remaining fully compatible with existing GS-SLAM pipelines. Evaluated on the TUM RGB-D and Replica datasets, RGS-SLAM achieves competitive or superior localization and reconstruction accuracy compared with state-of-the-art Gaussian and point-based SLAM systems, sustaining real-time mapping performance at up to 925 FPS. Additional details and resources are available at this URL: https://breeze1124.github.io/rgs-slam-project-page/

[232] Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench

Zanting Ye, Xiaolong Niu, Xuanbin Wu, Xu Han, Shengyuan Liu, Jing Hao, Zhihao Peng, Hao Sun, Jieqin Lv, Fanghu Wang, Yanchao Huang, Hubing Wu, Yixuan Yuan, Habib Zaidi, Arman Rahmim, Yefeng Zheng, Lijun Lu

Main category: cs.CV

TL;DR: MLLMs have a functional perception gap in PET imaging, causing CoT hallucinations; PET-Bench benchmark reveals this, and AVA fine-tuning fixes it, improving accuracy by 14.83%.

Details

Motivation: Current MLLMs excel in anatomical imaging but fail in functional imaging like PET, where they can't decode tracer biodistribution without morphological priors, creating a critical safety hazard in clinical diagnosis.

Method: Created PET-Bench (52,308 QA pairs from 9,732 PET studies), evaluated 19 SOTA MLLMs, identified CoT hallucination trap, and proposed AVA fine-tuning that enforces low-level functional perception before high-level reasoning.

Result: Standard CoT prompting produces fluent but ungrounded diagnoses in PET; AVA bridges the perception gap, transforms CoT into robust inference, and improves diagnostic accuracy by up to 14.83%.

Conclusion: Functional imaging requires different MLLM capabilities than anatomical imaging; AVA effectively addresses the CoT hallucination problem in PET, making MLLMs safer and more accurate for clinical functional imaging tasks.

Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities, their capability in functional imaging remains largely unexplored. In this work, we identify and quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors. Identifying Positron Emission Tomography (PET) as the quintessential modality to investigate this disconnect, we introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies. Extensive evaluation of 19 state-of-the-art MLLMs reveals a critical safety hazard termed the Chain-of-Thought (CoT) hallucination trap. We observe that standard CoT prompting, widely considered to enhance reasoning, paradoxically decouples linguistic generation from visual evidence in PET, producing clinically fluent but factually ungrounded diagnoses. To resolve this, we propose Atomic Visual Alignment (AVA), a simple fine-tuning strategy that enforces the mastery of low-level functional perception prior to high-level diagnostic reasoning. Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic accuracy by up to 14.83%. Code and data are available at https://github.com/yezanting/PET-Bench.

[233] Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning

Guoqiang Liang, Jianyi Wang, Zhonghua Wu, Shangchen Zhou

Main category: cs.CV

TL;DR: Zoom-IQA is a vision-language model that improves image quality assessment by emulating human cognitive behaviors through uncertainty awareness, region reasoning, and iterative refinement, trained via supervised fine-tuning and reinforcement learning.

Details

Motivation: Previous IQA methods either provide numerical scores without explanation or give low-level descriptions without precise scores. Recent VLM-based IQA methods suffer from unreliable reasoning due to limited integration of visual and textual cues.

Method: Two-stage training: 1) Supervised fine-tuning on GR-IQA dataset to ground assessments in key regions, 2) Reinforcement learning with KL-Coverage regularizer to prevent reasoning diversity collapse, plus Progressive Re-sampling Strategy to mitigate annotation bias.

Result: Zoom-IQA achieves improved robustness, explainability, and generalization. Application to downstream tasks like image restoration demonstrates its effectiveness.

Conclusion: The proposed Zoom-IQA model successfully addresses limitations of previous IQA methods by emulating key cognitive behaviors, resulting in more reliable and explainable image quality assessment.

Abstract: Image Quality Assessment (IQA) is a long-standing problem in computer vision. Previous methods typically focus on predicting numerical scores without explanation or providing low-level descriptions lacking precise scores. Recent reasoning-based vision language models (VLMs) have shown strong potential for IQA by jointly generating quality descriptions and scores. However, existing VLM-based IQA methods often suffer from unreliable reasoning due to their limited capability of integrating visual and textual cues. In this work, we introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors: uncertainty awareness, region reasoning, and iterative refinement. Specifically, we present a two-stage training pipeline: 1) supervised fine-tuning (SFT) on our Grounded-Rationale-IQA (GR-IQA) dataset to teach the model to ground its assessments in key regions, and 2) reinforcement learning (RL) for dynamic policy exploration, stabilized by our KL-Coverage regularizer to prevent reasoning and scoring diversity collapse, with a Progressive Re-sampling Strategy for mitigating annotation bias. Extensive experiments show that Zoom-IQA achieves improved robustness, explainability, and generalization. The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.

Arun Muthukkumar

Main category: cs.CV

TL;DR: MDENeRF: Iterative framework that refines monocular depth estimates by fusing them with Neural Radiance Field (NeRF) depth information and uncertainty to recover fine geometric details.

Details

Motivation: Current monocular depth estimation methods produce smooth depth maps lacking fine geometric details needed for accurate scene understanding in applications like autonomous navigation and extended reality.

Method: Three-component iterative framework: (1) initial monocular estimate for global structure, (2) NeRF trained on perturbed viewpoints with per-pixel uncertainty derived from volume rendering, (3) Bayesian fusion of noisy monocular and NeRF depths to iteratively inject high-frequency details.

Result: Demonstrated improvements on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.

Conclusion: MDENeRF successfully refines monocular depth estimates by combining monocular priors for global structure with NeRF-derived depth and uncertainty to recover fine geometric details.

Abstract: Monocular depth estimation has applications in many fields, such as autonomous navigation and extended reality, making it an essential computer vision task. However, current methods often produce smooth depth maps that lack the fine geometric detail needed for accurate scene understanding. We propose MDENeRF, an iterative framework that refines monocular depth estimates using depth information from Neural Radiance Fields (NeRFs). MDENeRF consists of three components: (1) an initial monocular estimate for global structure, (2) a NeRF trained on perturbed viewpoints, with per-pixel uncertainty, and (3) Bayesian fusion of the noisy monocular and NeRF depths. We derive NeRF uncertainty from the volume rendering process to iteratively inject high-frequency fine details. Meanwhile, our monocular prior maintains global structure. We demonstrate improvements on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.

[235] 3D Wavelet-Based Structural Priors for Controlled Diffusion in Whole-Body Low-Dose PET Denoising

Peiyuan Jing, Yue Tang, Chun-Wun Cheng, Zhenxuan Zhang, Liutao Yang, Thiago V. Lima, Klaus Strobel, Antoine Leimgruber, Angelica Aviles-Rivero, Guang Yang, Javier Montoya

Main category: cs.CV

TL;DR: WCC-Net is a 3D diffusion-based framework that uses wavelet representations to guide volumetric PET denoising, achieving superior performance over existing methods while maintaining anatomical consistency.

Details

Motivation: Low-dose PET imaging reduces radiation exposure but suffers from increased noise that degrades image quality and diagnostic reliability. Existing diffusion models struggle with anatomical consistency in low signal-to-noise regimes and volumetric whole-body imaging.

Method: Proposes Wavelet-Conditioned ControlNet (WCC-Net), a fully 3D diffusion-based framework that introduces explicit frequency-domain structural priors via wavelet representations. Uses a lightweight control branch to inject wavelet-based structural guidance into a frozen pretrained diffusion backbone, decoupling anatomical structure from noise while preserving generative expressiveness and 3D structural continuity.

Result: WCC-Net consistently outperforms CNN-, GAN-, and diffusion-based baselines. On internal 1/20-dose test set: improves PSNR by +1.21 dB and SSIM by +0.008 over strong diffusion baseline, while reducing structural distortion (GMSD) and intensity error (NMAE). Generalizes robustly to unseen dose levels (1/50 and 1/4), achieving superior quantitative performance and improved volumetric anatomical consistency.

Conclusion: WCC-Net effectively addresses the limitations of stochastic diffusion models in PET denoising by incorporating wavelet-based structural guidance, achieving state-of-the-art performance while maintaining anatomical fidelity across various dose levels.

Abstract: Low-dose Positron Emission Tomography (PET) imaging reduces patient radiation exposure but suffers from increased noise that degrades image quality and diagnostic reliability. Although diffusion models have demonstrated strong denoising capability, their stochastic nature makes it challenging to enforce anatomically consistent structures, particularly in low signal-to-noise regimes and volumetric whole-body imaging. We propose Wavelet-Conditioned ControlNet (WCC-Net), a fully 3D diffusion-based framework that introduces explicit frequency-domain structural priors via wavelet representations to guide volumetric PET denoising. By injecting wavelet-based structural guidance into a frozen pretrained diffusion backbone through a lightweight control branch, WCC-Net decouples anatomical structure from noise while preserving generative expressiveness and 3D structural continuity. Extensive experiments demonstrate that WCC-Net consistently outperforms CNN-, GAN-, and diffusion-based baselines. On the internal 1/20-dose test set, WCC-Net improves PSNR by +1.21 dB and SSIM by +0.008 over a strong diffusion baseline, while reducing structural distortion (GMSD) and intensity error (NMAE). Moreover, WCC-Net generalizes robustly to unseen dose levels (1/50 and 1/4), achieving superior quantitative performance and improved volumetric anatomical consistency.

[236] Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model

Siwen Jiao, Tianxiong Lv, Kangan Qian, Chenxu Zhao, Xiuyuan Zhu, Tianlun Li, Xiaolong Cheng, Jinyu Li, Zhihao Liao, Yang Cai

Main category: cs.CV

TL;DR: AP-GRPO framework with SNRA operator improves 3D numerical prediction in VLMs by addressing reward sparsity and gradient instability through dense reward activation and absolute-preserving gradients.

Details

Motivation: VLMs struggle with precise numerical prediction for 3D scene understanding due to reward sparsity and gradient instability in traditional RL approaches, particularly the "near-miss" sample problem in standard GRPO frameworks.

Method: Introduces SNRA (Smooth Numerical Reward Activation) operator using dynamically parameterized Sigmoid function for dense reward continuum, and AP-GRPO (Absolute-Preserving GRPO) framework that integrates absolute scalar gradients to preserve numerical information.

Result: AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without architectural modifications. Created Numerical3D-50k dataset with 50,000 verifiable 3D subtasks.

Conclusion: The proposed approach successfully addresses the numerical prediction bottleneck in VLMs for 3D scene understanding by transforming sparse rewards into dense continua and preserving absolute numerical information, enabling efficient 3D reasoning activation.

Abstract: Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes “near-miss” samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.

[237] Moonworks Lunara Aesthetic Dataset

Yan Wang, M M Sayeef Abdullah, Partho Hassan, Sabit Hassan

Main category: cs.CV

TL;DR: Lunara Aesthetic Dataset: A first-of-its-kind, high-quality aesthetic image dataset with diverse artistic styles, human-refined prompts, and structured annotations, released under Apache 2.0 license.

Details

Motivation: To address the lack of high-quality aesthetic datasets with stylistic diversity and licensing transparency, as existing large-scale web-derived datasets prioritize breadth over precision and aesthetic quality.

Method: Created using Moonworks Lunara model to generate images embodying distinct aesthetic styles across regional aesthetics (Middle East, Northern Europe, East Asia, South Asia) and general categories (sketch, oil painting). Each image includes human-refined prompts and structured annotations describing objects, attributes, relationships, and stylistic cues.

Result: Produced a dataset with substantially higher aesthetic scores than both aesthetics-focused datasets and general-purpose datasets, featuring diverse artistic styles, high-quality aesthetic content, and comprehensive annotations.

Conclusion: The Lunara Aesthetic Dataset provides a valuable resource for research with its focus on aesthetic quality, stylistic diversity, licensing transparency, and unrestricted academic/commercial use under Apache 2.0 license.

Abstract: The dataset spans diverse artistic styles, including regionally grounded aesthetics from the Middle East, Northern Europe, East Asia, and South Asia, alongside general categories such as sketch and oil painting. All images are generated using the Moonworks Lunara model and intentionally crafted to embody distinct, high-quality aesthetic styles, yielding a first-of-its-kind dataset with substantially higher aesthetic scores, exceeding even aesthetics-focused datasets, and general-purpose datasets by a larger margin. Each image is accompanied by a human-refined prompt and structured annotations that jointly describe salient objects, attributes, relationships, and stylistic cues. Unlike large-scale web-derived datasets that emphasize breadth over precision, the Lunara Aesthetic Dataset prioritizes aesthetic quality, stylistic diversity, and licensing transparency, and is released under the Apache 2.0 license to support research and unrestricted academic and commercial use.

[238] Human-inspired Global-to-Parallel Multi-scale Encoding for Lightweight Vision Models

Wei Xu

Main category: cs.CV

TL;DR: GPM (Global-to-Parallel Multi-scale Encoding) is a novel lightweight vision architecture inspired by human visual perception that achieves better accuracy-efficiency trade-off by balancing global context and local details.

Details

Motivation: Current lightweight vision models either sacrifice performance for efficiency or increase parameters while reducing computation, creating deployment challenges. Existing human vision-inspired approaches oversimplify visual processes and fail to capture true perceptual mechanisms.

Method: Proposes GPM with Global Insight Generator (GIG) for holistic cues, then parallel processing: LSAE for mid/large-scale semantic relations and IRB for fine-grained texture preservation. This mimics human vision’s “whole before details” and “context-aware local attention” behaviors.

Result: H-GPE network built on GPM achieves strong performance on image classification, object detection, and semantic segmentation while maintaining balanced FLOPs and parameters, outperforming recent SOTA lightweight models in accuracy-efficiency trade-off.

Conclusion: GPM’s biologically-inspired design effectively balances global context and local details, enabling lightweight networks with superior accuracy-efficiency trade-off suitable for resource-limited deployment.

Abstract: Lightweight vision networks have witnessed remarkable progress in recent years, yet achieving a satisfactory balance among parameter scale, computational overhead, and task performance remains difficult. Although many existing lightweight models manage to reduce computation considerably, they often do so at the expense of a substantial increase in parameter count (e.g., LSNet, MobileMamba), which still poses obstacles for deployment on resource-limited devices. In parallel, some studies attempt to draw inspiration from human visual perception, but their modeling tends to oversimplify the visual process, making it hard to reflect how perception truly operates. Revisiting the cooperative mechanism of the human visual system, we propose GPM (Global-to-Parallel Multi-scale Encoding). GPM first employs a Global Insight Generator (GIG) to extract holistic cues, and subsequently processes features of different scales through parallel branches: LSAE emphasizes mid-/large-scale semantic relations, while IRB (Inverted Residual Block) preserves fine-grained texture information, jointly enabling coherent representation of global and local features. As such, GPM conforms to two characteristic behaviors of human vision perceiving the whole before focusing on details, and maintaining broad contextual awareness even during local attention. Built upon GPM, we further develop the lightweight H-GPE network. Experiments on image classification, object detection, and semantic segmentation show that H-GPE achieves strong performance while maintaining a balanced footprint in both FLOPs and parameters, delivering a more favorable accuracy-efficiency trade-off compared with recent state-of-the-art lightweight models.

[239] Semantic Misalignment in Vision-Language Models under Perceptual Degradation

Guo Cheng

Main category: cs.CV

TL;DR: VLMs show strong multimodal performance but fail under realistic perception degradation, revealing disconnect between pixel-level robustness and semantic reliability in safety-critical applications.

Details

Motivation: VLMs are increasingly used in autonomous driving and embodied AI where reliable perception is critical for safety, but their robustness to realistic perception degradation remains poorly understood despite strong benchmark performance.

Method: Systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception using semantic segmentation on Cityscapes dataset. Introduce perception-realistic corruptions that cause moderate drops in segmentation metrics but severe VLM failures. Propose language-level misalignment metrics for hallucination, critical omission, and safety misinterpretation.

Result: Perception-realistic corruptions cause only moderate drops in conventional segmentation metrics but lead to severe VLM failures including hallucinated objects, omission of safety-critical entities, and inconsistent safety judgments. Clear disconnect exists between pixel-level robustness and multimodal semantic reliability across multiple contrastive and generative VLMs.

Conclusion: Current VLM-based systems have critical limitations in handling perception uncertainty, highlighting need for evaluation frameworks that explicitly account for perception degradation in safety-critical applications.

Abstract: Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.

[240] SPARK: Scalable Real-Time Point Cloud Aggregation with Multi-View Self-Calibration

Chentian Sun

Main category: cs.CV

TL;DR: SPARK is a real-time multi-camera 3D reconstruction framework that jointly handles point cloud fusion and camera extrinsic uncertainty through self-calibration, achieving linear scalability with camera count.

Details

Motivation: Existing methods struggle with multi-view fusion, camera extrinsic uncertainty, and scalability for large camera setups in real-time 3D reconstruction, which is crucial for 3D perception, immersive interaction, and robotics.

Method: SPARK consists of two components: (1) a geometry-aware online extrinsic estimation module that uses multi-view priors and enforces cross-view and temporal consistency for stable self-calibration, and (2) a confidence-driven point cloud fusion strategy that models depth reliability and visibility at pixel and point levels to suppress noise and inconsistencies.

Result: Extensive experiments on real-world multi-camera systems show SPARK outperforms existing approaches in extrinsic accuracy, geometric consistency, temporal stability, and real-time performance, while scaling linearly with the number of cameras.

Conclusion: SPARK demonstrates effectiveness and scalability for large-scale multi-camera 3D reconstruction by jointly addressing extrinsic uncertainty and point cloud fusion through self-calibration and confidence-driven fusion strategies.

Abstract: Real-time multi-camera 3D reconstruction is crucial for 3D perception, immersive interaction, and robotics. Existing methods struggle with multi-view fusion, camera extrinsic uncertainty, and scalability for large camera setups. We propose SPARK, a self-calibrating real-time multi-camera point cloud reconstruction framework that jointly handles point cloud fusion and extrinsic uncertainty. SPARK consists of: (1) a geometry-aware online extrinsic estimation module leveraging multi-view priors and enforcing cross-view and temporal consistency for stable self-calibration, and (2) a confidence-driven point cloud fusion strategy modeling depth reliability and visibility at pixel and point levels to suppress noise and view-dependent inconsistencies. By performing frame-wise fusion without accumulation, SPARK produces stable point clouds in dynamic scenes while scaling linearly with the number of cameras. Extensive experiments on real-world multi-camera systems show that SPARK outperforms existing approaches in extrinsic accuracy, geometric consistency, temporal stability, and real-time performance, demonstrating its effectiveness and scalability for large-scale multi-camera 3D reconstruction.

[241] EfficientFSL: Enhancing Few-Shot Classification via Query-Only Tuning in Vision Transformers

Wenwen Liao, Hang Ruan, Jianbo Yu, Bing Song, YuansongWang, Xiaofeng Yang

Main category: cs.CV

TL;DR: EfficientFSL is a query-only fine-tuning framework for Vision Transformers in few-shot learning that achieves SOTA performance with minimal computational overhead by using lightweight trainable blocks.

Details

Motivation: Large models like Vision Transformers show superior few-shot performance but require extensive GPU memory and training time, making them impractical for low-resource scenarios. Need to bridge gap between performance and computational efficiency.

Method: Proposes EfficientFSL with three key components: 1) Lightweight trainable Forward Block to synthesize task-specific queries, 2) Combine Block to fuse multi-layer outputs for robust features, 3) Support-Query Attention Block to mitigate distribution shift by aligning prototypes with query distribution.

Result: Achieves state-of-the-art performance on four in-domain few-shot datasets and six cross-domain datasets while significantly reducing computational overhead with minimal trainable parameters.

Conclusion: EfficientFSL effectively bridges the gap between performance and efficiency in few-shot learning with Vision Transformers, making large models practical for real-world low-resource applications through query-only fine-tuning.

Abstract: Large models such as Vision Transformers (ViTs) have demonstrated remarkable superiority over smaller architectures like ResNet in few-shot classification, owing to their powerful representational capacity. However, fine-tuning such large models demands extensive GPU memory and prolonged training time, making them impractical for many real-world low-resource scenarios. To bridge this gap, we propose EfficientFSL, a query-only fine-tuning framework tailored specifically for few-shot classification with ViT, which achieves competitive performance while significantly reducing computational overhead. EfficientFSL fully leverages the knowledge embedded in the pre-trained model and its strong comprehension ability, achieving high classification accuracy with an extremely small number of tunable parameters. Specifically, we introduce a lightweight trainable Forward Block to synthesize task-specific queries that extract informative features from the intermediate representations of the pre-trained model in a query-only manner. We further propose a Combine Block to fuse multi-layer outputs, enhancing the depth and robustness of feature representations. Finally, a Support-Query Attention Block mitigates distribution shift by adjusting prototypes to align with the query set distribution. With minimal trainable parameters, EfficientFSL achieves state-of-the-art performance on four in-domain few-shot datasets and six cross-domain datasets, demonstrating its effectiveness in real-world applications.

[242] STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge

Main category: cs.CV

TL;DR: STEP3-VL-10B is a 10B parameter multimodal foundation model that achieves frontier-level performance rivaling models 10-20x larger through unified pre-training on 1.2T tokens and scaled reinforcement learning, featuring Parallel Coordinated Reasoning for test-time compute scaling.

Details

Motivation: To redefine the trade-off between model compactness and multimodal intelligence, creating a lightweight open-source foundation model that delivers frontier-level performance while being efficient and reproducible.

Method: Two strategic shifts: 1) Unified fully unfrozen pre-training on 1.2T multimodal tokens integrating language-aligned Perception Encoder with Qwen3-8B decoder, 2) Scaled post-training with over 1k iterations of reinforcement learning, plus Parallel Coordinated Reasoning (PaCoRe) for test-time compute scaling.

Result: Despite compact 10B size, rivals or surpasses models 10-20x larger (GLM-4.6V-106B, Qwen3-VL-235B) and top proprietary models (Gemini 2.5 Pro, Seed-1.5-VL). Achieves 92.2% on MMBench, 80.11% on MMMU, 94.43% on AIME2025, and 75.95% on MathVision.

Conclusion: STEP3-VL-10B delivers best-in-class performance as a powerful, efficient, and reproducible baseline, demonstrating that compact models can achieve frontier-level multimodal intelligence through strategic training approaches and test-time compute scaling.

Abstract: We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

cs.AI

[243] AI Survival Stories: a Taxonomic Analysis of AI Existential Risk

Herman Cappelen, Simon Goldstein, John Hawthorne

Main category: cs.AI

TL;DR: The paper develops a framework for analyzing AI existential risk using a two-premise argument and creates a taxonomy of survival scenarios where humanity avoids destruction.

Details

Motivation: To provide a structured approach to the debate about AI existential risk by analyzing the logical premises behind claims that AI could destroy humanity, and to help evaluate different survival scenarios and appropriate responses.

Method: Develops a general framework based on a two-premise argument: (1) AI systems will become extremely powerful, and (2) if AI systems become extremely powerful, they will destroy humanity. Uses these premises to construct a taxonomy of survival stories where humanity survives, each corresponding to one premise failing.

Result: Creates a taxonomy of four survival scenarios: scientific barriers prevent AI from becoming powerful; humanity bans AI research; powerful AI’s goals prevent destruction; or humanity can detect and disable destructive AI. Analyzes challenges for each scenario and how they motivate different policy responses.

Conclusion: The framework provides a systematic way to think about AI existential risk, helps identify different survival pathways and their challenges, informs appropriate policy responses, and enables rough probability estimates of AI-caused human extinction (P(doom)).

Abstract: Since the release of ChatGPT, there has been a lot of debate about whether AI systems pose an existential risk to humanity. This paper develops a general framework for thinking about the existential risk of AI systems. We analyze a two premise argument that AI systems pose a threat to humanity. Premise one: AI systems will become extremely powerful. Premise two: if AI systems become extremely powerful, they will destroy humanity. We use these two premises to construct a taxonomy of survival stories, in which humanity survives into the far future. In each survival story, one of the two premises fails. Either scientific barriers prevent AI systems from becoming extremely powerful; or humanity bans research into AI systems, thereby preventing them from becoming extremely powerful; or extremely powerful AI systems do not destroy humanity, because their goals prevent them from doing so; or extremely powerful AI systems do not destroy humanity, because we can reliably detect and disable systems that have the goal of doing so. We argue that different survival stories face different challenges. We also argue that different survival stories motivate different responses to the threats from AI. Finally, we use our taxonomy to produce rough estimates of P(doom), the probability that humanity will be destroyed by AI.

[244] GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

Chen Chen, Jiawei Shao, Dakuan Lu, Haoyi Hu, Xiangcheng Liu, Hantao Yao, Wu Liu

Main category: cs.AI

TL;DR: GUI-Eyes: RL framework for active visual perception in GUI tasks using strategic tool invocation and two-stage reasoning, achieving 44.8% grounding accuracy with minimal labeled data.

Details

Motivation: Existing GUI automation methods rely on static, one-shot visual inputs and passive perception, lacking adaptive decision-making about when, whether, and how to observe interfaces.

Method: Two-stage reasoning process where agent learns strategic decisions on whether/how to invoke visual tools (cropping/zooming). Progressive perception strategy with coarse exploration and fine-grained grounding coordinated by two-level policy. Spatially continuous reward function integrating location proximity and region overlap.

Result: On ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples, significantly outperforming both supervised and RL-based baselines.

Conclusion: Tool-aware active perception enabled by staged policy reasoning and fine-grained reward feedback is critical for building robust and data-efficient GUI agents.

Abstract: Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to adaptively determine when, whether, and how to observe the interface. We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks. To acquire more informative observations, the agent learns to make strategic decisions on both whether and how to invoke visual tools, such as cropping or zooming, within a two-stage reasoning process. To support this behavior, we introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding, coordinated by a two-level policy. In addition, we design a spatially continuous reward function tailored to tool usage, which integrates both location proximity and region overlap to provide dense supervision and alleviate the reward sparsity common in GUI environments. On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples, significantly outperforming both supervised and RL-based baselines. These results highlight that tool-aware active perception, enabled by staged policy reasoning and fine-grained reward feedback, is critical for building robust and data-efficient GUI agents.

[245] PCN-Rec: Agentic Proof-Carrying Negotiation for Reliable Governance-Constrained Recommendation

Aradhya Dixit, Shreem Dixit

Main category: cs.AI

TL;DR: PCN-Rec is a proof-carrying negotiation pipeline that separates reasoning from constraint enforcement to reliably satisfy governance requirements in LLM-based recommenders while preserving recommendation quality.

Details

Motivation: Modern LLM-based recommenders can generate compelling ranked lists but struggle to reliably satisfy governance constraints like minimum long-tail exposure or diversity requirements, creating a need for systems that can both generate high-quality recommendations and guarantee constraint satisfaction.

Method: A proof-carrying negotiation pipeline with three components: 1) Base recommender produces candidate window, 2) Two agents negotiate (User Advocate for relevance, Policy Agent for constraints), 3) Mediator LLM synthesizes top-N slate with structured certificate. Includes deterministic verifier and constrained-greedy repair for failed verifications.

Result: Achieves 98.55% pass rate on feasible users (n=551, W=80) versus one-shot single-LLM baseline, while preserving utility with only 0.021 absolute drop in NDCG@10 (0.403 vs 0.424). Differences are statistically significant (p<0.05).

Conclusion: PCN-Rec successfully separates natural-language reasoning from deterministic constraint enforcement, providing reliable governance compliance while maintaining recommendation quality, with auditable verification traces for accountability.

Abstract: Modern LLM-based recommenders can generate compelling ranked lists, but they struggle to reliably satisfy governance constraints such as minimum long-tail exposure or diversity requirements. We present PCN-Rec, a proof-carrying negotiation pipeline that separates natural-language reasoning from deterministic enforcement. A base recommender (MF/CF) produces a candidate window of size W, which is negotiated by two agents: a User Advocate optimizing relevance and a Policy Agent enforcing constraints. A mediator LLM synthesizes a top-N slate together with a structured certificate (JSON) describing the claimed constraint satisfaction. A deterministic verifier recomputes all constraints from the slate and accepts only verifier-checked certificates; if verification fails, a deterministic constrained-greedy repair produces a compliant slate for re-verification, yielding an auditable trace. On MovieLens-100K with governance constraints, PCN-Rec achieves a 98.55% pass rate on feasible users (n = 551, W = 80) versus a one-shot single-LLM baseline without verification/repair, while preserving utility with only a 0.021 absolute drop in NDCG@10 (0.403 vs. 0.424); differences are statistically significant (p < 0.05).

[246] Antisocial behavior towards large language model users: experimental evidence

Paweł Niszczota, Cassandra Grützner

Main category: cs.AI

TL;DR: People punish others financially for using LLMs, with punishment increasing with actual LLM use. Self-reported “no use” is punished more than actual non-use, suggesting distrust. High actual use is punished more than self-reported use.

Details

Motivation: To investigate whether negative attitudes toward AI users translate into costly behavioral sanctions, specifically whether people will spend their own resources to punish those who use LLMs.

Method: Two-phase online experiment with 491 Phase II participants. Participants could spend part of their endowment to reduce earnings of peers who completed a real-effort task with or without LLM support. Phase I provided targets for punishment decisions.

Result: Participants destroyed 36% of earnings of those who relied exclusively on LLMs. Punishment increased monotonically with actual LLM use. Disclosure created credibility gap: self-reported null use punished more than actual null use, while high actual use punished more than self-reported use.

Conclusion: The efficiency gains from LLMs come at the cost of social sanctions, as people are willing to incur personal costs to punish LLM users. There is distrust in self-reported non-use, suggesting a social penalty for LLM adoption.

Abstract: The rapid spread of large language models (LLMs) has raised concerns about the social reactions they provoke. Prior research documents negative attitudes toward AI users, but it remains unclear whether such disapproval translates into costly action. We address this question in a two-phase online experiment (N = 491 Phase II participants; Phase I provided targets) where participants could spend part of their own endowment to reduce the earnings of peers who had previously completed a real-effort task with or without LLM support. On average, participants destroyed 36% of the earnings of those who relied exclusively on the model, with punishment increasing monotonically with actual LLM use. Disclosure about LLM use created a credibility gap: self-reported null use was punished more harshly than actual null use, suggesting that declarations of “no use” are treated with suspicion. Conversely, at high levels of use, actual reliance on the model was punished more strongly than self-reported reliance. Taken together, these findings provide the first behavioral evidence that the efficiency gains of LLMs come at the cost of social sanctions.

[247] Improving Chain-of-Thought for Logical Reasoning via Attention-Aware Intervention

Nguyen Minh Phuong, Dang Huu Tien, Naoya Inoue

Main category: cs.AI

TL;DR: AAI is a non-interactive, end-to-end framework that enhances LLM logical reasoning by identifying attention heads with logical patterns and reweighting them during inference.

Details

Motivation: Existing logical reasoning approaches for LLMs either use complex interactive frameworks (with overhead) or hybrid approaches (relying on external resources, limiting scalability). The authors aim for a non-interactive, end-to-end solution that enables reasoning to emerge within the model itself.

Method: Attention-Aware Intervention (AAI): 1) Identify attention heads that pattern with logical reasoning operators by introducing structural information in few-shot prompts, 2) At inference time, reweight attention scores across these selected heads to steer reasoning toward leveraging prior knowledge.

Result: AAI enhances logical reasoning performance across diverse benchmarks and model architectures with negligible computational overhead.

Conclusion: AAI provides an efficient, non-interactive framework for logical reasoning that improves generalization while preserving analyzability without external resources.

Abstract: Modern logical reasoning with LLMs primarily relies on employing complex interactive frameworks that decompose the reasoning process into subtasks solved through carefully designed prompts or requiring external resources (e.g., symbolic solvers) to exploit their strong logical structures. While interactive approaches introduce additional overhead, hybrid approaches depend on external components, which limit their scalability. A non-interactive, end-to-end framework enables reasoning to emerge within the model itself – improving generalization while preserving analyzability without any external resources. In this work, we introduce a non-interactive, end-to-end framework for reasoning tasks. We show that introducing structural information into the few-shot prompt activates a subset of attention heads that patterns aligned with logical reasoning operators. Building on this insight, we propose Attention-Aware Intervention (AAI), an inference-time intervention method that reweights attention scores across selected heads identified by their logical patterns. AAI offers an efficient way to steer the model’s reasoning toward leveraging prior knowledge through attention modulation. Extensive experiments show that AAI enhances logical reasoning performance across diverse benchmarks and model architectures, while incurring negligible additional computational overhead. Code is available at https://github.com/phuongnm94/aai_for_logical_reasoning.

[248] Thinking Long, but Short: Stable Sequential Test-Time Scaling for Large Reasoning Models

Michael R. Metel, Yufei Cui, Boxing Chen, Prasanna Parthasarathi

Main category: cs.AI

TL;DR: Min-Seek is a novel sequential test-time scaling method that stabilizes reasoning accuracy across various thought lengths without fine-tuning, using efficient KV cache management to enable reasoning beyond context limits.

Details

Motivation: Current sequential test-time scaling methods suffer from accuracy degradation and instability as reasoning length increases, requiring fine-tuning for optimal performance. There's a need for a stable, training-free approach that maintains accuracy across different reasoning lengths.

Method: Min-Seek uses a custom KV cache that stores keys without position embeddings and dynamically encodes them contiguously before each new generated thought. This allows keeping only one additional induced thought’s KV pairs in cache, enabling reasoning beyond maximum context length with linear computational complexity.

Result: The method significantly improves model accuracy across various reasoning tasks, stabilizes accuracy over a wide range of induced thoughts, eliminates the need for reasoning length fine-tuning, and enables reasoning beyond the model’s maximum context length.

Conclusion: Min-Seek provides an efficient, stable sequential test-time scaling solution that overcomes limitations of existing methods, offering improved accuracy without fine-tuning while maintaining computational efficiency and enabling extended reasoning capabilities.

Abstract: Sequential test-time scaling is a promising training-free method to improve large reasoning model accuracy, but as currently implemented, significant limitations have been observed. Inducing models to think for longer can increase their accuracy, but as the length of reasoning is further extended, it has also been shown to result in accuracy degradation and model instability. This work presents a novel sequential test-time scaling method, Min-Seek, which improves model accuracy significantly over a wide range of induced thoughts, stabilizing the accuracy of sequential scaling, and removing the need for reasoning length fine-tuning. Beyond improving model accuracy over a variety of reasoning tasks, our method is inherently efficient, as only the KV pairs of one additional induced thought are kept in the KV cache during reasoning. With a custom KV cache which stores keys without position embeddings, by dynamically encoding them contiguously before each new generated thought, our method can continue to reason well beyond a model’s maximum context length, and under mild conditions has linear computational complexity.

[249] A Scoping Review of the Ethical Perspectives on Anthropomorphising Large Language Model-Based Conversational Agents

Andrea Ferrario, Rasita Vinay, Matteo Casserini, Alessandro Facchini

Main category: cs.AI

TL;DR: This scoping review examines anthropomorphisation in LLM-based conversational agents, analyzing conceptual foundations, ethical implications, and methodological approaches to guide ethical design and governance.

Details

Motivation: Anthropomorphisation of LLM-based conversational agents has become increasingly significant but literature remains fragmented across domains with inconsistent definitions, operationalizations, and normative evaluations. There's a need to systematically map ethically oriented work to understand both risks and opportunities.

Method: Scoping review methodology across five databases and three preprint repositories, synthesizing literature on anthropomorphising LLM-based conversational agents. The review focuses on three key areas: conceptual foundations, ethical challenges/opportunities, and methodological approaches.

Result: The review found convergence on attribution-based definitions of anthropomorphisation but substantial divergence in operationalization. Literature shows predominantly risk-forward normative framing, with limited empirical work linking observed interaction effects to actionable governance guidance.

Conclusion: The paper concludes with a research agenda and design/governance recommendations for ethically deploying anthropomorphic cues in LLM-based conversational agents, addressing both ethical concerns and potential benefits.

Abstract: Anthropomorphisation – the phenomenon whereby non-human entities are ascribed human-like qualities – has become increasingly salient with the rise of large language model (LLM)-based conversational agents (CAs). Unlike earlier chatbots, LLM-based CAs routinely generate interactional and linguistic cues, such as first-person self-reference, epistemic and affective expressions that empirical work shows can increase engagement. On the other hand, anthropomorphisation raises ethical concerns, including deception, overreliance, and exploitative relationship framing, while some authors argue that anthropomorphic interaction may support autonomy, well-being, and inclusion. Despite increasing interest in the phenomenon, literature remains fragmented across domains and varies substantially in how it defines, operationalizes, and normatively evaluates anthropomorphisation. This scoping review maps ethically oriented work on anthropomorphising LLM-based CAs across five databases and three preprint repositories. We synthesize (1) conceptual foundations, (2) ethical challenges and opportunities, and (3) methodological approaches. We find convergence on attribution-based definitions but substantial divergence in operationalization, a predominantly risk-forward normative framing, and limited empirical work that links observed interaction effects to actionable governance guidance. We conclude with a research agenda and design/governance recommendations for ethically deploying anthropomorphic cues in LLM-based conversational agents.

[250] Epistemology gives a Future to Complementarity in Human-AI Interactions

Andrea Ferrario, Alessandro Facchini, Juan M. Durán

Main category: cs.AI

TL;DR: The paper critiques current conceptualizations of human-AI complementarity and reframes it within epistemology as evidence of reliable epistemic processes in human-AI teams.

Details

Motivation: Current approaches to human-AI complementarity face theoretical challenges: lack of precise theoretical anchoring, being formalized only as a post hoc indicator of relative predictive accuracy, ignoring other interaction desiderata, and abstracting away from performance gain magnitude-cost profiles, making it difficult to obtain empirically.

Method: The authors leverage epistemology and computational reliabilism to reframe complementarity within justificatory AI discourse. They argue that historical complementarity instances serve as evidence that a human-AI interaction is a reliable epistemic process for predictive tasks.

Result: Complementarity is reconceptualized not as a relative measure of predictive accuracy, but as helping calibrate decision-making to the reliability of AI-supported processes. It contributes to assessing the reliability of human-AI teams when generating predictions.

Conclusion: The role and value of complementarity lies in supporting practical reasoning for stakeholders (patients, managers, regulators) by helping calibrate decision-making to the reliability of AI-supported processes that increasingly shape everyday life.

Abstract: Human-AI complementarity is the claim that a human supported by an AI system can outperform either alone in a decision-making process. Since its introduction in the human-AI interaction literature, it has gained traction by generalizing the reliance paradigm and by offering a more practical alternative to the contested construct of ’trust in AI.’ Yet complementarity faces key theoretical challenges: it lacks precise theoretical anchoring, it is formalized just as a post hoc indicator of relative predictive accuracy, it remains silent about other desiderata of human-AI interactions and it abstracts away from the magnitude-cost profile of its performance gain. As a result, complementarity is difficult to obtain in empirical settings. In this work, we leverage epistemology to address these challenges by reframing complementarity within the discourse on justificatory AI. Drawing on computational reliabilism, we argue that historical instances of complementarity function as evidence that a given human-AI interaction is a reliable epistemic process for a given predictive task. Together with other reliability indicators assessing the alignment of the human-AI team with the epistemic standards and socio-technical practices, complementarity contributes to the degree of reliability of human-AI teams when generating predictions. This supports the practical reasoning of those affected by these outputs – patients, managers, regulators, and others. In summary, our approach suggests that the role and value of complementarity lies not in providing a relative measure of predictive accuracy, but in helping calibrate decision-making to the reliability of AI-supported processes that increasingly shape everyday life.

[251] Beyond Rule-Based Workflows: An Information-Flow-Orchestrated Multi-Agents Paradigm via Agent-to-Agent Communication from CORAL

Xinxing Ren, Quagmire Zang, Caelum Forder, Suman Deb, Ahsen Tahir, Roman J. Georgio, Peter Carroll, Zekun Guo

Main category: cs.AI

TL;DR: CORAL introduces a workflow-free multi-agent system where an information flow orchestrator dynamically coordinates agents through natural language communication, eliminating the need for predefined workflows and outperforming rule-based systems.

Details

Motivation: Existing LLM-based multi-agent systems rely on predefined workflows that require substantial manual effort to anticipate task states and cannot exhaustively cover complex real-world task spaces, limiting their flexibility and robustness.

Method: Proposes an Information-Flow-Orchestrated Multi-Agent Paradigm via Agent-to-Agent (A2A) Communication, where a dedicated orchestrator continuously monitors task progress and dynamically coordinates other agents through natural language using the A2A toolkit, without predefined workflows.

Result: Achieves 63.64% accuracy on GAIA benchmark under pass@1 setting, outperforming the workflow-based OWL system’s 55.15% by 8.49 percentage points with comparable token consumption. Case analysis shows more flexible task monitoring and robust handling of edge cases.

Conclusion: The workflow-free approach enables more flexible and robust multi-agent coordination, overcoming limitations of rule-based workflow systems while maintaining efficiency, representing a significant advancement in LLM-based multi-agent system design.

Abstract: Most existing Large Language Model (LLM)-based Multi-Agent Systems (MAS) rely on predefined workflows, where human engineers enumerate task states in advance and specify routing rules and contextual injections accordingly. Such workflow-driven designs are essentially rule-based decision trees, which suffer from two fundamental limitations: they require substantial manual effort to anticipate and encode possible task states, and they cannot exhaustively cover the state space of complex real-world tasks. To address these issues, we propose an Information-Flow-Orchestrated Multi-Agent Paradigm via Agent-to-Agent (A2A) Communication from CORAL, in which a dedicated information flow orchestrator continuously monitors task progress and dynamically coordinates other agents through the A2A toolkit using natural language, without relying on predefined workflows. We evaluate our approach on the general-purpose benchmark GAIA, using the representative workflow-based MAS OWL as the baseline while controlling for agent roles and underlying models. Under the pass@1 setting, our method achieves 63.64% accuracy, outperforming OWL’s 55.15% by 8.49 percentage points with comparable token consumption. Further case-level analysis shows that our paradigm enables more flexible task monitoring and more robust handling of edge cases. Our implementation is publicly available at: https://github.com/Coral-Protocol/Beyond-Rule-Based-Workflows

[252] Continuum Memory Architectures for Long-Horizon LLM Agents

Joe Logan

Main category: cs.AI

TL;DR: The paper introduces Continuum Memory Architecture (CMA) as a solution to RAG’s limitations in handling temporal continuity and stateful memory for LLM agents.

Details

Motivation: RAG treats memory as a stateless lookup table with persistent information, read-only retrieval, and no temporal continuity, which limits LLM agents' ability to accumulate, mutate, or disambiguate memory over time.

Method: Proposes CMA as an architectural class with persistent storage, selective retention, associative routing, temporal chaining, and consolidation into higher-order abstractions, without disclosing specific implementation details.

Result: CMA shows consistent behavioral advantages over RAG on tasks involving knowledge updates, temporal association, associative recall, and contextual disambiguation, demonstrating its necessity for long-horizon agents.

Conclusion: CMA is a necessary architectural primitive for long-horizon agents, though challenges remain around latency, drift, and interpretability that need further research.

Abstract: Retrieval-augmented generation (RAG) has become the default strategy for providing large language model (LLM) agents with contextual knowledge. Yet RAG treats memory as a stateless lookup table: information persists indefinitely, retrieval is read-only, and temporal continuity is absent. We define the \textit{Continuum Memory Architecture} (CMA), a class of systems that maintain and update internal state across interactions through persistent storage, selective retention, associative routing, temporal chaining, and consolidation into higher-order abstractions. Rather than disclosing implementation specifics, we specify the architectural requirements CMA imposes and show consistent behavioral advantages on tasks that expose RAG’s structural inability to accumulate, mutate, or disambiguate memory. The empirical probes (knowledge updates, temporal association, associative recall, contextual disambiguation) demonstrate that CMA is a necessary architectural primitive for long-horizon agents while highlighting open challenges around latency, drift, and interpretability.

[253] M^4olGen: Multi-Agent, Multi-Stage Molecular Generation under Precise Multi-Property Constraints

Yizhan Li, Florence Cloutier, Sifan Wu, Ali Parviz, Boris Knyazev, Yan Zhang, Glen Berseth, Bang Liu

Main category: cs.AI

TL;DR: MolGen is a two-stage framework for generating molecules under multi-property constraints using fragment-level retrieval and RL-based optimization.

Details

Motivation: Generating molecules that satisfy precise numeric constraints over multiple physicochemical properties is critical but challenging. LLMs struggle with precise multi-objective control and numeric reasoning without external structure and feedback.

Method: Two-stage framework: 1) Prototype generation using multi-agent reasoner with retrieval-anchored fragment-level edits; 2) RL-based fine-grained optimization using Group Relative Policy Optimization (GRPO) for one- or multi-hop refinements to minimize property errors while regulating edit complexity and deviation from prototype.

Result: Experiments on generation under two sets of property constraints (QED, LogP, Molecular Weight and HOMO, LUMO) show consistent gains in validity and precise satisfaction of multi-property targets, outperforming strong LLMs and graph-based algorithms.

Conclusion: MolGen better reasons about molecules by leveraging fragments and supports controllable refinement toward numeric targets, addressing limitations of prior approaches in precise multi-property constraint satisfaction.

Abstract: Generating molecules that satisfy precise numeric constraints over multiple physicochemical properties is critical and challenging. Although large language models (LLMs) are expressive, they struggle with precise multi-objective control and numeric reasoning without external structure and feedback. We introduce \textbf{M olGen}, a fragment-level, retrieval-augmented, two-stage framework for molecule generation under multi-property constraints. Stage I : Prototype generation: a multi-agent reasoner performs retrieval-anchored, fragment-level edits to produce a candidate near the feasible region. Stage II : RL-based fine-grained optimization: a fragment-level optimizer trained with Group Relative Policy Optimization (GRPO) applies one- or multi-hop refinements to explicitly minimize the property errors toward our target while regulating edit complexity and deviation from the prototype. A large, automatically curated dataset with reasoning chains of fragment edits and measured property deltas underpins both stages, enabling deterministic, reproducible supervision and controllable multi-hop reasoning. Unlike prior work, our framework better reasons about molecules by leveraging fragments and supports controllable refinement toward numeric targets. Experiments on generation under two sets of property constraints (QED, LogP, Molecular Weight and HOMO, LUMO) show consistent gains in validity and precise satisfaction of multi-property targets, outperforming strong LLMs and graph-based algorithms.

[254] CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Robert Mullins, Tom Blanchard, Nicolas Papernot, Kristina Nikolić, Florian Tramèr, Ilia Shumailov, Cheng Zhang, Yiren Zhao

Main category: cs.AI

TL;DR: Single-Shot Planning for Computer Use Agents prevents prompt injection attacks by generating complete execution graphs before observing potentially malicious UI content, providing provable security while maintaining utility.

Details

Motivation: AI agents are vulnerable to prompt injection attacks that can hijack behavior to steal credentials or cause financial loss. While architectural isolation is the only known robust defense, applying it to Computer Use Agents (CUAs) is challenging because they require continuous UI observation for task execution, conflicting with security isolation requirements.

Method: Introduces Single-Shot Planning for CUAs where a trusted planner generates a complete execution graph with conditional branches before any observation of potentially malicious content. This provides provable control flow integrity guarantees against arbitrary instruction injections. Additional measures address Branch Steering attacks that manipulate UI elements to trigger unintended valid paths.

Result: The approach retains up to 57% of the performance of frontier models while improving performance for smaller open-source models by up to 19% on OSWorld benchmark. Demonstrates that rigorous security and utility can coexist in CUAs.

Conclusion: UI workflows are structurally predictable despite being dynamic, enabling secure Single-Shot Planning for Computer Use Agents. While architectural isolation prevents instruction injections, additional defenses are needed against Branch Steering attacks. The approach successfully balances security and utility in real-world agent systems.

Abstract: AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior to steal credentials or cause financial loss. The only known robust defense is architectural isolation that strictly separates trusted task planning from untrusted environment observations. However, applying this design to Computer Use Agents (CUAs) – systems that automate tasks by viewing screens and executing actions – presents a fundamental challenge: current agents require continuous observation of UI state to determine each action, conflicting with the isolation required for security. We resolve this tension by demonstrating that UI workflows, while dynamic, are structurally predictable. We introduce Single-Shot Planning for CUAs, where a trusted planner generates a complete execution graph with conditional branches before any observation of potentially malicious content, providing provable control flow integrity guarantees against arbitrary instruction injections. Although this architectural isolation successfully prevents instruction injections, we show that additional measures are needed to prevent Branch Steering attacks, which manipulate UI elements to trigger unintended valid paths within the plan. We evaluate our design on OSWorld, and retain up to 57% of the performance of frontier models while improving performance for smaller open-source models by up to 19%, demonstrating that rigorous security and utility can coexist in CUAs.

[255] Hallucination Detection and Mitigation in Large Language Models

Ahmad Pesaranghader, Erin Li

Main category: cs.AI

TL;DR: A framework for managing hallucinations in LLMs/LRMs using root cause awareness and continuous improvement cycles, with application to regulated domains like finance.

Details

Motivation: LLMs/LRMs have transformative potential in high-stakes domains (finance, law), but their tendency to hallucinate poses critical reliability risks that must be systematically addressed.

Method: Comprehensive operational framework with continuous improvement cycle driven by root cause awareness. Categorizes hallucination sources into model, data, and context factors. Integrates multi-faceted detection methods (uncertainty estimation, reasoning consistency) with stratified mitigation strategies (knowledge grounding, confidence calibration). Uses tiered architecture with closed feedback loops.

Result: Demonstrated through tiered architecture and financial data extraction case study, where model, context, and data tiers form closed feedback loops for progressive reliability enhancement.

Conclusion: Provides systematic, scalable methodology for building trustworthy generative AI systems in regulated environments by managing hallucinations through targeted interventions rather than generic fixes.

Abstract: Large Language Models (LLMs) and Large Reasoning Models (LRMs) offer transformative potential for high-stakes domains like finance and law, but their tendency to hallucinate, generating factually incorrect or unsupported content, poses a critical reliability risk. This paper introduces a comprehensive operational framework for hallucination management, built on a continuous improvement cycle driven by root cause awareness. We categorize hallucination sources into model, data, and context-related factors, allowing targeted interventions over generic fixes. The framework integrates multi-faceted detection methods (e.g., uncertainty estimation, reasoning consistency) with stratified mitigation strategies (e.g., knowledge grounding, confidence calibration). We demonstrate its application through a tiered architecture and a financial data extraction case study, where model, context, and data tiers form a closed feedback loop for progressive reliability enhancement. This approach provides a systematic, scalable methodology for building trustworthy generative AI systems in regulated environments.

[256] Generative AI collective behavior needs an interactionist paradigm

Laura Ferrarotti, Gian Maria Campedelli, Roberto Dessì, Andrea Baronchelli, Giovanni Iacca, Kathleen M. Carley, Alex Pentland, Joel Z. Leibo, James Evans, Bruno Lepri

Main category: cs.AI

TL;DR: The paper argues that studying collective behavior of LLM-based agents is crucial for societal impact, requiring new interactionist approaches due to LLMs’ pre-trained knowledge and in-context learning capabilities.

Details

Motivation: Understanding collective behavior of LLM-based agents is essential due to significant societal risks and benefits. LLMs' unique characteristics—pre-trained knowledge, implicit social priors, and in-context learning adaptation—create distinctive emergent phenomena that require systematic study.

Method: Proposes an interactionist paradigm with alternative theoretical foundations, methodologies, and analytical tools to examine how prior knowledge and embedded values interact with social context in multi-agent generative AI systems.

Result: Identifies four crucial directions for developing and deploying LLM-based collectives, focusing on theory, methods, and trans-disciplinary dialogue (though specific directions aren’t detailed in the abstract).

Conclusion: A new research paradigm is needed to systematically study emergent phenomena in LLM-based collectives, addressing how their distinctive characteristics shape collective behavior with important societal implications.

Abstract: In this article, we argue that understanding the collective behavior of agents based on large language models (LLMs) is an essential area of inquiry, with important implications in terms of risks and benefits, impacting us as a society at many levels. We claim that the distinctive nature of LLMs–namely, their initialization with extensive pre-trained knowledge and implicit social priors, together with their capability of adaptation through in-context learning–motivates the need for an interactionist paradigm consisting of alternative theoretical foundations, methodologies, and analytical tools, in order to systematically examine how prior knowledge and embedded values interact with social context to shape emergent phenomena in multi-agent generative AI systems. We propose and discuss four directions that we consider crucial for the development and deployment of LLM-based collectives, focusing on theory, methods, and trans-disciplinary dialogue.

[257] Chinese Labor Law Large Language Model Benchmark

Zixun Lan, Maochun Xu, Yifan Ren, Rui Wu, Jianghui Zhou, Xueyang Cheng, Jianan Ding Ding, Xinheng Wang, Mingmin Chi, Fei Ma

Main category: cs.AI

TL;DR: LabourLawLLM is a specialized Chinese labor law LLM that outperforms general-purpose and existing legal LLMs on comprehensive labor law tasks, with a scalable methodology for other legal subfields.

Details

Motivation: General-purpose LLMs like GPT-4 struggle with specialized legal subdomains requiring precise legal knowledge, complex reasoning, and contextual sensitivity, particularly in Chinese labor law.

Method: Developed LabourLawLLM tailored to Chinese labor law and created LabourLawBench benchmark covering diverse tasks (legal provision citation, QA, case classification, compensation computation, NER, case analysis). Used evaluation framework combining objective metrics (ROUGE-L, accuracy, F1, soft-F1) with subjective GPT-4 scoring.

Result: LabourLawLLM consistently outperforms general-purpose and existing legal-specific LLMs across all task categories in the LabourLawBench evaluation.

Conclusion: The methodology provides a scalable approach for building specialized LLMs in other legal subfields, improving accuracy, reliability, and societal value of legal AI applications beyond labor law.

Abstract: Recent advances in large language models (LLMs) have led to substantial progress in domain-specific applications, particularly within the legal domain. However, general-purpose models such as GPT-4 often struggle with specialized subdomains that require precise legal knowledge, complex reasoning, and contextual sensitivity. To address these limitations, we present LabourLawLLM, a legal large language model tailored to Chinese labor law. We also introduce LabourLawBench, a comprehensive benchmark covering diverse labor-law tasks, including legal provision citation, knowledge-based question answering, case classification, compensation computation, named entity recognition, and legal case analysis. Our evaluation framework combines objective metrics (e.g., ROUGE-L, accuracy, F1, and soft-F1) with subjective assessment based on GPT-4 scoring. Experiments show that LabourLawLLM consistently outperforms general-purpose and existing legal-specific LLMs across task categories. Beyond labor law, our methodology provides a scalable approach for building specialized LLMs in other legal subfields, improving accuracy, reliability, and societal value of legal AI applications.

[258] An intelligent agent-based simulation of human mobility in extreme urban morphologies

Abderaouf Bahi, Amel Ourici

Main category: cs.AI

TL;DR: AI-integrated simulation shows efficient human mobility (7.8-8.4 min commute, 89% satisfaction) is achievable in extreme vertical/high-density urban morphologies using RL, GNNs, and multi-modal transportation.

Details

Motivation: To investigate whether efficient human mobility is feasible in extreme urban morphologies characterized by high-density vertical structures and linear city layouts, which present unprecedented navigation challenges.

Method: Developed a hybrid simulation framework integrating agent-based modeling, reinforcement learning (RL), supervised learning, and graph neural networks (GNNs) to capture multi-modal transportation behaviors across vertical levels and density scenarios using synthetic data and real-world traces from high-density cities.

Result: AI-integrated architecture achieved average commute time of 7.8-8.4 minutes, satisfaction rate exceeding 89%, and reachability index over 91% even during peak congestion. Ablation studies showed removing RL or GNN degraded performance significantly (commute times increased up to 85%, reachability fell below 70%). Environmental modeling showed low energy consumption and minimal CO₂ emissions with electric mode prioritization.

Conclusion: Efficient and sustainable mobility in extreme urban forms is achievable with adaptive AI systems, intelligent infrastructure, and real-time feedback mechanisms, as demonstrated by the successful integration of RL, GNNs, and multi-modal transportation in the simulation framework.

Abstract: This paper investigates the feasibility of human mobility in extreme urban morphologies, characterized by high-density vertical structures and linear city layouts. To assess whether agents can navigate efficiently within such unprecedented topologies, we develop a hybrid simulation framework that integrates agent-based modeling, reinforcement learning (RL), supervised learning, and graph neural networks (GNNs). The simulation captures multi-modal transportation behaviors across multiple vertical levels and varying density scenarios, using both synthetic data and real-world traces from high-density cities. Experiments show that the full AI-integrated architecture enables agents to achieve an average commute time of 7.8–8.4 minutes, a satisfaction rate exceeding 89%, and a reachability index over 91%, even during peak congestion periods. Ablation studies indicate that removing intelligent modules such as RL or GNN significantly degrades performance, with commute times increasing by up to 85% and reachability falling below 70%. Environmental modeling demonstrates low energy consumption and minimal CO$_2$ emissions when electric modes are prioritized. These results suggest that efficient and sustainable mobility in extreme urban forms is achievable, provided adaptive AI systems, intelligent infrastructure, and real-time feedback mechanisms are implemented.

[259] SPRInG: Continual LLM Personalization via Selective Parametric Adaptation and Retrieval-Interpolated Generation

Seoyeon Kim, Jaehyung Kim

Main category: cs.AI

TL;DR: SPRInG: A semi-parametric framework for continual personalization of LLMs that addresses preference drift through drift-driven selective adaptation and relevance gating.

Details

Motivation: Real-world user interactions are dynamic with evolving preferences, but current personalization methods assume static preferences. Standard continual learning approaches fail to distinguish genuine preference shifts from transient contexts, leading to catastrophic forgetting or noisy updates.

Method: SPRInG uses drift-driven selective adaptation with likelihood-based scoring to identify high-novelty interactions, selectively updates user-specific adapters on drift signals, preserves residuals in replay buffer, and during inference applies strict relevance gating with logit interpolation to fuse parametric knowledge with retrieved history.

Result: Experiments on long-form personalized generation benchmark show SPRInG outperforms existing baselines, demonstrating robustness for real-world continual personalization.

Conclusion: SPRInG effectively addresses the challenge of continual personalization in dynamic environments by distinguishing genuine preference drift from noise, enabling LLMs to adapt to evolving user interests without catastrophic forgetting.

Abstract: Personalizing Large Language Models typically relies on static retrieval or one-time adaptation, assuming user preferences remain invariant over time. However, real-world interactions are dynamic, where user interests continuously evolve, posing a challenge for models to adapt to preference drift without catastrophic forgetting. Standard continual learning approaches often struggle in this context, as they indiscriminately update on noisy interaction streams, failing to distinguish genuine preference shifts from transient contexts. To address this, we introduce SPRInG, a novel semi-parametric framework designed for effective continual personalization. During training, SPRInG employs drift-driven selective adaptation, which utilizes a likelihood-based scoring function to identify high-novelty interactions. This allows the model to selectively update the user-specific adapter on drift signals while preserving hard-to-learn residuals in a replay buffer. During inference, we apply strict relevance gating and fuse parametric knowledge with retrieved history via logit interpolation. Experiments on the long-form personalized generation benchmark demonstrate that SPRInG outperforms existing baselines, validating its robustness for real-world continual personalization.

[260] Memo-SQL: Structured Decomposition and Experience-Driven Self-Correction for Training-Free NL2SQL

Zerui Yang, Weichuan Wang, Yanwei Xu, Linqi Song, Yudai Matsuda, Wei Han, Bo Bai

Main category: cs.AI

TL;DR: Memo-SQL: A training-free NL2SQL framework using structured decomposition and experience-aware self-correction with dynamic memory of error-fix pairs to improve accuracy while reducing computational costs.

Details

Motivation: Existing NL2SQL systems have two critical limitations: (1) they only use correct examples in in-context learning, missing valuable signals from historical error-fix pairs that could enable better self-correction, and (2) test-time scaling approaches produce near-identical SQL candidates through arbitrary decomposition, reducing ensemble effectiveness. There's also a severe accuracy-efficiency trade-off where high performance requires excessive computation.

Method: Memo-SQL introduces two key innovations: 1) Structured decomposition using three clear strategies (entity-wise, hierarchical, and atomic sequential) to encourage diverse reasoning instead of arbitrary decomposition. 2) Experience-aware self-correction with a dynamic memory that stores both successful queries and historical error-fix pairs, using retrieval-augmented prompting to bring relevant examples into context at inference time without fine-tuning or external APIs.

Result: On the BIRD benchmark, Memo-SQL achieves 68.5% execution accuracy, setting a new state-of-the-art among open, zero-fine-tuning methods. It also uses over 10 times fewer computational resources than prior test-time scaling approaches.

Conclusion: Memo-SQL successfully addresses the limitations of existing NL2SQL systems by combining structured decomposition for diverse reasoning with experience-aware self-correction using dynamic memory, achieving superior accuracy with significantly reduced computational requirements.

Abstract: Existing NL2SQL systems face two critical limitations: (1) they rely on in-context learning with only correct examples, overlooking the rich signal in historical error-fix pairs that could guide more robust self-correction; and (2) test-time scaling approaches often decompose questions arbitrarily, producing near-identical SQL candidates across runs and diminishing ensemble gains. Moreover, these methods suffer from a stark accuracy-efficiency trade-off: high performance demands excessive computation, while fast variants compromise quality. We present Memo-SQL, a training-free framework that addresses these issues through two simple ideas: structured decomposition and experience-aware self-correction. Instead of leaving decomposition to chance, we apply three clear strategies, entity-wise, hierarchical, and atomic sequential, to encourage diverse reasoning. For correction, we build a dynamic memory of both successful queries and historical error-fix pairs, and use retrieval-augmented prompting to bring relevant examples into context at inference time, no fine-tuning or external APIs required. On BIRD, Memo-SQL achieves 68.5% execution accuracy, setting a new state of the art among open, zero-fine-tuning methods, while using over 10 times fewer resources than prior TTS approaches.

[261] Structured Personality Control and Adaptation for LLM Agents

Jinpeng Wang, Xinyu Jia, Wei Wei Heng, Yuquan Li, Binbin Shi, Qianlei Chen, Guannan Chen, Junxia Zhang, Yuyu Yin

Main category: cs.AI

TL;DR: Framework for modeling LLM personality using Jungian psychological types with three mechanisms for coherent expression, contextual adaptation, and long-term evolution.

Details

Motivation: LLMs are increasingly important in HCI, but existing approaches struggle to achieve both nuanced and adaptable personality expression. Personality is critical for influencing engagement, decision-making, and perceived realism in human-computer interactions.

Method: A framework that models LLM personality via Jungian psychological types with three integrated mechanisms: 1) dominant-auxiliary coordination for coherent core expression, 2) reinforcement-compensation for temporary contextual adaptation, and 3) reflection mechanism for long-term personality evolution.

Result: Personality alignment is evaluated using Myers-Briggs Type Indicator questionnaires and tested under diverse challenge scenarios as preliminary structured assessment. Findings suggest evolving, personality-aware LLMs can support coherent, context-sensitive interactions.

Conclusion: The framework enables naturalistic agent design in HCI by allowing LLMs to maintain nuanced personality traits while dynamically adjusting to interaction demands and gradually updating their underlying structure.

Abstract: Large Language Models (LLMs) are increasingly shaping human-computer interaction (HCI), from personalized assistants to social simulations. Beyond language competence, researchers are exploring whether LLMs can exhibit human-like characteristics that influence engagement, decision-making, and perceived realism. Personality, in particular, is critical, yet existing approaches often struggle to achieve both nuanced and adaptable expression. We present a framework that models LLM personality via Jungian psychological types, integrating three mechanisms: a dominant-auxiliary coordination mechanism for coherent core expression, a reinforcement-compensation mechanism for temporary adaptation to context, and a reflection mechanism that drives long-term personality evolution. This design allows the agent to maintain nuanced traits while dynamically adjusting to interaction demands and gradually updating its underlying structure. Personality alignment is evaluated using Myers-Briggs Type Indicator questionnaires and tested under diverse challenge scenarios as a preliminary structured assessment. Findings suggest that evolving, personality-aware LLMs can support coherent, context-sensitive interactions, enabling naturalistic agent design in HCI.

[262] PaperScout: An Autonomous Agent for Academic Paper Search with Process-Aware Sequence-Level Policy Optimization

Tingyue Pan, Jie Ouyang, Mingyue Cheng, Qingchuan Li, Zirui Liu, Mingfan Pan, Shuo Yu, Qi Liu

Main category: cs.AI

TL;DR: PaperScout: an autonomous agent for academic paper search using sequential decision-making with PSPO optimization to address granularity mismatch in multi-turn agent training.

Details

Motivation: Existing academic paper search approaches rely on rigid, predefined workflows that struggle with complex, conditional queries. There's a need for more adaptive, dynamic search capabilities.

Method: Propose PaperScout as an autonomous agent that reformulates paper search as sequential decision-making. Introduce Proximal Sequence Policy Optimization (PSPO), a process-aware, sequence-level policy optimization method that aligns optimization with agent-environment interaction to address granularity mismatch in multi-turn agent training.

Result: PaperScout significantly outperforms strong workflow-driven and RL baselines in both recall and relevance on synthetic and real-world benchmarks.

Conclusion: The adaptive agentic framework with PSPO optimization effectively addresses limitations of rigid search workflows and standard RL methods for multi-turn agentic tasks.

Abstract: Academic paper search is a fundamental task in scientific research, yet most existing approaches rely on rigid, predefined workflows that struggle with complex, conditional queries. To address this limitation, we propose PaperScout, an autonomous agent that reformulates paper search as a sequential decision-making process. Unlike static workflows, PaperScout dynamically decides whether, when, and how to invoke search and expand tools based on accumulated retrieval context. However, training such agents presents a fundamental challenge: standard reinforcement learning methods, typically designed for single-turn tasks, suffer from a granularity mismatch when applied to multi-turn agentic tasks, where token-level optimization diverges from the granularity of sequence-level interactions, leading to noisy credit assignment. We introduce Proximal Sequence Policy Optimization (PSPO), a process-aware, sequence-level policy optimization method that aligns optimization with agent-environment interaction. Comprehensive experiments on both synthetic and real-world benchmarks demonstrate that PaperScout significantly outperforms strong workflow-driven and RL baselines in both recall and relevance, validating the effectiveness of our adaptive agentic framework and optimization strategy.

[263] FilDeep: Learning Large Deformations of Elastic-Plastic Solids with Multi-Fidelity Data

Jianheng Tang, Shilong Tao, Zhe Feng, Haonan Sun, Menglu Wang, Zhanxing Zhu, Yunhuai Liu

Main category: cs.AI

TL;DR: FilDeep is a fidelity-based deep learning framework that uses multi-fidelity data (low-fidelity high-quantity and high-fidelity low-quantity data) to solve large deformation problems in elastic-plastic solids, addressing the quantity-accuracy dilemma in dataset construction.

Details

Motivation: Traditional numerical methods for large deformations in elastic-plastic solids have limitations, and current deep learning approaches require high-quantity, high-accuracy datasets that are difficult to obtain for large deformation problems. There's a fundamental dilemma between data quantity and accuracy during dataset construction.

Method: FilDeep framework simultaneously trains with both low-fidelity (high quantity, lower accuracy) and high-fidelity (low quantity, higher accuracy) data. It includes attention-enabled cross-fidelity modules to capture long-range physical interactions across multi-fidelity data, specifically designed for practical large deformation problems like stretch bending.

Result: Extensive experiments demonstrate that FilDeep consistently achieves state-of-the-art performance and can be efficiently deployed in manufacturing applications. It presents the first DL framework for large deformation problems using multi-fidelity data.

Conclusion: FilDeep successfully resolves the quantity-accuracy dilemma in large deformation problems by leveraging multi-fidelity data, making deep learning more practical for manufacturing applications involving elastic-plastic solids with large deformations.

Abstract: The scientific computation of large deformations in elastic-plastic solids is crucial in various manufacturing applications. Traditional numerical methods exhibit several inherent limitations, prompting Deep Learning (DL) as a promising alternative. The effectiveness of current DL techniques typically depends on the availability of high-quantity and high-accuracy datasets, which are yet difficult to obtain in large deformation problems. During the dataset construction process, a dilemma stands between data quantity and data accuracy, leading to suboptimal performance in the DL models. To address this challenge, we focus on a representative application of large deformations, the stretch bending problem, and propose FilDeep, a Fidelity-based Deep Learning framework for large Deformation of elastic-plastic solids. Our FilDeep aims to resolve the quantity-accuracy dilemma by simultaneously training with both low-fidelity and high-fidelity data, where the former provides greater quantity but lower accuracy, while the latter offers higher accuracy but in less quantity. In FilDeep, we provide meticulous designs for the practical large deformation problem. Particularly, we propose attention-enabled cross-fidelity modules to effectively capture long-range physical interactions across MF data. To the best of our knowledge, our FilDeep presents the first DL framework for large deformation problems using MF data. Extensive experiments demonstrate that our FilDeep consistently achieves state-of-the-art performance and can be efficiently deployed in manufacturing.

[264] State of AI: An Empirical 100 Trillion Token Study with OpenRouter

Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, Anjney Midha

Main category: cs.AI

TL;DR: Analysis of 100T+ tokens of real-world LLM usage reveals rapid adoption of reasoning models, popularity of creative roleplay and coding, emergence of agentic inference, and a “Glass Slipper” effect where early users show much higher retention than later cohorts.

Details

Motivation: The field shifted dramatically with the release of reasoning models like o1, moving from single-pass to multi-step deliberation inference. However, empirical understanding of how these models are actually used in practice has lagged behind this rapid technological evolution.

Method: Leveraged the OpenRouter platform, an AI inference provider across various LLMs, to analyze over 100 trillion tokens of real-world LLM interactions across tasks, geographies, and time periods.

Result: Substantial adoption of open-weight models, outsized popularity of creative roleplay and coding assistance (beyond just productivity tasks), rise of agentic inference, and identification of the “Glass Slipper” effect where early users show much higher long-term retention than later cohorts.

Conclusion: LLM usage “in the wild” is complex and multifaceted, with implications for model builders, AI developers, and infrastructure providers. Data-driven understanding of usage patterns can inform better design and deployment of LLM systems.

Abstract: The past year has marked a turning point in the evolution and real-world use of large language models (LLMs). With the release of the first widely adopted reasoning model, o1, on December 5th, 2024, the field shifted from single-pass pattern generation to multi-step deliberation inference, accelerating deployment, experimentation, and new classes of applications. As this shift unfolded at a rapid pace, our empirical understanding of how these models have actually been used in practice has lagged behind. In this work, we leverage the OpenRouter platform, which is an AI inference provider across a wide variety of LLMs, to analyze over 100 trillion tokens of real-world LLM interactions across tasks, geographies, and time. In our empirical study, we observe substantial adoption of open-weight models, the outsized popularity of creative roleplay (beyond just the productivity tasks many assume dominate) and coding assistance categories, plus the rise of agentic inference. Furthermore, our retention analysis identifies foundational cohorts: early users whose engagement persists far longer than later cohorts. We term this phenomenon the Cinderella “Glass Slipper” effect. These findings underscore that the way developers and end-users engage with LLMs “in the wild” is complex and multifaceted. We discuss implications for model builders, AI developers, and infrastructure providers, and outline how a data-driven understanding of usage can inform better design and deployment of LLM systems.

[265] MATRIX AS PLAN: Structured Logical Reasoning with Feedback-Driven Replanning

Ke Chen, Jiandian Zeng, Zihao Peng, Guo Li, Guangxue Zhang, Tian Wang

Main category: cs.AI

TL;DR: MatrixCoT is a structured Chain-of-Thought framework that uses matrix-based planning to enhance LLMs’ logical reasoning without external solvers, improving robustness and interpretability.

Details

Motivation: Current approaches have limitations: CoT prompting falls short on symbolic reasoning tasks, neuro-symbolic methods are format-sensitive and brittle, and LLM-driven approaches lack structured representations and error-correction mechanisms.

Method: MatrixCoT normalizes and types natural language expressions, adds citation fields, and introduces matrix-based planning to preserve global relations among reasoning steps. It includes a feedback-driven replanning mechanism for verification under semantic-equivalence constraints.

Result: Experiments on five logical-reasoning benchmarks with five LLMs show MatrixCoT enhances both robustness and interpretability for complex symbolic reasoning tasks while maintaining competitive performance, without relying on external solvers.

Conclusion: MatrixCoT provides a structured CoT framework that addresses limitations of existing approaches by creating verifiable planning artifacts and incorporating verification mechanisms, making LLM reasoning more stable and trustworthy for logical tasks.

Abstract: As knowledge and semantics on the web grow increasingly complex, enhancing Large Language Models (LLMs) comprehension and reasoning capabilities has become particularly important. Chain-of-Thought (CoT) prompting has been shown to enhance the reasoning capabilities of LLMs. However, it still falls short on logical reasoning tasks that rely on symbolic expressions and strict deductive rules. Neuro-symbolic methods address this gap by enforcing formal correctness through external solvers. Yet these solvers are highly format-sensitive, and small instabilities in model outputs can lead to frequent processing failures. LLM-driven approaches avoid parsing brittleness, but they lack structured representations and process-level error-correction mechanisms. To further enhance the logical reasoning capabilities of LLMs, we propose MatrixCoT, a structured CoT framework with a matrix-based plan. Specifically, we normalize and type natural language expressions, attach explicit citation fields, and introduce a matrix-based planning method to preserve global relations among steps. The plan becomes a verifiable artifact, making execution more stable. For verification, we also add a feedback-driven replanning mechanism. Under semantic-equivalence constraints, it identifies omissions and defects, rewrites and compresses the dependency matrix, and produces a more trustworthy final answer. Experiments on five logical-reasoning benchmarks and five LLMs show that, without relying on external solvers, MatrixCoT enhances both robustness and interpretability when tackling complex symbolic reasoning tasks, while maintaining competitive performance.

[266] Following the Teacher’s Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs

Cheng Feng, Chaoliang Zhong, Jun Sun, Yusuke Oishi

Main category: cs.AI

TL;DR: Student models can outperform teachers on domain tasks when their advantage on student-favored subdomains outweighs deficits on teacher-favored subdomains, achieved through scheduled checkpoint distillation and adaptive weighting.

Details

Motivation: LLMs are too large for practical deployment, and distillation often fails due to capacity gaps between teacher and student models, leading to suboptimal performance on domain-specific tasks.

Method: Proposes Scheduled Checkpoint Distillation (SCD) to emulate teacher’s convergence process during SFT, plus Adaptive Weighting (AW) to preserve student strengths on favorable subdomains.

Result: Method consistently outperforms existing distillation approaches across QA, NER, and text classification tasks in multiple languages, enabling students to match or exceed teacher performance.

Conclusion: Students can surpass teachers on domain tasks when properly balancing subdomain advantages, achieved through convergence-emulating distillation and adaptive weighting mechanisms.

Abstract: Large language models (LLMs) are challenging to deploy for domain-specific tasks due to their massive scale. While distilling a fine-tuned LLM into a smaller student model is a promising alternative, the capacity gap between teacher and student often leads to suboptimal performance. This raises a key question: when and how can a student model match or even surpass its teacher on domain-specific tasks? In this work, we propose a novel theoretical insight: a student can outperform its teacher if its advantage on a Student-Favored Subdomain (SFS) outweighs its deficit on the Teacher-Favored Subdomain (TFS). Guided by this insight, we propose Scheduled Checkpoint Distillation (SCD), which reduces the TFS deficit by emulating the teacher’s convergence process during supervised fine-tuning (SFT) on the domain task, and a sample-wise Adaptive Weighting (AW) mechanism to preserve student strengths on SFS. Experiments across diverse domain tasks–including QA, NER, and text classification in multiple languages–show that our method consistently outperforms existing distillation approaches, allowing the student model to match or even exceed the performance of its fine-tuned teacher.

[267] Is More Context Always Better? Examining LLM Reasoning Capability for Time Interval Prediction

Yanan Cao, Farnaz Fallahi, Murali Mohana Krishna Dandu, Lalitesh Morishetti, Kai Zhao, Luyi Ma, Sinduja Subramaniam, Jianpeng Xu, Evren Korpeoglu, Kaushiki Nag, Sushant Kumar, Kannan Achan

Main category: cs.AI

TL;DR: LLMs can predict time intervals between recurring user actions but underperform specialized ML models, showing limited ability to capture quantitative temporal structure. More context doesn’t always help - moderate context improves accuracy but excessive detail degrades performance.

Details

Motivation: While LLMs show impressive reasoning capabilities, their ability to infer temporal regularities from structured behavioral data remains underexplored. The paper aims to systematically investigate whether LLMs can predict time intervals between recurring user actions and how contextual information affects their predictions.

Method: The study uses a simple repurchase scenario to benchmark state-of-the-art LLMs in zero-shot settings against both statistical and machine-learning models. It examines how different levels of contextual information shape LLM predictive behavior.

Result: Two key findings: 1) LLMs surpass lightweight statistical baselines but consistently underperform dedicated machine-learning models, revealing limited ability to capture quantitative temporal structure. 2) Moderate context improves LLM accuracy, but adding further user-level detail degrades performance, challenging the “more context leads to better reasoning” assumption.

Conclusion: The study highlights fundamental limitations of current LLMs in structured temporal inference and offers guidance for designing future context-aware hybrid models that integrate statistical precision with linguistic flexibility.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning and prediction across different domains. Yet, their ability to infer temporal regularities from structured behavioral data remains underexplored. This paper presents a systematic study investigating whether LLMs can predict time intervals between recurring user actions, such as repeated purchases, and how different levels of contextual information shape their predictive behavior. Using a simple but representative repurchase scenario, we benchmark state-of-the-art LLMs in zero-shot settings against both statistical and machine-learning models. Two key findings emerge. First, while LLMs surpass lightweight statistical baselines, they consistently underperform dedicated machine-learning models, showing their limited ability to capture quantitative temporal structure. Second, although moderate context can improve LLM accuracy, adding further user-level detail degrades performance. These results challenge the assumption that “more context leads to better reasoning”. Our study highlights fundamental limitations of today’s LLMs in structured temporal inference and offers guidance for designing future context-aware hybrid models that integrate statistical precision with linguistic flexibility.

[268] History Is Not Enough: An Adaptive Dataflow System for Financial Time-Series Synthesis

Haochong Xia, Yao Long Teng, Regan Tan, Molei Qin, Xinrun Wang, Bo An

Main category: cs.AI

TL;DR: A drift-aware dataflow system for financial ML that adapts data generation to market dynamics through differentiable optimization, improving model robustness and returns.

Details

Motivation: Traditional ML models in finance overfit static historical data and fail to generalize in dynamic markets due to concept drift and distributional non-stationarity. There's a need for adaptive data generation that evolves with the market rather than relying solely on past observations.

Method: A drift-aware dataflow system with parameterized data manipulation (single-stock transformations, multi-stock mix-ups, curation operations) coupled with an adaptive planner-scheduler using gradient-based bi-level optimization. Unifies data augmentation, curriculum learning, and workflow management under a differentiable framework.

Result: Extensive experiments on forecasting and reinforcement learning trading tasks show enhanced model robustness and improved risk-adjusted returns compared to traditional approaches.

Conclusion: The system provides a generalizable approach to adaptive data management and learning-guided workflow automation for financial data, addressing the critical gap between training and real-world performance in quantitative finance.

Abstract: In quantitative finance, the gap between training and real-world performance-driven by concept drift and distributional non-stationarity-remains a critical obstacle for building reliable data-driven systems. Models trained on static historical data often overfit, resulting in poor generalization in dynamic markets. The mantra “History Is Not Enough” underscores the need for adaptive data generation that learns to evolve with the market rather than relying solely on past observations. We present a drift-aware dataflow system that integrates machine learning-based adaptive control into the data curation process. The system couples a parameterized data manipulation module comprising single-stock transformations, multi-stock mix-ups, and curation operations, with an adaptive planner-scheduler that employs gradient-based bi-level optimization to control the system. This design unifies data augmentation, curriculum learning, and data workflow management under a single differentiable framework, enabling provenance-aware replay and continuous data quality monitoring. Extensive experiments on forecasting and reinforcement learning trading tasks demonstrate that our framework enhances model robustness and improves risk-adjusted returns. The system provides a generalizable approach to adaptive data management and learning-guided workflow automation for financial data.

[269] DecisionLLM: Large Language Models for Long Sequence Decision Exploration

Xiaowei Lv, Zhilin Zhang, Yijun Li, Yusen Huo, Siyuan Ju, Xuyan Li, Chunxiang Hong, Tianyu Wang, Yongcai Wang, Peng Sun, Chuan Yu, Jian Xu, Bo Zheng

Main category: cs.AI

TL;DR: DecisionLLM applies large language models to offline sequential decision-making by treating trajectories as a distinct modality and aligning them with natural language task descriptions, outperforming traditional Decision Transformers.

Details

Motivation: The paper is motivated by the success of LLMs in complex reasoning tasks and their shared Transformer foundation with Decision Transformers. The authors investigate whether LLMs, operating at much larger scales, can unlock new performance levels in long-horizon sequential decision-making, particularly for offline decision tasks like real-time bidding in computational advertising.

Method: The authors propose DecisionLLM, which treats trajectories as a distinct modality to address LLMs’ inability to interpret continuous values. The model learns to align trajectory data with natural language task descriptions, enabling autoregressive prediction of future decisions within a cohesive framework. They establish scaling laws showing performance depends on model scale, data volume, and data quality.

Result: DecisionLLM achieves strong performance in offline experimental benchmarks and bidding scenarios. Specifically, DecisionLLM-3B outperforms traditional Decision Transformer by 69.4 on Maze2D umaze-v1 and by 0.085 on AuctionNet. The work extends the AIGB paradigm and points to promising directions for future online bidding exploration.

Conclusion: LLMs can be effectively applied to offline sequential decision-making by treating trajectories as a distinct modality and aligning them with task descriptions. The proposed DecisionLLM framework demonstrates superior performance over traditional Decision Transformers and establishes scaling laws for this paradigm, opening promising directions for future research in online decision-making applications.

Abstract: Long-sequence decision-making, which is usually addressed through reinforcement learning (RL), is a critical component for optimizing strategic operations in dynamic environments, such as real-time bidding in computational advertising. The Decision Transformer (DT) introduced a powerful paradigm by framing RL as an autoregressive sequence modeling problem. Concurrently, Large Language Models (LLMs) have demonstrated remarkable success in complex reasoning and planning tasks. This inspires us whether LLMs, which share the same Transformer foundation, but operate at a much larger scale, can unlock new levels of performance in long-horizon sequential decision-making problem. This work investigates the application of LLMs to offline decision making tasks. A fundamental challenge in this domain is the LLMs’ inherent inability to interpret continuous values, as they lack a native understanding of numerical magnitude and order when values are represented as text strings. To address this, we propose treating trajectories as a distinct modality. By learning to align trajectory data with natural language task descriptions, our model can autoregressively predict future decisions within a cohesive framework we term DecisionLLM. We establish a set of scaling laws governing this paradigm, demonstrating that performance hinges on three factors: model scale, data volume, and data quality. In offline experimental benchmarks and bidding scenarios, DecisionLLM achieves strong performance. Specifically, DecisionLLM-3B outperforms the traditional Decision Transformer (DT) by 69.4 on Maze2D umaze-v1 and by 0.085 on AuctionNet. It extends the AIGB paradigm and points to promising directions for future exploration in online bidding.

[270] MHub.ai: A Simple, Standardized, and Reproducible Platform for AI Models in Medical Imaging

Leonard Nürnberg, Dennis Bontempi, Suraj Pai, Curtis Lisle, Steve Pieper, Ron Kikinis, Sil van de Leemput, Rahul Soni, Gowtham Murugesan, Cosmin Ciausu, Miriam Groeneveld, Felix J. Dorfner, Jue Jiang, Aneesh Rangnekar, Harini Veeraraghavan, Joeran S. Bosma, Keno Bressem, Raymond Mak, Andrey Fedorov, Hugo JWL Aerts

Main category: cs.AI

TL;DR: MHub.ai is an open-source container platform that standardizes AI models for medical imaging, addressing reproducibility issues by packaging peer-reviewed models with consistent interfaces, metadata, and reference data.

Details

Motivation: AI in medical imaging faces challenges with diverse implementations, inconsistent documentation, and reproducibility issues that limit research and clinical adoption.

Method: Developed MHub.ai, a container-based platform that packages AI models from publications into standardized containers with unified interfaces, DICOM support, and embedded metadata, plus reference data for verification.

Result: Created a platform with initial state-of-the-art models for segmentation, prediction, and feature extraction across modalities, demonstrated utility through lung segmentation evaluation with public segmentations, metrics, and interactive dashboards.

Conclusion: MHub.ai simplifies AI model use in medical imaging, enables standardized benchmarking, lowers clinical translation barriers, and promotes reproducibility through its open-source, modular framework.

Abstract: Artificial intelligence (AI) has the potential to transform medical imaging by automating image analysis and accelerating clinical research. However, research and clinical use are limited by the wide variety of AI implementations and architectures, inconsistent documentation, and reproducibility issues. Here, we introduce MHub.ai, an open-source, container-based platform that standardizes access to AI models with minimal configuration, promoting accessibility and reproducibility in medical imaging. MHub.ai packages models from peer-reviewed publications into standardized containers that support direct processing of DICOM and other formats, provide a unified application interface, and embed structured metadata. Each model is accompanied by publicly available reference data that can be used to confirm model operation. MHub.ai includes an initial set of state-of-the-art segmentation, prediction, and feature extraction models for different modalities. The modular framework enables adaptation of any model and supports community contributions. We demonstrate the utility of the platform in a clinical use case through comparative evaluation of lung segmentation models. To further strengthen transparency and reproducibility, we publicly release the generated segmentations and evaluation metrics and provide interactive dashboards that allow readers to inspect individual cases and reproduce or extend our analysis. By simplifying model use, MHub.ai enables side-by-side benchmarking with identical execution commands and standardized outputs, and lowers the barrier to clinical translation.

[271] MMPG: MoE-based Adaptive Multi-Perspective Graph Fusion for Protein Representation Learning

Yusong Wang, Jialun Shen, Zhihao Wu, Yicheng Xu, Shiyin Tan, Mingkun Xu, Changshuo Wang, Zixing Song, Prayag Tiwari

Main category: cs.AI

TL;DR: MMPG is a multi-perspective protein graph framework that constructs protein graphs from physical, chemical, and geometric perspectives and fuses them using Mixture of Experts for better protein representation learning.

Details

Motivation: Current GNN-based protein representation learning methods use single-perspective graph construction strategies that capture only partial properties of residue interactions, leading to incomplete protein representations.

Method: Constructs protein graphs from physical, chemical, and geometric perspectives, then uses Mixture of Experts to dynamically route perspectives to specialized experts that learn intrinsic features and cross-perspective interactions.

Result: MMPG produces superior protein representations and achieves advanced performance on four different downstream protein tasks. MoE automatically specializes experts in modeling distinct levels of interaction from individual representations to global consensus.

Conclusion: Multi-perspective graph construction with adaptive fusion via Mixture of Experts effectively addresses the limitation of single-perspective approaches and improves protein representation learning.

Abstract: Graph Neural Networks (GNNs) have been widely adopted for Protein Representation Learning (PRL), as residue interaction networks can be naturally represented as graphs. Current GNN-based PRL methods typically rely on single-perspective graph construction strategies, which capture partial properties of residue interactions, resulting in incomplete protein representations. To address this limitation, we propose MMPG, a framework that constructs protein graphs from multiple perspectives and adaptively fuses them via Mixture of Experts (MoE) for PRL. MMPG constructs graphs from physical, chemical, and geometric perspectives to characterize different properties of residue interactions. To capture both perspective-specific features and their synergies, we develop an MoE module, which dynamically routes perspectives to specialized experts, where experts learn intrinsic features and cross-perspective interactions. We quantitatively verify that MoE automatically specializes experts in modeling distinct levels of interaction from individual representations, to pairwise inter-perspective synergies, and ultimately to a global consensus across all perspectives. Through integrating this multi-level information, MMPG produces superior protein representations and achieves advanced performance on four different downstream protein tasks.

[272] CtD: Composition through Decomposition in Emergent Communication

Boaz Carmeli, Ron Meir, Yonatan Belinkov

Main category: cs.AI

TL;DR: Neural agents learn compositional generalization through a “Composition through Decomposition” method with two steps: first decomposing images into basic concepts via a codebook, then composing those concepts to describe novel images, sometimes achieving zero-shot generalization.

Details

Motivation: To demonstrate how artificial neural agents can acquire and utilize compositional generalization - the human cognitive ability to systematically combine known concepts in novel ways - particularly for describing previously unseen images.

Method: “Composition through Decomposition” with two sequential training steps: 1) ‘Decompose’ step where agents learn to decompose images into basic concepts using a codebook acquired during interaction in a multi-target coordination game; 2) ‘Compose’ step where agents use this codebook to describe novel images by composing basic concepts into complex phrases.

Result: The agents successfully learn to describe previously unseen images by composing basic concepts, with remarkable cases where generalization in the ‘Compose’ step is achieved zero-shot without additional training.

Conclusion: Artificial neural agents can acquire compositional generalization capabilities similar to humans through structured training approaches, enabling them to describe novel images by systematically combining learned basic concepts, sometimes achieving zero-shot generalization.

Abstract: Compositionality is a cognitive mechanism that allows humans to systematically combine known concepts in novel ways. This study demonstrates how artificial neural agents acquire and utilize compositional generalization to describe previously unseen images. Our method, termed “Composition through Decomposition”, involves two sequential training steps. In the ‘Decompose’ step, the agents learn to decompose an image into basic concepts using a codebook acquired during interaction in a multi-target coordination game. Subsequently, in the ‘Compose’ step, the agents employ this codebook to describe novel images by composing basic concepts into complex phrases. Remarkably, we observe cases where generalization in the `Compose’ step is achieved zero-shot, without the need for additional training.

[273] How does downsampling affect needle electromyography signals? A generalisable workflow for understanding downsampling effects on high-frequency time series

Mathieu Cherpitel, Janne Luijten, Thomas Bäck, Camiel Verhamme, Martijn Tannemaat, Anna Kononova

Main category: cs.AI

TL;DR: This paper presents a systematic workflow to evaluate how downsampling affects diagnostic information in high-frequency needle EMG signals, showing shape-aware algorithms outperform standard decimation while enabling near real-time analysis.

Details

Motivation: High and heterogeneous sampling rates in needle EMG signals pose computational challenges for real-time analysis. Downsampling could help but its impact on diagnostic content and classification performance is poorly understood.

Method: Developed a workflow combining shape-based distortion metrics with classification outcomes from feature-based ML models and feature space analysis. Evaluated using a three-class neuromuscular disease classification task with different downsampling algorithms and factors.

Result: Shape-aware downsampling algorithms outperform standard decimation in preserving peak structure and signal morphology. The workflow identifies configurations that preserve diagnostic information while substantially reducing computational load.

Conclusion: Provides practical guidance for selecting downsampling configurations enabling near real-time nEMG analysis. The workflow is generalizable to balance data reduction with model performance in other high-frequency time-series applications.

Abstract: Automated analysis of needle electromyography (nEMG) signals is emerging as a tool to support the detection of neuromuscular diseases (NMDs), yet the signals’ high and heterogeneous sampling rates pose substantial computational challenges for feature-based machine-learning models, particularly for near real-time analysis. Downsampling offers a potential solution, but its impact on diagnostic signal content and classification performance remains insufficiently understood. This study presents a workflow for systematically evaluating information loss caused by downsampling in high-frequency time series. The workflow combines shape-based distortion metrics with classification outcomes from available feature-based machine learning models and feature space analysis to quantify how different downsampling algorithms and factors affect both waveform integrity and predictive performance. We use a three-class NMD classification task to experimentally evaluate the workflow. We demonstrate how the workflow identifies downsampling configurations that preserve diagnostic information while substantially reducing computational load. Analysis of shape-based distortion metrics showed that shape-aware downsampling algorithms outperform standard decimation, as they better preserve peak structure and overall signal morphology. The results provide practical guidance for selecting downsampling configurations that enable near real-time nEMG analysis and highlight a generalisable workflow that can be used to balance data reduction with model performance in other high-frequency time-series applications as well.

[274] GFM4GA: Graph Foundation Model for Group Anomaly Detection

Jiujiu Chen, Weijun Zeng, Shaofeng Hu, Sihong Xie, Hui Xiong

Main category: cs.AI

TL;DR: GFM4GA: A graph foundation model for group anomaly detection using dual-level contrastive learning pretraining and few-shot finetuning, outperforming existing methods.

Details

Motivation: Group anomaly detection is important for network applications but challenging due to diverse anomaly patterns. Existing graph foundation models work for individual anomalies but fail for group anomalies because group anomalies must be detected as a whole, and individuals within abnormal groups can appear normal.

Method: Proposes GFM4GA with dual-level contrastive learning pretraining based on feature-based estimation and group extraction to capture group anomaly structure and feature inconsistencies. Downstream tasks use parameter-constrained and group-anomaly-proportion weighted few-shot finetuning, with adaptive ability expanded via group contexts from labeled anomaly neighbors.

Result: GFM4GA surpasses both group anomaly detectors and GFMs for individual anomalies, achieving average improvements of 2.85% in AUROC and 2.55% in AUPRC.

Conclusion: GFM4GA effectively addresses group anomaly detection challenges by leveraging graph foundation models with specialized pretraining and few-shot learning techniques, demonstrating superior performance over existing approaches.

Abstract: Group anomaly detection is crucial in many network applications, but faces challenges due to diverse anomaly patterns. Motivated by the success of large language models (LLMs) in natural language processing, graph foundation models (GFMs) is proposed to handle few-shot learning task with fewer labeling efforts. GFMs have been successfully applied to detection of individual anomalies but cannot be generalized to group anomalies, as group anomaly patterns must be detected as a whole and individuals in an abnormal group can look rather normal. Therefore, we propose GFM4GA, a novel graph foundation model for group anomaly detection. The pipeline is pretrained via dual-level contrastive learning based on feature-based estimation and group extraction, to capture potential group anomaly structure and feature inconsistencies. In the downstream tasks, the pipeline is finetuned in parameter-constrained and group-anomaly-proportion weighted few-shot settings, and its adaptive ability to unseen group anomalies expanded via group contexts determined by labeled anomaly neighbors. Experiments show that GFM4GA surpasses group anomaly detectors and GFMs for individual anomalies, achieving average improvements of 2.85% in AUROC and 2.55% in AUPRC.

[275] Topo-RAG: Topology-aware retrieval for hybrid text-table documents

Alex Dantart, Marco Kóvacs-Navarro

Main category: cs.AI

TL;DR: Topo-RAG is a framework that processes hybrid text+table documents using separate pathways for narrative and tabular data, improving retrieval performance by 18.4% over standard linearization methods.

Details

Motivation: Current RAG systems use linearization (converting tables to text strings) which is mathematically insufficient for capturing the complex geometry and spatial relationships in enterprise documents that contain both narrative text and structured tabular data.

Method: Dual architecture with separate processing pathways: traditional dense retrievers for fluid narrative text, and a Cell-Aware Late Interaction mechanism for tabular structures that preserves spatial relationships and topology.

Result: 18.4% improvement in nDCG@10 on hybrid queries compared to standard linearization approaches, evaluated on SEC-25 synthetic enterprise corpus that mimics real-world complexity.

Conclusion: Topo-RAG demonstrates that treating “everything as text” is insufficient; respecting the topology and shape of information leads to better understanding and retrieval performance for complex enterprise documents.

Abstract: In enterprise datasets, documents are rarely pure. They are not just text, nor just numbers; they are a complex amalgam of narrative and structure. Current Retrieval-Augmented Generation (RAG) systems have attempted to address this complexity with a blunt tool: linearization. We convert rich, multidimensional tables into simple Markdown-style text strings, hoping that an embedding model will capture the geometry of a spreadsheet in a single vector. But it has already been shown that this is mathematically insufficient. This work presents Topo-RAG, a framework that challenges the assumption that “everything is text”. We propose a dual architecture that respects the topology of the data: we route fluid narrative through traditional dense retrievers, while tabular structures are processed by a Cell-Aware Late Interaction mechanism, preserving their spatial relationships. Evaluated on SEC-25, a synthetic enterprise corpus that mimics real-world complexity, Topo-RAG demonstrates an 18.4% improvement in nDCG@10 on hybrid queries compared to standard linearization approaches. It’s not just about searching better; it’s about understanding the shape of information.

[276] TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks

Vansh Kapoor, Aman Gupta, Hao Chen, Anurag Beniwal, Jing Huang, Aviral Kumar

Main category: cs.AI

TL;DR: TRIM is a targeted routing method for multi-step reasoning that routes only critical steps (likely to cause cascading failures) to larger models while letting smaller models handle routine steps, achieving 5x higher cost efficiency on math tasks.

Details

Motivation: Current LLM routing methods assign entire queries to one model, treating all reasoning steps as equal, but multi-step reasoning tasks are vulnerable to cascading failures where a single incorrect step leads to complete solution breakdown.

Method: TRIM operates at step-level using process reward models to identify erroneous steps and makes routing decisions based on step-level uncertainty and budget constraints. It includes routing strategies from simple threshold-based policies to more expressive policies that reason about long-horizon accuracy-cost trade-offs.

Result: On MATH-500, thresholding strategy surpasses prior routing methods with 5x higher cost efficiency, while advanced policies match strong model’s performance using 80% fewer expensive model tokens. On AIME, achieves up to 6x higher cost efficiency. Methods generalize across math reasoning tasks.

Conclusion: Targeted step-level interventions can fundamentally transform inference efficiency by confining expensive calls to precisely those steps where stronger models prevent cascading errors, demonstrating that step-level difficulty represents fundamental characteristics of reasoning.

Abstract: Multi-step reasoning tasks like mathematical problem solving are vulnerable to cascading failures, where a single incorrect step leads to complete solution breakdown. Current LLM routing methods assign entire queries to one model, treating all reasoning steps as equal. We propose TRIM (Targeted routing in multi-step reasoning tasks), which routes only critical steps$\unicode{x2013}$those likely to derail the solution$\unicode{x2013}$to larger models while letting smaller models handle routine continuations. Our key insight is that targeted step-level interventions can fundamentally transform inference efficiency by confining expensive calls to precisely those steps where stronger models prevent cascading errors. TRIM operates at the step-level: it uses process reward models to identify erroneous steps and makes routing decisions based on step-level uncertainty and budget constraints. We develop several routing strategies within TRIM, ranging from a simple threshold-based policy to more expressive policies that reason about long-horizon accuracy-cost trade-offs and uncertainty in step-level correctness estimates. On MATH-500, even the simplest thresholding strategy surpasses prior routing methods with 5x higher cost efficiency, while more advanced policies match the strong, expensive model’s performance using 80% fewer expensive model tokens. On harder benchmarks such as AIME, TRIM achieves up to 6x higher cost efficiency. All methods generalize effectively across math reasoning tasks, demonstrating that step-level difficulty represents fundamental characteristics of reasoning.

[277] NoReGeo: Non-Reasoning Geometry Benchmark

Irina Abdullaeva, Anton Vasiliuk, Elizaveta Goncharova, Temurbek Rahmatullaev, Zagorulko Ivan, Maxim Kurkin, Andrey Kuznetsov

Main category: cs.AI

TL;DR: NoReGeo is a new benchmark that tests LLMs’ intrinsic geometric understanding without requiring reasoning or algebra, revealing that even top models like GPT-4 only achieve 65% accuracy on simple geometric problems.

Details

Motivation: Existing geometry benchmarks focus on reasoning-based solutions using algebraic methods, but don't evaluate whether LLMs can inherently encode spatial relationships and recognize geometric properties directly. The authors want to assess LLMs' native geometric understanding separate from their reasoning capabilities.

Method: Created NoReGeo benchmark with 2,500 trivial geometric problems across 25 categories, designed to be solvable purely through native geometric understanding (assuming known object locations). Tested state-of-the-art models including GPT-4, and conducted ablation experiments to examine whether geometric understanding emerges through fine-tuning.

Result: Even the most advanced LLMs (including frontier models like GPT-4) achieve maximum 65% accuracy on binary classification tasks. Ablation experiments show geometric understanding doesn’t emerge through fine-tuning alone, indicating specialized training approaches are needed from the outset.

Conclusion: There’s a significant gap in current LLMs’ ability to natively grasp geometric concepts. The NoReGeo benchmark provides a foundation for future research toward developing models with true geometric cognition, requiring specialized training approaches rather than just fine-tuning.

Abstract: We present NoReGeo, a novel benchmark designed to evaluate the intrinsic geometric understanding of large language models (LLMs) without relying on reasoning or algebraic computation. Unlike existing benchmarks that primarily assess models’ proficiency in reasoning-based geometry-where solutions are derived using algebraic methods-NoReGeo focuses on evaluating whether LLMs can inherently encode spatial relationships and recognize geometric properties directly. Our benchmark comprises 2,500 trivial geometric problems spanning 25 categories, each carefully crafted to be solvable purely through native geometric understanding, assuming known object locations. We assess a range of state-of-the-art models on NoReGeo, including frontier models like GPT-4, observing that even the most advanced systems achieve an overall maximum of 65% accuracy in binary classification tasks. Further, our ablation experiments demonstrate that such geometric understanding does not emerge through fine-tuning alone, indicating that effective training for geometric comprehension requires a specialized approach from the outset. Our findings highlight a significant gap in current LLMs’ ability to natively grasp geometric concepts, providing a foundation for future research toward models with true geometric cognition.

[278] Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

Xin Guan, Zijian Li, Shen Huang, Pengjun Xie, Jingren Zhou, Jiuxin Cao

Main category: cs.AI

TL;DR: EAPO introduces Evidence-Augmented Policy Optimization with dense process supervision to improve long-context reasoning by focusing on evidence quality rather than just outcome rewards.

Details

Motivation: Current RL approaches for LLM reasoning in long-context scenarios suffer from sparse outcome rewards, which fail to penalize ungrounded "lucky guesses" and leave evidence retrieval largely unsupervised.

Method: EAPO establishes Evidence-Augmented Reasoning paradigm, validates evidence extraction as the bottleneck, then introduces specialized RL with Group-Relative Evidence Reward for dense process supervision, plus Adaptive Reward-Policy Co-Evolution to refine the reward model.

Result: Comprehensive evaluations across eight benchmarks show EAPO significantly enhances long-context reasoning performance compared to state-of-the-art baselines.

Conclusion: EAPO successfully addresses the sparse reward problem in long-context reasoning by providing dense process supervision focused on evidence quality, leading to improved reasoning performance.

Abstract: While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded “lucky guesses,” leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.

[279] C-GRASP: Clinically-Grounded Reasoning for Affective Signal Processing

Cheng Lin Cheng, Ting Chuan Lin, Chai Kai Chang

Main category: cs.AI

TL;DR: C-GRASP is a guardrailed RAG-enhanced pipeline that addresses physiological hallucinations in LLM-based HRV interpretation, improving emotion classification accuracy and clinical reasoning consistency.

Details

Motivation: Current LLM applications to HRV interpretation suffer from physiological hallucinations including RSA contamination, short-data instability in nonlinear metrics, and over-reliance on population norms rather than individualized baselines.

Method: Proposes C-GRASP (Clinically-Grounded Reasoning for Affective Signal Processing) with eight traceable reasoning steps, Z-score Priority Hierarchy prioritizing individualized baseline shifts, and automated RSA-aware guardrails to prevent spectral contamination.

Result: Achieved 37.3% accuracy in 4-class emotion classification on DREAMER dataset (414 trials) and 69.6% Clinical Reasoning Consistency score, with ablation studies confirming the critical role of individualized Delta Z-score module.

Conclusion: C-GRASP transitions affective computing from black-box classification to transparent, evidence-based clinical decision support, enabling safer AI integration in biomedical engineering by preventing population bias common in native LLMs.

Abstract: Heart rate variability (HRV) is a pivotal noninvasive marker for autonomic monitoring; however, applying Large Language Models (LLMs) to HRV interpretation is hindered by physiological hallucinations. These include respiratory sinus arrhythmia (RSA) contamination, short-data instability in nonlinear metrics, and the neglect of individualized baselines in favor of population norms. We propose C-GRASP (Clinically-Grounded Reasoning for Affective Signal Processing), a guardrailed RAG-enhanced pipeline that decomposes HRV interpretation into eight traceable reasoning steps. Central to C-GRASP is a Z-score Priority Hierarchy that enforces the weighting of individualized baseline shifts over normative statistics. The system effectively mitigates spectral hallucinations through automated RSA-aware guardrails, preventing contamination of frequency-domain indices. Evaluated on 414 trials from the DREAMER dataset, C-GRASP integrated with high-scale reasoning models (e.g., MedGemma3-thinking) achieved superior performance in 4-class emotion classification (37.3% accuracy) and a Clinical Reasoning Consistency (CRC) score of 69.6%. Ablation studies confirm that the individualized Delta Z-score module serves as the critical logical anchor, preventing the “population bias” common in native LLMs. Ultimately, C-GRASP transitions affective computing from black-box classification to transparent, evidence-based clinical decision support, paving the way for safer AI integration in biomedical engineering.

[280] LatentRefusal: Latent-Signal Refusal for Unanswerable Text-to-SQL Queries

Xuancheng Ren, Shijing Hu, Zhihui Lu, Jiangqi Huang, Qiang Duan

Main category: cs.AI

TL;DR: LatentRefusal: A latent-signal refusal mechanism for text-to-SQL systems that predicts query answerability from intermediate LLM activations to safely refuse unanswerable/underspecified queries.

Details

Motivation: Unanswerable and underspecified queries in text-to-SQL systems can generate executable programs that yield misleading results or violate safety constraints, posing a major barrier to safe deployment. Existing refusal strategies are either brittle (output-level instruction following) or add complexity/overhead (uncertainty estimation).

Method: Formalizes safe refusal as an answerability-gating problem. Proposes LatentRefusal, a latent-signal refusal mechanism that predicts query answerability from intermediate hidden activations of LLMs. Uses Tri-Residual Gated Encoder, a lightweight probing architecture, to suppress schema noise and amplify sparse, localized cues of question-schema mismatch.

Result: Extensive evaluations across diverse ambiguous/unanswerable settings show effectiveness. LatentRefusal improves average F1 to 88.5% on both backbones across four benchmarks while adding only ~2ms probe overhead. Provides attachable, efficient safety layer for text-to-SQL systems.

Conclusion: LatentRefusal offers a practical solution for safe refusal in text-to-SQL systems by leveraging latent signals from LLM activations, addressing the limitations of existing approaches while maintaining efficiency and effectiveness.

Abstract: In LLM-based text-to-SQL systems, unanswerable and underspecified user queries may generate not only incorrect text but also executable programs that yield misleading results or violate safety constraints, posing a major barrier to safe deployment. Existing refusal strategies for such queries either rely on output-level instruction following, which is brittle due to model hallucinations, or estimate output uncertainty, which adds complexity and overhead. To address this challenge, we formalize safe refusal in text-to-SQL systems as an answerability-gating problem and propose LatentRefusal, a latent-signal refusal mechanism that predicts query answerability from intermediate hidden activations of a large language model. We introduce the Tri-Residual Gated Encoder, a lightweight probing architecture, to suppress schema noise and amplify sparse, localized cues of question-schema mismatch that indicate unanswerability. Extensive empirical evaluations across diverse ambiguous and unanswerable settings, together with ablation studies and interpretability analyses, demonstrate the effectiveness of the proposed approach and show that LatentRefusal provides an attachable and efficient safety layer for text-to-SQL systems. Across four benchmarks, LatentRefusal improves average F1 to 88.5 percent on both backbones while adding approximately 2 milliseconds of probe overhead.

[281] Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Yuzhi Zhang, Linfeng Zhang, Weinan E, Di Jin, Siheng Chen

Main category: cs.AI

TL;DR: ML-Master 2.0 introduces Hierarchical Cognitive Caching to enable AI agents to handle ultra-long-horizon machine learning engineering tasks spanning days/weeks, achieving 56.44% medal rate on MLE-Bench.

Details

Motivation: Current AI agents struggle with ultra-long-horizon autonomy in scientific discovery, as LLMs get overwhelmed by execution details and fail to consolidate sparse feedback into coherent long-term guidance for experimental cycles spanning days or weeks.

Method: Hierarchical Cognitive Caching (HCC) - a multi-tiered architecture inspired by computer systems that reframes context management as cognitive accumulation. It dynamically distills transient execution traces into stable knowledge and cross-task wisdom, decoupling immediate execution from long-term strategy.

Result: ML-Master 2.0 achieves state-of-the-art 56.44% medal rate on OpenAI’s MLE-Bench under 24-hour budgets, demonstrating superior performance in ultra-long-horizon machine learning engineering tasks.

Conclusion: Ultra-long-horizon autonomy with cognitive accumulation provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities in scientific discovery.

Abstract: The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI’s MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.

[282] ErrEval: Error-Aware Evaluation for Question Generation through Explicit Diagnostics

Weiping Fu, Bifan Wei, Jingyi Hao, Yushun Zhang, Jian Zhang, Jiaxin Wang, Bo Li, Yu He, Lingling Zhang, Jun Liu

Main category: cs.AI

TL;DR: ErrEval is an error-aware evaluation framework for automatic question generation that improves assessment by explicitly diagnosing errors before scoring, addressing issues like factual hallucinations and answer mismatches that current methods overlook.

Details

Motivation: Current QG evaluation methods (including LLM-based evaluators) use black-box holistic approaches without explicit error modeling, causing them to miss critical defects like factual hallucinations and answer mismatches, leading to overestimation of question quality.

Method: ErrEval reformulates evaluation as a two-stage process: 1) Error diagnosis using a lightweight plug-and-play Error Identifier that detects/categorizes errors across structural, linguistic, and content aspects; 2) Informed scoring where diagnostic signals guide LLM evaluators for more fine-grained, grounded judgments.

Result: Extensive experiments on three benchmarks show ErrEval improves alignment with human judgments. Further analyses confirm it effectively mitigates overestimation of low-quality questions.

Conclusion: Explicit error diagnostics enhance QG evaluation by addressing critical defects that current methods overlook, leading to more accurate assessment of question quality and better alignment with human judgments.

Abstract: Automatic Question Generation (QG) often produces outputs with critical defects, such as factual hallucinations and answer mismatches. However, existing evaluation methods, including LLM-based evaluators, mainly adopt a black-box and holistic paradigm without explicit error modeling, leading to the neglect of such defects and overestimation of question quality. To address this issue, we propose ErrEval, a flexible and Error-aware Evaluation framework that enhances QG evaluation through explicit error diagnostics. Specifically, ErrEval reformulates evaluation as a two-stage process of error diagnosis followed by informed scoring. At the first stage, a lightweight plug-and-play Error Identifier detects and categorizes common errors across structural, linguistic, and content-related aspects. These diagnostic signals are then incorporated as explicit evidence to guide LLM evaluators toward more fine-grained and grounded judgments. Extensive experiments on three benchmarks demonstrate the effectiveness of ErrEval, showing that incorporating explicit diagnostics improves alignment with human judgments. Further analyses confirm that ErrEval effectively mitigates the overestimation of low-quality questions.

[283] LADFA: A Framework of Using Large Language Models and Retrieval-Augmented Generation for Personal Data Flow Analysis in Privacy Policies

Haiyue Yuan, Nikolay Matyunin, Ali Raza, Shujun Li

Main category: cs.AI

TL;DR: LADFA is an end-to-end framework that uses LLMs with RAG and custom knowledge bases to extract personal data flows from privacy policies and construct data flow graphs for analysis.

Details

Motivation: Privacy policies are difficult for people to comprehend due to complex legal language and inconsistent practices across organizations, requiring automated large-scale analysis tools.

Method: Combines LLMs with retrieval-augmented generation (RAG) and custom knowledge bases; framework includes pre-processor, LLM-based processor, and data flow post-processor to extract personal data flows and construct flow graphs.

Result: Validated effectiveness and accuracy through case study examining ten privacy policies from automotive industry; framework demonstrated capability to extract data flows and construct analysis graphs.

Conclusion: LADFA provides flexible, customizable framework for automated privacy policy analysis and can be extended to other text-based analysis tasks beyond privacy policies.

Abstract: Privacy policies help inform people about organisations’ personal data processing practices, covering different aspects such as data collection, data storage, and sharing of personal data with third parties. Privacy policies are often difficult for people to fully comprehend due to the lengthy and complex legal language used and inconsistent practices across different sectors and organisations. To help conduct automated and large-scale analyses of privacy policies, many researchers have studied applications of machine learning and natural language processing techniques, including large language models (LLMs). While a limited number of prior studies utilised LLMs for extracting personal data flows from privacy policies, our approach builds on this line of work by combining LLMs with retrieval-augmented generation (RAG) and a customised knowledge base derived from existing studies. This paper presents the development of LADFA, an end-to-end computational framework, which can process unstructured text in a given privacy policy, extract personal data flows and construct a personal data flow graph, and conduct analysis of the data flow graph to facilitate insight discovery. The framework consists of a pre-processor, an LLM-based processor, and a data flow post-processor. We demonstrated and validated the effectiveness and accuracy of the proposed approach by conducting a case study that involved examining ten selected privacy policies from the automotive industry. Moreover, it is worth noting that LADFA is designed to be flexible and customisable, making it suitable for a range of text-based analysis tasks beyond privacy policy analysis.

[284] LLMdoctor: Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment of Large Language Models

Tiesunlong Shen, Rui Mao, Jin Wang, Heming Sun, Jian Zhang, Xuejie Zhang, Erik Cambria

Main category: cs.AI

TL;DR: LLMdoctor is a novel test-time alignment framework using a patient-doctor paradigm with token-level reward acquisition and flow-guided preference optimization to align LLMs efficiently without fine-tuning.

Details

Motivation: Traditional fine-tuning methods for aligning LLMs with human preferences are computationally expensive and inflexible. Existing test-time alignment approaches have limitations: they rely on distorted trajectory-level signals or inefficient sampling, which caps performance and fails to preserve the base model's generative diversity.

Method: LLMdoctor uses a patient-doctor paradigm where a large frozen patient LLM is steered by a smaller specialized doctor model. The framework integrates token-level reward acquisition with token-level flow-guided preference optimization (TFPO). It first extracts fine-grained token-level preference signals from the patient model’s behavioral variations, then uses TFPO to train the doctor model, establishing flow consistency across all subtrajectories for precise token-by-token alignment.

Result: Extensive experiments show that LLMdoctor significantly outperforms existing test-time alignment methods and even surpasses the performance of full fine-tuning approaches like DPO.

Conclusion: LLMdoctor provides an efficient and effective alternative to traditional fine-tuning for LLM alignment, offering better performance while preserving generative diversity through its novel token-level approach.

Abstract: Aligning Large Language Models (LLMs) with human preferences is critical, yet traditional fine-tuning methods are computationally expensive and inflexible. While test-time alignment offers a promising alternative, existing approaches often rely on distorted trajectory-level signals or inefficient sampling, fundamentally capping performance and failing to preserve the generative diversity of the base model. This paper introduces LLMdoctor, a novel framework for efficient test-time alignment that operates via a patient-doctor paradigm. It integrates token-level reward acquisition with token-level flow-guided preference optimization (TFPO) to steer a large, frozen patient LLM with a smaller, specialized doctor model. Unlike conventional methods that rely on trajectory-level rewards, LLMdoctor first extracts fine-grained, token-level preference signals from the patient model’s behavioral variations. These signals then guide the training of the doctor model via TFPO, which establishes flow consistency across all subtrajectories, enabling precise token-by-token alignment while inherently preserving generation diversity. Extensive experiments demonstrate that LLMdoctor significantly outperforms existing test-time alignment methods and even surpasses the performance of full fine-tuning approaches like DPO.

[285] NSR-Boost: A Neuro-Symbolic Residual Boosting Framework for Industrial Legacy Models

Ziming Dai, Dabiao Ma, Jinle Tong, Mengyuan Han, Jian Yang, Haojun Fei

Main category: cs.AI

TL;DR: NSR-Boost is a neuro-symbolic residual boosting framework that enables safe, non-intrusive upgrades to legacy GBDT models in production by targeting hard regions with LLM-generated symbolic experts and Bayesian optimization.

Details

Motivation: Upgrading legacy GBDT models in high-concurrency production environments faces prohibitive retraining costs and systemic risks, creating a need for safe, low-cost evolutionary approaches.

Method: Three-stage framework: 1) Identify hard regions through residuals, 2) Generate interpretable experts using LLM to create symbolic code structures and fine-tune with Bayesian optimization, 3) Dynamically integrate experts with legacy model via lightweight aggregator.

Result: Successfully deployed in Qfin Holdings’ core financial risk control system, outperforms SOTA baselines across six public and one private dataset, shows excellent performance gains on real-world online data.

Conclusion: Effectively captures long-tail risks missed by traditional models and offers a safe, low-cost evolutionary paradigm for industrial applications.

Abstract: Although the Gradient Boosted Decision Trees (GBDTs) dominate industrial tabular applications, upgrading legacy models in high-concurrency production environments still faces prohibitive retraining costs and systemic risks. To address this problem, we present NSR-Boost, a neuro-symbolic residual boosting framework designed specifically for industrial scenarios. Its core advantage lies in being “non-intrusive”. It treats the legacy model as a frozen model and performs targeted repairs on “hard regions” where predictions fail. The framework comprises three key stages: first, finding hard regions through residuals, then generating interpretable experts by generating symbolic code structures using Large Language Model (LLM) and fine-tuning parameters using Bayesian optimization, and finally dynamically integrating experts with legacy model output through a lightweight aggregator. We report on the successful deployment of NSR-Boost within the core financial risk control system at Qfin Holdings. This framework not only significantly outperforms state-of-the-art (SOTA) baselines across six public datasets and one private dataset, more importantly, shows excellent performance gains on real-world online data. In conclusion, it effectively captures long-tail risks missed by traditional models and offers a safe, low-cost evolutionary paradigm for industry.

[286] ChartComplete: A Taxonomy-based Inclusive Chart Dataset

Ahmad Mustapha, Charbel Toumieh, Mariette Awad

Main category: cs.AI

TL;DR: The paper introduces ChartComplete, a new dataset covering 30 different chart types to address limitations in existing chart understanding benchmarks that only include a small set of chart types.

Details

Motivation: Existing chart understanding datasets for evaluating multimodal large language models (MLLMs) are limited to only a small set of chart types, creating a gap in comprehensive evaluation of chart understanding capabilities.

Method: Proposes ChartComplete dataset based on a chart taxonomy from the visualization community, covering thirty different chart types. The dataset consists of classified chart images without learning signals.

Result: ChartComplete dataset is created and presented to the community as a resource for building more comprehensive chart understanding benchmarks.

Conclusion: ChartComplete addresses the limitation of existing chart understanding datasets by providing broader coverage of chart types, enabling more comprehensive evaluation of MLLMs’ chart understanding capabilities.

Abstract: With advancements in deep learning (DL) and computer vision techniques, the field of chart understanding is evolving rapidly. In particular, multimodal large language models (MLLMs) are proving to be efficient and accurate in understanding charts. To accurately measure the performance of MLLMs, the research community has developed multiple datasets to serve as benchmarks. By examining these datasets, we found that they are all limited to a small set of chart types. To bridge this gap, we propose the ChartComplete dataset. The dataset is based on a chart taxonomy borrowed from the visualization community, and it covers thirty different chart types. The dataset is a collection of classified chart images and does not include a learning signal. We present the ChartComplete dataset as is to the community to build upon it.

[287] Panning for Gold: Expanding Domain-Specific Knowledge Graphs with General Knowledge

Runhao Zhao, Weixin Zeng, Wentao Zhang, Chong Chen, Zhengpin Li, Xiang Zhao, Lei Chen

Main category: cs.AI

TL;DR: DKGF task enriches domain-specific KGs by fusing relevant facts from general KGs using Fact-as-Program paradigm to handle relevance ambiguity and granularity misalignment.

Details

Motivation: Domain-specific knowledge graphs (DKGs) have limited coverage compared to general knowledge graphs (GKGs), creating a need to enrich DKGs by integrating relevant facts from GKGs while addressing domain relevance ambiguity and knowledge granularity misalignment.

Method: ExeFuse framework treats each GKG fact as a latent semantic program, maps abstract relations to granularity-aware operators, and verifies domain relevance through program executability on the target DKG using a unified probabilistic framework.

Result: Created two benchmarks (DKGF(W-I) and DKGF(Y-I)) with 21 evaluation configurations. Extensive experiments validate the task’s importance and model effectiveness, establishing the first standardized testbed for DKGF.

Conclusion: DKGF is a novel and important task for enriching domain-specific knowledge graphs, and the proposed ExeFuse framework effectively addresses the core challenges of relevance ambiguity and granularity misalignment, providing a foundational testbed for future research.

Abstract: Domain-specific knowledge graphs (DKGs) often lack coverage compared to general knowledge graphs (GKGs). To address this, we introduce Domain-specific Knowledge Graph Fusion (DKGF), a novel task that enriches DKGs by integrating relevant facts from GKGs. DKGF faces two key challenges: high ambiguity in domain relevance and misalignment in knowledge granularity across graphs. We propose ExeFuse, a simple yet effective Fact-as-Program paradigm. It treats each GKG fact as a latent semantic program, maps abstract relations to granularity-aware operators, and verifies domain relevance via program executability on the target DKG. This unified probabilistic framework jointly resolves relevance and granularity issues. We construct two benchmarks, DKGF(W-I) and DKGF(Y-I), with 21 evaluation configurations. Extensive experiments validate the task’s importance and our model’s effectiveness, providing the first standardized testbed for DKGF.

[288] Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment

Felix Jahn, Yannic Muskalla, Lisa Dargasz, Patrick Schramowski, Kevin Baum

Main category: cs.AI

TL;DR: GRACE is a neuro-symbolic architecture that separates normative reasoning from instrumental decision-making to ensure AI agents act in morally permissible ways while maintaining effectiveness.

Details

Motivation: As AI agents become more autonomous and impactful in real-world contexts, there's a critical need to ensure their decisions are not only effective but also normatively aligned with ethical principles.

Method: GRACE uses a three-module architecture: Moral Module (deontic logic reasoning for permissible macro actions), Decision-Making Module (encapsulates target agent for optimal primitive actions), and Guard (monitors and enforces moral compliance). It employs reason-based formalism for interpretability and symbolic representation for formal verification.

Result: The architecture enables stakeholders to understand, contest, and refine agent behavior, demonstrated through an LLM therapy assistant example. It provides formal verification and statistical guarantees of alignment.

Conclusion: GRACE offers a scalable solution for ensuring AI alignment by decoupling normative reasoning from instrumental decision-making, making AI behavior interpretable, contestable, and justifiable while maintaining effectiveness.

Abstract: As AI agents become increasingly autonomous, widely deployed in consequential contexts, and efficacious in bringing about real-world impacts, ensuring that their decisions are not only instrumentally effective but also normatively aligned has become critical. We introduce a neuro-symbolic reason-based containment architecture, Governor for Reason-Aligned ContainmEnt (GRACE), that decouples normative reasoning from instrumental decision-making and can contain AI agents of virtually any design. GRACE restructures decision-making into three modules: a Moral Module (MM) that determines permissible macro actions via deontic logic-based reasoning; a Decision-Making Module (DMM) that encapsulates the target agent while selecting instrumentally optimal primitive actions in accordance with derived macro actions; and a Guard that monitors and enforces moral compliance. The MM uses a reason-based formalism providing a semantic foundation for deontic logic, enabling interpretability, contestability, and justifiability. Its symbolic representation enriches the DMM’s informational context and supports formal verification and statistical guarantees of alignment enforced by the Guard. We demonstrate GRACE on an example of a LLM therapy assistant, showing how it enables stakeholders to understand, contest, and refine agent behavior.

[289] Diagnosing Generalization Failures in Fine-Tuned LLMs: A Cross-Architectural Study on Phishing Detection

Frank Bobe, Gregory D. Vetaw, Chase Pavlick, Darshan Bryner, Matthew Cook, Jose Salas-Vernis

Main category: cs.AI

TL;DR: Fine-tuning LLMs for phishing detection shows generalization failures are architecture-dependent; Gemma 2 excels with diverse data, Llama 3.1 fails to integrate diverse data, and Mistral is consistently resilient across training paradigms.

Details

Motivation: Despite fine-tuning LLMs achieving SOTA performance on specialized tasks, diagnosing why models become brittle and fail to generalize remains a critical open problem that needs systematic investigation.

Method: Multi-layered diagnostic framework applied to cross-architectural study: fine-tuned Llama 3.1 8B, Gemma 2 9B, and Mistral models on phishing detection; used SHAP analysis and mechanistic interpretability to uncover root causes of generalization failures.

Result: Three key findings: (1) Generalization driven by synergy between architecture and data diversity (Gemma 2 achieves >91% F1 only with diverse “generalist” dataset); (2) Generalization highly architecture-dependent (Llama 3.1 fails to integrate diverse data); (3) Some architectures inherently more generalizable (Mistral consistently resilient across training paradigms).

Conclusion: Reliable AI requires deep validation of interplay between architecture, data, and training strategy; work provides concrete methodology for diagnosing generalization failures by pinpointing flawed heuristics responsible for model brittleness.

Abstract: The practice of fine-tuning Large Language Models (LLMs) has achieved state-of-the-art performance on specialized tasks, yet diagnosing why these models become brittle and fail to generalize remains a critical open problem. To address this, we introduce and apply a multi-layered diagnostic framework to a cross-architectural study. We fine-tune Llama 3.1 8B, Gemma 2 9B, and Mistral models on a high-stakes phishing detection task and use SHAP analysis and mechanistic interpretability to uncover the root causes of their generalization failures. Our investigation reveals three critical findings: (1) Generalization is driven by a powerful synergy between architecture and data diversity. The Gemma 2 9B model achieves state-of-the-art performance (>91% F1), but only when trained on a stylistically diverse ``generalist’’ dataset. (2) Generalization is highly architecture-dependent. We diagnose a specific failure mode in Llama 3.1 8B, which performs well on a narrow domain but cannot integrate diverse data, leading to a significant performance drop. (3) Some architectures are inherently more generalizable. The Mistral model proves to be a consistent and resilient performer across multiple training paradigms. By pinpointing the flawed heuristics responsible for these failures, our work provides a concrete methodology for diagnosing and understanding generalization failures, underscoring that reliable AI requires deep validation of the interplay between architecture, data, and training strategy.

[290] A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

Xingjun Ma, Yixu Wang, Hengyuan Xu, Yutao Wu, Yifan Ding, Yunhan Zhao, Zilong Wang, Jiabin Hua, Ming Wen, Jianan Liu, Ranjie Duan, Yifeng Gao, Yingshui Tan, Yunhao Chen, Hui Xue, Xin Wang, Wei Cheng, Jingjing Chen, Zuxuan Wu, Bo Li, Yu-Gang Jiang

Main category: cs.AI

TL;DR: Integrated safety evaluation of 7 frontier AI models reveals heterogeneous safety landscape with trade-offs between benchmark performance, adversarial robustness, multilingual generalization, and regulatory compliance.

Details

Motivation: Despite rapid advances in LLMs and MLLMs, it's unclear whether safety has improved proportionally due to fragmented evaluation practices limited to single modalities or threat models.

Method: Unified evaluation protocol across language, vision-language, and image generation settings using benchmark evaluation, adversarial evaluation, multilingual evaluation, and compliance evaluation on 7 frontier models.

Result: GPT-5.2 shows consistently strong safety performance, while other models exhibit trade-offs. All models degrade substantially under adversarial evaluation despite strong benchmark results. Text-to-image models show stronger alignment in regulated categories but remain brittle.

Conclusion: Safety in frontier models is inherently multidimensional, shaped by modality, language, and evaluation scheme, highlighting the need for standardized safety evaluations to assess real-world risk and guide responsible development.

Abstract: The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has produced substantial gains in reasoning, perception, and generative capability across language and vision. However, whether these advances yield commensurate improvements in safety remains unclear, in part due to fragmented evaluation practices limited to single modalities or threat models. In this report, we present an integrated safety evaluation of 7 frontier models: GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5. We evaluate each model across language, vision-language, and image generation settings using a unified protocol that integrates benchmark evaluation, adversarial evaluation, multilingual evaluation, and compliance evaluation. Aggregating our evaluations into safety leaderboards and model safety profiles across multiple evaluation modes reveals a sharply heterogeneous safety landscape. While GPT-5.2 demonstrates consistently strong and balanced safety performance across evaluations, other models exhibit pronounced trade-offs among benchmark safety, adversarial alignment, multilingual generalization, and regulatory compliance. Both language and vision-language modalities show significant vulnerability under adversarial evaluation, with all models degrading substantially despite strong results on standard benchmarks. Text-to-image models achieve relatively stronger alignment in regulated visual risk categories, yet remain brittle under adversarial or semantically ambiguous prompts. Overall, these results show that safety in frontier models is inherently multidimensional–shaped by modality, language, and evaluation scheme, underscoring the need for standardized safety evaluations to accurately assess real-world risk and guide responsible model development and deployment.

[291] Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang

Main category: cs.AI

TL;DR: The paper proposes SafeProbing, a decoding-time defense method that surfaces latent safety signals in LLMs to detect jailbreak attacks early during generation, improving safety while maintaining utility.

Details

Motivation: Despite safety alignment, LLMs remain vulnerable to jailbreak attacks, and existing defenses struggle against sophisticated attacks, often compromising detection robustness or degrading model utility too much.

Method: The approach examines LLM decoding and leverages the observation that even jailbroken models exhibit latent safety signals during generation. It explicitly surfaces these signals for early detection of unsafe content during decoding.

Result: Experiments across diverse jailbreak attacks show the approach significantly enhances safety while maintaining low over-refusal rates on benign inputs and preserving response quality.

Conclusion: Activating intrinsic safety-awareness during decoding offers a promising complementary direction for defending against jailbreak attacks.

Abstract: Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real-world applications. Despite extensive safety alignment efforts, recent studies show that such alignment is often shallow and remains vulnerable to jailbreak attacks. Existing defense mechanisms, including decoding-based constraints and post-hoc content detectors, struggle against sophisticated jailbreaks, often intervening robust detection or excessively degrading model utility. In this work, we examine the decoding process of LLMs and make a key observation: even when successfully jailbroken, models internally exhibit latent safety-related signals during generation. However, these signals are overridden by the model’s drive for fluent continuation, preventing timely self-correction or refusal. Building on this observation, we propose a simple yet effective approach that explicitly surfaces and leverages these latent safety signals for early detection of unsafe content during decoding. Experiments across diverse jailbreak attacks demonstrate that our approach significantly enhances safety, while maintaining low over-refusal rates on benign inputs and preserving response quality. Our results suggest that activating intrinsic safety-awareness during decoding offers a promising and complementary direction for defending against jailbreak attacks. Code is available at: https://github.com/zyz13590/SafeProbing.

[292] From Single to Multi-Agent Reasoning: Advancing GeneGPT for Genomics QA

Kimia Abedini, Farzad Shami, Gianmaria Silvello

Main category: cs.AI

TL;DR: GenomAgent is a multi-agent framework that outperforms GeneGPT by 12% on genomics QA tasks by coordinating specialized agents instead of relying on rigid API calls.

Details

Motivation: Extracting genomic information from complex distributed databases is challenging. While LLMs offer potential for genomic QA, they face limitations due to restricted access to domain-specific databases. GeneGPT improves on this but has rigid API dependencies and limited adaptability.

Method: Proposed GenomAgent, a multi-agent framework that efficiently coordinates specialized agents for complex genomics queries. The approach involves replicating GeneGPT and developing a flexible architecture that can adapt to various scientific domains.

Result: Evaluated on nine tasks from the GeneTuring benchmark, GenomAgent outperforms GeneGPT by 12% on average. The framework demonstrates flexibility that extends beyond genomics to various scientific domains needing expert knowledge extraction.

Conclusion: GenomAgent provides a more effective and adaptable solution for genomic QA compared to existing state-of-the-art methods, with potential applications across multiple scientific domains requiring expert knowledge extraction.

Abstract: Comprehending genomic information is essential for biomedical research, yet extracting data from complex distributed databases remains challenging. Large language models (LLMs) offer potential for genomic Question Answering (QA) but face limitations due to restricted access to domain-specific databases. GeneGPT is the current state-of-the-art system that enhances LLMs by utilizing specialized API calls, though it is constrained by rigid API dependencies and limited adaptability. We replicate GeneGPT and propose GenomAgent, a multi-agent framework that efficiently coordinates specialized agents for complex genomics queries. Evaluated on nine tasks from the GeneTuring benchmark, GenomAgent outperforms GeneGPT by 12% on average, and its flexible architecture extends beyond genomics to various scientific domains needing expert knowledge extraction.

[293] Multi-Property Synthesis

Christoph Weinhuber, Yannik Schnitzer, Alessandro Abate, David Parker, Giuseppe De Giacomo, Moshe Y. Vardi

Main category: cs.AI

TL;DR: Symbolic algorithm for LTLf synthesis with multiple properties computes maximal realizable goal sets in one fixed-point computation instead of enumerating subsets.

Details

Motivation: When dealing with LTLf synthesis with multiple properties, satisfying all properties may be impossible, and enumerating all subsets of properties is computationally expensive.

Method: Develop a fully symbolic algorithm that introduces Boolean goal variables and exploits monotonicity to represent exponentially many goal combinations compactly, computing the relation between product-game states and realizable goal sets in one fixed-point computation.

Result: The approach substantially outperforms enumeration-based baselines with speedups of up to two orders of magnitude.

Conclusion: The symbolic approach efficiently solves LTLf synthesis with multiple properties by computing maximal realizable goal sets without enumerating subsets, achieving significant performance improvements.

Abstract: We study LTLf synthesis with multiple properties, where satisfying all properties may be impossible. Instead of enumerating subsets of properties, we compute in one fixed-point computation the relation between product-game states and the goal sets that are realizable from them, and we synthesize strategies achieving maximal realizable sets. We develop a fully symbolic algorithm that introduces Boolean goal variables and exploits monotonicity to represent exponentially many goal combinations compactly. Our approach substantially outperforms enumeration-based baselines, with speedups of up to two orders of magnitude.

[294] Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models

Zirui Ren, Ziming Liu

Main category: cs.AI

TL;DR: HRM shows surprising failure modes on simple puzzles, exhibits grokking dynamics, and gets trapped in incorrect fixed points, suggesting it’s “guessing” rather than reasoning. Three scaling strategies boost Sudoku-Extreme accuracy from 54.5% to 96.9%.

Details

Motivation: To understand the strengths and failure modes of Hierarchical Reasoning Model (HRM) which achieves extraordinary performance on reasoning tasks but has unexplained weaknesses.

Method: Mechanistic study of HRM’s reasoning patterns revealing three surprising facts: failure on simple puzzles due to fixed point property violation, grokking dynamics, and existence of multiple fixed points. Proposed three scaling strategies: data augmentation, input perturbation, and model bootstrapping.

Result: HRM appears to be “guessing” rather than “reasoning.” Combined methods create Augmented HRM, boosting Sudoku-Extreme accuracy from 54.5% to 96.9%.

Conclusion: The analysis provides new insights into how reasoning models work, showing they may rely on “guessing” patterns rather than true reasoning, and demonstrates practical improvements through scaling strategies.

Abstract: Hierarchical reasoning model (HRM) achieves extraordinary performance on various reasoning tasks, significantly outperforming large language model-based reasoners. To understand the strengths and potential failure modes of HRM, we conduct a mechanistic study on its reasoning patterns and find three surprising facts: (a) Failure of extremely simple puzzles, e.g., HRM can fail on a puzzle with only one unknown cell. We attribute this failure to the violation of the fixed point property, a fundamental assumption of HRM. (b) “Grokking” dynamics in reasoning steps, i.e., the answer is not improved uniformly, but instead there is a critical reasoning step that suddenly makes the answer correct; (c) Existence of multiple fixed points. HRM “guesses” the first fixed point, which could be incorrect, and gets trapped there for a while or forever. All facts imply that HRM appears to be “guessing” instead of “reasoning”. Leveraging this “guessing” picture, we propose three strategies to scale HRM’s guesses: data augmentation (scaling the quality of guesses), input perturbation (scaling the number of guesses by leveraging inference randomness), and model bootstrapping (scaling the number of guesses by leveraging training randomness). On the practical side, by combining all methods, we develop Augmented HRM, boosting accuracy on Sudoku-Extreme from 54.5% to 96.9%. On the scientific side, our analysis provides new insights into how reasoning models “reason”.

[295] Structure and Diversity Aware Context Bubble Construction for Enterprise Retrieval Augmented Systems

Amir Khurshid, Abhishek Sehgal

Main category: cs.AI

TL;DR: Proposes context bubble framework for LLM RAG that constructs coherent, citable bundles of document spans under token budget constraints, using structural priors and diversity constraints to reduce fragmentation and duplication.

Details

Motivation: Traditional RAG with top-k passage retrieval causes information fragmentation, over-retrieval, content duplication, and insufficient coverage of secondary query facets. Need for more efficient context construction that preserves document structure.

Method: Structure-informed, diversity-constrained context bubble framework that organizes multi-granular spans (sections, rows), uses task-conditioned structural priors, starts from high-relevance anchor spans, and performs constrained selection balancing query relevance, marginal coverage, and redundancy penalties under strict token budget.

Result: Significantly reduces redundant context, better covers secondary facets, improves answer quality and citation faithfulness within limited context windows. Ablation shows both structural priors and diversity constraints are necessary.

Conclusion: Context bubble framework outperforms traditional top-k RAG by producing compact, informative context sets with better coverage and less redundancy while providing auditability through full retrieval trace.

Abstract: Large language model (LLM) contexts are typically constructed using retrieval-augmented generation (RAG), which involves ranking and selecting the top-k passages. The approach causes fragmentation in information graphs in document structures, over-retrieval, and duplication of content alongside insufficient query context, including 2nd and 3rd order facets. In this paper, a structure-informed and diversity-constrained context bubble construction framework is proposed that assembles coherent, citable bundles of spans under a strict token budget. The method preserves and exploits inherent document structure by organising multi-granular spans (e.g., sections and rows) and using task-conditioned structural priors to guide retrieval. Starting from high-relevance anchor spans, a context bubble is constructed through constrained selection that balances query relevance, marginal coverage, and redundancy penalties. It will explicitly constrain diversity and budget, producing compact and informative context sets, unlike top-k retrieval. Moreover, a full retrieval is emitted that traces the scoring and selection choices of the records, thus providing auditability and deterministic tuning. Experiments on enterprise documents demonstrate the efficiency of context bubble as it significantly reduces redundant context, is better able to cover secondary facets and has a better answer quality and citation faithfulness within a limited context window. Ablation studies demonstrate that both structural priors as well as diversity constraint selection are necessary; removing either component results in a decline in coverage and an increase in redundant or incomplete context.

[296] The Impact of Generative AI on Architectural Conceptual Design: Performance, Creative Self-Efficacy and Cognitive Load

Han Jiang, Yao Xiao, Rachel Hurley, Shichao Liu

Main category: cs.AI

TL;DR: GenAI improves design performance for novice architects but reduces creative self-efficacy, with effectiveness depending on user expertise and prompting strategies.

Details

Motivation: To understand how generative AI influences performance, creative self-efficacy, and cognitive load in architectural conceptual design tasks, particularly examining differences based on user expertise and interaction strategies.

Method: 36 student participants completed two-phase architectural design tasks: first independently, then with external tools (GenAI-assisted vs. control using online repository). Expert raters evaluated design outcomes, while self-efficacy and cognitive load were self-reported. Difference-in-differences analyses and subgroup analyses were conducted.

Result: No overall performance advantage of GenAI, but significant improvement for novice designers. Creative self-efficacy declined for GenAI users. No significant cognitive load differences, but iterative idea generation and visual feedback prompts reduced cognitive load.

Conclusion: GenAI effectiveness depends on users’ prior expertise and interaction strategies through prompting, with benefits for novices but potential negative impacts on creative self-efficacy.

Abstract: Our study examines how generative AI (GenAI) influences performance, creative self-efficacy, and cognitive load in architectural conceptual design tasks. Thirty-six student participants from Architectural Engineering and other disciplines completed a two-phase architectural design task, first independently and then with external tools (GenAI-assisted condition and control condition using an online repository of existing architectural projects). Design outcomes were evaluated by expert raters, while self-efficacy and cognitive load were self-reported after each phase. Difference-in-differences analyses revealed no overall performance advantage of GenAI across participants; however, subgroup analyses showed that GenAI significantly improved design performance for novice designers. In contrast, general creative self-efficacy declined for students using GenAI. Cognitive load did not differ significantly between conditions, though prompt usage patterns showed that iterative idea generation and visual feedback prompts were linked to greater reductions in cognitive load. These findings suggest that GenAI effectiveness depends on users’ prior expertise and interaction strategies through prompting.

Akul Goel, Surya Narayanan Hari, Belinda Waltman, Matt Thomson

Main category: cs.AI

TL;DR: Intelligent routing system uses LLM router to direct medical record data to optimal open-source models for SDOH coding, achieving 96.4% accuracy across 13 codes and outperforming GPT-4o.

Details

Motivation: SDOH significantly impact health outcomes but are infrequently coded in EHRs using Z-codes. While LLMs show promise for extracting SDOH from clinical notes, no single model performs best across all tasks, and closed-source models pose privacy risks. Need for open-source LLMs that can be run within healthcare organizations and perform well on SDOH tasks.

Method: Intelligent routing system with language model router that directs medical record data to open-source LLMs demonstrating optimal performance on specific SDOH codes. Used publicly-available deidentified dataset and introduced synthetic data generation/validation paradigm to scale training data without privacy-protected records.

Result: System achieved state-of-the-art performance of 96.4% accuracy averaged across 13 SDOH codes (including homelessness and food insecurity), outperforming closed models like GPT-4o.

Conclusion: Demonstrated architecture for intelligent routing of inputs to task-optimal language models achieves high performance across medical coding sub-tasks, addressing privacy concerns while improving SDOH coding accuracy.

Abstract: Social Determinants of Health (SDOH), also known as Health-Related Social Needs (HSRN), play a significant role in patient health outcomes. The Centers for Disease Control and Prevention (CDC) introduced a subset of ICD-10 codes called Z-codes to recognize and measure SDOH. However, Z-codes are infrequently coded in a patient’s Electronic Health Record (EHR), and instead, in many cases, need to be inferred from clinical notes. Previous research has shown that large language models (LLMs) show promise on extracting unstructured data from EHRs, but it can be difficult to identify a single model that performs best on varied coding tasks. Further, clinical notes contain protected health information posing a challenge for the use of closed-source language models from commercial vendors. The identification of open-source LLMs that can be run within health organizations and exhibit high performance on SDOH tasks is an important issue to solve. Here, we introduce an intelligent routing system for SDOH coding that uses a language model router to direct medical record data to open-source LLMs that demonstrate optimal performance on specific SDOH codes. This intelligent routing system exhibits state of the art performance of 96.4% accuracy averaged across 13 codes, including homelessness and food insecurity, outperforming closed models such as GPT-4o. We leveraged a publicly-available, deidentified dataset of medical record notes to run the router, but we also introduce a synthetic data generation and validation paradigm to increase the scale of training data without needing privacy-protected medical records. Together, we demonstrate an architecture for intelligent routing of inputs to task-optimal language models to achieve high performance across a set of medical coding sub-tasks.

[298] Machine Learning and Theory Ladenness – A Phenomenological Account

Alberto Termine, Emanuele Ratti, Alessandro Facchini

Main category: cs.AI

TL;DR: ML model-building in science is mostly indifferent to domain theory despite being weakly theory-laden (theory infection), with implications for cross-disciplinary transferability and shifting the theory-ladenness debate from descriptive to normative priorities.

Details

Motivation: To analyze theory ladenness in machine learning applied to scientific domains, challenging recent trends in philosophy of science that emphasize strong theory-ladenness in ML models.

Method: Construct an account of ML models by comparing them with phenomenological models, examining how domain theory (scientific domain knowledge) interacts with ML model-building processes.

Result: ML model-building is mostly indifferent to domain theory, though it remains theory-laden in a weak sense called “theory infection.” This has significant consequences for ML transferability across scientific disciplines.

Conclusion: The findings shift the priorities of the debate on theory ladenness in ML from descriptive to normative concerns, highlighting the practical implications for interdisciplinary ML applications.

Abstract: We provide an analysis of theory ladenness in machine learning in science, where “theory”, that we call “domain theory”, refers to the domain knowledge of the scientific discipline where ML is used. By constructing an account of ML models based on a comparison with phenomenological models, we show, against recent trends in philosophy of science, that ML model-building is mostly indifferent to domain theory, even if the model remains theory laden in a weak sense, which we call theory infection. These claims, we argue, have far-reaching consequences for the transferability of ML across scientific disciplines, and shift the priorities of the debate on theory ladenness in ML from descriptive to normative.

[299] CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning

Joshua Ong Jun Leang, Aryo Pradipta Gema, Shay B. Cohen

Main category: cs.AI

TL;DR: CoMAT enhances LLM mathematical reasoning by converting natural language to symbolic form and executing reasoning on symbolic representations, outperforming Chain-of-Thought on most benchmarks without external solvers.

Details

Motivation: Mathematical reasoning remains challenging for LLMs despite progress with techniques like Chain-of-Thought. There's a need for more effective reasoning methods that can handle complex mathematical tasks with transparency and verifiability.

Method: CoMAT uses a two-stage approach: 1) Symbolic Conversion - converting natural language queries into symbolic form, and 2) Reasoning Execution - deriving answers from symbolic representations. The method operates entirely within a single LLM without external solvers.

Result: CoMAT outperforms traditional Chain-of-Thought on 6 out of 7 benchmarks across four LLMs, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. It also provides improved faithfulness and verifiability.

Conclusion: CoMAT offers an effective approach for enhancing mathematical reasoning in LLMs through symbolic conversion and reasoning execution, providing both performance improvements and transparent reasoning processes for complex mathematical tasks.

Abstract: Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present Chain of Mathematically Annotated Thought (CoMAT), which enhances reasoning through two stages: Symbolic Conversion (converting natural language queries into symbolic form) and Reasoning Execution (deriving answers from symbolic representations). CoMAT operates entirely with a single LLM and without external solvers. Across four LLMs, CoMAT outperforms traditional CoT on six out of seven benchmarks, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. In addition to improved performance, CoMAT ensures faithfulness and verifiability, offering a transparent reasoning process for complex mathematical tasks

[300] MatchMiner-AI: An Open-Source Solution for Cancer Clinical Trial Matching

Jennifer Altreuter, Pavel Trukhanov, Morgan A. Paul, Michael J. Hassett, Irbaz B. Riaz, Muhammad Umar Afzal, Arshad A. Mohammed, Sarah Sammons, James Lindsay, Emily Mallaber, Harry R. Klein, Gufran Gungor, Matthew Galvin, Michael Deletto, Stephen C. Van Nostrand, James Provencher, Joyce Yu, Naeem Tahir, Jonathan Wischhusen, Olga Kozyreva, Taylor Ortiz, Hande Tuncer, Jad El Masri, Alys Malcolm, Tali Mazor, Ethan Cerami, Kenneth L. Kehl

Main category: cs.AI

TL;DR: MatchMiner-AI is an open-source platform that uses synthetic EHR data to match cancer patients with clinical trials, achieving 90% relevance for top recommendations while preserving privacy.

Details

Motivation: Low clinical trial enrollment (<10% of cancer patients) and accrual failures create a need for better trial matching. AI could help but faces privacy restrictions when trained on real health data.

Method: Developed MatchMiner-AI platform trained entirely on synthetic EHR data. Extracts clinical criteria from EHR text, embeds patient summaries and trial “spaces” in shared vector space, applies custom text classifiers to assess patient-trial pairings, and evaluated on real clinical data.

Result: Outperformed baseline approaches: 90% of top 20 recommended trials were relevant for enrolled patients (vs 17% baseline), 88% for standard-of-care patients (vs 14% baseline). Text classifiers achieved AUROC 0.94-0.98. Mean average precision reached ~0.90. All components publicly available.

Conclusion: MatchMiner-AI demonstrates a privacy-preserving, openly accessible approach to clinical trial matching using LLM-generated synthetic EHR data, addressing both trial enrollment challenges and data privacy concerns.

Abstract: Background Clinical trials are essential to advancing cancer treatments, yet fewer than 10% of adults with cancer enroll in trials, and many studies fail to meet accrual targets. Artificial intelligence (AI) could improve identification of appropriate trials for patients, but sharing AI models trained on protected health information remains difficult due to privacy restrictions. Methods We developed MatchMiner-AI, an open-source platform for clinical trial search and ranking trained entirely on synthetic electronic health record (EHR) data. The system extracts core clinical criteria from longitudinal EHR text and embeds patient summaries and trial “spaces” (target populations) in a shared vector space for rapid retrieval. It then applies custom text classifiers to assess whether each patient-trial pairing is a clinically reasonable consideration. The pipeline was evaluated on real clinical data. Results Across retrospective evaluations on real EHR data, the fine-tuned pipeline outperformed baseline text-embedding approaches. For trial-enrolled patients, 90% of the top 20 recommended trials were relevant matches (compared to 17% for the baseline model). Similar improvements were noted for patients who received standard-of-care treatments (88% of the top 20 matches were relevant, compared to 14% for baseline). Text classification modules demonstrated strong discrimination (AUROC 0.94-0.98) for evaluating candidate patient-trial space pair eligibility; incorporating these components consistently increased mean average precision to ~ 0.90 across patient- and trial-centric use cases. Synthetic training data, model weights, inference tools, and demonstration frontends are publicly available. Conclusions MatchMiner-AI demonstrates an openly accessible, privacy-preserving approach to distilling a clinical trial matching AI pipeline from LLM-generated synthetic EHR data.

[301] Scalable Oversight for Superhuman AI via Recursive Self-Critiquing

Xueru Wen, Jie Lou, Xinyu Lu, Junjie Yang, Yanjiang Liu, Yaojie Lu, Debing Zhang, Xing Yu

Main category: cs.AI

TL;DR: Recursive self-critiquing offers a scalable approach for AI oversight when AI outputs exceed human cognitive capabilities, building on the principle that critique of critique is easier than direct critique.

Details

Motivation: Current AI alignment techniques (SFT, RLHF) rely on direct human assessment but become impractical when AI outputs surpass human cognitive thresholds, creating a need for scalable oversight methods.

Method: Proposes recursive self-critiquing based on two hypotheses: (1) critique of critique is easier than critique itself, and (2) this difficulty relationship holds recursively. Conducts Human-Human, Human-AI, and AI-AI experiments to investigate recursive self-critiquing for AI supervision.

Result: Results highlight recursive critique as a promising approach for scalable AI oversight when direct human evaluation becomes infeasible.

Conclusion: Recursive self-critiquing provides a tractable supervision pathway for aligning advanced AI systems that exceed human cognitive capabilities, addressing fundamental limitations of current alignment techniques.

Abstract: As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques, including SFT and RLHF, face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become impractical when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) \textit{Critique of critique can be easier than critique itself}, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) \textit{This difficulty relationship holds recursively}, suggesting that when direct evaluation is infeasible, performing higher-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. We conduct Human-Human, Human-AI, and AI-AI experiments to investigate the potential of recursive self-critiquing for AI supervision. Our results highlight recursive critique as a promising approach for scalable AI oversight.

[302] MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, Martin Vechev

Main category: cs.AI

TL;DR: MathArena is a new benchmark for evaluating LLMs on mathematical reasoning using real-time competition problems to avoid contamination, assessing both problem-solving and proof-writing capabilities.

Details

Motivation: Existing math benchmarks suffer from contamination issues (problems available online) and don't evaluate proof-writing capabilities, making it difficult to assess genuine reasoning vs. memorization.

Method: Uses recurring math competitions as a source of fresh problems, evaluating models immediately upon release to eliminate contamination risk. Evaluates over 50 models across 7 competitions (162 problems total), including proof-writing assessment on IMO problems.

Result: Found strong signs of contamination in AIME 2024. Top models show impressive reasoning on harder competitions like CMIMC 2025. On IMO 2025 proof-writing, top models achieve slightly less than 40%, showing progress but significant room for improvement.

Conclusion: MathArena provides a rigorous, contamination-free benchmark for mathematical reasoning that will evolve with new competitions, offering real-time evaluation of LLM capabilities including proof-writing.

Abstract: The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination. Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models. MathArena is also the first benchmark for proof-writing capabilities. On IMO 2025, top models achieve slightly less than 40%, demonstrating both notable progress and significant room for improvement. So far, we have evaluated over $50$ models across seven competitions, totaling $162$ problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.

[303] Uncovering Systemic and Environment Errors in Autonomous Systems Using Differential Testing

Yashwanthi Anand, Rahil P Mehta, Manish Motwani, Sandhya Saisubramanian

Main category: cs.AI

TL;DR: AIProbe is a black-box testing technique that uses differential testing to determine whether undesirable autonomous agent behaviors stem from agent deficiencies (model/policy flaws) or environmental infeasibility (unsolvable tasks).

Details

Motivation: As autonomous agents and environments grow more complex, it becomes increasingly difficult but critical to identify whether undesirable behaviors (including task failures) are due to systemic agent errors (model/policy flaws) or environment errors (inherently infeasible tasks). This attribution is essential for reliable deployment.

Method: AIProbe uses differential testing: 1) Generates diverse environmental configurations and tasks via Latin Hypercube sampling of configurable parameters, 2) Solves each generated task using a search-based planner independent of the agent, 3) Compares agent performance to planner solutions to attribute failures to either agent deficiencies or unsolvable task conditions.

Result: Evaluation across multiple domains shows AIProbe significantly outperforms state-of-the-art techniques in detecting both total and unique errors.

Conclusion: AIProbe contributes to reliable deployment of autonomous agents by effectively distinguishing between agent errors and environmental infeasibility, enabling better diagnosis and improvement of autonomous systems.

Abstract: When an autonomous agent behaves undesirably, including failure to complete a task, it can be difficult to determine whether the behavior is due to a systemic agent error, such as flaws in the model or policy, or an environment error, where a task is inherently infeasible under a given environment configuration, even for an ideal agent. As agents and their environments grow more complex, identifying the error source becomes increasingly difficult but critical for reliable deployment. We introduce AIProbe, a novel black-box testing technique that applies differential testing to attribute undesirable agent behaviors either to agent deficiencies, such as modeling or training flaws, or due to environmental infeasibility. AIProbe first generates diverse environmental configurations and tasks for testing the agent, by modifying configurable parameters using Latin Hypercube sampling. It then solves each generated task using a search-based planner, independent of the agent. By comparing the agent’s performance to the planner’s solution, AIProbe identifies whether failures are due to errors in the agent’s model or policy, or due to unsolvable task conditions. Our evaluation across multiple domains shows that AIProbe significantly outperforms state-of-the-art techniques in detecting both total and unique errors, thereby contributing to a reliable deployment of autonomous agents.

[304] BASIL: Bayesian Assessment of Sycophancy in LLMs

Katherine Atwell, Pedram Heydari, Anthony Sicilia, Malihe Alikhani

Main category: cs.AI

TL;DR: The paper introduces a Bayesian framework to separate sycophancy (overly agreeable behavior) from rational belief updating in LLMs, with metrics that work even without ground truth, and shows interventions can reduce sycophantic tendencies.

Details

Motivation: Sycophancy in LLMs poses challenges for human-AI collaboration in high-stakes domains, but existing approaches can't properly distinguish sycophantic behavior from rational belief updating based on new evidence.

Method: Developed a Bayesian probabilistic framework grounded in behavioral economics and rational decision theory that separates sycophancy from rational belief updating, with descriptive and normative metrics applicable even without ground-truth labels.

Result: Found robust evidence of sycophantic belief shifts across multiple LLMs in uncertainty-driven tasks, showing impact on rationality depends on whether models systematically over- or under-update beliefs. Post-hoc calibration and fine-tuning (SFT and DPO) substantially reduced Bayesian inconsistency.

Conclusion: The Bayesian framework successfully distinguishes sycophancy from rational belief updating, enabling better measurement and mitigation of sycophantic tendencies in LLMs, with interventions showing strong improvements especially under explicit sycophancy prompting.

Abstract: Sycophancy (overly agreeable or flattering behavior) poses a fundamental challenge for human-AI collaboration, particularly in high-stakes decision-making domains such as health, law, and education. A central difficulty in studying sycophancy in large language models (LLMs) is disentangling sycophantic belief shifts from rational changes in behavior driven by new evidence or user-provided information. Existing approaches either measure descriptive behavior changes or apply normative evaluations that rely on objective ground truth, limiting their applicability to subjective or uncertain tasks. We introduce a Bayesian probabilistic framework, grounded in behavioral economics and rational decision theory, that explicitly separates sycophancy from rational belief updating. Within this framework, we achieve three objectives: (i) a descriptive metric that measures sycophancy while controlling for rational responses to evidence; (ii) a normative metric that quantifies how sycophancy leads models astray from Bayesian-consistent belief updating; and (iii) the ability to apply both metrics in settings without ground-truth labels. Applying our framework across multiple LLMs and three uncertainty-driven tasks, we find robust evidence of sycophantic belief shifts and show that their impact on rationality depends on whether models systematically over- or under-update their beliefs. Finally, we demonstrate that a post-hoc calibration method and two fine-tuning strategies (SFT and DPO) substantially reduce Bayesian inconsistency, with particularly strong improvements under explicit sycophancy prompting.

[305] Compartmentalised Agentic Reasoning for Clinical NLI

Maël Jullien, Lei Xu, Marco Valentino, André Freitas

Main category: cs.AI

TL;DR: CARENLI is a compartmentalized agentic framework that improves clinical natural language inference by routing premise-statement pairs to specialized reasoning solvers with verification, boosting accuracy from ~23% to ~57%.

Details

Motivation: Current large language models often fail at clinical natural language inference when decisions require correct inferential schemas rather than just surface matching, highlighting the need for more reliable clinical reasoning systems.

Method: CARENLI uses a compartmentalized agentic framework that routes each premise-statement pair to one of four reasoning families (Causal Attribution, Compositional Grounding, Epistemic Verification, Risk State Abstraction), then applies specialized solvers with explicit verification and targeted refinement.

Result: On an expanded CTNLI benchmark of 200 instances, CARENLI improved mean accuracy from about 23% with direct prompting to about 57% across four contemporary backbone models, representing a 34-point gain with largest benefits on structurally demanding reasoning types.

Conclusion: Compartmentalization plus verification provides a practical route to more reliable and auditable clinical inference, addressing the limitations of direct prompting approaches for complex clinical reasoning tasks.

Abstract: Large language models can produce fluent judgments for clinical natural language inference, yet they frequently fail when the decision requires the correct inferential schema rather than surface matching. We introduce CARENLI, a compartmentalised agentic framework that routes each premise-statement pair to a reasoning family and then applies a specialised solver with explicit verification and targeted refinement. We evaluate on an expanded CTNLI benchmark of 200 instances spanning four reasoning families: Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction. Across four contemporary backbone models, CARENLI improves mean accuracy from about 23% with direct prompting to about 57%, a gain of roughly 34 points, with the largest benefits on structurally demanding reasoning types. These results support compartmentalisation plus verification as a practical route to more reliable and auditable clinical inference.

[306] Gradient Coupling: The Hidden Barrier to Generalization in Agentic Reinforcement Learning

Jingyu Liu, Xiaopeng Wu, Jingquan Peng, Kehan Chen, Chuan Yu, Lizhong Ding, Yong Liu

Main category: cs.AI

TL;DR: RL agents suffer from poor generalization due to “gradient coupling” - destructive interference between gradients in similar states. Proposed solution: train actor as classifier to learn disentangled embeddings, improving generalization.

Details

Motivation: Reinforcement learning agents often exhibit poor generalization, failing to adapt to scenarios not seen during training. The paper identifies "gradient coupling" as a fundamental cause of this brittleness, where high similarity between distinct states leads to destructive interference between gradients.

Method: Proposes a novel objective where the actor is trained to simultaneously function as a classifier that separates good and bad actions. This auxiliary pressure compels the model to learn disentangled embeddings for positive and negative actions, which mitigates negative gradient interference.

Result: Extensive experiments demonstrate the effectiveness of the method in improving generalization performance by mitigating gradient coupling issues.

Conclusion: The proposed approach of training the actor as a classifier to learn disentangled embeddings effectively addresses the gradient coupling problem, leading to improved generalization in reinforcement learning agents.

Abstract: Reinforcement learning (RL) is a dominant paradigm for training autonomous agents, yet these agents often exhibit poor generalization, failing to adapt to scenarios not seen during training. In this work, we identify a fundamental cause of this brittleness, a phenomenon which we term “gradient coupling.” We hypothesize that in complex agentic tasks, the high similarity between distinct states leads to destructive interference between gradients. Specifically, a gradient update that reinforces an optimal action in one state can inadvertently increase the likelihood of a suboptimal action in a similar, yet different, state. To solve this, we propose a novel objective where the actor is trained to simultaneously function as a classifier that separates good and bad actions. This auxiliary pressure compels the model to learn disentangled embeddings for positive and negative actions, which mitigates negative gradient interference and improve the generalization performance. Extensive experiments demonstrate the effectiveness of our method.

[307] Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning

Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu

Main category: cs.AI

TL;DR: DUPL introduces dual-uncertainty guided policy learning for multimodal RLVR, addressing perceptual ambiguity by quantifying both perceptual and output uncertainty to improve targeted exploration and reasoning performance.

Details

Motivation: Existing RLVR methods treat visual inputs as deterministic, ignoring perceptual ambiguity. This prevents distinguishing whether model uncertainty comes from complex reasoning or ambiguous perception, hindering targeted allocation of exploration/learning signals.

Method: DUPL quantifies perceptual uncertainty via symmetric KL divergence and output uncertainty via policy entropy. It establishes an uncertainty-driven feedback loop with dynamic branch prioritization to recalibrate policy advantage, focusing learning on states with high perceptual or decisional ambiguity.

Result: Implemented on GRPO and evaluated on six multimodal reasoning benchmarks, DUPL improves Qwen2.5-VL 3B and 7B models with accuracy gains up to 11.2% on visual math tasks and 7.1% on general-domain reasoning tasks, consistently outperforming GRPO.

Conclusion: Dual-uncertainty guided policy learning is an effective and generalizable approach for multimodal RLVR that enables targeted exploration beyond passive data augmentation by addressing perceptual ambiguity.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model’s uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce DUPL, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Implemented on top of GRPO and evaluated on six multimodal mathematical and general-domain reasoning benchmarks, DUPL improves Qwen2.5-VL 3B and 7B models, achieving accuracy gains of up to 11.2% on visual math tasks and up to 7.1% on general-domain reasoning tasks, while consistently outperforming GRPO. These results demonstrate that dual-uncertainty guided policy learning is an effective and generalizable approach for multimodal RLVR.

[308] VAL-Bench: Belief Consistency as a measure for Value Alignment in Language Models

Aman Gupta, Denny O’Shea, Fazl Barez

Main category: cs.AI

TL;DR: VAL-Bench is a new benchmark measuring LLM consistency on real-life controversial issues, showing models vary widely in maintaining consistent stances across opposing prompts.

Details

Motivation: Existing benchmarks use hypothetical situations that don't capture real-world complexity and ambiguity. Since humans don't agree on universal values, evaluating LLM value alignment is difficult but critical as LLMs increasingly shape human decisions.

Method: Created VAL-Bench with 115K pairs of prompts designed to elicit opposing stances on controversial issues extracted from Wikipedia. Used LLM-as-a-judge validated against human annotations to evaluate if response pairs consistently express either neutral or specific stances.

Result: Considerable variation in consistency rates across models (~10% to ~80%), with Claude models being the only ones achieving high consistency. Lack of consistency risks epistemic harm by making user beliefs dependent on question framing rather than evidence.

Conclusion: VAL-Bench enables systematic measurement of value alignment conditions. Research is needed to train belief consistency in modern LLMs, especially for trust-critical applications where inconsistent responses undermine reliability.

Abstract: Large language models (LLMs) are increasingly being used for tasks where outputs shape human decisions, so it is critical to verify that their responses consistently reflect desired human values. Humans, as individuals or groups, don’t agree on a universal set of values, which makes evaluating value alignment difficult. Existing benchmarks often use hypothetical or commonsensical situations, which don’t capture the complexity and ambiguity of real-life debates. We introduce the Value ALignment Benchmark (VAL-Bench), which measures the consistency in language model belief expressions in response to real-life value-laden prompts. VAL-Bench consists of 115K pairs of prompts designed to elicit opposing stances on a controversial issue, extracted from Wikipedia. We use an LLM-as-a-judge, validated against human annotations, to evaluate if the pair of responses consistently expresses either a neutral or a specific stance on the issue. Applied across leading open- and closed-source models, the benchmark shows considerable variation in consistency rates (ranging from ~10% to ~80%), with Claude models the only ones to achieve high levels of consistency. Lack of consistency in this manner risks epistemic harm by making user beliefs dependent on how questions are framed rather than on underlying evidence, and undermines LLM reliability in trust-critical applications. Therefore, we stress the importance of research towards training belief consistency in modern LLMs. By providing a scalable, reproducible benchmark, VAL-Bench enables systematic measurement of necessary conditions for value alignment.

[309] Lightweight Diffusion-based Framework for Online Imagined Speech Decoding in Aphasia

Eunyeong Ko, Soowon Kim, Ha-Na Jo

Main category: cs.AI

TL;DR: Real-time imagined speech decoding system for aphasia patients using lightweight diffusion model achieves 65% top-1 accuracy in online feedback phase.

Details

Motivation: Individuals with aphasia have severe verbal communication difficulties, but most imagined speech decoding approaches are limited to offline analysis or computationally demanding models, making real-time communication impractical.

Method: Two-session framework: offline data acquisition followed by online feedback phase. Uses four-class Korean-language task with three imagined speech targets based on participant’s daily needs plus resting state. Introduces lightweight diffusion-based neural decoding model optimized for real-time inference with architectural simplifications (dimensionality reduction, temporal kernel optimization, group normalization with regularization, dual early-stopping criteria).

Result: Real-time evaluation achieved 65% top-1 and 70% top-2 accuracy. The Water class specifically reached 80% top-1 and 100% top-2 accuracy. Demonstrated feasibility of online imagined speech decoding for communication-oriented BCI in aphasia.

Conclusion: Real-time-optimized diffusion-based architectures combined with clinically grounded task design can support feasible online imagined speech decoding for communication-oriented BCI applications in aphasia, addressing the critical need for real-time communication aids.

Abstract: Individuals with aphasia experience severe difficulty in real-time verbal communication, while most imagined speech decoding approaches remain limited to offline analysis or computationally demanding models. To address this limitation, we propose a two-session experimental framework consisting of an offline data acquisition phase and a subsequent online feedback phase for real-time imagined speech decoding. The paradigm employed a four-class Korean-language task, including three imagined speech targets selected according to the participant’s daily communicative needs and a resting-state condition, and was evaluated in a single individual with chronic anomic aphasia. Within this framework, we introduce a lightweight diffusion-based neural decoding model explicitly optimized for real-time inference, achieved through architectural simplifications such as dimensionality reduction, temporal kernel optimization, group normalization with regularization, and dual early-stopping criteria. In real-time evaluation, the proposed system achieved 65% top-1 and 70% top-2 accuracy, with the Water class reaching 80% top-1 and 100% top-2 accuracy. These results demonstrate that real-time-optimized diffusion-based architectures, combined with clinically grounded task design, can support feasible online imagined speech decoding for communication-oriented BCI applications in aphasia.

[310] Can LLMs Understand What We Cannot Say? Measuring Multilevel Alignment Through Abortion Stigma Across Cognitive, Interpersonal, and Structural Levels

Anika Sharma, Malavika Mampally, Chidaksh Ravuru, Kandyce Brennan, Neil Gaikwad

Main category: cs.AI

TL;DR: LLMs fail to coherently understand abortion stigma across cognitive, interpersonal, and structural levels, showing systematic biases and internal contradictions despite appropriate language use.

Details

Motivation: As LLMs increasingly mediate stigmatized health decisions, there's a critical need to assess whether they genuinely understand complex psychological phenomena like abortion stigma that operate across multiple dimensions.

Method: Systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS), examining representation at cognitive (self-judgment), interpersonal (worries about judgment/isolation), and structural (community condemnation/disclosure) levels.

Result: Models fail tests of genuine understanding across all dimensions: underestimate cognitive stigma while overestimating interpersonal stigma, introduce demographic biases (higher stigma for younger, less educated, non-White personas), treat secrecy as universal despite 36% of humans reporting openness, and produce internal contradictions (overestimate isolation yet predict isolated individuals are less secretive).

Conclusion: Current alignment approaches ensure appropriate language but not coherent understanding across levels. AI safety in high-stakes contexts demands new approaches: multilevel coherence in design, continuous auditing in evaluation, mandatory audits/accountability/deployment restrictions in governance, and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.

Abstract: As Large Language Models (LLMs) increasingly mediate stigmatized health decisions, their capacity to understand complex psychological phenomena remains inadequately assessed. Can LLMs understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across cognitive, interpersonal, and structural levels. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS), examining representation at cognitive (self-judgment), interpersonal (worries about judgment and isolation), and structural (community condemnation and disclosure patterns) levels. Models fail tests of genuine understanding across all dimensions. They underestimate cognitive stigma while overestimating interpersonal stigma, introduce demographic biases assigning higher stigma to younger, less educated, and non-White personas, and treat secrecy as universal despite 36% of humans reporting openness. Most critically, models produce internal contradictions: they overestimate isolation yet predict isolated individuals are less secretive, revealing incoherent representations. These patterns show current alignment approaches ensure appropriate language but not coherent understanding across levels. This work provides empirical evidence that LLMs lack coherent understanding of psychological constructs operating across multiple dimensions. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.

[311] Explicit Abstention Knobs for Predictable Reliability in Video Question Answering

Jorge Ortiz

Main category: cs.AI

TL;DR: Confidence-based abstention provides reliable error rate control in video QA, but fails under distribution shift, requiring new methods for robust selective prediction.

Details

Motivation: High-stakes deployment of vision-language models requires selective prediction where systems abstain when uncertain to avoid costly errors, but it's unclear if confidence-based abstention provides reliable error rate control, especially under distribution shift.

Method: Used NExT-QA dataset and Gemini 2.0 Flash model to test confidence thresholding for selective prediction in video question answering, evaluating both in-distribution and under distribution shift conditions.

Result: Confidence thresholding provides smooth risk-coverage tradeoffs and mechanistic control in-distribution, but fails to maintain error rate control under distribution shift, showing the need for more robust methods.

Conclusion: While confidence-based abstention works well in-distribution, it’s not robust to distribution shift, highlighting the need for new approaches to ensure reliable selective prediction in real-world VLM deployments.

Abstract: High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f

[312] Stock Market Price Prediction using Neural Prophet with Deep Neural Network

Navin Chhibber, Sunil Khemka, Navneet Kumar Tyagi, Rohit Tewari, Bireswar Banerjee, Piyush Ranjan

Main category: cs.AI

TL;DR: Proposes Neural Prophet with Deep Neural Network (NP-DNN) for stock price prediction, achieving 99.21% accuracy using Z-score normalization, missing value imputation, and MLP for complex pattern learning.

Details

Motivation: Existing statistical time-series approaches fail to effectively forecast probability ranges of future stock prices, creating a need for more accurate prediction methods in this interdisciplinary domain.

Method: NP-DNN model with Z-score normalization for preprocessing, missing value imputation, and Multi-Layer Perceptron (MLP) to learn complex nonlinear relationships and extract hidden patterns from stock price data.

Result: Achieved 99.21% accuracy, outperforming other approaches including Fused Large Language Model, demonstrating superior stock price prediction capability.

Conclusion: NP-DNN effectively addresses limitations of traditional statistical methods for stock price forecasting, providing high-accuracy predictions through deep learning techniques and proper data preprocessing.

Abstract: Stock market price prediction is a significant interdisciplinary research domain that depends at the intersection of finance, statistics, and economics. Forecasting Accurately predicting stock prices has always been a focal point for various researchers. However, existing statistical approaches for time-series prediction often fail to effectively forecast the probability range of future stock prices. Hence, to solve this problem, the Neural Prophet with a Deep Neural Network (NP-DNN) is proposed to predict stock market prices. The preprocessing technique used in this research is Z-score normalization, which normalizes stock price data by removing scale differences, making patterns easier to detect. Missing value imputation fills gaps in historical data, enhancing the models use of complete information for more accurate predictions. The Multi-Layer Perceptron (MLP) learns complex nonlinear relationships among stock market prices and extracts hidden patterns from the input data, thereby creating meaningful feature representations for better prediction accuracy. The proposed NP-DNN model achieved an accuracy of 99.21% compared with other approaches using the Fused Large Language Model. Keywords: deep neural network, forecasting stock prices, multi-layer perceptron, neural prophet, stock market price prediction.

[313] Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making

Jua Han, Jaeyoon Seo, Jungbin Min, Jean Oh, Jihie Kim

Main category: cs.AI

TL;DR: LLMs are dangerously unreliable for safety-critical robotics applications, with even 99% accuracy being unacceptable as 1% failure rate can cause catastrophic harm in life-threatening scenarios like fire evacuations.

Details

Motivation: As LLMs become integral to robotics decision-making, the physical risk grows - a single wrong instruction can directly endanger human safety. There's an urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic.

Method: Qualitative evaluation of fire evacuation scenarios identified critical failure cases. Designed seven quantitative tasks categorized into: Complete Information (ASCII maps to isolate spatial reasoning), Incomplete Information (inferring missing context), and Safety-Oriented Spatial Reasoning (natural language evaluation of safe decision-making in life-threatening contexts). Benchmarked various LLMs and VLMs across these tasks.

Result: Serious vulnerabilities revealed: several models achieved 0% success rate in ASCII navigation, and in simulated fire drills, models instructed robots to move toward hazardous areas instead of emergency exits. Analysis shows how “rare” errors (1% failure rate) escalate into catastrophic outcomes.

Conclusion: Current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics as it implies one out of every hundred executions could result in catastrophic harm. Even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.

Abstract: One mistake by an AI system in a safety-critical setting can cost lives. As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows; a single wrong instruction can directly endanger human safety. This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic. Through a qualitative evaluation of a fire evacuation scenario, we identified critical failure cases in LLM-based decision-making. Based on these, we designed seven tasks for quantitative assessment, categorized into: Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). Complete information tasks utilize ASCII maps to minimize interpretation ambiguity and isolate spatial reasoning from visual processing. Incomplete information tasks require models to infer missing context, testing for spatial continuity versus hallucinations. SOSR tasks use natural language to evaluate safe decision-making in life-threatening contexts. We benchmark various LLMs and Vision-Language Models (VLMs) across these tasks. Beyond aggregate performance, we analyze the implications of a 1% failure rate, highlighting how “rare” errors escalate into catastrophic outcomes. Results reveal serious vulnerabilities: several models achieved a 0% success rate in ASCII navigation, while in a simulated fire drill, models instructed robots to move toward hazardous areas instead of emergency exits. Our findings lead to a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics, as it implies one out of every hundred executions could result in catastrophic harm. We demonstrate that even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.

[314] HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation

Rongxin Chen, Tianyu Wu, Bingbing Xu, Jiatang Luo, Xiucheng Xu, Huawei Shen

Main category: cs.AI

TL;DR: HAG is a hierarchical agent generation framework that creates realistic agent populations by combining macro-level distribution alignment with micro-level individual consistency, outperforming existing methods.

Details

Motivation: Current agent initialization methods either rely on static data (which can't adapt to unseen topics) or LLM generation (which lacks distribution awareness), leading to unrealistic agent populations that don't align with real-world distributions.

Method: Two-stage hierarchical approach: 1) Use World Knowledge Model to infer hierarchical conditional probabilities and build Topic-Adaptive Tree for macro-level distribution alignment; 2) Grounded instantiation and agentic augmentation using real-world data for micro-level consistency.

Result: HAG significantly outperforms baselines, reducing population alignment errors by 37.7% on average and improving sociological consistency by 18.8% across multi-domain benchmarks.

Conclusion: HAG provides a robust framework for high-fidelity agent initialization that balances macro-level distribution awareness with micro-level individual rationality, addressing limitations of existing approaches.

Abstract: High-fidelity agent initialization is crucial for credible Agent-Based Modeling across diverse domains. A robust framework should be Topic-Adaptive, capturing macro-level joint distributions while ensuring micro-level individual rationality. Existing approaches fall into two categories: static data-based retrieval methods that fail to adapt to unseen topics absent from the data, and LLM-based generation methods that lack macro-level distribution awareness, resulting in inconsistencies between micro-level persona attributes and reality. To address these problems, we propose HAG, a Hierarchical Agent Generation framework that formalizes population generation as a two-stage decision process. Firstly, utilizing a World Knowledge Model to infer hierarchical conditional probabilities to construct the Topic-Adaptive Tree, achieving macro-level distribution alignment. Then, grounded real-world data, instantiation and agentic augmentation are carried out to ensure micro-level consistency. Given the lack of specialized evaluation, we establish a multi-domain benchmark and a comprehensive PACE evaluation framework. Extensive experiments show that HAG significantly outperforms representative baselines, reducing population alignment errors by an average of 37.7% and enhancing sociological consistency by 18.8%.

[315] DScheLLM: Enabling Dynamic Scheduling through a Fine-Tuned Dual-System Large language Model

Lixiang Zhang, Chenggong Zhao, Qing Gao, Xiaoke Zhao, Gengyi Bai, Jinhu Lv

Main category: cs.AI

TL;DR: DScheLLM: A dynamic scheduling approach using fine-tuned LLMs with dual-system reasoning (fast-slow) to handle various disruptions in production scheduling.

Details

Motivation: Conventional scheduling approaches are limited by their reliance on event-specific models and explicit analytical formulations, which lack adaptability and generalization to unseen disturbances in dynamic production environments.

Method: Proposes DScheLLM with a unified LLM-based framework using fine-tuned Huawei OpenPangu Embedded-7B model with LoRA. Uses dual-system reasoning (fast-slow modes) trained on datasets generated from exact schedules by operations research solver.

Result: Fast-thinking mode efficiently generates high-quality schedules, while slow-thinking mode produces solver-compatible, well-formatted decision inputs. Demonstrated effectiveness on standard job shop scheduling benchmarks.

Conclusion: One of the earliest studies applying LLMs to dynamic job shop scheduling, highlighting their significant potential for intelligent and adaptive scheduling optimization in dynamic environments.

Abstract: Production scheduling is highly susceptible to dynamic disruptions, such as variations in processing times, machine availability, and unexpected task insertions. Conventional approaches typically rely on event-specific models and explicit analytical formulations, which limits their adaptability and generalization across previously unseen disturbances. To overcome these limitations, this paper proposes DScheLLM, a dynamic scheduling approach that leverages fine-tuned large language models within a dual-system (fast-slow) reasoning architecture to address disturbances of different scales. A unified large language model-based framework is constructed to handle dynamic events, where training datasets for both fast and slow reasoning modes are generated using exact schedules obtained from an operations research solver. The Huawei OpenPangu Embedded-7B model is subsequently fine-tuned under the hybrid reasoning paradigms using LoRA. Experimental evaluations on standard job shop scheduling benchmarks demonstrate that the fast-thinking mode can efficiently generate high-quality schedules and the slow-thinking mode can produce solver-compatible and well-formatted decision inputs. To the best of our knowledge, this work represents one of the earliest studies applying large language models to job shop scheduling in dynamic environments, highlighting their considerable potential for intelligent and adaptive scheduling optimization.

[316] Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park

Main category: cs.AI

TL;DR: MATTRL introduces test-time reinforcement learning for multi-agent systems, using structured textual experience injection during inference to improve decision-making without expensive MARL training.

Details

Motivation: Multi-agent RL training is resource-intensive and unstable due to co-adaptation non-stationarity and sparse, high-variance rewards, creating a need for more efficient and stable approaches.

Method: Forms multi-expert team for discussions, retrieves/integrates test-time experiences, reaches consensus decisions, and studies credit assignment for turn-level experience pool construction and reinjection.

Result: Improves accuracy by average 3.67% over multi-agent baseline and 8.67% over single-agent baselines across medicine, math, and education benchmarks.

Conclusion: MATTRL provides stable, effective, distribution-shift-robust multi-agent reasoning without tuning, with ablation studies showing credit assignment scheme impacts.

Abstract: Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67% over a multi-agent baseline, and by 8.67% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.

cs.SD

[317] Diffusion-based Frameworks for Unsupervised Speech Enhancement

Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel, Xavier Alameda-Pineda

Main category: cs.SD

TL;DR: Unsupervised speech enhancement using diffusion models with explicit noise modeling improves performance over previous approaches that only model speech.

Details

Motivation: Previous unsupervised speech enhancement methods combine diffusion models for clean speech with NMF-based noise models, but only model speech as a latent variable. The authors aim to improve performance by explicitly modeling both speech and noise as latent variables.

Method: Two main contributions: 1) Revisiting existing framework to explicitly model both speech and acoustic noise as latent variables, jointly sampling them in the EM E-step. 2) Introducing new unsupervised SE framework replacing NMF noise prior with diffusion-based noise model, learned jointly with speech prior in a single conditional score model. Two variants: implicit noise accounting and explicit noise as latent variable.

Result: Explicit noise modeling systematically improves SE performance for both NMF-based and diffusion-based noise priors. Diffusion-based noise model achieves best overall quality and intelligibility among unsupervised methods under matched conditions. Under mismatched conditions, NMF-based explicit-noise framework is more robust and suffers less degradation than several supervised baselines.

Conclusion: Explicit joint modeling of speech and noise as latent variables improves unsupervised speech enhancement performance. Diffusion-based noise models perform best under matched conditions, while NMF-based approaches offer better robustness under mismatched conditions.

Abstract: This paper addresses $\textit{unsupervised}$ diffusion-based single-channel speech enhancement (SE). Prior work in this direction combines a score-based diffusion model trained on clean speech with a Gaussian noise model whose covariance is structured by non-negative matrix factorization (NMF). This combination is used within an iterative expectation-maximization (EM) scheme, in which a diffusion-based posterior-sampling E-step estimates the clean speech. We first revisit this framework and propose to explicitly model both speech and acoustic noise as latent variables, jointly sampling them in the E-step instead of sampling speech alone as in previous approaches. We then introduce a new unsupervised SE framework that replaces the NMF noise prior with a diffusion-based noise model, learned jointly with the speech prior in a single conditional score model. Within this framework, we derive two variants: one that implicitly accounts for noise and one that explicitly treats noise as a latent variable. Experiments on WSJ0-QUT and VoiceBank-DEMAND show that explicit noise modeling systematically improves SE performance for both NMF-based and diffusion-based noise priors. Under matched conditions, the diffusion-based noise model attains the best overall quality and intelligibility among unsupervised methods, while under mismatched conditions the proposed NMF-based explicit-noise framework is more robust and suffers less degradation than several supervised baselines. Our code will be publicly available on this $\href{https://github.com/jeaneudesAyilo/enudiffuse}{URL}$.

[318] Self-supervised restoration of singing voice degraded by pitch shifting using shallow diffusion

Yunyi Liu, Taketo Akama

Main category: cs.SD

TL;DR: A diffusion model approach for high-quality pitch shifting of singing voices by treating it as a restoration problem rather than direct signal processing.

Details

Motivation: Conventional pitch shifting methods suffer from artifacts like formant shifts and robotic coloration, especially at larger transposition jumps, which degrade singing voice quality.

Method: Reframes pitch shifting as restoration: uses lightweight mel-space diffusion model driven by frame-level acoustic features (f0, volume, content features). Training pairs created self-supervised by applying pitch shifts and reversing them to simulate artifacts while keeping ground truth.

Result: Substantially reduces pitch shift artifacts compared to classical baselines on curated singing set, measured by both statistical metrics and pairwise acoustic measures.

Conclusion: Restoration-based pitch shifting could be a viable approach for artifact-resistant transposition in vocal production workflows.

Abstract: Pitch shifting has been an essential feature in singing voice production. However, conventional signal processing approaches exhibit well known trade offs such as formant shifts and robotic coloration that becomes more severe at larger transposition jumps. This paper targets high quality pitch shifting for singing by reframing it as a restoration problem: given an audio track that has been pitch shifted (and thus contaminated by artifacts), we recover a natural sounding performance while preserving its melody and timing. Specifically, we use a lightweight, mel space diffusion model driven by frame level acoustic features such as f0, volume, and content features. We construct training pairs in a self supervised manner by applying pitch shifts and reversing them to simulate realistic artifacts while retaining ground truth. On a curated singing set, the proposed approach substantially reduces pitch shift artifacts compared to representative classical baselines, as measured by both statistical metrics and pairwise acoustic measures. The results suggest that restoration based pitch shifting could be a viable approach towards artifact resistant transposition in vocal production workflows.

[319] RSA-Bench: Benchmarking Audio Large Models in Real-World Acoustic Scenarios

Yibo Zhang, Liang Lin, Kaiwen Luo, Shilinlu Yan, Jin Wang, Yaoqi Guo, Yitian Chen, Yalan Qin, Zhenhong Zhou, Kun Wang, Li Sun

Main category: cs.SD

TL;DR: RSA-Bench is a new benchmark that evaluates Audio Large Models’ robustness using realistic acoustic environments, revealing critical weaknesses in high-order reasoning and sensitivity to semantic interference.

Details

Motivation: Existing evaluations use synthetic Gaussian noise or simple single-source interference, failing to capture the complex acoustic ecology of real-world environments where ALMs are deployed.

Method: Created RSA-Bench by naturally superimposing diverse environmental soundscapes (Pasture, Extreme Weather, Classroom, Outdoors) onto clean speech across interference intensities, evaluating models on six core tasks from perception to reasoning.

Result: Three key findings: 1) Perception-Cognition Gap - models maintain low-level recognition but collapse in high-order reasoning under stress; 2) Scenario Sensitivity - vocal-like interference is more destructive than mechanical noise; 3) Denoising Paradox - standard speech enhancement worsens performance due to semantic distortions.

Conclusion: Current ALMs have significant robustness limitations in real-world acoustic environments, particularly in high-order reasoning tasks, and traditional evaluation methods fail to capture these critical weaknesses.

Abstract: While Audio Large Models (ALMs) have achieved remarkable proficiency, their robustness remains brittle in real-world deployment. Existing evaluations largely rely on synthetic Gaussian noise or simplistic single-source interference, failing to capture the intricate, multi-layered acoustic dynamics – or Acoustic Ecology'' -- that characterize authentic physical environments. To bridge this ecological gap, we introduce \textbf{RSA-Bench}, a comprehensive robustness benchmark designed to stress-test ALLMs through high-fidelity auditory scene simulations. Unlike traditional methods, we construct evaluation samples by naturally superimposing diverse environmental soundscapes -- spanning \textit{Pasture}, \textit{Extreme Weather}, \textit{Classroom}, and \textit{Outdoors} -- onto clean speech signals across a spectrum of interference intensities. By evaluating models on six core tasks ranging from fundamental perception to complex reasoning, our study unveils three macro-level insights: \textbf{(I) The Perception-Cognition Gap:} Models maintain relative resilience in low-level recognition but suffer a \textbf{functional collapse} in high-order reasoning tasks under stress; \textbf{(II) Scenario Sensitivity:} Vocal-like’’ interference (e.g., background laughter) proves significantly more destructive than mechanical noise, challenging the model’s auditory attention mechanisms; and \textbf{(III) The Denoising Paradox:} Standard speech enhancement often exacerbates performance degradation, as ALLMs prove highly sensitive to the semantic distortions introduced by denoising artifacts.

Victor Zheleznov, Stefan Bilbao, Alec Wright, Simon King

Main category: cs.SD

TL;DR: Combines scalar auxiliary variable techniques with neural ODEs to create stable differentiable models for learning nonlinear dynamics, applied to string vibration synthesis.

Details

Motivation: Modal methods for physical modelling synthesis need extensions to nonlinear problems like high-amplitude string vibration. While scalar auxiliary variable techniques enable stable numerical solvers for nonlinear systems, and neural ODEs can learn nonlinear dynamics from data, there's a need to combine these approaches for stable differentiable learning models.

Method: Proposes combining scalar auxiliary variable techniques with neural ordinary differential equations to create stable differentiable models. Leverages analytical solutions for linear vibration of system’s modes so physical parameters remain accessible after training without needing parameter encoders in architecture.

Result: The model can be trained to reproduce nonlinear dynamics of a system, demonstrated with synthetic data for nonlinear transverse vibration of a string. Sound examples are presented as proof of concept.

Conclusion: The approach successfully combines scalar auxiliary variable techniques with neural ODEs to yield stable differentiable models capable of learning nonlinear dynamics while maintaining accessibility to physical parameters after training.

Abstract: Modal methods are a long-standing approach to physical modelling synthesis. Extensions to nonlinear problems are possible, including the case of a high-amplitude vibration of a string. A modal decomposition leads to a densely coupled nonlinear system of ordinary differential equations. Recent work in scalar auxiliary variable techniques has enabled construction of explicit and stable numerical solvers for such classes of nonlinear systems. On the other hand, machine learning approaches (in particular neural ordinary differential equations) have been successful in modelling nonlinear systems automatically from data. In this work, we examine how scalar auxiliary variable techniques can be combined with neural ordinary differential equations to yield a stable differentiable model capable of learning nonlinear dynamics. The proposed approach leverages the analytical solution for linear vibration of system’s modes so that physical parameters of a system remain easily accessible after the training without the need for a parameter encoder in the model architecture. As a proof of concept, we generate synthetic data for the nonlinear transverse vibration of a string and show that the model can be trained to reproduce the nonlinear dynamics of the system. Sound examples are presented.

[321] HeartMuLa: A Family of Open Sourced Music Foundation Models

Dongchao Yang, Yuxin Xie, Yuguo Yin, Zheyu Wang, Xiaoyu Yi, Gongxi Zhu, Xiaolong Weng, Zihan Xiong, Yingzhe Ma, Dading Cong, Jingliang Liu, Zihang Huang, Jinghan Ru, Rongjie Huang, Haoran Wan, Peixu Wang, Kuoxi Yu, Helin Wang, Liming Liang, Xianwei Zhuang, Yuanyuan Wang, Haohan Guo, Junjie Cao, Zeqian Ju, Songxiang Liu, Yuewen Cao, Heming Weng, Yuexian Zou

Main category: cs.SD

TL;DR: HeartFM is an open-source music foundation model family with four components for audio-text alignment, lyric recognition, music tokenization, and LLM-based song generation with rich user control.

Details

Motivation: To advance large-scale music understanding and generation across diverse tasks and modalities, and to demonstrate that commercial-grade music AI systems can be reproduced with academic-scale resources.

Method: A four-component framework: HeartCLAP for audio-text alignment, HeartTranscriptor for lyric recognition, HeartCodec for high-fidelity music tokenization (12.5 Hz), and HeartMuLa as an LLM-based song generator with fine-grained attribute control and short music generation modes.

Result: The system achieves Suno-level commercial-grade performance, with HeartMuLa showing significant improvement when scaled to 7B parameters, enabling high-fidelity music synthesis under rich user-controllable conditions.

Conclusion: HeartFM provides strong baselines for future music AI research and facilitates practical applications in multimodal content production, demonstrating that academic-scale resources can reproduce commercial-grade music generation systems.

Abstract: We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric recognition model optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate (12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details and enabling efficient autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions (e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides two specialized modes: (i) fine-grained musical attribute control, which allows users to specify the style of different song sections (e.g., intro, verse, chorus) using natural language prompts; and (ii) short, engaging music generation, which is suitable as background music for short videos. Lastly, HeartMuLa improves significantly when scaled to 7B parameters. For the first time, we show that a Suno-level, commercial-grade system can be reproduced using academic-scale data and GPU resources. We expect these foundation models to serve as strong baselines for future research and to facilitate practical applications in multimodal content production.

[322] Audio Generation Through Score-Based Generative Modeling: Design Principles and Implementation

Ge Zhu, Yutong Wen, Zhiyao Duan

Main category: cs.SD

TL;DR: A comprehensive survey of diffusion models for audio applications, providing design principles, unified framework, and open-source codebase with case studies.

Details

Motivation: Existing reviews lack in-depth discussion of specific design choices for audio diffusion models, and there's limited principled guidance for implementation and comparison across different audio applications.

Method: Adopts score modeling as unifying framework, systematically examines training/sampling procedures and conditioning mechanisms, introduces open-source codebase (AudioDiffuser) implementing reviewed framework.

Result: Provides comprehensive review of diffusion model design for audio, demonstrates capabilities through three case studies (audio generation, speech enhancement, text-to-speech) with benchmark evaluations on standard datasets.

Conclusion: The survey offers integrated guidance for audio diffusion models, promotes reproducible research through open-source implementation, and shows practical applications across diverse audio domains.

Abstract: Diffusion models have emerged as powerful deep generative techniques, producing high-quality and diverse samples in applications in various domains including audio. While existing reviews provide overviews, there remains limited in-depth discussion of these specific design choices. The audio diffusion model literature also lacks principled guidance for the implementation of these design choices and their comparisons for different applications. This survey provides a comprehensive review of diffusion model design with an emphasis on design principles for quality improvement and conditioning for audio applications. We adopt the score modeling perspective as a unifying framework that accommodates various interpretations, including recent approaches like flow matching. We systematically examine the training and sampling procedures of diffusion models, and audio applications through different conditioning mechanisms. To provide an integrated, unified codebase and to promote reproducible research and rapid prototyping, we introduce an open-source codebase (https://github.com/gzhu06/AudioDiffuser) that implements our reviewed framework for various audio applications. We demonstrate its capabilities through three case studies: audio generation, speech enhancement, and text-to-speech synthesis, with benchmark evaluations on standard datasets.

[323] ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan

Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, Ming Li

Main category: cs.SD

TL;DR: Proposes CompSpoofV2 dataset and separation-enhanced joint learning framework for detecting component-level audio deepfakes where only speech or background sounds are manipulated.

Details

Motivation: Real-world audio contains both foreground speech and background sounds. Current deepfake detection systems struggle with component-level manipulations where only one part is altered, as the unaltered component can mislead detection systems and make the audio sound more natural.

Method: Created CompSpoofV2 dataset (250k+ samples, ~283 hours) for component-level audio anti-spoofing, and developed a separation-enhanced joint learning framework. Also launched ESDD2 challenge focusing on component-level spoofing detection.

Result: CompSpoofV2 is a large-scale curated dataset for component-level audio anti-spoofing. The ESDD2 challenge will be held at ICME 2026 to advance research in this area.

Conclusion: Component-level audio deepfakes present a challenging detection scenario that requires specialized datasets and methods. The proposed CompSpoofV2 dataset and framework address this gap, with the ESDD2 challenge aiming to advance the field.

Abstract: Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).

[324] DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Hanlin Zhang, Daxin Tan, Dehua Tao, Xiao Chen, Haochen Tan, Yunhe Li, Yuchen Cao, Jianping Wang, Linqi Song

Main category: cs.SD

TL;DR: DSA-Tokenizer is a speech tokenizer that explicitly disentangles speech into separate semantic and acoustic tokens using distinct optimization constraints, enabling better content-style separation for speech LLMs.

Details

Motivation: Existing speech tokenizers either prioritize semantics only, fuse semantic and acoustic information inseparably, or achieve incomplete disentanglement between content and style. There's a need for better disentanglement to enable more controllable speech generation in Speech LLMs.

Method: 1) Uses ASR supervision for semantic tokens to capture linguistic content; 2) Uses mel-spectrogram restoration for acoustic tokens to encode style; 3) Introduces hierarchical Flow-Matching decoder to eliminate rigid length constraints; 4) Employs joint reconstruction-recombination training strategy to enforce separation.

Result: Achieves high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech LLMs. The approach enables explicit separation of content and style for better speech modeling.

Conclusion: Disentangled tokenization is a pivotal paradigm for future speech modeling, and DSA-Tokenizer demonstrates effective separation of semantic and acoustic information for improved speech generation control.

Abstract: Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models (Speech LLMs). Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. To eliminate rigid length constraints between the two sequences, we introduce a hierarchical Flow-Matching decoder that further improve speech generation quality. Furthermore, We employ a joint reconstruction-recombination training strategy to enforce this separation. DSA-Tokenizer enables high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech LLMs. Our analysis highlights disentangled tokenization as a pivotal paradigm for future speech modeling. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/. The code and model will be made publicly available after the paper has been accepted.

cs.LG

Sharim Khan, Paul Landes, Adam Cross, Jimeng Sun

Main category: cs.LG

TL;DR: The paper explores using reasoning models and traditional LLMs for multi-label SDoH ICD-9 code classification on MIMIC-III hospital admissions, achieving 89% F1 score and identifying missing SDoH codes.

Details

Motivation: Social Determinants of Health (SDoH) correlate with patient outcomes but are rarely captured in structured data. While LLMs show promise in extracting SDoH from clinical text, predicting from long admissions notes with distant dependencies remains challenging.

Method: The study uses reasoning models and traditional large language models for hospital admission multi-label SDoH ICD-9 code classification on the MIMIC-III dataset. The approach exploits existing ICD-9 codes for prediction on admissions.

Result: The method achieved 89% F1 score for SDoH ICD-9 code classification. The research also identified missing SDoH codes in 139 admissions.

Conclusion: The paper demonstrates effective SDoH code classification from clinical text using reasoning models and LLMs, with contributions including performance findings, identification of missing codes, and reproducible code.

Abstract: Social Determinants of Health correlate with patient outcomes but are rarely captured in structured data. Recent attention has been given to automatically extracting these markers from clinical text to supplement diagnostic systems with knowledge of patients’ social circumstances. Large language models demonstrate strong performance in identifying Social Determinants of Health labels from sentences. However, prediction in large admissions or longitudinal notes is challenging given long distance dependencies. In this paper, we explore hospital admission multi-label Social Determinants of Health ICD-9 code classification on the MIMIC-III dataset using reasoning models and traditional large language models. We exploit existing ICD-9 codes for prediction on admissions, which achieved an 89% F1. Our contributions include our findings, missing SDoH codes in 139 admissions, and code to reproduce the results.

[326] The Geometry of Thought: Disclosing the Transformer as a Tropical Polynomial Circuit

Faruk Alpay, Bilge Senturk

Main category: cs.LG

TL;DR: Transformers’ self-attention in high-confidence regime operates in tropical semiring (max-plus algebra), converting softmax attention into tropical matrix product that performs dynamic programming for chain-of-thought reasoning.

Details

Motivation: To understand the mathematical foundations of Transformer self-attention mechanisms, particularly in the high-confidence regime, and reveal the underlying computational structure that enables chain-of-thought reasoning.

Method: Analyze the Transformer self-attention mechanism in the limit where inverse temperature β→∞ (high-confidence regime). Show that softmax attention converts to tropical matrix product in max-plus algebra, revealing it as a dynamic programming recurrence (Bellman-Ford path-finding update) on token similarity graphs.

Result: Proved that Transformer self-attention in high-confidence regime operates in tropical semiring. The tropical limit converts softmax attention into tropical matrix product, showing the forward pass executes dynamic programming on latent token similarity graphs.

Conclusion: Transformers’ chain-of-thought reasoning emerges from an inherent shortest-path/longest-path algorithm within the network’s computation, providing a new geometric perspective on how Transformers perform reasoning through dynamic programming on token relationships.

Abstract: We prove that the Transformer self-attention mechanism in the high-confidence regime ($β\to \infty$, where $β$ is an inverse temperature) operates in the tropical semiring (max-plus algebra). In particular, we show that taking the tropical limit of the softmax attention converts it into a tropical matrix product. This reveals that the Transformer’s forward pass is effectively executing a dynamic programming recurrence (specifically, a Bellman-Ford path-finding update) on a latent graph defined by token similarities. Our theoretical result provides a new geometric perspective for chain-of-thought reasoning: it emerges from an inherent shortest-path (or longest-path) algorithm being carried out within the network’s computation.

[327] TimeSAE: Sparse Decoding for Faithful Explanations of Black-Box Time Series Models

Khalid Oublal, Quentin Bouniot, Qi Gan, Stephan Clémençon, Zeynep Akata

Main category: cs.LG

TL;DR: TimeSAE: A framework for explaining black-box time series models using Sparse Autoencoders and causality, addressing limitations of existing methods that fail under distributional shifts.

Details

Motivation: Black box and pretrained models are increasingly used in time series applications, but existing explanation methods only work in-distribution and lack generalization capability, limiting their effectiveness in real-world scenarios where distributional shifts occur.

Method: Introduces TimeSAE framework based on Sparse Autoencoders (SAEs) and causality to explain black-box models for time series data, addressing sensitivity to distributional shifts.

Result: Extensive evaluations on synthetic and real-world datasets show TimeSAE provides more faithful and robust explanations compared to leading baselines, supported by both quantitative metrics and qualitative insights.

Conclusion: TimeSAE offers a more effective framework for explaining black-box time series models that generalizes better under distributional shifts, with practical implementation available in TimeSAE-Lib library.

Abstract: As black box models and pretrained models gain traction in time series applications, understanding and explaining their predictions becomes increasingly vital, especially in high-stakes domains where interpretability and trust are essential. However, most of the existing methods involve only in-distribution explanation, and do not generalize outside the training support, which requires the learning capability of generalization. In this work, we aim to provide a framework to explain black-box models for time series data through the dual lenses of Sparse Autoencoders (SAEs) and causality. We show that many current explanation methods are sensitive to distributional shifts, limiting their effectiveness in real-world scenarios. Building on the concept of Sparse Autoencoder, we introduce TimeSAE, a framework for black-box model explanation. We conduct extensive evaluations of TimeSAE on both synthetic and real-world time series datasets, comparing it to leading baselines. The results, supported by both quantitative metrics and qualitative insights, show that TimeSAE provides more faithful and robust explanations. Our code is available in an easy-to-use library TimeSAE-Lib: https://anonymous.4open.science/w/TimeSAE-571D/.

[328] Comparative Evaluation of Deep Learning-Based and WHO-Informed Approaches for Sperm Morphology Assessment

Mohammad Abbadi

Main category: cs.LG

TL;DR: Deep learning model (HuSHeM) outperforms traditional WHO criteria with inflammation index for sperm morphology assessment, showing better discrimination, calibration, and clinical utility.

Details

Motivation: Current sperm morphology assessment suffers from subjectivity, inter-observer variability, and resource limitations, necessitating more objective and reliable methods.

Method: Comparative AI framework evaluating HuSHeM (image-based deep learning model) vs. WHO(+SIRI) baseline. HuSHeM trained on high-resolution sperm images and tested on independent clinical cohort using discrimination, calibration, and clinical utility analyses.

Result: HuSHeM showed higher discriminative performance (better AUC), improved precision-recall under class imbalance, better calibration, and greater net clinical benefit across relevant thresholds compared to WHO(+SIRI).

Conclusion: Image-based deep learning offers superior predictive reliability and clinical utility over traditional methods, serving as objective decision-support tool for fertility screening without replacing clinical judgment.

Abstract: Assessment of sperm morphological quality remains a critical yet subjective component of male fertility evaluation, often limited by inter-observer variability and resource constraints. This study presents a comparative biomedical artificial intelligence framework evaluating an image-based deep learning model (HuSHeM) alongside a clinically grounded baseline derived from World Health Organization criteria augmented with the Systemic Inflammation Response Index (WHO(+SIRI)). The HuSHeM model was trained on high-resolution sperm morphology images and evaluated using an independent clinical cohort. Model performance was assessed using discrimination, calibration, and clinical utility analyses. The HuSHeM model demonstrated higher discriminative performance, as reflected by an increased area under the receiver operating characteristic curve with relatively narrow confidence intervals compared to WHO(+SIRI). Precision-recall analysis further indicated improved performance under class imbalance, with higher precision-recall area values across evaluated thresholds. Calibration analysis indicated closer agreement between predicted probabilities and observed outcomes for HuSHeM, while decision curve analysis suggested greater net clinical benefit across clinically relevant threshold probabilities. These findings suggest that image-based deep learning may offer improved predictive reliability and clinical utility compared with traditional rule-based and inflammation-augmented criteria. The proposed framework supports objective and reproducible assessment of sperm morphology and may serve as a decision-support tool within fertility screening and referral workflows. The proposed models are intended as decision-support or referral tools and are not designed to replace clinical judgment or laboratory assessment.

[329] QFed: Parameter-Compact Quantum-Classical Federated Learning

Samar Abdelghani, Soumaya Cherkaoui

Main category: cs.LG

TL;DR: Quantum-assisted federated learning framework (QFed) reduces model parameters by 77.6% while maintaining comparable accuracy to classical approaches, enabling more efficient edge computing.

Details

Motivation: Organizations need to extract collective intelligence from distributed datasets while maintaining privacy and regulatory compliance. Federated Learning helps but faces challenges with statistical heterogeneity, system diversity, and computational burden from complex models.

Method: Introduces QFed, a quantum-enabled federated learning framework that leverages quantum computing to reduce parameter counts in classical models by polylogarithmic factors, thus decreasing training overhead across edge device networks.

Result: Experimental evaluation on FashionMNIST dataset shows QFed achieves 77.6% reduction in parameter count of a VGG-like model while maintaining accuracy comparable to classical approaches in scalable environments.

Conclusion: Quantum computing can strengthen federated learning capabilities for edge devices by significantly reducing computational overhead while preserving model performance, making FL more practical for privacy-sensitive distributed applications.

Abstract: Organizations and enterprises across domains such as healthcare, finance, and scientific research are increasingly required to extract collective intelligence from distributed, siloed datasets while adhering to strict privacy, regulatory, and sovereignty requirements. Federated Learning (FL) enables collaborative model building without sharing sensitive raw data, but faces growing challenges posed by statistical heterogeneity, system diversity, and the computational burden from complex models. This study examines the potential of quantum-assisted federated learning, which could cut the number of parameters in classical models by polylogarithmic factors and thus lessen training overhead. Accordingly, we introduce QFed, a quantum-enabled federated learning framework aimed at boosting computational efficiency across edge device networks. We evaluate the proposed framework using the widely adopted FashionMNIST dataset. Experimental results show that QFed achieves a 77.6% reduction in the parameter count of a VGG-like model while maintaining an accuracy comparable to classical approaches in a scalable environment. These results point to the potential of leveraging quantum computing within a federated learning context to strengthen FL capabilities of edge devices.

[330] Eluder dimension: localise it!

Alireza Bakhtiari, Alex Ayoub, Samuel Robertson, David Janz, Csaba Szepesvári

Main category: cs.LG

TL;DR: The paper establishes limitations of standard eluder dimension analysis for achieving first-order regret bounds in RL, introduces a localized eluder dimension method, and demonstrates improved regret bounds for Bernoulli bandits and first-order bounds for finite-horizon RL.

Details

Motivation: Standard eluder dimension-based analysis cannot achieve first-order regret bounds for generalised linear model classes, creating a gap in theoretical understanding and practical performance guarantees for reinforcement learning algorithms.

Method: Introduces a localisation method for the eluder dimension that modifies the standard analysis approach, enabling more refined regret bounds that account for problem-specific structure.

Result: The localized eluder dimension analysis recovers and improves classic results for Bernoulli bandits, and achieves the first genuine first-order regret bounds for finite-horizon reinforcement learning with bounded cumulative returns.

Conclusion: Localizing the eluder dimension enables overcoming the limitations of standard analysis, leading to improved regret bounds and establishing the first first-order guarantees for finite-horizon RL tasks.

Abstract: We establish a lower bound on the eluder dimension of generalised linear model classes, showing that standard eluder dimension-based analysis cannot lead to first-order regret bounds. To address this, we introduce a localisation method for the eluder dimension; our analysis immediately recovers and improves on classic results for Bernoulli bandits, and allows for the first genuine first-order bounds for finite-horizon reinforcement learning tasks with bounded cumulative returns.

[331] A New Convergence Analysis of Plug-and-Play Proximal Gradient Descent Under Prior Mismatch

Guixian Xu, Jinglai Li, Junqi Tang

Main category: cs.LG

TL;DR: First convergence proof for plug-and-play proximal gradient descent under prior mismatch conditions

Details

Motivation: Existing PnP-PGD convergence theories require restrictive assumptions and don't address practical scenarios where denoisers are trained on different data distributions than the inference task (prior mismatch).

Method: Develops new convergence theory for plug-and-play proximal gradient descent that specifically addresses prior mismatch conditions where the denoiser is trained on a different data distribution.

Result: First convergence proof of PnP-PGD under prior mismatch, removing several restrictive and unverifiable assumptions from existing theoretical results.

Conclusion: The work provides more practical and verifiable convergence guarantees for PnP algorithms in real-world scenarios where training and inference data distributions differ.

Abstract: In this work, we provide a new convergence theory for plug-and-play proximal gradient descent (PnP-PGD) under prior mismatch where the denoiser is trained on a different data distribution to the inference task at hand. To the best of our knowledge, this is the first convergence proof of PnP-PGD under prior mismatch. Compared with the existing theoretical results for PnP algorithms, our new results removed the need for several restrictive and unverifiable assumptions.

[332] A pipeline for enabling path-specific causal fairness in observational health data

Aparajita Kashyap, Sara Matijevic, Noémie Elhadad, Steven A. Kushner, Shalmali Joshi

Main category: cs.LG

TL;DR: A pipeline for training causally fair ML models in healthcare that addresses both direct and indirect biases through path-specific causal fairness analysis.

Details

Motivation: To prevent ML models from replicating or exacerbating existing healthcare biases, particularly by addressing both direct discrimination and biases from differential healthcare access, which requires considering social and medical contexts.

Method: Maps structural fairness models to observational healthcare settings, creates a generalizable pipeline for training causally fair models using path-specific causal fairness analysis, and leverages foundation models trained without fairness constraints to generate fair downstream predictions.

Result: Develops a model-agnostic pipeline that disentangles direct and indirect sources of bias, expands characterization of fairness-accuracy tradeoffs, and demonstrates how foundation models can be adapted for causally fair predictions in healthcare tasks with known disparities.

Conclusion: The work presents a practical approach to training causally fair ML models in healthcare that addresses both direct and indirect forms of bias, providing a framework that considers specific healthcare contexts and disparities while maintaining model accuracy.

Abstract: When training machine learning (ML) models for potential deployment in a healthcare setting, it is essential to ensure that they do not replicate or exacerbate existing healthcare biases. Although many definitions of fairness exist, we focus on path-specific causal fairness, which allows us to better consider the social and medical contexts in which biases occur (e.g., direct discrimination by a clinician or model versus bias due to differential access to the healthcare system) and to characterize how these biases may appear in learned models. In this work, we map the structural fairness model to the observational healthcare setting and create a generalizable pipeline for training causally fair models. The pipeline explicitly considers specific healthcare context and disparities to define a target “fair” model. Our work fills two major gaps: first, we expand on characterizations of the “fairness-accuracy” tradeoff by detangling direct and indirect sources of bias and jointly presenting these fairness considerations alongside considerations of accuracy in the context of broadly known biases. Second, we demonstrate how a foundation model trained without fairness constraints on observational health data can be leveraged to generate causally fair downstream predictions in tasks with known social and medical disparities. This work presents a model-agnostic pipeline for training causally fair machine learning models that address both direct and indirect forms of healthcare bias.

Jacob Sander, Brian Jalaian, Venkat R. Dasari

Main category: cs.LG

TL;DR: An integrated framework combining GPTQ quantization, LoRA fine-tuning, and data distillation to compress LLMs for edge deployment while maintaining task performance.

Details

Motivation: LLMs face deployment challenges on resource-constrained edge devices due to high computational, memory, and energy demands. Need to address data acquisition, fine-tuning, and compression challenges for efficient edge deployment.

Method: Integrated framework combining GPTQ-based quantization, low-rank adaptation (LoRA), specialized data distillation, knowledge distillation via KL divergence, Bayesian hyperparameter optimization, and Muon optimizer.

Result: Achieves up to 2x memory compression (e.g., 6GB to 3GB), enables efficient inference for specialized tasks, shows superior performance on LLM benchmarks compared to GPTQ alone, with Muon optimizer enhancing resistance to accuracy decay during quantization.

Conclusion: The proposed integrated framework effectively addresses LLM deployment challenges on edge devices by combining multiple optimization techniques to reduce model size and complexity while preserving task-specific performance.

Abstract: Large Language Models (LLMs) enable advanced natural language processing but face deployment challenges on resource-constrained edge devices due to high computational, memory, and energy demands. Optimizing these models requires addressing three key challenges: acquiring task-specific data, fine-tuning for performance, and compressing models to accelerate inference while reducing resource demands. We propose an integrated framework combining GPTQ-based quantization, low-rank adaptation (LoRA), and a specialized data distillation process to significantly reduce model size and complexity while preserving or enhancing task-specific performance. By leveraging data distillation, knowledge distillation via Kullback-Leibler divergence, Bayesian hyperparameter optimization, and the Muon optimizer, our pipeline achieves up to 2x memory compression (e.g., reducing a 6GB model to 3GB) and enables efficient inference for specialized tasks. Empirical results demonstrate superior performance on standard LLM benchmarks compared to GPTQ quantization alone, with the Muon optimizer notably enhancing fine-tuned models’ resistance to accuracy decay during quantization.

Kirandeep Kaur, Vinayak Gupta, Aditya Gupta, Chirag Shah

Main category: cs.LG

TL;DR: ProPer introduces a two-agent system (DGA and RGA) for proactive AI assistants that identify and address users’ unexpressed needs through implicit dimension generation and personalized response balancing.

Details

Motivation: Current language assistants are reactive, requiring explicit user input, leaving relevant but unexpressed needs unmet. Existing proactive approaches either burden users with clarification requests or make mistimed interventions based on context extrapolation.

Method: Two-agent architecture: 1) DGA (Dimension Generating Agent) - fine-tuned LLM that generates implicit dimensions/knowledge gaps from user data; 2) RGA (Response Generating Agent) - balances explicit and implicit dimensions to create personalized proactive responses. Includes reranker for quality/diversity/task-relevance filtering.

Result: ProPer improves quality scores and win rates across multiple domains, achieving up to 84% gains in single-turn evaluation and consistent dominance in multi-turn interactions, measured using a gap-aware rubric for coverage, initiative appropriateness, and intent alignment.

Conclusion: ProPer successfully addresses the limitations of reactive assistants by systematically identifying and addressing implicit user needs through a novel two-agent architecture that generates, filters, and integrates latent dimensions for timely proactive interventions.

Abstract: Most language-based assistants follow a reactive ask-and-respond paradigm, requiring users to explicitly state their needs. As a result, relevant but unexpressed needs often go unmet. Existing proactive agents attempt to address this gap either by eliciting further clarification, preserving this burden, or by extrapolating future needs from context, often leading to unnecessary or mistimed interventions. We introduce ProPer, Proactivity-driven Personalized agents, a novel two-agent architecture consisting of a Dimension Generating Agent (DGA) and a Response Generating Agent (RGA). DGA, a fine-tuned LLM agent, leverages explicit user data to generate multiple implicit dimensions (latent aspects relevant to the user’s task but not considered by the user) or knowledge gaps. These dimensions are selectively filtered using a reranker based on quality, diversity, and task relevance. RGA then balances explicit and implicit dimensions to tailor personalized responses with timely and proactive interventions. We evaluate ProPer across multiple domains using a structured, gap-aware rubric that measures coverage, initiative appropriateness, and intent alignment. Our results show that ProPer improves quality scores and win rates across all domains, achieving up to 84% gains in single-turn evaluation and consistent dominance in multi-turn interactions.

[335] Interpolation-Based Optimization for Enforcing lp-Norm Metric Differential Privacy in Continuous and Fine-Grained Domains

Chenxi Qiu

Main category: cs.LG

TL;DR: Proposes interpolation-based framework for optimizing lp-norm metric differential privacy in fine-grained/continuous domains, using sparse anchor points and log-convex interpolation with corrected formulation for high-dimensional spaces.

Details

Motivation: Existing mDP optimization methods work well for coarse-grained domains but struggle in fine-grained/continuous settings due to computational costs of dense perturbation matrices and pointwise constraints.

Method: Optimizes perturbation distributions at sparse anchor points, interpolates via log-convex combinations, decomposes interpolation into 1D steps with corrected formulation for high-dimensional spaces, and explores joint optimization of distributions and privacy budget allocation.

Result: Method provides rigorous privacy guarantees and competitive utility in fine-grained domains, outperforming baseline mechanisms on real-world location datasets.

Conclusion: The interpolation-based framework effectively addresses computational challenges of mDP optimization in fine-grained/continuous domains while maintaining privacy guarantees and utility.

Abstract: Metric Differential Privacy (mDP) generalizes Local Differential Privacy (LDP) by adapting privacy guarantees based on pairwise distances, enabling context-aware protection and improved utility. While existing optimization-based methods reduce utility loss effectively in coarse-grained domains, optimizing mDP in fine-grained or continuous settings remains challenging due to the computational cost of constructing dense perterubation matrices and satisfying pointwise constraints. In this paper, we propose an interpolation-based framework for optimizing lp-norm mDP in such domains. Our approach optimizes perturbation distributions at a sparse set of anchor points and interpolates distributions at non-anchor locations via log-convex combinations, which provably preserve mDP. To address privacy violations caused by naive interpolation in high-dimensional spaces, we decompose the interpolation process into a sequence of one-dimensional steps and derive a corrected formulation that enforces lp-norm mDP by design. We further explore joint optimization over perturbation distributions and privacy budget allocation across dimensions. Experiments on real-world location datasets demonstrate that our method offers rigorous privacy guarantees and competitive utility in fine-grained domains, outperforming baseline mechanisms. in high-dimensional spaces, we decompose the interpolation process into a sequence of one-dimensional steps and derive a corrected formulation that enforces lp-norm mDP by design. We further explore joint optimization over perturbation distributions and privacy budget allocation across dimensions. Experiments on real-world location datasets demonstrate that our method offers rigorous privacy guarantees and competitive utility in fine-grained domains, outperforming baseline mechanisms.

[336] Kinematic Tokenization: Optimization-Based Continuous-Time Tokens for Learnable Decision Policies in Noisy Time Series

Griffin Kearney

Main category: cs.LG

TL;DR: Kinematic Tokenization uses continuous spline coefficients (position, velocity, acceleration, jerk) instead of discrete tokens for noisy time series, improving policy learning under asymmetric abstention-inducing losses.

Details

Motivation: Transformers work with discrete tokens but real-world signals are continuous and noisy. Discrete tokenizations fail in low signal-to-noise regimes, especially when asymmetric penalties encourage abstention.

Method: Kinematic Tokenization reconstructs explicit splines from noisy measurements and tokenizes local spline coefficients (position, velocity, acceleration, jerk). Applied to financial time series with trading volume profiles.

Result: On multi-asset equity testbed with risk-averse asymmetric classification, discrete baselines collapse to cash-only policies, while continuous spline tokens sustain calibrated, non-trivial action distributions and stable policies.

Conclusion: Explicit continuous-time tokens improve learnability and calibration of selective decision policies in noisy time series under abstention-inducing losses.

Abstract: Transformers are designed for discrete tokens, yet many real-world signals are continuous processes observed through noisy sampling. Discrete tokenizations (raw values, patches, finite differences) can be brittle in low signal-to-noise regimes, especially when downstream objectives impose asymmetric penalties that rationally encourage abstention. We introduce Kinematic Tokenization, an optimization-based continuous-time representation that reconstructs an explicit spline from noisy measurements and tokenizes local spline coefficients (position, velocity, acceleration, jerk). This is applied to financial time series data in the form of asset prices in conjunction with trading volume profiles. Across a multi-asset daily-equity testbed, we use a risk-averse asymmetric classification objective as a stress test for learnability. Under this objective, several discrete baselines collapse to an absorbing cash policy (the Liquidation Equilibrium), whereas the continuous spline tokens sustain calibrated, non-trivial action distributions and stable policies. These results suggest that explicit continuous-time tokens can improve the learnability and calibration of selective decision policies in noisy time series under abstention-inducing losses.

[337] A Sustainable AI Economy Needs Data Deals That Work for Generators

Ruoxi Jia, Luis Oala, Wenjie Xiong, Suqin Ge, Jiachen T. Wang, Feiyang Kang, Dawn Song

Main category: cs.LG

TL;DR: The paper argues that the current machine learning value chain is unsustainable due to economic inequality where data generators receive little value while aggregators capture most benefits, proposing an Equitable Data-Value Exchange Framework to create fairer data markets.

Details

Motivation: The motivation stems from identifying a structural inequality in the ML value chain where data processing strips economic equity from data generators, with most value accruing to aggregators, documented creator royalties being negligible, and widespread opacity in data deals - threatening the sustainability of current learning algorithms.

Method: The authors analyze 73 public data deals to quantify value distribution, identify three structural faults (missing provenance, asymmetric bargaining power, non-dynamic pricing), trace these problems along the ML value chain, and propose the Equitable Data-Value Exchange (EDVEX) Framework.

Result: Analysis shows majority of value accrues to aggregators, documented creator royalties round to zero, widespread opacity in deal terms, and identifies three structural faults that create economic inequality in the data value chain.

Conclusion: The paper concludes that the current ML value chain is structurally unsustainable and proposes the EDVEX Framework to enable equitable data-value exchange, outlining research directions for the community to contribute to fairer data deals and markets.

Abstract: We argue that the machine learning value chain is structurally unsustainable due to an economic data processing inequality: each state in the data cycle from inputs to model weights to synthetic outputs refines technical signal but strips economic equity from data generators. We show, by analyzing seventy-three public data deals, that the majority of value accrues to aggregators, with documented creator royalties rounding to zero and widespread opacity of deal terms. This is not just an economic welfare concern: as data and its derivatives become economic assets, the feedback loop that sustains current learning algorithms is at risk. We identify three structural faults - missing provenance, asymmetric bargaining power, and non-dynamic pricing - as the operational machinery of this inequality. In our analysis, we trace these problems along the machine learning value chain and propose an Equitable Data-Value Exchange (EDVEX) Framework to enable a minimal market that benefits all participants. Finally, we outline research directions where our community can make concrete contributions to data deals and contextualize our position with related and orthogonal viewpoints.

[338] An Exploratory Study to Repurpose LLMs to a Unified Architecture for Time Series Classification

Hansen He, Shuheng Li

Main category: cs.LG

TL;DR: The paper explores hybrid architectures combining specialized time series encoders with frozen LLMs for time series classification, finding that Inception-based encoders consistently yield positive performance gains.

Details

Motivation: There's growing interest in repurposing LLMs for time series classification due to their strong reasoning and generalization abilities, but prior work has focused mainly on alignment strategies rather than exploring optimal time series encoder architectures.

Method: The study evaluates hybrid architectures that combine various specialized time series encoders (Inception, convolutional, residual, transformer-based, and MLP architectures) with a frozen LLM backbone to assess their impact on performance.

Result: Among all evaluated encoder families, only the Inception model consistently yields positive performance gains when integrated with an LLM backbone for time series classification tasks.

Conclusion: The choice of time series encoder significantly impacts hybrid LLM architectures, with Inception-based models emerging as a promising direction for future LLM-driven time series learning.

Abstract: Time series classification (TSC) is a core machine learning problem with broad applications. Recently there has been growing interest in repurposing large language models (LLMs) for TSC, motivated by their strong reasoning and generalization ability. Prior work has primarily focused on alignment strategies that explicitly map time series data into the textual domain; however, the choice of time series encoder architecture remains underexplored. In this work, we conduct an exploratory study of hybrid architectures that combine specialized time series encoders with a frozen LLM backbone. We evaluate a diverse set of encoder families, including Inception, convolutional, residual, transformer-based, and multilayer perceptron architectures, among which the Inception model is the only encoder architecture that consistently yields positive performance gains when integrated with an LLM backbone. Overall, this study highlights the impact of time series encoder choice in hybrid LLM architectures and points to Inception-based models as a promising direction for future LLM-driven time series learning.

[339] In-Context Operator Learning on the Space of Probability Measures

Frank Cole, Dixi Wang, Yineng Chen, Yulong Lu, Rongjie Lai

Main category: cs.LG

TL;DR: The paper introduces in-context operator learning for optimal transport problems, where a single solution operator learns to map distribution pairs to OT maps using few-shot prompts without gradient updates at inference.

Details

Motivation: To develop a framework that can solve optimal transport problems in-context using only few-shot samples from distributions, eliminating the need for gradient-based optimization at inference time.

Method: Parameterize a solution operator and develop scaling-law theory in two regimes: nonparametric (low-intrinsic-dimension manifold tasks) and parametric (Gaussian families). Provide generalization bounds and explicit architecture that recovers exact OT maps.

Result: Established generalization bounds for in-context accuracy scaling with prompt size, intrinsic task dimension, and model capacity in nonparametric setting. In parametric setting, provided explicit architecture that recovers exact OT maps and finite-sample excess-risk bounds.

Conclusion: The framework enables efficient in-context learning of optimal transport maps using few-shot prompts, validated through synthetic transports and generative-modeling benchmarks.

Abstract: We introduce \emph{in-context operator learning on probability measure spaces} for optimal transport (OT). The goal is to learn a single solution operator that maps a pair of distributions to the OT map, using only few-shot samples from each distribution as a prompt and \emph{without} gradient updates at inference. We parameterize the solution operator and develop scaling-law theory in two regimes. In the \emph{nonparametric} setting, when tasks concentrate on a low-intrinsic-dimension manifold of source–target pairs, we establish generalization bounds that quantify how in-context accuracy scales with prompt size, intrinsic task dimension, and model capacity. In the \emph{parametric} setting (e.g., Gaussian families), we give an explicit architecture that recovers the exact OT map in context and provide finite-sample excess-risk bounds. Our numerical experiments on synthetic transports and generative-modeling benchmarks validate the framework.

[340] FaTRQ: Tiered Residual Quantization for LLM Vector Search in Far-Memory-Aware ANNS Systems

Tianqi Zhang, Flavio Ponzina, Tajana Rosing

Main category: cs.LG

TL;DR: FaTRQ is a far-memory-aware refinement system for ANNS that eliminates costly SSD reads by using tiered memory and progressive distance estimation with compact residuals, achieving 2.4× storage efficiency and up to 9× throughput improvement over SOTA GPU ANNS systems.

Details

Motivation: Modern ANNS engines still rely on expensive second-pass refinement that reads full-precision vectors from slow storage (SSDs), which dominates query latency for modern text and multimodal embeddings.

Method: 1) Progressive distance estimator refines coarse scores using compact residuals streamed from far memory, stopping early when candidates are provably outside top-k. 2) Tiered residual quantization encodes residuals as ternary values stored efficiently in far memory. 3) Custom accelerator deployed in CXL Type-2 device performs low-latency refinement locally.

Result: FaTRQ improves storage efficiency by 2.4× and improves throughput by up to 9× compared to state-of-the-art GPU ANNS systems.

Conclusion: FaTRQ successfully eliminates the bottleneck of fetching full vectors from storage by using tiered memory and progressive refinement, significantly improving both storage efficiency and query throughput for ANNS in RAG applications.

Abstract: Approximate Nearest-Neighbor Search (ANNS) is a key technique in retrieval-augmented generation (RAG), enabling rapid identification of the most relevant high-dimensional embeddings from massive vector databases. Modern ANNS engines accelerate this process using prebuilt indexes and store compressed vector-quantized representations in fast memory. However, they still rely on a costly second-pass refinement stage that reads full-precision vectors from slower storage like SSDs. For modern text and multimodal embeddings, these reads now dominate the latency of the entire query. We propose FaTRQ, a far-memory-aware refinement system using tiered memory that eliminates the need to fetch full vectors from storage. It introduces a progressive distance estimator that refines coarse scores using compact residuals streamed from far memory. Refinement stops early once a candidate is provably outside the top-k. To support this, we propose tiered residual quantization, which encodes residuals as ternary values stored efficiently in far memory. A custom accelerator is deployed in a CXL Type-2 device to perform low-latency refinement locally. Together, FaTRQ improves the storage efficiency by 2.4$\times$ and improves the throughput by up to 9$ \times$ than SOTA GPU ANNS system.

[341] Continuous-Depth Transformers with Learned Control Dynamics

Peter Jemley

Main category: cs.LG

TL;DR: Hybrid transformer replaces discrete middle layers with continuous-depth Neural ODE block, enabling inference-time control over generation attributes via learned steering signals.

Details

Motivation: Standard transformers process representations through fixed discrete layers, limiting flexibility. The paper aims to enable inference-time control over generation attributes by treating depth as a continuous variable.

Method: Replace discrete middle layers with continuous-depth Neural ODE block governed by learned vector field F_θ(H, τ, u), where u is low-dimensional control signal injected via explicit concatenation. Use adjoint method for O(1) memory training.

Result: Four key results: (1) Gradient flow stability with zero exploding/vanishing gradient events, (2) 98%/88% accuracy for positive/negative sentiment control, (3) 0.068% trajectory divergence between fixed/adaptive solvers, (4) Latency parity with standard discrete baselines. Adaptive solvers reveal geometric structure in learned dynamics.

Conclusion: Continuous-depth dynamics with learned control signals provide a viable, efficient mechanism for steerable language generation, enabling inference-time attribute control while maintaining computational efficiency.

Abstract: We present a hybrid transformer architecture that replaces discrete middle layers with a continuous-depth Neural Ordinary Differential Equation (ODE) block, enabling inference-time control over generation attributes via a learned steering signal. Unlike standard transformers that process representations through fixed discrete layers, our approach treats depth as a continuous variable governed by a learned vector field $F_θ(H, τ, u)$, where $u$ is a low-dimensional control signal injected via explicit concatenation. We validate the architecture through four experiments: (1) gradient flow stability with zero exploding/vanishing gradient events, (2) semantic steering achieving 98%/88% accuracy for positive/negative sentiment control, (3) continuous interpolation validated by a negligible 0.068% trajectory divergence between fixed and adaptive solvers, and (4) efficiency benchmarking demonstrating latency parity with standard discrete baselines. Additionally, we show that adaptive ODE solvers reveal geometric structure in the learned dynamics: the control signal partitions the vector field into distinct dynamical regimes with different curvature characteristics. The adjoint method enables $O(1)$ memory training regardless of integration depth. Our results demonstrate that continuous-depth dynamics with learned control signals provide a viable, efficient mechanism for steerable language generation.

[342] PID-Guided Partial Alignment for Multimodal Decentralized Federated Learning

Yanhang Shi, Xiaoyu Wang, Houwei Cao, Jian Li, Yong Liu

Main category: cs.LG

TL;DR: PARSE is a multimodal decentralized federated learning framework that uses partial information decomposition to enable heterogeneous agents with different modalities to collaborate effectively over peer-to-peer networks without central coordination.

Details

Motivation: Standard multimodal DFL approaches using monolithic shared embeddings cause gradient misalignment between uni- and multimodal agents, suppressing heterogeneous sharing and cross-modal interaction in decentralized peer-to-peer settings.

Method: PARSE operationalizes partial information decomposition (PID) where each agent performs feature fission to factorize latent representations into redundant, unique, and synergistic slices. It enables P2P knowledge sharing through slice-level partial alignment, exchanging only semantically shareable branches among agents with corresponding modalities.

Result: PARSE yields consistent gains over task-, modality-, and hybrid-sharing DFL baselines across benchmarks and agent mixes. It resolves uni-/multimodal gradient conflicts without needing central coordination or gradient surgery.

Conclusion: PARSE overcomes the multimodal DFL dilemma while remaining compatible with standard DFL constraints, demonstrating efficiency and robustness through ablations on fusion operators and split ratios with qualitative visualizations.

Abstract: Multimodal decentralized federated learning (DFL) is challenging because agents differ in available modalities and model architectures, yet must collaborate over peer-to-peer (P2P) networks without a central coordinator. Standard multimodal pipelines learn a single shared embedding across all modalities. In DFL, such a monolithic representation induces gradient misalignment between uni- and multimodal agents; as a result, it suppresses heterogeneous sharing and cross-modal interaction. We present PARSE, a multimodal DFL framework that operationalizes partial information decomposition (PID) in a server-free setting. Each agent performs feature fission to factorize its latent representation into redundant, unique, and synergistic slices. P2P knowledge sharing among heterogeneous agents is enabled by slice-level partial alignment: only semantically shareable branches are exchanged among agents that possess the corresponding modality. By removing the need for central coordination and gradient surgery, PARSE resolves uni-/multimodal gradient conflicts, thereby overcoming the multimodal DFL dilemma while remaining compatible with standard DFL constraints. Across benchmarks and agent mixes, PARSE yields consistent gains over task-, modality-, and hybrid-sharing DFL baselines. Ablations on fusion operators and split ratios, together with qualitative visualizations, further demonstrate the efficiency and robustness of the proposed design.

[343] CAFEDistill: Learning Personalized and Dynamic Models through Federated Early-Exit Network Distillation

Boyi Liu, Zimu Zhou, Yongxin Tong

Main category: cs.LG

TL;DR: CAFEDistill is a federated learning framework that integrates early-exit networks with personalized federated learning, addressing client heterogeneity and exit interference conflicts through progressive depth-prioritized coordination and client-decoupled communication.

Details

Motivation: Existing PFL methods produce static models with fixed accuracy-efficiency tradeoffs, limiting adaptability to varying inference requirements and resource availability. Early-exit networks offer adaptive inference but face challenges when integrated into PFL due to client heterogeneity and depth-wise exit interference conflicts.

Method: CAFEDistill uses a conflict-aware federated exit distillation framework with progressive depth-prioritized student coordination to mitigate interference between shallow and deep exits while enabling personalized knowledge transfer across clients. It employs client-decoupled formulation to reduce communication overhead.

Result: Extensive evaluations show CAFEDistill outperforms state-of-the-art methods, achieving higher accuracy while reducing inference costs by 30.79%-46.86%.

Conclusion: CAFEDistill successfully extends personalized federated learning to early-exit networks by jointly addressing client heterogeneity and exit interference conflicts, enabling adaptive inference with improved accuracy and reduced computational costs.

Abstract: Personalized Federated Learning (PFL) enables collaboratively model training on decentralized, heterogeneous data while tailoring them to each client’s unique distribution. However, existing PFL methods produce static models with a fixed tradeoff between accuracy and efficiency, limiting their applicability in environments where inference requirements vary with contexts and resource availability. Early-exit networks (EENs) offer adaptive inference by attaching intermediate classifiers. Yet integrating them into PFL is challenging due to client-wise heterogeneity and depth-wise interference arising from conflicting exit objectives. Prior studies fail to resolve both conflicts simultaneously, leading to suboptimal performance. In this paper, we propose CAFEDistill, a Conflict-Aware Federated Exit Distillation framework that jointly addresses these conflicts and extends PFL to early-exit networks. Through a progressive, depth-prioritized student coordination mechanism, CAFEDistill mitigates interference among shallow and deep exits while allowing effective personalized knowledge transfer across clients. Furthermore, it reduces communication overhead via a client-decoupled formulation. Extensive evaluations show that CAFEDistill outperforms the state-of-the-arts, achieving higher accuracy and reducing inference costs by 30.79%-46.86%.

[344] Time Aggregation Features for XGBoost Models

Mykola Pinchuk

Main category: cs.LG

TL;DR: Time aggregation features for XGBoost CTR prediction show trailing windows outperform other designs, with event count windows providing small additional gains.

Details

Motivation: To improve click-through rate prediction models by comparing different time aggregation feature designs under strict out-of-time splits and no-lookahead constraints.

Method: Used XGBoost models with time aggregation features on Avazu dataset with strict temporal splits. Compared time-aware target encoding baseline to models with entity history time aggregation using different window designs (trailing windows, event count windows, gap windows, bucketized windows). Features for hour H used only impressions from hours strictly before H.

Result: Trailing windows improved ROC AUC by 0.0066-0.0082 and PR AUC by 0.0084-0.0094 over target encoding alone. Event count windows provided only small consistent improvements over trailing windows. Gap windows and bucketized windows underperformed simple trailing windows.

Conclusion: Trailing windows are recommended as a practical default for time aggregation features in CTR prediction, with optional event count windows when marginal ROC AUC gains are important.

Abstract: This paper studies time aggregation features for XGBoost models in click-through rate prediction. The setting is the Avazu click-through rate prediction dataset with strict out-of-time splits and a no-lookahead feature constraint. Features for hour H use only impressions from hours strictly before H. This paper compares a strong time-aware target encoding baseline to models augmented with entity history time aggregation under several window designs. Across two rolling-tail folds on a deterministic ten percent sample, a trailing window specification improves ROC AUC by about 0.0066 to 0.0082 and PR AUC by about 0.0084 to 0.0094 relative to target encoding alone. Within the time aggregation design grid, event count windows provide the only consistent improvement over trailing windows, and the gain is small. Gap windows and bucketized windows underperform simple trailing windows in this dataset and protocol. These results support a practical default of trailing windows, with an optional event count window when marginal ROC AUC gains matter.

[345] BPE: Behavioral Profiling Ensemble

Yanxin Liu, Yunqi Zhang

Main category: cs.LG

TL;DR: BPE framework introduces behavioral profiling for ensemble learning, using model-specific behavioral patterns rather than inter-model divergence for integration, achieving better accuracy and efficiency.

Details

Motivation: Traditional ensemble methods treat models as holistic entities and rely on inter-model divergence, ignoring individual model competence across different instance regions and requiring heavy validation set dependence.

Method: Behavioral Profiling Ensemble (BPE) constructs a behavioral profile intrinsic to each model and derives integration weights based on deviation between model’s response to test instances and its established behavioral profile.

Result: Extensive experiments on synthetic and real-world datasets show significant improvements over state-of-the-art ensemble baselines in predictive accuracy, computational efficiency, and storage resource utilization.

Conclusion: BPE framework represents a novel paradigm shift in ensemble learning by focusing on intrinsic model behavioral profiles rather than inter-model divergence, offering superior performance and efficiency.

Abstract: Ensemble learning is widely recognized as a pivotal strategy for pushing the boundaries of predictive performance. Traditional static ensemble methods, such as Stacking, typically assign weights by treating each base learner as a holistic entity, thereby overlooking the fact that individual models exhibit varying degrees of competence across different regions of the instance space. To address this limitation, Dynamic Ensemble Selection (DES) was introduced. However, both static and dynamic approaches predominantly rely on the divergence among different models as the basis for integration. This inter-model perspective neglects the intrinsic characteristics of the models themselves and necessitates a heavy reliance on validation sets for competence estimation. In this paper, we propose the Behavioral Profiling Ensemble (BPE) framework, which introduces a novel paradigm shift. Unlike traditional methods, BPE constructs a ``behavioral profile’’ intrinsic to each model and derives integration weights based on the deviation between the model’s response to a specific test instance and its established behavioral profile. Extensive experiments on both synthetic and real-world datasets demonstrate that the algorithm derived from the BPE framework achieves significant improvements over state-of-the-art ensemble baselines. These gains are evident not only in predictive accuracy but also in computational efficiency and storage resource utilization across various scenarios.

[346] Unlabeled Data Can Provably Enhance In-Context Learning of Transformers

Renpu Liu, Jing Yang

Main category: cs.LG

TL;DR: Augmented ICL framework combines labeled examples with unlabeled data in prompts, using transformers to emulate EM algorithm for provable accuracy improvements.

Details

Motivation: Traditional ICL is limited by few labeled examples that fit in prompts, while vast unlabeled data exists. Need to leverage unlabeled data to enhance ICL performance theoretically.

Method: Propose augmented ICL framework with both labeled examples and unlabeled inputs in prompt. Use chain-of-thought prompting to make transformers emulate expectation-maximization algorithm for multi-class linear classification.

Result: Transformers can effectively extract information from both labeled and unlabeled data, leading to provable ICL accuracy improvements. Framework consistently outperforms conventional few-shot ICL in experiments.

Conclusion: First theoretical study showing transformers can leverage unlabeled data through EM emulation, providing both theoretical guarantees and empirical validation for augmented ICL.

Abstract: Large language models (LLMs) exhibit impressive in-context learning (ICL) capabilities, yet the quality of their predictions is fundamentally limited by the few costly labeled demonstrations that can fit into a prompt. Meanwhile, there exist vast and continuously growing amounts of unlabeled data that may be closely related to the ICL task. How to utilize such unlabeled data to provably enhance the performance of ICL thus becomes an emerging fundamental question. In this work, we propose a novel augmented ICL framework, in which the prompt includes a small set of labeled examples alongside a block of unlabeled inputs. We focus on the multi-class linear classification setting and demonstrate that, with chain-of-thought (CoT) prompting, a multi-layer transformer can effectively emulate an expectation-maximization (EM) algorithm. This enables the transformer to implicitly extract useful information from both labeled and unlabeled data, leading to provable improvements in ICL accuracy. Moreover, we show that such a transformer can be trained via teacher forcing, with its parameters converging to the desired solution at a linear rate. Experiments demonstrate that the augmented ICL framework consistently outperforms conventional few-shot ICL, providing empirical support for our theoretical findings. To the best of our knowledge, this is the first theoretical study on the impact of unlabeled data on the ICL performance of transformers.

[347] Efficient Content-based Recommendation Model Training via Noise-aware Coreset Selection

Hung Vinh Tran, Tong Chen, Hechuan Wen, Quoc Viet Hung Nguyen, Bin Cui, Hongzhi Yin

Main category: cs.LG

TL;DR: NaCS is a noise-aware coreset selection framework for content-based recommendation systems that identifies representative subsets while correcting noisy labels and filtering unreliable interactions, achieving 93-95% of full-dataset performance with just 1% of training data.

Details

Motivation: Content-based recommendation systems require large-scale continuous training to accommodate diverse user preferences, which is computationally expensive. Coreset selection can reduce training overhead, but existing methods are vulnerable to pervasive noise in user-item interactions, especially when coresets are minimally sized.

Method: NaCS uses submodular optimization based on training gradients to construct coresets while simultaneously correcting noisy labels using a progressively trained model. It refines the selected coreset by filtering out low-confidence samples through uncertainty quantification to avoid training with unreliable interactions.

Result: NaCS produces higher-quality coresets for CRSs while achieving better efficiency than existing coreset selection techniques. Notably, it recovers 93-95% of full-dataset training performance using merely 1% of the training data.

Conclusion: NaCS is an effective framework that addresses the noise vulnerability in coreset selection for content-based recommendation systems, enabling efficient training with minimal performance degradation through noise-aware selection and correction mechanisms.

Abstract: Content-based recommendation systems (CRSs) utilize content features to predict user-item interactions, serving as essential tools for helping users navigate information-rich web services. However, ensuring the effectiveness of CRSs requires large-scale and even continuous model training to accommodate diverse user preferences, resulting in significant computational costs and resource demands. A promising approach to this challenge is coreset selection, which identifies a small but representative subset of data samples that preserves model quality while reducing training overhead. Yet, the selected coreset is vulnerable to the pervasive noise in user-item interactions, particularly when it is minimally sized. To this end, we propose Noise-aware Coreset Selection (NaCS), a specialized framework for CRSs. NaCS constructs coresets through submodular optimization based on training gradients, while simultaneously correcting noisy labels using a progressively trained model. Meanwhile, we refine the selected coreset by filtering out low-confidence samples through uncertainty quantification, thereby avoid training with unreliable interactions. Through extensive experiments, we show that NaCS produces higher-quality coresets for CRSs while achieving better efficiency than existing coreset selection techniques. Notably, NaCS recovers 93-95% of full-dataset training performance using merely 1% of the training data. The source code is available at \href{https://github.com/chenxing1999/nacs}{https://github.com/chenxing1999/nacs}.

[348] Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts

Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang

Main category: cs.LG

TL;DR: Sparse-RL enables stable reinforcement learning training for LLMs using sparse KV caches, overcoming policy mismatch issues from compression during rollouts while maintaining performance and reducing memory overhead.

Details

Motivation: RL is crucial for developing complex reasoning in LLMs, but storing full KV caches during long rollouts creates prohibitive memory overhead on limited hardware. Existing KV compression techniques work for inference but cause catastrophic performance collapse when applied to RL training due to policy mismatch.

Method: Sparse-RL addresses policy mismatch among dense old policy, sparse sampler policy, and learner policy through two key techniques: Sparsity-Aware Rejection Sampling and Importance-based Reweighting. These correct off-policy bias from compression-induced information loss.

Result: Sparse-RL significantly reduces rollout memory overhead compared to dense baselines while preserving performance. It also enables sparsity-aware training that enhances model robustness during sparse inference deployment.

Conclusion: Sparse-RL provides an effective solution for stable RL training under sparse rollouts, overcoming the critical bottleneck of KV cache memory overhead while maintaining performance and improving robustness for sparse inference scenarios.

Abstract: Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.

[349] Adaptive Label Error Detection: A Bayesian Approach to Mislabeled Data Detection

Zan Chaudhry, Noam H. Rotenberg, Brian Caffo, Craig K. Jones, Haris I. Sair

Main category: cs.LG

TL;DR: ALED is a novel method for detecting mislabeled samples in machine learning datasets using deep feature extraction, manifold modeling with Gaussian distributions, and likelihood ratio tests, showing superior performance on medical imaging data.

Details

Motivation: Machine learning systems suffer from poor performance when trained with incorrect ground truth labels, even with expert annotation. As ML becomes more widespread, detecting and correcting mislabeling is crucial for developing more powerful models.

Method: ALED extracts intermediate features from a deep CNN, denoises the features, models each class’s reduced manifold with a multidimensional Gaussian distribution, and performs likelihood ratio tests to identify mislabeled samples.

Result: ALED shows markedly increased sensitivity without compromising precision compared to established methods on multiple medical imaging datasets. Fine-tuning on corrected data resulted in a 33.8% decrease in test set errors.

Conclusion: ALED provides an effective solution for label error detection with strong practical benefits, deployed as a Python package (statlab) for wider accessibility and use.

Abstract: Machine learning classification systems are susceptible to poor performance when trained with incorrect ground truth labels, even when data is well-curated by expert annotators. As machine learning becomes more widespread, it is increasingly imperative to identify and correct mislabeling to develop more powerful models. In this work, we motivate and describe Adaptive Label Error Detection (ALED), a novel method of detecting mislabeling. ALED extracts an intermediate feature space from a deep convolutional neural network, denoises the features, models the reduced manifold of each class with a multidimensional Gaussian distribution, and performs a simple likelihood ratio test to identify mislabeled samples. We show that ALED has markedly increased sensitivity, without compromising precision, compared to established label error detection methods, on multiple medical imaging datasets. We demonstrate an example where fine-tuning a neural network on corrected data results in a 33.8% decrease in test set errors, providing strong benefits to end users. The ALED detector is deployed in the Python package statlab.

[350] Bayesian Meta-Analyses Could Be More: A Case Study in Trial of Labor After a Cesarean-section Outcomes and Complications

Ashley Klein, Edward Raff, Marcia DesJardin

Main category: cs.LG

TL;DR: Bayesian approach addresses missing key decision variables in medical meta-analyses, applied to TOLAC scenarios to support physician decision-making.

Details

Motivation: Medical meta-analyses often miss key physician decision variables, leading to unreliable effect sizes and conclusions, especially in scenarios like Trial of Labor After Cesarean where few interventions exist.

Method: Developed a Bayesian approach to handle missing key decision variables in medical studies, specifically applied to evaluate TOLAC situations with OBGYN professionals.

Result: The Bayesian approach enables determination of whether positive effect claims remain warranted despite missing data, providing physicians with needed support to advance patient care in TOLAC scenarios.

Conclusion: Bayesian methodology offers a solution to address missing key decision variables in medical meta-analyses, enhancing reliability and supporting clinical decision-making in complex scenarios like TOLAC.

Abstract: The meta-analysis’s utility is dependent on previous studies having accurately captured the variables of interest, but in medical studies, a key decision variable that impacts a physician’s decisions was not captured. This results in an unknown effect size and unreliable conclusions. A Bayesian approach may allow analysis to determine if the claim of a positive effect is still warranted, and we build a Bayesian approach to this common medical scenario. To demonstrate its utility, we assist professional OBGYNs in evaluating Trial of Labor After a Cesarean-section (TOLAC) situations where few interventions are available for patients and find the support needed for physicians to advance patient care.

[351] LeMoF: Level-guided Multimodal Fusion for Heterogeneous Clinical Data

Jongseok Kim, Seongae Kang, Jonghwan Shin, Yuhan Lee, Ohyun Jo

Main category: cs.LG

TL;DR: LeMoF is a novel multimodal clinical prediction framework that uses level-guided modal fusion to selectively integrate representations from different encoder layers, achieving better performance than existing methods.

Details

Motivation: Existing multimodal clinical prediction methods rely on static modality integration and simple fusion strategies, failing to fully exploit modality-specific representations and achieve balanced performance in heterogeneous clinical environments.

Method: Proposes Level-guided Modal Fusion (LeMoF) that selectively integrates level-guided representations from different encoder layers, explicitly separating and learning global modality-level predictions from level-specific discriminative representations.

Result: LeMoF consistently outperforms existing state-of-the-art multimodal fusion techniques across various encoder configurations in length of stay prediction using ICU data, demonstrating robust predictive performance across clinical conditions.

Conclusion: Level-wise integration is a key factor for achieving robust predictive performance in multimodal clinical prediction, and LeMoF provides a balanced approach between prediction stability and discriminative capability.

Abstract: Multimodal clinical prediction is widely used to integrate heterogeneous data such as Electronic Health Records (EHR) and biosignals. However, existing methods tend to rely on static modality integration schemes and simple fusion strategies. As a result, they fail to fully exploit modality-specific representations. In this paper, we propose Level-guided Modal Fusion (LeMoF), a novel framework that selectively integrates level-guided representations within each modality. Each level refers to a representation extracted from a different layer of the encoder. LeMoF explicitly separates and learns global modality-level predictions from level-specific discriminative representations. This design enables LeMoF to achieve a balanced performance between prediction stability and discriminative capability even in heterogeneous clinical environments. Experiments on length of stay prediction using Intensive Care Unit (ICU) data demonstrate that LeMoF consistently outperforms existing state-of-the-art multimodal fusion techniques across various encoder configurations. We also confirmed that level-wise integration is a key factor in achieving robust predictive performance across various clinical conditions.

[352] Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text

Piyush Singh Pasi

Main category: cs.LG

TL;DR: METAL is a lightweight alignment method that learns linear layers using only English text to map multilingual text embeddings into multimodal space, achieving strong zero-shot transfer across languages for text-to-image retrieval and other multimodal tasks.

Details

Motivation: Multimodal models perform well in English due to abundant data but struggle with other languages due to limited multilingual multimodal resources. Current solutions rely heavily on machine translation, while advances in multilingual text modeling are underutilized.

Method: METAL learns only a few linear layers using English text alone to map multilingual text embeddings into a multimodal space. The method is lightweight and simple, focusing on transforming embedding geometry rather than performing trivial rotations.

Result: METAL matches baseline performance in English (94.9% Recall@10) and achieves strong zero-shot transfer across 11 languages (89.5% Recall@10 average, with 10 unseen languages) on XTD text-to-image retrieval. It also generalizes to audio-text retrieval and cross-lingual text-to-image generation.

Conclusion: METAL demonstrates that lightweight alignment using only English text can effectively enable multilingual multimodal capabilities, providing a simple yet powerful approach to overcome multilingual data scarcity. The authors release code, checkpoints, and multilingual evaluation datasets to facilitate further research.

Abstract: Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely heavily on machine translation, while advances in multilingual text modeling remain underutilized. We introduce METAL, a lightweight alignment method that learns only a few linear layers using English text alone to map multilingual text embeddings into a multimodal space. Despite its simplicity, METAL matches baseline performance in English (94.9 percent Recall at 10) and achieves strong zero-shot transfer (89.5 percent Recall at 10 averaged across 11 languages, 10 unseen) on XTD text-to-image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, METAL generalizes to audio-text retrieval and cross-lingual text-to-image generation. We release code and checkpoints at https://github.com/m2m-codebase/M2M , as well as multilingual evaluation datasets including MSCOCO Multilingual 30K (https://huggingface.co/datasets/piyushsinghpasi/mscoco-multilingual-30k ), AudioCaps Multilingual (https://huggingface.co/datasets/piyushsinghpasi/audiocaps-multilingual ), and Clotho Multilingual (https://huggingface.co/datasets/piyushsinghpasi/clotho-multilingual ), to facilitate further research.

[353] Step-by-Step Causality: Transparent Causal Discovery with Multi-Agent Tree-Query and Adversarial Confidence Estimation

Ziyi Ding, Chenfei Ye-Hao, Zheyuan Wang, Xiao-Ping Zhang

Main category: cs.LG

TL;DR: Tree-Query is a tree-structured LLM framework for pairwise causal discovery that uses interpretable queries about backdoor paths, independence, confounding, and direction to provide robust causal judgments with confidence scores.

Details

Motivation: Classical constraint-based causal discovery methods suffer from error propagation, while recent LLM-based approaches act as opaque black boxes without confidence measures. There's a need for interpretable, robust causal discovery methods that provide reliable confidence scores.

Method: Tree-Query reduces pairwise causal discovery to a sequence of structured queries about backdoor paths, (in)dependence relations, latent confounding, and causal direction. It uses a multi-expert LLM framework organized in a tree structure to make interpretable judgments with robustness-aware confidence scoring.

Result: The method provides theoretical guarantees for asymptotic identifiability of four pairwise relations. On data-free benchmarks from Mooij et al. and UCI causal graphs, Tree-Query improves structural metrics over direct LLM baselines. A diet-weight case study demonstrates effective confounder screening and stable, high-confidence causal conclusions.

Conclusion: Tree-Query offers a principled approach to obtain data-free causal priors from LLMs that can complement downstream data-driven causal discovery, providing interpretable judgments with confidence scores rather than acting as an opaque black box.

Abstract: Causal discovery aims to recover ``what causes what’’, but classical constraint-based methods (e.g., PC, FCI) suffer from error propagation, and recent LLM-based causal oracles often behave as opaque, confidence-free black boxes. This paper introduces Tree-Query, a tree-structured, multi-expert LLM framework that reduces pairwise causal discovery to a short sequence of queries about backdoor paths, (in)dependence, latent confounding, and causal direction, yielding interpretable judgments with robustness-aware confidence scores. Theoretical guarantees are provided for asymptotic identifiability of four pairwise relations. On data-free benchmarks derived from Mooij et al. and UCI causal graphs, Tree-Query improves structural metrics over direct LLM baselines, and a diet–weight case study illustrates confounder screening and stable, high-confidence causal conclusions. Tree-Query thus offers a principled way to obtain data-free causal priors from LLMs that can complement downstream data-driven causal discovery. Code is available at https://anonymous.4open.science/r/Repo-9B3E-4F96.

[354] Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, Ruoxi Jia

Main category: cs.LG

TL;DR: SPF is a lightweight fine-tuning method that removes gradient conflicts with safety subspace to maintain both utility and safety alignment in LLMs.

Details

Motivation: Fine-tuning LLMs for downstream tasks often degrades safety alignment, creating a safety-utility dilemma where improving one compromises the other. Existing methods struggle to balance both aspects effectively.

Method: Safety-Preserving Fine-tuning (SPF) analyzes geometric interactions between safety and utility gradients, identifies that safety gradients lie in low-rank subspace while utility gradients span broader space, and explicitly removes gradient components conflicting with the safety subspace during fine-tuning.

Result: SPF maintains downstream task performance while recovering nearly all pre-trained safety alignment, even under adversarial fine-tuning. It shows robust resistance to deep fine-tuning and dynamic jailbreak attacks.

Conclusion: SPF provides a practical solution to the safety-utility dilemma in LLM fine-tuning through novel mechanistic understanding of gradient interactions, enabling always-aligned LLM fine-tuning with theoretical guarantees on utility convergence and bounded safety drift.

Abstract: Fine-tuning is an essential and pervasive functionality for applying large language models (LLMs) to downstream tasks. However, it has the potential to substantially degrade safety alignment, e.g., by greatly increasing susceptibility to jailbreak attacks, even when the fine-tuning data is entirely harmless. Despite garnering growing attention in defense efforts during the fine-tuning stage, existing methods struggle with a persistent safety-utility dilemma: emphasizing safety compromises task performance, whereas prioritizing utility typically requires deep fine-tuning that inevitably leads to steep safety declination. In this work, we address this dilemma by shedding new light on the geometric interaction between safety- and utility-oriented gradients in safety-aligned LLMs. Through systematic empirical analysis, we uncover three key insights: (I) safety gradients lie in a low-rank subspace, while utility gradients span a broader high-dimensional space; (II) these subspaces are often negatively correlated, causing directional conflicts during fine-tuning; and (III) the dominant safety direction can be efficiently estimated from a single sample. Building upon these novel insights, we propose safety-preserving fine-tuning (SPF), a lightweight approach that explicitly removes gradient components conflicting with the low-rank safety subspace. Theoretically, we show that SPF guarantees utility convergence while bounding safety drift. Empirically, SPF consistently maintains downstream task performance and recovers nearly all pre-trained safety alignment, even under adversarial fine-tuning scenarios. Furthermore, SPF exhibits robust resistance to both deep fine-tuning and dynamic jailbreak attacks. Together, our findings provide new mechanistic understanding and practical guidance toward always-aligned LLM fine-tuning.

[355] Simple Network Graph Comparative Learning

Qiang Yu, Xinran Cheng, Shiqiang Xu, Chuanyi Liu

Main category: cs.LG

TL;DR: SNGCL is a novel contrastive learning method for node classification that uses multilayer Laplace smoothing filters and improved triplet loss to address issues with existing graph contrastive learning approaches.

Details

Motivation: Existing graph contrastive learning methods face two main challenges: 1) data augmentation techniques create views that differ too much from original data, weakening relevance and training efficiency, and 2) most methods rely heavily on large numbers of negative samples.

Method: SNGCL uses superimposed multilayer Laplace smoothing filters to obtain global and local feature smoothing matrices, feeds these into siamese network’s target and online networks, and employs an improved triplet recombination loss function to minimize intra-class distance while maximizing inter-class distance.

Result: Experimental comparisons with state-of-the-art models show SNGCL is strongly competitive in most node classification tasks.

Conclusion: SNGCL effectively addresses limitations of existing graph contrastive learning methods for node classification by using smoothing filters and improved loss functions, demonstrating strong performance across various tasks.

Abstract: The effectiveness of contrastive learning methods has been widely recognized in the field of graph learning, especially in contexts where graph data often lack labels or are difficult to label. However, the application of these methods to node classification tasks still faces a number of challenges. First, existing data enhancement techniques may lead to significant differences from the original view when generating new views, which may weaken the relevance of the view and affect the efficiency of model training. Second, the vast majority of existing graph comparison learning algorithms rely on the use of a large number of negative samples. To address the above challenges, this study proposes a novel node classification contrast learning method called Simple Network Graph Comparative Learning (SNGCL). Specifically, SNGCL employs a superimposed multilayer Laplace smoothing filter as a step in processing the data to obtain global and local feature smoothing matrices, respectively, which are thus passed into the target and online networks of the siamese network, and finally employs an improved triple recombination loss function to bring the intra-class distance closer and the inter-class distance farther. We have compared SNGCL with state-of-the-art models in node classification tasks, and the experimental results show that SNGCL is strongly competitive in most tasks.

[356] LOOKAT: Lookup-Optimized Key-Attention for Memory-Efficient Transformers

Aryan Karmore

Main category: cs.LG

TL;DR: LOOKAT compresses KV cache using product quantization and asymmetric distance computation, achieving 64× compression with 95.7% fidelity on GPT-2 without architecture changes.

Details

Motivation: Current KV cache quantization methods reduce storage but not bandwidth, as attention still requires dequantizing keys from INT4/INT8 to FP16. Need better compression for deploying large language models on edge devices.

Method: Applies product quantization and asymmetric distance computation to transformer architecture by decomposing key vectors into subspaces, learning codebooks, and computing attention tables via lookup tables. Transforms attention from memory-bound to compute-bound.

Result: Achieves 64× compression at 95.7% output fidelity and 32× compression at 95.0% fidelity on GPT-2. Maintains rank correlation ρ > 0.95 without architecture changes or training. Theoretical analysis shows rank correlation degrades as O(d_k/mK).

Conclusion: LOOKAT effectively compresses KV cache using vector database techniques, enabling efficient deployment of large language models on edge devices by transforming attention from memory-bound to compute-bound operations.

Abstract: Compressing the KV cache is a required step to deploy large language models on edge devices. Current quantization methods compress storage but fail to reduce bandwidth as attention calculation requires dequantizing keys from INT4/INT8 to FP16 before use. We observe that attention scoring is mathematically equivalent to the inner product similarity search and we can apply some compression techniques from vector databases to compress KV-cache better. We propose LOOKAT, which applies product quantization and asymmetric distance computation, to transformer architecture by decomposing key vectors into subspaces, learning codebooks and computing attention tables via lookup tables. This transforms attention from memory-bound to compute-bound. LOOKAT achieves 64 $\times$ compression at 95.7% output fidelity and 32 $\times$ compression at 95.0% fidelity when tested on GPT-2. LOOKAT requires no architecture changes or training while maintaining rank correlation $ρ> 0.95$. Theoretical analysis confirms that rank correlation degrades as $O(d_k/mK)$, with guarantees validated across sequence lengths up to 1024 tokens.

[357] CC-OR-Net: A Unified Framework for LTV Prediction through Structural Decoupling

Mingyu Zhao, Haoran Bai, Yu Tian, Bing Zhu, Hengliang Luo

Main category: cs.LG

TL;DR: CC-OR-Net is a novel neural framework for Customer Lifetime Value prediction that structurally decomposes ranking and regression tasks to handle zero-inflated, long-tail distributions and better identify high-value “whale” users.

Details

Motivation: LTV prediction faces challenges from zero-inflated, long-tail data distributions where low-to-medium value users numerically overwhelm high-value "whale" users, and existing methods fail to balance global accuracy with high-value precision through proper architectural design.

Method: Proposes CC-OR-Net (Conditional Cascaded Ordinal-Residual Networks) with three components: structural ordinal decomposition module for guaranteed ranking, intra-bucket residual module for fine-grained regression, and targeted high-value augmentation module for top-tier user precision.

Result: Evaluated on real-world datasets with over 300M users, CC-OR-Net achieves superior trade-off across key business metrics and outperforms state-of-the-art methods in creating commercially valuable LTV prediction solutions.

Conclusion: CC-OR-Net provides a unified framework that structurally decomposes ranking and regression tasks, offering a more robust solution for LTV prediction that better handles data distribution challenges and delivers practical business value.

Abstract: Customer Lifetime Value (LTV) prediction, a central problem in modern marketing, is characterized by a unique zero-inflated and long-tail data distribution. This distribution presents two fundamental challenges: (1) the vast majority of low-to-medium value users numerically overwhelm the small but critically important segment of high-value “whale” users, and (2) significant value heterogeneity exists even within the low-to-medium value user base. Common approaches either rely on rigid statistical assumptions or attempt to decouple ranking and regression using ordered buckets; however, they often enforce ordinality through loss-based constraints rather than inherent architectural design, failing to balance global accuracy with high-value precision. To address this gap, we propose \textbf{C}onditional \textbf{C}ascaded \textbf{O}rdinal-\textbf{R}esidual Networks \textbf{(CC-OR-Net)}, a novel unified framework that achieves a more robust decoupling through \textbf{structural decomposition}, where ranking is architecturally guaranteed. CC-OR-Net integrates three specialized components: a \textit{structural ordinal decomposition module} for robust ranking, an \textit{intra-bucket residual module} for fine-grained regression, and a \textit{targeted high-value augmentation module} for precision on top-tier users. Evaluated on real-world datasets with over 300M users, CC-OR-Net achieves a superior trade-off across all key business metrics, outperforming state-of-the-art methods in creating a holistic and commercially valuable LTV prediction solution.

[358] Bias in the Shadows: Explore Shortcuts in Encrypted Network Traffic Classification

Chuyi Wang, Xiaohui Xie, Tongze Wang, Yong Cui

Main category: cs.LG

TL;DR: BiasSeeker is a model-agnostic, data-driven framework that detects dataset-specific shortcut features in encrypted network traffic classification to address shortcut learning and improve generalization.

Details

Motivation: Pre-trained models for encrypted traffic classification often suffer from shortcut learning - relying on spurious correlations that don't generalize to real-world data. Existing solutions lack adaptability across different model architectures and deployment scenarios.

Method: BiasSeeker performs statistical correlation analysis directly on raw binary traffic to identify spurious or environment-entangled features. It introduces systematic categorization of shortcut features and applies category-specific validation strategies to reduce bias while preserving meaningful information.

Result: Evaluated on 19 public datasets across three network traffic classification tasks. The framework emphasizes context-aware feature selection and dataset-specific diagnosis for understanding shortcut learning.

Conclusion: BiasSeeker offers a novel perspective for addressing shortcut learning in encrypted traffic classification, highlighting that feature selection should be an intentional, scenario-sensitive step before model training.

Abstract: Pre-trained models operating directly on raw bytes have achieved promising performance in encrypted network traffic classification (NTC), but often suffer from shortcut learning-relying on spurious correlations that fail to generalize to real-world data. Existing solutions heavily rely on model-specific interpretation techniques, which lack adaptability and generality across different model architectures and deployment scenarios. In this paper, we propose BiasSeeker, the first semi-automated framework that is both model-agnostic and data-driven for detecting dataset-specific shortcut features in encrypted traffic. By performing statistical correlation analysis directly on raw binary traffic, BiasSeeker identifies spurious or environment-entangled features that may compromise generalization, independent of any classifier. To address the diverse nature of shortcut features, we introduce a systematic categorization and apply category-specific validation strategies that reduce bias while preserving meaningful information. We evaluate BiasSeeker on 19 public datasets across three NTC tasks. By emphasizing context-aware feature selection and dataset-specific diagnosis, BiasSeeker offers a novel perspective for understanding and addressing shortcut learning in encrypted network traffic classification, raising awareness that feature selection should be an intentional and scenario-sensitive step prior to model training.

[359] Reinforcement Learning to Discover a NorthEast Monsoon Index for Monthly Rainfall Prediction in Thailand

Kiattikun Chobtham

Main category: cs.LG

TL;DR: A novel NorthEast monsoon climate index derived from sea surface temperature improves long-term rainfall prediction in Thailand using reinforcement learning to optimize index calculation areas.

Details

Motivation: Existing global climate indices like El Niño Southern Oscillation are insufficient for accurate local-scale rainfall prediction in specific Thai regions, creating a need for region-specific climate indices.

Method: Developed a NorthEast monsoon climate index from sea surface temperature, used Deep Q-Network reinforcement learning to optimize calculation areas, clustered rainfall stations into 12 patterns, and integrated the index into Long Short-Term Memory models.

Result: The optimized index significantly improves long-term monthly rainfall prediction skill in most cluster areas and effectively reduces Root Mean Square Error for 12-month-ahead forecasts.

Conclusion: Region-specific climate indices optimized through reinforcement learning can substantially enhance local-scale rainfall prediction accuracy in Thailand, offering a promising approach for climate forecasting.

Abstract: Climate prediction is a challenge due to the intricate spatiotemporal patterns within Earth systems. Global climate indices, such as the El Niño Southern Oscillation, are standard input features for long-term rainfall prediction. However, a significant gap persists regarding local-scale indices capable of improving predictive accuracy in specific regions of Thailand. This paper introduces a novel NorthEast monsoon climate index calculated from sea surface temperature to reflect the climatology of the boreal winter monsoon. To optimise the calculated areas used for this index, a Deep Q-Network reinforcement learning agent explores and selects the most effective rectangles based on their correlation with seasonal rainfall. Rainfall stations were classified into 12 distinct clusters to distinguish rainfall patterns between southern and upper Thailand. Experimental results show that incorporating the optimised index into Long Short-Term Memory models significantly improves long-term monthly rainfall prediction skill in most cluster areas. This approach effectively reduces the Root Mean Square Error for 12-month-ahead forecasts.

[360] Graph Regularized PCA

Antonio Briola, Marwin Schmidt, Fabio Caccioli, Carlos Ros Perez, James Singleton, Christian Michler, Tomaso Aste

Main category: cs.LG

TL;DR: Graph Regularized PCA (GR-PCA) extends PCA to handle non-isotropic noise by incorporating feature dependency structure through graph-based regularization, learning sparse precision graphs and biasing loadings toward low-frequency Fourier modes of graph Laplacians.

Details

Motivation: Standard PCA assumes isotropic noise (independent and identically distributed features), which is often violated in high-dimensional data where features exhibit dependencies. This limitation motivates the development of a PCA variant that can handle non-isotropic noise and incorporate feature dependency structures.

Method: GR-PCA incorporates graph-based regularization by learning a sparse precision graph from data features and biasing PCA loadings toward low-frequency Fourier modes of the corresponding graph Laplacian. This suppresses high-frequency signals while preserving graph-coherent low-frequency ones, yielding interpretable principal components aligned with conditional relationships.

Result: GR-PCA concentrates variance on intended support, produces loadings with lower graph-Laplacian energy, and remains competitive in out-of-sample reconstruction. It prevents overfitting when high-frequency signals are present, reducing reconstruction accuracy but improving structural fidelity. The advantage over PCA is most pronounced when high-frequency signals are graph-correlated.

Conclusion: GR-PCA provides a practical, scalable, and modular approach to structure-aware dimensionality reduction that improves structural fidelity without sacrificing predictive performance, offering advantages over standard PCA when feature dependencies exist and high-frequency signals are graph-correlated.

Abstract: High-dimensional data often exhibit dependencies among variables that violate the isotropic-noise assumption under which principal component analysis (PCA) is optimal. For cases where the noise is not independent and identically distributed across features (i.e., the covariance is not spherical) we introduce Graph Regularized PCA (GR-PCA). It is a graph-based regularization of PCA that incorporates the dependency structure of the data features by learning a sparse precision graph and biasing loadings toward the low-frequency Fourier modes of the corresponding graph Laplacian. Consequently, high-frequency signals are suppressed, while graph-coherent low-frequency ones are preserved, yielding interpretable principal components aligned with conditional relationships. We evaluate GR-PCA on synthetic data spanning diverse graph topologies, signal-to-noise ratios, and sparsity levels. Compared to mainstream alternatives, it concentrates variance on the intended support, produces loadings with lower graph-Laplacian energy, and remains competitive in out-of-sample reconstruction. When high-frequency signals are present, the graph Laplacian penalty prevents overfitting, reducing the reconstruction accuracy but improving structural fidelity. The advantage over PCA is most pronounced when high-frequency signals are graph-correlated, whereas PCA remains competitive when such signals are nearly rotationally invariant. The procedure is simple to implement, modular with respect to the precision estimator, and scalable, providing a practical route to structure-aware dimensionality reduction that improves structural fidelity without sacrificing predictive performance.

[361] PRL: Process Reward Learning Improves LLMs’ Reasoning Ability and Broadens the Reasoning Boundary

Jiarui Yao, Ruida Wang, Tong Zhang

Main category: cs.LG

TL;DR: PRL (Process Reward Learning) decomposes RL objectives into intermediate steps with rigorous process rewards, improving LLM reasoning without complex additional steps like MCTS or separate reward models.

Details

Motivation: Current LLM reasoning approaches rely on outcome rewards at trajectory level, missing fine-grained supervision. Existing process signal methods require complex additional steps (MCTS, separate reward models) and lack theoretical foundation, making optimization opaque.

Method: PRL decomposes entropy regularized RL objective into intermediate steps with rigorous process rewards. Derives formulation equivalent to reward maximization plus KL-divergence penalty between policy and reference models, turning outcome rewards into process supervision signals.

Result: PRL improves average performance for LLM reasoning (average @ n) and broadens reasoning boundary (pass @ n). Extensive experiments verify effectiveness and generalizability.

Conclusion: PRL provides theoretically grounded process supervision that improves LLM reasoning efficiency and performance without complex additional training steps, offering better guidance during RL optimization.

Abstract: Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs’ reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show the effectiveness of PRL could be verified and generalized.

[362] Fundamental Limitations of Favorable Privacy-Utility Guarantees for DP-SGD

Murat Bilgehan Ertan, Marten van Dijk

Main category: cs.LG

TL;DR: DP-SGD has fundamental privacy-utility tradeoff limitations under worst-case adversarial models, requiring substantial noise that degrades utility in practical settings.

Details

Motivation: To understand the fundamental limitations of DP-SGD under worst-case adversarial privacy definitions, which remain poorly understood despite DP-SGD being the dominant paradigm for private training.

Method: Analyze DP-SGD in the f-differential privacy framework using hypothesis-testing trade-off curves, study shuffled sampling over single epoch with M gradient updates, derive explicit suboptimal upper bounds on achievable trade-off curves, and prove geometric lower bounds on separation κ.

Result: Proved that enforcing small separation (good privacy) imposes strict lower bound on Gaussian noise multiplier σ, showing DP-SGD cannot simultaneously achieve strong privacy and high utility. Derived specific bounds: σ ≥ 1/√(2ln M) or κ ≥ 1/√8(1-1/√(4πln M)), with same limitation extending to Poisson subsampling.

Conclusion: DP-SGD faces critical bottleneck under standard worst-case adversarial assumptions - the required noise levels for meaningful privacy cause significant accuracy degradation in realistic training settings, revealing fundamental limitations of current DP-SGD approaches.

Abstract: Differentially Private Stochastic Gradient Descent (DP-SGD) is the dominant paradigm for private training, but its fundamental limitations under worst-case adversarial privacy definitions remain poorly understood. We analyze DP-SGD in the $f$-differential privacy framework, which characterizes privacy via hypothesis-testing trade-off curves, and study shuffled sampling over a single epoch with $M$ gradient updates. We derive an explicit suboptimal upper bound on the achievable trade-off curve. This result induces a geometric lower bound on the separation $κ$ which is the maximum distance between the mechanism’s trade-off curve and the ideal random-guessing line. Because a large separation implies significant adversarial advantage, meaningful privacy requires small $κ$. However, we prove that enforcing a small separation imposes a strict lower bound on the Gaussian noise multiplier $σ$, which directly limits the achievable utility. In particular, under the standard worst-case adversarial model, shuffled DP-SGD must satisfy $σ\ge \frac{1}{\sqrt{2\ln M}}$ $\quad\text{or}\quad$ $κ\ge\ \frac{1}{\sqrt{8}}!\left(1-\frac{1}{\sqrt{4π\ln M}}\right)$, and thus cannot simultaneously achieve strong privacy and high utility. Although this bound vanishes asymptotically as $M \to \infty$, the convergence is extremely slow: even for practically relevant numbers of updates the required noise magnitude remains substantial. We further show that the same limitation extends to Poisson subsampling up to constant factors. Our experiments confirm that the noise levels implied by this bound leads to significant accuracy degradation at realistic training settings, thus showing a critical bottleneck in DP-SGD under standard worst-case adversarial assumptions.

[363] X-SAM: Boosting Sharpness-Aware Minimization with Dominant-Eigenvector Gradient Correction

Hongru Duan, Yongle Chen, Lei Guan

Main category: cs.LG

TL;DR: X-SAM improves Sharpness-Aware Minimization by explicitly aligning gradient corrections with the Hessian’s top eigenvector to better regularize sharpness and improve generalization.

Details

Motivation: SAM's optimization behavior doesn't always align with theoretical expectations - both sharp and flat regions can yield small perturbed losses, causing gradients to still point toward sharp regions and weakening SAM's intended sharpness regularization effect.

Method: Proposes X-SAM (explicit eigenvector-aligned SAM) that corrects gradients via orthogonal decomposition along the top eigenvector of the Hessian, enabling more direct regularization of the Hessian’s maximum eigenvalue.

Result: Theoretical convergence proof and superior generalization demonstrated through extensive experimental evaluations, confirming both theoretical and practical advantages over standard SAM.

Conclusion: X-SAM addresses limitations of SAM by explicitly aligning gradient corrections with sharpness directions, providing more effective sharpness regularization and improved generalization performance.

Abstract: Sharpness-Aware Minimization (SAM) aims to improve generalization by minimizing a worst-case perturbed loss over a small neighborhood of model parameters. However, during training, its optimization behavior does not always align with theoretical expectations, since both sharp and flat regions may yield a small perturbed loss. In such cases, the gradient may still point toward sharp regions, failing to achieve the intended effect of SAM. To address this issue, we investigate SAM from a spectral and geometric perspective: specifically, we utilize the angle between the gradient and the leading eigenvector of the Hessian as a measure of sharpness. Our analysis illustrates that when this angle is less than or equal to ninety degrees, the effect of SAM’s sharpness regularization can be weakened. Furthermore, we propose an explicit eigenvector-aligned SAM (X-SAM), which corrects the gradient via orthogonal decomposition along the top eigenvector, enabling more direct and efficient regularization of the Hessian’s maximum eigenvalue. We prove X-SAM’s convergence and superior generalization, with extensive experimental evaluations confirming both theoretical and practical advantages.

[364] In-Context Source and Channel Coding

Ziqiong Wang, Tianqi Ren, Rongpeng Li, Zhifeng Zhao, Honggang Zhang

Main category: cs.LG

TL;DR: Proposes In-Context Decoding (ICD) framework to enhance SSCC robustness against channel errors without transmitter modifications, using reliability-guided bit flipping and LLM-based arithmetic decoding.

Details

Motivation: SSCC suffers from cliff effect in low SNR regimes where residual bit errors after channel decoding catastrophically break lossless source decoding, especially for Arithmetic Coding driven by LLMs.

Method: Receiver-side ICD framework uses Error Correction Code Transformer for bit-wise reliability, constructs confidence-ranked candidate pool via reliability-guided bit flipping, samples diverse candidates, applies LLM-based arithmetic decoder for reconstructions and sequence-level log-likelihoods, and uses reliability-likelihood fusion rule for final output selection.

Result: Extensive experiments over AWGN and Rayleigh fading channels demonstrate consistent gains compared with conventional SSCC baselines and representative JSCC schemes.

Conclusion: ICD enhances SSCC robustness without transmitter modifications, providing theoretical guarantees on sampling stability and convergence while maintaining compatibility with existing entropy coders and channel codes.

Abstract: Separate Source-Channel Coding (SSCC) remains attractive for text transmission due to its modularity and compatibility with mature entropy coders and powerful channel codes. However, SSCC often suffers from a pronounced cliff effect in low Signal-to-Noise Ratio (SNR) regimes, where residual bit errors after channel decoding can catastrophically break lossless source decoding, especially for Arithmetic Coding (AC) driven by Large Language Models (LLMs). This paper proposes a receiver-side In-Context Decoding (ICD) framework that enhances SSCC robustness without modifying the transmitter. ICD leverages an Error Correction Code Transformer (ECCT) to obtain bit-wise reliability for the decoded information bits. Based on the context-consistent bitstream, ICD constructs a confidence-ranked candidate pool via reliability-guided bit flipping, samples a compact yet diverse subset of candidates, and applies an LLM-based arithmetic decoder to obtain both reconstructions and sequence-level log-likelihoods. A reliability-likelihood fusion rule then selects the final output. We further provide theoretical guarantees on the stability and convergence of the proposed sampling procedure. Extensive experiments over Additive White Gaussian Noise (AWGN) and Rayleigh fading channels demonstrate consistent gains compared with conventional SSCC baselines and representative Joint Source-Channel Coding (JSCC) schemes.

[365] Early Fault Detection on CMAPSS with Unsupervised LSTM Autoencoders

P. Sánchez, K. Reyes, B. Radu, E. Fernández

Main category: cs.LG

TL;DR: Unsupervised health monitoring for turbofan engines using LSTM autoencoders on normalized sensor data, with adaptive thresholds for real-time alerts.

Details

Motivation: Traditional health monitoring requires run-to-failure labels which are expensive and impractical to obtain. There's a need for unsupervised methods that can be deployed quickly across diverse fleets without manual rule tuning.

Method: 1) Remove operating-condition effects via regression-based normalization of NASA CMAPSS sensor streams. 2) Train LSTM autoencoder only on healthy portions of each trajectory. 3) Use adaptive data-driven threshold on reconstruction error for real-time alert triggering.

Result: Benchmark results show high recall and low false-alarm rates across multiple operating regimes. The method demonstrates practical deployability and scalability.

Conclusion: The framework provides a complementary early-warning layer to Remaining Useful Life models, enabling quick deployment and scaling to diverse fleets without requiring failure labels or hand-tuned rules.

Abstract: This paper introduces an unsupervised health-monitoring framework for turbofan engines that does not require run-to-failure labels. First, operating-condition effects in NASA CMAPSS sensor streams are removed via regression-based normalisation; then a Long Short-Term Memory (LSTM) autoencoder is trained only on the healthy portion of each trajectory. Persistent reconstruction error, estimated using an adaptive data-driven threshold, triggers real-time alerts without hand-tuned rules. Benchmark results show high recall and low false-alarm rates across multiple operating regimes, demonstrating that the method can be deployed quickly, scale to diverse fleets, and serve as a complementary early-warning layer to Remaining Useful Life models.

[366] Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers

Emre Ozbas, Melih Bastopcu

Main category: cs.LG

TL;DR: LLM server optimization for heterogeneous query streams: allocates thinking tokens per task type to balance accuracy vs latency in an M/G/1 queue, with provable optimality and convergence guarantees.

Details

Motivation: LLM servers face heterogeneous query streams with different task types requiring varying computational effort. There's a fundamental trade-off between accuracy (improved with more thinking tokens) and latency (increased with more tokens). Current systems lack principled methods to optimize this trade-off across multiple task types while maintaining queue stability.

Method: Formulate as constrained optimization: maximize weighted accuracy minus mean system time, subject to token budget and stability constraints. Model as M/G/1 queue with Poisson arrivals and FIFO service. Service time is affine in allocated tokens, accuracy has diminishing returns. Prove strict concavity of objective, derive optimality conditions as coupled projected fixed-point equations. Develop iterative solution with contraction guarantees and projected gradient method with computable step-size for convergence.

Result: Objective function is strictly concave over stability region, ensuring unique optimal token allocation. Derive first-order optimality conditions and contraction condition for iterative solution. Develop projected gradient method with global convergence guarantee. Integer allocations via rounding with evaluated performance loss in simulations.

Conclusion: Provides principled framework for optimizing LLM server performance across heterogeneous tasks. Theoretical guarantees ensure optimality and convergence. Practical implementation via rounding yields near-optimal performance with minimal loss, enabling efficient resource allocation in real-world LLM serving systems.

Abstract: We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to $N$ distinct task types. Queries arrive according to a Poisson process, and each type occurs with a known prior probability. For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query. The token allocation induces an accuracy-latency trade-off: the service time follows an approximately affine function of the allocated tokens, while the probability of a correct response exhibits diminishing returns. Under a first-in, first-out (FIFO) service discipline, the system operates as an $M/G/1$ queue, and the mean system time depends on the first and second moments of the resulting service-time distribution. We formulate a constrained optimization problem that maximizes a weighted average accuracy objective penalized by the mean system time, subject to architectural token-budget constraints and queue-stability conditions. The objective function is shown to be strictly concave over the stability region, which ensures existence and uniqueness of the optimal token allocation. The first-order optimality conditions yield a coupled projected fixed-point characterization of the optimum, together with an iterative solution and an explicit sufficient condition for contraction. Moreover, a projected gradient method with a computable global step-size bound is developed to guarantee convergence beyond the contractive regime. Finally, integer-valued token allocations are attained via rounding of the continuous solution, and the resulting performance loss is evaluated in simulation results.

[367] SPIKE: Sparse Koopman Regularization for Physics-Informed Neural Networks

Jose Marie Antonio Minoza

Main category: cs.LG

TL;DR: SPIKE framework combines PINNs with continuous-time Koopman operators to improve generalization and stability in solving differential equations.

Details

Motivation: PINNs tend to overfit within training domains and perform poorly when extrapolating beyond trained spatiotemporal regions, limiting their practical utility for long-term predictions and generalization.

Method: SPIKE regularizes PINNs with continuous-time Koopman operators, enforcing linear dynamics dz/dt = Az in a learned observable space. It uses L1 regularization on A to learn sparse generator matrices, embodying the parsimony principle. The continuous-time formulation with matrix exponential integration provides unconditional stability.

Result: Experiments across various PDE types (parabolic, hyperbolic, dispersive, stiff) and systems (Navier-Stokes, Lorenz) show consistent improvements in temporal extrapolation, spatial generalization, and long-term prediction accuracy compared to standard PINNs.

Conclusion: SPIKE successfully addresses PINNs’ overfitting and generalization limitations by incorporating Koopman operator theory, learning parsimonious dynamics representations that improve extrapolation capabilities while maintaining stability for stiff systems.

Abstract: Physics-Informed Neural Networks (PINNs) provide a mesh-free approach for solving differential equations by embedding physical constraints into neural network training. However, PINNs tend to overfit within the training domain, leading to poor generalization when extrapolating beyond trained spatiotemporal regions. This work presents SPIKE (Sparse Physics-Informed Koopman-Enhanced), a framework that regularizes PINNs with continuous-time Koopman operators to learn parsimonious dynamics representations. By enforcing linear dynamics $dz/dt = Az$ in a learned observable space, both PIKE (without explicit sparsity) and SPIKE (with L1 regularization on $A$) learn sparse generator matrices, embodying the parsimony principle that complex dynamics admit low-dimensional structure. Experiments across parabolic, hyperbolic, dispersive, and stiff PDEs, including fluid dynamics (Navier-Stokes) and chaotic ODEs (Lorenz), demonstrate consistent improvements in temporal extrapolation, spatial generalization, and long-term prediction accuracy. The continuous-time formulation with matrix exponential integration provides unconditional stability for stiff systems while avoiding diagonal dominance issues inherent in discrete-time Koopman operators.

[368] We Need a More Robust Classifier: Dual Causal Learning Empowers Domain-Incremental Time Series Classification

Zhipeng Liu, Peibo Duan, Xuan Tang, Haodong Jing, Mingyang Geng, Yongsheng Huang, Jialu Xu, Bin Zhang, Binwu Wang

Main category: cs.LG

TL;DR: DualCD: A lightweight dual-causal disentanglement framework for robust time series classification in domain incremental learning scenarios.

Details

Motivation: Existing time series classification models struggle with domain incremental learning, where models need to adapt to new domains while retaining knowledge from previous ones. Current approaches lack robustness in handling domain shifts and confounders.

Method: Proposes DualCD framework with: 1) Temporal feature disentanglement module to separate class-causal features from spurious features; 2) Dual-causal intervention mechanism that eliminates intra-class and inter-class confounding features by constructing variant samples and using causal intervention loss.

Result: Extensive experiments on multiple datasets and models show DualCD effectively improves performance in domain incremental scenarios. The authors also create a comprehensive benchmark for domain incremental time series classification research.

Conclusion: DualCD provides a robust, lightweight solution for domain incremental time series classification that can be seamlessly integrated into existing models, addressing key challenges in handling domain shifts and confounding features.

Abstract: The World Wide Web thrives on intelligent services that rely on accurate time series classification, which has recently witnessed significant progress driven by advances in deep learning. However, existing studies face challenges in domain incremental learning. In this paper, we propose a lightweight and robust dual-causal disentanglement framework (DualCD) to enhance the robustness of models under domain incremental scenarios, which can be seamlessly integrated into time series classification models. Specifically, DualCD first introduces a temporal feature disentanglement module to capture class-causal features and spurious features. The causal features can offer sufficient predictive power to support the classifier in domain incremental learning settings. To accurately capture these causal features, we further design a dual-causal intervention mechanism to eliminate the influence of both intra-class and inter-class confounding features. This mechanism constructs variant samples by combining the current class’s causal features with intra-class spurious features and with causal features from other classes. The causal intervention loss encourages the model to accurately predict the labels of these variant samples based solely on the causal features. Extensive experiments on multiple datasets and models demonstrate that DualCD effectively improves performance in domain incremental scenarios. We summarize our rich experiments into a comprehensive benchmark to facilitate research in domain incremental time series classification.

[369] Meta Dynamic Graph for Traffic Flow Prediction

Yiqing Zou, Hanning Yuan, Qianyu Yang, Ziqiang Yuan, Shuliang Wang, Sijie Ruan

Main category: cs.LG

TL;DR: MetaDG is a novel traffic prediction framework that uses dynamic graph structures of node representations to model spatio-temporal dynamics, generating both dynamic adjacency matrices and meta-parameters to unify spatio-temporal heterogeneity modeling.

Details

Motivation: Current traffic prediction methods have limitations: 1) dynamic modeling is often limited to spatial topology changes only, and 2) spatial and temporal heterogeneity are modeled separately. There's a need to extend dynamic modeling beyond topology and unify spatio-temporal heterogeneity capture.

Method: Proposes Meta Dynamic Graph (MetaDG) framework that leverages dynamic graph structures of node representations to explicitly model spatio-temporal dynamics. It generates both dynamic adjacency matrices and meta-parameters, extending dynamic modeling beyond topology while unifying spatio-temporal heterogeneity into a single dimension.

Result: Extensive experiments on four real-world datasets validate the effectiveness of MetaDG for traffic flow prediction.

Conclusion: MetaDG successfully addresses limitations of existing methods by extending dynamic modeling beyond spatial topology and unifying spatio-temporal heterogeneity capture through dynamic graph structures of node representations.

Abstract: Traffic flow prediction is a typical spatio-temporal prediction problem and has a wide range of applications. The core challenge lies in modeling the underlying complex spatio-temporal dependencies. Various methods have been proposed, and recent studies show that the modeling of dynamics is useful to meet the core challenge. While handling spatial dependencies and temporal dependencies using separate base model structures may hinder the modeling of spatio-temporal correlations, the modeling of dynamics can bridge this gap. Incorporating spatio-temporal heterogeneity also advances the main goal, since it can extend the parameter space and allow more flexibility. Despite these advances, two limitations persist: 1) the modeling of dynamics is often limited to the dynamics of spatial topology (e.g., adjacency matrix changes), which, however, can be extended to a broader scope; 2) the modeling of heterogeneity is often separated for spatial and temporal dimensions, but this gap can also be bridged by the modeling of dynamics. To address the above limitations, we propose a novel framework for traffic prediction, called Meta Dynamic Graph (MetaDG). MetaDG leverages dynamic graph structures of node representations to explicitly model spatio-temporal dynamics. This generates both dynamic adjacency matrices and meta-parameters, extending dynamic modeling beyond topology while unifying the capture of spatio-temporal heterogeneity into a single dimension. Extensive experiments on four real-world datasets validate the effectiveness of MetaDG.

[370] SuS: Strategy-aware Surprise for Intrinsic Exploration

Mark Kashirskiy, Ilya Makarov

Main category: cs.LG

TL;DR: SuS is a novel intrinsic motivation framework using pre-post prediction mismatch for exploration in RL, combining Strategy Stability and Strategy Surprise to improve LLM performance on math reasoning tasks.

Details

Motivation: Traditional curiosity-driven methods rely solely on state prediction error, which may not capture strategic novelty. The authors aim to create a more sophisticated exploration signal that considers behavioral strategy consistency and unexpected outcomes relative to current strategy.

Method: SuS introduces two components: Strategy Stability (SS) measures behavioral strategy consistency across temporal steps, and Strategy Surprise (SuS) captures unexpected outcomes relative to the agent’s current strategy representation. These are combined through learned weighting coefficients in a reward formulation.

Result: SuS achieves 17.4% improvement in Pass@1 and 26.4% improvement in Pass@5 compared to baselines on mathematical reasoning tasks using LLMs. Ablation studies show at least 10% performance degradation when removing either component, validating their synergistic nature.

Conclusion: The Strategy-aware Surprise framework effectively enhances exploration in reinforcement learning by combining strategy stability and surprise signals, leading to significant improvements in both accuracy and solution diversity for mathematical reasoning tasks with LLMs.

Abstract: We propose Strategy-aware Surprise (SuS), a novel intrinsic motivation framework that uses pre-post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity-driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent’s current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation, validating the synergistic nature of our approach. SuS achieves 17.4% improvement in Pass@1 and 26.4% improvement in Pass@5 compared to baseline methods, while maintaining higher strategy diversity throughout training.

[371] EvoMorph: Counterfactual Explanations for Continuous Time-Series Extrinsic Regression Applied to Photoplethysmography

Mesut Ceylan, Alexis Tabin, Patrick Langer, Elgar Fleisch, Filipe Barata

Main category: cs.LG

TL;DR: EvoMorph: Evolutionary framework for generating physiologically plausible counterfactual explanations for time-series regression on biomedical signals like PPG.

Details

Motivation: Existing counterfactual explanation methods for time series are limited to classification, ignore waveform morphology, and produce unrealistic signals, making them unsuitable for clinical biomedical time-series applications where physiological plausibility and trust are critical.

Method: Multi-objective evolutionary framework that optimizes morphology-aware objectives using interpretable signal descriptors and applies transformations to preserve waveform structure for generating diverse counterfactuals.

Result: Evaluated on three PPG datasets (heart rate, respiratory rate, oxygen saturation) against nearest-unlike-neighbor baseline. Successfully generated physiologically plausible counterfactuals and demonstrated utility for uncertainty quantification by relating counterfactual sensitivity to ensemble uncertainty and data density.

Conclusion: EvoMorph enables physiologically-aware counterfactual generation for continuous biomedical signals and supports uncertainty-aware interpretability, advancing trustworthy model analysis for clinical time-series applications.

Abstract: Wearable devices enable continuous, population-scale monitoring of physiological signals, such as photoplethysmography (PPG), creating new opportunities for data-driven clinical assessment. Time-series extrinsic regression (TSER) models increasingly leverage PPG signals to estimate clinically relevant outcomes, including heart rate, respiratory rate, and oxygen saturation. For clinical reasoning and trust, however, single point estimates alone are insufficient: clinicians must also understand whether predictions are stable under physiologically plausible variations and to what extent realistic, attainable changes in physiological signals would meaningfully alter a model’s prediction. Counterfactual explanations (CFE) address these “what-if” questions, yet existing time series CFE generation methods are largely restricted to classification, overlook waveform morphology, and often produce physiologically implausible signals, limiting their applicability to continuous biomedical time series. To address these limitations, we introduce EvoMorph, a multi-objective evolutionary framework for generating physiologically plausible and diverse CFE for TSER applications. EvoMorph optimizes morphology-aware objectives defined on interpretable signal descriptors and applies transformations to preserve the waveform structure. We evaluated EvoMorph on three PPG datasets (heart rate, respiratory rate, and oxygen saturation) against a nearest-unlike-neighbor baseline. In addition, in a case study, we evaluated EvoMorph as a tool for uncertainty quantification by relating counterfactual sensitivity to bootstrap-ensemble uncertainty and data-density measures. Overall, EvoMorph enables the generation of physiologically-aware counterfactuals for continuous biomedical signals and supports uncertainty-aware interpretability, advancing trustworthy model analysis for clinical time-series applications.

[372] PLGC: Pseudo-Labeled Graph Condensation

Jay Nandy, Arnab Kumar Mondal, Anuj Rathore, Mahesh Chandran

Main category: cs.LG

TL;DR: PLGC is a self-supervised graph condensation method that generates synthetic graphs without requiring ground-truth labels, using pseudo-labels from node embeddings to match structural and feature statistics.

Details

Motivation: Existing graph condensation methods rely on clean supervised labels, which limits their reliability when labels are scarce, noisy, or inconsistent. There's a need for robust condensation methods that work in noisy or weakly-labeled environments.

Method: PLGC constructs latent pseudo-labels from node embeddings and optimizes condensed graphs to match the original graph’s structural and feature statistics. It jointly learns latent prototypes and node assignments in a label-free manner.

Result: PLGC achieves competitive performance with state-of-the-art supervised condensation methods on clean datasets and exhibits substantial robustness under label noise, often outperforming all baselines by a significant margin.

Conclusion: Self-supervised graph condensation offers practical and theoretical advantages in noisy or weakly-labeled environments, with PLGC providing theoretical guarantees that pseudo-labels preserve latent structural statistics and ensure accurate embedding alignment.

Abstract: Large graph datasets make training graph neural networks (GNNs) computationally costly. Graph condensation methods address this by generating small synthetic graphs that approximate the original data. However, existing approaches rely on clean, supervised labels, which limits their reliability when labels are scarce, noisy, or inconsistent. We propose Pseudo-Labeled Graph Condensation (PLGC), a self-supervised framework that constructs latent pseudo-labels from node embeddings and optimizes condensed graphs to match the original graph’s structural and feature statistics – without requiring ground-truth labels. PLGC offers three key contributions: (1) A diagnosis of why supervised condensation fails under label noise and distribution shift. (2) A label-free condensation method that jointly learns latent prototypes and node assignments. (3) Theoretical guarantees showing that pseudo-labels preserve latent structural statistics of the original graph and ensure accurate embedding alignment. Empirically, across node classification and link prediction tasks, PLGC achieves competitive performance with state-of-the-art supervised condensation methods on clean datasets and exhibits substantial robustness under label noise, often outperforming all baselines by a significant margin. Our findings highlight the practical and theoretical advantages of self-supervised graph condensation in noisy or weakly-labeled environments.

[373] Discrete Feynman-Kac Correctors

Mohsin Hasan, Viktor Ohanesian, Artem Gazizov, Yoshua Bengio, Alán Aspuru-Guzik, Roberto Bondesan, Marta Skreta, Kirill Neklyudov

Main category: cs.LG

TL;DR: A framework called Discrete Feynman-Kac Correctors enables flexible control over generated distributions in discrete diffusion models at inference time without additional training, using Sequential Monte Carlo algorithms for annealing, combining multiple processes, and reward-guided sampling.

Details

Motivation: Discrete diffusion models lack flexible control over generated sample distributions, limiting their practical applications despite their advantages in capturing hierarchical interdependencies in sequence data.

Method: Proposes Discrete Feynman-Kac Correctors framework using Sequential Monte Carlo algorithms to control trained discrete diffusion models at inference time. Methods include: temperature control (annealing), sampling from product of marginals of multiple diffusion processes, and sampling from product of marginal with external reward functions.

Result: Enables control over generated distributions without additional training or fine-tuning. Applications demonstrated include: efficient sampling from annealed Boltzmann distribution of Ising model, improved language model performance for code generation and amortized learning, and reward-tilted protein sequence generation.

Conclusion: The framework provides flexible inference-time control over discrete diffusion models, expanding their practical utility across diverse domains while maintaining the original model architecture without retraining.

Abstract: Discrete diffusion models have recently emerged as a promising alternative to the autoregressive approach for generating discrete sequences. Sample generation via gradual denoising or demasking processes allows them to capture hierarchical non-sequential interdependencies in the data. These custom processes, however, do not assume a flexible control over the distribution of generated samples. We propose Discrete Feynman-Kac Correctors, a framework that allows for controlling the generated distribution of discrete masked diffusion models at inference time. We derive Sequential Monte Carlo (SMC) algorithms that, given a trained discrete diffusion model, control the temperature of the sampled distribution (i.e. perform annealing), sample from the product of marginals of several diffusion processes (e.g. differently conditioned processes), and sample from the product of the marginal with an external reward function, producing likely samples from the target distribution that also have high reward. Notably, our framework does not require any training of additional models or fine-tuning of the original model. We illustrate the utility of our framework in several applications including: efficient sampling from the annealed Boltzmann distribution of the Ising model, improving the performance of language models for code generation and amortized learning, as well as reward-tilted protein sequence generation.

[374] CS-GBA: A Critical Sample-based Gradient-guided Backdoor Attack for Offline Reinforcement Learning

Yuanjie Zhao, Junnan Qiu, Yue Ding, Jie Li

Main category: cs.LG

TL;DR: CS-GBA is a stealthy backdoor attack on offline RL that uses critical sample selection, correlation-breaking triggers, and gradient-guided action generation to evade safety constraints with minimal poisoning.

Details

Motivation: Existing backdoor attacks on offline RL are ineffective against safety-constrained algorithms like CQL due to inefficient random poisoning and easily detectable OOD triggers. There's a need for more stealthy and destructive attacks under strict budget constraints.

Method: Three key components: 1) Adaptive Critical Sample Selection focusing on transitions with high TD errors, 2) Correlation-Breaking Trigger mechanism using physical mutual exclusivity of state features to avoid OOD detection, and 3) Gradient-Guided Action Generation that searches for worst-case actions using the victim Q-network’s gradient instead of label inversion.

Result: Significantly outperforms state-of-the-art baselines on D4RL benchmarks, achieving high attack success rates against safety-constrained algorithms with only 5% poisoning budget while maintaining clean environment performance.

Conclusion: CS-GBA demonstrates that carefully designed backdoor attacks can effectively compromise safety-constrained offline RL algorithms through strategic sample selection, stealthy triggers, and gradient-guided action manipulation, highlighting vulnerabilities in current defensive mechanisms.

Abstract: Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to backdoor attacks. Existing attack strategies typically struggle against safety-constrained algorithms (e.g., CQL) due to inefficient random poisoning and the use of easily detectable Out-of-Distribution (OOD) triggers. In this paper, we propose CS-GBA (Critical Sample-based Gradient-guided Backdoor Attack), a novel framework designed to achieve high stealthiness and destructiveness under a strict budget. Leveraging the theoretical insight that samples with high Temporal Difference (TD) errors are pivotal for value function convergence, we introduce an adaptive Critical Sample Selection strategy that concentrates the attack budget on the most influential transitions. To evade OOD detection, we propose a Correlation-Breaking Trigger mechanism that exploits the physical mutual exclusivity of state features (e.g., 95th percentile boundaries) to remain statistically concealed. Furthermore, we replace the conventional label inversion with a Gradient-Guided Action Generation mechanism, which searches for worst-case actions within the data manifold using the victim Q-network’s gradient. Empirical results on D4RL benchmarks demonstrate that our method significantly outperforms state-of-the-art baselines, achieving high attack success rates against representative safety-constrained algorithms with a minimal 5% poisoning budget, while maintaining the agent’s performance in clean environments.

[375] Reinforcement Learning with Multi-Step Lookahead Information Via Adaptive Batching

Nadav Merlis

Main category: cs.LG

TL;DR: The paper studies tabular RL with multi-step lookahead information, proposes adaptive batching policies to overcome limitations of existing heuristics, and develops an optimistic algorithm with near-optimal regret bounds.

Details

Motivation: Existing approaches for handling multi-step lookahead information in RL (fixed batching and model predictive control) have limitations. Fixed batching uses predefined chunk sizes regardless of state, while MPC may be suboptimal. The authors aim to develop more effective policies that can adaptively use lookahead information based on state context.

Method: Propose adaptive batching policies (ABPs) that process lookahead information in state-dependent batches. Derive optimal Bellman equations for ABPs and design an optimistic regret-minimizing algorithm to learn optimal ABPs in unknown environments.

Result: Developed an algorithm that learns optimal adaptive batching policies with regret bounds that are order-optimal up to a factor of the lookahead horizon ℓ (typically small). The approach overcomes limitations of fixed batching and MPC.

Conclusion: Adaptive batching policies provide a principled approach to leverage multi-step lookahead information in RL, with theoretical guarantees and practical advantages over existing heuristics. The method enables efficient learning in unknown environments with near-optimal performance.

Abstract: We study tabular reinforcement learning problems with multiple steps of lookahead information. Before acting, the learner observes $\ell$ steps of future transition and reward realizations: the exact state the agent would reach and the rewards it would collect under any possible course of action. While it has been shown that such information can drastically boost the value, finding the optimal policy is NP-hard, and it is common to apply one of two tractable heuristics: processing the lookahead in chunks of predefined sizes (‘fixed batching policies’), and model predictive control. We first illustrate the problems with these two approaches and propose utilizing the lookahead in adaptive (state-dependent) batches; we refer to such policies as adaptive batching policies (ABPs). We derive the optimal Bellman equations for these strategies and design an optimistic regret-minimizing algorithm that enables learning the optimal ABP when interacting with unknown environments. Our regret bounds are order-optimal up to a potential factor of the lookahead horizon $\ell$, which can usually be considered a small constant.

[376] DeFlow: Decoupling Manifold Modeling and Value Maximization for Offline Policy Extraction

Zhancun Mu

Main category: cs.LG

TL;DR: DeFlow is a decoupled offline RL framework that uses flow matching to capture complex behavior manifolds, avoiding expensive backpropagation through ODE solvers by learning a lightweight refinement module within a data-derived trust region.

Details

Motivation: Optimizing generative policies in offline RL is computationally prohibitive due to the need for backpropagation through ODE solvers. Current approaches either sacrifice iterative generation capability through single-step distillation or face stability issues from balancing multiple loss terms.

Method: DeFlow learns a lightweight refinement module within an explicit, data-derived trust region of the flow manifold. This decoupled approach bypasses solver differentiation and eliminates the need for balancing loss terms while preserving the flow’s iterative expressivity.

Result: DeFlow achieves superior performance on the challenging OGBench benchmark and demonstrates efficient offline-to-online adaptation capabilities.

Conclusion: The decoupled approach of DeFlow provides stable policy improvement while fully preserving the expressive iterative generation capability of flow matching, offering an efficient solution for offline RL with complex behavior manifolds.

Abstract: We present DeFlow, a decoupled offline RL framework that leverages flow matching to faithfully capture complex behavior manifolds. Optimizing generative policies is computationally prohibitive, typically necessitating backpropagation through ODE solvers. We address this by learning a lightweight refinement module within an explicit, data-derived trust region of the flow manifold, rather than sacrificing the iterative generation capability via single-step distillation. This way, we bypass solver differentiation and eliminate the need for balancing loss terms, ensuring stable improvement while fully preserving the flow’s iterative expressivity. Empirically, DeFlow achieves superior performance on the challenging OGBench benchmark and demonstrates efficient offline-to-online adaptation.

[377] Communication-Efficient Federated Learning by Exploiting Spatio-Temporal Correlations of Gradients

Shenlong Zheng, Zhen Zhang, Yuhui Deng, Geyong Min, Lin Cui

Main category: cs.LG

TL;DR: GradESTC is a compression technique for federated learning that reduces communication overhead by exploiting both spatial (low-rank) and temporal correlations in gradients, transmitting only lightweight coefficients and a limited number of updated basis vectors instead of full gradients.

Details

Motivation: Communication overhead is a critical challenge in federated learning, especially in bandwidth-constrained networks. Existing methods focus mainly on compressing individual gradients but overlook temporal correlations between gradients across adjacent rounds, missing opportunities for further compression.

Method: GradESTC exploits spatial correlations by decomposing each full gradient into compact basis vectors and combination coefficients. It leverages temporal correlations by dynamically updating only a small portion of basis vectors each round. The method transmits lightweight combination coefficients and limited updated basis vectors instead of full gradients.

Result: Extensive experiments show GradESTC reduces uplink communication by an average of 39.79% compared to the strongest baseline when reaching target accuracy near convergence, while maintaining comparable convergence speed and final accuracy to uncompressed FedAvg.

Conclusion: By effectively leveraging spatio-temporal gradient structures, GradESTC offers a practical and scalable solution for communication-efficient federated learning, addressing the critical challenge of communication overhead in bandwidth-constrained networks.

Abstract: Communication overhead is a critical challenge in federated learning, particularly in bandwidth-constrained networks. Although many methods have been proposed to reduce communication overhead, most focus solely on compressing individual gradients, overlooking the temporal correlations among them. Prior studies have shown that gradients exhibit spatial correlations, typically reflected in low-rank structures. Through empirical analysis, we further observe a strong temporal correlation between client gradients across adjacent rounds. Based on these observations, we propose GradESTC, a compression technique that exploits both spatial and temporal gradient correlations. GradESTC exploits spatial correlations to decompose each full gradient into a compact set of basis vectors and corresponding combination coefficients. By exploiting temporal correlations, only a small portion of the basis vectors need to be dynamically updated in each round. GradESTC significantly reduces communication overhead by transmitting lightweight combination coefficients and a limited number of updated basis vectors instead of the full gradients. Extensive experiments show that, upon reaching a target accuracy level near convergence, GradESTC reduces uplink communication by an average of 39.79% compared to the strongest baseline, while maintaining comparable convergence speed and final accuracy to uncompressed FedAvg. By effectively leveraging spatio-temporal gradient structures, GradESTC offers a practical and scalable solution for communication-efficient federated learning.

[378] Projected Microbatch Accumulation yields reference-free proximal policy updates for reinforcement learning

Nilin Abrahamsen

Main category: cs.LG

TL;DR: PROMA is a new proximal policy update method for LLM fine-tuning that projects out sequence-wise gradient components before microbatch aggregation, enabling stable policy learning without entropy collapse or reference policy reliance.

Details

Motivation: Existing methods like PPO and GRPO have limitations: PPO relies on likelihood-ratio clipping and reference policies, while GRPO doesn't provide tight enough KL divergence control, leading to unstable policy learning and potential entropy collapse.

Method: PROMA accumulates policy gradients across microbatches by projecting out sequence-wise gradient components before microbatch aggregation. The projection is applied layer-wise during backward pass, enabling efficient implementation without additional forward/backward passes.

Result: PROMA enforces tighter control of local KL divergence than GRPO, resulting in more stable policy learning. Unlike PPO and GRPO, it achieves proximal updates without inducing entropy collapse and doesn’t rely on reference policies or likelihood-ratio clipping.

Conclusion: PROMA provides an efficient, stable proximal policy update method for LLM fine-tuning that addresses limitations of existing approaches by using gradient projection techniques during microbatch accumulation.

Abstract: This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy update method for large language model fine-tuning. PROMA accumulates policy gradients across microbatches by projecting out sequence-wise gradient components before microbatch aggregation. The projection is applied layer-wise during the backward pass, enabling efficient implementation without additional forward or backward passes. Empirically, PROMA enforces tighter control of local KL divergence than GRPO, resulting in more stable policy learning. Unlike PPO and GRPO, PROMA achieves proximal updates without inducing entropy collapse and does not rely on a reference policy or likelihood-ratio clipping.

[379] Transformer-Based Cognitive Radio: Adaptive Modulation Strategies Using Transformer Models

Andrea Melis, Andrea Piroddi, Roberto Girau

Main category: cs.LG

TL;DR: Transformer models (GPT-2) generate novel modulation schemes for cognitive radio systems, achieving comparable or better performance than traditional methods.

Details

Motivation: Cognitive Radio systems need to dynamically adapt to changing spectrum environments, and machine learning advancements (particularly Transformer models) could enhance spectral efficiency, robustness, and security.

Method: Train a GPT-2 Transformer model on a dataset of existing modulation formulas to generate novel modulation schemes, then evaluate them using SNR and Power Spectrum Density metrics.

Result: Transformer-generated modulation schemes achieve performance comparable to, and in some cases outperform, traditional modulation methods.

Conclusion: Advanced CR systems can benefit significantly from Transformer model implementation, leading to more efficient, robust, and secure communication systems.

Abstract: Cognitive Radio (CR) systems, which dynamically adapt to changing spectrum environments, could benefit significantly from advancements in machine learning technologies. These systems can be enhanced in terms of spectral efficiency, robustness, and security through innovative approaches such as the use of Transformer models. This work investigates the application of Transformer models, specifically the GPT-2 architecture, to generate novel modulation schemes for wireless communications. By training a GPT-2 model on a dataset of existing modulation formulas, new modulation schemes has been created. These generated schemes are then compared to traditional methods using key performance metrics such as Signal-to-Noise Ratio (SNR) and Power Spectrum Density (PSD). The results show that Transformer-generated modulation schemes can achieve performance comparable to, and in some cases outperforming, traditional methods. This demonstrates that advanced CR systems could greatly benefit from the implementation of Transformer models, leading to more efficient, robust, and secure communication systems.

[380] Mixtures of Transparent Local Models

Niffa Cheick Oumar Diaby, Thierry Duchesne, Mario Marchand

Main category: cs.LG

TL;DR: The paper proposes a mixture of transparent local models as an interpretable alternative to opaque ML models, with PAC-Bayesian risk bounds for binary classification and regression.

Details

Motivation: Growing demand for transparent ML models to ensure security and non-discrimination, especially when simple functions work well locally but change abruptly between regions.

Method: Proposes algorithm that learns both transparent labeling functions and their corresponding input space localities using a multi-predictor/multi-locality loss function, with PAC-Bayesian risk bounds for binary linear classification and linear regression.

Result: Synthetic data illustrates algorithm functionality; real data shows competitiveness with existing methods and some opaque models.

Conclusion: Mixture of transparent local models provides an effective interpretable alternative to opaque models with theoretical guarantees via PAC-Bayesian bounds.

Abstract: The predominance of machine learning models in many spheres of human activity has led to a growing demand for their transparency. The transparency of models makes it possible to discern some factors, such as security or non-discrimination. In this paper, we propose a mixture of transparent local models as an alternative solution for designing interpretable (or transparent) models. Our approach is designed for the situations where a simple and transparent function is suitable for modeling the label of instances in some localities/regions of the input space, but may change abruptly as we move from one locality to another. Consequently, the proposed algorithm is to learn both the transparent labeling function and the locality of the input space where the labeling function achieves a small risk in its assigned locality. By using a new multi-predictor (and multi-locality) loss function, we established rigorous PAC-Bayesian risk bounds for the case of binary linear classification problem and that of linear regression. In both cases, synthetic data sets were used to illustrate how the learning algorithms work. The results obtained from real data sets highlight the competitiveness of our approach compared to other existing methods as well as certain opaque models. Keywords: PAC-Bayes, risk bounds, local models, transparent models, mixtures of local transparent models.

[381] Process-Guided Concept Bottleneck Model

Reza M. Asiyabi, SEOSAW Partnership, Steven Hancock, Casey Ryan

Main category: cs.LG

TL;DR: PG-CBM extends Concept Bottleneck Models by incorporating domain-defined causal mechanisms through biophysically meaningful intermediate concepts, improving accuracy and interpretability for scientific applications like biomass estimation.

Details

Motivation: Standard CBMs overlook domain-specific relationships and causal mechanisms, and their dependence on complete concept labels limits applicability in scientific domains where supervision is sparse but processes are well-defined.

Method: PG-CBM constrains learning to follow domain-defined causal mechanisms through biophysically meaningful intermediate concepts, leveraging multi-source heterogeneous training data while producing interpretable intermediate outputs.

Result: PG-CBM reduces error and bias compared to multiple benchmarks in above ground biomass density estimation from Earth Observation data, while leveraging multi-source data and producing interpretable outputs.

Conclusion: PG-CBM enhances transparency, enables detection of spurious learning, provides scientific insights, and represents a step toward more trustworthy AI systems in scientific applications.

Abstract: Concept Bottleneck Models (CBMs) improve the explainability of black-box Deep Learning (DL) by introducing intermediate semantic concepts. However, standard CBMs often overlook domain-specific relationships and causal mechanisms, and their dependence on complete concept labels limits applicability in scientific domains where supervision is sparse but processes are well defined. To address this, we propose the Process-Guided Concept Bottleneck Model (PG-CBM), an extension of CBMs which constrains learning to follow domain-defined causal mechanisms through biophysically meaningful intermediate concepts. Using above ground biomass density estimation from Earth Observation data as a case study, we show that PG-CBM reduces error and bias compared to multiple benchmarks, whilst leveraging multi-source heterogeneous training data and producing interpretable intermediate outputs. Beyond improved accuracy, PG-CBM enhances transparency, enables detection of spurious learning, and provides scientific insights, representing a step toward more trustworthy AI systems in scientific applications.

[382] Kolmogorov Arnold Networks and Multi-Layer Perceptrons: A Paradigm Shift in Neural Modelling

Aradhya Gaonkar, Nihal Jain, Vignesh Chougule, Nikhil Deshpande, Sneha Varur, Channabasappa Muttal

Main category: cs.LG

TL;DR: KANs outperform MLPs in accuracy and computational efficiency across multiple benchmarks including function approximation, time-series prediction, and classification tasks.

Details

Motivation: To compare Kolmogorov-Arnold Networks (KAN) with traditional Multi-Layer Perceptrons (MLP) to determine which architecture offers better performance and computational efficiency for various computational challenges.

Method: Comprehensive comparative analysis using diverse datasets (mathematical functions, temperature prediction, wine classification) with performance evaluation via MSE and computational cost via FLOPs.

Result: KANs consistently outperform MLPs in all benchmarks, achieving higher predictive accuracy with significantly reduced computational costs.

Conclusion: KANs offer superior balance between computational efficiency and accuracy, making them particularly suitable for resource-limited and real-time applications, while providing a systematic framework for neural architecture selection.

Abstract: The research undertakes a comprehensive comparative analysis of Kolmogorov-Arnold Networks (KAN) and Multi-Layer Perceptrons (MLP), highlighting their effectiveness in solving essential computational challenges like nonlinear function approximation, time-series prediction, and multivariate classification. Rooted in Kolmogorov’s representation theorem, KANs utilize adaptive spline-based activation functions and grid-based structures, providing a transformative approach compared to traditional neural network frameworks. Utilizing a variety of datasets spanning mathematical function estimation (quadratic and cubic) to practical uses like predicting daily temperatures and categorizing wines, the proposed research thoroughly assesses model performance via accuracy measures like Mean Squared Error (MSE) and computational expense assessed through Floating Point Operations (FLOPs). The results indicate that KANs reliably exceed MLPs in every benchmark, attaining higher predictive accuracy with significantly reduced computational costs. Such an outcome highlights their ability to maintain a balance between computational efficiency and accuracy, rendering them especially beneficial in resource-limited and real-time operational environments. By elucidating the architectural and functional distinctions between KANs and MLPs, the paper provides a systematic framework for selecting the most suitable neural architectures for specific tasks. Furthermore, the proposed study highlights the transformative capabilities of KANs in progressing intelligent systems, influencing their use in situations that require both interpretability and computational efficiency.

[383] Combinatorial Optimization Augmented Machine Learning

Maximilian Schiffer, Heiko Hoppe, Yue Su, Louis Bouvier, Axel Parmentier

Main category: cs.LG

TL;DR: Comprehensive survey of Combinatorial Optimization Augmented Machine Learning (COAML), covering frameworks, taxonomy, algorithmic approaches, applications, and research frontiers.

Details

Motivation: To provide a unified overview of the emerging COAML paradigm that integrates predictive models with combinatorial decision-making, bridging machine learning, operations research, and stochastic optimization.

Method: Develops a unifying framework for COAML pipelines, creates taxonomy based on uncertainty and decision structure, reviews algorithmic approaches for static/dynamic problems, surveys applications across domains.

Result: Comprehensive survey covering methodological building blocks, connection to empirical cost minimization, problem taxonomy, algorithmic approaches, and applications in scheduling, routing, stochastic programming, and reinforcement learning.

Conclusion: COAML is a powerful paradigm for data-driven, feasibility-preserving policies; survey serves as tutorial introduction and roadmap for future research at the intersection of combinatorial optimization and machine learning.

Abstract: Combinatorial optimization augmented machine learning (COAML) has recently emerged as a powerful paradigm for integrating predictive models with combinatorial decision-making. By embedding combinatorial optimization oracles into learning pipelines, COAML enables the construction of policies that are both data-driven and feasibility-preserving, bridging the traditions of machine learning, operations research, and stochastic optimization. This paper provides a comprehensive overview of the state of the art in COAML. We introduce a unifying framework for COAML pipelines, describe their methodological building blocks, and formalize their connection to empirical cost minimization. We then develop a taxonomy of problem settings based on the form of uncertainty and decision structure. Using this taxonomy, we review algorithmic approaches for static and dynamic problems, survey applications across domains such as scheduling, vehicle routing, stochastic programming, and reinforcement learning, and synthesize methodological contributions in terms of empirical cost minimization, imitation learning, and reinforcement learning. Finally, we identify key research frontiers. This survey aims to serve both as a tutorial introduction to the field and as a roadmap for future research at the interface of combinatorial optimization and machine learning.

[384] ProbFM: Probabilistic Time Series Foundation Model with Uncertainty Decomposition

Arundeep Chinta, Lucas Vinh Tran, Jay Katukuri

Main category: cs.LG

TL;DR: ProbFM: A transformer-based probabilistic foundation model using Deep Evidential Regression for principled uncertainty quantification with explicit epistemic-aleatoric decomposition in financial time series forecasting.

Details

Motivation: Current Time Series Foundation Models (TSFMs) for financial forecasting lack proper uncertainty quantification - they either use restrictive distributional assumptions, conflate uncertainty sources, or lack calibration mechanisms. Existing approaches fail to provide theoretically-grounded uncertainty decomposition, hindering adoption in financial applications.

Method: ProbFM leverages Deep Evidential Regression (DER) to learn optimal uncertainty representations through higher-order evidence learning while maintaining single-pass computational efficiency. The paper also conducts controlled comparison using LSTM architecture across five probabilistic methods (DER, Gaussian NLL, Student’s-t NLL, Quantile Loss, Conformal Prediction) to evaluate DER independently of architectural complexity.

Result: Evaluation on cryptocurrency return forecasting shows that DER maintains competitive forecasting accuracy while providing explicit epistemic-aleatoric uncertainty decomposition. The controlled comparison demonstrates DER’s effectiveness in uncertainty quantification.

Conclusion: ProbFM establishes an extensible framework for principled uncertainty quantification in foundation models and provides empirical evidence for DER’s effectiveness in financial applications, addressing the core limitations of current TSFMs in uncertainty quantification.

Abstract: Time Series Foundation Models (TSFMs) have emerged as a promising approach for zero-shot financial forecasting, demonstrating strong transferability and data efficiency gains. However, their adoption in financial applications is hindered by fundamental limitations in uncertainty quantification: current approaches either rely on restrictive distributional assumptions, conflate different sources of uncertainty, or lack principled calibration mechanisms. While recent TSFMs employ sophisticated techniques such as mixture models, Student’s t-distributions, or conformal prediction, they fail to address the core challenge of providing theoretically-grounded uncertainty decomposition. For the very first time, we present a novel transformer-based probabilistic framework, ProbFM (probabilistic foundation model), that leverages Deep Evidential Regression (DER) to provide principled uncertainty quantification with explicit epistemic-aleatoric decomposition. Unlike existing approaches that pre-specify distributional forms or require sampling-based inference, ProbFM learns optimal uncertainty representations through higher-order evidence learning while maintaining single-pass computational efficiency. To rigorously evaluate the core DER uncertainty quantification approach independent of architectural complexity, we conduct an extensive controlled comparison study using a consistent LSTM architecture across five probabilistic methods: DER, Gaussian NLL, Student’s-t NLL, Quantile Loss, and Conformal Prediction. Evaluation on cryptocurrency return forecasting demonstrates that DER maintains competitive forecasting accuracy while providing explicit epistemic-aleatoric uncertainty decomposition. This work establishes both an extensible framework for principled uncertainty quantification in foundation models and empirical evidence for DER’s effectiveness in financial applications.

[385] STEM: Scaling Transformers with Embedding Modules

Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen

Main category: cs.LG

TL;DR: STEM replaces FFN up-projection with token-indexed embedding lookup, enabling extreme sparsity without routing overhead while improving performance, interpretability, and long-context scaling.

Details

Motivation: Fine-grained sparsity offers higher parametric capacity without proportional compute but suffers from training instability, load balancing, and communication overhead. Current approaches need better solutions for these issues.

Method: STEM (Scaling Transformers with Embedding Modules) uses static token-indexed approach replacing FFN up-projection with layer-local embedding lookup while keeping gate and down-projection dense. This eliminates runtime routing, enables CPU offload with prefetch, and decouples capacity from FLOPs/communication.

Result: STEM trains stably despite extreme sparsity, improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating ~1/3 FFN parameters). Learns embedding spaces with large angular spread for enhanced knowledge storage. Provides better interpretability, knowledge editing/injection capabilities, and strengthens long-context performance.

Conclusion: STEM is an effective method for scaling parametric memory with better interpretability, training stability, and efficiency. Delivers 3-4% accuracy improvements across 350M-1B models, with notable gains on knowledge/reasoning benchmarks.

Abstract: Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3–4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.

[386] Single-Stage Huffman Encoder for ML Compression

Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer

Main category: cs.LG

TL;DR: Single-stage Huffman encoder using fixed codebooks from average probability distributions achieves near-optimal compression for LLM tensor communication.

Details

Motivation: Collective operations in LLM training/serving are bottlenecked by network bandwidth. Traditional Huffman coding has prohibitive overheads (computational, latency, data) for latency-sensitive scenarios like die-to-die communication.

Method: Proposes single-stage Huffman encoder using fixed codebooks derived from average probability distribution of previous data batches. Leverages statistical similarity of tensors across layers and shards in LLMs (demonstrated with Gemma 2B model).

Result: Achieves compression within 0.5% of per-shard Huffman coding and within 1% of ideal Shannon compressibility, enabling efficient on-the-fly compression.

Conclusion: Fixed codebook approach eliminates overheads of traditional three-stage Huffman coding while maintaining near-optimal compression performance, making it suitable for latency-sensitive LLM communication scenarios.

Abstract: Training and serving Large Language Models (LLMs) require partitioning data across multiple accelerators, where collective operations are frequently bottlenecked by network bandwidth. Lossless compression using Huffman codes is an effective way to alleviate the issue, however, its three-stage design requiring on-the-fly frequency analysis, codebook generation and transmission of codebook along with data introduces computational, latency and data overheads which are prohibitive for latency-sensitive scenarios such as die-to-die communication. This paper proposes a single-stage Huffman encoder that eliminates these overheads by using fixed codebooks derived from the average probability distribution of previous data batches. Through our analysis of the Gemma 2B model, we demonstrate that tensors exhibit high statistical similarity across layers and shards. Using this approach we achieve compression within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility, enabling efficient on-the-fly compression.

[387] On the origin of neural scaling laws: from random graphs to natural language

Maissam Barkeshli, Alberto Alfarano, Andrey Gromov

Main category: cs.LG

TL;DR: The paper shows that neural scaling laws emerge even in simplified settings without power law data structure, using random walks on graphs and systematically simplified language models.

Details

Motivation: To understand the origin of neural scaling laws, challenging the common assumption that they arise from power law structure in data. The authors want to demonstrate that scaling laws can emerge in simpler, controlled settings.

Method: Train transformers on random walks (bigrams) on graphs with tunable complexity; systematically simplify natural language by training on sequences from increasingly simplified generative language models (4,2,1-layer transformers down to bigrams); use random walks on Erdös-Renyi and Barabási-Albert graphs; revisit conventional scaling laws with 2-layer transformers and short context length.

Result: Scaling laws emerge even without power law data structure; monotonic evolution of scaling exponents as language complexity decreases; scaling laws from random walks on different graph ensembles; essential language modeling scaling results reproducible with simpler models; alternative method for compute optimal curves; preliminary evidence that maximal update parameterization is more parameter efficient.

Conclusion: Neural scaling laws are more fundamental than previously thought - they emerge even in simple settings without power law data correlations. The findings suggest scaling laws may be intrinsic to transformer architectures rather than dependent on complex data structure, and provide new methods for analyzing scaling behavior.

Abstract: Scaling laws have played a major role in the modern AI revolution, providing practitioners predictive power over how the model performance will improve with increasing data, compute, and number of model parameters. This has spurred an intense interest in the origin of neural scaling laws, with a common suggestion being that they arise from power law structure already present in the data. In this paper we study scaling laws for transformers trained to predict random walks (bigrams) on graphs with tunable complexity. We demonstrate that this simplified setting already gives rise to neural scaling laws even in the absence of power law structure in the data correlations. We further consider dialing down the complexity of natural language systematically, by training on sequences sampled from increasingly simplified generative language models, from 4,2,1-layer transformer language models down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erdös-Renyi and scale-free Barabási-Albert ensembles. Finally, we revisit conventional scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 50, provide a critical analysis of various fits used in prior literature, demonstrate an alternative method for obtaining compute optimal curves as compared with current practice in published literature, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.

[388] Data-driven stochastic reduced-order modeling of parametrized dynamical systems

Andrew F. Ilersich, Kevin Course, Prasanth B. Nair

Main category: cs.LG

TL;DR: A data-driven framework for learning continuous-time stochastic reduced-order models that generalize across parameters and forcing conditions using amortized stochastic variational inference.

Details

Motivation: High-fidelity simulations of complex dynamical systems are computationally expensive, and current reduced-order models struggle with stochastic dynamics and lack uncertainty quantification, limiting their use in robust decision-making.

Method: Uses amortized stochastic variational inference with a reparametrization trick for Markov Gaussian processes to jointly learn a probabilistic autoencoder and stochastic differential equations governing latent dynamics, eliminating need for expensive forward solvers during training.

Result: Demonstrates excellent generalization to unseen parameter combinations and forcings, with significant efficiency gains compared to existing approaches in three challenging test problems.

Conclusion: The framework provides an efficient, scalable approach for learning stochastic reduced-order models with uncertainty quantification that can incorporate physics-informed priors when available.

Abstract: Modeling complex dynamical systems under varying conditions is computationally intensive, often rendering high-fidelity simulations intractable. Although reduced-order models (ROMs) offer a promising solution, current methods often struggle with stochastic dynamics and fail to quantify prediction uncertainty, limiting their utility in robust decision-making contexts. To address these challenges, we introduce a data-driven framework for learning continuous-time stochastic ROMs that generalize across parameter spaces and forcing conditions. Our approach, based on amortized stochastic variational inference, leverages a reparametrization trick for Markov Gaussian processes to eliminate the need for computationally expensive forward solvers during training. This enables us to jointly learn a probabilistic autoencoder and stochastic differential equations governing the latent dynamics, at a computational cost that is independent of the dataset size and system stiffness. Additionally, our approach offers the flexibility of incorporating physics-informed priors if available. Numerical studies are presented for three challenging test problems, where we demonstrate excellent generalization to unseen parameter combinations and forcings, and significant efficiency gains compared to existing approaches.

[389] Communication-Efficient and Privacy-Adaptable Mechanism – a Federated Learning Scheme with Convergence Analysis

Chun Hei Michael Shiu, Chih Wei Ling

Main category: cs.LG

TL;DR: The paper theoretically analyzes CEPAM’s privacy guarantees and convergence properties, and experimentally evaluates its utility performance including convergence profiles and accuracy-privacy trade-offs in federated learning.

Details

Motivation: Federated learning enables privacy-preserving collaboration under data-governance constraints, but faces challenges in communication efficiency and privacy protection between parties. CEPAM was recently introduced to address both objectives simultaneously, but requires theoretical analysis and experimental validation.

Method: Theoretical analysis of CEPAM’s privacy guarantees and convergence properties, combined with experimental evaluations including convergence profiles compared with other baselines, and accuracy-privacy trade-offs between different parties.

Result: The paper provides theoretical foundations for CEPAM’s privacy protection and convergence behavior, and demonstrates its utility performance through experimental comparisons and trade-off analysis.

Conclusion: CEPAM offers a theoretically sound and practically effective solution for achieving both communication efficiency and customizable privacy protection in federated learning settings.

Abstract: Federated learning enables multiple parties to jointly train learning models without sharing their own underlying data, offering a practical pathway to privacy-preserving collaboration under data-governance constraints. Continued study of federated learning is essential to address key challenges in it, including communication efficiency and privacy protection between parties. A recent line of work introduced a novel approach called the Communication-Efficient and Privacy-Adaptable Mechanism (CEPAM), which achieves both objectives simultaneously. CEPAM leverages the rejection-sampled universal quantizer (RSUQ), a randomized vector quantizer whose quantization error is equivalent to a prescribed noise, which can be tuned to customize privacy protection between parties. In this work, we theoretically analyze the privacy guarantees and convergence properties of CEPAM. Moreover, we assess CEPAM’s utility performance through experimental evaluations, including convergence profiles compared with other baselines, and accuracy-privacy trade-offs between different parties.

[390] Distributed Perceptron under Bounded Staleness, Partial Participation, and Noisy Communication

Keval Jain, Anant Raj, Saurav Prakash, Girish Varma

Main category: cs.LG

TL;DR: The paper analyzes a semi-asynchronous client-server perceptron with IPM-style averaging, addressing system effects like stale updates, partial participation, and communication noise, and provides theoretical bounds on cumulative mistakes.

Details

Motivation: To understand and provide theoretical guarantees for federated/distributed learning systems under realistic deployment conditions including stale updates (two-sided version lag), intermittent client availability, and imperfect communication with noise.

Method: Proposes staleness-bucket aggregation with padding for server-side aggregation that deterministically enforces a prescribed staleness profile without stochastic delay assumptions. Analyzes IPM-style averaging where clients run local perceptron updates and server aggregates updates arriving each round.

Result: Proves finite-horizon expected bound on cumulative weighted perceptron mistakes: delay impact appears only through mean enforced staleness, while communication noise contributes an additional term growing with square root of horizon and total noise energy. In noiseless case, shows finite expected mistake budget yields explicit finite-round stabilization bound under fresh-participation condition.

Conclusion: The proposed staleness-bucket aggregation with padding provides theoretical guarantees for semi-asynchronous federated perceptron learning under realistic system effects, with delay effects isolated to mean staleness and noise effects quantified separately.

Abstract: We study a semi-asynchronous client-server perceptron trained via iterative parameter mixing (IPM-style averaging): clients run local perceptron updates and a server forms a global model by aggregating the updates that arrive in each communication round. The setting captures three system effects in federated and distributed deployments: (i) stale updates due to delayed model delivery and delayed application of client computations (two-sided version lag), (ii) partial participation (intermittent client availability), and (iii) imperfect communication on both downlink and uplink, modeled as effective zero-mean additive noise with bounded second moment. We introduce a server-side aggregation rule called staleness-bucket aggregation with padding that deterministically enforces a prescribed staleness profile over update ages without assuming any stochastic model for delays or participation. Under margin separability and bounded data radius, we prove a finite-horizon expected bound on the cumulative weighted number of perceptron mistakes over a given number of server rounds: the impact of delay appears only through the mean enforced staleness, whereas communication noise contributes an additional term that grows on the order of the square root of the horizon with the total noise energy. In the noiseless case, we show how a finite expected mistake budget yields an explicit finite-round stabilization bound under a mild fresh-participation condition.

[391] High-accuracy and dimension-free sampling with diffusions

Khashayar Gatmiry, Sitan Chen, Adil Salim

Main category: cs.LG

TL;DR: Proposes a new diffusion model solver with polylogarithmic iteration complexity in 1/ε, achieving first high-accuracy guarantee using only approximate score access.

Details

Motivation: Current diffusion model inference requires many small iterations (polynomial complexity in dimension and 1/ε) to produce high-quality samples, which is computationally expensive.

Method: New solver combining low-degree approximation with collocation method (Lee, Song, Vempala 2018) for solving diffusion differential equations.

Result: Achieves polylogarithmic iteration complexity in 1/ε, with dimension dependence only through effective radius of target distribution support.

Conclusion: First high-accuracy guarantee for diffusion-based samplers using approximate score access, with significantly improved iteration complexity.

Abstract: Diffusion models have shown remarkable empirical success in sampling from rich multi-modal distributions. Their inference relies on numerically solving a certain differential equation. This differential equation cannot be solved in closed form, and its resolution via discretization typically requires many small iterations to produce \emph{high-quality} samples. More precisely, prior works have shown that the iteration complexity of discretization methods for diffusion models scales polynomially in the ambient dimension and the inverse accuracy $1/\varepsilon$. In this work, we propose a new solver for diffusion models relying on a subtle interplay between low-degree approximation and the collocation method (Lee, Song, Vempala 2018), and we prove that its iteration complexity scales \emph{polylogarithmically} in $1/\varepsilon$, yielding the first ``high-accuracy’’ guarantee for a diffusion-based sampler that only uses (approximate) access to the scores of the data distribution. In addition, our bound does not depend explicitly on the ambient dimension; more precisely, the dimension affects the complexity of our solver through the \emph{effective radius} of the support of the target distribution only.

[392] DInf-Grid: A Neural Differential Equation Solver with Differentiable Feature Grids

Navami Kairanda, Shanthika Naik, Marc Habermann, Avinash Sharma, Christian Theobalt, Vladislav Golyanik

Main category: cs.LG

TL;DR: DInf-Grid: A differentiable grid-based representation using radial basis functions for efficient solving of differential equations, achieving 5-20x speed-up over coordinate-based MLPs.

Details

Motivation: Existing neural solvers have limitations: coordinate-based MLPs (like sinusoidal networks) are computationally intensive and slow to train, while grid-based methods (like Instant-NGP and K-Planes) cannot compute higher-order derivatives needed for solving differential equations due to linear interpolation constraints.

Method: Combines feature grid efficiency with infinitely differentiable radial basis function interpolation. Introduces multi-resolution decomposition with co-located grids to capture high-frequency solutions and enable stable computation of global gradients. Trained implicitly using differential equations as loss functions.

Result: Achieves 5-20x speed-up over coordinate-based MLP methods, solving differential equations in seconds or minutes while maintaining comparable accuracy and compactness. Validated on Poisson equation (image reconstruction), Helmholtz equation (wave fields), and Kirchhoff-Love boundary value problem (cloth simulation).

Conclusion: DInf-Grid successfully bridges the gap between computational efficiency and differentiability requirements for solving differential equations, offering a practical alternative to traditional neural solvers with significant performance improvements.

Abstract: We present a novel differentiable grid-based representation for efficiently solving differential equations (DEs). Widely used architectures for neural solvers, such as sinusoidal neural networks, are coordinate-based MLPs that are both computationally intensive and slow to train. Although grid-based alternatives for implicit representations (e.g., Instant-NGP and K-Planes) train faster by exploiting signal structure, their reliance on linear interpolation restricts their ability to compute higher-order derivatives, rendering them unsuitable for solving DEs. Our approach overcomes these limitations by combining the efficiency of feature grids with radial basis function interpolation, which is infinitely differentiable. To effectively capture high-frequency solutions and enable stable and faster computation of global gradients, we introduce a multi-resolution decomposition with co-located grids. Our proposed representation, DInf-Grid, is trained implicitly using the differential equations as loss functions, enabling accurate modelling of physical fields. We validate DInf-Grid on a variety of tasks, including the Poisson equation for image reconstruction, the Helmholtz equation for wave fields, and the Kirchhoff-Love boundary value problem for cloth simulation. Our results demonstrate a 5-20x speed-up over coordinate-based MLP-based methods, solving differential equations in seconds or minutes while maintaining comparable accuracy and compactness.

[393] SSFL: Discovering Sparse Unified Subnetworks at Initialization for Efficient Federated Learning

Riyasat Ohib, Bishal Thapaliya, Gintare Karolina Dziugaite, Jingyu Liu, Vince Calhoun, Sergey Plis

Main category: cs.LG

TL;DR: SSFL is a sparse federated learning method that identifies a global sparse subnetwork before training using aggregated saliency scores from local clients, reducing communication costs while improving accuracy-sparsity trade-offs.

Details

Motivation: To address the communication inefficiency in federated learning by reducing the amount of data transmitted between clients and server, while maintaining or improving model performance, especially in non-IID data scenarios.

Method: SSFL computes parameter saliency scores separately on each client’s local data, aggregates these scores to determine a global sparse mask before training begins, then only trains and communicates the sparse model weights each round.

Result: Achieves >20% relative error reduction on CIFAR-10 compared to strongest sparse baseline, reduces communication costs by 2× relative to dense FL, and delivers 2.3× faster communication time in real-world deployment.

Conclusion: SSFL provides an effective approach for sparse federated learning that significantly reduces communication overhead while improving model performance, making it practical for real-world federated learning deployments.

Abstract: In this work, we propose Salient Sparse Federated Learning (SSFL), a streamlined approach for sparse federated learning with efficient communication. SSFL identifies a sparse subnetwork prior to training, leveraging parameter saliency scores computed separately on local client data in non-IID scenarios, and then aggregated, to determine a global mask. Only the sparse model weights are trained and communicated each round between the clients and the server. On standard benchmarks including CIFAR-10, CIFAR-100, and Tiny-ImageNet, SSFL consistently improves the accuracy sparsity trade off, achieving more than 20% relative error reduction on CIFAR-10 compared to the strongest sparse baseline, while reducing communication costs by $2 \times$ relative to dense FL. Finally, in a real-world federated learning deployment, SSFL delivers over $2.3 \times$ faster communication time, underscoring its practical efficiency.

[394] Machine Unlearning Fails to Remove Data Poisoning Attacks

Martin Pawelczyk, Jimmy Z. Di, Yiwei Lu, Gautam Kamath, Ayush Sekhari, Seth Neel

Main category: cs.LG

TL;DR: Existing machine unlearning methods fail to remove data poisoning effects across various attack types and models, despite claims of effectiveness in other settings.

Details

Motivation: To evaluate whether practical machine unlearning methods can effectively remove the effects of poisoned data, which is a potential application beyond just complying with data deletion requests.

Method: Experimental evaluation of existing unlearning methods across multiple poisoning attacks (indiscriminate, targeted, and new Gaussian poisoning) on different models (image classifiers and LLMs), introducing new poisoning-based evaluation metrics.

Result: Current unlearning methods fail to remove poisoning effects even with substantial compute budgets, showing limited benefit over retraining from scratch.

Conclusion: Broader evaluation perspectives are needed to avoid false confidence in unlearning methods without provable guarantees; current methods are not yet production-ready for removing poisoned data.

Abstract: We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of settings, they fail to remove the effects of data poisoning across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, are required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned data without having to retrain, our work suggests that these methods are not yet ``ready for prime time,’’ and currently provide limited benefit over retraining.

[395] Mathematical theory of deep learning

Philipp Petersen, Jakob Zech

Main category: cs.LG

TL;DR: Introduction to mathematical analysis of deep learning covering approximation theory, optimization theory, and statistical learning theory as the three main pillars of deep neural network theory.

Details

Motivation: To provide foundational mathematical knowledge about deep learning for students and researchers in mathematics and related fields, addressing the need for rigorous yet accessible understanding of the mathematical concepts underpinning deep learning.

Method: The book presents mathematical analysis covering three main areas: approximation theory (how well neural networks can approximate functions), optimization theory (how to train neural networks effectively), and statistical learning theory (generalization properties of neural networks). It prioritizes simplicity over generality and presents rigorous yet accessible results.

Result: The book serves as a comprehensive guide that covers fundamental results in the three core mathematical areas relevant to deep learning, providing readers with the essential mathematical foundations needed to understand deep neural network theory.

Conclusion: This book successfully provides an accessible yet rigorous introduction to the mathematical analysis of deep learning, equipping readers with foundational knowledge in approximation theory, optimization theory, and statistical learning theory - the three main pillars of deep neural network theory.

Abstract: This book provides an introduction to the mathematical analysis of deep learning. It covers fundamental results in approximation theory, optimization theory, and statistical learning theory, which are the three main pillars of deep neural network theory. Serving as a guide for students and researchers in mathematics and related fields, the book aims to equip readers with foundational knowledge on the topic. It prioritizes simplicity over generality, and presents rigorous yet accessible results to help build an understanding of the essential mathematical concepts underpinning deep learning.

[396] Permissive Information-Flow Analysis for Large Language Models

Shoaib Ahmed Siddiqui, Radhika Gaonkar, Boris Köpf, David Krueger, Andrew Paverd, Ahmed Salem, Shruti Tople, Lukas Wutschitz, Menglin Xia, Santiago Zanella-Béguelin

Main category: cs.LG

TL;DR: A novel permissive information flow tracking approach for LLMs that propagates only influential input labels, improving over conservative baseline methods in 85%+ cases.

Details

Motivation: LLMs are becoming commodity components in larger software systems, creating security/privacy risks where poisoned data can compromise entire systems or leak confidential data. Traditional conservative information flow tracking (propagating all input labels) is too restrictive for LLMs operating on diverse input sources.

Method: Proposes a permissive approach that propagates only labels of samples influential in generating model output, eliminating labels of unnecessary inputs. Implements two variations: (1) prompt-based retrieval augmentation, and (2) k-nearest-neighbors language model. Compares with baseline using introspection to predict output label.

Result: Experimental results in an LLM agent setting show the permissive label propagator improves over the baseline in more than 85% of cases, demonstrating practical effectiveness.

Conclusion: The proposed permissive information flow tracking approach for LLMs is practical and effective, addressing security/privacy concerns while avoiding the over-conservatism of traditional methods for systems with diverse input sources.

Abstract: Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model’s behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. One promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, this approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the labels of the samples that were influential in generating the model output and to eliminate the labels of unnecessary inputs. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a $k$-nearest-neighbors language model. We compare these with a baseline that uses introspection to predict the output label. Our experimental results in an LLM agent setting show that the permissive label propagator improves over the baseline in more than 85% of the cases, which underscores the practicality of our approach.

[397] An information-matching approach to optimal experimental design and active learning

Yonatan Kurniawan, Tracianne B. Neilsen, Benjamin L. Francis, Alex M. Stankovic, Mingjian Wen, Ilia Nikiforov, Ellad B. Tadmor, Vasily V. Bulatov, Vincenzo Lordi, Mark K. Transtrum

Main category: cs.LG

TL;DR: Proposes an information-matching criterion using Fisher Information Matrix to select optimal training data that contains sufficient information to learn only parameters needed for downstream predictions, formulated as convex optimization for scalability.

Details

Motivation: Collecting sufficient training data for mathematical models is expensive and challenging. Many applications only need to predict certain quantities of interest (QoIs), which often depend on a small subset of parameters due to model sloppiness/unidentifiability.

Method: Introduces an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. Formulated as a convex optimization problem for scalability to large models and datasets. Also used as a query function within Active Learning loops.

Result: Demonstrated effectiveness across various modeling problems in power systems, underwater acoustics, and material science. Found that a relatively small set of optimal training data can provide necessary information for achieving precise predictions.

Conclusion: The information-matching approach enables efficient data selection for downstream predictions, particularly valuable for active learning in large machine learning models. Results are encouraging for diverse future applications where data collection is expensive.

Abstract: The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.

[398] VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction

Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, Stanley Osher

Main category: cs.LG

TL;DR: VICON improves computational efficiency of operator networks by using vision transformers for patch-wise 2D processing, achieving better accuracy and faster inference while supporting flexible timestep strategies.

Details

Motivation: Existing In-Context Operator Networks (ICONs) process each spatial point as individual tokens, which severely limits computational efficiency when handling dense data in higher spatial dimensions, making them impractical for real-world applications.

Method: Proposes Vision In-Context Operator Networks (VICON) that integrates vision transformer architectures to efficiently process 2D data through patch-wise operations while preserving ICON’s adaptability to multiphysics systems and varying timesteps.

Result: VICON significantly outperforms state-of-the-art baselines DPOT and MPP, reducing averaged last-step rollout error by 37.9% and 44.7% respectively, while requiring only 72.5% and 34.8% of their inference times. It shows remarkable robustness in realistic scenarios with varying sampling frequencies.

Conclusion: VICON provides an efficient and versatile solution for operator learning in fluid dynamics, enabling immediate deployment in real-world imperfect measurement systems without requiring retraining, demonstrating superior computational efficiency and robustness compared to existing methods.

Abstract: In-Context Operator Networks (ICONs) have demonstrated the ability to learn operators across diverse partial differential equations using few-shot, in-context learning. However, existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions. We propose Vision In-Context Operator Networks (VICON), which integrates vision transformer architectures to efficiently process 2D data through patch-wise operations while preserving ICON’s adaptability to multiphysics systems and varying timesteps. Evaluated across three fluid dynamics benchmarks, VICON significantly outperforms state-of-the-art baselines: DPOT and MPP, reducing the averaged last-step rollout error by 37.9% compared to DPOT and 44.7% compared to MPP, while requiring only 72.5% and 34.8% of their respective inference times. VICON naturally supports flexible rollout strategies with varying timestep strides, enabling immediate deployment in imperfect measurement systems where sampling frequencies may differ or frames might be dropped - common challenges in real-world settings - without requiring retraining or interpolation. In these realistic scenarios, VICON exhibits remarkable robustness, experiencing only 24.41% relative performance degradation compared to 71.37%-74.49% degradation in baseline methods, demonstrating its versatility for deploying in realistic applications. Our scripts for processing datasets and code are publicly available at https://github.com/Eydcao/VICON.

[399] Robust LLM Alignment via Distributionally Robust Direct Preference Optimization

Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran

Main category: cs.LG

TL;DR: The paper addresses distribution shift in LLM alignment by developing two distributionally robust DPO algorithms (WDPO and KLDPO) that improve alignment performance when user preferences vary across regions, demographics, and time.

Details

Motivation: Current LLM alignment algorithms rely on static preference datasets that don't account for real-world variations in user preferences across geographical regions, demographics, linguistic patterns, and evolving cultural trends, leading to catastrophic alignment failures.

Method: Developed two novel distributionally robust direct preference optimization algorithms: Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO), using distributionally robust optimization framework. Also proposed scalable gradient descent-style learning algorithms with approximations for the challenging minimax loss functions.

Result: Characterized the sample complexity of learning optimal policy parameters for both WDPO and KLDPO. Empirical experiments using benchmark datasets and LLMs demonstrated superior performance in substantially improving alignment when there is preference distribution shift.

Conclusion: The proposed distributionally robust DPO algorithms (WDPO and KLDPO) effectively address the distribution shift problem in LLM alignment, providing robust solutions that maintain alignment performance across diverse and evolving user preferences.

Abstract: A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift. LLM alignment algorithms rely on static preference datasets, assuming that they accurately represent real-world user preferences. However, user preferences vary significantly across geographical regions, demographics, linguistic patterns, and evolving cultural trends. This preference distribution shift leads to catastrophic alignment failures in many real-world applications. We address this problem using the principled framework of distributionally robust optimization, and develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO). We characterize the sample complexity of learning the optimal policy parameters for WDPO and KLDPO. Moreover, we propose scalable gradient descent-style learning algorithms by developing suitable approximations for the challenging minimax loss functions of WDPO and KLDPO. Our empirical experiments using benchmark data sets and LLMs demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.

[400] CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

Yousef Koka, David Selby, Gerrit Großmann, Sebastian Vollmer, Kathan Pandya

Main category: cs.LG

TL;DR: CleanSurvival is a reinforcement learning framework that automates data preprocessing optimization for survival analysis models, outperforming standard approaches and being 10x faster than random search.

Details

Motivation: Data preprocessing is critical but often neglected in machine learning, especially for specialized tasks like survival analysis. While automated ML pipelines exist for classification/regression, survival analysis lacks tailored automated preprocessing solutions, facing both general preprocessing challenges and this specific gap.

Method: The paper presents ‘CleanSurvival’, a reinforcement-learning-based solution using Q-learning to optimize preprocessing pipelines for survival analysis. It handles continuous/categorical variables by selecting optimal combinations of data imputation, outlier detection, and feature extraction techniques for Cox, random forest, neural network, or custom time-to-event models.

Result: Experimental benchmarks on real-world datasets show Q-learning-based preprocessing achieves superior predictive performance compared to standard approaches, finding optimal models up to 10 times faster than undirected random grid search. A simulation study demonstrates effectiveness across different types and levels of missingness and noise.

Conclusion: CleanSurvival successfully addresses the gap in automated preprocessing for survival analysis, providing an efficient reinforcement learning framework that significantly improves model performance and optimization speed compared to traditional methods.

Abstract: Data preprocessing is a critical yet frequently neglected aspect of machine learning, often paid little attention despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like survival or time-to-event models. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents ‘CleanSurvival’, a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The package is available on GitHub: https://github.com/datasciapps/CleanSurvival Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing results in superior predictive performance to standard approaches, finding such a model up to 10 times faster than undirected random grid search. Furthermore, a simulation study demonstrates the effectiveness in different types and levels of missingness and noise in the data.

[401] Online Scheduling for LLM Inference with KV Cache Constraints

Patrick Jaillet, Jiashuo Jiang, Konstantina Mellou, Marco Molinaro, Chara Podimata, Zijie Zhou

Main category: cs.LG

TL;DR: The paper proposes a novel batching and scheduling algorithm for LLM inference that minimizes latency while managing KV cache memory constraints, with theoretical guarantees and empirical validation.

Details

Motivation: LLM inference is computationally intensive and requires efficient scheduling to optimize latency and resource utilization. A key challenge is managing the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints that need to be addressed.

Method: 1) Introduces a hindsight optimal benchmark formulated as an integer program for minimum total inference latency under full future information. 2) Proves no deterministic online algorithm can achieve constant competitive ratio with arbitrary arrivals. 3) Develops a polynomial-time online scheduling algorithm that achieves constant competitive ratio under certain conditions. 4) Validates algorithm with synthetic datasets and real-world LLM inference simulations (Llama2-70B on A100 GPUs).

Result: The proposed algorithm achieves strong empirical performance, significantly outperforming benchmark algorithms in both synthetic and real-world evaluations. Theoretical analysis shows it can achieve constant competitive ratio under certain conditions, while no deterministic online algorithm can achieve constant competitive ratio with arbitrary arrivals.

Conclusion: The work provides a path toward more sustainable and cost-effective LLM deployment through efficient scheduling algorithms that manage KV cache constraints while minimizing inference latency, with both theoretical guarantees and practical validation.

Abstract: Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose a novel batching and scheduling algorithm that minimizes inference latency while effectively managing the KV cache’s memory. More specifically, we make the following contributions. First, to evaluate the performance of online algorithms for scheduling in LLM inference, we introduce a hindsight optimal benchmark, formulated as an integer program that computes the minimum total inference latency under full future information. Second, we prove that no deterministic online algorithm can achieve a constant competitive ratio when the arrival process is arbitrary. Third, motivated by the computational intractability of solving the integer program at scale, we propose a polynomial-time online scheduling algorithm and show that under certain conditions it can achieve a constant competitive ratio. We also demonstrate our algorithm’s strong empirical performance by comparing it to the hindsight optimal in a synthetic dataset. Finally, we conduct empirical evaluations on a real-world public LLM inference dataset, simulating the Llama2-70B model on A100 GPUs, and show that our algorithm significantly outperforms the benchmark algorithms. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment.

[402] Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Leyang Hu, Matteo Gamba, Randall Balestriero

Main category: cs.LG

TL;DR: Curvature Tuning (CT) is an interpretable finetuning method that adjusts model decision boundaries by modifying activation functions with a single trainable hyperparameter, improving both generalization and robustness compared to existing methods.

Details

Motivation: Current finetuning methods rely on weight adaptation, lack interpretability, and depend on heuristically chosen hyperparameters. The paper proposes shifting focus from weights to activation functions to create a more principled and interpretable steering method.

Method: CT modulates model decision boundaries by injecting a single hyperparameter into activation functions, viewing them through spline operators. This projects models onto a space of smooth functions. The hyperparameter is made trainable, creating a parameter-efficient finetuning approach.

Result: CT significantly improves downstream accuracy: boosts ResNet-50/152 by 8.59%/8.34% over linear probing and 4.64%/1.70% over LoRA across 12 datasets. Improves robust accuracy on RobustBench ℓ∞ benchmark by 1032.64%/1494.46%.

Conclusion: CT provides an interpretable, principled alternative to weight-based finetuning by focusing on activation functions, complementing existing methods through decision boundary curvature adjustment and smooth function projection.

Abstract: The scaling of model and data sizes has reshaped the AI landscape, establishing finetuning pretrained models as the standard paradigm for solving downstream tasks. However, dominant finetuning methods typically rely on weight adaptation, often lack interpretability, and depend on heuristically chosen hyperparameters. In this paper, we take a different perspective and shift the focus from weights to activation functions, viewing them through the lens of spline operators. We propose Curvature Tuning (CT), an interpretable and principled steering method that modulates a model’s decision boundary by injecting a single hyperparameter into its activation functions. We show that CT provably adjusts model decision boundary curvature and, more fundamentally, projects a model onto a space of smooth functions-thereby complementing current finetuning methods, whose effect lies primarily in feature adaptation. Making this hyperparameter trainable gives rise to a novel and highly parameter-efficient finetuning method. Empirically, CT improves both generalization and robustness. For example, it boosts downstream accuracy of ResNet-50/152 by 8.59%/8.34% over linear probing and 4.64%/1.70% over LoRA across 12 datasets, and improves robust accuracy on the $\ell_\infty$ benchmark from RobustBench by 1032.64%/1494.46%. Our code is available at https://github.com/Leon-Leyang/curvature-tuning.

[403] Privacy amplification by random allocation

Vitaly Feldman, Moshe Shenfeld

Main category: cs.LG

TL;DR: Random k-out-of-t sampling for privacy amplification: theoretical guarantees show it’s bounded by Poisson subsampling with probability (1+o(1))k/t, with efficient numerical algorithms for Gaussian noise mechanisms.

Details

Motivation: Analyze privacy amplification of random k-out-of-t sampling used in differentially private optimization and communication-efficient private aggregation. Existing analyses either rely on overly conservative shuffling bounds or require prohibitive Monte Carlo simulations.

Method: Provide first theoretical guarantees showing random k-out-of-t allocation privacy can be upper bounded by independent/Poisson subsampling with probability (1+o(1))k/t. Develop two additional analysis techniques for numerical improvements. Create efficiently-computable numerical estimation algorithms.

Result: Demonstrate nearly tight numerical results for random allocation applied to Gaussian noise addition. Bounds are efficiently computable and improve over existing conservative shuffling-based analyses.

Conclusion: Random k-out-of-t sampling provides strong privacy amplification comparable to Poisson subsampling, with practical efficient computation methods for Gaussian mechanisms, addressing limitations of previous analyses.

Abstract: We consider the privacy amplification properties of a sampling scheme in which a user’s data is used in k steps chosen randomly and uniformly from a sequence (or set) of t steps. This sampling scheme has been recently applied in the context of differentially private optimization [Chua et al., 2024a, Choquette-Choo et al., 2025] and is also motivated by communication-efficient high-dimensional private aggregation [Asi et al., 2025]. Existing analyses of this scheme either rely on privacy amplification by shuffling which leads to overly conservative bounds or require Monte Carlo simulations that are computationally prohibitive in most practical scenarios. We give the first theoretical guarantees and numerical estimation algorithms for this sampling scheme. In particular, we demonstrate that the privacy guarantees of random k-out-of-t allocation can be upper bounded by the privacy guarantees of the well-studied independent (or Poisson) subsampling in which each step uses the user’s data with probability $(1+o(1))k/t$. Further, we provide two additional analysis techniques that lead to numerical improvements in several parameter regimes. Altogether, our bounds give efficiently-computable and nearly tight numerical results for random allocation applied to Gaussian noise addition.

[404] MixMin: Finding Data Mixtures via Convex Minimization

Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison

Main category: cs.LG

TL;DR: MixMin is a gradient-based method for optimizing data mixtures in ML pipelines by treating it as a convex bi-level optimization problem that becomes tractable with larger model classes.

Details

Motivation: Modern ML pipelines increasingly combine data from diverse sources, but finding optimal data mixtures is challenging and open problem. Current approaches lack systematic methods for determining best data combinations for downstream performance.

Method: Formalizes data mixing as bi-level optimization: best mixture leads to best downstream model. Observes objective becomes convex with larger model classes. Develops gradient-based approach called MixMin to optimize this convex objective.

Result: MixMin uniformly improved data mixtures across all experiments. For pythia-410M model: 0.2% additional compute, 1-5% relative improvement on PIQA, ARC Easy, SciQ, OpenWebMath. MixMin mixtures for smaller models improved larger models, suggesting scale-invariance. For XGBoost bioassay: 0.03-0.15 average precision improvement.

Conclusion: MixMin provides effective gradient-based solution to data mixing problem, demonstrating practical improvements across language modeling and chemistry tasks with minimal computational overhead, and shows promising scale-invariant properties.

Abstract: Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of 0.03-0.15.

[405] LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities

Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter

Main category: cs.LG

TL;DR: LaM-SLidE introduces identifier representations (IDs) to enable traceable latent space modeling of spatial dynamical systems while leveraging efficiency from image/video generation techniques.

Details

Motivation: Current generative models struggle with dynamical systems involving interacting entities (chemical molecules, human behavior) because they need to preserve entity traceability, connectivity patterns, and conservation while benefiting from efficient latent space modeling approaches used in image/video generation.

Method: LaM-SLidE introduces identifier representations (IDs) that enable retrieval of entity properties and composition from latent system representations, bridging traceability with efficient latent space modeling using pre-trained encoders/decoders from image/video generation.

Result: The method performs favorably across different domains in terms of speed, accuracy, and generalizability compared to existing approaches.

Conclusion: LaM-SLidE successfully bridges the gap between entity traceability in dynamical systems and the efficiency of latent space generative modeling, enabling better trajectory sampling for systems with interacting entities.

Abstract: Generative models are spearheading recent progress in deep learning, showcasing strong promise for trajectory sampling in dynamical systems as well. However, whereas latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems – from chemical molecule structures to collective human behavior – are described by interactions of entities, making them inherently linked to connectivity patterns, entity conservation, and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), bridges the gap between: (1) keeping the traceability of individual entities in a latent system representation, and (2) leveraging the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder enable generative modeling directly in latent space. The core idea of LaM-SLidE is the introduction of identifier representations (IDs) that enable the retrieval of entity properties and entity composition from latent system representations, thus fostering traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. Code is available at https://github.com/ml-jku/LaM-SLidE .

[406] An Improved Privacy and Utility Analysis of Differentially Private SGD with Bounded Domain and Smooth Losses

Hao Liang, Wanrong Zhang, Xinlei He, Kaishun Wu, Hong Xing

Main category: cs.LG

TL;DR: DPSGD privacy analysis for non-convex L-smooth functions shows converged privacy loss in bounded domains, with smaller domain diameter improving both privacy and utility, and theoretical trade-off bounds for DPSGD variants.

Details

Motivation: DPSGD protects sensitive data but suffers performance degradation due to loose privacy bounds. Existing analyses make impractical assumptions (convexity, complex parameters) and don't deeply examine privacy mechanisms' impact on utility.

Method: Rigorous privacy characterization for DPSGD with general L-smooth non-convex loss functions, tracking privacy loss over iterations using noisy smooth-reduction property. Comprehensive convergence analysis in different scenarios including bounded domains.

Result: For DPSGD with bounded domain: (1) privacy loss converges without convexity assumption, (2) smaller bounded diameter improves both privacy and utility under certain conditions, (3) established big-O order privacy-utility trade-off bounds for DPSGD-GC and DPSGD-DC with strongly convex population risk.

Conclusion: The paper provides tighter privacy analysis for DPSGD without restrictive assumptions, shows domain bounding benefits, and establishes theoretical trade-off bounds validated by practical MIA experiments.

Abstract: Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to protect sensitive data during the training of machine learning models, but its privacy guarantee often comes at a large cost of model performance due to the lack of tight theoretical bounds quantifying privacy loss. While recent efforts have achieved more accurate privacy guarantees, they still impose some assumptions prohibited from practical applications, such as convexity and complex parameter requirements, and rarely investigate in-depth the impact of privacy mechanisms on the model’s utility. In this paper, we provide a rigorous privacy characterization for DPSGD with general L-smooth and non-convex loss functions, revealing converged privacy loss with iteration in bounded-domain cases. Specifically, we track the privacy loss over multiple iterations, leveraging the noisy smooth-reduction property, and further establish comprehensive convergence analysis in different scenarios. In particular, we show that for DPSGD with a bounded domain, (i) the privacy loss can still converge without the convexity assumption, (ii) a smaller bounded diameter can improve both privacy and utility simultaneously under certain conditions, and (iii) the attainable big-O order of the privacy utility trade-off for DPSGD with gradient clipping (DPSGD-GC) and for DPSGD-GC with bounded domain (DPSGD-DC) and mu-strongly convex population risk function, respectively. Experiments via membership inference attack (MIA) in a practical setting validate insights gained from the theoretical results.

[407] ShuffleGate: Scalable Feature Optimization for Recommender Systems via Batch-wise Sensitivity Learning

Yihong Huang, Chen Chu, Fan Zhang, Liping Wang Fei Chen, Yu Lin, Ruiduan Li, Zhihao Li

Main category: cs.LG

TL;DR: ShuffleGate is a unified, interpretable mechanism for feature and dimension selection in recommender systems that measures model sensitivity to information loss through batch-wise shuffling, achieving polarized importance distributions and bridging the search-retrain gap.

Details

Motivation: Current feature selection and dimension selection methods for large-scale recommender systems suffer from ambiguous importance scores, prohibitive computational costs, and isolated solutions for conceptually related tasks. There's a need for a unified approach that can handle different granularities while being interpretable and scalable.

Method: ShuffleGate introduces a batch-wise shuffling strategy to erase information in an end-to-end differentiable manner, measuring model sensitivity to information loss rather than learning relative weights. This creates naturally polarized importance distributions that distinguish essential signals from noise without complex threshold tuning.

Result: Achieves SOTA performance on feature and dimension selection tasks. Can identify and prune 99.9% of redundant embedding parameters on Criteo dataset while maintaining competitive AUC. Successfully deployed in industrial video recommendation platform, compressing input dimension from 10,000+ to 1,000+, achieving 91% increase in training throughput while serving billions of daily requests without performance degradation.

Conclusion: ShuffleGate provides a unified, scalable, and interpretable solution for feature optimization across different granularities, bridging the search-retrain gap and demonstrating practical industrial value for large-scale recommender systems.

Abstract: Feature optimization, specifically Feature Selection (FS) and Dimension Selection (DS), is critical for the efficiency and generalization of large-scale recommender systems. While conceptually related, these tasks are typically tackled with isolated solutions that often suffer from ambiguous importance scores or prohibitive computational costs. In this paper, we propose ShuffleGate, a unified and interpretable mechanism that estimates component importance by measuring the model’s sensitivity to information loss. Unlike conventional gating that learns relative weights, ShuffleGate introduces a batch-wise shuffling strategy to effectively erase information in an end-to-end differentiable manner. This paradigm shift yields naturally polarized importance distributions, bridging the long-standing “search-retrain gap” and distinguishing essential signals from noise without complex threshold tuning. ShuffleGate provides a unified solution across granularities. It achieves state-of-the-art performance on feature and dimension selection tasks. Furthermore, to demonstrate its extreme scalability and precision, we extend ShuffleGate to evaluate fine-grained embedding entries. Experiments show it can identify and prune 99.9% of redundant embedding parameters on the Criteo dataset while maintaining competitive AUC, verifying its robustness in massive search spaces. Finally, the method has been successfully deployed in a top-tier industrial video recommendation platform. By compressing the concatenated input dimension from over 10,000 to 1,000+, it achieved a 91% increase in training throughput while serving billions of daily requests without performance degradation.

[408] Evaluating Large Language Models for Fair and Reliable Organ Allocation

Brian Hyeongseok Kim, Hannah Murray, Isabelle Lee, Jason Byun, Joshua Lum, Dani Yogatama, Evi Micha

Main category: cs.LG

TL;DR: LLMs show concerning fairness issues in simulated kidney allocation tasks, with different metrics revealing contradictory results and demographic preferences that vary by task type.

Details

Motivation: Medical institutions are considering LLMs for high-stakes clinical decisions like organ allocation, but existing evaluation methods are inadequate - benchmarks are too simplistic and accuracy metrics don't address the lack of clear ground truth in allocation decisions.

Method: First tested LLMs’ medical knowledge, then designed two kidney allocation tasks: (1) Choose-One (select single candidate) evaluated with traditional fairness metrics like proportional parity, and (2) Rank-All (rank all candidates) reflecting real-world allocation processes where organs pass down ranked lists.

Result: Evaluation of three LLMs revealed divergence between fairness metrics: exposure-based metrics suggested equitable outcomes, but probability-based metrics uncovered systematic preferential sorting where specific groups clustered in upper-ranking tiers. Demographic preferences were highly task-dependent, showing inverted trends between Choose-One and Rank-All tasks.

Conclusion: Current LLMs can introduce inequalities in real-world allocation scenarios, highlighting the urgent need for rigorous fairness evaluation and human oversight before deployment in high-stakes clinical decision-making.

Abstract: Medical institutions are considering the use of LLMs in high-stakes clinical decision-making, such as organ allocation. In such sensitive use cases, evaluating fairness is imperative. However, existing evaluation methods often fall short; benchmarks are too simplistic to capture real-world complexity, and accuracy-based metrics fail to address the absence of a clear ground truth. To realistically and fairly model organ allocation, specifically kidney allocation, we begin by testing the medical knowledge of LLMs to determine whether they understand the clinical factors required to make sound allocation decisions. Building on this foundation, we design two tasks: (1) Choose-One and (2) Rank-All. In Choose-One, LLMs select a single candidate from a list of potential candidates to receive a kidney. In this scenario, we assess fairness across demographics using traditional fairness metrics, such as proportional parity. In Rank-All, LLMs rank all candidates waiting for a kidney, reflecting real-world allocation processes more closely, where an organ is passed down a ranked list until allocated. Our evaluation on three LLMs reveals a divergence between fairness metrics: while exposure-based metrics suggest equitable outcomes, probability-based metrics uncover systematic preferential sorting, where specific groups were clustered in upper-ranking tiers. Furthermore, we observe that demographic preferences are highly task-dependent, showing inverted trends between Choose-One and Rank-All tasks, even when considering the topmost rank. Overall, our results indicate that current LLMs can introduce inequalities in real-world allocation scenarios, underscoring the urgent need for rigorous fairness evaluation and human oversight before their use in high-stakes decision-making.

[409] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jianfei Chen, Jun Zhu

Main category: cs.LG

TL;DR: This paper introduces two contributions for efficient attention: 1) FP4 attention using Blackwell GPU Tensor Cores for 5x inference speedup, and 2) 8-bit attention for both training and inference, achieving lossless fine-tuning but slower pretraining convergence.

Details

Motivation: Attention mechanisms have quadratic time complexity which limits efficiency. While existing low-bit attention methods like FlashAttention3 and SageAttention focus only on inference, training large models also requires efficiency improvements.

Method: 1) Leverage FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation for inference. 2) Design an accurate and efficient 8-bit attention that works for both forward and backward propagation during training.

Result: FP4 attention achieves 1038 TOPS on RTX5090 (5x speedup over FlashAttention). 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks.

Conclusion: The paper demonstrates that low-bit attention can effectively accelerate both inference and training tasks, with FP4 providing significant inference speedups and 8-bit attention enabling efficient training with some limitations in pretraining convergence.

Abstract: The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.

[410] Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation

Sungmin Cha, Kyunghyun Cho

Main category: cs.LG

TL;DR: Knowledge distillation in generative models creates a precision-recall trade-off where students focus on high-likelihood regions at the expense of coverage, explaining KD’s effectiveness.

Details

Motivation: While knowledge distillation (KD) is widely used in training generative models like LLMs, its underlying mechanisms for improving generative quality remain poorly understood despite empirical benefits.

Method: Used controlled simulation with mixtures of Gaussians to analyze KD effects, then validated findings in large-scale language modeling using SmolLM2 family models.

Result: Distillation induces a precision-recall trade-off: as teacher becomes more selective, student concentrates probability mass on high-likelihood regions (improving precision/sample quality) at the expense of coverage (reducing recall/diversity).

Conclusion: KD’s effectiveness in generative modeling stems from this precision-recall trade-off, which is especially beneficial when sample quality matters more than diversity (e.g., instruction tuning, downstream generation).

Abstract: Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly large language models (LLMs). While its empirical benefits are well documented – enabling smaller student models to emulate the performance of much larger teachers – the underlying mechanisms by which KD improves generative quality remain poorly understood. In this work, we present a minimal working explanation of KD in generative modeling. Using a controlled simulation with mixtures of Gaussians, we demonstrate that distillation induces a trade-off between precision and recall in the student model. As the teacher distribution becomes more selective, the student concentrates more probability mass on high-likelihood regions at the expense of coverage, which is a behavior modulated by a single entropy-controlling parameter. We then validate this effect in a large-scale language modeling setup using the SmolLM2 family of models. Empirical results reveal the same precision-recall dynamics observed in simulation, where precision corresponds to sample quality and recall to distributional coverage. This precision-recall trade-off in LLMs is found to be especially beneficial in scenarios where sample quality is more important than diversity, such as instruction tuning or downstream generation. Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.

[411] Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh

Main category: cs.LG

TL;DR: Quartet enables accurate end-to-end FP4 training for LLMs using Blackwell architecture, achieving competitive performance vs FP16/FP8 through new scaling laws and optimized CUDA kernels.

Details

Motivation: Training LLMs in low-precision (FP4) reduces computational costs and improves throughput/energy efficiency, but current FP4 methods suffer accuracy degradation and rely on mixed-precision fallbacks. NVIDIA's Blackwell architecture enables FP4 operations, creating opportunity for better low-precision training.

Method: Investigates hardware-supported FP4 training, develops new low-precision scaling law to quantify performance trade-offs, designs Quartet technique for optimal accuracy-vs-computation, implements with optimized CUDA kernels for Blackwell architecture.

Result: Demonstrates fully FP4-based training is competitive alternative to FP16 half-precision and FP8 training through extensive evaluations on Llama-type models. Code available at https://github.com/IST-DASLab/Quartet.

Conclusion: Quartet enables accurate end-to-end FP4 training for LLMs, making low-precision training practical and efficient using Blackwell architecture, with competitive performance compared to higher precision alternatives.

Abstract: Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA’s recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an “optimal” technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

[412] Deep Learning for Continuous-Time Stochastic Control with Jumps

Patrick Cheridito, Jean-Loup Dupret, Donatien Hainaut

Main category: cs.LG

TL;DR: Model-based deep learning approach for finite-horizon continuous-time stochastic control with jumps using two neural networks for policy and value function approximation.

Details

Motivation: To solve complex high-dimensional continuous-time stochastic control problems with jumps, which are challenging for traditional methods due to their computational complexity and the presence of discontinuous dynamics.

Method: Iteratively train two neural networks: one for optimal policy representation and another for value function approximation. Derive training objectives from Hamilton-Jacobi-Bellman equation using continuous-time dynamic programming principle to capture stochastic dynamics with jumps.

Result: Empirical evaluations demonstrate accurate and scalable performance on various problems, showing effectiveness in solving complex high-dimensional stochastic control tasks.

Conclusion: The proposed model-based deep learning approach provides an effective solution for continuous-time stochastic control with jumps, offering both accuracy and scalability for high-dimensional problems.

Abstract: In this paper, we introduce a model-based deep-learning approach to solve finite-horizon continuous-time stochastic control problems with jumps. We iteratively train two neural networks: one to represent the optimal policy and the other to approximate the value function. Leveraging a continuous-time version of the dynamic programming principle, we derive two different training objectives based on the Hamilton-Jacobi-Bellman equation, ensuring that the networks capture the underlying stochastic dynamics. Empirical evaluations on different problems illustrate the accuracy and scalability of our approach, demonstrating its effectiveness in solving complex high-dimensional stochastic control tasks.

[413] GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, Eunhyeok Park

Main category: cs.LG

TL;DR: GraLoRA improves upon LoRA by partitioning weight matrices into sub-blocks with individual low-rank adapters, overcoming LoRA’s overfitting at higher ranks and achieving better performance closer to full fine-tuning.

Details

Motivation: LoRA suffers from overfitting when bottleneck is widened, performs best at ranks 32-64, and still falls short of full fine-tuning performance due to structural bottleneck causing gradient entanglement and distorted gradient propagation.

Method: GraLoRA partitions weight matrices into sub-blocks, each with its own low-rank adapter, maintaining negligible computational/storage cost while increasing representational capacity and approximating full fine-tuning behavior.

Result: GraLoRA consistently outperforms LoRA and other baselines on code generation and commonsense reasoning benchmarks, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+, with improvements holding across model sizes and rank settings.

Conclusion: GraLoRA provides a scalable and robust solution for parameter-efficient fine-tuning by overcoming LoRA’s fundamental limitations through granular partitioning, effectively increasing representational capacity while maintaining efficiency.

Abstract: Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32-64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA’s structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA’s limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT. Code, data, and scripts are available at https://github.com/SqueezeBits/GraLoRA.git

[414] Optimal kernel regression bounds under energy-bounded noise

Amon Lahr, Johannes Köhler, Anna Scampicchio, Melanie N. Zeilinger

Main category: cs.LG

TL;DR: Derived tight, non-asymptotic uncertainty bounds for kernel-based estimation that handle correlated noise, computed via worst-case function realization within hypothesis class.

Details

Motivation: Non-conservative uncertainty bounds are crucial for assessing estimation algorithm accuracy and enabling deployment in safety-critical applications, especially when dealing with correlated noise sequences.

Method: Proposes computing worst-case function realization within hypothesis class at arbitrary query locations using norm-boundedness assumptions on unknown function and noise. Shows this value equals posterior mean and covariance of Gaussian process with optimal measurement noise covariance selection.

Result: Developed tight, non-asymptotic uncertainty bounds that are effective for kernel-based estimates, providing rigorous analysis and comparison with existing literature.

Conclusion: The approach yields tight, easy-to-compute uncertainty bounds for kernel-based estimation that handle correlated noise, making it suitable for safety-critical applications requiring accurate uncertainty quantification.

Abstract: Non-conservative uncertainty bounds are key for both assessing an estimation algorithm’s accuracy and in view of downstream tasks, such as its deployment in safety-critical contexts. In this paper, we derive a tight, non-asymptotic uncertainty bound for kernel-based estimation, which can also handle correlated noise sequences. Its computation relies on a mild norm-boundedness assumption on the unknown function and the noise, returning the worst-case function realization within the hypothesis class at an arbitrary query input location. The value of this function is shown to be given in terms of the posterior mean and covariance of a Gaussian process for an optimal choice of the measurement noise covariance. By rigorously analyzing the proposed approach and comparing it with other results in the literature, we show its effectiveness in returning tight and easy-to-compute bounds for kernel-based estimates.

[415] From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

Shoaib Ahmed Siddiqui, Adrian Weller, David Krueger, Gintare Karolina Dziugaite, Michael Curtis Mozer, Eleni Triantafillou

Main category: cs.LG

TL;DR: LLM unlearning methods are vulnerable to relearning attacks where forgotten knowledge re-emerges through fine-tuning, even on unrelated data. The paper studies this in vision classifiers and finds forget-set accuracy can recover from 50% to 100% with retain-set-only fine-tuning.

Details

Motivation: Recent unlearning methods for LLMs are vulnerable to relearning attacks where supposedly forgotten knowledge re-emerges through fine-tuning on even seemingly unrelated examples. This raises concerns about the robustness and reliability of current unlearning techniques.

Method: The study examines example-level unlearning in vision classifiers in a controlled setting. They analyze various unlearning methods and discover that resistance to relearning attacks can be predicted by weight-space properties like L2-distance and linear mode connectivity between original and unlearned models.

Result: Forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set (zero forget set examples). This effect occurs across multiple unlearning methods, while models retrained from scratch excluding forget set remain at 50% accuracy.

Conclusion: Current unlearning methods are vulnerable to relearning attacks, but resistance can be predicted by weight-space properties. Based on this insight, the authors propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.

Abstract: Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set – i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, $L_2$-distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.

[416] Learning normalized image densities via dual score matching

Florentin Guth, Zahra Kadkhodaie, Eero P Simoncelli

Main category: cs.LG

TL;DR: The paper introduces a new framework for learning normalized energy models inspired by diffusion models, using dual score matching to ensure consistent energies across noise levels, achieving state-of-the-art log likelihood on ImageNet64 while revealing insights about image probability distributions.

Details

Motivation: Learning probability models from data is challenging due to the curse of dimensionality. The authors aim to develop a framework for learning normalized energy models that can accurately estimate log probabilities while overcoming dimensionality issues.

Method: Modify score network architecture to compute energy while preserving inductive biases. Use dual score matching: primary objective optimizes gradient w.r.t. input (score), secondary objective optimizes gradient w.r.t. noise level to ensure consistent normalized energies across noise levels.

Result: Achieved cross-entropy comparable to state-of-the-art on ImageNet64. Energy model shows strong generalization - log probabilities from networks trained on non-overlapping subsets are nearly identical. Revealed that image probability and local neighborhood dimensionality vary substantially with image content.

Conclusion: The proposed dual score matching framework enables effective learning of normalized energy models, achieving competitive performance while providing insights that challenge conventional assumptions about concentration of measure and low-dimensional manifold support in image distributions.

Abstract: Learning probability models from data is at the heart of many machine learning endeavors, but is notoriously difficult due to the curse of dimensionality. We introduce a new framework for learning \emph{normalized} energy (log probability) models that is inspired by diffusion generative models, which rely on networks optimized to estimate the score. We modify a score network architecture to compute an energy while preserving its inductive biases. The gradient of this energy network with respect to its input image is the score of the learned density, which can be optimized using a denoising objective. Importantly, the gradient with respect to the noise level provides an additional score that can be optimized with a novel secondary objective, ensuring consistent and normalized energies across noise levels. We train an energy network with this \emph{dual} score matching objective on the ImageNet64 dataset, and obtain a cross-entropy (negative log likelihood) value comparable to the state of the art. We further validate our approach by showing that our energy model \emph{strongly generalizes}: log probabilities estimated with two networks trained on non-overlapping data subsets are nearly identical. Finally, we demonstrate that both image probability and dimensionality of local neighborhoods vary substantially depending on image content, in contrast with conventional assumptions such as concentration of measure or support on a low-dimensional manifold.

[417] LittleBit: Ultra Low-Bit Quantization via Latent Factorization

Banseok Lee, Dongkyu Kim, Youngcheon You, Youngmin Kim

Main category: cs.LG

TL;DR: LittleBit enables extreme LLM compression to 0.1 bits per weight using low-rank matrix factorization with binarization and multi-scale compensation, achieving 31× memory reduction while maintaining performance.

Details

Motivation: Large language models face substantial memory and computational costs, making deployment challenging. While quantization helps, performance degradation in sub-1-bit regimes remains particularly difficult, creating a need for effective extreme compression methods.

Method: LittleBit represents weights in low-rank form using latent matrix factorization, then binarizes these factors. It integrates multi-scale compensation (row, column, and latent dimension with per-rank importance) and uses Dual Sign-Value-Independent Decomposition for QAT initialization and integrated Residual Compensation to mitigate errors.

Result: Achieves 31× memory reduction (e.g., Llama2-13B to under 0.9 GB). LittleBit’s 0.1 BPW performance on Llama2-7B surpasses leading methods at 0.7 BPW, enabling 11.6× kernel-level speedup over FP16.

Conclusion: LittleBit establishes a new viable size-performance trade-off, making powerful LLMs practical for resource-constrained environments through extreme sub-1-bit quantization while maintaining competitive performance.

Abstract: Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit’s superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method’s 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off–unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level–and makes powerful LLMs practical for resource-constrained environments. Our code can be found at https://github.com/SamsungLabs/LittleBit.

[418] Advancing Safe Mechanical Ventilation Using Offline RL With Hybrid Actions and Clinically Aligned Rewards

Muhammad Hamza Yousuf, Jason Li, Sahar Vahdati, Raphael Theilen, Jakob Wittenstein, Jens Lehmann

Main category: cs.LG

TL;DR: The paper presents IntelliLung, an AI system using offline reinforcement learning to optimize invasive mechanical ventilation settings for ICU patients, addressing challenges with hybrid action spaces and scaling to 6 settings.

Details

Motivation: Optimal mechanical ventilation settings can reduce mortality and ventilator-induced lung injury in ICU patients, but current optimization is complex and error-prone due to patient variability. Existing offline RL methods struggle with the hybrid (continuous/discrete) nature of ventilation settings.

Method: Developed IntelliLung using constrained action space and factored action critics to handle hybrid action spaces without discretization. Adapted state-of-the-art offline RL algorithms, introduced clinically grounded reward function based on ventilator-free days and physiological targets, and used multiobjective optimization for reward selection.

Result: The approach scales to 6 optimizable ventilation settings compared to 2-3 in previous studies, avoids distribution shift issues from discretization, and provides more equitable consideration of clinically relevant objectives through multiobjective optimization.

Conclusion: IntelliLung represents a clinically aligned AI system developed in collaboration with healthcare professionals that addresses key limitations of previous approaches and is designed for real-world deployment to improve mechanical ventilation optimization in ICU settings.

Abstract: Invasive mechanical ventilation (MV) is a life-sustaining therapy commonly used in the intensive care unit (ICU) for patients with severe and acute conditions. These patients frequently rely on MV for breathing. Given the high risk of death in such cases, optimal MV settings can reduce mortality, minimize ventilator-induced lung injury, shorten ICU stays, and ease the strain on healthcare resources. However, optimizing MV settings remains a complex and error-prone process due to patient-specific variability. While Offline Reinforcement Learning (RL) shows promise for optimizing MV settings, current methods struggle with the hybrid (continuous and discrete) nature of MV settings. Discretizing continuous settings leads to exponential growth in the action space, which limits the number of optimizable settings. Converting the predictions back to continuous can cause a distribution shift, compromising safety and performance. To address this challenge, in the IntelliLung project, we are developing an AI-based approach where we constrain the action space and employ factored action critics. This approach allows us to scale to six optimizable settings compared to 2-3 in previous studies. We adapt SOTA offline RL algorithms to operate directly on hybrid action spaces, avoiding the pitfalls of discretization. We also introduce a clinically grounded reward function based on ventilator-free days and physiological targets. Using multiobjective optimization for reward selection, we show that this leads to a more equitable consideration of all clinically relevant objectives. Notably, we develop a system in close collaboration with healthcare professionals that is aligned with real-world clinical objectives and designed with future deployment in mind.

[419] Curating art exhibitions using machine learning

Eurico Covas

Main category: cs.LG

TL;DR: AI models can learn to curate exhibitions from human-curated examples, with three out of four models achieving reasonable imitation of human curators, showing that exhibition data contains sufficient information for AI curation and that well-designed modest models can approach large language model performance.

Details

Motivation: To explore whether artificial intelligence can learn exhibition curation from existing human-curated exhibitions, and to determine if simpler, well-designed models can perform comparably to large language models for this specific task.

Method: Developed four related machine learning models that learn from existing human-curated exhibitions. Used feature engineering and careful architecture design for modest-sized models, comparing them against brute-force approaches using large language models like GPT.

Result: Three out of four AI models achieved reasonable ability to imitate human curators with varying degrees of precision and curatorial coherence. The models replicated past exhibitions with accuracy well above random chance.

Conclusion: Two key insights: 1) Exhibition data contains sufficient information for AI to replicate past curation with above-random accuracy; 2) Well-designed modest models with feature engineering can approach the performance of large language models without brute-force approaches.

Abstract: Here we present a series of artificial models - a total of four related models - based on machine learning techniques that attempt to learn from existing exhibitions which have been curated by human experts, in order to be able to do similar curatorship work. Out of our four artificial intelligence models, three achieve a reasonable ability at imitating these various curators responsible for all those exhibitions, with various degrees of precision and curatorial coherence. In particular, we can conclude two key insights: first, that there is sufficient information in these exhibitions to construct an artificial intelligence model that replicates past exhibitions with an accuracy well above random choices; and second, that using feature engineering and carefully designing the architecture of modest size models can make them almost as good as those using the so-called large language models such as GPT in a brute force approach.

[420] COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation

Uliana Parkina, Maxim Rakhuba

Main category: cs.LG

TL;DR: Proposes inversion-free regularized framework for context-aware low-rank approximation to overcome numerical instabilities in neural network compression/fine-tuning.

Details

Motivation: Existing context-aware low-rank approximation methods for neural networks suffer from numerical instabilities due to reliance on classical formulas involving explicit Gram matrix computation and inversion, which can degrade approximation quality or cause numerically singular matrices.

Method: Novel inversion-free regularized framework based entirely on stable decompositions that avoids explicit Gram matrix computation and inversion. Handles challenging scenarios: (1) calibration matrices exceeding GPU memory, (2) nearly singular input activation matrices, (3) insufficient data preventing unique approximation.

Result: Method overcomes numerical pitfalls of prior art. For insufficient data scenarios, proves solution converges to desired approximation and derives explicit error bounds.

Conclusion: Proposed inversion-free regularized framework provides stable, efficient solution for context-aware low-rank approximation in neural network compression and fine-tuning, addressing key numerical limitations of existing methods.

Abstract: Recent studies suggest that context-aware low-rank approximation is a useful tool for compression and fine-tuning of modern large-scale neural networks. In this type of approximation, a norm is weighted by a matrix of input activations, significantly improving metrics over the unweighted case. Nevertheless, existing methods for neural networks suffer from numerical instabilities due to their reliance on classical formulas involving explicit Gram matrix computation and their subsequent inversion. We demonstrate that this can degrade the approximation quality or cause numerically singular matrices. To address these limitations, we propose a novel inversion-free regularized framework that is based entirely on stable decompositions and overcomes the numerical pitfalls of prior art. Our method can handle possible challenging scenarios: (1) when calibration matrices exceed GPU memory capacity, (2) when input activation matrices are nearly singular, and even (3) when insufficient data prevents unique approximation. For the latter, we prove that our solution converges to a desired approximation and derive explicit error bounds.

[421] Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise

Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le

Main category: cs.LG

TL;DR: SAM’s generalization improves with smaller micro-batch sizes due to implicit variance-based sharpness regularization, leading to proposed Reweighted SAM for parallelizable training.

Details

Motivation: SAM improves generalization but its principles are unclear, especially why performance improves with smaller micro-batch sizes (m-sharpness phenomenon), which is critical for distributed training but lacks rigorous explanation.

Method: Use extended Stochastic Differential Equation (SDE) framework and analyze stochastic gradient noise to characterize SAM variants (n-SAM and m-SAM), revealing implicit variance-based sharpness regularization. Propose Reweighted SAM (RW-SAM) with sharpness-weighted sampling to mimic m-SAM benefits while remaining parallelizable.

Result: Analysis shows stochastic perturbations induce implicit variance-based sharpness regularization whose strength increases as micro-batch size decreases. RW-SAM successfully mimics generalization benefits of m-SAM while maintaining parallelizability.

Conclusion: The m-sharpness phenomenon in SAM is explained by implicit variance-based regularization, and RW-SAM provides a practical parallelizable alternative that captures these generalization benefits.

Abstract: Sharpness-aware minimization (SAM) has emerged as a highly effective technique to improve model generalization, but its underlying principles are not fully understood. We investigate m-sharpness, where SAM performance improves monotonically as the micro-batch size for computing perturbations decreases, a phenomenon critical for distributed training yet lacking rigorous explanation. We leverage an extended Stochastic Differential Equation (SDE) framework and analyze stochastic gradient noise (SGN) to characterize the dynamics of SAM variants, including n-SAM and m-SAM. Our analysis reveals that stochastic perturbations induce an implicit variance-based sharpness regularization whose strength increases as m decreases. Motivated by this insight, we propose Reweighted SAM (RW-SAM), which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate our theory and method.

[422] Functional Critics Are Essential in Off-Policy Actor-Critic: Provable Convergence and Efficient Exploration

Qinxun Bai, Yuxuan Han, Wei Xu, Zhengyuan Zhou

Main category: cs.LG

TL;DR: Functional critics (policy-conditioned value functions) are essential for off-policy actor-critic algorithms, addressing both convergence issues and enabling efficient exploration, with practical implementation achieving competitive performance.

Details

Motivation: Off-policy actor-critic algorithms suffer from the "moving target" problem where the policy being evaluated changes continually, making theoretical convergence difficult. Previous functional critic approaches haven't been competitive with state-of-the-art AC algorithms.

Method: Revisits functional critics within off-policy AC framework, identifies their necessity for convergence and exploration, proposes tailored neural network architecture and minimal AC algorithm based on these insights.

Result: Establishes first convergence proof for off-policy target-based AC algorithm under linear function approximation, shows functional critics enable efficient exploration via posterior sampling approximation, and achieves competitive performance on DeepMind Control Suite.

Conclusion: Functional critics are essential rather than optional for off-policy actor-critic algorithms, addressing both theoretical convergence challenges and practical exploration efficiency, making them a necessity for robust RL systems.

Abstract: Off-policy reinforcement learning (RL) with function approximation offers an effective way to improve sample efficiency by reusing past experience. Within this setting, the actor-critic (AC) framework has achieved strong empirical success but suffers from the “moving target” problem, where the policy being evaluated changes continually. Functional critics, or policy-conditioned value functions, have been proposed to address this issue by including a representation of the policy as input. While the concept of generalizing value functions across policy space is appealing, previous efforts have struggled to remain competitive against state-of-the-art AC algorithms that do not utilize functional critics. In this work, we revisit functional critics within the off-policy AC framework and identify two aspects that render them a necessity rather than a luxury. First, in off-policy AC, critic learning contends with both the “deadly triad” instability and the “moving target” issue, while actor learning faces the challenge of estimating the exact off-policy policy gradient. This complex interplay makes theoretical convergence extremely difficult for practical algorithms. We demonstrate that a functional critic is essential for addressing this challenge and establish the first convergence proof for an off-policy target-based AC algorithm under linear function approximation. Second, we identify a crucial link between functional critic modeling and efficient exploration. Specifically, we show that approximating posterior sampling for exploration in model-free settings is infeasible without functional critics. Practically, we propose a tailored neural network architecture and a minimal AC algorithm that relies solely on these insights. In experiments on the DeepMind Control Suite, this implementation achieves performance competitive with state-of-the-art methods.

[423] Knowledge Homophily in Large Language Models

Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt, Yu Zhang, Yao Ma, Yu Wang

Main category: cs.LG

TL;DR: LLMs show knowledge homophily patterns similar to human semantic memory, where related facts cluster together. This enables graph-based knowledgeability prediction to optimize knowledge injection and retrieval.

Details

Motivation: While LLMs are used as knowledge bases, their structural knowledge organization remains unexplored. The paper aims to investigate whether LLMs exhibit knowledge homophily patterns similar to human cognitive neuroscience findings (semantic clustering and priming).

Method: Map LLM knowledge into graph representation through knowledge checking at triplet and entity levels. Analyze knowledgeability relationships between entities and neighbors. Propose GNN regression model to estimate entity-level knowledgeability scores by leveraging neighborhood scores.

Result: Discover that LLMs tend to possess similar knowledge levels about entities positioned closer in the graph (knowledge homophily). The GNN model successfully predicts knowledgeability scores, enabling prioritization of less well-known triplets for labeling.

Conclusion: The knowledge homophily principle improves efficiency of active labeling for fine-tuning LLMs and enhances multi-hop path retrieval in reasoning-intensive QA, maximizing knowledge coverage under limited labeling budgets.

Abstract: Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.

[424] Learning Regularization Functionals for Inverse Problems: A Comparative Study

Johannes Hertrich, Hok Shing Wong, Alexander Denker, Stanislas Ducotterd, Zhenghan Fang, Markus Haltmeier, Željko Kereta, Erich Kobler, Oscar Leong, Mohammad Sadegh Salehi, Carola-Bibiane Schönlieb, Johannes Schwab, Zakhar Shumaylov, Jeremias Sulam, German Shâma Wache, Martin Zach, Yasi Zhang, Matthias J. Ehrhardt, Sebastian Neumayer

Main category: cs.LG

TL;DR: The paper presents a unified framework for comparing learned regularization methods in imaging inverse problems by collecting and standardizing existing code implementations.

Details

Motivation: Learned regularization methods for imaging inverse problems have proliferated with different architectures and training strategies, but direct comparison is difficult due to non-modular implementations and lack of standardization.

Method: Collect and unify available code implementations into a common framework, enabling systematic comparison of different learned regularization approaches through standardized evaluation.

Result: The unified framework allows systematic comparison of methods, highlighting their strengths and limitations, and provides practical guidelines for implementation and use.

Conclusion: The unified framework offers valuable insights into learned regularization methods’ future potential and facilitates fair comparison, advancing the field through standardized evaluation practices.

Abstract: In recent years, a variety of learned regularization frameworks for solving inverse problems in imaging have emerged. These offer flexible modeling together with mathematical insights. The proposed methods differ in their architectural design and training strategies, making direct comparison challenging due to non-modular implementations. We address this gap by collecting and unifying the available code into a common framework. This unified view allows us to systematically compare the approaches and highlight their strengths and limitations, providing valuable insights into their future potential. We also provide concise descriptions of each method, complemented by practical guidelines.

[425] Fine-Tuning Diffusion Models via Intermediate Distribution Shaping

Gautham Govind Anil, Shaan Ul Haque, Nithish Kannen, Dheeraj Nagaraj, Sanjay Shakkottai, Karthikeyan Shanmugam

Main category: cs.LG

TL;DR: The paper introduces GRAFT (Generalized Rejection sAmpling based Fine-Tuning) framework for diffusion models, showing it performs KL-regularized reward maximization, proposes P-GRAFT for intermediate noise level shaping, and introduces inverse noise correction to improve flow models without explicit rewards.

Details

Motivation: While diffusion models capture training data distributions well, there's a need to shape these distributions using reward functions for downstream applications. Policy gradient methods like PPO work for autoregressive generation but are intractable for diffusion models due to marginal likelihood requirements.

Method: 1) Unify RAFT variants as GRAFT framework showing implicit KL-regularized reward maximization; 2) Introduce P-GRAFT for shaping distributions at intermediate noise levels; 3) Propose inverse noise correction to improve flow models without explicit rewards; 4) Mathematical analysis via bias-variance tradeoff.

Result: Applied to Stable Diffusion 2, GRAFT outperforms policy gradient methods on T2I benchmarks in VQAScore with 8.81% relative improvement over base model. Inverse noise correction improves FID for unconditional image generation at lower FLOPs/image. Validated across text-to-image, layout, molecule, and unconditional image generation tasks.

Conclusion: GRAFT provides an effective framework for fine-tuning diffusion models with reward functions, addressing tractability issues of policy gradient methods. P-GRAFT enables more effective fine-tuning via intermediate noise shaping, and inverse noise correction offers computational efficiency improvements for flow models.

Abstract: Diffusion models are widely used for generative tasks across domains. While pre-trained diffusion models effectively capture the training data distribution, it is often desirable to shape these distributions using reward functions to align with downstream applications. Policy gradient methods, such as Proximal Policy Optimization (PPO), are widely used in the context of autoregressive generation. However, the marginal likelihoods required for such methods are intractable for diffusion models, leading to alternative proposals and relaxations. In this context, we unify variants of Rejection sAmpling based Fine-Tuning (RAFT) as GRAFT, and show that this implicitly performs KL regularized reward maximization with reshaped rewards. We then introduce P-GRAFT to shape distributions at intermediate noise levels and demonstrate empirically that this can lead to more effective fine-tuning. We mathematically explain this via a bias-variance tradeoff. Motivated by this, we propose inverse noise correction to improve flow models without leveraging explicit rewards. We empirically evaluate our methods on text-to-image(T2I) generation, layout generation, molecule generation and unconditional image generation. Notably, our framework, applied to Stable Diffusion 2, improves over policy gradient methods on popular T2I benchmarks in terms of VQAScore and shows an $8.81%$ relative improvement over the base model. For unconditional image generation, inverse noise correction improves FID of generated images at lower FLOPs/image.

[426] Distributionally Robust Causal Abstractions

Yorgos Felekis, Theodoros Damoulas, Paris Giampouras

Main category: cs.LG

TL;DR: First distributionally robust causal abstraction learning framework with Wasserstein ambiguity sets to handle environmental shifts and model misspecification.

Details

Motivation: Existing causal abstraction learning methods assume fixed, well-specified exogenous distributions, making them vulnerable to environmental shifts and misspecification.

Method: Introduce distributionally robust causal abstractions formulated as constrained min-max optimization with Wasserstein ambiguity sets, with theoretical results for empirical and Gaussian environments.

Result: Provides principled selection of robustness levels via ambiguity set radii, and empirical evidence shows robustness to environmental shifts, structural model misspecification, and intervention mapping errors.

Conclusion: First robust causal abstraction learning framework that addresses key limitations of existing methods through distributional robustness with theoretical guarantees.

Abstract: Causal Abstraction (CA) theory provides a principled framework for relating causal models that describe the same system at different levels of granularity while ensuring interventional consistency between them. Recently, several approaches for learning CAs have been proposed, but all assume fixed and well-specified exogenous distributions, making them vulnerable to environmental shifts and misspecification. In this work, we address these limitations by introducing the first class of distributionally robust CAs and their associated learning algorithms. The latter cast robust causal abstraction learning as a constrained min-max optimization problem with Wasserstein ambiguity sets. We provide theoretical results, for both empirical and Gaussian environments, leading to principled selection of the level of robustness via the radius of these sets. Furthermore, we present empirical evidence across different problems and CA learning methods, demonstrating our framework’s robustness not only to environmental shifts but also to structural model and intervention mapping misspecification.

[427] Attn-JGNN: Attention Enhanced Join-Graph Neural Networks

Jixin Zhang

Main category: cs.LG

TL;DR: Attn-JGNN: Attention-enhanced join-graph neural network for #SAT solving with improved accuracy via attention mechanisms in join-graph clusters.

Details

Motivation: To improve solving accuracy for #SAT (model counting) problems using neural networks, addressing limitations of existing neural approaches by incorporating attention mechanisms into join-graph representations.

Method: 1. Uses tree decomposition to encode CNF formulas into join-graphs; 2. Performs iterative message passing on join-graphs; 3. Applies attention mechanisms within and between clusters to focus on key variables and reduce redundant computation; 4. Learns partition functions to approximate model counts.

Result: Attn-JGNN achieves better results than other neural network methods for #SAT solving, demonstrating improved accuracy through attention-enhanced join-graph processing.

Conclusion: Attention mechanisms applied to join-graph neural networks effectively improve #SAT solving accuracy by focusing computational resources on critical variables and clusters during probabilistic inference.

Abstract: We propose an Attention Enhanced Join-Graph Neural Networks(Attn-JGNN) model for solving #SAT problems, which significantly improves the solving accuracy. Inspired by the Iterative Join Graph Propagation (IJGP) algorithm, Attn-JGNN uses tree decomposition to encode the CNF formula into a join-graph, then performs iterative message passing on the join-graph, and finally approximates the model number by learning partition functions. In order to further improve the accuracy of the solution, we apply the attention mechanism in and between clusters of the join-graphs, which makes Attn-JGNN pay more attention to the key variables and clusters in probabilistic inference, and reduces the redundant calculation. Finally, our experiments show that our Attn-JGNN model achieves better results than other neural network methods.

[428] Learning Without Augmenting: Unsupervised Time Series Representation Learning via Frame Projections

Berken Utku Demirel, Christian Holz

Main category: cs.LG

TL;DR: Proposes unsupervised representation learning using orthonormal bases and overcomplete frames instead of data augmentations, achieving 15-20% gains on temporal sequence tasks.

Details

Motivation: Traditional SSL methods rely on handcrafted data augmentations that require domain knowledge and impose representational invariances that can limit generalization. This is especially challenging for temporal sequence tasks where signal-specific characteristics make augmentations difficult.

Method: Replaces data augmentations with views generated using orthonormal bases and overcomplete frames. Learns embeddings from these distinct spaces where samples reside on different manifolds shaped by geometric biases, then jointly leverages the complementary geometry of these manifolds.

Result: Achieves performance gains of 15-20% over existing self-supervised approaches on nine datasets across five temporal sequence tasks, without relying on augmentation-induced diversity.

Conclusion: The method demonstrates that geometric biases from orthonormal and overcomplete spaces can effectively replace handcrafted augmentations for representation learning, particularly benefiting challenging temporal sequence tasks where traditional augmentations are difficult.

Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data. Most SSL approaches rely on strong, well-established, handcrafted data augmentations to generate diverse views for representation learning. However, designing such augmentations requires domain-specific knowledge and implicitly imposes representational invariances on the model, which can limit generalization. In this work, we propose an unsupervised representation learning method that replaces augmentations by generating views using orthonormal bases and overcomplete frames. We show that embeddings learned from orthonormal and overcomplete spaces reside on distinct manifolds, shaped by the geometric biases introduced by representing samples in different spaces. By jointly leveraging the complementary geometry of these distinct manifolds, our approach achieves superior performance without artificially increasing data diversity through strong augmentations. We demonstrate the effectiveness of our method on nine datasets across five temporal sequence tasks, where signal-specific characteristics make data augmentations particularly challenging. Without relying on augmentation-induced diversity, our method achieves performance gains of up to 15–20% over existing self-supervised approaches. Source code: https://github.com/eth-siplab/Learning-with-FrameProjections

[429] Deep Jump Gaussian Processes for Surrogate Modeling of High-Dimensional Piecewise Continuous Functions

Yang Xu, Chiwoo Park

Main category: cs.LG

TL;DR: DJGP is a novel surrogate modeling method that combines region-specific linear projections with Jump Gaussian Processes to handle piecewise continuous functions in high dimensions, achieving better accuracy and uncertainty quantification than existing methods.

Details

Motivation: Conventional Jump Gaussian Processes (JGP) have limitations in high-dimensional input spaces, needing a method that can effectively model piecewise continuous functions while capturing local low-dimensional subspace structures.

Method: DJGP integrates region-specific locally linear projections with JGP modeling, using region-dependent matrices to capture local low-dimensional subspaces. It places a Gaussian Process prior on projection matrices for smooth evolution across input space, creating a two-layer deep learning architecture of GP/JGP. A scalable variational inference algorithm jointly learns projection matrices and JGP hyperparameters.

Result: Theoretical analysis provides an oracle error bound decomposed into four distinct error sources with practical implications. Experiments on synthetic and benchmark datasets show DJGP achieves superior predictive accuracy and more reliable uncertainty quantification compared to existing methods.

Conclusion: DJGP effectively addresses high-dimensional piecewise continuous function modeling by combining localized projections with JGP, offering both theoretical guarantees and practical performance improvements in surrogate modeling.

Abstract: We introduce Deep Jump Gaussian Processes (DJGP), a novel method for surrogate modeling of a piecewise continuous function on a high-dimensional domain. DJGP addresses the limitations of conventional Jump Gaussian Processes (JGP) in high-dimensional input spaces by integrating region-specific, locally linear projections with JGP modeling. These projections employ region-dependent matrices to capture local low-dimensional subspace structures, making them well suited to the inherently localized modeling behavior of JGPs, a variant of local Gaussian processes. To control model complexity, we place a Gaussian Process prior on the projection matrices, allowing them to evolve smoothly across the input space. The projected inputs are then modeled with a JGP to capture piecewise continuous relationships with the response. This yields a distinctive two-layer deep learning of GP/JGP. We further develop a scalable variational inference algorithm to jointly learn the projection matrices and JGP hyperparameters. Rigorous theoretical analysis and extensive empirical studies are provided to justify the proposed approach. In particular, we derive an oracle error bound for DJGP and decompose it into four distinct sources of error, which are then linked to practical implications. Experiments on synthetic and benchmark datasets demonstrate that DJGP achieves superior predictive accuracy and more reliable uncertainty quantification compared with existing methods.

[430] Geometric Algorithms for Neural Combinatorial Optimization with Constraints

Nikolaos Karalias, Akbar Rafiey, Yifei Xu, Zhishang Luo, Behrooz Tahmasebi, Connie Jiang, Stefanie Jegelka

Main category: cs.LG

TL;DR: Self-supervised learning framework for combinatorial optimization that solves discrete constrained problems using neural networks with convex geometry techniques for feasible solution decomposition.

Details

Motivation: Address the central challenge of SSL for CO: solving problems with discrete constraints, which is difficult for neural networks that typically produce continuous outputs.

Method: End-to-end differentiable framework leveraging convex geometry and Carathéodory’s theorem to decompose neural network outputs into convex combinations of polytope corners corresponding to feasible sets, enabling self-supervised training and quality-preserving rounding.

Result: Extensive experiments show consistent outperformance over neural baselines in cardinality-constrained optimization, with worked examples demonstrating applicability to independent sets in graphs and matroid-constrained problems.

Conclusion: The proposed decomposition-based approach enables effective self-supervised learning for combinatorial optimization with discrete constraints, providing a general framework applicable to diverse CO tasks beyond cardinality constraints.

Abstract: Self-Supervised Learning (SSL) for Combinatorial Optimization (CO) is an emerging paradigm for solving combinatorial problems using neural networks. In this paper, we address a central challenge of SSL for CO: solving problems with discrete constraints. We design an end-to-end differentiable framework that enables us to solve discrete constrained optimization problems with neural networks. Concretely, we leverage algorithmic techniques from the literature on convex geometry and Carathéodory’s theorem to decompose neural network outputs into convex combinations of polytope corners that correspond to feasible sets. This decomposition-based approach enables self-supervised training but also ensures efficient quality-preserving rounding of the neural net output into feasible solutions. Extensive experiments in cardinality-constrained optimization show that our approach can consistently outperform neural baselines. We further provide worked-out examples of how our method can be applied beyond cardinality-constrained problems to a diverse set of combinatorial optimization tasks, including finding independent sets in graphs, and solving matroid-constrained problems.

[431] Differential Privacy as a Perk: Federated Learning over Multiple-Access Fading Channels with a Multi-Antenna Base Station

Hao Liang, Haifeng Wen, Kaishun Wu, Hong Xing

Main category: cs.LG

TL;DR: This paper shows that in over-the-air federated learning (AirFL), channel noise alone can provide differential privacy without artificial noise injection, challenging previous assumptions that artificial noise is necessary for privacy.

Details

Motivation: Prior work assumed artificial noise is required for differential privacy in AirFL, but this paper challenges that assumption by showing channel impairments can naturally provide privacy without compromising training convergence.

Method: The authors study AirFL over multiple-access fading channels with multi-antenna BS, derive novel convergent DP bounds under general bounded-domain assumptions, and optimize receive beamforming and power allocations to characterize convergence-privacy trade-offs.

Result: Theoretical analysis reveals DP can be achieved as a “perk” without artificial noise injection, with explicit conditions where privacy doesn’t compromise training. Numerical results validate the theoretical findings.

Conclusion: Channel noise in AirFL can naturally provide differential privacy without artificial noise, challenging previous assumptions and enabling more efficient privacy-preserving federated learning systems.

Abstract: Federated Learning (FL) is a distributed learning paradigm that preserves privacy by eliminating the need to exchange raw data during training. In its prototypical edge instantiation with underlying wireless transmissions enabled by analog over-the-air computing (AirComp), referred to as \emph{over-the-air FL (AirFL)}, the inherent channel noise plays a unique role of \emph{frenemy} in the sense that it degrades training due to noisy global aggregation while providing a natural source of randomness for privacy-preserving mechanisms, formally quantified by \emph{differential privacy (DP)}. It remains, nevertheless, challenging to effectively harness such channel impairments, as prior arts, under assumptions of either simple channel models or restricted types of loss functions, mostly considering (local) DP enhancement with a single-round or non-convergent bound on privacy loss. In this paper, we study AirFL over multiple-access fading channels with a multi-antenna base station (BS) subject to user-level DP requirements. Despite a recent study, which claimed in similar settings that artificial noise (AN) must be injected to ensure DP in general, we demonstrate, on the contrary, that DP can be gained as a \emph{perk} even \emph{without} employing any AN. Specifically, we derive a novel bound on DP that converges under general bounded-domain assumptions on model parameters, along with a convergence bound with general smooth and non-convex loss functions. Next, we optimize over receive beamforming and power allocations to characterize the optimal convergence-privacy trade-offs, which also reveal explicit conditions in which DP is achievable without compromising training. Finally, our theoretical findings are validated by extensive numerical results.

[432] Bootstrap Off-policy with World Model

Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, Shengbo Eben Li

Main category: cs.LG

TL;DR: BOOM integrates planning and off-policy learning through a bootstrap loop with world models, using likelihood-free alignment and soft value-weighting to achieve SOTA results on control benchmarks.

Details

Motivation: Online planning improves RL sample efficiency but creates divergence between collected data and actual policy behaviors, degrading both model learning and policy improvement.

Method: BOOM framework with bootstrap loop: policy initializes planner, planner refines actions to bootstrap policy through behavior alignment. Uses jointly learned world model for trajectory simulation and value targets. Core components: likelihood-free alignment loss bootstrapping policy using planner’s non-parametric action distribution, and soft value-weighted mechanism prioritizing high-return behaviors.

Result: Achieves state-of-the-art results in both training stability and final performance on high-dimensional DeepMind Control Suite and Humanoid-Bench.

Conclusion: BOOM successfully addresses the planning-data divergence problem through tight integration of planning and off-policy learning with world models, demonstrating superior performance on challenging control tasks.

Abstract: Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy’s actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off-policy with WOrld Model), a framework that tightly integrates planning and off-policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood-free alignment loss that bootstraps the policy using the planner’s non-parametric action distribution, combined with a soft value-weighted mechanism that prioritizes high-return behaviors and mitigates variability in the planner’s action quality within the replay buffer. Experiments on the high-dimensional DeepMind Control Suite and Humanoid-Bench show that BOOM achieves state-of-the-art results in both training stability and final performance. The code is accessible at https://github.com/molumitu/BOOM_MBRL.

[433] EarthSight: A Distributed Framework for Low-Latency Satellite Intelligence

Ansel Kaplan Erol, Seungjun Lee, Divya Mahajan

Main category: cs.LG

TL;DR: EarthSight is a distributed runtime framework for satellite constellations that enables low-latency image analysis by coordinating onboard multi-task inference with ground-station query scheduling and dynamic filter ordering.

Details

Motivation: Traditional satellite image delivery pipelines suffer from hours-to-days delays due to bandwidth limitations. Existing onboard ML solutions treat satellites as isolated nodes, causing redundant inference that strains power and compute resources, limiting mission scope and responsiveness.

Method: Three core innovations: 1) Multi-task inference on satellites using shared backbones to amortize computation across vision tasks; 2) Ground-station query scheduler that aggregates user requests, predicts priorities, and assigns compute budgets; 3) Dynamic filter ordering that integrates model selectivity, accuracy, and execution cost to reject low-value images early.

Result: EarthSight reduces average compute time per image by 1.9x and lowers 90th percentile end-to-end latency from first contact to delivery from 51 to 21 minutes compared to state-of-the-art baseline in satellite simulator evaluations.

Conclusion: EarthSight enables satellite constellations to perform scalable, low-latency image analysis within strict downlink bandwidth and onboard power budgets by leveraging global ground-station context and resource-aware adaptive decisions in orbit.

Abstract: Low-latency delivery of satellite imagery is essential for time-critical applications such as disaster response, intelligence, and infrastructure monitoring. However, traditional pipelines rely on downlinking all captured images before analysis, introducing delays of hours to days due to restricted communication bandwidth. To address these bottlenecks, emerging systems perform onboard machine learning to prioritize which images to transmit. However, these solutions typically treat each satellite as an isolated compute node, limiting scalability and efficiency. Redundant inference across satellites and tasks further strains onboard power and compute costs, constraining mission scope and responsiveness. We present EarthSight, a distributed runtime framework that redefines satellite image intelligence as a distributed decision problem between orbit and ground. EarthSight introduces three core innovations: (1) multi-task inference on satellites using shared backbones to amortize computation across multiple vision tasks; (2) a ground-station query scheduler that aggregates user requests, predicts priorities, and assigns compute budgets to incoming imagery; and (3) dynamic filter ordering, which integrates model selectivity, accuracy, and execution cost to reject low-value images early and conserve resources. EarthSight leverages global context from ground stations and resource-aware adaptive decisions in orbit to enable constellations to perform scalable, low-latency image analysis within strict downlink bandwidth and onboard power budgets. Evaluations using a prior established satellite simulator show that EarthSight reduces average compute time per image by 1.9x and lowers 90th percentile end-to-end latency from first contact to delivery from 51 to 21 minutes compared to the state-of-the-art baseline.

[434] Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi

Main category: cs.LG

TL;DR: Uniqueness-Aware Reinforcement Learning (UARL) addresses exploration collapse in RL for LLMs by rewarding rare high-level solution strategies, improving pass@k diversity without harming pass@1 performance.

Details

Motivation: RL for LLMs suffers from exploration collapse where policies prematurely concentrate on dominant reasoning patterns, improving pass@1 but limiting rollout diversity and pass@k gains. Current methods regularize local token behavior rather than diversity over solution sets.

Method: Proposes Uniqueness-Aware Reinforcement Learning with rollout-level objective that rewards correct solutions exhibiting rare high-level strategies. Uses LLM-based judge to cluster rollouts by high-level solution strategies (ignoring superficial variations) and reweights policy advantages inversely with cluster size.

Result: Across mathematics, physics, and medical reasoning benchmarks, consistently improves pass@k across large sampling budgets and increases area under pass@k curve (AUC@K) without sacrificing pass@1. Sustains exploration and uncovers more diverse solution strategies at scale.

Conclusion: Explicitly rewarding unique high-level solution strategies addresses exploration collapse in RL for LLMs, enabling better diversity and performance across multiple reasoning domains while maintaining single-solution accuracy.

Abstract: Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.

[435] Prototype-Guided Non-Exemplar Continual Learning for Cross-subject EEG Decoding

Dan Li, Hye-Bin Shin, Yeon-Woo Choi

Main category: cs.LG

TL;DR: ProNECL framework enables continual EEG decoding without storing historical data by using prototype-based alignment to prevent forgetting across subjects.

Details

Motivation: EEG signals vary significantly across individuals, causing knowledge from previous subjects to be overwritten in continual learning. Existing replay-based methods are impractical due to privacy and memory constraints.

Method: ProNECL summarizes subject-specific discriminative representations into class-level prototypes, then incrementally aligns new subject representations with a global prototype memory using prototype-based feature regularization and cross-subject alignment.

Result: Experiments on BCI Competition IV 2a and 2b datasets show ProNECL effectively balances knowledge retention and adaptability, achieving superior performance in cross-subject continual EEG decoding.

Conclusion: ProNECL provides a practical solution for continual EEG decoding without needing historical data storage, addressing privacy and memory constraints while maintaining performance.

Abstract: Due to the significant variability in electroencephalo-gram (EEG) signals across individuals, knowledge acquired from previous subjects is often overwritten as new subjects are introduced in continual EEG decoding tasks. Existing methods mainly rely on storing historical data from seen subjects as replay buffers to mitigate forgetting, which is impractical under privacy or memory constraints. To address this issue, we propose a Prototype-guided Non-Exemplar Continual Learning (ProNECL) framework that preserves prior knowledge without accessing historical EEG samples. ProNECL summarizes subject-specific discriminative representations into class-level prototypes and incrementally aligns new subject representations with a global prototype memory through prototype-based feature regulariza-tion and cross-subject alignment. Experiments on the BCI Com-petition IV 2a and 2b datasets demonstrate that ProNECL effec-tively balances knowledge retention and adaptability, achieving superior performance in cross-subject continual EEG decoding tasks.

[436] Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu

Main category: cs.LG

TL;DR: CadLLM is a training-free method to accelerate inference throughput of diffusion-based LLMs by adaptively controlling generation parameters based on token confidence.

Details

Motivation: Diffusion-based LLMs (dLLMs) have inference efficiency challenges. The authors observed that token unmasking confidence varies dynamically across blocks and steps, suggesting opportunities for optimization without retraining.

Method: A lightweight adaptive approach that controls generation block size, step size, and threshold based on average confidence of unmasked tokens. Also reduces softmax overhead by dynamically using a subset of vocabulary to regulate sampling breadth. Works as a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs.

Result: Extensive experiments on four popular tasks show up to 2.28x throughput improvement over state-of-the-art baseline while maintaining competitive accuracy.

Conclusion: CadLLM provides an effective training-free solution for accelerating dLLM inference through adaptive parameter control based on token confidence, achieving significant throughput gains with minimal accuracy trade-offs.

Abstract: We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.

[437] Geometric and Dynamic Scaling in Deep Transformers

Haoran Su, Chenyu You

Main category: cs.LG

TL;DR: The paper identifies geometric issues as the root cause of representation collapse in deep Transformers, proposing Manifold-Geometric Transformer (MGT) with manifold-constrained updates and deep delta learning to prevent degeneracy.

Details

Motivation: Existing explanations for Transformer collapse (optimization instability, vanishing gradients) fail to explain why collapse persists even with modern normalization/initialization. The authors argue collapse is fundamentally a geometric problem where residual updates cause systematic drift off semantic manifolds and monotonic feature accumulation.

Method: Proposes a unified geometric framework with two principles: 1) Manifold-constrained hyper-connections restrict residual updates to valid local tangent directions, preventing manifold drift; 2) Deep delta learning introduces data-dependent, non-monotonic updates enabling reflection and erasure of redundant features rather than unconditional accumulation.

Result: The resulting Manifold-Geometric Transformer (MGT) decouples direction and sign of feature updates, yielding stable geometric evolution across depth. The analysis predicts that enforcing geometric validity with dynamic erasure is essential for avoiding rank collapse in ultra-deep networks.

Conclusion: Geometry, rather than depth itself, is the key limiting factor in deep representation learning. The paper outlines an evaluation protocol for Transformers exceeding 100 layers to test this hypothesis.

Abstract: Despite their empirical success, pushing Transformer architectures to extreme depth often leads to a paradoxical failure: representations become increasingly redundant, lose rank, and ultimately collapse. Existing explanations largely attribute this phenomenon to optimization instability or vanishing gradients, yet such accounts fail to explain why collapse persists even under modern normalization and initialization schemes. In this paper, we argue that the collapse of deep Transformers is fundamentally a geometric problem. Standard residual updates implicitly assume that feature accumulation is always beneficial, but offer no mechanism to constrain update directions or to erase outdated information. As depth increases, this leads to systematic drift off the semantic manifold and monotonic feature accumulation, causing representational degeneracy. We propose a unified geometric framework that addresses these failures through two orthogonal principles. First, manifold-constrained hyper-connections restrict residual updates to valid local tangent directions, preventing uncontrolled manifold drift. Second, deep delta learning introduces data-dependent, non-monotonic updates that enable reflection and erasure of redundant features rather than their unconditional accumulation. Together, these mechanisms decouple the direction and sign of feature updates, yielding a stable geometric evolution across depth. We term the resulting architecture the Manifold-Geometric Transformer (MGT). Our analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks. We outline an evaluation protocol for Transformers exceeding 100 layers to test the hypothesis that geometry, rather than depth itself, is the key limiting factor in deep representation learning.

[438] Eventually LIL Regret: Almost Sure $\ln\ln T$ Regret for a sub-Gaussian Mixture on Unbounded Data

Shubhada Agrawal, Aaditya Ramdas

Main category: cs.LG

TL;DR: The paper proves that Robbins’ classic sub-Gaussian mixture satisfies a path-wise deterministic regret bound on Ville events, bridging adversarial online learning and game-theoretic statistics.

Details

Motivation: To bridge adversarial online learning (which handles bounded data with regret bounds) and game-theoretic statistics (which handles unbounded data using stochastic assumptions) through conditional regret bounds.

Method: Analyze Robbins’ sub-Gaussian mixture in a stochastic setting, proving it satisfies path-wise deterministic regret bounds on Ville events with cumulative variance process V_T.

Result: For every path in Ville event E_α, regret is bounded by ln²(1/α)/V_T + ln(1/α) + ln ln V_T (up to constants). On probability-one event E_0, regret is eventually bounded by ln ln V_T.

Conclusion: Conditional regret bounds serve as a bridge between stochastic and adversarial betting, connecting two different approaches to online learning and statistics.

Abstract: We prove that a classic sub-Gaussian mixture proposed by Robbins in a stochastic setting actually satisfies a path-wise (deterministic) regret bound. For every path in a natural ``Ville event’’ $E_α$, this regret till time $T$ is bounded by $\ln^2(1/α)/V_T + \ln (1/α) + \ln \ln V_T$ up to universal constants, where $V_T$ is a nonnegative, nondecreasing, cumulative variance process. (The bound reduces to $\ln(1/α) + \ln \ln V_T$ if $V_T \geq \ln(1/α)$.) If the data were stochastic, then one can show that $E_α$ has probability at least $1-α$ under a wide class of distributions (eg: sub-Gaussian, symmetric, variance-bounded, etc.). In fact, we show that on the Ville event $E_0$ of probability one, the regret on every path in $E_0$ is eventually bounded by $\ln \ln V_T$ (up to constants). We explain how this work helps bridge the world of adversarial online learning (which usually deals with regret bounds for bounded data), with game-theoretic statistics (which can handle unbounded data, albeit using stochastic assumptions). In short, conditional regret bounds serve as a bridge between stochastic and adversarial betting.

[439] Robust and Efficient Zeroth-Order LLM Fine-Tuning via Adaptive Bayesian Subspace Optimizer

Jian Feng, Zhihong Huang

Main category: cs.LG

TL;DR: BSZO is a Bayesian subspace zeroth-order optimizer that improves fine-tuning of large language models by combining gradient information across multiple perturbation directions using Kalman filtering, achieving better convergence and robustness under low-precision training.

Details

Motivation: Existing zeroth-order optimization methods for LLM fine-tuning suffer from performance degradation under low-precision training and essentially operate in one-dimensional space, limiting their effectiveness and robustness.

Method: BSZO applies Kalman filtering to combine finite-difference gradient information across multiple perturbation directions within a subspace. It treats each measurement as a noisy observation, builds a posterior distribution over the subspace-projected gradient, and uses Bayesian inference with residual-based adaptive mechanisms to handle noise variations.

Result: Theoretical analysis shows BSZO improves convergence rate by factor k/γ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show BSZO outperforms baselines across tasks, achieving up to 6.67% absolute average improvement on OPT-13B while maintaining robustness under fp16/bf16 precision and keeping memory usage close to inference-only baselines (1.00×-1.08× of MeZO).

Conclusion: BSZO provides an effective Bayesian subspace approach for zeroth-order optimization that significantly improves LLM fine-tuning performance while maintaining memory efficiency and robustness under low-precision training conditions.

Abstract: Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations. However, existing methods essentially perform updates in a one-dimensional space, and suffer from collapse or substantial performance degradation under low-precision training. We introduce BSZO, an adaptive \textbf{B}ayesian \textbf{S}ubspace \textbf{Z}eroth-Order \textbf{O}ptimizer, which applies Kalman filtering to combine finite-difference information across multiple perturbation directions within a subspace. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the subspace-projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adapt to noise variations. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/γ$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms the baselines across various tasks, achieving up to 6.67% absolute average improvement on OPT-13B while remaining robust under fp16/bf16 precision and keeping memory usage close to inference-only baselines (1.00$\times$–1.08$\times$ of MeZO).

[440] Entropy Production in Machine Learning Under Fokker-Planck Probability Flow

Lennon Shikhman

Main category: cs.LG

TL;DR: An entropy-based retraining framework for ML models in nonstationary environments reduces retraining frequency by 1-2 orders of magnitude while maintaining performance comparable to frequent retraining.

Details

Motivation: Most drift detection methods lack dynamical interpretation and don't provide guidance on balancing retraining decisions against operational costs. Current approaches offer limited theoretical grounding for retraining policies.

Method: Proposes an entropy-based retraining framework grounded in nonequilibrium statistical physics. Models drift as probability flow via Fokker-Planck equation, quantifies model-data mismatch using relative entropy, and implements entropy-triggered retraining using EWMA control statistic on streaming kernel density estimator of KL divergence.

Result: In synthetic, financial, and web-traffic domains, entropy-based retraining achieves predictive performance comparable to frequent retraining while reducing retraining frequency by 1-2 orders of magnitude. However, in biomedical ECG setting, it underperforms maximum-frequency baseline due to limitations of feature-space entropy monitoring under complex label-conditional drift.

Conclusion: The entropy-based framework provides theoretically grounded retraining decisions that significantly reduce operational costs while maintaining performance, though has limitations for complex label-conditional drift scenarios requiring more sophisticated monitoring approaches.

Abstract: Machine learning models deployed in nonstationary environments inevitably experience performance degradation due to data drift. While numerous drift detection heuristics exist, most lack a dynamical interpretation and provide limited guidance on how retraining decisions should be balanced against operational cost. In this work, we propose an entropy-based retraining framework grounded in nonequilibrium statistical physics. Interpreting drift as probability flow governed by a Fokker-Planck equation, we quantify model-data mismatch using relative entropy and show that its time derivative admits an entropy-balance decomposition featuring a nonnegative entropy production term driven by probability currents. Guided by this theory, we implement an entropy-triggered retraining policy using an exponentially weighted moving-average (EWMA) control statistic applied to a streaming kernel density estimator of the Kullback-Leibler divergence. We evaluate this approach across multiple nonstationary data streams. In synthetic, financial, and web-traffic domains, entropy-based retraining achieves predictive performance comparable to frequent retraining while reducing retraining frequency by one to two orders of magnitude. However, in a challenging biomedical ECG setting, the entropy-based trigger underperforms the maximum-frequency baseline, highlighting limitations of feature-space entropy monitoring under complex label-conditional drift.

[441] Fast Mining and Dynamic Time-to-Event Prediction over Multi-sensor Data Streams

Kota Nakamura, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai

Main category: cs.LG

TL;DR: TimeCast is a dynamic prediction framework for forecasting machine failure timing from multi-sensor data streams, adapting to evolving patterns with stage-based modeling for real-time predictions.

Details

Motivation: Real-world sensor data streams are dynamic with evolving patterns over time, requiring adaptive methods to continuously predict machine failures accurately in real-time.

Method: TimeCast identifies distinct time-evolving patterns (stages) in data streams, learns individual models for each stage, adapts to pattern shifts, captures time-varying sensor interdependencies, and enables online model updates with linear scalability.

Result: TimeCast achieves higher prediction accuracy than state-of-the-art methods, successfully identifies dynamic changes in data streams, and significantly reduces computational time in extensive experiments on real datasets.

Conclusion: TimeCast provides an effective dynamic framework for real-time machine failure prediction that adapts to evolving data patterns while maintaining scalability and practical applicability.

Abstract: Given real-time sensor data streams obtained from machines, how can we continuously predict when a machine failure will occur? This work aims to continuously forecast the timing of future events by analyzing multi-sensor data streams. A key characteristic of real-world data streams is their dynamic nature, where the underlying patterns evolve over time. To address this, we present TimeCast, a dynamic prediction framework designed to adapt to these changes and provide accurate, real-time predictions of future event time. Our proposed method has the following properties: (a) Dynamic: it identifies the distinct time-evolving patterns (i.e., stages) and learns individual models for each, enabling us to make adaptive predictions based on pattern shifts. (b) Practical: it finds meaningful stages that capture time-varying interdependencies between multiple sensors and improve prediction performance; (c) Scalable: our algorithm scales linearly with the input size and enables online model updates on data streams. Extensive experiments on real datasets demonstrate that TimeCast provides higher prediction accuracy than state-of-the-art methods while finding dynamic changes in data streams with a great reduction in computational time.

[442] Future-as-Label: Scalable Supervision from Real-World Outcomes

Benjamin Turtel, Paul Wilczewski, Danny Franklin, Kris Skothiem

Main category: cs.LG

TL;DR: Foresight Learning: Using time and realized outcomes as free supervision to train language models for real-world forecasting, improving accuracy and calibration.

Details

Motivation: Time provides natural supervision - forecasts about real-world events eventually resolve to verifiable outcomes without human annotation. This creates scalable, free supervision for training prediction models.

Method: Extends reinforcement learning with verifiable rewards to real-world prediction. Trains language models to make probabilistic forecasts from causally masked information, using proper scoring rules as reward functions once events resolve. Learning driven entirely by realized outcomes.

Result: Qwen3-32B trained with Foresight Learning improves Brier score by 27% and halves calibration error vs pretrained baseline. Outperforms Qwen3-235B on constructed future-event prediction tasks and Metaculus benchmark despite 7x fewer parameters.

Conclusion: Foresight Learning enables scalable outcome-based supervision for open-world prediction, leveraging time as a source of free labels to significantly improve forecasting accuracy and calibration in language models.

Abstract: Time creates free supervision: forecasts about real-world events resolve to verifiable outcomes. The passage of time provides labels that require no annotation. To exploit this structure, we extend reinforcement learning with verifiable rewards to real-world prediction over time. We train language models to make probabilistic forecasts from causally masked information, using proper scoring rules as the reward function once events resolve. Learning is driven entirely by realized outcomes, enabling scalable outcome-based supervision in open-world prediction. On real-world forecasting benchmarks, Qwen3-32B trained using Foresight Learning improves Brier score by 27% and halves calibration error relative to its pretrained baseline, and outperforms Qwen3-235B on both constructed future-event prediction tasks and the Metaculus benchmark despite a 7x parameter disadvantage.

[443] Softly Induced Functional Simplicity: Implications for Neural Network Generalisation, Robustness, and Distillation

Maciej Glowacki

Main category: cs.LG

TL;DR: The paper shows that soft symmetry-respecting inductive biases create pseudo-Goldstone modes in loss landscapes, leading to lower-complexity solutions that improve generalization, robustness, and distillability in HEP classification tasks.

Details

Motivation: Learning robust and generalizable abstractions from high-dimensional data is challenging in machine learning and HEP. Lower complexity solutions are known to generalize better and be more robust, but require appropriate inductive biases to be learnable in complex hypothesis spaces.

Method: The authors use soft symmetry-respecting inductive biases in HEP classification tasks, which create approximate degeneracies (pseudo-Goldstone modes) in the loss landscape. They quantify functional complexity using metrics derived from first principles Hessian analysis and compressibility measures.

Result: The study demonstrates that solutions with lower functional complexity produce abstractions that are more generalizable, robust to input perturbations, and efficiently distillable. The soft symmetry bias creates pseudo-Goldstone modes that facilitate finding these low-complexity solutions.

Conclusion: Inductive biases that respect symmetries can create favorable loss geometries (pseudo-Goldstone modes) that enable learning of lower-complexity solutions, which in turn yield more generalizable, robust, and efficiently distillable abstractions for HEP applications.

Abstract: Learning robust and generalisable abstractions from high-dimensional input data is a central challenge in machine learning and its applications to high-energy physics (HEP). Solutions of lower functional complexity are known to produce abstractions that generalise more effectively and are more robust to input perturbations. In complex hypothesis spaces, inductive biases make such solutions learnable by shaping the loss geometry during optimisation. In a HEP classification task, we show that a soft symmetry respecting inductive bias creates approximate degeneracies in the loss, which we identify as pseudo-Goldstone modes. We quantify functional complexity using metrics derived from first principles Hessian analysis and via compressibility. Our results demonstrate that solutions of lower complexity give rise to abstractions that are more generalisable, robust, and efficiently distillable.

[444] Provably Safe Reinforcement Learning for Stochastic Reach-Avoid Problems with Entropy Regularization

Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu

Main category: cs.LG

TL;DR: The paper proposes two online safe RL algorithms for constrained MDPs with reach-avoid objectives, using OFU and entropy regularization to ensure safety during learning with high probability.

Details

Motivation: Learning optimal policies for Markov decision processes with safety constraints is challenging, especially ensuring safety during the learning phase. Existing approaches may not guarantee safety with high probability during online learning.

Method: Two algorithms: 1) OFU-based algorithm using optimism in the face of uncertainty principle, 2) Main algorithm with entropy regularization built upon the first algorithm. Both ensure safety constraints with arbitrarily high probability during learning.

Result: Finite-sample analysis shows both algorithms achieve regret bounds. Entropy regularization improves regret and significantly reduces episode-to-episode variability inherent in OFU-based safe RL algorithms.

Conclusion: Entropy regularization is effective for safe RL, enhancing performance by improving regret bounds and stabilizing learning dynamics while maintaining safety guarantees with high probability.

Abstract: We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.

[445] Discrete Solution Operator Learning for Geometry-Dependent PDEs

Jinshuai Bai, Haolin Li, Zahra Sharif Khodaei, M. H. Aliabadi, YuanTong Gu, Xi-Qiao Feng

Main category: cs.LG

TL;DR: DiSOL introduces a discrete solution operator learning paradigm that learns procedural solvers for PDEs with geometry variations, handling topological changes and discontinuous boundaries that break traditional neural operator assumptions.

Details

Motivation: Neural operators assume smooth variations in geometry, but many engineering problems involve discrete structural changes like topological changes, abrupt boundary condition changes, and computational domain variations that break this assumption.

Method: DiSOL factorizes the solver into learnable stages mirroring classical discretizations: local contribution encoding, multiscale assembly, and implicit solution reconstruction on an embedded grid, preserving procedure-level consistency while adapting to geometry-dependent discrete structures.

Result: DiSOL produces stable and accurate predictions for geometry-dependent Poisson, advection-diffusion, linear elasticity, and spatiotemporal heat conduction problems, even under strongly out-of-distribution geometries including discontinuous boundaries and topological changes.

Conclusion: The paper highlights the need for procedural operator representations in geometry-dominated problems and positions discrete solution operator learning as a distinct, complementary direction in scientific machine learning.

Abstract: Neural operator learning accelerates PDE solution by approximating operators as mappings between continuous function spaces. Yet in many engineering settings, varying geometry induces discrete structural changes, including topological changes, abrupt changes in boundary conditions or boundary types, and changes in the computational domain, which break the smooth-variation premise. Here we introduce Discrete Solution Operator Learning (DiSOL), a complementary paradigm that learns discrete solution procedures rather than continuous function-space operators. DiSOL factorizes the solver into learnable stages that mirror classical discretizations: local contribution encoding, multiscale assembly, and implicit solution reconstruction on an embedded grid, thereby preserving procedure-level consistency while adapting to geometry-dependent discrete structures. Across geometry-dependent Poisson, advection-diffusion, linear elasticity, as well as spatiotemporal heat conduction problems, DiSOL produces stable and accurate predictions under both in-distribution and strongly out-of-distribution geometries, including discontinuous boundaries and topological changes. These results highlight the need for procedural operator representations in geometry-dominated problems and position discrete solution operator learning as a distinct, complementary direction in scientific machine learning.

[446] Reward Learning through Ranking Mean Squared Error

Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor

Main category: cs.LG

TL;DR: R4 is a new rating-based RL method that learns reward functions from human ratings using a novel ranking mean squared error loss, offering formal guarantees and outperforming existing methods with less feedback.

Details

Motivation: Reward design is a bottleneck in RL applications. Reward learning from human feedback is an alternative, but existing methods often use binary preferences which can be cognitively demanding. Ratings provide richer supervision and are potentially easier for humans to provide.

Method: R4 uses a novel ranking mean squared error (rMSE) loss that treats teacher-provided ratings as ordinal targets. It learns from trajectory-rating pairs, samples trajectories, predicts returns, ranks them using differentiable soft ranks, and optimizes MSE between soft ranks and teacher ratings.

Result: R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks while requiring significantly less feedback.

Conclusion: R4 provides an effective rating-based RL approach with theoretical guarantees and practical advantages over existing methods, making reward learning from human feedback more efficient and reliable.

Abstract: Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g., “bad,” “neutral,” “good”). At each training step, we sample a set of trajectories, predict their returns, and rank them using a differentiable sorting operator (soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the teacher’s ratings. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks from OpenAI Gym and the DeepMind Control Suite, while requiring significantly less feedback.

cs.MA

[447] Adaptive Orchestration: Scalable Self-Evolving Multi-Agent Systems

Sathish Sampath, Anuradha Baskaran

Main category: cs.MA

TL;DR: A Self-Evolving Concierge System using Dynamic Mixture of Experts to solve the Generalization-Specialization Dilemma in LLM agents, dynamically hiring specialized sub-agents while maintaining stability and efficiency.

Details

Motivation: Address the scalability bottleneck in LLM agents: monolithic agents with large toolkits suffer from context pollution and attention decay (hallucinations), while static multi-agent swarms cause latency and resource overhead.

Method: Introduces a Self-Evolving Concierge System with Dynamic Mixture of Experts (DMoE) that dynamically restructures runtime environment by hiring specialized sub-agents based on real-time conversation analysis. Includes asynchronous Meta-Cognition Engine for capability gap detection, LRU eviction policy for resource constraints, and Surgical History Pruning to mitigate refusal bias.

Result: Experimental results show the architecture maintains high task success rates while minimizing token consumption compared to static agent swarms.

Conclusion: The Self-Evolving Concierge System provides a novel solution to the Generalization-Specialization Dilemma by enabling dynamic specialization without the instability of self-rewriting code or the inefficiency of static multi-agent systems.

Abstract: As Large Language Models (LLMs) are increasingly deployed as autonomous agents, they face a critical scalability bottleneck known as the “Generalization-Specialization Dilemma.” Monolithic agents equipped with extensive toolkits suffer from context pollution and attention decay, leading to hallucinations. Conversely, static multi-agent swarms introduce significant latency and resource overhead. This paper introduces a Self-Evolving Concierge System, a novel architecture utilizing a Dynamic Mixture of Experts (DMoE) approach. Unlike recent self-improving agents that rewrite their own codebase, our system preserves stability by dynamically restructuring its runtime environment: “hiring” specialized sub-agents based on real-time conversation analysis. We introduce an asynchronous “Meta-Cognition Engine” that detects capability gaps, a Least Recently Used (LRU) eviction policy for resource constraints, and a novel “Surgical History Pruning” mechanism to mitigate refusal bias. Experimental results demonstrate that this architecture maintains high task success rates while minimizing token consumption compared to static agent swarms.

[448] Multi-Agent Cooperative Learning for Robust Vision-Language Alignment under OOD Concepts

Philip Xu, Isabel Wagner, Eerke Boiten

Main category: cs.MA

TL;DR: MACL framework uses four agents to prevent cross-modal alignment collapse in vision-language models for OOD concepts, achieving 1-5% precision gains.

Details

Motivation: Address cross-modal alignment collapse in vision-language models when handling out-of-distribution concepts, particularly addressing modality imbalance issues.

Method: Multi-agent cooperative learning framework with four agents (image, text, name, coordination) using structured message passing, multi-agent feature space name learning, context exchange enhanced few-shot learning, and adaptive dynamic balancing mechanism.

Result: Significantly improves performance on VISTA-Beyond dataset in both few-shot and zero-shot settings, achieving 1-5% precision gains across diverse visual domains.

Conclusion: MACL framework effectively mitigates modality imbalance and cross-modal alignment collapse for OOD concepts through multi-agent collaboration, demonstrating strong performance improvements.

Abstract: This paper introduces a novel Multi-Agent Cooperative Learning (MACL) framework to address cross-modal alignment collapse in vision-language models when handling out-of-distribution (OOD) concepts. Four core agents, including image, text, name, and coordination agents, collaboratively mitigate modality imbalance through structured message passing. The proposed framework enables multi-agent feature space name learning, incorporates a context exchange enhanced few-shot learning algorithm, and adopts an adaptive dynamic balancing mechanism to regulate inter-agent contributions. Experiments on the VISTA-Beyond dataset demonstrate that MACL significantly improves performance in both few-shot and zero-shot settings, achieving 1-5% precision gains across diverse visual domains.

[449] When Personas Override Payoffs: Role Identity Bias in Multi-Agent LLM Decision-Making

Viswonathan Manoranjan, Snehalkumar `Neil’ S. Gaikwad

Main category: cs.MA

TL;DR: Multi-agent LLM systems don’t act as pure strategic reasoners but as identity-driven actors when given personas, even with complete payoff information. Role-based personas bias equilibrium selection toward socially preferred outcomes rather than payoff-optimal ones.

Details

Motivation: To understand how design choices like role-based personas and payoff visibility affect LLM reasoning in multi-agent systems, specifically whether they function as strategic reasoners optimizing payoffs or as identity-driven actors prioritizing role alignment.

Method: Used Nash equilibrium achievement as diagnostic for strategic reasoning, conducting systematic experiments across four LLM architectures (Qwen-7B, Qwen-32B, Llama-8B, Mistral-7B) in complex environmental decision-making games with four agents. Tested conditions with/without personas and with/without explicit payoffs.

Result: Role identity bias fundamentally alters strategic reasoning even when payoff-optimal equilibria exist. Removing personas and providing explicit payoffs enables Qwen models to achieve high Nash equilibrium rates. Personas systematically bias equilibrium selection toward socially preferred outcomes (Green Transition) and prevent reaching equilibrium when Tragedy of the Commons is payoff-optimal. Qwen architectures are highly sensitive to both personas and payoff visibility, while Llama and Mistral exhibit rigid reasoning.

Conclusion: Representational choices (personas, payoff visibility) are substantive governance decisions that determine whether multi-agent systems act as strategic reasoners or identity-driven actors, with important implications for real-world deployment.

Abstract: Large language models are increasingly deployed in multi-agent systems for strategic tasks, yet how design choices such as role-based personas and payoff visibility affect reasoning remains poorly understood. We investigate whether multi-agent systems function as strategic reasoners capable of payoff optimization or as identity-driven actors that prioritize role alignment over explicit incentives. Using Nash equilibrium achievement as a diagnostic for strategic reasoning, we conduct systematic experiments across four LLM architectures (Qwen-7B, Qwen-32B, Llama-8B, Mistral-7B) in complex environmental decision-making games involving four agents. We show that role identity bias fundamentally alters strategic reasoning even when payoff-optimal equilibria exist and complete payoff information is available. Removing personas and providing explicit payoffs enables Qwen models to achieve high Nash equilibrium rates, indicating that both conditions are necessary for strategic reasoning. In contrast, personas systematically bias equilibrium selection toward socially preferred outcomes: with personas present, all of the achieved equilibria correspond to Green Transition, while models entirely fail to reach equilibrium when Tragedy of the Commons is payoff-optimal. The effect of explicit payoffs depends entirely on persona presence, revealing strong interactions between representational design choices. We also observe clear model-dependent patterns. Qwen architectures are highly sensitive to both personas and payoff visibility, whereas Llama and Mistral exhibit rigid reasoning behavior across conditions. These findings demonstrate that representational choices are substantive governance decisions that determine whether multi-agent systems act as strategic reasoners or identity-driven actors, with important implications for real-world deployment.

[450] TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems

Rui Sun, Jie Ding, Chenghua Gong, Tianjun Gu, Yihang Jiang, Juyuan Zhang, Liming Pan, Linyuan Lü

Main category: cs.MA

TL;DR: TopoDIM is a framework for one-shot topology generation in LLM-based multi-agent systems that reduces communication latency and token consumption while improving performance through diverse interaction modes.

Details

Motivation: Existing multi-agent communication methods rely on spatio-temporal interaction paradigms with sequential multi-round dialogues, which incur high latency and computation costs. The need for more efficient communication topologies that leverage evaluation and debate mechanisms to enhance collective intelligence.

Method: TopoDIM enables agents to autonomously construct heterogeneous communication topologies in a one-shot manner without iterative coordination. It features decentralized execution for adaptability and privacy, and incorporates diverse interaction modes for efficient problem-solving.

Result: TopoDIM reduces total token consumption by 46.41% while improving average performance by 1.50% over state-of-the-art methods. The framework also demonstrates strong adaptability in organizing communication among heterogeneous agents.

Conclusion: TopoDIM provides an effective solution for optimizing communication topology in LLM-based multi-agent systems, achieving significant efficiency gains and performance improvements through one-shot topology generation with diverse interaction modes.

Abstract: Optimizing communication topology in LLM-based multi-agent system is critical for enabling collective intelligence. Existing methods mainly rely on spatio-temporal interaction paradigms, where the sequential execution of multi-round dialogues incurs high latency and computation. Motivated by the recent insights that evaluation and debate mechanisms can improve problem-solving in multi-agent systems, we propose TopoDIM, a framework for one-shot Topology generation with Diverse Interaction Modes. Designed for decentralized execution to enhance adaptability and privacy, TopoDIM enables agents to autonomously construct heterogeneous communication without iterative coordination, achieving token efficiency and improved task performance. Experiments demonstrate that TopoDIM reduces total token consumption by 46.41% while improving average performance by 1.50% over state-of-the-art methods. Moreover, the framework exhibits strong adaptability in organizing communication among heterogeneous agents. Code is available at: https://anonymous.4open.science/r/TopoDIM-8D35/

[451] Fairness Driven Multi-Agent Path Finding Problem

Aditi Anand, Dildar Ali, Suman Banerjee

Main category: cs.MA

TL;DR: The paper studies Multi-Agent Path Finding (MAPF) with fairness considerations, proposing heuristic solutions for non-rational agents and incentive-compatible mechanisms for rational agents who might misreport information.

Details

Motivation: MAPF is computationally expensive and real-world agents can be rational, potentially misreporting private information. The paper addresses fairness in both rational and non-rational agent scenarios.

Method: For non-rational agents: propose a heuristic solution. For rational agents: develop a mechanism that is dominant strategy incentive compatible and individually rational. Use various solution methodologies to evaluate effectiveness.

Result: The proposed approaches are shown to be effective and efficient through various solution methodologies, though specific performance metrics are not detailed in the abstract.

Conclusion: The paper provides solutions for MAPF under fairness considerations for both rational and non-rational agents, with mechanisms ensuring truthful reporting and individual rationality for rational agents.

Abstract: The Multi-Agent Path Finding (MAPF) problem aims at finding non-conflicting paths for multiple agents from their respective sources to destinations. This problem arises in multiple real-life situations, including robot motion planning and airspace assignment for unmanned aerial vehicle movement. The problem is computationally expensive, and adding to it, the agents are rational and can misreport their private information. In this paper, we study both variants of the problem under the realm of fairness. For the non-rational agents, we propose a heuristic solution for this problem. Considering the agents are rational, we develop a mechanism and demonstrate that it is a dominant strategy, incentive compatible, and individually rational. We employ various solution methodologies to highlight the effectiveness and efficiency of the proposed solution approaches.

[452] Multipath Routing for Multi-Hop UAV Networks

Zhenyu Zhao, Tiankui Zhang, Xiaoxia Xu, Junjie Li, Yuanwei Liu, Wenjuan Xing

Main category: cs.MA

TL;DR: Proposes IPPO-DM, a multi-agent deep reinforcement learning method for traffic-adaptive multipath routing in multi-hop UAV networks to reduce congestion and meet latency requirements.

Details

Motivation: Existing single-path routing in multi-hop UAV networks causes local congestion and increased traffic delays. Need for dynamic multipath routing to meet diverse traffic flow latency requirements in mobile environments.

Method: Formulates on-time packet delivery ratio maximization as Dec-POMDP, develops IPPO-DM algorithm combining Independent Proximal Policy Optimization with Dirichlet distribution modeling for traffic splitting ratios.

Result: IPPO-DM outperforms benchmark schemes in delivery latency guarantee and packet loss performance in simulations.

Conclusion: The proposed traffic-adaptive multipath routing method with IPPO-DM effectively addresses congestion and latency issues in dynamic multi-hop UAV networks.

Abstract: Multi-hop uncrewed aerial vehicle (UAV) networks are promising to extend the terrestrial network coverage. Existing multi-hop UAV networks employ a single routing path by selecting the next-hop forwarding node in a hop-by-hop manner, which leads to local congestion and increases traffic delays. In this paper, a novel traffic-adaptive multipath routing method is proposed for multi-hop UAV networks, which enables each UAV to dynamically split and forward traffic flows across multiple next-hop neighbors, thus meeting latency requirements of diverse traffic flows in dynamic mobile environments. An on-time packet delivery ratio maximization problem is formulated to determine the traffic splitting ratios at each hop. This sequential decision-making problem is modeled as a decentralized partially observable Markov decision process (Dec-POMDP). To solve this Dec-POMDP, a novel multi-agent deep reinforcement leaning (MADRL) algorithm, termed Independent Proximal Policy Optimization with Dirichlet Modeling (IPPO-DM), is developed. Specifically, the IPPO serves as the core optimization framework, where the Dirichlet distribution is leveraged to parameterize a continuous stochastic policy network on the probability simplex, inherently ensuring feasible traffic splitting ratios. Simulation results demonstrate that IPPO-DM outperforms benchmark schemes in terms of both delivery latency guarantee and packet loss performance.

[453] Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems

Xi Shi, Mengxin Zheng, Qian Lou

Main category: cs.MA

TL;DR: LAMaS is a latency-aware multi-agent orchestration framework that optimizes parallel execution to reduce inference latency while maintaining task performance.

Details

Motivation: Multi-agent systems suffer from high inference latency due to sequential execution and repeated model invocations, limiting scalability in time-sensitive scenarios. Existing approaches focus on task performance and cost but don't optimize for latency under parallel execution.

Method: Proposes LAMaS framework with explicit latency supervision under parallel execution. It enables parallel execution and explicitly optimizes the critical execution path, allowing controllers to construct execution topology graphs with lower latency.

Result: Reduces critical path length by 38-46% compared to state-of-the-art baseline for multi-agent architecture search across multiple benchmarks, while maintaining or even improving task performance.

Conclusion: Explicitly optimizing latency under parallel execution is crucial for designing efficient multi-agent systems, and LAMaS demonstrates significant latency reduction while preserving task performance.

Abstract: Multi-agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi-step execution and repeated model invocations, severely limiting their scalability and usability in time-sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning-based orchestration of multi-agent systems with explicit latency supervision under parallel execution. We propose Latency-Aware Multi-agent System (LAMaS), a latency-aware multi-agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38-46% compared to the state-of-the-art baseline for multi-agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi-agent systems. The code is available at https://github.com/xishi404/LAMaS

[454] Procedural Fairness in Multi-Agent Bandits

Joshua Caiata, Carter Blair, Kate Larson

Main category: cs.MA

TL;DR: The paper introduces procedural fairness in multi-agent multi-armed bandits, arguing that fairness should include equal decision-making power (process) rather than just focusing on outcomes like welfare or equality.

Details

Motivation: Current fairness approaches in MA-MAB focus only on outcomes (welfare, inequality, utility balancing), but psychological, economic, and Rawlsian theories suggest fairness also involves process and who gets to participate in decisions. There's a need to incorporate procedural fairness alongside outcome-based fairness.

Method: Introduces a new fairness objective called “procedural fairness” that provides equal decision-making power for all agents, lies in the core, and ensures proportionality in outcomes. The framework allows putting procedural fairness into practice in MA-MAB settings.

Result: Empirical results show that outcome-based fairness notions sacrifice equal voice and representation, while procedurally fair policies only minimally sacrifice outcome-based fairness objectives (equality and utilitarianism). Different fairness notions prioritize fundamentally different and incompatible values.

Conclusion: Procedural legitimacy deserves greater focus as a fairness objective. Fairness requires explicit normative choices, and the paper provides a framework for implementing procedural fairness in practice.

Abstract: In the context of multi-agent multi-armed bandits (MA-MAB), fairness is often reduced to outcomes: maximizing welfare, reducing inequality, or balancing utilities. However, evidence in psychology, economics, and Rawlsian theory suggests that fairness is also about process and who gets a say in the decisions being made. We introduce a new fairness objective, procedural fairness, which provides equal decision-making power for all agents, lies in the core, and provides for proportionality in outcomes. Empirical results confirm that fairness notions based on optimizing for outcomes sacrifice equal voice and representation, while the sacrifice in outcome-based fairness objectives (like equality and utilitarianism) is minimal under procedurally fair policies. We further prove that different fairness notions prioritize fundamentally different and incompatible values, highlighting that fairness requires explicit normative choices. This paper argues that procedural legitimacy deserves greater focus as a fairness objective, and provides a framework for putting procedural fairness into practice.

cs.MM

[455] EditEmoTalk: Controllable Speech-Driven 3D Facial Animation with Continuous Expression Editing

Diqiong Jiang, Kai Zhu, Dan Song, Jian Chang, Chenglizhao Chen, Zhenyu Wu

Main category: cs.MM

TL;DR: EditEmoTalk is a speech-driven 3D facial animation framework with continuous emotion editing, using boundary-aware semantic embeddings and emotional consistency loss for smooth, fine-grained emotional control.

Details

Motivation: Current speech-driven 3D facial animation methods achieve good lip sync but rely on discrete emotion categories, which limits continuous and fine-grained emotional control. There's a need for more nuanced emotional expression manipulation.

Method: Uses boundary-aware semantic embedding that learns normal directions of inter-emotion decision boundaries to create a continuous expression manifold. Also introduces emotional consistency loss that enforces semantic alignment between generated motion dynamics and target emotion embeddings through a mapping network.

Result: Achieves superior controllability, expressiveness, and generalization while maintaining accurate lip synchronization. Extensive experiments demonstrate the framework’s effectiveness.

Conclusion: EditEmoTalk enables continuous emotion editing for speech-driven 3D facial animation, overcoming limitations of discrete emotion categories. The framework will be released with code and pretrained models.

Abstract: Speech-driven 3D facial animation aims to generate realistic and expressive facial motions directly from audio. While recent methods achieve high-quality lip synchronization, they often rely on discrete emotion categories, limiting continuous and fine-grained emotional control. We present EditEmoTalk, a controllable speech-driven 3D facial animation framework with continuous emotion editing. The key idea is a boundary-aware semantic embedding that learns the normal directions of inter-emotion decision boundaries, enabling a continuous expression manifold for smooth emotion manipulation. Moreover, we introduce an emotional consistency loss that enforces semantic alignment between the generated motion dynamics and the target emotion embedding through a mapping network, ensuring faithful emotional expression. Extensive experiments demonstrate that EditEmoTalk achieves superior controllability, expressiveness, and generalization while maintaining accurate lip synchronization. Code and pretrained models will be released.

[456] Subjective evaluation of UHD video coded using VVC with LCEVC and ML-VVC

Naeem Ramzan, Muhammad Tufail Khan

Main category: cs.MM

TL;DR: LCEVC enhancement on VVC base layer shows competitive quality compared to upsampled VVC and multilayer VVC at two bitrate operating points (10% and 50% enhancement layer bitrate).

Details

Motivation: To evaluate the subjective quality performance of LCEVC as an enhancement layer on top of VVC base layer compared to existing multilayer video coding approaches.

Method: Used MPEG multilayer video coding assessment methodology with LCEVC Test Model v8.1. Compared reconstructed UHD output from HD VVC base + LCEVC enhancement against upsampled VVC base and multilayer VVC. Tested two operating points (10% and 50% enhancement layer bitrate) with Degradation Category Rating methodology involving 25 participants across 15 SDR/HDR sequences.

Result: Reported Mean Opinion Scores with 95% confidence intervals show perceptual quality comparison across coding approaches and operating points within the test scope.

Conclusion: LCEVC enhancement on VVC base layer provides competitive subjective quality performance compared to reference multilayer coding approaches, with results quantified for two different enhancement layer bitrate allocations.

Abstract: This paper presents the results of a subjective quality assessment of a multilayer video coding configuration in which Low Complexity Enhancement Video Coding (LCEVC) is applied as an enhancement layer on top of a Versatile Video Coding (VVC) base layer. The evaluation follows the same test methodology and conditions previously defined for MPEG multilayer video coding assessments, with the LCEVC enhancement layer encoded using version 8.1 of the LCEVC Test Model (LTM). The test compares reconstructed UHD output generated from an HD VVC base layer with LCEVC enhancement against two reference cases: upsampled VVC base layer decoding and multilayer VVC (ML-VVC). Two operating points are considered, corresponding to enhancement layers representing approximately 10% and 50% of the total bitrate. Subjective assessment was conducted using the Degradation Category Rating (DCR) methodology with twenty five participants, across a dataset comprising fifteen SDR and HDR sequences. The reported results include Mean Opinion Scores (MOS) with associated 95% confidence intervals, enabling comparison of perceptual quality across coding approaches and operating points within the defined test scope.

[457] The State-of-the-Art in Lifelog Retrieval: A Review of Progress at the ACM Lifelog Search Challenge Workshop 2022-24

Allie Tran, Werner Bailer, Duc-Tien Dang-Nguyen, Graham Healy, Steve Hodges, Björn Þór Jónsson, Luca Rossetto, Klaus Schoeffmann, Minh-Triet Tran, Lucia Vadicamo, Cathal Gurrin

Main category: cs.MM

TL;DR: Review paper analyzing interactive lifelog retrieval systems from ACM Lifelog Search Challenge (2022-2024), highlighting trends in embedding-based methods, LLM integration, and UI improvements for known-item search, QA, and ad-hoc search tasks.

Details

Motivation: To review and analyze recent advances in interactive lifelog retrieval systems demonstrated at the ACM Lifelog Search Challenge from 2022 to 2024, identifying key trends and improvements in retrieval techniques and interface designs.

Method: Comparative analysis of systems from ACM LSC competitions (2022-2024), examining three main retrieval tasks: known-item search, question answering, and ad-hoc search. Analysis focuses on retrieval methods, UI designs, and system performance.

Result: Identified trends: widespread adoption of embedding-based retrieval (CLIP, BLIP), increased LLM integration for conversational retrieval, continued innovation in multimodal/collaborative interfaces. Found embedding-driven approaches with LLMs show promise, UI design improvements enhance usability, and multi-instance system evaluations need reconsideration for expert track.

Conclusion: Interactive lifelog retrieval has advanced through embedding methods and LLM integration, with UI design playing crucial role in usability. Future work should optimize retrieval complexity-usability balance and refine evaluation methods for multi-instance systems in expert track.

Abstract: The ACM Lifelog Search Challenge (LSC) is a venue that welcomes and compares systems that support the exploration of lifelog data, and in particular the retrieval of specific information, through an interactive competition format. This paper reviews the recent advances in interactive lifelog retrieval as demonstrated at the ACM LSC from 2022 to 2024. Through a detailed comparative analysis, we highlight key improvements across three main retrieval tasks: known-item search, question answering, and ad-hoc search. Our analysis identifies trends such as the widespread adoption of embedding-based retrieval methods (e.g., CLIP, BLIP), increased integration of large language models (LLMs) for conversational retrieval, and continued innovation in multimodal and collaborative search interfaces. We further discuss how specific retrieval techniques and user interface (UI) designs have impacted system performance, emphasizing the importance of balancing retrieval complexity with usability. Our findings indicate that embedding-driven approaches combined with LLMs show promise for lifelog retrieval systems. Likewise, improving UI design can enhance usability and efficiency. Additionally, we recommend reconsidering multi-instance system evaluations within the expert track to better manage variability in user familiarity and configuration effectiveness.

eess.AS

[458] Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition

Md. Nazmus Sakib, Golam Mahmud, Md. Maruf Bangabashi, Umme Ara Mahinur Istia, Md. Jahidul Islam, Partha Sarker, Afra Yeamini Prity

Main category: eess.AS

TL;DR: End-to-end Bengali ASR using Conformer-CTC with multi-level embedding fusion (phoneme, syllable, wordpiece) achieves WER 10.01% and CER 5.03%.

Details

Motivation: Bengali is a morphologically rich, low-resource language with over 300 million speakers, posing significant challenges for automatic speech recognition systems.

Method: Conformer-CTC backbone with multi-level embedding fusion mechanism incorporating phoneme, syllable, and wordpiece representations. Uses early and late Conformer stages with preprocessing including silence trimming, resampling, Log-Mel spectrogram extraction, and SpecAugment augmentation.

Result: Achieved word error rate (WER) of 10.01% and character error rate (CER) of 5.03%, demonstrating strong performance for Bengali ASR.

Conclusion: Multi-granular linguistic information combined with acoustic modeling provides an effective, scalable approach for low-resource ASR development, particularly for morphologically rich languages like Bengali.

Abstract: Bengali, spoken by over 300 million people, is a morphologically rich and lowresource language, posing challenges for automatic speech recognition (ASR). This research presents an end-to-end framework for Bengali ASR, building on a Conformer-CTC backbone with a multi-level embedding fusion mechanism that incorporates phoneme, syllable, and wordpiece representations. By enriching acoustic features with these linguistic embeddings, the model captures fine-grained phonetic cues and higher-level contextual patterns. The architecture employs early and late Conformer stages, with preprocessing steps including silence trimming, resampling, Log-Mel spectrogram extraction, and SpecAugment augmentation. The experimental results demonstrate the strong potential of the model, achieving a word error rate (WER) of 10.01% and a character error rate (CER) of 5.03%. These results demonstrate the effectiveness of combining multi-granular linguistic information with acoustic modeling, providing a scalable approach for low-resource ASR development.

[459] Nearest Kronecker Product Decomposition Based Subband Adaptive Filter: Algorithms and Applications

Jianhong Ye, Haiquan Zhao

Main category: eess.AS

TL;DR: The paper proposes several enhanced NKP-based adaptive filtering algorithms with improved convergence, reduced complexity, robustness to noise, and nonlinear capabilities for various signal processing applications.

Details

Motivation: Existing NKP-based NLMS algorithms suffer from degraded convergence with highly correlated input signals and high computational complexity. There's a need for more efficient, robust, and versatile adaptive filtering algorithms that can handle various real-world scenarios including impulsive noise, nonlinear environments, and active noise control applications.

Method: The authors propose multiple algorithms: 1) NSAF-NKP-I (type-I NKP-based normalized subband adaptive filter), 2) NSAF-NKP-II (enhanced type-II with reduced complexity), 3) Robust variants using maximum correntropy criterion (RNSAF-NKP-MCC) and logarithmic criterion (RNSAF-NKP-LC), 4) Nonlinear implementations using trigonometric functional link networks (TFLN-NSAF-NKP) and Volterra series (Volterra-NKP-NSAF), and 5) Filtered-x version for active noise control (NKP-FxNSAF).

Result: The proposed algorithms demonstrate superior convergence performance over existing methods, with NSAF-NKP-II achieving equivalent performance to NSAF-NKP-I but with substantially reduced computational complexity. The robust variants show improved performance against impulsive noise, while nonlinear implementations handle complex nonlinear environments effectively. All algorithms are validated through simulations in echo cancellation, sparse system identification, nonlinear processing, and active noise control scenarios.

Conclusion: The paper presents a comprehensive suite of NKP-based adaptive filtering algorithms that address various practical challenges including computational efficiency, robustness to impulsive noise, nonlinear system handling, and active noise control applications, demonstrating superior performance over state-of-the-art counterparts across multiple real-world scenarios.

Abstract: Recently, the nearest Kronecker product (NKP) decomposition-based normalized least mean square (NLMS-NKP) algorithm has demonstrated superior convergence performance compared to the conventional NLMS algorithm. However, its convergence rate exhibits significant degradation when processing highly correlated input signals. To address this problem, we propose a type-I NKP-based normalized subband adaptive filter (NSAF) algorithm, namely NSAF-NKP-I. Nevertheless, this algorithm incurs substantially higher computational overhead than the NLMS-NKP algorithm. Remarkably, our enhanced type-II NKP-based NSAF (NSAF-NKP-II) algorithm achieves equivalent convergence performance while substantially reducing computational complexity. Furthermore, to enhance robustness against impulsive noise interference, we develop two robust variants: the maximum correntropy criterion-based robust NSAF-NKP (RNSAF-NKP-MCC) and logarithmic criterion-based robust NSAF-NKP (RNSAF-NKP-LC) algorithms. Additionally, detailed analyses of computational complexity, step-size range, and theoretical steady-state performance are provided for theproposed algorithms. To enhance the practicability of the NSAF-NKP-II algorithm in complex nonlinear environments, we further devise two nonlinear implementations: the trigonometric functional link network-based NKP-NSAF (TFLN-NSAF-NKP) and Volterra series expansion-based NKP-NSAF (Volterra-NKP-NSAF) algorithms. In active noise control (ANC) systems, we further propose the filtered-x NSAF-NKP-II (NKP-FxNSAF) algorithm. Simulation experiments in echo cancellation, sparse system identification, nonlinear processing, and ANC scenarios are conducted to validate the superiority of the proposed algorithms over existing state-of-the-art counterparts.

[460] VoiceSculptor: Your Voice, Designed By You

Jingbin Hu, Huakang Chen, Linhan Ma, Dake Guo, Qirui Zhan, Wenhao Li, Haoyu Zhang, Kangxiang Xia, Ziyu Zhang, Wenjie Tian, Chengyou Wang, Jinrui Liang, Shuhan Guo, Zihang Yang, Bengu Wu, Binbin Zhang, Pengcheng Zhu, Pengyuan Xie, Chuan Xie, Qiang Zhang, Jie Liu, Lei Xie

Main category: eess.AS

TL;DR: VoiceSculptor is an open-source unified TTS system that enables instruction-based voice design with fine-grained control over speech attributes and high-fidelity voice cloning in a single framework.

Details

Motivation: Current open-source TTS systems lack truly instruction-following, fine-grained control over core speech attributes like pitch, speaking rate, age, emotion, and style.

Method: Integrates instruction-based voice design and high-fidelity voice cloning in a unified framework. Uses natural-language descriptions to generate controllable speaker timbre, supports iterative refinement via Retrieval-Augmented Generation (RAG), and provides attribute-level edits across multiple dimensions. The designed voice is rendered into a prompt waveform and fed into a cloning model for timbre transfer.

Result: Achieves open-source state-of-the-art (SOTA) on InstructTTSEval-Zh benchmark. The system is fully open-sourced including code and pretrained models.

Conclusion: VoiceSculptor bridges the gap in open-source TTS by providing comprehensive instruction-based control over speech attributes while maintaining high-fidelity voice cloning capabilities, advancing reproducible instruction-controlled TTS research.

Abstract: Despite rapid progress in text-to-speech (TTS), open-source systems still lack truly instruction-following, fine-grained control over core speech attributes (e.g., pitch, speaking rate, age, emotion, and style). We present VoiceSculptor, an open-source unified system that bridges this gap by integrating instruction-based voice design and high-fidelity voice cloning in a single framework. It generates controllable speaker timbre directly from natural-language descriptions, supports iterative refinement via Retrieval-Augmented Generation (RAG), and provides attribute-level edits across multiple dimensions. The designed voice is then rendered into a prompt waveform and fed into a cloning model to enable high-fidelity timbre transfer for downstream speech synthesis. VoiceSculptor achieves open-source state-of-the-art (SOTA) on InstructTTSEval-Zh, and is fully open-sourced, including code and pretrained models, to advance reproducible instruction-controlled TTS research.

eess.IV

[461] Cell Behavior Video Classification Challenge, a benchmark for computer vision methods in time-lapse microscopy

Raffaella Fiamma Cabini, Deborah Barkauskas, Guangyu Chen, Zhi-Qi Cheng, David E Cicchetti, Judith Drazba, Rodrigo Fernandez-Gonzalez, Raymond Hawkins, Yujia Hu, Jyoti Kini, Charles LeWarne, Xufeng Lin, Sai Preethi Nakkina, John W Peterson, Koert Schreurs, Ayushi Singh, Kumaran Bala Kandan Viswanathan, Inge MN Wortel, Sanjian Zhang, Rolf Krause, Santiago Fernandez Gonzalez, Diego Ulisse Pizzagalli

Main category: eess.IV

TL;DR: The paper presents the Cell Behavior Video Classification Challenge (CBVCC) which benchmarks 35 methods for classifying microscopy videos of complex cellular behaviors, comparing three approaches: tracking-derived features, end-to-end deep learning, and ensemble methods.

Details

Motivation: Classifying microscopy videos of cellular behaviors is crucial for understanding biological dynamics but remains challenging due to non-rigid object boundaries, need for hierarchical spatiotemporal features from entire sequences, and multiple objects in view.

Method: Organized the CBVCC challenge benchmarking 35 methods using three approaches: 1) classification of tracking-derived features, 2) end-to-end deep learning architectures learning spatiotemporal features directly from videos without explicit tracking, and 3) ensemble methods combining tracking-derived with image-derived features.

Result: The paper discusses participant results and compares the potential and limitations of each approach, providing a benchmark for 35 different methods in cellular behavior video classification.

Conclusion: The CBVCC serves as a foundation to foster development of computer vision methods for studying cellular dynamics by benchmarking and comparing different approaches to microscopy video classification.

Abstract: The classification of microscopy videos capturing complex cellular behaviors is crucial for understanding and quantifying the dynamics of biological processes over time. However, it remains a frontier in computer vision, requiring approaches that effectively model the shape and motion of objects without rigid boundaries, extract hierarchical spatiotemporal features from entire image sequences rather than static frames, and account for multiple objects within the field of view. To this end, we organized the Cell Behavior Video Classification Challenge (CBVCC), benchmarking 35 methods based on three approaches: classification of tracking-derived features, end-to-end deep learning architectures to directly learn spatiotemporal features from the entire video sequence without explicit cell tracking, or ensembling tracking-derived with image-derived features. We discuss the results achieved by the participants and compare the potential and limitations of each approach, serving as a basis to foster the development of computer vision methods for studying cellular dynamics.

[462] An effective interactive brain cytoarchitectonic parcellation framework using pretrained foundation model

Shiqi Zhang, Fang Xu, Pengcheng Zhou

Main category: eess.IV

TL;DR: Interactive cytoarchitectonic parcellation framework using DINOv3 vision transformer for brain region segmentation with sparse user scribbles.

Details

Motivation: Cytoarchitectonic mapping is crucial for brain structure analysis but faces challenges: scarcity of training labels and variability in staining/imaging conditions. Current deep learning approaches are constrained by these limitations.

Method: Proposes interactive framework combining: (1) multi-layer DINOv3 feature fusion, (2) lightweight segmentation decoder, and (3) real-time user-guided training from sparse scribbles. Uses transfer learning from DINOv3 vision transformer.

Result: DINOv3 transfer learning outperforms training nnU-Net from scratch. Features show clear anatomical correspondence. Method enables efficient brain region segmentation with sparse labels.

Conclusion: Foundation-model-driven interactive segmentation offers scalable and efficient cytoarchitectonic mapping, addressing label scarcity and imaging variability challenges.

Abstract: Cytoarchitectonic mapping provides anatomically grounded parcellations of brain structure and forms a foundation for integrative, multi-modal neuroscience analyses. These parcellations are defined based on the shape, density, and spatial arrangement of neuronal cell bodies observed in histological imaging. Recent works have demonstrated the potential of using deep learning models toward fully automatic segmentation of cytoarchitectonic areas in large-scale datasets, but performance is mainly constrained by the scarcity of training labels and the variability of staining and imaging conditions. To address these challenges, we propose an interactive cytoarchitectonic parcellation framework that leverages the strong transferability of the DINOv3 vision transformer. Our framework combines (i) multi-layer DINOv3 feature fusion, (ii) a lightweight segmentation decoder, and (iii) real-time user-guided training from sparse scribbles. This design enables rapid human-in-the-loop refinement while maintaining high segmentation accuracy. Compared with training an nnU-Net from scratch, transfer learning with DINOv3 yields markedly improved performance. We also show that features extracted by DINOv3 exhibit clear anatomical correspondence and demonstrate the method’s practical utility for brain region segmentation using sparse labels. These results highlight the potential of foundation-model-driven interactive segmentation for scalable and efficient cytoarchitectonic mapping.

[463] Multi-Objective Pareto-Front Optimization for Efficient Adaptive VVC Streaming

Angeliki Katsenou, Vignesh V. Menon, Guoda Laurinaviciute, Benjamin Bross, Detlev Marpe

Main category: eess.IV

TL;DR: Proposes multi-objective Pareto-front optimization for VVC streaming that jointly optimizes video quality, bitrate, and decoding time/energy, achieving significant bitrate savings while maintaining quality.

Details

Motivation: Need for efficient adaptive video streaming that balances coding performance objectives (bitrate, quality, decoding complexity) in a content- and codec-dependent manner, while ensuring consistent QoE during streaming.

Method: Two Pareto-front optimization strategies: JRQT-PF (Joint Rate-Quality-Time) and JQT-PF (Joint Quality-Time), constructing quality-monotonic bitrate ladders for VVC streaming under quality monotonicity constraints.

Result: JQT-PF achieves 11.76% average bitrate savings with 0.29% decoding time reduction at same XPSNR; aggressive configurations yield up to 27.88% bitrate savings. JRQT-PF offers 6.38% bitrate savings and 6.17% decoding time reduction. Outperforms existing methods including fixed ladders and dynamic resolution selection.

Conclusion: Pareto-front optimization with decoding time constraints enables sustainable, high-quality streaming tailored to network and device capabilities, providing controlled tradeoffs between quality, bitrate, and complexity.

Abstract: Adaptive video streaming has facilitated improved video streaming over the past years. A balance among coding performance objectives such as bitrate, video quality, and decoding complexity is required to achieve efficient, content- and codec-dependent, adaptive video streaming. This paper proposes a multi-objective Pareto-front (PF) optimization framework to construct quality-monotonic, content-adaptive bitrate ladders Versatile Video Coding (VVC) streaming that jointly optimize video quality, bitrate, and decoding time, which is used as a practical proxy for decoding energy. Two strategies are introduced: the Joint Rate-Quality-Time Pareto Front (JRQT-PF) and the Joint Quality-Time Pareto Front (JQT-PF), each exploring different tradeoff formulations and objective prioritizations. The ladders are constructed under quality monotonicity constraints during adaptive streaming to ensure a consistent Quality of Experience (QoE). Experiments are conducted on a large-scale UHD dataset (Inter-4K), with quality assessed using PSNR, VMAF, and XPSNR, and complexity measured via decoding time and energy consumption. The JQT-PF method achieves 11.76% average bitrate savings while reducing average decoding time by 0.29% to maintain the same XPSNR, compared to a widely-used fixed ladder. More aggressive configurations yield up to 27.88% bitrate savings at the cost of increased complexity. The JRQT-PF strategy, on the other hand, offers more controlled tradeoffs, achieving 6.38 % bitrate savings and 6.17 % decoding time reduction. This framework outperforms existing methods, including fixed ladders, VMAF- and XPSNR-based dynamic resolution selection, and complexity-aware benchmarks. The results confirm that PF optimization with decoding time constraints enables sustainable, high-quality streaming tailored to network and device capabilities.

[464] Learning Physics-Informed Noise Models from Dark Frames for Low-Light Raw Image Denoising

Hansen Feng, Lizhi Wang, Yiqi Huang, Yuzhi Wang, Lin Zhu, Hua Huang

Main category: eess.IV

TL;DR: The paper proposes PNNP, a physics-informed noise neural proxy that learns noise models from dark frames instead of paired real data for low-light raw image denoising.

Details

Motivation: Current noise modeling approaches have limitations: physics-based methods struggle to characterize entire real noise distributions, while learning-based methods depend impractically on paired real data. There's a need to break down data dependency while maintaining accurate noise modeling.

Method: Proposes learning noise models from dark frames instead of paired real data. Introduces PNNP with three key techniques: 1) Physics-guided noise decoupling (PND) to handle different noise levels flexibly, 2) Physics-aware proxy model (PPM) to incorporate physical priors, and 3) Differentiable distribution loss (DDL) for explicit supervision of noise distribution.

Result: PNNP exhibits powerful potential in characterizing real noise distribution. Extensive experiments on public datasets demonstrate superior performance in practical low-light raw image denoising.

Conclusion: The proposed approach successfully breaks down data dependency by learning from dark frames while maintaining accurate noise modeling through physics-informed neural proxies, achieving state-of-the-art performance in low-light raw image denoising.

Abstract: Recently, the mainstream practice for training low-light raw image denoising methods has shifted towards employing synthetic data. Noise modeling, which focuses on characterizing the noise distribution of real-world sensors, profoundly influences the effectiveness and practicality of synthetic data. Currently, physics-based noise modeling struggles to characterize the entire real noise distribution, while learning-based noise modeling impractically depends on paired real data. In this paper, we propose a novel strategy: learning the noise model from dark frames instead of paired real data, to break down the data dependency. Based on this strategy, we introduce an efficient physics-informed noise neural proxy (PNNP) to approximate the real-world sensor noise model. Specifically, we integrate physical priors into neural proxies and introduce three efficient techniques: physics-guided noise decoupling (PND), physics-aware proxy model (PPM), and differentiable distribution loss (DDL). PND decouples the dark frame into different components and handles different levels of noise flexibly, which reduces the complexity of noise modeling. PPM incorporates physical priors to constrain the synthetic noise, which promotes the accuracy of noise modeling. DDL provides explicit and reliable supervision for noise distribution, which promotes the precision of noise modeling. PNNP exhibits powerful potential in characterizing the real noise distribution. Extensive experiments on public datasets demonstrate superior performance in practical low-light raw image denoising. The source code will be publicly available at the project homepage.

[465] Instance-level quantitative saliency in multiple sclerosis lesion segmentation

Federico Spagnolo, Nataliia Molchanova, Meritxell Bach Cuadra, Mario Ocampo Pineda, Lester Melie-Garcia, Cristina Granziera, Vincent Andrearczyk, Adrien Depeursinge

Main category: eess.IV

TL;DR: The paper introduces two instance-level XAI methods for semantic segmentation that provide quantitative saliency maps, applied to white matter lesion segmentation in MRI scans.

Details

Motivation: Instance-level explanations for semantic segmentation remain largely unexplored, especially important for multi-lesional diseases where understanding what drives detection and contouring of specific lesions is clinically meaningful.

Method: Extended SmoothGrad and Grad-CAM++ to obtain quantitative instance saliency maps for semantic segmentation. Applied to WML segmentation using 4023 MRI scans with expert annotations, training three deep learning architectures (3D U-Net, nnU-Net, Swin UNETR).

Result: Models achieved Dice scores of 0.71, 0.78, and 0.80. Saliency maps showed models rely primarily on FLAIR images, with positive saliency inside lesions and negative in immediate neighborhood. Peak saliency values differed significantly between correct and incorrect predictions.

Conclusion: Two architecture-agnostic XAI methods provide quantitative instance-level explanations for semantic segmentation, supporting clinically meaningful interpretation and potentially helping identify segmentation errors.

Abstract: Explainable artificial intelligence (XAI) methods have been proposed to interpret model decisions in classification and, more recently, in semantic segmentation. However, instance-level XAI for semantic segmentation, namely explanations focused on a single object among multiple instances of the same class, remains largely unexplored. Such explanations are particularly important in multi-lesional diseases to understand what drives the detection and contouring of a specific lesion. We propose instance-level explanation maps for semantic segmentation by extending SmoothGrad and Grad-CAM++ to obtain quantitative instance saliency. These methods were applied to the segmentation of white matter lesions (WMLs), a magnetic resonance imaging biomarker in multiple sclerosis. We used 4023 FLAIR and MPRAGE MRI scans from 687 patients collected at the University Hospital of Basel, Switzerland, with WML masks annotated by four expert clinicians. Three deep learning architectures, a 3D U-Net, nnU-Net, and Swin UNETR, were trained and evaluated, achieving normalized Dice scores of 0.71, 0.78, and 0.80, respectively. Instance saliency maps showed that the models relied primarily on FLAIR rather than MPRAGE for WML segmentation, with positive saliency inside lesions and negative saliency in their immediate neighborhood, consistent with clinical practice. Peak saliency values differed significantly across correct and incorrect predictions, suggesting that quantitative instance saliency may help identify segmentation errors. In conclusion, we introduce two architecture-agnostic XAI methods that provide quantitative instance-level explanations for semantic segmentation and support clinically meaningful interpretation of model decisions.

[466] End-to-End PET Image Reconstruction via a Posterior-Mean Diffusion Model

Yiran Sun, Osama Mawlawi

Main category: eess.IV

TL;DR: PMDM-PET: A diffusion model approach for PET image reconstruction that achieves optimal perception-distortion tradeoff by combining posterior-mean predictions with optimal transport to ground-truth distribution.

Details

Motivation: Current DL methods for PET reconstruction have limitations: regression-based models produce overly smoothed images (low distortion but low perceptual quality), while GAN-based and likelihood-based models introduce artifacts (high distortion but high perceptual quality). There's a need for a robust perception-distortion tradeoff for clinical applicability.

Method: Proposes Posterior-Mean Denoising Diffusion Model (PMDM-PET) that uses mathematical theory to explore closed-form expression of perception-distortion function in diffusion model space. First obtains posterior-mean PET predictions under minimum MSE, then optimally transports their distribution to ground-truth PET images distribution.

Result: PMDM-PET generates realistic PET images with possible minimum distortion and optimal perceptual quality. Outperforms five recent SOTA DL baselines in both qualitative visual inspection and quantitative metrics (PSNR, SSIM, NRMSE).

Conclusion: PMDM-PET successfully addresses the perception-distortion tradeoff in PET reconstruction, producing clinically applicable images with both high perceptual quality and low distortion, surpassing existing state-of-the-art methods.

Abstract: Positron Emission Tomography (PET) is a functional imaging modality that enables the visualization of biochemical and physiological processes across various tissues. Recently, deep learning (DL)-based methods have demonstrated significant progress in directly mapping sinograms to PET images. However, regression-based DL models often yield overly smoothed reconstructions lacking of details (i.e., low distortion, low perceptual quality), whereas GAN-based and likelihood-based posterior sampling models tend to introduce undesirable artifacts in predictions (i.e., high distortion, high perceptual quality), limiting their clinical applicability. To achieve a robust perception-distortion tradeoff, we propose Posterior-Mean Denoising Diffusion Model (PMDM-PET), a novel approach that builds upon a recently established mathematical theory to explore the closed-form expression of perception-distortion function in diffusion model space for PET image reconstruction from sinograms. Specifically, PMDM-PET first obtained posterior-mean PET predictions under minimum mean square error (MSE), then optimally transports the distribution of them to the ground-truth PET images distribution. Experimental results demonstrate that PMDM-PET not only generates realistic PET images with possible minimum distortion and optimal perceptual quality but also outperforms five recent state-of-the-art (SOTA) DL baselines in both qualitative visual inspection and quantitative pixel-wise metrics PSNR (dB)/SSIM/NRMSE.

[467] A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning

Stefano Cerri, Asbjørn Munk, Jakob Ambsdorf, Julia Machnio, Sebastian Nørgaard Llambias, Vardan Nersesjan, Christian Hedeager Krag, Peirong Liu, Pablo Rocamora García, Mostafa Mehdipour Ghazi, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen

Main category: eess.IV

TL;DR: FOMO300K is a large-scale heterogeneous dataset of 318,877 brain MRI scans from 59,969 subjects, aggregated from 920 public sources, designed to support self-supervised learning in medical imaging.

Details

Motivation: To address the need for large-scale, diverse medical imaging datasets to support development and benchmarking of self-supervised learning methods in medical imaging, particularly for brain MRI analysis.

Method: Aggregated 318,877 brain MRI scans from 920 publicly available sources, including both clinical- and research-grade images with multiple MRI sequences. Applied minimal preprocessing to preserve original image characteristics while reducing entry barriers.

Result: Created FOMO300K dataset with 318,877 scans from 82,678 MRI sessions and 59,969 subjects, featuring wide anatomical and pathological variability including large brain anomalies. Provided companion code for self-supervised pretraining and finetuning, along with pretrained models.

Conclusion: FOMO300K serves as a comprehensive resource to advance self-supervised learning in medical imaging by providing a large-scale, heterogeneous dataset with minimal preprocessing, enabling better development and benchmarking of methods at scale.

Abstract: We present FOMO300K, a large-scale, heterogeneous dataset of 318,877 brain Magnetic Resonance Imaging (MRI) scans from 82,678 MRI sessions and 59,969 subjects, aggregated from 920 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing entry barriers for new users. Companion code for self-supervised pretraining and finetuning is provided, along with pretrained models. FOMO300K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.

[468] Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation

Szymon Płotka, Gizem Mert, Maciej Chrabaszcz, Ewa Szczurek, Arkadiusz Sitek

Main category: eess.IV

TL;DR: HoME introduces a hierarchical soft mixture-of-experts architecture for efficient 3D medical image segmentation, built on Mamba SSM backbone with two-level token routing for local feature extraction and global context fusion.

Details

Motivation: Despite AI advances in medical image segmentation, challenges remain in efficient 3D processing across diverse modalities and handling data variability. Current methods struggle with long-context modeling and adaptation to different imaging characteristics.

Method: Hierarchical Soft Mixture-of-Experts (HoME) with two-level token routing: 1) Local SMoE partitions input sequences into groups, routing tokens to specialized per-group experts for localized feature extraction; 2) Global SMoE aggregates outputs for cross-group information fusion and global context refinement, built on Mamba Selective State Space Model backbone.

Result: HoME surpasses state-of-the-art results across datasets from three most widely used 3D medical imaging modalities (CT, MRI, PET) and varying data qualities, demonstrating enhanced generalizability and segmentation performance.

Conclusion: The hierarchical design combining local expert routing with global expert refinement provides an efficient solution for 3D medical image segmentation, addressing challenges of long-context modeling and data variability across different imaging modalities.

Abstract: In recent years, artificial intelligence has significantly advanced medical image segmentation. Nonetheless, challenges remain, including efficient 3D medical image processing across diverse modalities and handling data variability. In this work, we introduce Hierarchical Soft Mixture-of-Experts (HoME), a two-level token-routing layer for efficient long-context modeling, specifically designed for 3D medical image segmentation. Built on the Mamba Selective State Space Model (SSM) backbone, HoME enhances sequential modeling through adaptive expert routing. In the first level, a Soft Mixture-of-Experts (SMoE) layer partitions input sequences into local groups, routing tokens to specialized per-group experts for localized feature extraction. The second level aggregates these outputs through a global SMoE layer, enabling cross-group information fusion and global context refinement. This hierarchical design, combining local expert routing with global expert refinement, enhances generalizability and segmentation performance, surpassing state-of-the-art results across datasets from the three most widely used 3D medical imaging modalities and varying data qualities. The code is publicly available at https://github.com/gmum/MambaHoME.

Mehmet Onurcan Kaya, Figen S. Oktem

Main category: eess.IV

TL;DR: A novel phase retrieval framework using Langevin dynamics for posterior sampling that balances measurement fidelity and perceptual quality through stochastic sampling, denoising, and model-based updates.

Details

Motivation: Phase retrieval is ill-posed, and existing methods struggle to jointly achieve measurement fidelity (distortion) and perceptual realism (perception). There's a need for methods that can navigate the perception-distortion tradeoff effectively.

Method: Proposes a framework using Langevin dynamics for efficient posterior sampling. Includes three variants integrating: 1) theoretically grounded Langevin inference, 2) adaptive noise schedule learning, 3) parallel reconstruction sampling, and 4) warm-start initialization from classical solvers. Combines stochastic sampling, learned denoising, and model-based updates.

Result: Achieves state-of-the-art performance across multiple benchmarks in both fidelity and perceptual quality. The framework successfully balances distortion and perceptual quality better than conventional approaches.

Conclusion: The proposed Langevin dynamics-based framework provides a principled approach to phase retrieval that effectively navigates the perception-distortion tradeoff, outperforming existing methods and offering a new direction for ill-posed inverse problems.

Abstract: Phase retrieval is an ill-posed inverse problem in which classical and deep learning-based methods struggle to jointly achieve measurement fidelity and perceptual realism. We propose a novel framework for phase retrieval that leverages Langevin dynamics to enable efficient posterior sampling, yielding reconstructions that explicitly balance distortion and perceptual quality. Unlike conventional approaches that prioritize pixel-wise accuracy, our methods navigate the perception-distortion tradeoff through a principled combination of stochastic sampling, learned denoising, and model-based updates. The framework comprises three variants of increasing complexity, integrating theoretically grounded Langevin inference, adaptive noise schedule learning, parallel reconstruction sampling, and warm-start initialization from classical solvers. Extensive experiments demonstrate that our methods achieve state-of-the-art performance across multiple benchmarks, both in terms of fidelity and perceptual quality. The source code and trained models are available at https://github.com/METU-SPACE-Lab/prNet-for-Phase-Retrieval

Jyun-Ping Kao, Shinyeong Rho, Shahar Lazarev, Hyun-Hae Cho, Fangxu Xing, Taehoon Shin, C. -C. Jay Kuo, Jonghye Woo

Main category: eess.IV

TL;DR: Novel parameter-efficient transfer learning approach using 3D LoRA to adapt CT-pretrained foundation model for MRI-based ADHD classification, achieving state-of-the-art results with 113x fewer parameters.

Details

Motivation: Early ADHD diagnosis in children is crucial but challenging using neuroimaging due to heterogeneous presentations and symptom overlap with other conditions. There's a need for efficient cross-modal adaptation methods that can leverage large pre-trained models for specialized medical imaging tasks.

Method: Proposes a parameter-efficient transfer learning approach that adapts a large-scale 3D convolutional foundation model pre-trained on CT images to MRI-based ADHD classification. Introduces 3D Low-Rank Adaptation (LoRA) by factorizing 3D convolutional kernels into 2D low-rank updates, dramatically reducing trainable parameters while maintaining performance.

Result: Achieved state-of-the-art results in five-fold cross-validation on public diffusion MRI database: one variant reached 71.9% accuracy, another attained AUC of 0.716. Both used only 1.64 million trainable parameters (113x fewer than fully fine-tuned foundation model). Represents one of first successful cross-modal (CT-to-MRI) adaptations in neuroimaging.

Conclusion: The 3D LoRA fine-tuning strategy establishes a new benchmark for ADHD classification while greatly improving efficiency. Demonstrates successful cross-modal adaptation of foundation models in neuroimaging with dramatic parameter reduction, enabling more practical deployment of large models in medical imaging applications.

Abstract: Early diagnosis of attention-deficit/hyperactivity disorder (ADHD) in children plays a crucial role in improving outcomes in education and mental health. Diagnosing ADHD using neuroimaging data, however, remains challenging due to heterogeneous presentations and overlapping symptoms with other conditions. To address this, we propose a novel parameter-efficient transfer learning approach that adapts a large-scale 3D convolutional foundation model, pre-trained on CT images, to an MRI-based ADHD classification task. Our method introduces Low-Rank Adaptation (LoRA) in 3D by factorizing 3D convolutional kernels into 2D low-rank updates, dramatically reducing trainable parameters while achieving superior performance. In a five-fold cross-validated evaluation on a public diffusion MRI database, our 3D LoRA fine-tuning strategy achieved state-of-the-art results, with one model variant reaching 71.9% accuracy and another attaining an AUC of 0.716. Both variants use only 1.64 million trainable parameters (over 113x fewer than a fully fine-tuned foundation model). Our results represent one of the first successful cross-modal (CT-to-MRI) adaptations of a foundation model in neuroimaging, establishing a new benchmark for ADHD classification while greatly improving efficiency.

Today’s Research Highlights

Table of Contents

cs.CL

[1] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

[2] LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue

[3] Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines

[4] Introducing Axlerod: An LLM-based Chatbot for Assisting Independent Insurance Agents

[5] Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research

[6] SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data

[7] StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model

[8] MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

[9] Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

[10] Uncertainty-Aware Dynamic Knowledge Graphs for Reliable Question Answering

[11] Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

[12] Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox

[13] ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

[14] SagaScale: A Realistic, Scalable, and High-Quality Long-Context Benchmark Built from Full-Length Novels

[15] Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions

[16] Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

[17] Forgetting as a Feature: Cognitive Alignment of Large Language Models

[18] SciNets: Graph-Constrained Multi-Hop Reasoning for Scientific Literature Synthesis

[19] Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens

[20] Enhancing Business Analytics through Hybrid Summarization of Financial Reports

[21] Clinical Document Metadata Extraction: A Scoping Review

[22] Geometric Patterns of Meaning: A PHATE Manifold Analysis of Multi-lingual Embeddings

[23] Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings

[24] Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

[25] From Detection to Diagnosis: Advancing Hallucination Analysis with Automated Data Synthesis

[26] Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations

[27] Bears, all bears, and some bears. Language Constraints on Language Models’ Inductive Inferences

[28] MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication

[29] OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing

[30] Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

[31] Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

[32] Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

[33] Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

[34] SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction

[35] EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records

[36] EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels

[37] Long-Chain Reasoning Distillation via Adaptive Prefix Alignment

[38] Deriving Character Logic from Storyline as Codified Decision Trees

[39] Is MT Ready for the Next Crisis or Pandemic?

[40] CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking

[41] Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation

[42] Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends

[43] ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

[44] What Gets Activated: Uncovering Domain and Driver Experts in MoE Language Models

[45] Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

[46] AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers

[47] Credit C-GPT: A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection

[48] HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning

[49] HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns

[50] One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?

[51] GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients

[52] Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?

[53] coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts

[54] Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLMs

[55] Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel

[56] The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

[57] Multilinguality as Sense Adaptation

[58] ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios

[59] Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis

[60] An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit

[61] OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

[62] Training-Trajectory-Aware Token Selection

[63] Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

[64] The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

[65] INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects

[66] TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction

[67] Are Language Models Models?

[68] SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability

[69] Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models

[70] DR-Arena: an Automated Evaluation Framework for Deep Research Agents

[71] AEQ-Bench: Measuring Empathy of Omni-Modal Large Models

[72] PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models

[73] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure

[74] Form and Meaning in Intrinsic Multilingual Evaluations

[75] Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs

[76] Detecting Winning Arguments with Large Language Models and Persuasion Strategies

[77] LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals