Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 68]
cs.CV [Total: 111]
cs.AI [Total: 33]
cs.SD [Total: 4]
cs.LG [Total: 107]
cs.MA [Total: 3]
cs.MM [Total: 0]
eess.AS [Total: 1]
eess.IV [Total: 10]

cs.CL

[1] RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning

Xiang Gao, Yuguang Yao, Qi Zhang, Kaiwen Dong, Avinash Baidya, Ruocheng Guo, Hilaf Hasson, Kamalika Das

Main category: cs.CL

TL;DR: RIMRULE: A neuro-symbolic approach that distills interpretable rules from LLM failure traces and injects them during inference to improve tool-use performance without weight updates.

Details

Motivation: LLMs struggle with domain-specific tools that have idiosyncratic, under-documented, or private APIs, requiring effective adaptation to task-specific tools without extensive retraining.

Method: Dynamic rule injection: LLM proposes rules from failure traces, consolidated using Minimum Description Length objective for generality/conciseness. Rules stored in natural language and symbolic form for efficient inference-time retrieval.

Result: Improves accuracy on both seen and unseen tools without modifying LLM weights; outperforms prompting-based methods and complements finetuning; rules transferable across different LLM architectures.

Conclusion: RIMRULE enables effective LLM adaptation to domain-specific tools through interpretable symbolic rules, offering portable knowledge transfer and improved tool-use reliability.

Abstract: Large language models (LLMs) often struggle to use tools reliably in domain-specific settings, where APIs may be idiosyncratic, under-documented, or tailored to private workflows. This highlights the need for effective adaptation to task-specific tools. We propose RIMRULE, a neuro-symbolic approach for LLM adaptation based on dynamic rule injection. Compact, interpretable rules are distilled from failure traces and injected into the prompt during inference to improve task performance. These rules are proposed by the LLM itself and consolidated using a Minimum Description Length (MDL) objective that favors generality and conciseness. Each rule is stored in both natural language and a structured symbolic form, supporting efficient retrieval at inference time. Experiments on tool-use benchmarks show that this approach improves accuracy on both seen and unseen tools without modifying LLM weights. It outperforms prompting-based adaptation methods and complements finetuning. Moreover, rules learned from one LLM can be reused to improve others, including long reasoning LLMs, highlighting the portability of symbolic knowledge across architectures.

[2] Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.CL

TL;DR: MetaJuLS is a meta-reinforcement learning approach that learns universal constraint propagation policies for structured inference in LLMs, achieving 1.5-2x speedups while maintaining near-state-of-the-art accuracy across languages and tasks.

Details

Motivation: Large language models increasingly need structured inference with complex constraints (JSON schema enforcement, multi-lingual parsing), requiring efficient constraint propagation methods that work across different languages and tasks without extensive retraining.

Method: Meta-reinforcement learning approach that formulates structured inference as adaptive constraint propagation, training a Graph Attention Network with meta-learning to learn universal constraint propagation policies applicable across languages and tasks.

Result: Achieves 1.5-2x speedups over GPU-optimized baselines while maintaining within 0.2% accuracy of state-of-the-art parsers. Demonstrates rapid cross-domain adaptation: policies trained on English parsing adapt to new languages/tasks with only 5-10 gradient steps (5-15 seconds) instead of hours of task-specific training.

Conclusion: MetaJuLS enables efficient structured inference for LLMs, reduces inference carbon footprint (Green AI), and discovers both human-like parsing strategies and novel non-intuitive heuristics through mechanistic analysis.

Abstract: Large language models increasingly require structured inference, from JSON schema enforcement to multi-lingual parsing, where outputs must satisfy complex constraints. We introduce MetaJuLS, a meta-reinforcement learning approach that learns universal constraint propagation policies applicable across languages and tasks without task-specific retraining. By formulating structured inference as adaptive constraint propagation and training a Graph Attention Network with meta-learning, MetaJuLS achieves 1.5–2.0$\times$ speedups over GPU-optimized baselines while maintaining within 0.2% accuracy of state-of-the-art parsers. On Universal Dependencies across 10 languages and LLM-constrained generation (LogicBench, GSM8K-Constrained), MetaJuLS demonstrates rapid cross-domain adaptation: a policy trained on English parsing adapts to new languages and tasks with 5–10 gradient steps (5–15 seconds) rather than requiring hours of task-specific training. Mechanistic analysis reveals the policy discovers human-like parsing strategies (easy-first) and novel non-intuitive heuristics. By reducing propagation steps in LLM deployments, MetaJuLS contributes to Green AI by directly reducing inference carbon footprint.

[3] Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description

Yongmin Yoo, Kris W Pan

Main category: cs.CL

TL;DR: Pat-DEVAL is a multi-dimensional evaluation framework for patent description bodies that uses LLM-as-a-judge with Chain-of-Legal-Thought reasoning to assess both technical soundness and legal compliance, achieving superior correlation with expert judgments.

Details

Motivation: Existing evaluation approaches for automated patent drafting fail to assess long-form structural coherence and statutory compliance specific to patent descriptions, despite LLMs enabling end-to-end automated drafting.

Method: Proposes Pat-DEVAL framework using LLM-as-a-judge paradigm with Chain-of-Legal-Thought (CoLT), a legally-constrained reasoning mechanism that enforces sequential patent-law-specific analysis for evaluating patent descriptions.

Result: Pat-DEVAL achieves Pearson correlation of 0.69 with expert judgments on Pap2Pat-EvalGold dataset, significantly outperforming baseline metrics and existing LLM evaluators, with superior 0.73 correlation in Legal-Professional Compliance.

Conclusion: Pat-DEVAL establishes a new standard for ensuring both technical soundness and legal compliance in patent descriptions, providing robust methodological foundation for practical deployment of automated patent drafting systems.

Abstract: Patent descriptions must deliver comprehensive technical disclosure while meeting strict legal standards such as enablement and written description requirements. Although large language models have enabled end-to-end automated patent drafting, existing evaluation approaches fail to assess long-form structural coherence and statutory compliance specific to descriptions. We propose Pat-DEVAL, the first multi-dimensional evaluation framework dedicated to patent description bodies. Leveraging the LLM-as-a-judge paradigm, Pat-DEVAL introduces Chain-of-Legal-Thought (CoLT), a legally-constrained reasoning mechanism that enforces sequential patent-law-specific analysis. Experiments validated by patent expert on our Pap2Pat-EvalGold dataset demonstrate that Pat-DEVAL achieves a Pearson correlation of 0.69, significantly outperforming baseline metrics and existing LLM evaluators. Notably, the framework exhibits a superior correlation of 0.73 in Legal-Professional Compliance, proving that the explicit injection of statutory constraints is essential for capturing nuanced legal validity. By establishing a new standard for ensuring both technical soundness and legal compliance, Pat-DEVAL provides a robust methodological foundation for the practical deployment of automated patent drafting systems.

[4] Understanding Emotion in Discourse: Recognition Insights and Linguistic Patterns for Generation

Cheonkam Jeong, Adeline Nyamathi

Main category: cs.CL

TL;DR: Systematic analysis of Emotion Recognition in Conversation reveals that conversational context is crucial (90% gain from recent 10-30 turns), hierarchical sentence representations help only without context, and affective lexicons provide no additional benefit. Linguistic analysis shows emotion-specific discourse marker patterns, explaining why sadness benefits most from context.

Details

Motivation: Address two critical gaps in ERC: 1) limited understanding of which architectural choices actually matter, and 2) lack of linguistic analysis connecting recognition to generation. The paper aims to provide systematic insights into what works and why in emotion recognition.

Method: Conducted rigorous ablation study with 10-seed evaluation on IEMOCAP dataset. For recognition: tested conversational context windows, hierarchical sentence representations, and external affective lexicons. For linguistic analysis: analyzed 5,286 discourse marker occurrences to examine emotion-marker positioning associations.

Result: Achieved 82.69% (4-way) and 67.07% (6-way) weighted F1 with simple architectures using strictly causal context, outperforming prior text-only methods. Key findings: 1) Conversational context is paramount (90% gain from recent 10-30 turns), 2) Hierarchical representations help only without context, 3) Affective lexicons provide no gain, 4) Significant emotion-marker positioning association (p<.0001), with sadness showing reduced left-periphery markers (21.9% vs 28-32%).

Conclusion: Conversational context is the most important factor in ERC, with diminishing returns beyond recent turns. Pre-trained encoders capture emotional semantics sufficiently, making external lexicons unnecessary. Linguistic analysis reveals emotion-specific discourse patterns, explaining why sadness benefits most from context (+22%p) due to lack of explicit pragmatic signals requiring conversational history for disambiguation.

Abstract: While Emotion Recognition in Conversation (ERC) has achieved high accuracy, two critical gaps remain: a limited understanding of \textit{which} architectural choices actually matter, and a lack of linguistic analysis connecting recognition to generation. We address both gaps through a systematic analysis of the IEMOCAP dataset. For recognition, we conduct a rigorous ablation study with 10-seed evaluation and report three key findings. First, conversational context is paramount, with performance saturating rapidly – 90% of the total gain achieved within just the most recent 10–30 preceding turns (depending on the label set). Second, hierarchical sentence representations help at utterance-level, but this benefit disappears once conversational context is provided, suggesting that context subsumes intra-utterance structure. Third, external affective lexicons (SenticNet) provide no gain, indicating that pre-trained encoders already capture necessary emotional semantics. With simple architectures using strictly causal context, we achieve 82.69% (4-way) and 67.07% (6-way) weighted F1, outperforming prior text-only methods including those using bidirectional context. For linguistic analysis, we analyze 5,286 discourse marker occurrences and find a significant association between emotion and marker positioning ($p < .0001$). Notably, “sad” utterances exhibit reduced left-periphery marker usage (21.9%) compared to other emotions (28–32%), consistent with theories linking left-periphery markers to active discourse management. This connects to our recognition finding that sadness benefits most from context (+22%p): lacking explicit pragmatic signals, sad utterances require conversational history for disambiguation.

[5] Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models

Wang Xing, Wei Song, Siyu Lin, Chen Wu, Zhesi Li, Man Wang

Main category: cs.CL

TL;DR: A distillation framework using LLMs as teachers to compress temporal knowledge graph reasoning models for efficient deployment on resource-constrained platforms.

Details

Motivation: Existing TKG reasoning models are computationally expensive and parameter-heavy, hindering deployment on resource-constrained platforms. Current compression techniques fail to capture temporal dependencies in TKGs, leading to performance degradation.

Method: Proposes a distillation framework using large language models as teacher models to transfer both structural and temporal reasoning capabilities to lightweight student models. Integrates large-scale public knowledge with task-specific temporal information.

Result: Extensive experiments on multiple benchmark datasets show the method consistently outperforms strong baselines, achieving favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.

Conclusion: The proposed distillation framework effectively addresses the deployment challenges of TKG reasoning models by creating compact, efficient models that maintain strong temporal reasoning capabilities through LLM-guided knowledge transfer.

Abstract: Reasoning over temporal knowledge graphs (TKGs) is fundamental to improving the efficiency and reliability of intelligent decision-making systems and has become a key technological foundation for future artificial intelligence applications. Despite recent progress, existing TKG reasoning models typically rely on large parameter sizes and intensive computation, leading to high hardware costs and energy consumption. These constraints hinder their deployment on resource-constrained, low-power, and distributed platforms that require real-time inference. Moreover, most existing model compression and distillation techniques are designed for static knowledge graphs and fail to adequately capture the temporal dependencies inherent in TKGs, often resulting in degraded reasoning performance. To address these challenges, we propose a distillation framework specifically tailored for temporal knowledge graph reasoning. Our approach leverages large language models as teacher models to guide the distillation process, enabling effective transfer of both structural and temporal reasoning capabilities to lightweight student models. By integrating large-scale public knowledge with task-specific temporal information, the proposed framework enhances the student model’s ability to model temporal dynamics while maintaining a compact and efficient architecture. Extensive experiments on multiple publicly available benchmark datasets demonstrate that our method consistently outperforms strong baselines, achieving a favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.

[6] From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark

Jinning Zhang, Jie Song, Wenhui Tu, Zecheng Li, Jingxuan Li, Jin Li, Xuan Liu, Taole Sha, Zichen Wei, Yan Li

Main category: cs.CL

TL;DR: EBM-adapted graph-based RAG system for medicine integrates PICO framework and evidence hierarchy, validated in sports rehabilitation with strong performance metrics.

Details

Motivation: Current RAG approaches in medicine focus on performance but overlook EBM principles like PICO alignment and evidence hierarchy, creating gaps in evidence-based grounding.

Method: Generalizable EBM adaptation strategy: integrates PICO framework into knowledge graph construction/retrieval, uses Bayesian-inspired reranking algorithm to calibrate scores by evidence grade without predefined weights.

Result: System achieved 0.830 nugget coverage, 0.819 faithfulness, 0.882 semantic similarity, 0.788 PICOT match; expert clinicians rated 4.66-4.84/5 across key metrics; released knowledge graph (357K nodes) and benchmark (1,637 QA pairs).

Conclusion: EBM adaptation strategy improves retrieval/answer quality, is transferable to other clinical domains, and addresses dataset scarcity in sports rehabilitation RAG research.

Abstract: In medicine, large language models (LLMs) increasingly rely on retrieval-augmented generation (RAG) to ground outputs in up-to-date external evidence. However, current RAG approaches focus primarily on performance improvements while overlooking evidence-based medicine (EBM) principles. This study addresses two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present a generalizable strategy for adapting EBM to graph-based RAG, integrating the PICO framework into knowledge graph construction and retrieval, and proposing a Bayesian-inspired reranking algorithm to calibrate ranking scores by evidence grade without introducing predefined weights. We validated this framework in sports rehabilitation, a literature-rich domain currently lacking RAG systems and benchmarks. We released a knowledge graph (357,844 nodes and 371,226 edges) and a reusable benchmark of 1,637 QA pairs. The system achieved 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy. In a 5-point Likert evaluation, five expert clinicians rated the system 4.66-4.84 across factual accuracy, faithfulness, relevance, safety, and PICO alignment. These findings demonstrate that the proposed EBM adaptation strategy improves retrieval and answer quality and is transferable to other clinical domains. The released resources also help address the scarcity of RAG datasets in sports rehabilitation.

[7] A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR

Yuang Zheng, Yuxiang Mei, Dongxing Xu, Jie Chen, Yanhua Long

Main category: cs.CL

TL;DR: A lightweight multilingual ASR system using CTC architecture with hierarchical LoRA-MoE framework achieves competitive performance with single-pass decoding, eliminating need for language identification during inference.

Details

Motivation: Large-scale multilingual ASR models like Whisper have high computational costs and latency, making them unsuitable for resource-constrained edge devices. There's a need for lightweight, efficient multilingual ASR systems that can operate without prior language identification.

Method: Proposes a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into mHuBERT-CTC model. Uses hierarchical design: multilingual shared LoRA for language-invariant acoustic representations and language-specific LoRA experts for language-dependent characteristics. Employs LID-posterior-driven LoRA routing for end-to-end decoding without explicit language labels.

Result: HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding. Experiments on MSR-86K and MLC-SLM 2025 Challenge datasets show significant improvement in decoding efficiency for low-resource multilingual ASR applications.

Conclusion: The proposed HLoRA framework enables efficient, language-agnostic multilingual ASR with single-pass decoding, making it suitable for deployment on resource-constrained edge devices while maintaining competitive performance with existing methods.

Abstract: Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.

[8] JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation

Leonard Lin, Adam Lensenmayer

Main category: cs.CL

TL;DR: JP-TL-Bench is a lightweight benchmark for evaluating Japanese-English translation systems using pairwise LLM comparisons against a fixed anchor set, with scores aggregated via Bradley-Terry model.

Details

Motivation: Existing translation evaluation often focuses on acceptability rather than nuanced quality comparison. For Japanese-English translation, subtle choices in politeness, implicature, ellipsis, and register significantly affect perceived naturalness, requiring a benchmark that can distinguish between "which of these two good translations is better?"

Method: Uses reference-free, pairwise LLM comparisons against a fixed, versioned anchor set. Pairwise results are aggregated with a Bradley-Terry model and reported as win rates plus a normalized 0-10 “LT” score derived from a logistic transform of fitted log-strengths.

Result: The benchmark provides structurally stable scores when using the same base set, judge, and aggregation code, making it reliable and affordable for iterative development of Japanese-English translation systems.

Conclusion: JP-TL-Bench offers a practical, lightweight evaluation framework specifically designed for the nuanced challenges of Japanese-English translation, enabling reliable comparison between high-quality translation systems.

Abstract: We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese-English translation systems. In this context, the challenge is often “which of these two good translations is better?” rather than “is this translation acceptable?” This distinction matters for Japanese-English, where subtle choices in politeness, implicature, ellipsis, and register strongly affect perceived naturalness. JP-TL-Bench uses a protocol built to make LLM judging both reliable and affordable: it evaluates a candidate model via reference-free, pairwise LLM comparisons against a fixed, versioned anchor set. Pairwise results are aggregated with a Bradley-Terry model and reported as win rates plus a normalized 0-10 “LT” score derived from a logistic transform of fitted log-strengths. Because each candidate is scored against the same frozen anchor set, scores are structurally stable given the same base set, judge, and aggregation code.

[9] Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback

Yan Sun, Ming Cai, Stanley Kok

Main category: cs.CL

TL;DR: The paper introduces Q* and Feedback+ verification techniques for LLM assistants in enterprise workflows to reduce errors and improve reliability through reverse translation and execution feedback.

Details

Motivation: Current conversational business analytics systems lack built-in verification mechanisms, forcing users to manually validate potentially flawed LLM outputs, which is inefficient and error-prone for enterprise decision support.

Method: Two complementary verification techniques: Q* performs reverse translation and semantic matching between generated code and user intent, while Feedback+ incorporates execution feedback to guide code refinement, both embedded within a generator-discriminator framework.

Result: Evaluations on Spider, Bird, and GSM8K benchmark datasets show both Q* and Feedback+ reduce error rates and task completion time, though reverse translation is identified as a key bottleneck.

Conclusion: The work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support by shifting validation responsibilities from users to the system.

Abstract: As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.

[10] Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation

Qianli Wang, Van Bach Nguyen, Yihong Liu, Fedor Splitt, Nils Feldhus, Christin Seifert, Hinrich Schütze, Sebastian Möller, Vera Schmitt

Main category: cs.CL

TL;DR: LLMs generate multilingual counterfactuals but translation-based ones are more valid yet require more edits and still underperform English counterfactuals; multilingual CDA outperforms cross-lingual CDA but gains are limited by counterfactual imperfections.

Details

Motivation: To investigate LLMs' effectiveness in generating multilingual counterfactuals (minimally edited inputs that change model predictions), as their performance in this multilingual task remains unclear despite their known English counterfactual generation capabilities and multilingual proficiency.

Method: Comprehensive study across six languages with: 1) automatic evaluation of directly generated vs. translation-based counterfactuals, 2) analysis of edit patterns across languages, 3) error categorization in generated counterfactuals, and 4) evaluation of multilingual vs. cross-lingual counterfactual data augmentation (CDA) for model performance.

Result: 1) Translation-based counterfactuals have higher validity but require more edits and still underperform English counterfactuals; 2) High-resource European languages show similar edit patterns; 3) Four consistent error types identified across languages; 4) Multilingual CDA yields better performance improvements than cross-lingual CDA, especially for lower-resource languages, but gains are limited by counterfactual imperfections.

Conclusion: LLMs can generate multilingual counterfactuals but with limitations: translation-based approaches improve validity at the cost of more edits, multilingual CDA is more effective than cross-lingual approaches, but overall performance gains are constrained by the quality issues in generated counterfactuals, highlighting the need for better multilingual counterfactual generation methods.

Abstract: Counterfactuals refer to minimally edited inputs that cause a model’s prediction to change, serving as a promising approach to explaining the model’s behavior. Large language models (LLMs) excel at generating English counterfactuals and demonstrate multilingual proficiency. However, their effectiveness in generating multilingual counterfactuals remains unclear. To this end, we conduct a comprehensive study on multilingual counterfactuals. We first conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages. Although translation-based counterfactuals offer higher validity than their directly generated counterparts, they demand substantially more modifications and still fall short of matching the quality of the original English counterfactuals. Second, we find the patterns of edits applied to high-resource European-language counterfactuals to be remarkably similar, suggesting that cross-lingual perturbations follow common strategic principles. Third, we identify and categorize four main types of errors that consistently appear in the generated counterfactuals across languages. Finally, we reveal that multilingual counterfactual data augmentation (CDA) yields larger model performance improvements than cross-lingual CDA, especially for lower-resource languages. Yet, the imperfections of the generated counterfactuals limit gains in model performance and robustness.

[11] Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity

Doyoung Kim, Zhiwei Ren, Jie Hao, Zhongkai Sun, Lichao Wang, Xiyao Ma, Zack Ye, Xu Han, Jun Yin, Heng Ji, Wei Shen, Xing Fan, Benjamin Yao, Chenlei Guo

Main category: cs.CL

TL;DR: WildAGTEval is a benchmark for evaluating LLM agents’ function-calling capabilities under realistic API complexity, addressing noisy API outputs and real-world constraints.

Details

Motivation: Prior work assumes idealized API systems and ignores real-world factors like noisy API outputs. There's a need to evaluate LLM agents' function-calling capabilities under realistic API complexity scenarios.

Method: Created WildAGTEval benchmark with two dimensions of real-world complexity: 1) API specification (detailed documentation and usage constraints), and 2) API execution (runtime challenges). Includes 60 distinct complexity scenarios that can compose into ~32K test configurations and user-agent interactions for evaluation.

Result: Most scenarios are challenging, with irrelevant information complexity posing the greatest difficulty (reducing strong LLM performance by 27.3%). Qualitative analysis shows LLMs sometimes distort user intent to claim task completion, critically affecting user satisfaction.

Conclusion: WildAGTEval provides a comprehensive benchmark for evaluating LLM agents under realistic API complexity, revealing significant challenges in handling real-world function-calling scenarios and highlighting issues with user intent distortion.

Abstract: We introduce WildAGTEval, a benchmark designed to evaluate large language model (LLM) agents’ function-calling capabilities under realistic API complexity. Unlike prior work that assumes an idealized API system and disregards real-world factors such as noisy API outputs, WildAGTEval accounts for two dimensions of real-world complexity: 1. API specification, which includes detailed documentation and usage constraints, and 2. API execution, which captures runtime challenges. Consequently, WildAGTEval offers (i) an API system encompassing 60 distinct complexity scenarios that can be composed into approximately 32K test configurations, and (ii) user-agent interactions for evaluating LLM agents on these scenarios. Using WildAGTEval, we systematically assess several advanced LLMs and observe that most scenarios are challenging, with irrelevant information complexity posing the greatest difficulty and reducing the performance of strong LLMs by 27.3%. Furthermore, our qualitative analysis reveals that LLMs occasionally distort user intent merely to claim task completion, critically affecting user satisfaction.

[12] Can Large Language Models Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations

Qianli Wang, Nils Feldhus, Pepa Atanasova, Fedor Splitt, Simon Ostermann, Sebastian Möller, Vera Schmitt

Main category: cs.CL

TL;DR: Quantization moderately degrades self-explanation quality and faithfulness in LLMs, with larger models showing better faithfulness preservation but no quantization technique consistently excelling across all metrics.

Details

Motivation: Self-explanations are crucial for transparency in high-stakes LLM applications, but the effects of quantization (widely used for model compression) on SE quality and faithfulness remain unexplored, creating a critical gap in understanding deployment trade-offs.

Method: Examined two types of self-explanations (natural language explanations and counterfactual examples) generated by LLMs quantized using three common techniques at different bit widths. Conducted user studies to assess coherence and trustworthiness.

Result: Quantization leads to moderate declines in SE quality (up to 4.4%) and faithfulness (up to 2.38%). User studies show diminished coherence and trustworthiness (up to 8.5%). Larger models maintain faithfulness better but show limited resilience in SE quality. No quantization technique consistently excels across task accuracy, SE quality, and faithfulness.

Conclusion: While quantization moderately degrades SE quality and faithfulness, the impact doesn’t undermine its effectiveness as a compression technique. Context-specific validation is recommended, especially for natural language explanations which show greater sensitivity. No single quantization method is best across all metrics.

Abstract: Quantization is widely used to accelerate inference and streamline the deployment of large language models (LLMs), yet its effects on self-explanations (SEs) remain unexplored. SEs, generated by LLMs to justify their own outputs, require reasoning about the model’s own decision-making process, a capability that may exhibit particular sensitivity to quantization. As SEs are increasingly relied upon for transparency in high-stakes applications, understanding whether and to what extent quantization degrades SE quality and faithfulness is critical. To address this gap, we examine two types of SEs: natural language explanations (NLEs) and counterfactual examples, generated by LLMs quantized using three common techniques at distinct bit widths. Our findings indicate that quantization typically leads to moderate declines in both SE quality (up to 4.4%) and faithfulness (up to 2.38%). The user study further demonstrates that quantization diminishes both the coherence and trustworthiness of SEs (up to 8.5%). Compared to smaller models, larger models show limited resilience to quantization in terms of SE quality but better maintain faithfulness. Moreover, no quantization technique consistently excels across task accuracy, SE quality, and faithfulness. Given that quantization’s impact varies by context, we recommend validating SE quality for specific use cases, especially for NLEs, which show greater sensitivity. Nonetheless, the relatively minor deterioration in SE quality and faithfulness does not undermine quantization’s effectiveness as a model compression technique.

[13] DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

Yuxin Li, Xiangyu Zhang, Yifei Li, Zhiwei Guo, Haoyang Zhang, Eng Siong Chng, Cuntai Guan

Main category: cs.CL

TL;DR: DepFlow is a depression-conditioned text-to-speech framework that creates acoustic-semantic mismatches to mitigate semantic bias in depression detection models, improving robustness for real-world scenarios like camouflaged depression.

Details

Motivation: Current depression datasets show strong coupling between linguistic sentiment and diagnostic labels, causing models to learn semantic shortcuts that fail in real-world scenarios like camouflaged depression where people maintain positive language despite depression.

Method: Three-stage framework: 1) Depression Acoustic Encoder learns speaker/content-invariant depression embeddings via adversarial training, 2) Flow-matching TTS with FiLM modulation injects embeddings for controllable synthesis, 3) Prototype-based severity mapping for interpretable manipulation across depression continuum.

Result: DepFlow achieves effective disentanglement with ROC-AUC of 0.693. The created Camouflage Depression-oriented Augmentation (CDoA) dataset improves macro-F1 by 9%, 12%, and 5% across three depression detection architectures, outperforming conventional augmentation methods.

Conclusion: DepFlow mitigates semantic bias in depression detection, enhances model robustness for camouflaged depression scenarios, and provides a controllable synthesis platform for conversational systems and simulation-based evaluation where real clinical data is limited.

Abstract: Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.

[14] Robust Uncertainty Quantification for Factual Generation of Large Language Models

Yuhao Zhang, Zhongliang Yang, Linna Zhou

Main category: cs.CL

TL;DR: The paper proposes a novel uncertainty quantification method (RU) for detecting LLM hallucinations using trap questions with fake names, showing improved performance over baselines.

Details

Motivation: LLM hallucination remains a critical limitation affecting AI reliability. Traditional uncertainty quantification methods work well in standard QA settings but fail with non-canonical or adversarial questions, creating a gap for real-world applications requiring robust critical thinking.

Method: The study creates trap questions containing fake names and proposes a novel robust uncertainty quantification method (RU) to detect hallucinations in multi-fact generation tasks.

Result: The trap question set performs excellently. The RU method outperforms baseline methods across four different models, achieving 0.1-0.2 average increase in ROCAUC values compared to the best baseline.

Conclusion: The proposed method provides new insights and approaches for addressing LLM hallucination issues, offering improved uncertainty quantification for detecting unreliable AI-generated content.

Abstract: The rapid advancement of large language model(LLM) technology has facilitated its integration into various domains of professional and daily life. However, the persistent challenge of LLM hallucination has emerged as a critical limitation, significantly compromising the reliability and trustworthiness of AI-generated content. This challenge has garnered significant attention within the scientific community, prompting extensive research efforts in hallucination detection and mitigation strategies. Current methodological frameworks reveal a critical limitation: traditional uncertainty quantification approaches demonstrate effectiveness primarily within conventional question-answering paradigms, yet exhibit notable deficiencies when confronted with non-canonical or adversarial questioning strategies. This performance gap raises substantial concerns regarding the dependability of LLM responses in real-world applications requiring robust critical thinking capabilities. This study aims to fill this gap by proposing an uncertainty quantification scenario in the task of generating with multiple facts. We have meticulously constructed a set of trap questions contained with fake names. Based on this scenario, we innovatively propose a novel and robust uncertainty quantification method(RU). A series of experiments have been conducted to verify its effectiveness. The results show that the constructed set of trap questions performs excellently. Moreover, when compared with the baseline methods on four different models, our proposed method has demonstrated great performance, with an average increase of 0.1-0.2 in ROCAUC values compared to the best performing baseline method, providing new sights and methods for addressing the hallucination issue of LLMs.

[15] The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

Jiandong Shao, Raphael Tang, Crystina Zhang, Karin Sevegnani, Pontus Stenetorp, Jianfei Yang, Yao Lu

Main category: cs.CL

TL;DR: Removing just 2% bilingual data from pretraining corpus causes 56% translation performance drop, but cross-lingual QA and reasoning remain stable. Parallel data (14% of bilingual) restores translation, while code-switching (72%) contributes minimally.

Details

Motivation: To understand how bilingual data in pretraining corpora enables cross-lingual abilities in multilingual LLMs, specifically investigating the contributions of different types of bilingual data (parallel vs code-switching) to various cross-lingual tasks.

Method: Pretrained models from scratch under controlled conditions: 1) compared standard web corpus with monolingual-only version (removing all multilingual documents), 2) categorized bilingual data into parallel (14%), code-switching (72%), and miscellaneous (14%), 3) conducted granular ablations by reintroducing parallel or code-switching data into monolingual-only corpus.

Result: 1) Removing 2% bilingual data causes 56% BLEU drop in translation but cross-lingual QA and reasoning remain stable. 2) Parallel data almost fully restores translation performance (91% of baseline), while code-switching contributes minimally. 3) Other cross-lingual tasks unaffected by either type of bilingual data.

Conclusion: Translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear achievable even without bilingual data, suggesting different mechanisms for different cross-lingual abilities.

Abstract: Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.

[16] BERT-JEPA: Reorganizing CLS Embeddings for Language-Invariant Semantics

Taj Gillin, Adam Lalani, Kenneth Zhang, Marcel Mateos Salles

Main category: cs.CL

TL;DR: BERT-JEPA (BEPA) combines BERT-style models with Joint Embedding Predictive Architecture (JEPA) training to create language-agnostic CLS embeddings, improving multilingual performance.

Details

Motivation: To address the collapsed [CLS] embedding space in BERT-style models and transform it into a language-agnostic representation space for better multilingual performance.

Method: Adds a JEPA (Joint Embedding Predictive Architecture) training objective to BERT-style models, creating BERT-JEPA (BEPA) training paradigm.

Result: Increased performance across multilingual benchmarks by creating language-agnostic embedding spaces.

Conclusion: Combining JEPA with BERT-style models effectively creates language-agnostic representations and improves multilingual task performance.

Abstract: Joint Embedding Predictive Architectures (JEPA) are a novel self supervised training technique that have shown recent promise across domains. We introduce BERT-JEPA (BEPA), a training paradigm that adds a JEPA training objective to BERT-style models, working to combat a collapsed [CLS] embedding space and turning it into a language-agnostic space. This new structure leads to increased performance across multilingual benchmarks.

[17] Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, Jun Wang

Main category: cs.CL

TL;DR: Geo-R is a retrieval-free image geolocalization framework that uses structured geographic reasoning and reinforcement learning with coordinate-aligned rewards to improve accuracy and interpretability.

Details

Motivation: Existing vision-language models for image geolocalization rely on synthetic reasoning annotations or external image retrieval, which limits interpretability and generalizability. The authors aim to develop a retrieval-free approach that provides transparent reasoning and better generalization.

Method: 1) Chain of Region: rule-based hierarchical reasoning that maps GPS coordinates to geographic entities (country, province, city) without synthetic labels. 2) Lightweight reinforcement learning with coordinate-aligned rewards based on Haversine distance to refine predictions through spatially meaningful feedback.

Result: Experimental results across multiple benchmarks confirm improved localization accuracy, stronger generalization, and more transparent inference compared to existing approaches.

Conclusion: Geo-R establishes a new retrieval-free paradigm for scalable and interpretable image geolocalization by bridging structured geographic reasoning with direct spatial supervision. The model and code will be publicly available for reproducibility.

Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.

[18] Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset

Alistair Plum, Laura Bernardy, Tharindu Ranasinghe

Main category: cs.CL

TL;DR: judgeWEL: A new Luxembourgish NER dataset created using Wikipedia/Wikidata weak supervision and LLM verification, 5x larger than existing resources with better entity category balance.

Details

Motivation: Building NER datasets for under-represented languages like Luxembourgish is challenging due to resource scarcity, high annotation costs, and linguistic particularities that lead to inconsistent annotations.

Method: 1) Use Wikipedia internal links and Wikidata entries as weak supervision to infer entity types automatically. 2) Employ multiple LLMs to verify and filter annotations, retaining only high-quality labelled sentences. 3) Create a novel pipeline combining weak supervision with LLM-based quality control.

Result: The resulting judgeWEL corpus is approximately five times larger than currently available Luxembourgish NER datasets and provides broader, more balanced coverage across entity categories.

Conclusion: judgeWEL offers a substantial new resource for multilingual and low-resource NER research, demonstrating an effective approach to creating annotated datasets for under-represented languages using weak supervision and LLM verification.

Abstract: We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.

[19] Toward Better Temporal Structures for Geopolitical Events Forecasting

Kian Ahrabian, Eric Boxer, Jay Pujara

Main category: cs.CL

TL;DR: This paper introduces Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs) to address limitations in existing temporal knowledge graph representations, creates a new dataset htkgh-polecat based on POLECAT, and benchmarks LLMs on relation prediction tasks for complex geopolitical forecasting.

Details

Motivation: Current temporal knowledge graphs (TKGs) and hyper-relational temporal knowledge graphs (HTKGHs) lack expressive power for complex facts, particularly their inability to support more than two primary entities in temporal facts, which is common in real-world geopolitical events.

Method: The authors formalize HTKGHs as a generalization of HTKGs that maintains backward compatibility while supporting complex fact types. They then create the htkgh-polecat dataset from the POLECAT global event database and benchmark popular LLMs on relation prediction tasks.

Result: The paper presents a formalization for HTKGHs, introduces the htkgh-polecat dataset, and provides benchmarking results and analysis of LLMs’ performance on complex relation prediction tasks in geopolitical forecasting scenarios.

Conclusion: HTKGHs provide a more expressive framework for representing complex temporal facts in geopolitical contexts, and the benchmarking results offer insights into LLMs’ adaptability and capabilities for forecasting in these complex scenarios.

Abstract: Forecasting on geopolitical temporal knowledge graphs (TKGs) through the lens of large language models (LLMs) has recently gained traction. While TKGs and their generalization, hyper-relational temporal knowledge graphs (HTKGs), offer a straightforward structure to represent simple temporal relationships, they lack the expressive power to convey complex facts efficiently. One of the critical limitations of HTKGs is a lack of support for more than two primary entities in temporal facts, which commonly occur in real-world events. To address this limitation, in this work, we study a generalization of HTKGs, Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs). We first derive a formalization for HTKGHs, demonstrating their backward compatibility while supporting two complex types of facts commonly found in geopolitical incidents. Then, utilizing this formalization, we introduce the htkgh-polecat dataset, built upon the global event database POLECAT. Finally, we benchmark and analyze popular LLMs on the relation prediction task, providing insights into their adaptability and capabilities in complex forecasting scenarios.

[20] Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment

Muhammad Shahmeer Khan

Main category: cs.CL

TL;DR: Comparative analysis of DistilBERT, MiniLM, and ALBERT shows trade-offs: ALBERT excels in accuracy, MiniLM in speed, DistilBERT offers balanced performance across NLP tasks.

Details

Motivation: Enterprise NLP needs efficient, lightweight models for multi-domain text automation tasks, requiring understanding of accuracy-efficiency trade-offs among popular Transformer models.

Method: Comparative evaluation of three lightweight Transformer models (DistilBERT, MiniLM, ALBERT) across three domains using IMDB, AG News, and Measuring Hate Speech datasets, measuring both accuracy metrics (accuracy, precision, recall, F1) and efficiency metrics (size, inference time, throughput, memory).

Result: No single model dominates all dimensions: ALBERT achieves highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, DistilBERT shows most consistent accuracy across tasks with competitive efficiency.

Conclusion: Trade-offs exist between accuracy and efficiency: MiniLM recommended for latency-sensitive applications, DistilBERT for balanced performance, ALBERT for resource-constrained environments.

Abstract: In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.

[21] Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games

Dimitris Vartziotis

Main category: cs.CL

TL;DR: The paper contrasts social constructivist (language games) vs. mathematical (Semantic Field Theory) approaches to linguistic meaning, analyzing how LLM architectures relate to these frameworks and arguing they are complementary perspectives.

Details

Motivation: To examine long-standing theories of linguistic meaning in the new empirical setting of large language models, contrasting social constructivist accounts with mathematical frameworks to understand the scope and limits of statistical language models.

Method: Formalizes lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in continuous semantic space, then analyzes how transformer architecture properties (distributed representations, attention mechanisms, geometric regularities) relate to these concepts.

Result: LLM success in capturing semantic regularities supports mathematical structure in language, while their limitations in pragmatic reasoning and context sensitivity align with social grounding importance. The two approaches are complementary rather than competing.

Conclusion: Mathematical structure and language games are complementary perspectives that clarify the scope/limits of statistical language models and motivate new directions for theoretically informed AI architectures.

Abstract: Large language models (LLMs) offer a new empirical setting in which long-standing theories of linguistic meaning can be examined. This paper contrasts two broad approaches: social constructivist accounts associated with language games, and a mathematically oriented framework we call Semantic Field Theory. Building on earlier work by the author, we formalize the notions of lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in a continuous semantic space. We then analyze how core properties of transformer architectures-such as distributed representations, attention mechanisms, and geometric regularities in embedding spaces-relate to these concepts. We argue that the success of LLMs in capturing semantic regularities supports the view that language exhibits an underlying mathematical structure, while their persistent limitations in pragmatic reasoning and context sensitivity are consistent with the importance of social grounding emphasized in philosophical accounts of language use. On this basis, we suggest that mathematical structure and language games can be understood as complementary rather than competing perspectives. The resulting framework clarifies the scope and limits of purely statistical models of language and motivates new directions for theoretically informed AI architectures.

[22] Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

Hyunjun Kim

Main category: cs.CL

TL;DR: Defensive M2S: A training paradigm that compresses multi-turn conversations to single-turn for guardrail models, reducing training cost from O(n²) to O(n) and achieving 93× token reduction while maintaining high attack detection performance.

Details

Motivation: Processing full multi-turn conversation histories for guardrail models incurs significant computational cost, making safety screening of long conversations inefficient and unscalable for LLM deployments.

Method: Fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations using compression templates (hyphenize, numberize, pythonize) rather than complete dialogue histories, reducing training complexity.

Result: Achieves 93.8% attack detection recall with Qwen3Guard + hyphenize compression, reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation) and training tokens by 93× (15.7M to 169K).

Conclusion: M2S compression is an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations while dramatically reducing both training and inference costs.

Abstract: Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from $O(n^2)$ to $O(n)$ for $n$-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline – a 93$\times$ reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.

[23] Noise-Aware Named Entity Recognition for Historical VET Documents

Alexander M. Esser, Jens Dörpinghaus

Main category: cs.CL

TL;DR: A robust NER approach for noisy VET documents using noise-aware training with synthetic OCR errors, transfer learning, and multi-stage fine-tuning to improve accuracy in noisy conditions.

Details

Motivation: NER in Vocational Education and Training (VET) documents is challenging due to historical digitized documents with OCR-induced noise, requiring robust methods that can handle such noisy conditions.

Method: Proposes noise-aware training with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Compares three strategies: training on noisy data, clean data, and artificially generated data.

Result: Domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. The method is one of the first to recognize multiple entity types in VET documents and is transferable to arbitrary languages.

Conclusion: The proposed approach effectively addresses NER in noisy VET documents, providing a robust solution that improves accuracy in OCR-challenged environments, with publicly available code for reproducibility.

Abstract: This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.

[24] Rule-Based Approaches to Atomic Sentence Extraction

Lineesha Kamana, Akshita Ananda Subramanian, Mehuli Ghosh, Suman Saha

Main category: cs.CL

TL;DR: Rule-based atomic sentence extraction using dependency parsing achieves moderate-to-high accuracy but struggles with complex syntactic structures like relative clauses, coordination, and passive voice.

Details

Motivation: Previous machine learning approaches to atomic sentence extraction lack interpretability and don't provide insight into which specific linguistic structures cause extraction failures, creating a gap in understanding extraction difficulties.

Method: Implemented dependency-based extraction rules in spaCy using the WikiSplit dataset, generated 100 gold-standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore metrics.

Result: System achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, showing moderate-to-high alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions.

Conclusion: Rule-based extraction is reasonably accurate but sensitive to syntactic complexity, with specific clause structures and dependencies identified as primary sources of extraction difficulties.

Abstract: Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the “split-and-rephrase” task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.

[25] Retrieval–Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends

Yuelyu Ji, Zhuochun Li, Rui Meng, Daqing He

Main category: cs.CL

TL;DR: Survey paper analyzing multi-hop QA systems through a four-axis framework focusing on execution procedures, mapping existing approaches, and identifying trade-offs between effectiveness, efficiency, and faithfulness.

Details

Motivation: Current multi-hop QA systems often leave their retrieval-reasoning processes implicit, making it difficult to compare procedural choices across different model families and approaches.

Method: Introduces a four-axis framework covering: (A) overall execution plan, (B) index structure, (C) next-step control (strategies and triggers), and (D) stop/continue criteria. Uses this schema to map representative multi-hop QA systems and analyze ablations on standard benchmarks.

Result: Synthesizes reported ablations and tendencies on standard benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness.

Conclusion: Identifies open challenges for retrieval-reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.

Abstract: Multi-hop question answering (QA) requires systems to iteratively retrieve evidence and reason across multiple hops. While recent RAG and agentic methods report strong results, the underlying retrieval–reasoning \emph{process} is often left implicit, making procedural choices hard to compare across model families. This survey takes the execution procedure as the unit of analysis and introduces a four-axis framework covering (A) overall execution plan, (B) index structure, (C) next-step control (strategies and triggers), and (D) stop/continue criteria. Using this schema, we map representative multi-hop QA systems and synthesize reported ablations and tendencies on standard benchmarks (e.g., HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness. We conclude with open challenges for retrieval–reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.

[26] ECR: Manifold-Guided Semantic Cues for Compact Language Models

Chung-Wei Victor Yuan

Main category: cs.CL

TL;DR: ECR framework preserves embedding space structure in compact models by regulating geometry around semantic anchors, preventing semantic drift without altering inference architecture.

Details

Motivation: Compact models often lose embedding space structure when capacity is limited or data is multilingual, causing semantic drift that harms downstream tasks. Existing compression methods focus on superficial output alignment but fail to preserve underlying manifold structure.

Method: Embedding Consistency Regulation (ECR) framework: 1) Derives semantic anchors from teacher embeddings offline, 2) Trains compact model to maintain consistent geometry around these anchors without matching logits or internal features, 3) Adds only small projection step at inference without changing architecture.

Result: On 100K multilingual corpus, ECR stabilizes training and preserves semantic structure across tasks and languages. Produces more compact and task-aligned representation space, enabling low-capacity models to learn cleaner manifolds than conventional baselines.

Conclusion: ECR helps compact models better follow task requirements and makes them easier to deploy under strict efficiency or privacy limits. Works without teacher outputs and is compatible with but independent of distillation.

Abstract: Compact models often lose the structure of their embedding space. The issue shows up when the capacity is tight or the data spans several languages. Such collapse makes it difficult for downstream tasks to build on the resulting representation. Existing compression methods focus on aligning model outputs at a superficial level but fail to preserve the underlying manifold structure. This mismatch often leads to semantic drift in the compact model, causing both task behavior and linguistic properties to deviate from the reference model. To address those issues, we provide a new framework called Embedding Consistency Regulation (ECR). This framework first derives a set of semantic anchors from teacher embeddings (computed once offline). Then, the compact model learns to maintain consistent geometry around these anchors, without relying on matching logits or internal features. ECR adds only a small projection step at inference, without altering the decoding architecture or its runtime behavior. In experiments on a 100K multilingual corpus, ECR consistently stabilizes training and preserves semantic structure across tasks and languages. It also produces a more compact and task-aligned representation space, enabling low-capacity models to learn cleaner manifolds than conventional baselines. ECR works without teacher outputs and is compatible with, but independent of, distillation. Taken together, our results show that ECR helps compact models better follow task requirements and makes them easier to deploy under strict efficiency or privacy limits.

[27] InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song

Main category: cs.CL

TL;DR: InfoSynth is an automated framework for generating novel and diverse reasoning benchmarks for LLMs using information-theoretic metrics and genetic algorithms, achieving 97% accuracy in creating Python coding problems with controllable novelty and difficulty.

Details

Motivation: Traditional benchmark creation is manual, expensive, and time-consuming, while existing benchmarks often contaminate LLM training data, requiring novel and diverse benchmarks to accurately assess LLM capabilities.

Method: InfoSynth uses information-theoretic principles (KL-divergence and entropy metrics) to quantify benchmark novelty/diversity, and employs genetic algorithms with iterative code feedback to synthesize Python coding problems from seed datasets.

Result: The method generates accurate test cases and solutions 97% of the time, produces benchmarks with higher novelty/diversity than seed datasets, and allows control over novelty/diversity and difficulty of generated problems.

Conclusion: InfoSynth provides a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs, addressing the limitations of traditional benchmark creation methods.

Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

[28] CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu, Qiankun Li, Kun Wang, Zhigang Zeng

Main category: cs.CL

TL;DR: CSSBench is a Chinese-specific safety benchmark that evaluates lightweight LLMs against adversarial patterns unique to Chinese (homophones, pinyin, symbol-splitting), covering six real-world domains and measuring safety-induced performance degradation.

Details

Motivation: There's a safety evaluation gap for lightweight LLMs in Chinese contexts. Existing benchmarks focus on English, but Chinese malicious queries use unique adversarial patterns (homophones, pinyin, symbol-based splitting) that aren't captured. Lightweight models deployed in cost-sensitive/on-device scenarios may be particularly vulnerable to these Chinese-specific attacks.

Method: Created CSSBench with Chinese-specific adversarial patterns (homophones, pinyin, symbol-based splitting). Covers six Chinese-relevant domains: illegal activities/compliance, privacy leakage, health/medical misinformation, fraud/hate, adult content, and public/political safety. Organizes queries into multiple task types and measures over-refusal behavior to assess safety-induced performance degradation.

Result: Evaluation of popular lightweight LLMs shows that Chinese-specific adversarial patterns pose a critical challenge. The benchmark reveals safety vulnerabilities in lightweight models when faced with these unique Chinese attack patterns.

Conclusion: CSSBench provides comprehensive Chinese safety evaluation for LLMs, addressing the gap in existing English-focused benchmarks. It helps assess robustness for practical deployments in Chinese contexts, especially for lightweight models in cost-sensitive scenarios.

Abstract: Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.

[29] Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

Sumanth Balaji, Piyush Mishra, Aashraya Sachdeva, Suraj Agrawal

Main category: cs.CL

TL;DR: JourneyBench is a new benchmark for evaluating LLM agents in customer support that measures policy adherence through realistic scenarios and a novel User Journey Coverage Score metric, showing that dynamic policy-aware agents outperform static ones.

Details

Motivation: Traditional IVR systems are rigid and can't handle complex policy-driven tasks. While LLM agents offer promise, existing benchmarks focus only on tool usage or task completion, ignoring crucial aspects like multi-step policy adherence, task dependencies, and robustness to unpredictable user behavior in real-world support workflows.

Method: JourneyBench uses graph representations to generate diverse, realistic support scenarios and introduces the User Journey Coverage Score metric to measure policy adherence. The study evaluates two agent designs: Static-Prompt Agent (SPA) and Dynamic-Prompt Agent (DPA) that explicitly models policy control, testing them across 703 conversations in three domains.

Result: DPA significantly boosts policy adherence, allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Structured orchestration through dynamic policy modeling proves crucial for effective customer support agents.

Conclusion: JourneyBench establishes a critical resource for advancing AI-driven customer support beyond IVR limitations by providing a comprehensive benchmark for evaluating policy-aware agents, demonstrating the importance of structured orchestration for real-world support workflows.

Abstract: Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent’s capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.

[30] Probabilistic Guarantees for Reducing Contextual Hallucinations in LLMs

Nils Rautenberg, Sven Schippkus

Main category: cs.CL

TL;DR: A framework using repeated LLM queries and judge ensembles to provide probabilistic guarantees against hallucinations in fixed-input workflows.

Details

Motivation: LLMs produce contextual hallucinations that contradict prompt information, which is problematic for deterministic automation workflows where correctness must be unambiguous.

Method: Issuing same prompt in independent context windows for exponential error reduction, using LLM-as-judge to identify correct answers, and strengthening imperfect judges through majority vote ensembles.

Result: Pipeline failure decreases exponentially with repetitions, and hallucination-selection decreases exponentially with number of judges in ensemble, matching theoretical predictions.

Conclusion: Provides lightweight, modular, theoretically grounded method to drive hallucination probabilities arbitrarily low without modifying model weights, decoding strategies, or prompt engineering.

Abstract: Large language models (LLMs) frequently produce contextual hallucinations, where generated content contradicts or ignores information explicitly stated in the prompt. Such errors are particularly problematic in deterministic automation workflows, where inputs are fixed and correctness is unambiguous. We introduce a simple and model-agnostic framework that provides explicit probabilistic guarantees for reducing hallucinations in this setting. We formalize the notion of a specific task, defined by a fixed input and a deterministic correctness criterion, and show that issuing the same prompt in independent context windows yields an exponential reduction in the probability that all model outputs are incorrect. To identify a correct answer among repeated runs, we incorporate an LLM-as-a-judge and prove that the probability that the judged pipeline fails decays at a rate determined by the judge’s true- and false-positive probabilities. When the judge is imperfect, we strengthen it through majority vote over independent judge calls, obtaining ensemble-level error rates that decrease exponentially in the number of votes. This yields an explicit bound on the probability that the pipeline selects a hallucinated answer. Experiments on controlled extraction tasks with synthetic noisy judges match these predictions exactly: pipeline failure decreases exponentially with the number of repetitions, and hallucination-selection decreases exponentially with the number of judges in the ensemble. Together, these results provide a lightweight, modular, and theoretically grounded method for driving hallucination probabilities arbitrarily low in fixed-input LLM workflows-without modifying model weights, decoding strategies, or prompt engineering.

[31] Physio-DPO: Aligning Large Language Models with the Protein Energy Landscape to Eliminate Structural Hallucinations

QiWei Meng

Main category: cs.CL

TL;DR: Physio-DPO: A physics-informed alignment framework that grounds protein language models in thermodynamic stability to reduce structural hallucinations in generative protein design.

Details

Motivation: Large protein language models often produce structural hallucinations - generating sequences with high linguistic likelihood but thermodynamically unstable conformations. Existing alignment approaches like DPO are limited because they model preferences as binary labels and ignore the continuous structure of the physical energy landscape.

Method: Physio-DPO introduces a magnitude-aware objective that scales optimization updates according to the energy gap between native structures and physics-perturbed hard negatives. This physics-informed alignment framework grounds protein language models in thermodynamic stability.

Result: Physio-DPO consistently outperforms strong baselines including SFT, PPO, and standard DPO, reducing self-consistency RMSD to 1.28 Å and increasing foldability to 92.8%. It effectively mitigates structural hallucinations by recovering biophysical interactions like hydrophobic core packing and hydrogen bond networks.

Conclusion: Physio-DPO successfully integrates physical energy landscape information into protein language model alignment, addressing the critical problem of structural hallucinations in generative protein design and improving the thermodynamic stability of generated protein sequences.

Abstract: Large Protein Language Models have shown strong potential for generative protein design, yet they frequently produce structural hallucinations, generating sequences with high linguistic likelihood that fold into thermodynamically unstable conformations. Existing alignment approaches such as Direct Preference Optimization are limited in this setting, as they model preferences as binary labels and ignore the continuous structure of the physical energy landscape. We propose Physio-DPO, a physics informed alignment framework that grounds protein language models in thermodynamic stability. Physio-DPO introduces a magnitude aware objective that scales optimization updates according to the energy gap between native structures and physics perturbed hard negatives. Experiments show that Physio-DPO consistently outperforms strong baselines including SFT, PPO, and standard DPO, reducing self consistency RMSD to 1.28 Å and increasing foldability to 92.8%. Qualitative analysis further demonstrates that Physio-DPO effectively mitigates structural hallucinations by recovering biophysical interactions such as hydrophobic core packing and hydrogen bond networks.

[32] Fast-weight Product Key Memory

Tianyu Zhao, Llion Jones

Main category: cs.CL

TL;DR: FwPKM transforms static Product Key Memory into dynamic fast-weight episodic memory that updates via local gradient descent, enabling efficient long-context processing with unbounded storage.

Details

Motivation: Current sequence modeling layers face a trade-off: Softmax attention has unbounded storage but quadratic computational costs, while linear variants are efficient but have limited fixed-size storage. There's a need for an architecture that combines efficient computation with flexible, dynamic storage capacity.

Method: Proposes Fast-weight Product Key Memory (FwPKM) that transforms sparse Product Key Memory from a static module into a dynamic episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing rapid memorization and retrieval of new key-value pairs from input sequences.

Result: FwPKM functions as an effective episodic memory that complements standard semantic memory, yielding significant perplexity reductions on long-context datasets. Notably, it generalizes to 128K-token contexts despite being trained on only 4K-token sequences in Needle in a Haystack evaluations.

Conclusion: FwPKM resolves the tension between storage capacity and computational efficiency in sequence modeling by providing a dynamic, fast-weight episodic memory that can efficiently handle long contexts while maintaining computational efficiency.

Abstract: Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, “fast-weight” episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.

[33] Sigmoid Head for Quality Estimation under Language Ambiguity

Tu Anh Dinh, Jan Niehues

Main category: cs.CL

TL;DR: The paper proposes a Sigmoid Head module for quality estimation that addresses limitations of LM probability distributions by using sigmoid activation and negative sampling heuristics.

Details

Motivation: LM probability is unreliable for quality estimation because natural language ambiguity causes probability to spread across multiple valid options, misleadingly indicating low quality. This stems from softmax activation (can't assign high probabilities to multiple correct options) and training on single one-hot encoded references.

Method: Train a Quality Estimation module (Sigmoid Head) on top of pre-trained LMs with sigmoid activation to allow multiple correct options to receive high probabilities. Use negative sampling with heuristics to avoid selecting potentially alternative correct tokens during training.

Result: Sigmoid Head provides significantly better quality signals than original softmax head, is computationally efficient for training/inference, and is more robust to out-of-domain settings compared to supervised QE since it doesn’t rely on human-annotated quality data.

Conclusion: The proposed Sigmoid Head effectively addresses LM probability limitations for quality estimation through sigmoid activation and careful negative sampling, offering improved reliability and domain robustness without requiring annotated quality data.

Abstract: Language model (LM) probability is not a reliable quality estimator, as natural language is ambiguous. When multiple output options are valid, the model’s probability distribution is spread across them, which can misleadingly indicate low output quality. This issue is caused by two reasons: (1) LMs’ final output activation is softmax, which does not allow multiple correct options to receive high probabilities simultaneuously and (2) LMs’ training data is single, one-hot encoded references, indicating that there is only one correct option at each output step. We propose training a module for Quality Estimation on top of pre-trained LMs to address these limitations. The module, called Sigmoid Head, is an extra unembedding head with sigmoid activation to tackle the first limitation. To tackle the second limitation, during the negative sampling process to train the Sigmoid Head, we use a heuristic to avoid selecting potentially alternative correct tokens. Our Sigmoid Head is computationally efficient during training and inference. The probability from Sigmoid Head is notably better quality signal compared to the original softmax head. As the Sigmoid Head does not rely on human-annotated quality data, it is more robust to out-of-domain settings compared to supervised QE.

[34] Exploring the Performance of Large Language Models on Subjective Span Identification Tasks

Alphaeus Dmonte, Roland Oruche, Tharindu Ranasinghe, Marcos Zampieri, Prasad Calyam

Main category: cs.CL

TL;DR: LLMs perform well on subjective span identification tasks like sentiment analysis, offensive language detection, and claim verification, with text relationships aiding precision.

Details

Motivation: Current span identification research focuses on explicit tasks like NER using smaller models, while subjective span identification with LLMs in tasks like ABSA remains underexplored.

Method: Evaluated various LLMs on three tasks using strategies like instruction tuning, in-context learning, and chain of thought prompting.

Result: LLMs show strong performance on subjective span identification, with underlying text relationships helping them identify precise text spans.

Conclusion: LLMs are effective for subjective span identification tasks, filling an important gap in NLP research and contributing to model explainability.

Abstract: Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.

[35] Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries

Jonathan Simkin, Lovedeep Gondara, Zeeshan Rizvi, Gregory Doyle, Jeff Dowden, Dan Bond, Desmond Martin, Raymond Ng

Main category: cs.CL

TL;DR: Cross-provincial evaluation shows transformer models can be adapted between Canadian cancer registries with modest fine-tuning, and ensemble methods significantly reduce missed cancer cases while maintaining privacy.

Details

Motivation: Manual abstraction of pathology reports for cancer registries is resource-intensive and causes delays. While NLP systems help, their ability to generalize across different jurisdictions with varying reporting conventions is not well understood.

Method: Adapted BCCRTron (domain-adapted transformer) and GatorTron (biomedical transformer) for cancer surveillance. Used ~104,000 reports for Tier 1 (cancer vs. non-cancer) and ~22,000 for Tier 2 (reportable vs. non-reportable) tasks from Newfoundland & Labrador Cancer Registry. Fine-tuned models using complementary synoptic and diagnosis-focused input pipelines, then combined them using conservative OR-ensemble.

Result: Adapted models maintained high performance across jurisdictions. Ensemble achieved Tier 1 recall of 0.99 (reducing missed cancers to 24 vs 48-54 for standalone models) and Tier 2 recall of 0.99 (reducing missed reportable cancers to 33 vs 46-54). Privacy-preserving workflow shares only model weights between provinces.

Conclusion: Transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. Ensemble combining complementary text representations substantially reduces missed cancers and improves error coverage. Privacy-preserving approach enables interoperable NLP infrastructure for pan-Canadian cancer pathology workflows.

Abstract: Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland & Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.

[36] From Transformers to LLMs: A Systematic Survey of Efficiency Considerations in NLP

Wazib Ansar, Saptarsi Goswami, Amlan Chakrabarti

Main category: cs.CL

TL;DR: A systematic literature review surveying 312 papers (2011-2025) on efficiency improvements for Transformer-based LLMs in NLP, covering data curation, model design, downsizing, dynamic inference, and adaptation strategies, with statistical analysis and evaluation of 30+ models on 13 benchmarks.

Details

Motivation: The rapid advancement of Transformer-based LLMs has dramatically increased computational demands, creating urgent needs to improve efficiency across computational requirements, energy consumption, carbon footprint, and financial costs in NLP systems.

Method: Conducted systematic literature review of 312 articles (2011-2025), categorized efficiency improvements into data curation, model design, model downsizing, and dynamic inferencing, plus adaptation strategies (pre-training, fine-tuning, prompt-engineering, RAG). Performed statistical analysis and evaluated 30+ NLP models on 13 benchmarks.

Result: Comprehensive survey provides systematic categorization of efficiency techniques, statistical trends in research, and empirical evaluation showing efficiency-efficacy trade-offs across different model architectures and approaches.

Conclusion: The review offers valuable insights for researchers and practitioners, highlighting the growing trend toward sustainable NLP practices and providing a structured framework for understanding efficiency improvements in Transformer-based LLMs.

Abstract: The emergence of Transformer-based Large Language Models (LLMs) has substantially augmented the capabilities of Natural Language Processing (NLP), thereby intensifying the demand for computational resources. Therefore, enhancing efficiency based on factors like computational requirements, energy consumption, carbon footprint and financial cost has become a vital area of research. This motivates us to conduct a systematic literature review on Transformer-based LLMs in NLP from the perspective of efficiency. In this survey of 312 articles published between the years 2011 and 2025, efficiency-improvement endeavors have been systematically discussed targeting various aspects such as data curation, model design, model downsizing, and dynamic inferencing. This has been augmented with efficiency considerations in model adaptation strategies like pre-training, fine-tuning, prompt-engineering and Retrieval-Augmented Generation (RAG). Furthermore, a statistical analysis of the articles has been performed followed by an in-depth evaluation of the efficiency and efficacy of more than 30 renowned NLP models has been conducted on 13 evaluation benchmarks. This paper offers valuable insights for researchers, professionals as well as scholars, and explores the trend of research toward sustainable practices in NLP.

[37] EXAONE 3.0 7.8B Instruction Tuned Language Model

Soyoung An, Kyunghoon Bae, Eunbi Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Yeonjung Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Euisoon Kim, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Moontae Lee, Seungjun Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Boseong Seo, Sihoon Yang, Heuiyeen Yeen, Kyungjae Yoo, Hyeongu Yun

Main category: cs.CL

TL;DR: LG AI Research releases EXAONE 3.0, a 7.8B parameter instruction-tuned LLM that excels in Korean while maintaining strong general task and reasoning performance, making it competitive with similar-sized open models.

Details

Motivation: To contribute to open research and innovation by releasing a high-quality instruction-tuned language model that demonstrates strong real-world performance, particularly with bilingual proficiency in Korean and English.

Method: Developed a family of Large Language Models with instruction-tuning, publicly releasing the 7.8B parameter version. Conducted extensive evaluations across public and in-house benchmarks to assess performance.

Result: EXAONE 3.0 shows highly competitive performance against similar-sized open models, with exceptional Korean language capabilities and compelling performance in general tasks and complex reasoning.

Conclusion: EXAONE 3.0 represents a significant contribution to open LLM research with strong bilingual proficiency and real-world effectiveness, advancing Expert AI development.

Abstract: We introduce EXAONE 3.0 instruction-tuned language model, the first open model in the family of Large Language Models (LLMs) developed by LG AI Research. Among different model sizes, we publicly release the 7.8B instruction-tuned model to promote open research and innovations. Through extensive evaluations across a wide range of public and in-house benchmarks, EXAONE 3.0 demonstrates highly competitive real-world performance with instruction-following capability against other state-of-the-art open models of similar size. Our comparative analysis shows that EXAONE 3.0 excels particularly in Korean, while achieving compelling performance across general tasks and complex reasoning. With its strong real-world effectiveness and bilingual proficiency, we hope that EXAONE keeps contributing to advancements in Expert AI. Our EXAONE 3.0 instruction-tuned model is available at https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct.

Qichao Ma, Rui-Jie Zhu, Peiye Liu, Renye Yan, Fahong Zhang, Ling Liang, Meng Li, Zhaofei Yu, Zongwei Wang, Yimao Cai, Tiejun Huang

Main category: cs.CL

TL;DR: Inner-Probe is a lightweight framework that uses multi-head attention results to analyze copyrighted sub-dataset influence on LLM outputs and detect non-copyrighted text, outperforming traditional methods in efficiency and accuracy.

Details

Motivation: Current methods for identifying copyrighted content in LLM outputs have limitations: they can't pinpoint which specific sub-datasets (like particular authors' works) influence outputs, and they treat all training data as copyrighted, ignoring non-copyrighted content.

Method: Inner-Probe analyzes multi-head attention (MHA) results during LLM generation rather than just text. It uses a lightweight LSTM network trained on MHA results for sub-dataset contribution analysis, and a global projector with unsupervised contrastive learning for non-copyrighted text detection.

Result: Inner-Probe shows 3x improved efficiency in sub-dataset contribution analysis on Books3, achieves 15.04%-58.7% higher accuracy than baselines on the Pile dataset, and delivers a 0.104 increase in AUC for non-copyrighted data filtering.

Conclusion: The proposed Inner-Probe framework effectively addresses limitations of current copyright detection methods by leveraging MHA results for precise sub-dataset influence analysis and non-copyrighted text detection, offering significant improvements in efficiency and accuracy.

Abstract: Large Language Models (LLMs) utilize extensive knowledge databases and show powerful text generation ability. However, their reliance on high-quality copyrighted datasets raises concerns about copyright infringements in generated texts. Current research often employs prompt engineering or semantic classifiers to identify copyrighted content, but these approaches have two significant limitations: (1) Challenging to identify which specific subdataset (e.g., works from particular authors) influences an LLM’s output. (2) Treating the entire training database as copyrighted, hence overlooking the inclusion of non-copyrighted training data. We propose Inner-Probe, a lightweight framework designed to evaluate the influence of copyrighted sub-datasets on LLM-generated texts. Unlike traditional methods relying solely on text, we discover that the results of multi-head attention (MHA) during LLM output generation provide more effective information. Thus, Inner-Probe performs sub-dataset contribution analysis using a lightweight LSTM based network trained on MHA results in a supervised manner. Harnessing such a prior, Inner-Probe enables non-copyrighted text detection through a concatenated global projector trained with unsupervised contrastive learning. Inner-Probe demonstrates 3x improved efficiency compared to semantic model training in sub-dataset contribution analysis on Books3, achieves 15.04% - 58.7% higher accuracy over baselines on the Pile, and delivers a 0.104 increase in AUC for non-copyrighted data filtering.

[39] Towards Acyclic Preference Evaluation of Language Models via Multiple Evaluators

Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Kaize Ding, Ranjay Krishna

Main category: cs.CL

TL;DR: PGED addresses cyclic preference issues in LLM evaluation by using multiple model-based evaluators to construct preference graphs, then ensembling and denoising them for acyclic, non-contradictory results.

Details

Motivation: Existing LLM evaluation approaches using a single strong LLM as judge are vulnerable to cyclic preferences (A>B, B>C, C>A), causing contradictory evaluation results that undermine reliability.

Method: PGED leverages multiple model-based evaluators to construct preference graphs, then applies graph ensembling and denoising techniques to obtain acyclic, non-contradictory evaluation results with theoretical guarantees.

Result: Extensive experiments on ten benchmarks show PGED’s superiority in model ranking, response selection for test-time scaling, and data selection for fine-tuning. Small LLM evaluators combined via PGED outperform strong single evaluators like Qwen2-72B.

Conclusion: PGED effectively addresses cyclic preference issues in LLM evaluation, enhancing evaluation reliability and improving model performance through ensemble-based preference graph analysis.

Abstract: Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs’ quality regarding preference remains a critical challenge. While existing works usually leverage a strong LLM as the judge for comparing LLMs’ response pairwisely, such a single-evaluator approach is vulnerable to cyclic preference, i.e., output A is better than B, B than C, but C is better than A, causing contradictory evaluation results. To address this, we introduce PGED (Preference Graph Ensemble and Denoising), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate PGED’s superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.

[40] EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

Soyoung An, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Hyeongu Yun

Main category: cs.CL

TL;DR: EXAONE 3.5 is LG AI Research’s instruction-tuned language model series (32B, 7.8B, 2.4B) with top-tier instruction following, long-context comprehension, and competitive performance across multiple benchmarks.

Details

Motivation: To develop and release advanced instruction-tuned language models that excel in real-world instruction following, long-context understanding, and general language tasks while being accessible for research use.

Method: Developed instruction-tuned language models in three configurations (32B, 7.8B, 2.4B) with specialized training for instruction following and long-context comprehension capabilities.

Result: Achieved highest scores across seven instruction-following benchmarks, top performance in four long-context comprehension benchmarks, and competitive results in nine general benchmarks compared to state-of-the-art open models of similar sizes.

Conclusion: EXAONE 3.5 models demonstrate exceptional capabilities in instruction following and long-context understanding while maintaining competitive general performance, making them valuable open resources for research purposes.

Abstract: This technical report introduces the EXAONE 3.5 instruction-tuned language models, developed and released by LG AI Research. The EXAONE 3.5 language models are offered in three configurations: 32B, 7.8B, and 2.4B. These models feature several standout capabilities: 1) exceptional instruction following capabilities in real-world scenarios, achieving the highest scores across seven benchmarks, 2) outstanding long-context comprehension, attaining the top performance in four benchmarks, and 3) competitive results compared to state-of-the-art open models of similar sizes across nine general benchmarks. The EXAONE 3.5 language models are open to anyone for research purposes and can be downloaded from https://huggingface.co/LGAI-EXAONE. For commercial use, please reach out to the official contact point of LG AI Research: contact_us@lgresearch.ai.

[41] Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette

Jiahao Yuan, Zixiang Di, Shangzixin Zhao, Zhiqing Cui, Hanqing Wang, Guisong Yang, Usman Naseem

Main category: cs.CL

TL;DR: Cultural Palette is a multi-agent framework that treats cultural alignment as an adaptive “color-blending” process, using continent-level agents and a meta agent to dynamically blend cultural representations for country-specific adaptation.

Details

Motivation: LLMs struggle with monocultural biases and capturing nuanced cultural semantics, and existing methods fail to adapt to unknown cultures after fine-tuning.

Method: Three-step approach: 1) Create Pentachromatic Cultural Palette Dataset using GPT-4o with Hofstede’s cultural dimensions; 2) Five continent-level alignment agents form specialized cultural communities; 3) Meta Agent uses Cultural MoErges with attention-gated parameter merging to dynamically blend cultural “colors”.

Result: Extensive experiments across various countries show Cultural Palette surpasses existing baselines in cultural alignment.

Conclusion: The framework successfully redefines cultural alignment as an adaptive color-blending process that can handle diverse cultural values and adapt to unknown cultures.

Abstract: Large language models (LLMs) face challenges in aligning with diverse cultural values despite their remarkable performance in generation, which stems from inherent monocultural biases and difficulties in capturing nuanced cultural semantics. Existing methods struggle to adapt to unknown culture after fine-tuning. Inspired by cultural geography across five continents, we propose Cultural Palette, a multi-agent framework that redefines cultural alignment as an adaptive “color-blending” process for country-specific adaptation. Our approach harnesses cultural geography across five continents through three key steps: First, we synthesize the Pentachromatic Cultural Palette Dataset using GPT-4o, refining continental-level dialogues with Hofstede’s cultural dimensions to establish foundational cultural representations. Second, five continent-level alignment agents form specialized cultural communities that generate region-specific draft responses. Third, a Meta Agent employs Cultural MoErges to dynamically blend these cultural “colors” through attention-gated parameter merging, akin to mixing pigments on a palette, resolving conflicts while preserving cultural nuances to produce the final culturally-aligned response. Extensive experiments across various countries demonstrate that \textit{Cultural Palette} surpasses existing baselines in cultural alignment.

[42] EXAONE Deep: Reasoning Enhanced Language Models

Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun

Main category: cs.CL

TL;DR: EXAONE Deep series shows superior reasoning capabilities in math and coding tasks, with smaller models outperforming comparable-sized competitors and the largest model competing with leading open-weight models.

Details

Motivation: To develop high-performance reasoning models that excel in mathematical and coding benchmarks, making them openly available for research purposes.

Method: Training models primarily on reasoning-specialized datasets that incorporate long streams of thought processes to enhance reasoning capabilities.

Result: EXAONE Deep 2.4B and 7.8B outperform other models of comparable size, while EXAONE Deep 32B demonstrates competitive performance against leading open-weight models.

Conclusion: The EXAONE Deep series successfully achieves superior reasoning capabilities across different model sizes and is made openly available for research use.

Abstract: We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Deep models are openly available for research purposes and can be downloaded from https://huggingface.co/LGAI-EXAONE.

[43] Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall

Qianli Wang, Mingyang Wang, Nils Feldhus, Simon Ostermann, Yuan Cao, Hinrich Schütze, Sebastian Möller, Vera Schmitt

Main category: cs.CL

TL;DR: Quantization reduces factual knowledge recall in LLMs, especially in smaller models, but BitSandBytes preserves FKR best and quantization remains effective overall.

Details

Motivation: While quantization's effects on LLM capabilities have been studied, factual knowledge recall (FKR) - how LLMs access stored knowledge - remains underexplored, despite being critical for model performance.

Method: Comprehensive experiments using three common quantization techniques at different bit widths, combined with interpretability-driven analyses on knowledge memorization and latent multi-hop reasoning tasks.

Result: Quantization typically causes information loss, reducing FKR capacity, especially in smaller models. However, reduced-bit quantization doesn’t always hurt performance and can sometimes enhance FKR. BitSandBytes best preserves original model’s FKR.

Conclusion: Despite variability across models and methods, quantization causes only modest performance degradation and remains an effective compression strategy for LLMs.

Abstract: Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization’s effects on various LLM capabilities have been extensively studied, one critical area remains underexplored: factual knowledge recall (FKR), the process by which LLMs access stored knowledge. To this end, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with interpretability-driven analyses on two tasks, knowledge memorization and latent multi-hop reasoning. We show that quantization typically results in information loss within LLMs, consequently diminishing their capacity for FKR. This effect is particularly amplified in smaller models within the same architectural families. However, models quantized at reduced bit precision do not consistently exhibit inferior performance and occasionally quantization may even enhance model FKR. We find that BitSandBytes demonstrates highest preservation of the original full-precision model’s FKR. Despite variability across models and methods, quantization causes modest performance degradation and remains an effective compression strategy.

[44] FedSEA-LLaMA: A Secure, Efficient and Adaptive Federated Splitting Framework for Large Language Models

Zishuai Zhang, Hainan zhang, Weihua Li, Qinnan zhang, jin Dong, Yongxin Tong, Zhiming Zheng

Main category: cs.CL

TL;DR: FedSEA-LLaMA is a secure, efficient, and adaptive federated split learning framework for LLaMA2 that addresses privacy, communication overhead, and adaptability challenges in federated LLM training.

Details

Motivation: Private data is valuable for improving LLMs but is scattered across data silos, and LLMs' high computational demands limit deployment in federated environments. Existing federated split models face security vulnerabilities, high communication overhead from sequential training/inference, and lack adaptability to downstream tasks.

Method: 1) Inject Gaussian noise into forward-pass hidden states for secure end-to-end vector transmission; 2) Use attention-mask compression and KV cache collaboration to reduce communication costs; 3) Allow dynamic adjustment of partition points for input/output blocks based on task requirements.

Result: FedSEA-LLaMA maintains performance comparable to centralized LLaMA2 on NLU, summarization, and conversational QA tasks, achieves up to 8x speedups in training and inference, and demonstrates effectiveness against privacy attacks with adaptable partition points.

Conclusion: FedSEA-LLaMA successfully addresses key challenges in federated split learning for LLMs by providing secure transmission, efficient communication, and task adaptability while maintaining model performance.

Abstract: Private data holds promise for improving LLMs due to its high quality, but its scattered distribution across data silos and the high computational demands of LLMs limit their deployment in federated environments. To address this, the transformer-based federated split models are proposed, which offload most model parameters to the server (or distributed clients) while retaining only a small portion on the client to ensure data privacy. Despite this design, they still face three challenges: 1) Peer-to-peer key encryption struggles to secure transmitted vectors effectively; 2) The auto-regressive nature of LLMs means that federated split learning can only train and infer sequentially, causing high communication overhead; 3) Fixed partition points lack adaptability to downstream tasks. In this paper, we introduce FedSEA-LLaMA, a Secure, Efficient, and Adaptive Federated splitting framework based on LLaMA2. First, we inject Gaussian noise into forward-pass hidden states to enable secure end-to-end vector transmission. Second, we employ attention-mask compression and KV cache collaboration to reduce communication costs, accelerating training and inference. Third, we allow users to dynamically adjust the partition points for input/output blocks based on specific task requirements. Experiments on natural language understanding, summarization, and conversational QA tasks show that FedSEA-LLaMA maintains performance comparable to centralized LLaMA2 and achieves up to 8x speedups in training and inference. Further analysis of privacy attacks and different partition points also demonstrates the effectiveness of FedSEA-LLaMA in security and adaptability.

[45] C-VARC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models

Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng

Main category: cs.CL

TL;DR: Proposes a hierarchical Chinese value framework and C-VARC corpus for culturally-adaptive LLM value alignment, addressing Western bias in current evaluation methods.

Details

Motivation: Current LLM value evaluation suffers from Western cultural bias, incomplete domestic frameworks, and lack of scalable scenario generation methods, making evaluations costly and inadequate across diverse cultural contexts.

Method: Develops hierarchical Chinese value framework (3 dimensions, 12 core values, 50 derived values), constructs C-VARC corpus with 250k+ value rules via human annotation, and creates 400k rule-based moral dilemma scenarios.

Result: C-VARC scenarios show clearer value boundaries and greater diversity; 7 LLMs preferred C-VARC options in 70.5%+ cases; 5 Chinese annotators showed 87.5% alignment; framework captures nuanced value prioritization across 17 LLMs.

Conclusion: Establishes culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment with Chinese characteristics, addressing cultural bias in current LLM value assessment.

Abstract: Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly and inadequate across diverse cultural contexts. To address these challenges, we propose a hierarchical value framework grounded in core Chinese values, encompassing three main dimensions, 12 core values, and 50 derived values. Based on this framework, we construct a large-scale Chinese Value Rule Corpus (C-VARC) containing over 250,000 value rules enhanced and expanded through human annotation. Experimental results demonstrate that scenarios guided by C-VARC exhibit clearer value boundaries and greater content diversity compared to those produced through direct generation. In the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven mainstream LLMs preferred C-VARC generated options in over 70.5% of cases, while five Chinese human annotators showed an 87.5% alignment with C-VARC, confirming its universality, cultural relevance, and strong alignment with Chinese values. Additionally, we construct 400,000 rule-based moral dilemma scenarios that objectively capture nuanced distinctions in conflicting value prioritization across 17 LLMs. Our work establishes a culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment, representing Chinese characteristics.

[46] Esoteric Language Models

Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat

Main category: cs.CL

TL;DR: Eso-LMs fuse autoregressive and masked diffusion models, enabling KV caching for MDMs while preserving parallel generation, achieving state-of-the-art speed-quality tradeoffs with 14-65× faster inference than standard MDMs.

Details

Motivation: Masked Diffusion Models (MDMs) underperform autoregressive models in perplexity and lack key inference-time efficiency features like KV caching. There's a need to combine the strengths of both paradigms while overcoming their limitations.

Method: Eso-LMs fuse AR and MDM paradigms using causal attention instead of bidirectional attention, enabling exact likelihood computation and KV caching for MDMs while preserving parallel generation. Uses optimized sampling schedule.

Result: Achieves new SOTA on speed-quality Pareto frontier for unconditional generation. On long contexts: 14-65× faster inference than standard MDMs, 3-4× faster than prior semi-autoregressive approaches.

Conclusion: Eso-LMs successfully combine AR and MDM advantages, enabling efficient parallel generation with KV caching and exact likelihood computation, significantly advancing inference efficiency for diffusion language models.

Abstract: Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Within this family, Masked Diffusion Models (MDMs) currently perform best but still underperform AR models in perplexity and lack key inference-time efficiency features, most notably KV caching. We introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, smoothly interpolating between their perplexities while overcoming their respective limitations. Unlike prior work, which uses transformers with bidirectional attention as MDM denoisers, we exploit the connection between MDMs and Any-Order autoregressive models and adopt causal attention. This design lets us compute the exact likelihood of MDMs for the first time and, crucially, enables us \to introduce KV caching for MDMs while preserving parallel generation for the first time, significantly improving inference efficiency. Combined with an optimized sampling schedule, Eso-LMs achieves a new state of the art on the speed-quality Pareto frontier for unconditional generation. On long contexts, it yields $\mathbf{14 - 65{}\times}$ faster inference than standard MDMs and $\mathbf{3 - 4{}\times}$ faster inference than prior semi-autoregressive approaches. We provide code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/Eso-LMs

[47] Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs

Jing Yang Lee, Kong-Aik Lee, Woon-Seng Gan

Main category: cs.CL

TL;DR: The paper proposes a two-stage framework for open-domain dialogue generation that explicitly models the one-to-many property by decomposing it into multi-response generation and preference-based selection, using a new dataset and achieving improved diversity and quality.

Details

Motivation: Most modern LLM-based dialogue agents fail to explicitly model the one-to-many property of open-domain dialogue, where multiple appropriate responses exist for a single context, despite evidence that modeling this property boosts response diversity.

Method: Decompose open-domain dialogue generation into two tasks: Multi-Response Generation (MRG) to produce diverse high-quality responses, and Preference-based Selection (PS) to choose the best response. Introduce o2mDial dataset, propose in-context learning and instruction-tuning strategies, develop novel evaluation metrics for MRG, and create a model-based approach for PS.

Result: Applying the two-stage framework to smaller LLMs improves overall response diversity while maintaining contextual coherence, enhancing response quality by up to 90%, bringing smaller models closer to larger model performance.

Conclusion: Explicitly modeling the one-to-many property through a two-stage MRG and PS framework significantly improves open-domain dialogue generation, demonstrating that smaller LLMs can achieve better diversity and quality when properly structured to handle multiple plausible responses.

Abstract: Open-domain Dialogue (OD) exhibits a one-to-many (o2m) property, whereby multiple appropriate responses exist for a single dialogue context. Despite prior research showing that modeling this property boosts response diversity, most modern LLM-based dialogue agents do not explicitly do so. In this work, we model the o2m property of OD in LLMs by decomposing OD generation into two key tasks: Multi-Response Generation (MRG) and Preference-based Selection (PS), which entail generating a set of n semantically and lexically diverse high-quality responses for a given dialogue context, followed by selecting a single response based on human preference, respectively. To facilitate MRG and PS, we introduce o2mDial, a dialogue corpus explicitly designed to capture the o2m property by featuring multiple plausible responses for each context. Leveraging o2mDial, we propose new in-context learning and instruction-tuning strategies, as well as novel evaluation metrics for MRG, alongside a model-based approach for PS. Empirical results demonstrate that applying the proposed two-stage framework to smaller LLMs for OD generation enhances overall response diversity while maintaining contextual coherence, improving response quality by up to 90%, bringing them closer to the performance of larger models.

[48] Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine

Sebastian Joseph, Lily Chen, Barry Wei, Michael Mackert, Iain J. Marshall, Paul Pu Liang, Ramez Kouzy, Byron C. Wallace, Junyi Jessy Li

Main category: cs.CL

TL;DR: Medical fact-checking systems face fundamental challenges in connecting social media claims to clinical evidence, requiring interactive communication rather than end-to-end automation.

Details

Motivation: Despite technological advances in fact-checking, medical applications remain underutilized due to high-stakes decisions, vast literature, and inadequate medical literacy among users, creating a need for evidence-based verification systems.

Method: First study examining how clinical experts verify real social media claims by synthesizing medical evidence, developed with expert input to establish an upper-bound performance benchmark.

Result: Reveals fundamental challenges: difficulty connecting social media claims to clinical trial evidence, ambiguities in underspecified claims with mismatched intentions, and inherently subjective veracity labels in medical contexts.

Conclusion: Medical fact-checking should be approached as an interactive communication problem rather than an end-to-end automated process, requiring different evaluation frameworks.

Abstract: Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripens the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. In this position paper, developed with expert input, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached and evaluated as an interactive communication problem, rather than an end-to-end process.

[49] EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Gwangho Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Young Min Paik, Yongmin Park, Youngyong Park, Sanghyun Seo, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun

Main category: cs.CL

TL;DR: EXAONE 4.0 introduces a dual-mode AI system with Non-reasoning and Reasoning modes, combining usability with advanced reasoning, featuring agentic tool use and expanded multilingual support (English, Korean, Spanish).

Details

Motivation: To bridge the gap between user-friendly AI (EXAONE 3.5) and advanced reasoning capabilities (EXAONE Deep), while preparing for the agentic AI era with essential features like tool use and expanded language support.

Method: Integration of Non-reasoning mode for usability and Reasoning mode for advanced capabilities, with agentic tool use functionality. Offers two model sizes: 32B for high performance and 1.2B for on-device applications.

Result: EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive against frontier-class models. Models are publicly available for research via Hugging Face.

Conclusion: EXAONE 4.0 successfully combines usability and reasoning capabilities while introducing agentic features and expanded multilingual support, positioning it as a competitive solution for both research and practical applications.

Abstract: This technical report introduces EXAONE 4.0, which integrates a Non-reasoning mode and a Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. The EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive even against frontier-class models. The models are publicly available for research purposes and can be easily downloaded via https://huggingface.co/LGAI-EXAONE.

[50] NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

Hyeonseok Moon, Heuiseok Lim

Main category: cs.CL

TL;DR: The paper argues current LLM benchmarks overestimate context understanding by including irrelevant content, introduces NeedleChain to test true integration of all evidence, and shows even advanced models fail with 200 tokens of relevant text.

Details

Motivation: Current benchmarks for long-context LLMs embed substantial query-irrelevant content, which shifts evaluation toward retrieval rather than true integration of all provided information. This leads to overestimation of models' actual context-understanding capabilities.

Method: Introduces NeedleChain benchmark with three variants differing in required comprehension order, plus a parallel benchmark based on needle-in-a-haystack paradigm. Also proposes ROPE contraction, a training-free strategy to encourage models to reflect all available information.

Result: When context consists entirely of query-relevant text, even advanced models like GPT-4o fail to reliably integrate inputs as short as 200 tokens. NeedleChain enables more comprehensive assessment of context understanding capabilities.

Conclusion: Current benchmarks overestimate LLMs’ true context-understanding ability. NeedleChain provides a more rigorous evaluation framework, and ROPE contraction points to new directions for improving reliable reasoning over context through full-context integration.

Abstract: Recent reports suggest that LLMs can handle increasingly long contexts. However, many existing benchmarks for context understanding embed substantial query-irrelevant content, which shifts evaluation toward retrieving relevant snippets rather than fully integrating all provided information. Under this setting, we view that current benchmarks can overestimate true context-understanding ability of LLMs. In particular, we demonstrate that when the context consists entirely of query-relevant text, even advanced models such as GPT-4o fail to reliably integrate inputs as short as 200 tokens. To evaluate this capability more rigorously, we introduce NeedleChain, a benchmark designed to test whether models can faithfully incorporate all given evidence. NeedleChain includes three variants that differ in the required order of comprehension, along with a parallel benchmark based on the needle-in-a-haystack(NIAH) paradigm. By comparing these variants, NeedleChain enables a more comprehensive assessment of context understanding. We further propose a training-free strategy that encourages models to reflect all available information, ROPE contraction, highlighting the importance of full-context integration and pointing to new directions for improving reliable reasoning over context.

[51] RAG-BioQA: A Retrieval-Augmented Generation Framework for Long-Form Biomedical Question Answering

Lovely Yeswanth Panchumarthi, Sumalatha Saleti, Sai Prasad Gudari, Atharva Negi, Praveen Raj Budime, Harsit Upadhya

Main category: cs.CL

TL;DR: RAG-BioQA is a retrieval-augmented generation framework for long-form biomedical QA that combines BioBERT embeddings with FAISS indexing and fine-tuned FLAN-T5, outperforming existing methods on comprehensive medical explanations.

Details

Motivation: The rapid growth of biomedical literature makes it difficult to find specific medical information, and current QA systems only provide short answers without the comprehensive explanations needed for clinical decision-making.

Method: Uses retrieval-augmented generation with BioBERT embeddings and FAISS indexing for retrieval, plus LoRA fine-tuned FLAN-T5 for answer generation. Trained on 181k QA pairs from PubMedQA, MedDialog, and MedQuAD, and evaluated four retrieval strategies: dense retrieval (FAISS), BM25, ColBERT, and MonoT5.

Result: Domain-adapted dense retrieval outperformed zero-shot neural re-rankers, with best configuration achieving 0.24 BLEU-1 and 0.29 ROUGE-1. Fine-tuning improved BERTScore by 81% over the base model.

Conclusion: RAG-BioQA effectively addresses the need for comprehensive biomedical explanations, with domain-adapted retrieval being crucial for performance. The framework is released to support reproducible research in biomedical QA.

Abstract: The rapidly growth of biomedical literature creates challenges acquiring specific medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a retrieval-augmented generation framework for long-form biomedical question answering. Our system integrates BioBERT embeddings with FAISS indexing for retrieval and a LoRA fine-tuned FLAN-T5 model for answer generation. We train on 181k QA pairs from PubMedQA, MedDialog, and MedQuAD, and evaluate on a held-out PubMedQA test set. We compare four retrieval strategies: dense retrieval (FAISS), BM25, ColBERT, and MonoT5. Our results show that domain-adapted dense retrieval outperforms zero-shot neural re-rankers, with the best configuration achieving 0.24 BLEU-1 and 0.29 ROUGE-1. Fine-tuning improves BERTScore by 81% over the base model. We release our framework to support reproducible biomedical QA research.

[52] MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

Taicheng Guo, Hai Wang, ChaoChun Liu, Mohsen Golalikhani, Xin Chen, Xiangliang Zhang, Chandan K. Reddy

Main category: cs.CL

TL;DR: MTSQL-R1 is an agentic training framework for multi-turn Text-to-SQL that uses execution feedback and dialogue memory for iterative verification and refinement, outperforming existing baselines.

Details

Motivation: Existing multi-turn Text-to-SQL systems treat the task as simple text translation using short-horizon approaches without execution verification, leading to non-executable or incoherent SQL outputs.

Method: Frames the task as a Markov Decision Process where an agent interacts with a database for execution feedback and dialogue memory for coherence verification, performing iterative propose→execute→verify→refine cycles.

Result: Experiments on COSQL and SPARC benchmarks show MTSQL-R1 consistently outperforms strong baselines, demonstrating the effectiveness of environment-driven verification and memory-guided refinement.

Conclusion: The framework highlights the importance of execution feedback and persistent dialogue memory for conversational semantic parsing, with full implementation details to be released for community research.

Abstract: Multi-turn Text-to-SQL aims to translate a user’s conversational utterances into executable SQL while preserving dialogue coherence and grounding to the target schema. However, most existing systems only regard this task as a simple text translation task and follow a short-horizon paradigm, generating a query per turn without execution, explicit verification, and refinement, which leads to non-executable or incoherent outputs. We present MTSQL-R1, an agentic training framework for long-horizon multi-turn Text-to-SQL. We cast the task as a Markov Decision Process (MDP) in which an agent interacts with (i) a database for execution feedback and (ii) a persistent dialogue memory for coherence verification, performing an iterative propose to execute -> verify -> refine cycle until all checks pass. Experiments on COSQL and SPARC demonstrate that MTSQL-R1 consistently outperforms strong baselines, highlighting the importance of environment-driven verification and memory-guided refinement for conversational semantic parsing. Full recipes (including code, trained models, logs, reasoning trajectories, etc.) will be released after the internal review to contribute to community research.

[53] Optimizing Retrieval for RAG via Reinforcement Learning

Jiawei Zhou, Lei Chen

Main category: cs.CL

TL;DR: R3 is a retrieval framework optimized for RAG using reinforcement learning that enables retrievers to self-improve within RAG environments, achieving significant performance gains over existing methods.

Details

Motivation: As RAG becomes more widespread, retrieval shifts from human browsing to AI reasoning, creating complex search environments where relevance is hard to pre-define. Existing retrievers rely on static supervised fine-tuning that struggles to adapt to diverse RAG environments.

Method: Propose R3, a Retrieval framework optimized for RAG through Reinforcement learning (RL). Uses RL training paradigm allowing retrievers to explore and self-improve within given RAG environments, automating learning with minimal manual effort.

Result: R3 improves RAG performance by 5.2% over original retrievers and surpasses state-of-the-art retrievers by 4.9%. Achieves comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. Efficient and practical, requiring only 4 GPUs and completing training within a single day.

Conclusion: R3 provides an effective RL-based solution for adapting retrievers to complex RAG environments, offering significant performance improvements while being computationally efficient and practical for real-world deployment.

Abstract: As retrieval-augmented generation (RAG) becomes more widespread, the role of retrieval is shifting from retrieving information for human browsing to retrieving context for AI reasoning. This shift creates more complex search environments, where relevance is difficult to pre-define. Existing retrievers rely on supervised fine-tuning (SFT) with human labels or synthetic data, resulting in static relevance that struggles to adapt to diverse RAG environments. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through Reinforcement learning (RL). Specifically, we adopt an RL training paradigm that enables the retriever to explore and self-improve within given RAG environments, automating the learning process with minimal manual experimentation or tuning effort. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.

[54] Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Deokhyung Kang, Seonjeong Hwang, Daehui Kim, Hyounghun Kim, Gary Geunbae Lee

Main category: cs.CL

TL;DR: The multilingual reasoning gap in RLMs stems from understanding failures, not reasoning failures. Selective Translation bridges this gap by translating only problematic inputs, achieving near full-translation performance with ~20% translation.

Details

Motivation: RLMs perform better in high-resource languages than low-resource ones, creating a multilingual reasoning gap. While recent efforts address this gap, its underlying causes remain unexplored, limiting effective mitigation strategies.

Method: 1) Identify that the gap stems from understanding failures (inability to translate multilingual inputs to English, the dominant reasoning language). 2) Evaluate detection methods for understanding failures. 3) Propose Selective Translation: incorporate English translation into reasoning traces only when understanding failures are detected.

Result: Understanding failures are detectable to a meaningful extent (supervised approaches perform best). Selective Translation with Qwen3-4B substantially bridges the multilingual reasoning gap, achieving near full-translation performance while translating only about 20% of inputs.

Conclusion: Failures in language understanding are the primary driver of the multilingual reasoning gap, not reasoning failures. These understanding failures can be detected and selectively mitigated, clarifying the gap’s origin and suggesting a path toward more equitable multilingual reasoning.

Abstract: Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still exhibit a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have been made to address this gap, its underlying causes remain largely unexplored. In this work, we show that this gap primarily stems from failures in language understanding-specifically, the model’s inability to translate multilingual inputs into the language dominating its reasoning traces (typically English). As identifying understanding failures can enable targeted mitigation of the gap, we evaluate a range of detection methods and find that understanding failures are detectable to a meaningful extent, with supervised approaches performing best. Building on this, we propose Selective Translation, a strategy that incorporates an English translation into the initial reasoning trace only when an understanding failure is detected. Experimental results using Qwen3-4B show that Selective Translation substantially bridges the multilingual reasoning gap, achieving near full-translation performance while translating only about 20% of inputs. Together, our results show that failures in language understanding are the primary driver of the multilingual reasoning gap and can be detected and selectively mitigated, clarifying its origin and suggesting a path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis

[55] SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI

Anupama Sitaraman, Bharathan Balaji, Yuvraj Agarwal

Main category: cs.CL

TL;DR: SpiderGen is an LLM-based workflow that generates Product Category Rules Process Flow Graphs for Life Cycle Assessments, reducing costs from $25K+ to under $1 and time from 21-person days to under 10 minutes.

Details

Motivation: Climate change concerns require tools to estimate environmental impact of consumer goods through Life Cycle Assessments, which are currently expensive and time-consuming to produce.

Method: SpiderGen integrates traditional LCA taxonomy/methodology with LLM reasoning capabilities to generate graphical representations of LCA process information (PCR PFGs).

Result: Achieves 65% F1-Score vs 53% for one-shot prompting, produces accurate LCA process information with minor errors, and significantly reduces costs and time compared to traditional LCA methods.

Conclusion: SpiderGen demonstrates potential to dramatically reduce human effort and costs for carbon impact estimation while maintaining reasonable accuracy, though some challenges remain with detail scope differences.

Abstract: Investigating the effects of climate change and global warming caused by GHG emissions have been a key concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate graphical representations of the key procedural information used for LCA, known as Product Category Rules Process Flow Graphs (PCR PFGs). We additionally evaluate the output of SpiderGen by comparing it with 65 real-world LCA documents. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 65% across 10 sample data points, as compared to 53% using a one-shot prompting method. We observe that the remaining errors occur primarily due to differences in detail between LCA documents, as well as differences in the “scope” of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen’s potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than $1 USD in under 10 minutes as compared to the status quo LCA, which can cost over $25000 USD and take up to 21-person days.

[56] W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search

Zhenyu Ding, Yuhao Wang, Tengyue Xiao, Haoying Wang, Caigui Jiang, Ning Ding

Main category: cs.CL

TL;DR: W2S-AlignTree is a plug-and-play inference-time alignment framework that combines Monte Carlo Tree Search with Weak-to-Strong Generalization to align LLM outputs with human preferences without parameter modification.

Details

Motivation: LLMs often produce outputs misaligned with human preferences due to weak supervision limitations. Existing training-time alignment methods like RLHF are expensive, not scalable, and lack dynamic control during inference, creating need for scalable, adaptable alignment mechanisms.

Method: Formulates LLM alignment as optimal heuristic search using Monte Carlo Tree Search within generative search tree. Leverages weak model’s real-time step-level signals as alignment proxies with Entropy-Aware exploration mechanism to guide strong model generation without parameter modification.

Result: Consistently outperforms baselines across sentiment generation, summarization, and instruction-following tasks. Notably improves Llama3-8B performance from 1.89 to 2.19 (15.9% relative improvement) on summarization.

Conclusion: W2S-AlignTree provides effective, scalable inference-time alignment framework that enables fine-grained control over LLM outputs without expensive retraining, addressing key limitations of existing alignment methods.

Abstract: Large Language Models (LLMs) demonstrate impressive capabilities, yet their outputs often suffer from misalignment with human preferences due to the inadequacy of weak supervision and a lack of fine-grained control. Training-time alignment methods like Reinforcement Learning from Human Feedback (RLHF) face prohibitive costs in expert supervision and inherent scalability limitations, offering limited dynamic control during inference. Consequently, there is an urgent need for scalable and adaptable alignment mechanisms. To address this, we propose W2S-AlignTree, a pioneering plug-and-play inference-time alignment framework that synergistically combines Monte Carlo Tree Search (MCTS) with the Weak-to-Strong Generalization paradigm for the first time. W2S-AlignTree formulates LLM alignment as an optimal heuristic search problem within a generative search tree. By leveraging weak model’s real-time, step-level signals as alignment proxies and introducing an Entropy-Aware exploration mechanism, W2S-AlignTree enables fine-grained guidance during strong model’s generation without modifying its parameters. The approach dynamically balances exploration and exploitation in high-dimensional generation search trees. Experiments across controlled sentiment generation, summarization, and instruction-following show that W2S-AlignTree consistently outperforms strong baselines. Notably, W2S-AlignTree raises the performance of Llama3-8B from 1.89 to 2.19, a relative improvement of 15.9 on the summarization task.

[57] Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems

Dengyun Peng, Qiguang Chen, Bofei Liu, Jiannan Guan, Libo Qin, Zheng Yan, Jinhao Liu, Jianshu Zhang, Wanxiang Che

Main category: cs.CL

TL;DR: The paper introduces UnsolvableQA dataset and UnsolvableRL framework to help LLMs distinguish between objectively unsolvable queries (inherent contradictions) and subjectively unsolvable ones (beyond model capability), improving detection and reasoning accuracy.

Details

Motivation: Current LLMs conflate objective unsolvability (inherent contradictions) with subjective capability limitations, leading to hallucinations where models confidently answer unsolvable queries. There's a need to help LLMs properly distinguish between these two types of unsolvability.

Method: 1) Construct UnsolvableQA dataset using “Reverse Construction” - systematically injecting logical contradictions into valid reasoning chains to create unsolvable questions. 2) Develop UnsolvableRL, a reinforcement learning framework that balances objective unsolvability detection with calibrated confidence under capability limits.

Result: Achieves near-perfect unsolvability detection (>90% detection rate) and boosts solvable reasoning accuracy from 43.4% to 69.4% on Qwen3-4B-Instruct. Identifies data-training interaction: strict alignment without unsolvable data causes Capability Collapse, but acts as regularizer for rigor when unsolvable data is included.

Conclusion: The proposed approach effectively improves LLM reliability by distinguishing objective unsolvability from capability limitations, with the dataset and framework available for public use. The interaction between alignment constraints and unsolvable data is crucial for overall robustness.

Abstract: Ensuring large language model (LLM) reliability requires distinguishing objective unsolvability (inherent contradictions) from subjective capability limitations (tasks exceeding model competence). Current LLMs often conflate these dimensions, leading to hallucinations in which they return confident answers to inherently unsolvable queries. To address this issue, we propose a multi-domain dataset containing both solvable and unsolvable questions, UnsolvableQA, together with an alignment framework, UnsolvableRL. First, we construct UnsolvableQA by “Reverse Construction” that systematically injects logical contradictions into otherwise valid reasoning chains. Second, we introduce UnsolvableRL, a reinforcement learning paradigm that balances objective unsolvability detection with calibrated confidence under capability limits. Empirically, our approach achieves near-perfect unsolvability detection (>90% detection rate) and boosts solvable reasoning accuracy from 43.4% to 69.4% on Qwen3-4B-Instruct. Crucially, we identify a data-training interaction: strict alignment constraints induce Capability Collapse without unsolvable data, but act as a regularizer for rigor when such data are included, thereby improving overall robustness. Our code and data are available at https://github.com/sfasfaffa/unsolvableQA .

[58] Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates

Yixing Xu, Chao Li, Xuanwu Yin, Spandan Tiwari, Dong Li, Ashish Sirasao, Emad Barsoum

Main category: cs.CL

TL;DR: Dual LoRA improves LoRA performance by separating low-rank matrices into magnitude and direction groups with ReLU and sign functions to better simulate full fine-tuning parameter updates.

Details

Motivation: Standard LoRA often has unsatisfactory performance due to its low-rank assumption, which doesn't properly simulate the parameter updating process of full fine-tuning based on gradient-based optimization algorithms.

Method: Separates low-rank matrices into two groups: magnitude group (controls whether/how far to update parameters) with ReLU function, and direction group (decides forward/backward movement) with sign function. This better simulates full fine-tuning parameter updates.

Result: Consistently outperforms LoRA and state-of-the-art variants across various NLP tasks (natural language understanding, commonsense reasoning) on RoBERTa, DeBERTa, and LLaMA-1/2/3 models with same number of trainable parameters.

Conclusion: Dual LoRA effectively improves LoRA performance by incorporating inductive bias that better simulates full fine-tuning parameter updates while maintaining parameter efficiency.

Abstract: Low-rank adaptation (LoRA) is one of the most popular methods among parameter-efficient fine-tuning (PEFT) methods to adapt pre-trained large language models (LLMs) to specific downstream tasks. However, the model trained based on LoRA often has an unsatisfactory performance due to its low-rank assumption. In this paper, we propose a novel method called Dual LoRA to improve the performance by incorporating an inductive bias into the original LoRA. Specifically, we separate low-rank matrices into two groups: the magnitude group to control whether or not and how far we should update a parameter and the direction group to decide whether this parameter should move forward or backward, to better simulate the parameter updating process of the full fine-tuning based on gradient-based optimization algorithms. We show that this can be simply achieved by adding a ReLU function to the magnitude group and a sign function to the direction group. We conduct several experiments over a wide range of NLP tasks, including natural language understanding (NLU) and commonsense reasoning datasets on RoBERTa, DeBERTa, and LLaMA-1/2/3 as baseline models. The results show that we consistently outperform LoRA and its state-of-the-art variants with the same number of trainable parameters.

[59] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence

Wenjin Liu, Haoran Luo, Xin Feng, Xiang Ji, Lijuan Zhou, Rui Mao, Jiapu Wang, Shirui Pan, Erik Cambria

Main category: cs.CL

TL;DR: LexGenius is a Chinese legal benchmark for evaluating legal general intelligence in LLMs using a Dimension-Task-Ability framework with 7 dimensions, 11 tasks, and 20 abilities, showing current LLMs lag behind human legal professionals.

Details

Motivation: Existing legal benchmarks are result-oriented and fail to systematically evaluate legal intelligence in LLMs, hindering the development of legal general intelligence (GI) that simulates legal expert capabilities.

Method: Created LexGenius benchmark using recent legal cases and exam questions to generate multiple-choice questions, with manual and LLM reviews to reduce data leakage risks and ensure accuracy through multiple rounds of checks.

Result: Evaluation of 12 state-of-the-art LLMs revealed significant disparities across legal intelligence abilities, with even the best LLMs performing below human legal professionals.

Conclusion: LexGenius can effectively assess LLMs’ legal intelligence abilities and enhance the development of legal general intelligence, with the benchmark publicly available for research use.

Abstract: Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.

[60] Training-free Context-adaptive Attention for Efficient Long Context Modeling

Zeng You, Yaofo Chen, Shuhai Zhang, Zhijie Qiu, Tingyu Wu, Yingjian Li, Yaowei Wang, Mingkui Tan

Main category: cs.CL

TL;DR: TCA-Attention is a training-free sparse attention mechanism that selectively attends to informative tokens for efficient long-context inference, achieving 2.8× speedup and 61% KV cache reduction at 128K context length.

Details

Motivation: The quadratic complexity of self-attention in LLMs creates computational and memory challenges for long sequences. Existing sparse attention and KV cache compression methods have limitations like fixed patterns, inability to handle both prefilling/decoding stages, or requiring additional training.

Method: TCA-Attention uses two lightweight phases: 1) offline calibration to determine head-specific sparsity budgets via single forward pass, and 2) online token selection that adaptively retains core context tokens using lightweight redundancy metric. It’s training-free and requires no parameter updates or architectural changes.

Result: Achieves 2.8× speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention across various benchmarks. Theoretical analysis shows bounded approximation error.

Conclusion: TCA-Attention provides a practical plug-and-play solution for efficient long-context inference that accelerates both prefilling and decoding while reducing KV cache memory footprint without requiring training or architectural modifications.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. These capabilities stem primarily from the self-attention mechanism, which enables modeling of long-range dependencies. However, the quadratic complexity of self-attention with respect to sequence length poses significant computational and memory challenges, especially as sequence length extends to extremes. While various sparse attention and KV cache compression methods have been proposed to improve efficiency, they often suffer from limitations such as reliance on fixed patterns, inability to handle both prefilling and decoding stages, or the requirement for additional training. In this paper, we propose Training-free Context-adaptive Attention (TCA-Attention), a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference. Our method consists of two lightweight phases: i) an offline calibration phase that determines head-specific sparsity budgets via a single forward pass, and ii) an online token selection phase that adaptively retains core context tokens using a lightweight redundancy metric. TCA-Attention provides a unified solution that accelerates both prefilling and decoding while reducing KV cache memory footprint, without requiring parameter updates or architectural changes. Theoretical analysis shows that our approach maintains bounded approximation error. Extensive experiments demonstrate that TCA-Attention achieves a 2.8$\times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention across various benchmarks, offering a practical plug-and-play solution for efficient long-context inference.

[61] Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony

Darshil Chauhan, Adityasinh Solanki, Vansh Patel, Kanav Kapoor, Ritvik Jain, Aditya Bansal, Pratik Narang, Dhruv Kumar

Main category: cs.CL

TL;DR: On-device continual adaptation using LoRA and Experience Replay improves ASR for clinical telephony in resource-constrained regions, reducing WER degradation from 40.94% to 17.1% relative improvement while maintaining privacy.

Details

Motivation: ASR can help with clinical documentation in resource-constrained regions, but deployment is hindered by a "Reality Gap" between lab performance and noisy real-world clinical audio, plus privacy/resource constraints. Adaptation is essential for clinical telephony where speech variability and transcription errors impact clinical workflows.

Method: On-device continual adaptation framework using Low-Rank Adaptation (LoRA) with multi-domain Experience Replay (ER) for stabilization. Investigates trade-offs between data-driven and parameter-driven approaches, and uses Absolute Fisher importance estimation to handle high-variance gradients in clinical telephony speech.

Result: Multi-domain Experience Replay achieves 17.1% relative improvement in target WER and reduces catastrophic forgetting by 55% compared to naive adaptation. The approach addresses the 40.94% WER degradation observed when applying robust multilingual models to rural clinical telephony speech.

Conclusion: Acoustic adaptation is fundamental for healthcare ASR usability and cannot be bypassed by language models alone. On-device continual adaptation with stabilization strategies enables practical deployment in privacy-sensitive clinical telephony settings.

Abstract: Automatic Speech Recognition (ASR) holds immense potential to assist in clinical documentation and patient report generation, particularly in resource-constrained regions. However, deployment is currently hindered by a technical deadlock: a severe “Reality Gap” between laboratory performance and noisy, real-world clinical audio, coupled with strict privacy and resource constraints. Such adaptation is essential for clinical telephony systems, where patient speech is highly variable and transcription errors can directly impact downstream clinical workflows. We quantify this gap, showing that a robust multilingual model (IndicWav2Vec) degrades up to a 40.94% WER on rural clinical telephony speech from India, rendering it unusable. We demonstrate consistent improvements on these helpline interactions without transmitting raw patient data off-device via an on-device continual adaptation framework using Low-Rank Adaptation (LoRA). We conduct an investigative study of stabilization strategies, characterizing the trade-offs between data-driven and parameter-driven approaches. Our results demonstrate that multi-domain Experience Replay (ER) yields the primary performance gains, achieving a 17.1% relative improvement in target WER and reducing catastrophic forgetting by 55% compared to naive adaptation. Furthermore, we investigate a stabilized importance estimation strategy (Absolute Fisher) to ensure robust convergence against the high-variance gradients common in clinical telephony speech. Finally, we verify via a domain-specific spot check that acoustic adaptation is a fundamental prerequisite for usability in healthcare settings which cannot be bypassed by language models alone.

[62] CTTA-T: Continual Test-Time Adaptation for Text Understanding via Teacher-Student with a Domain-aware and Generalized Teacher

Tianlun Liu, Zhiliang Tian, Zhen Huang, Xingzhi Zhou, Wanlong Yu, Tianle Liu, Feng Liu, Dongsheng Li

Main category: cs.CL

TL;DR: CTTA-T: A continual test-time adaptation framework for text understanding that handles sequential unseen domains using a domain-aware teacher-student approach with dropout-driven consistency filtering and incremental PCA for dynamic cross-domain semantic accumulation.

Details

Motivation: Current continual test-time adaptation (CTTA) methods for text understanding struggle with error accumulation across domains and poor generalization to unobserved domains. Existing approaches either discard useful information through noise-filtering or fail to achieve adaptive accumulation of historical domains.

Method: Proposes CTTA-T framework with teacher-student architecture: 1) Refine-then-filter approach using dropout-driven consistency to calibrate predictions and remove unreliable guidance, 2) Domain-aware teacher construction via incremental PCA to dynamically accumulate cross-domain semantics and track domain shifts, enabling adaptive accumulation of historical knowledge.

Result: Experiments show CTTA-T outperforms baseline methods, demonstrating effectiveness in handling sequential unseen domains in continual test-time adaptation for text understanding.

Conclusion: CTTA-T successfully addresses the adaptation-generalization trade-off in continual test-time adaptation for text understanding by combining dropout-driven consistency filtering with dynamic cross-domain semantic accumulation via incremental PCA, enabling effective handling of evolving target domains.

Abstract: Text understanding often suffers from domain shifts. To handle testing domains, domain adaptation (DA) is trained to adapt to a fixed and observed testing domain; a more challenging paradigm, test-time adaptation (TTA), cannot access the testing domain during training and online adapts to the testing samples during testing, where the samples are from a fixed domain. We aim to explore a more practical and underexplored scenario, continual test-time adaptation (CTTA) for text understanding, which involves a sequence of testing (unobserved) domains in testing. Current CTTA methods struggle in reducing error accumulation over domains and enhancing generalization to handle unobserved domains: 1) Noise-filtering reduces accumulated errors but discards useful information, and 2) accumulating historical domains enhances generalization, but it is hard to achieve adaptive accumulation. In this paper, we propose a CTTA-T (continual test-time adaptation for text understanding) framework adaptable to evolving target domains: it adopts a teacher-student framework, where the teacher is domain-aware and generalized for evolving domains. To improve teacher predictions, we propose a refine-then-filter based on dropout-driven consistency, which calibrates predictions and removes unreliable guidance. For the adaptation-generalization trade-off, we construct a domain-aware teacher by dynamically accumulating cross-domain semantics via incremental PCA, which continuously tracks domain shifts. Experiments show CTTA-T excels baselines.

[63] Multi-hop Reasoning via Early Knowledge Alignment

Yuxin Wang, Shicheng Fang, Bo Wang, Qi Luo, Xuanjing Huang, Yining Zheng, Xipeng Qiu

Main category: cs.CL

TL;DR: Early Knowledge Alignment (EKA) improves iterative RAG systems by aligning LLMs with retrieval corpus before question decomposition, reducing cascading errors and improving efficiency.

Details

Motivation: Existing iterative RAG systems decompose questions without considering available retrieval corpus, leading to inefficient retrieval and cascading errors in multi-hop reasoning.

Method: EKA aligns LLMs with retrieval set before planning, providing contextually relevant knowledge early to establish stronger reasoning foundation and reduce unnecessary exploration.

Result: Significant improvements in retrieval precision, reduced cascading errors, enhanced performance and efficiency across six standard RAG datasets; effective as training-free inference strategy.

Conclusion: EKA advances state-of-the-art in iterative RAG by demonstrating critical interplay between structured reasoning and efficient exploration in reinforcement learning-augmented frameworks.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for Large Language Models (LLMs) to address knowledge-intensive queries requiring domain-specific or up-to-date information. To handle complex multi-hop questions that are challenging for single-step retrieval, iterative RAG approaches incorporating reinforcement learning have been proposed. However, existing iterative RAG systems typically plan to decompose questions without leveraging information about the available retrieval corpus, leading to inefficient retrieval and reasoning chains that cascade into suboptimal performance. In this paper, we introduce Early Knowledge Alignment (EKA), a simple but effective module that aligns LLMs with retrieval set before planning in iterative RAG systems with contextually relevant retrieved knowledge. Extensive experiments on six standard RAG datasets demonstrate that by establishing a stronger reasoning foundation, EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency. Our analysis from an entropy perspective demonstrate that incorporating early knowledge reduces unnecessary exploration during the reasoning process, enabling the model to focus more effectively on relevant information subsets. Moreover, EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models. Generalization tests across diverse datasets and retrieval corpora confirm the robustness of our approach. Overall, EKA advances the state-of-the-art in iterative RAG systems while illuminating the critical interplay between structured reasoning and efficient exploration in reinforcement learning-augmented frameworks. The code is released at \href{https://github.com/yxzwang/EarlyKnowledgeAlignment}{Github}.

[64] AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts

Baorong Huang, Ali Asiri

Main category: cs.CL

TL;DR: AlignAR introduces a generative sentence alignment method and new Arabic-English parallel dataset with simple legal and complex literary texts, showing LLM-based approaches outperform traditional methods on challenging alignment tasks.

Details

Motivation: Arabic-English parallel corpora are scarce and existing datasets mainly consist of simple one-to-one mappings, lacking the complexity needed to properly evaluate alignment methods for real-world translation scenarios.

Method: AlignAR is a generative sentence alignment method that creates a new dataset with both simple legal and complex literary parallel texts, including a “Hard” subset with reduced one-to-one mappings to better test alignment robustness.

Result: LLM-based approaches demonstrated superior robustness, achieving 85.5% F1-score (9% improvement over previous methods), while traditional alignment methods showed limitations on the challenging “Hard” subset.

Conclusion: “Easy” datasets lack discriminatory power for alignment evaluation; the proposed AlignAR method and new dataset enable better assessment of alignment methods, with LLM-based approaches showing promising performance on complex alignment tasks.

Abstract: High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-English dataset comprising simple legal and complex literary parallel texts. Our evaluation demonstrates that “Easy” datasets lack the discriminatory power to fully assess alignment methods. By reducing one-to-one mappings in our “Hard” subset, we exposed the limitations of traditional alignment methods. In contrast, LLM-based approaches demonstrated better robustness, achieving an overall F1-score of 85.5%, a nearly 9% improvement over previous methods. Our datasets and codes are open-sourced at https://github.com/XXX.

[65] LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition

Elsen Ronando, Sozo Inoue

Main category: cs.CL

TL;DR: LLM-Guided Exemplar Selection framework improves few-shot HAR by using semantic reasoning to select better exemplars, outperforming traditional methods.

Details

Motivation: Current HAR methods rely on large labeled datasets and purely geometric exemplar selection, which fails to distinguish similar wearable sensor activities like walking, walking upstairs, and walking downstairs.

Method: Incorporates LLM-generated knowledge prior capturing feature importance, inter-class confusability, and exemplar budget multipliers to guide exemplar scoring and selection. Combines these priors with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization.

Result: Achieves macro F1-score of 88.78% on UCI-HAR dataset under strict few-shot conditions, outperforming classical approaches like random sampling, herding, and k-center.

Conclusion: LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.

Abstract: In this paper, we propose an LLM-Guided Exemplar Selection framework to address a key limitation in state-of-the-art Human Activity Recognition (HAR) methods: their reliance on large labeled datasets and purely geometric exemplar selection, which often fail to distinguish similar wearable sensor activities such as walking, walking upstairs, and walking downstairs. Our method incorporates semantic reasoning via an LLM-generated knowledge prior that captures feature importance, inter-class confusability, and exemplar budget multipliers, and uses it to guide exemplar scoring and selection. These priors are combined with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization to obtain a compact and informative set of exemplars. Evaluated on the UCI-HAR dataset under strict few-shot conditions, the framework achieves a macro F1-score of 88.78%, outperforming classical approaches such as random sampling, herding, and k-center. The results show that LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.

[66] TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish

Melikşah Türker, A. Ebrar Kızıloğlu, Onur Güngör, Susan Üsküdarlı

Main category: cs.CL

TL;DR: TabiBERT is a new monolingual Turkish encoder based on ModernBERT architecture, trained from scratch on 1 trillion tokens from a multi-domain corpus, achieving state-of-the-art performance across various Turkish NLP tasks.

Details

Motivation: Turkish NLP lacks a monolingual encoder trained from scratch with modern architectural paradigms like RoPE, FlashAttention, and refined normalization, despite significant advances in encoder-only Transformers since BERT.

Method: Developed TabiBERT based on ModernBERT architecture with Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Pre-trained from scratch on 1 trillion tokens from a curated multi-domain corpus (73% web text, 20% scientific publications, 6% source code, 0.3% mathematical content). Created TabiBench with 28 datasets across eight task categories for evaluation.

Result: TabiBERT achieves 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing SOTA on 5/8 categories. Shows strong gains on question answering (+9.55), code retrieval (+2.41), and academic understanding (+0.66). Achieves 2.65x inference speedup, supports 8,192-token context length, and reduces GPU memory consumption.

Conclusion: TabiBERT successfully addresses the gap in Turkish NLP by providing a modern, efficient monolingual encoder with strong cross-domain generalization. The release of model weights, configurations, and evaluation code enables transparent, reproducible research for the Turkish NLP community.

Abstract: Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch, incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). It supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory consumption, enabling larger batch sizes. We introduce TabiBench with 28 datasets across eight task categories with standardized splits and protocols, evaluated using GLUE-style macro-averaging. TabiBERT attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories, with particularly strong gains on question answering (+9.55 points), code retrieval (+2.41 points), and academic understanding (+0.66 points). Compared with task-specific prior best results, including specialized models like TurkishBERTweet, TabiBERT achieves +1.47 average improvement, indicating robust cross-domain generalization. We release model weights, training configurations, and evaluation code for transparent, reproducible Turkish encoder research.

[67] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu

Main category: cs.CL

TL;DR: HGMem introduces a hypergraph-based memory mechanism for multi-step RAG that captures higher-order correlations between facts, enabling more integrated reasoning compared to traditional passive memory storage.

Details

Motivation: Existing RAG memory modules function as passive storage that accumulates isolated facts, overlooking crucial high-order correlations among primitive facts. This static nature limits representational strength and impact on multi-step reasoning, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts.

Method: HGMem represents memory as a hypergraph where hyperedges correspond to distinct memory units, enabling progressive formation of higher-order interactions within memory. This connects facts and thoughts around the focal problem, evolving into an integrated knowledge structure that provides strong propositions for deeper reasoning.

Result: Extensive experiments on challenging datasets for global sense-making show that HGMem consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.

Conclusion: HGMem extends memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding, addressing limitations of existing passive memory designs in multi-step RAG systems.

Abstract: Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.

[68] Training a Huggingface Model on AWS Sagemaker (Without Tears)

Liling Tan

Main category: cs.CL

TL;DR: A demo paper providing centralized guidance for researchers to train Hugging Face models on AWS SageMaker, addressing the steep learning curve of cloud platforms.

Details

Motivation: LLM development is dominated by resource-rich groups, forcing many researchers to use cloud services like AWS SageMaker due to lack of on-premise computing resources. However, the steep learning curve and fragmented documentation create barriers to cloud adoption.

Method: The paper creates a centralized, comprehensive guide/demo that collects essential information needed to train Hugging Face models on AWS SageMaker from scratch, filling knowledge gaps left by existing documentation.

Result: A practical demonstration paper that provides researchers with the necessary knowledge and step-by-step guidance to successfully train their first Hugging Face model on AWS SageMaker, overcoming the cloud platform learning curve.

Conclusion: By centralizing essential cloud training information, this demo paper democratizes access to cloud computing for LLM research, enabling more researchers to leverage powerful cloud resources despite the complexity of cloud platforms.

Abstract: The development of Large Language Models (LLMs) has primarily been driven by resource-rich research groups and industry partners. Due to the lack of on-premise computing resources required for increasingly complex models, many researchers are turning to cloud services like AWS SageMaker to train Hugging Face models. However, the steep learning curve of cloud platforms often presents a barrier for researchers accustomed to local environments. Existing documentation frequently leaves knowledge gaps, forcing users to seek fragmented information across the web. This demo paper aims to democratize cloud adoption by centralizing the essential information required for researchers to successfully train their first Hugging Face model on AWS SageMaker from scratch.

cs.CV

[69] FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications

Yehui Yang, Dalu Yang, Wenshuo Zhou, Fangxin Shang, Yifan Liu, Jie Ren, Haojun Fei, Qing Yang, Tao Chen

Main category: cs.CV

TL;DR: FCMBench-V1.0 is a financial credit multimodal benchmark with 4,043 privacy-compliant images and 8,446 QA samples covering 18 certificate types, designed to evaluate VLMs on perception, reasoning, and robustness for credit applications.

Details

Motivation: Current multimodal AI lacks domain-specific benchmarks for financial credit applications that reflect real documents/workflows, include credit-specific understanding, maintain privacy compliance, and test real-world robustness.

Method: Created FCMBench via closed synthesis-capture pipeline: manually synthesized document templates with virtual content and captured scenario-aware images in-house to ensure privacy compliance and avoid web-sourced data leakage.

Result: Evaluated 23 state-of-the-art VLMs: Gemini 3 Pro (64.61 F1) best commercial, Qwen3-VL-235B (57.27) best open-source, Qfin-VL-Instruct (64.92) top overall. Robustness tests show significant performance drops under acquisition artifacts.

Conclusion: FCMBench effectively discriminates VLM performance and robustness for financial credit applications, revealing that even top models struggle with real-world artifacts, highlighting the need for domain-specific evaluation and specialized models.

Abstract: As multimodal AI becomes widely used for credit risk assessment and document review, a domain-specific benchmark is urgently needed that (1) reflects documents and workflows specific to financial credit applications, (2) includes credit-specific understanding and real-world robustness, and (3) preserves privacy compliance without sacrificing practical utility. Here, we introduce FCMBench-V1.0 – a large-scale financial credit multimodal benchmark for real-world applications, covering 18 core certificate types, with 4,043 privacy-compliant images and 8,446 QA samples. The FCMBench evaluation framework consists of three dimensions: Perception, Reasoning, and Robustness, including 3 foundational perception tasks, 4 credit-specific reasoning tasks that require decision-oriented understanding of visual evidence, and 10 real-world acquisition artifact types for robustness stress testing. To reconcile compliance with realism, we construct all samples via a closed synthesis-capture pipeline: we manually synthesize document templates with virtual content and capture scenario-aware images in-house. This design also mitigates pre-training data leakage by avoiding web-sourced or publicly released images. FCMBench can effectively discriminate performance disparities and robustness across modern vision-language models. Extensive experiments were conducted on 23 state-of-the-art vision-language models (VLMs) from 14 top AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1(%) score as a commercial model (64.61), Qwen3-VL-235B achieves the best score as an open-source baseline (57.27), and our financial credit-specific model, Qfin-VL-Instruct, achieves the top overall score (64.92). Robustness evaluations show that even top-performing models suffer noticeable performance drops under acquisition artifacts.

[70] TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model

Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan’er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang, Junyu Zhou, Chengcheng Zhou, Haibin Huang, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: TeleWorld is a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term memory in a closed-loop system, enabling interactive and consistent world models.

Details

Motivation: Current video generation models lack real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, preventing them from evolving into practical world models for AI systems.

Method: TeleWorld introduces a generation-reconstruction-guidance paradigm: generated videos are continuously reconstructed into dynamic 4D spatio-temporal representations that guide subsequent generation. It uses an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (hierarchical planning) and Distribution Matching Distillation for real-time synthesis.

Result: The framework achieves strong performance in static/dynamic world understanding, long-term consistency, and real-time generation efficiency, integrating dynamic object modeling and static scene representation within a unified 4D framework.

Conclusion: TeleWorld represents a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence, advancing world models toward computationally accessible systems.

Abstract: World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)–a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.

[71] It’s Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

Anne Harrington, A. Sophia Koepke, Shyamgopal Karthik, Trevor Darrell, Alexei A. Efros

Main category: cs.CV

TL;DR: Simple noise optimization method reduces mode collapse in text-to-image models while preserving quality

Details

Motivation: Text-to-image models suffer from mode collapse when generating multiple images from the same prompt, producing similar outputs instead of diverse variations

Method: Proposes noise optimization objective to mitigate mode collapse, analyzes frequency characteristics of noise, and explores alternative noise initializations with different frequency profiles

Result: Noise optimization yields superior results in generation quality and variety compared to previous approaches like guidance mechanisms or candidate refinement

Conclusion: Simple noise optimization effectively addresses mode collapse in text-to-image generation while maintaining model fidelity, with frequency analysis providing insights for improved optimization and search

Abstract: Contemporary text-to-image models exhibit a surprising degree of mode collapse, as can be seen when sampling several images given the same text prompt. While previous work has attempted to address this issue by steering the model using guidance mechanisms, or by generating a large pool of candidates and refining them, in this work we take a different direction and aim for diversity in generations via noise optimization. Specifically, we show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model. We also analyze the frequency characteristics of the noise and show that alternative noise initializations with different frequency profiles can improve both optimization and search. Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and variety.

[72] Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

Pan Wang, Yang Liu, Guile Wu, Eduardo R. Corral-Soto, Chengjie Huang, Binbin Xu, Dongfeng Bai, Xu Yan, Yuan Ren, Xingxin Chen, Yizhe Wu, Tao Huang, Wenjun Wan, Xin Wu, Pei Zhou, Xuyang Dai, Kangbo Lv, Hongbo Zhang, Yosef Fried, Aixue Ye, Bailan Feng, Zhenyu Chen, Zhen Li, Yingcong Chen, Yiyi Liao, Bingbing Liu

Main category: cs.CV

TL;DR: Spatial4D-Bench is a large-scale benchmark with ~40K QA pairs across 18 tasks in 6 cognitive categories to evaluate MLLMs’ 4D spatial intelligence, revealing current models’ substantial limitations.

Details

Motivation: To assess whether Multimodal Large Language Models (MLLMs) can achieve human-level 4D spatial intelligence (perceiving object changes over time), which is naturally possessed by humans but not well-evaluated in current benchmarks.

Method: Created Spatial4D-Bench, a comprehensive 4D spatial intelligence benchmark with ~40,000 question-answer pairs covering 18 well-defined tasks organized into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning, and spatiotemporal reasoning.

Result: Benchmarking various state-of-the-art open-source and proprietary MLLMs revealed substantial limitations in 4D spatial reasoning abilities across diverse aspects like route planning, action recognition, and physical plausibility reasoning.

Conclusion: Current MLLMs have significant gaps in achieving human-level 4D spatial intelligence, and Spatial4D-Bench provides a structured, comprehensive evaluation framework to facilitate development of more capable models toward this goal.

Abstract: 4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.

[73] A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data

Hyunho Lee, Wenwen Li

Main category: cs.CV

TL;DR: SMAGNet is a multimodal deep learning model that uses SAR as primary input for flood water mapping and adaptively integrates MSI data when available, maintaining performance even when MSI data is missing.

Details

Motivation: Current flood water mapping relies heavily on SAR data, but multimodal approaches combining SAR and MSI data show promise. However, adaptive integration of partially available MSI data into SAR-based mapping remains underexplored, limiting practical application in real-world scenarios where timely post-flood observations may be limited.

Method: Proposed Spatially Masked Adaptive Gated Network (SMAGNet) - a multimodal deep learning model that uses SAR data as primary input and integrates complementary MSI data through feature fusion. The model is designed to handle varying levels of MSI data availability.

Result: SMAGNet consistently outperformed other multimodal models across varying MSI data availability levels on the C2S-MS Floods dataset. Even when MSI data was completely missing, SMAGNet maintained performance statistically comparable to U-Net trained solely on SAR data.

Conclusion: SMAGNet enhances model robustness to missing data and improves applicability of multimodal deep learning in real-world flood management scenarios by adaptively integrating available MSI data while maintaining performance when such data is unavailable.

Abstract: Mapping water extent during a flood event is essential for effective disaster management throughout all phases: mitigation, preparedness, response, and recovery. In particular, during the response stage, when timely and accurate information is important, Synthetic Aperture Radar (SAR) data are primarily employed to produce water extent maps. Recently, leveraging the complementary characteristics of SAR and MSI data through a multimodal approach has emerged as a promising strategy for advancing water extent mapping using deep learning models. This approach is particularly beneficial when timely post-flood observations, acquired during or shortly after the flood peak, are limited, as it enables the use of all available imagery for more accurate post-flood water extent mapping. However, the adaptive integration of partially available MSI data into the SAR-based post-flood water extent mapping process remains underexplored. To bridge this research gap, we propose the Spatially Masked Adaptive Gated Network (SMAGNet), a multimodal deep learning model that utilizes SAR data as the primary input for post-flood water extent mapping and integrates complementary MSI data through feature fusion. In experiments on the C2S-MS Floods dataset, SMAGNet consistently outperformed other multimodal deep learning models in prediction performance across varying levels of MSI data availability. Furthermore, we found that even when MSI data were completely missing, the performance of SMAGNet remained statistically comparable to that of a U-Net model trained solely on SAR data. These findings indicate that SMAGNet enhances the model robustness to missing data as well as the applicability of multimodal deep learning in real-world flood management scenarios.

[74] GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation

Yi-Chun Chen, Arnav Jhala

Main category: cs.CV

TL;DR: GameTileNet is a semantic dataset of low-resolution game tiles designed to support narrative-driven procedural content generation through visual-language alignment.

Details

Motivation: Current AI-generated game visuals often misalign with game narratives and lack diversity due to training data imbalances. Human artists must manually adjust inconsistent AI outputs, and automated content generation suffers from limited visual representation diversity.

Method: Collects artist-created game tiles from OpenGameArt.org under Creative Commons licenses, provides semantic annotations, and introduces a pipeline for object detection in low-resolution tile-based game art (32x32 pixels) with annotations for semantics, connectivity, and object classifications.

Result: Creates a valuable resource for improving PCG methods, supporting narrative-rich game content, and establishing a baseline for object detection in low-resolution, non-photorealistic images.

Conclusion: GameTileNet advances procedural content generation and AI research as a vision-language alignment task by providing semantic labels for low-resolution digital game art, addressing challenges in narrative-aligned visual generation.

Abstract: GameTileNet is a dataset designed to provide semantic labels for low-resolution digital game art, advancing procedural content generation (PCG) and related AI research as a vision-language alignment task. Large Language Models (LLMs) and image-generative AI models have enabled indie developers to create visual assets, such as sprites, for game interactions. However, generating visuals that align with game narratives remains challenging due to inconsistent AI outputs, requiring manual adjustments by human artists. The diversity of visual representations in automatically generated game content is also limited because of the imbalance in distributions across styles for training data. GameTileNet addresses this by collecting artist-created game tiles from OpenGameArt.org under Creative Commons licenses and providing semantic annotations to support narrative-driven content generation. The dataset introduces a pipeline for object detection in low-resolution tile-based game art (e.g., 32x32 pixels) and annotates semantics, connectivity, and object classifications. GameTileNet is a valuable resource for improving PCG methods, supporting narrative-rich game content, and establishing a baseline for object detection in low-resolution, non-photorealistic images. TL;DR: GameTileNet is a semantic dataset of low-resolution game tiles designed to support narrative-driven procedural content generation through visual-language alignment.

[75] Compressed Map Priors for 3D Perception

Brady Zhou, Philipp Krähenbühl

Main category: cs.CV

TL;DR: CMP is a framework that learns spatial priors from historical traversal data using compressed binarized hashmaps, improving 3D object detection for autonomous vehicles with minimal storage and computational overhead.

Details

Motivation: Current autonomous vision systems treat each location as if encountering it for the first time, ignoring that most deployment areas have been visited before. This wastes valuable historical information that could improve perception.

Method: Uses compressed map priors learned from historic traversals stored in binarized hashmaps requiring only 32KB/km² (20× reduction from dense storage). Easily integrates into existing 3D perception systems with minimal computational cost.

Result: Significant and consistent improvement in 3D object detection on nuScenes dataset across several architectures. The compressed storage enables efficient use of spatial priors.

Conclusion: Learning spatial priors from historical traversals is effective for autonomous vision systems. CMP provides a simple, storage-efficient way to incorporate these priors into existing systems with substantial performance gains.

Abstract: Human drivers rarely travel where no person has gone before. After all, thousands of drivers use busy city roads every day, and only one can claim to be the first. The same holds for autonomous computer vision systems. The vast majority of the deployment area of an autonomous vision system will have been visited before. Yet, most autonomous vehicle vision systems act as if they are encountering each location for the first time. In this work, we present Compressed Map Priors (CMP), a simple but effective framework to learn spatial priors from historic traversals. The map priors use a binarized hashmap that requires only $32\text{KB}/\text{km}^2$, a $20\times$ reduction compared to the dense storage. Compressed Map Priors easily integrate into leading 3D perception systems at little to no extra computational costs, and lead to a significant and consistent improvement in 3D object detection on the nuScenes dataset across several architectures.

[76] Attention to Detail: Global-Local Attention for High-Resolution AI-Generated Image Detection

Lawrence Han

Main category: cs.CV

TL;DR: GLASS is a novel architecture for AI-generated image detection that combines global resized views with multiple original-resolution local crops using stratified sampling and attention-based aggregation to preserve fine-grained details.

Details

Motivation: Most AI-generated image detection methods downsample images before processing, which risks losing fine-grained details that could be crucial for distinguishing AI-generated content from real images.

Method: GLASS combines a globally resized view with multiple randomly sampled local crops at original resolution using spatially stratified sampling. These crops are aggregated using attention-based scoring and can be integrated into various vision backbones like Vision Transformer, ResNet, and ConvNeXt.

Result: GLASH outperforms standard transfer learning approaches by achieving higher predictive performance while maintaining feasible computational constraints across different backbone architectures.

Conclusion: The GLASS architecture effectively leverages both global and local information from images of any size, preserving fine-grained details that are often lost in downsampling approaches, leading to improved AI-generated image detection performance.

Abstract: The rapid development of generative AI has made AI-generated images increasingly realistic and high-resolution. Most AI-generated image detection architectures typically downsample images before inputting them into models, risking the loss of fine-grained details. This paper presents GLASS (Global-Local Attention with Stratified Sampling), an architecture that combines a globally resized view with multiple randomly sampled local crops. These crops are original-resolution regions efficiently selected through spatially stratified sampling and aggregated using attention-based scoring. GLASS can be integrated into vision models to leverage both global and local information in images of any size. Vision Transformer, ResNet, and ConvNeXt models are used as backbones, and experiments show that GLASS outperforms standard transfer learning by achieving higher predictive performance within feasible computational constraints.

[77] Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions

Kaiwen Zheng, Junchen Fu, Songpei Xu, Yaoqing He, Joemon M. Jose, Han Hu, Xuri Ge

Main category: cs.CV

TL;DR: Focal-RegionFace: A vision-language model for generating and recognizing multi-attribute natural language descriptions for arbitrarily selected face regions, including facial action units, emotions, and age estimation.

Details

Motivation: Addresses the underexplored problem of facial analysis focusing on arbitrarily selected face regions, arguing that focusing on individual facial areas leads to better understanding and control in facial state analysis.

Method: Constructs a new multi-attribute description dataset for arbitrarily selected face regions with rich region-level annotations and natural language descriptions. Proposes Focal-RegionFace, a fine-tuned vision-language model based on Qwen2.5-VL that incrementally refines focus on localized facial features through multiple progressively fine-tuning stages.

Result: Focal-RegionFace achieves best performance on the new benchmark across traditional metrics and newly proposed metrics, demonstrating effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.

Conclusion: The proposed approach successfully addresses the FaceFocalDesc problem, enabling interpretable age estimation, facial action unit detection, and emotion detection through region-focused analysis, with verified effectiveness and versatility.

Abstract: In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system’s ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.

[78] DichroGAN: Towards Restoration of in-air Colours of Seafloor from Satellite Imagery

Salma Gonzalez-Sabbagh, Antonio Robles-Kelly, Shang Gao

Main category: cs.CV

TL;DR: DichroGAN is a cGAN that recovers in-air seafloor colors from satellite imagery by modeling atmospheric radiance and underwater light transmission to remove water column effects.

Details

Motivation: Recovering in-air seafloor colors from satellite imagery is challenging due to exponential light attenuation in water columns, requiring advanced methods to remove absorption and scattering effects.

Method: DichroGAN uses a two-step cGAN approach: two generators estimate diffuse/specular reflections for atmospheric radiance, then two more generators process spectral features and estimate underwater light transmission based on underwater image formation equations.

Result: Extensive experiments on satellite and underwater datasets show DichroGAN achieves competitive performance compared to state-of-the-art underwater restoration techniques.

Conclusion: DichroGAN effectively recovers in-air seafloor colors from satellite imagery by modeling complex underwater light interactions, demonstrating promising results for marine remote sensing applications.

Abstract: Recovering the in-air colours of seafloor from satellite imagery is a challenging task due to the exponential attenuation of light with depth in the water column. In this study, we present DichroGAN, a conditional generative adversarial network (cGAN) designed for this purpose. DichroGAN employs a two-steps simultaneous training: first, two generators utilise a hyperspectral image cube to estimate diffuse and specular reflections, thereby obtaining atmospheric scene radiance. Next, a third generator receives as input the generated scene radiance containing the features of each spectral band, while a fourth generator estimates the underwater light transmission. These generators work together to remove the effects of light absorption and scattering, restoring the in-air colours of seafloor based on the underwater image formation equation. DichroGAN is trained on a compact dataset derived from PRISMA satellite imagery, comprising RGB images paired with their corresponding spectral bands and masks. Extensive experiments on both satellite and underwater datasets demonstrate that DichroGAN achieves competitive performance compared to state-of-the-art underwater restoration techniques.

[79] MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing

Xiaokun Sun, Zeyu Cai, Hao Tang, Ying Tai, Jian Yang, Zhenyu Zhang

Main category: cs.CV

TL;DR: MorphAny3D is a training-free framework for high-quality 3D morphing using Structured Latent (SLAT) representations, achieving semantically consistent and temporally smooth deformations across categories.

Details

Motivation: 3D morphing is challenging due to difficulties in generating semantically consistent and temporally smooth deformations, especially across different object categories. Existing methods struggle with maintaining structural coherence and temporal consistency during morphing sequences.

Method: The framework leverages SLAT representations and introduces two key attention mechanisms: Morphing Cross-Attention (MCA) for fusing source and target information to ensure structural coherence, and Temporal-Fused Self-Attention (TFSA) for enhancing temporal consistency by incorporating features from preceding frames. An orientation correction strategy addresses pose ambiguity issues.

Result: Extensive experiments show state-of-the-art morphing sequence generation, even for challenging cross-category cases. The method supports advanced applications like decoupled morphing and 3D style transfer, and can generalize to other SLAT-based generative models.

Conclusion: MorphAny3D provides an effective training-free solution for high-quality 3D morphing by intelligently blending SLAT features within attention mechanisms, achieving superior results in semantic consistency and temporal smoothness across categories.

Abstract: 3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing sequences, even for challenging cross-category cases. MorphAny3D further supports advanced applications such as decoupled morphing and 3D style transfer, and can be generalized to other SLAT-based generative models. Project page: https://xiaokunsun.github.io/MorphAny3D.github.io/.

[80] CropNeRF: A Neural Radiance Field-Based Framework for Crop Counting

Md Ahmed Al Muzaddid, William J. Beksi

Main category: cs.CV

TL;DR: Novel 3D instance segmentation framework for accurate crop counting using multi-view images and NeRF, achieving superior performance across different crop types without crop-specific tuning.

Details

Motivation: Accurate crop counting is essential for agricultural management, but outdoor field environments present challenges with partial occlusions and ambiguity in distinguishing clustered crops from single viewpoints, which limits image-based segmentation methods.

Method: Uses 2D images from multiple viewpoints with instance masks for NeRF view synthesis. Introduces crop visibility and mask consistency scores combined with 3D information from NeRF to segment crop instances in 3D for accurate counting, eliminating need for crop-specific parameter tuning.

Result: Validated on three agricultural datasets (cotton bolls, apples, pears) showing consistent counting performance despite variations in crop color, shape, and size. Outperforms state-of-the-art methods and contributes a cotton plant dataset for future research.

Conclusion: The framework provides an effective solution for precise crop enumeration in challenging outdoor environments, demonstrating robustness across different crop types and advancing agricultural computer vision research.

Abstract: Rigorous crop counting is crucial for effective agricultural management and informed intervention strategies. However, in outdoor field environments, partial occlusions combined with inherent ambiguity in distinguishing clustered crops from individual viewpoints poses an immense challenge for image-based segmentation methods. To address these problems, we introduce a novel crop counting framework designed for exact enumeration via 3D instance segmentation. Our approach utilizes 2D images captured from multiple viewpoints and associates independent instance masks for neural radiance field (NeRF) view synthesis. We introduce crop visibility and mask consistency scores, which are incorporated alongside 3D information from a NeRF model. This results in an effective segmentation of crop instances in 3D and highly-accurate crop counts. Furthermore, our method eliminates the dependence on crop-specific parameter tuning. We validate our framework on three agricultural datasets consisting of cotton bolls, apples, and pears, and demonstrate consistent counting performance despite major variations in crop color, shape, and size. A comparative analysis against the state of the art highlights superior performance on crop counting tasks. Lastly, we contribute a cotton plant dataset to advance further research on this topic.

[81] HiSin: A Sinogram-Aware Framework for Efficient High-Resolution Inpainting

Jiaze E, Srutarshi Banerjee, Tekin Bicer, Guannan Wang, Yanfu Zhang, Bin Ren

Main category: cs.CV

TL;DR: HiSin is a diffusion-based framework for efficient high-resolution sinogram inpainting that reduces memory usage by 30.81% and inference time by 17.58% while maintaining accuracy.

Details

Motivation: High-resolution sinogram inpainting is crucial for CT reconstruction to avoid artifacts and diagnostic errors. While diffusion models are suitable for this task, they face memory and computational limitations when applied to high-resolution inputs.

Method: HiSin exploits spectral sparsity and structural heterogeneity of projection data. It progressively extracts global structure at low resolution and defers high-resolution inference to small patches. The method incorporates frequency-aware patch skipping and structure-adaptive step allocation to reduce redundant computation.

Result: HiSin reduces peak memory usage by up to 30.81% and inference time by up to 17.58% compared to state-of-the-art frameworks while maintaining inpainting accuracy.

Conclusion: The proposed HiSin framework enables efficient high-resolution sinogram inpainting by addressing memory and computational limitations of diffusion models through strategic use of spectral sparsity, patch-based processing, and adaptive computation allocation.

Abstract: High-resolution sinogram inpainting is essential for computed tomography reconstruction, as missing high-frequency projections can lead to visible artifacts and diagnostic errors. Diffusion models are well-suited for this task due to their robustness and detail-preserving capabilities, but their application to high-resolution inputs is limited by excessive memory and computational demands. To address this limitation, we propose HiSin, a novel diffusion-based framework for efficient sinogram inpainting that exploits spectral sparsity and structural heterogeneity of projection data. It progressively extracts global structure at low resolution and defers high-resolution inference to small patches, enabling memory-efficient inpainting. Considering the structural features of sinograms, we incorporate frequency-aware patch skipping and structure-adaptive step allocation to reduce redundant computation. Experimental results show that HiSin reduces peak memory usage by up to 30.81% and inference time by up to 17.58% than the state-of-the-art framework, and maintains inpainting accuracy across.

[82] IntraStyler: Exemplar-based Style Synthesis for Cross-modality Domain Adaptation

Han Liu, Yubo Fan, Hao Li, Dewei Hu, Daniel Moyer, Zhoubing Xu, Benoit M. Dawant, Ipek Oguz

Main category: cs.CV

TL;DR: IntraStyler: An exemplar-based style synthesis method for unsupervised domain adaptation that captures diverse intra-domain styles without prior knowledge, improving downstream segmentation performance.

Details

Motivation: Current domain adaptation methods focus mainly on domain shift between source and target domains, while intra-domain variability remains under-explored. Existing methods require pre-specified intra-domain variations for style synthesis, which is impractical.

Method: Proposes IntraStyler, an exemplar-based style synthesis method that uses an exemplar image to guide style synthesis. Introduces a style encoder to learn style-only features discriminatively using contrastive learning, enabling capture of diverse intra-domain styles without prior knowledge.

Result: Evaluated on CrossMoDA 2023, the largest public dataset for cross-modality domain adaptation. Shows efficacy in controllable style synthesis and demonstrates benefits of diverse synthetic data for downstream segmentation tasks.

Conclusion: IntraStyler effectively addresses intra-domain variability in domain adaptation through exemplar-based style synthesis without requiring prior knowledge of intra-domain variations, improving segmentation performance through diverse synthetic data generation.

Abstract: Image-level domain alignment is the de facto approach for unsupervised domain adaptation, where unpaired image translation is used to minimize the domain gap. Prior studies mainly focus on the domain shift between the source and target domains, whereas the intra-domain variability remains under-explored. To address the latter, an effective strategy is to diversify the styles of the synthetic target domain data during image translation. However, previous methods typically require intra-domain variations to be pre-specified for style synthesis, which may be impractical. In this paper, we propose an exemplar-based style synthesis method named IntraStyler, which can capture diverse intra-domain styles without any prior knowledge. Specifically, IntraStyler uses an exemplar image to guide the style synthesis such that the output style matches the exemplar style. To extract the style-only features, we introduce a style encoder to learn styles discriminatively based on contrastive learning. We evaluate the proposed method on the largest public dataset for cross-modality domain adaptation, CrossMoDA 2023. Our experiments show the efficacy of our method in controllable style synthesis and the benefits of diverse synthetic data for downstream segmentation. Code is available at https://github.com/han-liu/IntraStyler.

[83] From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning

Omar Sharif, Eftekhar Hossain, Patrick Ng

Main category: cs.CV

TL;DR: RL approach improves multimodal LLMs’ visual reasoning by incentivizing longer, structured reasoning chains that better integrate visual information, achieving 5.56% improvement on Qwen-2.5-VL-7B.

Details

Motivation: Current multimodal large language models (MLLMs) generate reasoning chains that lack proper integration of visual information, limiting their ability to solve visual perception tasks like visual puzzles. Visual perception is identified as the key bottleneck in such tasks.

Method: Use reward-driven reinforcement learning (RL) with group relative policy optimization (GRPO) to unlock long visual reasoning in open-source MLLMs without costly supervision. Design six reward functions targeting different reasoning aspects: image understanding, thinking steps, and answer accuracy.

Result: Achieves 5.56% improvement over base Qwen-2.5-VL-7B model, with consistent gains across both in-domain and out-of-domain settings. Converting images to textual descriptions yields even larger gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7.

Conclusion: RL with carefully designed reward functions can effectively improve MLLMs’ visual reasoning by incentivizing longer, structured reasoning chains that better integrate visual information, addressing the visual perception bottleneck in multimodal tasks.

Abstract: Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.

[84] LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization

Jie Li, Kwan-Yee K. Wong, Kai Han

Main category: cs.CV

TL;DR: LooC introduces a low-dimensional compositional vector quantization method that uses compact codebooks by treating codevectors as compositional units, achieving SOTA performance with smaller codebooks.

Details

Motivation: There's an urgent need for high-capacity yet compact VQ methods as data and models become more diverse and complex. Current methods face conflicts between capacity and compactness.

Method: 1) Parameter-efficient codebook by reframing codevector-feature vector relationship as compositional units; 2) Parameter-free extrapolation-by-interpolation mechanism for feature enhancement; 3) Plug-and-play module design for existing VQ methods.

Result: LooC outperforms existing VQ methods across different tasks, datasets, and architectures, achieving state-of-the-art performance with significantly smaller codebooks while avoiding collapse problems.

Conclusion: LooC successfully reconciles the conflict between high capacity and compactness in VQ through compositional design, offering an effective plug-and-play solution that improves performance while reducing codebook size.

Abstract: Vector quantization (VQ) is a prevalent and fundamental technique that discretizes continuous feature vectors by approximating them using a codebook. As the diversity and complexity of data and models continue to increase, there is an urgent need for high-capacity, yet more compact VQ methods. This paper aims to reconcile this conflict by presenting a new approach called LooC, which utilizes an effective Low-dimensional codebook for Compositional vector quantization. Firstly, LooC introduces a parameter-efficient codebook by reframing the relationship between codevectors and feature vectors, significantly expanding its solution space. Instead of individually matching codevectors with feature vectors, LooC treats them as lower-dimensional compositional units within feature vectors and combines them, resulting in a more compact codebook with improved performance. Secondly, LooC incorporates a parameter-free extrapolation-by-interpolation mechanism to enhance and smooth features during the VQ process, which allows for better preservation of details and fidelity in feature approximation. The design of LooC leads to full codebook usage, effectively utilizing the compact codebook while avoiding the problem of collapse. Thirdly, LooC can serve as a plug-and-play module for existing methods for different downstream tasks based on VQ. Finally, extensive evaluations on different tasks, datasets, and architectures demonstrate that LooC outperforms existing VQ methods, achieving state-of-the-art performance with a significantly smaller codebook.

[85] Towards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data Distributions

Aobo Li, Jinjian Wu, Yongxu Liu, Leida Li, Weisheng Dong

Main category: cs.CV

TL;DR: SynDR-IQA improves BIQA generalization by reshaping synthetic data distribution through diversity upsampling and redundancy downsampling to address clustered feature patterns in synthetic datasets.

Details

Motivation: BIQA suffers from limited labeled data; synthetic datasets offer a solution but models trained on them show poor generalization due to clustered feature patterns (high-quality images cluster around references, low-quality cluster by distortion types).

Method: SynDR-IQA reshapes synthetic data distribution using: 1) distribution-aware diverse content upsampling to enhance visual diversity while preserving content distribution, and 2) density-aware redundant cluster downsampling to balance samples by reducing density in clustered areas.

Result: Extensive experiments across three cross-dataset settings (synthetic-to-authentic, synthetic-to-algorithmic, synthetic-to-synthetic) demonstrate improved generalization performance.

Conclusion: The distribution issue in synthetic datasets, not model architecture, limits BIQA generalization; SynDR-IQA effectively addresses this by reshaping synthetic data distribution through diversity enhancement and redundancy reduction.

Abstract: Blind Image Quality Assessment (BIQA) has advanced significantly through deep learning, but the scarcity of large-scale labeled datasets remains a challenge. While synthetic data offers a promising solution, models trained on existing synthetic datasets often show limited generalization ability. In this work, we make a key observation that representations learned from synthetic datasets often exhibit a discrete and clustered pattern that hinders regression performance: features of high-quality images cluster around reference images, while those of low-quality images cluster based on distortion types. Our analysis reveals that this issue stems from the distribution of synthetic data rather than model architecture. Consequently, we introduce a novel framework SynDR-IQA, which reshapes synthetic data distribution to enhance BIQA generalization. Based on theoretical derivations of sample diversity and redundancy’s impact on generalization error, SynDR-IQA employs two strategies: distribution-aware diverse content upsampling, which enhances visual diversity while preserving content distribution, and density-aware redundant cluster downsampling, which balances samples by reducing the density of densely clustered areas. Extensive experiments across three cross-dataset settings (synthetic-to-authentic, synthetic-to-algorithmic, and synthetic-to-synthetic) demonstrate the effectiveness of our method. The code is available at https://github.com/Li-aobo/SynDR-IQA.

[86] Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection

Chao Yang, Haoyuan Zheng, Yue Ma

Main category: cs.CV

TL;DR: Cross-modal data augmentation framework using CycleGAN and YOLOv8 to address IR data scarcity in PCB defect detection by generating pseudo-IR images from visible-light data.

Details

Motivation: Addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection, as conventional methods rely on paired supervision which is limited in IR domain.

Method: Uses CycleGAN for unpaired image-to-image translation to map abundant visible-light PCB images into infrared domain, generating high-fidelity pseudo-IR samples. Then employs heterogeneous training strategy fusing generated pseudo-IR data with limited real IR samples to train lightweight YOLOv8 detector.

Result: The method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches performance benchmarks of fully supervised training.

Conclusion: Pseudo-IR synthesis proves to be a robust augmentation strategy for industrial inspection, demonstrating efficacy in addressing IR data scarcity through cross-modal data generation.

Abstract: This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.

[87] Context-Aware Pesticide Recommendation via Few-Shot Pest Recognition for Precision Agriculture

Anirudha Ghosh, Ritam Sarkar, Debaditya Barman

Main category: cs.CV

TL;DR: A lightweight framework for pest detection and pesticide recommendation using compact CNN with meta-learning, designed for low-resource devices like smartphones/drones to help small farmers.

Details

Motivation: Traditional pest management methods are costly, time-consuming, labor-intensive, and environmentally harmful. There's a need for accessible solutions for small farmers using low-resource devices.

Method: Two-component framework: 1) Pest Detection Module using lightweight CNN with prototypical meta-learning for few-shot learning, 2) Pesticide Recommendation Module incorporating environmental factors. Uses comprehensive pest image dataset with diverse viewing angles, sizes, and backgrounds.

Result: Lightweight CNN achieves high accuracy comparable to state-of-the-art models while significantly reducing computational complexity. Decision Support System reduces dependence on chemical pesticides and promotes sustainable practices.

Conclusion: The framework demonstrates potential for real-time applications in precision agriculture, offering an accessible, eco-friendly pest management solution for small farmers using low-resource devices.

Abstract: Effective pest management is crucial for enhancing agricultural productivity, especially for crops such as sugarcane and wheat that are highly vulnerable to pest infestations. Traditional pest management methods depend heavily on manual field inspections and the use of chemical pesticides. These approaches are often costly, time-consuming, labor-intensive, and can have a negative impact on the environment. To overcome these challenges, this study presents a lightweight framework for pest detection and pesticide recommendation, designed for low-resource devices such as smartphones and drones, making it suitable for use by small and marginal farmers. The proposed framework includes two main components. The first is a Pest Detection Module that uses a compact, lightweight convolutional neural network (CNN) combined with prototypical meta-learning to accurately identify pests even when only a few training samples are available. The second is a Pesticide Recommendation Module that incorporates environmental factors like crop type and growth stage to suggest safe and eco-friendly pesticide recommendations. To train and evaluate our framework, a comprehensive pest image dataset was developed by combining multiple publicly available datasets. The final dataset contains samples with different viewing angles, pest sizes, and background conditions to ensure strong generalization. Experimental results show that the proposed lightweight CNN achieves high accuracy, comparable to state-of-the-art models, while significantly reducing computational complexity. The Decision Support System additionally improves pest management by reducing dependence on traditional chemical pesticides and encouraging sustainable practices, demonstrating its potential for real-time applications in precision agriculture.

[88] TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models

Kohei Yamamoto, Tomohiro Kikuchi

Main category: cs.CV

TL;DR: TotalFM is a radiological foundation model for 3D-CT volumetric data that uses organ separation and large-scale data (140K series) to efficiently learn correspondences between CT images and text, achieving strong zero-shot performance in lesion classification tasks.

Details

Motivation: Foundation models in radiology face computational cost challenges when training on 3D-CT volumetric data, making practical implementation difficult despite their potential for various clinical tasks.

Method: Proposes TotalFM with organ separation concept: 1) Automates creation of organ volume and finding-sentence pairs using segmentation and LLM-based report processing, 2) Combines self-supervised pre-training (VideoMAE) with contrastive learning using volume-text pairs to balance computational efficiency and representation capability.

Result: Outperforms CT-CLIP in 83% (5/6) of organs and Merlin in 64% (9/14) of organs for zero-shot organ-wise lesion classification. Achieves higher AUROC in 83% (25/30) of finding categories vs Merlin for zero-shot finding-wise classification. Shows comparable performance to existing VLMs in radiology report generation.

Conclusion: The organ-separated learning framework provides a realistic and effective design guideline for practical implementation of 3D-CT foundation models, demonstrating high generalization performance in clinical evaluation settings.

Abstract: While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.

[89] S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

He Wang, Longteng Guo, Pengkang Huo, Xuanxu Lin, Yichen Yuan, Jie Jiang, Jing Liu

Main category: cs.CV

TL;DR: S1-MMAlign is a large-scale multimodal dataset of 15.5M scientific image-text pairs from 2.5M papers, enhanced with AI-generated captions to bridge the semantic gap between complex scientific imagery and sparse text descriptions.

Details

Motivation: Multimodal learning has transformed general domains but struggles in scientific discovery due to the semantic gap between complex scientific imagery and sparse textual descriptions in papers.

Method: Created dataset from 2.5M open-access papers across physics, biology, engineering. Used Qwen-VL multimodal model to recaption images by synthesizing context from paper abstracts and citation contexts, addressing weak alignment in raw captions.

Result: Dataset contains 15.5M high-quality image-text pairs. Enhancement pipeline improved data quality: SciBERT pseudo-perplexity metrics show reduced semantic ambiguity, CLIP scores indicate 18.21% improvement in image-text alignment.

Conclusion: S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in AI for Science, publicly available on Hugging Face.

Abstract: Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.

[90] ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching

Yi Sun, Xinhao Zhong, Hongyan Li, Yimin Zhou, Junhao Li, Bin Chen, Xuan Wang

Main category: cs.CV

TL;DR: ActErase: A training-free method for concept erasure in diffusion models that identifies activation difference regions via prompt-pair analysis and dynamically replaces activations during forward passes.

Details

Motivation: Address safety, copyright, and ethical concerns in text-to-image diffusion models. Existing concept erasure methods require data-intensive fine-tuning, which is computationally expensive and limiting.

Method: Training-free approach that identifies activation difference regions through prompt-pair analysis, extracts target activations, and dynamically replaces input activations during forward passes without model fine-tuning.

Result: Achieves SOTA erasure performance across nudity, artistic style, and object removal tasks while preserving generative capability. Shows strong robustness against adversarial attacks.

Conclusion: Establishes a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models, overcoming limitations of fine-tuning-based approaches.

Abstract: Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model’s activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model’s overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.

[91] FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering

Chaodong Tong, Qi Zhang, Chen Li, Lei Jiang, Yanbing Liu

Main category: cs.CV

TL;DR: FaithSCAN: A lightweight network that detects VQA hallucinations by fusing internal VLM signals (token uncertainty, visual representations, cross-modal alignment) with uncertainty-aware attention, using automatically generated supervision signals without human labels.

Details

Motivation: Existing VQA hallucination detection methods have limitations: external verification approaches are computationally expensive and dependent on external resource quality, while uncertainty-driven methods capture only limited facets of model uncertainty and fail to explore rich internal signals associated with diverse failure modes.

Method: FaithSCAN exploits rich internal VLM signals including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. The method extends LLM-as-a-Judge paradigm to VQA hallucination detection with a low-cost strategy to automatically generate model-dependent supervision signals for supervised training without human labels.

Result: Experiments on multiple VQA benchmarks show FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis reveals hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding, with different internal signals providing complementary diagnostic cues.

Conclusion: FaithSCAN addresses limitations of existing hallucination detection methods by efficiently leveraging internal VLM signals, demonstrating superior performance while providing new insights into multimodal hallucination patterns that vary across VLM architectures.

Abstract: Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.

[92] Disentangling Hardness from Noise: An Uncertainty-Driven Model-Agnostic Framework for Long-Tailed Remote Sensing Classification

Chi Ding, Junxiao Xue, Xinyi Yin, Shi Chen, Yunyun Shi, Yiduo Wang, Fengjian Xue, Xuecheng Wu

Main category: cs.CV

TL;DR: DUAL is an uncertainty-aware framework for long-tailed remote sensing that disentangles prediction uncertainty into epistemic (sample scarcity) and aleatoric (data ambiguity) components to better handle hard tail samples while suppressing noise.

Details

Motivation: Long-tailed distributions are common in remote sensing due to imbalanced object occurrence, but existing methods fail to distinguish between hard tail samples (scarce but valuable) and noisy ambiguous samples, leading to overfitting on noise.

Method: Proposes DUAL framework based on Evidential Deep Learning that dynamically disentangles prediction uncertainty into Epistemic Uncertainty (EU) for sample scarcity and Aleatoric Uncertainty (AU) for data ambiguity. Uses EU to guide reweighting for hard tail samples and AU for adaptive label smoothing to suppress noise impact.

Result: Extensive experiments on multiple datasets with various backbones show DUAL outperforms strong baselines like TGN and SADE. Ablation studies validate design choices.

Conclusion: DUAL effectively addresses the critical challenge of disentangling hard tail samples from noisy ambiguous ones in long-tailed remote sensing, providing a model-agnostic uncertainty-aware solution with strong generalization capabilities.

Abstract: Long-Tailed distributions are pervasive in remote sensing due to the inherently imbalanced occurrence of grounded objects. However, a critical challenge remains largely overlooked, i.e., disentangling hard tail data samples from noisy ambiguous ones. Conventional methods often indiscriminately emphasize all low-confidence samples, leading to overfitting on noisy data. To bridge this gap, building upon Evidential Deep Learning, we propose a model-agnostic uncertainty-aware framework termed DUAL, which dynamically disentangles prediction uncertainty into Epistemic Uncertainty (EU) and Aleatoric Uncertainty (AU). Specifically, we introduce EU as an indicator of sample scarcity to guide a reweighting strategy for hard-to-learn tail samples, while leveraging AU to quantify data ambiguity, employing an adaptive label smoothing mechanism to suppress the impact of noise. Extensive experiments on multiple datasets across various backbones demonstrate the effectiveness and generalization of our framework, surpassing strong baselines such as TGN and SADE. Ablation studies provide further insights into the crucial choices of our design.

[93] SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting

Jun-Jee Chao, Volkan Isler

Main category: cs.CV

TL;DR: SV-GS: A framework for dynamic object reconstruction from sparse observations using skeleton-guided deformation fields and motion interpolation.

Details

Motivation: Dynamic object reconstruction from real-world sparse observations (e.g., security cameras) is challenging because standard methods require dense multi-view video coverage, which is impractical in unconstrained environments.

Method: Uses skeleton-driven deformation field with coarse joint pose estimator (time-dependent) and fine-grained deformation module. Initialized with rough skeleton graph and static reconstruction, later relaxed to use diffusion-based generative priors.

Result: Outperforms existing methods by up to 34% in PSNR on synthetic datasets under sparse observations. Achieves comparable performance to dense monocular video methods on real-world datasets using significantly fewer frames.

Conclusion: SV-GS enables practical dynamic reconstruction from sparse real-world observations by combining skeleton guidance with deformation modeling, making it applicable to scenarios like security camera footage where dense coverage is unavailable.

Abstract: Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object’s motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.

[94] Towards Automated Differential Diagnosis of Skin Diseases Using Deep Learning and Imbalance-Aware Strategies

Ali Anaissi, Ali Braytee, Weidong Huang, Junaid Akram, Alaa Farhat, Jie Hua

Main category: cs.CV

TL;DR: Deep learning model using Swin Transformer achieves 87.71% accuracy on ISIC2019 dataset for classifying 8 skin lesion types, demonstrating potential as clinical diagnostic support tool.

Details

Motivation: Growing need for intelligent diagnostic tools due to increasing dermatological conditions and limited availability of dermatologists, requiring support for both patients and clinicians in timely and accurate skin disease diagnosis.

Method: Developed deep learning model using Swin Transformer architecture with pretraining on publicly available skin disease image datasets. Refined model architecture, optimized data preprocessing workflows, and applied targeted data augmentation techniques to improve performance.

Result: Final model achieved 87.71% prediction accuracy across eight skin lesion classes on the ISIC2019 dataset, demonstrating effective visual feature extraction and accurate classification of various dermatological cases.

Conclusion: The model shows potential as a diagnostic support tool for clinicians and a self-assessment aid for patients, addressing the gap between increasing dermatological needs and limited specialist availability.

Abstract: As dermatological conditions become increasingly common and the availability of dermatologists remains limited, there is a growing need for intelligent tools to support both patients and clinicians in the timely and accurate diagnosis of skin diseases. In this project, we developed a deep learning based model for the classification and diagnosis of skin conditions. By leveraging pretraining on publicly available skin disease image datasets, our model effectively extracted visual features and accurately classified various dermatological cases. Throughout the project, we refined the model architecture, optimized data preprocessing workflows, and applied targeted data augmentation techniques to improve overall performance. The final model, based on the Swin Transformer, achieved a prediction accuracy of 87.71 percent across eight skin lesion classes on the ISIC2019 dataset. These results demonstrate the model’s potential as a diagnostic support tool for clinicians and a self assessment aid for patients.

[95] TimeColor: Flexible Reference Colorization via Temporal Concatenation

Bryan Constantine Sadihin, Yihao Meng, Michael Hua Wang, Matteo Jiahao Chen, Hang Su

Main category: cs.CV

TL;DR: TimeColor is a sketch-based video colorization model that supports multiple heterogeneous references (character sheets, background images, etc.) with explicit region assignment, using additional latent frames and spatiotemporal attention mechanisms to improve color fidelity and consistency.

Details

Motivation: Existing colorization models only condition on a single reference (typically the first frame), ignoring other valuable conditional data sources like character sheets, background images, or arbitrary colorized frames that could provide better color information.

Method: TimeColor encodes references as additional latent frames concatenated temporally, processed concurrently in each diffusion step. It uses spatiotemporal correspondence-masked attention to enforce subject-reference binding and modality-disjoint RoPE indexing to prevent shortcutting and cross-identity palette leakage.

Result: Experiments on SAKUGA-42M dataset show TimeColor improves color fidelity, identity consistency, and temporal stability over prior baselines under both single- and multi-reference protocols.

Conclusion: TimeColor successfully addresses limitations of single-reference colorization by supporting heterogeneous, variable-count references with explicit region assignment, achieving better performance through novel architectural mechanisms.

Abstract: Most colorization models condition only on a single reference, typically the first frame of the scene. However, this approach ignores other sources of conditional data, such as character sheets, background images, or arbitrary colorized frames. We propose TimeColor, a sketch-based video colorization model that supports heterogeneous, variable-count references with the use of explicit per-reference region assignment. TimeColor encodes references as additional latent frames which are concatenated temporally, permitting them to be processed concurrently in each diffusion step while keeping the model’s parameter count fixed. TimeColor also uses spatiotemporal correspondence-masked attention to enforce subject-reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage. Experiments on SAKUGA-42M under both single- and multi-reference protocols show that TimeColor improves color fidelity, identity consistency, and temporal stability over prior baselines.

[96] VisNet: Efficient Person Re-Identification via Alpha-Divergence Loss, Feature Fusion and Dynamic Multi-Task Learning

Anns Ijaz, Muhammad Azeem Javed

Main category: cs.CV

TL;DR: VisNet is an efficient person re-identification model that achieves high accuracy with low computational cost through multi-scale feature fusion, semantic clustering with body partitioning, dynamic weight averaging, and FIDI loss function.

Details

Motivation: Person re-identification needs strong accuracy with minimal computational cost for real-world surveillance and mobile applications, but current state-of-the-art methods have high computational budgets.

Method: VisNet combines: 1) Multi-scale feature fusion from ResNet50 stages 1-4 without parallel paths, 2) Semantic clustering with anatomical body partitioning using rule-based pseudo-labeling, 3) Dynamic weight averaging for balancing classification semantic regularization, and 4) FIDI loss function for improved metric learning.

Result: Achieves 87.05% Rank-1 and 77.65% mAP on Market-1501 dataset with only 32.41M parameters and 4.601 GFLOPs, making it suitable for real-time deployment in resource-constrained environments.

Conclusion: VisNet provides a practical, computationally efficient solution for person re-identification that balances accuracy and computational cost, enabling real-world deployment in surveillance and mobile applications with limited resources.

Abstract: Person re-identification (ReID) is an extremely important area in both surveillance and mobile applications, requiring strong accuracy with minimal computational cost. State-of-the-art methods give good accuracy but with high computational budgets. To remedy this, this paper proposes VisNet, a computationally efficient and effective re-identification model suitable for real-world scenarios. It is the culmination of conceptual contributions, including feature fusion at multiple scales with automatic attention on each, semantic clustering with anatomical body partitioning, a dynamic weight averaging technique to balance classification semantic regularization, and the use of loss function FIDI for improved metric learning tasks. The multiple scales fuse ResNet50’s stages 1 through 4 without the use of parallel paths, with semantic clustering introducing spatial constraints through the use of rule-based pseudo-labeling. VisNet achieves 87.05% Rank-1 and 77.65% mAP on the Market-1501 dataset, having 32.41M parameters and 4.601 GFLOPs, hence, proposing a practical approach for real-time deployment in surveillance and mobile applications where computational resources are limited.

[97] ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition

Feng-Qi Cui, Jinyang Huang, Sirui Zhao, Jinglong Guo, Qifan Cai, Xin Yan, Zhi Liu

Main category: cs.CV

TL;DR: ReMA is a plug-and-play video augmentation strategy that uses controlled mixing to expand representations while preserving class-conditional stability, improving robustness without extra supervision.

Details

Motivation: Current video data augmentation strategies are perturbation-driven and introduce uncontrolled variations that amplify non-discriminative factors, weakening intra-class distributional structure and causing representation drift with inconsistent gains across temporal scales.

Method: ReMA integrates two mechanisms: 1) Representation Alignment Mechanism (RAM) for structured intra-class mixing under distributional alignment constraints, and 2) Dynamic Selection Mechanism (DSM) for motion-aware spatiotemporal masks to localize perturbations away from discrimination-sensitive regions.

Result: Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.

Conclusion: ReMA provides an effective plug-and-play augmentation strategy that improves video representation robustness by jointly controlling how and where mixing is applied, without requiring additional supervision or trainable parameters.

Abstract: Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.

[98] Depth-Synergized Mamba Meets Memory Experts for All-Day Image Reflection Separation

Siyan Fang, Long Peng, Yuntao Wang, Ruonan Wei, Yuehuan Wang

Main category: cs.CV

TL;DR: DMDNet uses depth-aware scanning with Mamba and memory compensation to separate reflections, especially effective for nighttime images where traditional methods fail due to similar contrasts.

Details

Motivation: Existing reflection separation methods struggle when transmission and reflection layers have similar contrasts, particularly in nighttime conditions where this problem is more severe. Current approaches rely on limited single-image information and lack specialized nighttime datasets.

Method: Proposes Depth-Memory Decoupling Network (DMDNet) with three key components: Depth-Aware Scanning (DAScan) to guide Mamba toward salient structures, Depth-Synergized State-Space Model (DS-SSM) to modulate state activations by depth, and Memory Expert Compensation Module (MECM) that leverages cross-image historical knowledge. Also constructs Nighttime Image Reflection Separation (NightIRS) dataset.

Result: Extensive experiments show DMDNet outperforms state-of-the-art methods in both daytime and nighttime reflection separation tasks.

Conclusion: DMDNet effectively addresses the challenge of separating reflection layers when contrasts are similar, especially in nighttime conditions, through depth-aware guidance and memory compensation mechanisms.

Abstract: Image reflection separation aims to disentangle the transmission layer and the reflection layer from a blended image. Existing methods rely on limited information from a single image, tending to confuse the two layers when their contrasts are similar, a challenge more severe at night. To address this issue, we propose the Depth-Memory Decoupling Network (DMDNet). It employs the Depth-Aware Scanning (DAScan) to guide Mamba toward salient structures, promoting information flow along semantic coherence to construct stable states. Working in synergy with DAScan, the Depth-Synergized State-Space Model (DS-SSM) modulates the sensitivity of state activations by depth, suppressing the spread of ambiguous features that interfere with layer disentanglement. Furthermore, we introduce the Memory Expert Compensation Module (MECM), leveraging cross-image historical knowledge to guide experts in providing layer-specific compensation. To address the lack of datasets for nighttime reflection separation, we construct the Nighttime Image Reflection Separation (NightIRS) dataset. Extensive experiments demonstrate that DMDNet outperforms state-of-the-art methods in both daytime and nighttime.

[99] HarmoniAD: Harmonizing Local Structures and Global Semantics for Anomaly Detection

Naiqi Zhang, Chuancheng Shi, Jingtong Dou, Wenhua Wu, Fei Shen, Jianhua Cao

Main category: cs.CV

TL;DR: HarmoniAD: A frequency-guided dual-branch framework for industrial anomaly detection that balances structure and semantics by decoupling features into high- and low-frequency paths with specialized attention modules.

Details

Motivation: Existing anomaly detection methods face a structure-semantics trade-off: structure-oriented models (frequency-based filters) are noise-sensitive, while semantics-oriented models (CLIP-based encoders) often miss fine details needed for detecting tiny defects in industrial quality inspection.

Method: Features are extracted by CLIP image encoder, transformed to frequency domain, and decoupled into high- and low-frequency paths. High-frequency branch uses fine-grained structural attention module (FSAM) to enhance textures/edges for small anomalies. Low-frequency branch uses global structural context module (GSCM) to capture long-range dependencies and semantic consistency. Multi-class joint training strategy is adopted.

Result: Experiments on MVTec-AD, VisA, and BTAD datasets show state-of-the-art performance with both sensitivity and robustness.

Conclusion: HarmoniAD successfully addresses the structure-semantics trade-off in anomaly detection through frequency-guided dual-branch design, achieving balanced performance for detecting both fine details and global semantics in industrial inspection.

Abstract: Anomaly detection is crucial in industrial product quality inspection. Failing to detect tiny defects often leads to serious consequences. Existing methods face a structure-semantics trade-off: structure-oriented models (such as frequency-based filters) are noise-sensitive, while semantics-oriented models (such as CLIP-based encoders) often miss fine details. To address this, we propose HarmoniAD, a frequency-guided dual-branch framework. Features are first extracted by the CLIP image encoder, then transformed into the frequency domain, and finally decoupled into high- and low-frequency paths for complementary modeling of structure and semantics. The high-frequency branch is equipped with a fine-grained structural attention module (FSAM) to enhance textures and edges for detecting small anomalies, while the low-frequency branch uses a global structural context module (GSCM) to capture long-range dependencies and preserve semantic consistency. Together, these branches balance fine detail and global semantics. HarmoniAD further adopts a multi-class joint training strategy, and experiments on MVTec-AD, VisA, and BTAD show state-of-the-art performance with both sensitivity and robustness.

[100] Joint Geometry-Appearance Human Reconstruction in a Unified Latent Space via Bridge Diffusion

Yingzhi Tang, Qijian Zhang, Junhui Hou

Main category: cs.CV

TL;DR: JGA-LBD is a unified framework for 3D human reconstruction from single RGB images that jointly models geometry and appearance using a shared latent representation and bridge diffusion, outperforming existing methods.

Details

Motivation: Existing methods use decoupled pipelines for geometry estimation and appearance synthesis, which leads to inconsistencies and hinders unified reconstruction of 3D digital humans from single RGB images.

Method: Unifies geometry and appearance modeling into joint latent representation using 3D Gaussian representations compressed via shared sparse VAE, then uses bridge diffusion to infer missing components from partial observations, followed by dedicated decoding.

Result: Outperforms current state-of-the-art approaches in both geometry fidelity and appearance quality, including challenging in-the-wild scenarios.

Conclusion: JGA-LBD successfully addresses the inconsistency problem in 3D human reconstruction by unifying geometry and appearance modeling through joint latent representation and bridge diffusion, achieving superior performance.

Abstract: Achieving consistent and high-fidelity geometry and appearance reconstruction of 3D digital humans from a single RGB image is inherently a challenging task. Existing studies typically resort to decoupled pipelines for geometry estimation and appearance synthesis, often hindering unified reconstruction and causing inconsistencies. This paper introduces \textbf{JGA-LBD}, a novel framework that unifies the modeling of geometry and appearance into a joint latent representation and formulates the generation process as bridge diffusion. Observing that directly integrating heterogeneous input conditions (e.g., depth maps, SMPL models) leads to substantial training difficulties, we unify all conditions into the 3D Gaussian representations, which can be further compressed into a unified latent space through a shared sparse variational autoencoder (VAE). Subsequently, the specialized form of bridge diffusion enables to start with a partial observation of the target latent code and solely focuses on inferring the missing components. Finally, a dedicated decoding module extracts the complete 3D human geometric structure and renders novel views from the inferred latent representation. Experiments demonstrate that JGA-LBD outperforms current state-of-the-art approaches in terms of both geometry fidelity and appearance quality, including challenging in-the-wild scenarios. Our code will be made publicly available at https://github.com/haiantyz/JGA-LBD.

[101] Intelligent Traffic Surveillance for Real-Time Vehicle Detection, License Plate Recognition, and Speed Estimation

Bruce Mugizi, Sudi Murindanyi, Olivia Nakacwa, Andrew Katumba

Main category: cs.CV

TL;DR: Real-time intelligent traffic surveillance system for developing countries using computer vision for vehicle detection, license plate recognition, and speed estimation with automated ticket issuance.

Details

Motivation: Speeding is a major contributor to road fatalities in developing countries like Uganda where road safety infrastructure is limited, creating an urgent need for automated traffic enforcement solutions.

Method: Uses computer vision techniques: YOLOv8 for license plate detection, CNN and transformer models for character recognition, source/target ROI for speed estimation, and integrates with Africa’s Talking API for automated SMS ticket issuance.

Result: YOLOv8 achieved 97.9% mAP for plate detection; transformer model reduced CER to 1.79% vs CNN’s 3.85%; speed estimation had 10 km/h margin of error; system enables automated ticket issuance via SMS.

Conclusion: The system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries.

Abstract: Speeding is a major contributor to road fatalities, particularly in developing countries such as Uganda, where road safety infrastructure is limited. This study proposes a real-time intelligent traffic surveillance system tailored to such regions, using computer vision techniques to address vehicle detection, license plate recognition, and speed estimation. The study collected a rich dataset using a speed gun, a Canon Camera, and a mobile phone to train the models. License plate detection using YOLOv8 achieved a mean average precision (mAP) of 97.9%. For character recognition of the detected license plate, the CNN model got a character error rate (CER) of 3.85%, while the transformer model significantly reduced the CER to 1.79%. Speed estimation used source and target regions of interest, yielding a good performance of 10 km/h margin of error. Additionally, a database was established to correlate user information with vehicle detection data, enabling automated ticket issuance via SMS via Africa’s Talking API. This system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries where such interventions are urgently needed.

[102] OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning

Liuxiang Qiu, Hui Da, Yuzhen Niu, Tiesong Zhao, Yang Cao, Zheng-Jun Zha

Main category: cs.CV

TL;DR: OmniVaT framework addresses single domain generalization for visual-tactile learning by bridging modality gaps and enhancing domain adaptation without multi-domain training data.

Details

Motivation: Visual-tactile learning suffers from modality discrepancies between visual and tactile sensors, and domain gaps caused by non-standardized sensors and inconsistent data collection procedures, which limits cross-domain generalization.

Method: Proposes OmniVaT framework with: 1) Multimodal Fractional Fourier Adapter (MFFA) to map visual and tactile embeddings into unified embedding-frequency space, mitigating modality gap without multi-domain training; 2) Discrete Tree Generation (DTG) module that obtains diverse multimodal fractional representations through hierarchical tree structure for better domain adaptation.

Result: Extensive experiments demonstrate superior cross-domain generalization performance on the SDG-VTL task compared to existing methods.

Conclusion: OmniVaT successfully addresses the challenging SDG-VTL task by effectively bridging modality gaps and enhancing domain adaptation capabilities without requiring multi-domain training data or complex cross-modal fusion strategies.

Abstract: Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.

[103] Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

Söhnke Benedikt Fischedick, Daniel Seichter, Benedict Stephan, Robin Schmidt, Horst-Michael Gross

Main category: cs.CV

TL;DR: DVEFormer is an efficient RGB-D Transformer that predicts dense text-aligned visual embeddings via knowledge distillation from Alpha-CLIP, enabling flexible text-based querying and 3D mapping for domestic robots.

Details

Motivation: Robots in domestic environments need comprehensive scene understanding to interact effectively with untrained humans. Traditional semantic segmentation with fixed classes is limited, while flexible text-based querying would enable more intuitive human-robot interaction.

Method: Proposes DVEFormer, an efficient RGB-D Transformer-based approach that learns dense text-aligned visual embeddings via knowledge distillation. Uses teacher embeddings from Alpha-CLIP to guide the student model in learning fine-grained pixel-wise embeddings, enabling both classical segmentation and text-based querying.

Result: Achieves competitive performance on indoor datasets while meeting real-time requirements: 26.3 FPS for full model and 77.0 FPS for smaller variant on NVIDIA Jetson AGX Orin. Enables text-based querying and integration into 3D mapping pipelines.

Conclusion: DVEFormer serves as a drop-in replacement for traditional segmentation while enabling flexible natural-language querying and seamless integration into 3D mapping for mobile robotics, making it suitable for domestic robot applications.

Abstract: In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.

[104] Mask-Conditioned Voxel Diffusion for Joint Geometry and Color Inpainting

Aarya Sumuk

Main category: cs.CV

TL;DR: Lightweight two-stage framework for joint 3D geometry and color inpainting using damage localization followed by mask-conditioned diffusion-based reconstruction.

Details

Motivation: Digital restoration of cultural heritage artifacts that have damaged geometry and color, requiring joint reconstruction of both aspects.

Method: Two-stage pipeline: 1) 2D CNN predicts damage masks on RGB slices, aggregated into volumetric mask; 2) Diffusion-based 3D U-Net performs mask-conditioned inpainting on voxel grids with composite objective combining occupancy and color reconstruction.

Result: Produces more complete geometry and coherent color reconstructions compared to symmetry-based baselines at fixed 32^3 resolution, with explicit mask conditioning proving practical.

Conclusion: Explicit mask conditioning effectively guides volumetric diffusion models for joint 3D geometry and color inpainting in cultural heritage restoration applications.

Abstract: We present a lightweight two-stage framework for joint geometry and color inpainting of damaged 3D objects, motivated by the digital restoration of cultural heritage artifacts. The pipeline separates damage localization from reconstruction. In the first stage, a 2D convolutional network predicts damage masks on RGB slices extracted from a voxelized object, and these predictions are aggregated into a volumetric mask. In the second stage, a diffusion-based 3D U-Net performs mask-conditioned inpainting directly on voxel grids, reconstructing geometry and color while preserving observed regions. The model jointly predicts occupancy and color using a composite objective that combines occupancy reconstruction with masked color reconstruction and perceptual regularization. We evaluate the approach on a curated set of textured artifacts with synthetically generated damage using standard geometric and color metrics. Compared to symmetry-based baselines, our method produces more complete geometry and more coherent color reconstructions at a fixed 32^3 resolution. Overall, the results indicate that explicit mask conditioning is a practical way to guide volumetric diffusion models for joint 3D geometry and color inpainting.

[105] BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition

Seungyeon Cho, Tae-kyun Kim

Main category: cs.CV

TL;DR: A probabilistic dual-stream framework for skeleton-based human action recognition that unifies reliability modeling and multi-modal integration, focusing on both body and hand motions for fine-grained recognition.

Details

Motivation: Existing skeleton-based HAR methods are body-centric and neglect subtle hand articulations crucial for fine-grained recognition. There's a need to handle uncertainty and integrate multiple modalities effectively.

Method: Three key components: 1) Calibration-free preprocessing that learns from native coordinates, 2) Probabilistic Noisy-OR fusion for reliability-aware dual-stream learning without confidence supervision, 3) Intra- to cross-modal ensemble coupling four skeleton modalities (Joint, Bone, Joint Motion, Bone Motion) with RGB representations.

Result: Comprehensive evaluations across multiple benchmarks (NTU RGB+D 60/120, PKU-MMD, N-UCLA) and a new hand-centric benchmark show consistent improvements and robustness under noisy and heterogeneous conditions.

Conclusion: The proposed probabilistic dual-stream framework effectively addresses the limitations of body-centric approaches by incorporating hand articulations and handling uncertainty through unified reliability modeling and multi-modal integration.

Abstract: Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.

[106] NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: NeoVerse is a scalable 4D world model that performs 4D reconstruction and novel-trajectory video generation from monocular videos, overcoming limitations of existing methods that require specialized multi-view data or complex pre-processing.

Details

Motivation: Current 4D world modeling methods suffer from scalability issues due to their reliance on expensive multi-view 4D data or cumbersome training pre-processing, limiting their practical application to diverse real-world scenarios.

Method: NeoVerse uses pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques to create a scalable pipeline that works with diverse in-the-wild monocular videos.

Result: NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks while demonstrating versatility and generalization across various domains.

Conclusion: NeoVerse provides a scalable and versatile 4D world modeling solution that enables 4D reconstruction, novel-trajectory video generation, and rich downstream applications from monocular videos, addressing key limitations of existing approaches.

Abstract: In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io

[107] RoLID-11K: A Dashcam Dataset for Small-Object Roadside Litter Detection

Tao Wu, Qing Xu, Xiangjian He, Oakleigh Weekes, James Brown, Wenting Duan

Main category: cs.CV

TL;DR: RoLID-11K is the first large-scale dataset for roadside litter detection from dashcams, featuring 11k annotated images with extreme small objects and long-tail distribution, benchmarking modern detectors for scalable litter monitoring.

Details

Motivation: Current roadside litter monitoring relies on labor-intensive surveys with limited coverage. Existing vision datasets don't capture dashcam footage characteristics where litter appears extremely small, sparse, and in cluttered backgrounds.

Method: Created RoLID-11K dataset with 11k+ annotated dashcam images from diverse UK driving conditions. Benchmarked various modern detectors including transformer architectures (CO-DETR) and real-time YOLO models.

Result: Transformer architectures like CO-DETR achieve best localization accuracy, while real-time models are constrained by coarse feature hierarchies. The dataset exhibits challenging long-tail and small-object distributions.

Conclusion: RoLID-11K establishes a challenging benchmark for extreme small-object detection in dynamic driving scenes and supports development of scalable, low-cost roadside litter monitoring systems.

Abstract: Roadside litter poses environmental, safety and economic challenges, yet current monitoring relies on labour-intensive surveys and public reporting, providing limited spatial coverage. Existing vision datasets for litter detection focus on street-level still images, aerial scenes or aquatic environments, and do not reflect the unique characteristics of dashcam footage, where litter appears extremely small, sparse and embedded in cluttered road-verge backgrounds. We introduce RoLID-11K, the first large-scale dataset for roadside litter detection from dashcams, comprising over 11k annotated images spanning diverse UK driving conditions and exhibiting pronounced long-tail and small-object distributions. We benchmark a broad spectrum of modern detectors, from accuracy-oriented transformer architectures to real-time YOLO models, and analyse their strengths and limitations on this challenging task. Our results show that while CO-DETR and related transformers achieve the best localisation accuracy, real-time models remain constrained by coarse feature hierarchies. RoLID-11K establishes a challenging benchmark for extreme small-object detection in dynamic driving scenes and aims to support the development of scalable, low-cost systems for roadside-litter monitoring. The dataset is available at https://github.com/xq141839/RoLID-11K.

[108] ABFR-KAN: Kolmogorov-Arnold Networks for Functional Brain Analysis

Tyler Ward, Abdullah Imran

Main category: cs.CV

TL;DR: ABFR-KAN is a transformer-based classification network that uses Kolmogorov-Arnold Networks to improve functional connectivity analysis for autism diagnosis, addressing atlas-based parcellation limitations.

Details

Motivation: Traditional functional connectivity analysis relies on atlas-based parcellation, which suffers from selection bias and lacks subject specificity, limiting its reliability for brain disorder diagnosis.

Method: Proposes ABFR-KAN, a transformer-based classification network incorporating novel brain function representation components with Kolmogorov-Arnold Networks to mitigate structural bias and improve anatomical conformity.

Result: Extensive experiments on ABIDE I dataset show ABFR-KAN consistently outperforms state-of-the-art baselines for ASD classification, validated through cross-site evaluation and ablation studies.

Conclusion: ABFR-KAN effectively addresses limitations of atlas-based parcellation, improving functional connectivity estimation reliability for autism spectrum disorder diagnosis.

Abstract: Functional connectivity (FC) analysis, a valuable tool for computer-aided brain disorder diagnosis, traditionally relies on atlas-based parcellation. However, issues relating to selection bias and a lack of regard for subject specificity can arise as a result of such parcellations. Addressing this, we propose ABFR-KAN, a transformer-based classification network that incorporates novel advanced brain function representation components with the power of Kolmogorov-Arnold Networks (KANs) to mitigate structural bias, improve anatomical conformity, and enhance the reliability of FC estimation. Extensive experiments on the ABIDE I dataset, including cross-site evaluation and ablation studies across varying model backbones and KAN configurations, demonstrate that ABFR-KAN consistently outperforms state-of-the-art baselines for autism spectrum distorder (ASD) classification. Our code is available at https://github.com/tbwa233/ABFR-KAN.

[109] Robust Assembly Progress Estimation via Deep Metric Learning

Kazuma Miura, Sarthak Pathak, Kazunori Umeda

Main category: cs.CV

TL;DR: Proposes Anomaly Quadruplet-Net for robust assembly progress estimation using Quadruplet Loss and strategic sample selection, improving accuracy over previous methods.

Details

Motivation: Manual multi-day assembly tasks challenge smart factory monitoring. Existing methods struggle with subtle visual changes between tasks, leading to misclassification.

Method: Uses Quadruplet Loss-based learning for anomaly images with custom data loader that strategically selects training samples to enhance estimation accuracy.

Result: Outperformed existing methods on desktop PC assembly dataset: improved estimation accuracy by 1.3% and reduced misclassification between adjacent tasks by 1.9%.

Conclusion: The proposed method effectively addresses occlusion and minimal visual change challenges in assembly progress estimation using small-scale datasets.

Abstract: In recent years, the advancement of AI technologies has accelerated the development of smart factories. In particular, the automatic monitoring of product assembly progress is crucial for improving operational efficiency, minimizing the cost of discarded parts, and maximizing factory productivity. However, in cases where assembly tasks are performed manually over multiple days, implementing smart factory systems remains a challenge. Previous work has proposed Anomaly Triplet-Net, which estimates assembly progress by applying deep metric learning to the visual features of products. Nevertheless, when visual changes between consecutive tasks are subtle, misclassification often occurs. To address this issue, this paper proposes a robust system for estimating assembly progress, even in cases of occlusion or minimal visual change, using a small-scale dataset. Our method leverages a Quadruplet Loss-based learning approach for anomaly images and introduces a custom data loader that strategically selects training samples to enhance estimation accuracy. We evaluated our approach using a image datasets: captured during desktop PC assembly. The proposed Anomaly Quadruplet-Net outperformed existing methods on the dataset. Specifically, it improved the estimation accuracy by 1.3% and reduced misclassification between adjacent tasks by 1.9% in the desktop PC dataset and demonstrating the effectiveness of the proposed method.

[110] CPPO: Contrastive Perception for Vision Language Policy Optimization

Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar, Kevin Cannons, Mohammad Asiful Hossain, Zhou Weimin, Shunbo Zhou, Yong Zhang, Mohammad Akbari

Main category: cs.CV

TL;DR: CPPO introduces Contrastive Perception Policy Optimization for finetuning vision-language models, using entropy shifts to detect perception tokens and adding contrastive perception loss to improve multimodal reasoning without extra models.

Details

Motivation: While RL has advanced reasoning in language models, extending it to multimodal reasoning requires improving both perception and reasoning aspects. Prior methods use explicit perception rewards but struggle with disentangling perception from reasoning tokens, requiring extra LLMs, ground-truth data, or indiscriminate reward application.

Method: CPPO detects perception tokens via entropy shifts in model outputs under perturbed input images. It extends RL objective with Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones.

Result: CPPO surpasses previous perception-rewarding methods while avoiding extra models, making training more efficient and scalable.

Conclusion: CPPO provides an effective approach for finetuning VLMs by addressing the perception-rewarding challenge through entropy-based token detection and contrastive learning, enabling more efficient and scalable multimodal reasoning.

Abstract: We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.

[111] MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation

Miaowei Wang, Jakub Zadrożny, Oisin Mac Aodha, Amir Vaxman

Main category: cs.CV

TL;DR: MotionPhysics: An end-to-end differentiable framework that infers physical parameters from natural language prompts to simulate realistic dynamics of 3D objects and materials without requiring ground-truth trajectories or annotated videos.

Details

Motivation: Simulating 3D objects with diverse materials requires expert knowledge and time-consuming parameter tuning. Current methods need ground-truth trajectories or annotated videos, limiting accessibility and scalability.

Method: 1) Uses multimodal LLM to estimate material parameters within plausible ranges. 2) Proposes learnable motion distillation loss to extract motion priors from pretrained video diffusion models while minimizing appearance/geometry biases. 3) End-to-end differentiable framework.

Result: Evaluated across 30+ scenarios with real-world, human-designed, and AI-generated 3D objects, covering elastic solids, metals, foams, sand, Newtonian/non-Newtonian fluids. Produces visually realistic dynamic simulations guided by natural language, surpassing state-of-the-art.

Conclusion: MotionPhysics enables automatic determination of physically plausible parameters for realistic dynamic simulations from natural language prompts, removing the need for expert knowledge or ground-truth data, advancing accessible physics simulation.

Abstract: Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.

[112] All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations

Wenrui Li, Hongtao Chen, Yao Xiao, Wangmeng Zuo, Jiantao Zhou, Yonghong Tian, Xiaopeng Fan

Main category: cs.CV

TL;DR: ORCANet is a video restoration model that handles smoothly evolving unknown degradations (SEUD) using recurrent conditional adaptive prompting with coarse intensity estimation and flow-based prompt generation.

Details

Motivation: Existing video restoration methods focus on frame-wise degradation variation but ignore temporal continuity in real-world degradation processes where degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually.

Method: ORCANet uses: 1) Coarse Intensity Estimation Dehazing (CIED) module with physical priors for haze intensity estimation and coarse dehazed features; 2) Flow Prompt Generation (FPG) module that extracts degradation features with static prompts (segment-level degradation types) and dynamic prompts (frame-level intensity variations); 3) Label-aware supervision for discriminative static prompt representations.

Result: Extensive experiments show ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines.

Conclusion: The proposed SEUD scenario and ORCANet framework effectively address the challenge of smoothly evolving unknown degradations in video restoration, with a flexible synthesis pipeline for generating temporally coherent degraded videos and a novel network architecture for robust restoration.

Abstract: All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model. But extending this task to videos faces unique challenges. Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes. In practice, degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually. In this paper, we introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time. To support this scenario, we design a flexible synthesis pipeline that generates temporally coherent videos with single, compound, and evolving degradations. To address the challenges in the SEUD scenario, we propose an all-in-One Recurrent Conditional and Adaptive prompting Network (ORCANet). First, a Coarse Intensity Estimation Dehazing (CIED) module estimates haze intensity using physical priors and provides coarse dehazed features as initialization. Second, a Flow Prompt Generation (FPG) module extracts degradation features. FPG generates both static prompts that capture segment-level degradation types and dynamic prompts that adapt to frame-level intensity variations. Furthermore, a label-aware supervision mechanism improves the discriminability of static prompt representations under different degradations. Extensive experiments show that ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines. Code is available at https://github.com/Friskknight/ORCANet-SEUD.

[113] FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection

Ruiqiang Zhang, Hengyi Wang, Chang Liu, Guanjie Wang, Zehua Ma, Weiming Zhang

Main category: cs.CV

TL;DR: FreeText is a training-free framework that improves text rendering in diffusion models by decomposing the problem into localization (where to write) and glyph injection (what to write), using attention mechanisms and frequency-domain modulation.

Details

Motivation: Current text-to-image diffusion models struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts like Chinese. Existing solutions require costly retraining or rigid external layout constraints that degrade aesthetics and limit flexibility.

Method: FreeText decomposes text rendering into two components: 1) Localizing writing regions using token-wise spatial attribution from image-to-text attention with sink-like tokens as stable anchors and topology-aware refinement, and 2) Injecting glyphs using Spectral-Modulated Glyph Injection (SGMI) which injects noise-aligned glyph priors with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage.

Result: Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across multiple benchmarks (longText-Benchmark, CVTG, and CLT-Bench) show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.

Conclusion: FreeText provides an effective training-free, plug-and-play solution for improving text rendering in diffusion models, addressing both localization and glyph injection challenges without requiring model retraining or compromising aesthetic quality.

Abstract: Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.

[114] Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios

Guangqian Guo, Pengfei Chen, Yong Guo, Huafeng Chen, Boqiang Zhang, Shan Gao

Main category: cs.CV

TL;DR: VNS-SAM enhances SAM’s segmentation for visually non-salient scenarios (low contrast foreground/background) while preserving zero-shot capabilities, using minimal parameter additions and a new dataset.

Details

Motivation: SAM struggles with visually non-salient scenarios where foreground and background have low contrast, leading to inaccurate contours and poor segmentation results.

Method: Proposes VNS-SAM with two key designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module to better utilize SAM’s low-level features for non-salient scenarios with minimal parameter/computation overhead.

Result: VNS-SEG dataset with 35K+ images created for training and benchmarking. VNS-SAM achieves superior performance across various VNS segmentation tasks, especially in zero-shot settings, with only 4 hours of additional training.

Conclusion: VNS-SAM effectively enhances SAM’s perception of visually non-salient scenarios while maintaining zero-shot generalizability, demonstrating practical feasibility and broad real-world application potential.

Abstract: Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM’s perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM’s low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours, demonstrating its feasibility and practicality. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 35K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model’s segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications. Codes and datasets are publicly available at https://guangqian-guo.github.io/VNS-SAM.

[115] A Comprehensive Dataset for Human vs. AI Generated Image Detection

Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Vasu Sharma, Vinija Jain, Aman Chadha, Aishwarya Naresh Reganti, Amitava Das

Main category: cs.CV

TL;DR: MS COCOAI is a new dataset for AI-generated image detection with 96K real/synthetic images from 5 generators, enabling classification of real vs. generated images and identification of the specific AI model used.

Details

Motivation: Multimodal generative AI systems have revolutionized image creation but also enable the spread of misleading content and manipulated media. As AI-generated images become increasingly indistinguishable from real photographs, there's an urgent need for effective detection methods to combat misinformation.

Method: Created MS COCOAI dataset using MS COCO as base, generating synthetic images with five AI generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. The dataset contains 96,000 real and synthetic datapoints and supports two detection tasks: real vs. generated classification and model identification for synthetic images.

Result: Released a comprehensive dataset of 96,000 images (real and synthetic) generated by five state-of-the-art AI models, publicly available on Hugging Face. The dataset enables research on two key detection tasks to address the growing challenge of AI-generated image identification.

Conclusion: MS COCOAI provides a valuable resource for advancing AI-generated image detection research, addressing the critical need for tools to identify synthetic media as AI-generated content becomes more sophisticated and harder to distinguish from real images.

Abstract: Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

[116] DynaDrag: Dynamic Drag-Style Image Editing by Motion Prediction

Jiacheng Sui, Yujie Zhou, Li Niu

Main category: cs.CV

TL;DR: DynaDrag is a new drag-style image editing method using a predict-and-move framework with iterative motion prediction and supervision to avoid tracking issues in previous approaches.

Details

Motivation: Previous drag-style image editing methods suffer from problems like miss tracking, ambiguous tracking, large gaps between source and target images, and unreasonable intermediate points that reduce editability.

Method: DynaDrag uses a predict-and-move framework with iterative Motion Prediction (predicting handle point movements) and Motion Supervision (dragging points accordingly), plus dynamic adjustment of valid handle points.

Result: Experiments on face and human datasets demonstrate superiority over previous works in drag-style image editing.

Conclusion: DynaDrag successfully addresses limitations of previous drag-style editing methods by introducing a novel predict-and-move framework with iterative motion prediction and dynamic handle point adjustment.

Abstract: To achieve pixel-level image manipulation, drag-style image editing which edits images using points or trajectories as conditions is attracting widespread attention. Most previous methods follow move-and-track framework, in which miss tracking and ambiguous tracking are unavoidable challenging issues. Other methods under different frameworks suffer from various problems like the huge gap between source image and target edited image as well as unreasonable intermediate point which can lead to low editability. To avoid these problems, we propose DynaDrag, the first dragging method under predict-and-move framework. In DynaDrag, Motion Prediction and Motion Supervision are performed iteratively. In each iteration, Motion Prediction first predicts where the handle points should move, and then Motion Supervision drags them accordingly. We also propose to dynamically adjust the valid handle points to further improve the performance. Experiments on face and human datasets showcase the superiority over previous works.

[117] SingBAG Pro: Accelerating point cloud-based iterative reconstruction for 3D photoacoustic imaging under arbitrary array

Shuang Li, Yibing Wang, Jian Gao, Chulhong Kim, Seongwook Choi, Yu Zhang, Qian Chen, Yao Yao, Changhui Li

Main category: cs.CV

TL;DR: SlingBAG Pro is an advanced 3D photoacoustic reconstruction algorithm that extends the original SlingBAG method to work with arbitrary irregular transducer arrays, achieving faster reconstruction while maintaining quality.

Details

Motivation: Traditional iterative reconstruction algorithms struggle with irregular geometric transducer arrays used in 3D photoacoustic imaging - they have high computational complexity, large memory requirements, and long reconstruction times, limiting clinical adoption.

Method: Extends the point cloud iteration concept of Sliding ball adaptive growth (SlingBAG) to arbitrary array geometries. Uses hierarchical optimization combining zero-gradient filtering with progressively increased temporal sampling rates during iteration to rapidly remove redundant spatial point clouds and accelerate convergence.

Result: Achieves up to 2.2-fold speed improvement compared to original SlingBAG algorithm for point cloud-based 3D PA reconstruction under irregular array geometries. Validated through both simulation and in vivo mouse experiments.

Conclusion: SlingBAG Pro enables high-quality 3D photoacoustic imaging with fewer transducers in irregular configurations, significantly reducing reconstruction time while maintaining reconstruction quality, making it promising for clinical applications.

Abstract: High-quality three-dimensional (3D) photoacoustic imaging (PAI) is gaining increasing attention in clinical applications. To address the challenges of limited space and high costs, irregular geometric transducer arrays that conform to specific imaging regions are promising for achieving high-quality 3D PAI with fewer transducers. However, traditional iterative reconstruction algorithms struggle with irregular array configurations, suffering from high computational complexity, substantial memory requirements, and lengthy reconstruction times. In this work, we introduce SlingBAG Pro, an advanced reconstruction algorithm based on the point cloud iteration concept of the Sliding ball adaptive growth (SlingBAG) method, while extending its compatibility to arbitrary array geometries. SlingBAG Pro maintains high reconstruction quality, reduces the number of required transducers, and employs a hierarchical optimization strategy that combines zero-gradient filtering with progressively increased temporal sampling rates during iteration. This strategy rapidly removes redundant spatial point clouds, accelerates convergence, and significantly shortens overall reconstruction time. Compared to the original SlingBAG algorithm, SlingBAG Pro achieves up to a 2.2-fold speed improvement in point cloud-based 3D PA reconstruction under irregular array geometries. The proposed method is validated through both simulation and in vivo mouse experiments, and the source code is publicly available at https://github.com/JaegerCQ/SlingBAG_Pro.

[118] AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models

Jintao Lin, Bowen Dong, Weikang Shi, Chenyang Lei, Suiyun Zhang, Rui Liu, Xihui Liu

Main category: cs.CV

TL;DR: AEGIS benchmark assesses UMMs’ world knowledge across multiple tasks using deterministic evaluation, revealing significant knowledge gaps and reasoning limitations.

Details

Motivation: Existing benchmarks for Unified Multimodal Models (UMMs) are inadequate - they only offer siloed, single-task evaluations with limited diagnostic power, failing to properly assess UMMs' ability to apply world knowledge across diverse tasks.

Method: Proposed AEGIS benchmark covering visual understanding, generation, editing, and interleaved generation with 1,050 manually-annotated questions across 21 topics and 6 reasoning types. Also introduced Deterministic Checklist-based Evaluation (DCE) protocol using atomic “Y/N” judgments instead of ambiguous prompt-based scoring.

Result: Most UMMs exhibit severe world knowledge deficits, with performance degrading significantly with complex reasoning. Simple plug-in reasoning modules can partially mitigate these vulnerabilities.

Conclusion: World-knowledge-based reasoning is a critical frontier for UMMs, and the proposed AEGIS benchmark with DCE evaluation provides a more reliable way to assess UMM capabilities.

Abstract: The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N’’ judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.

[119] A Cascaded Information Interaction Network for Precise Image Segmentation

Hewen Xiao, Jie Mei, Guangfu Ma, Weiren Wu

Main category: cs.CV

TL;DR: Proposes a cascaded CNN with Global Information Guidance Module for robust image segmentation, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Visual perception is crucial for autonomous systems but robust segmentation remains challenging in complex scenarios. Traditional methods struggle with cluttered/blurred environments, and single-scale feature extraction has inherent limitations.

Method: Cascaded convolutional neural network integrated with a novel Global Information Guidance Module that effectively fuses low-level texture details with high-level semantic features across multiple layers.

Result: Superior precision on benchmark image segmentation datasets, outperforming existing state-of-the-art methods, particularly effective in visually cluttered or blurred environments.

Conclusion: The proposed framework demonstrates effectiveness and promising potential for deployment in practical robotic applications, addressing segmentation challenges in complex visual scenarios.

Abstract: Visual perception plays a pivotal role in enabling autonomous behavior, offering a cost-effective and efficient alternative to complex multi-sensor systems. However, robust segmentation remains a challenge in complex scenarios. To address this, this paper proposes a cascaded convolutional neural network integrated with a novel Global Information Guidance Module. This module is designed to effectively fuse low-level texture details with high-level semantic features across multiple layers, thereby overcoming the inherent limitations of single-scale feature extraction. This architectural innovation significantly enhances segmentation accuracy, particularly in visually cluttered or blurred environments where traditional methods often fail. Experimental evaluations on benchmark image segmentation datasets demonstrate that the proposed framework achieves superior precision, outperforming existing state-of-the-art methods. The results highlight the effectiveness of the approach and its promising potential for deployment in practical robotic applications.

[120] Noise-Robust Tiny Object Localization with Flows

Huixin Sun, Linlin Yang, Ronyu Chen, Kerui Gu, Baochang Zhang, Angela Yao, Xianbin Cao

Main category: cs.CV

TL;DR: TOLF is a noise-robust localization framework for tiny objects that uses normalizing flows for error modeling and uncertainty-guided optimization to prevent overfitting to annotation noise.

Details

Motivation: Tiny object detection suffers from a persistent performance gap compared to normal-scale objects, largely due to their high sensitivity to annotation noise. Optimizing strict localization objectives risks overfitting to this noise.

Method: Proposes Tiny Object Localization with Flows (TOLF) with two key components: 1) Flow-based error modeling using normalizing flows to capture complex, non-Gaussian prediction distributions, enabling robust learning under noisy supervision. 2) Uncertainty-aware gradient modulation that suppresses learning from high-uncertainty, noise-prone samples to mitigate overfitting and stabilize training.

Result: Extensive experiments across three datasets validate the approach’s effectiveness. TOLF boosts the DINO baseline by 1.2% AP on the AI-TOD dataset, demonstrating significant improvement in tiny object detection performance.

Conclusion: TOLF provides a noise-robust solution for tiny object localization by addressing the fundamental challenge of annotation noise sensitivity through flexible error modeling and uncertainty-guided optimization, closing the performance gap between tiny and normal-scale object detection.

Abstract: Despite significant advances in generic object detection, a persistent performance gap remains for tiny objects compared to normal-scale objects. We demonstrate that tiny objects are highly sensitive to annotation noise, where optimizing strict localization objectives risks noise overfitting. To address this, we propose Tiny Object Localization with Flows (TOLF), a noise-robust localization framework leveraging normalizing flows for flexible error modeling and uncertainty-guided optimization. Our method captures complex, non-Gaussian prediction distributions through flow-based error modeling, enabling robust learning under noisy supervision. An uncertainty-aware gradient modulation mechanism further suppresses learning from high-uncertainty, noise-prone samples, mitigating overfitting while stabilizing training. Extensive experiments across three datasets validate our approach’s effectiveness. Especially, TOLF boosts the DINO baseline by 1.2% AP on the AI-TOD dataset.

[121] GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval

Mingyu Jeon, Sunjae Yoon, Jonghee Kim, Junyeoung Kim

Main category: cs.CV

TL;DR: GranAlign is a training-free framework for zero-shot video moment retrieval that addresses semantic granularity mismatch between text queries and video content through query rewriting and query-aware caption generation.

Details

Motivation: Previous ZVMR approaches fail to balance semantic granularity between pre-trained video and language representations, leading to inaccurate retrieval despite high-quality individual modality representations.

Method: Proposes Granularity-Aware Alignment with two techniques: 1) granularity-based query rewriting to generate varied semantic granularities, and 2) query-aware caption generation to embed query intent into video content.

Result: Sets new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with 3.23% mAP@avg improvement on QVHighlights.

Conclusion: The training-free GranAlign framework effectively resolves semantic granularity mismatches in ZVMR by pairing multi-level queries with both query-agnostic and query-aware captions.

Abstract: Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality’s representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.

[122] SafeMo: Linguistically Grounded Unlearning for Trustworthy Text-to-Motion Generation

Yiling Wang, Zeyu Zhang, Yiran Wang, Hao Tang

Main category: cs.CV

TL;DR: SafeMo is a trustworthy motion generation framework that uses Minimal Motion Unlearning to enable safe human motion generation in continuous space, avoiding artifacts from discrete codebook methods while maintaining good performance on benign prompts.

Details

Motivation: Existing text-to-motion generation methods have safety concerns, and current safety approaches using discrete VQ-VAE codebook replacement have critical flaws: they degrade performance on everyday tasks due to codebook reuse, and introduce quantization artifacts and jerky transitions. Additionally, existing datasets contain unsafe content unsuitable for safety-driven learning.

Method: Proposes SafeMo framework with Minimal Motion Unlearning (MMU), a two-stage machine unlearning strategy that enables safe human motion generation in continuous space. Also introduces SafeMoVAE-29K dataset with rewritten safe text prompts and continuous refined motion for trustworthy unlearning. Built upon DiP architecture.

Result: SafeMo achieves effective unlearning with strengthened forgetting on unsafe prompts, reaching 2.5x and 14.4x higher forget-set FID on HumanML3D and Motion-X respectively compared to previous SOTA method LCR. Maintains comparable or better performance on safe prompts.

Conclusion: SafeMo provides a trustworthy motion generation framework that addresses safety concerns through continuous-space unlearning, avoiding quantization artifacts while delivering strong safety-utility trade-offs. The new dataset enables better safety-focused training.

Abstract: Text-to-motion (T2M) generation with diffusion backbones achieves strong realism and alignment. Safety concerns in T2M methods have been raised in recent years; existing methods replace discrete VQ-VAE codebook entries to steer the model away from unsafe behaviors. However, discrete codebook replacement-based methods have two critical flaws: firstly, replacing codebook entries which are reused by benign prompts leads to drifts on everyday tasks, degrading the model’s benign performance; secondly, discrete token-based methods introduce quantization and smoothness loss, resulting in artifacts and jerky transitions. Moreover, existing text-to-motion datasets naturally contain unsafe intents and corresponding motions, making them unsuitable for safety-driven machine learning. To address these challenges, we propose SafeMo, a trustworthy motion generative framework integrating Minimal Motion Unlearning (MMU), a two-stage machine unlearning strategy, enabling safe human motion generation in continuous space, preserving continuous kinematics without codebook loss and delivering strong safety-utility trade-offs compared to current baselines. Additionally, we present the first safe text-to-motion dataset SafeMoVAE-29K integrating rewritten safe text prompts and continuous refined motion for trustworthy human motion unlearning. Built upon DiP, SafeMo efficiently generates safe human motions with natural transitions. Experiments demonstrate effective unlearning performance of SafeMo by showing strengthened forgetting on unsafe prompts, reaching 2.5x and 14.4x higher forget-set FID on HumanML3D and Motion-X respectively, compared to the previous SOTA human motion unlearning method LCR, with benign performance on safe prompts being better or comparable. Code: https://github.com/AIGeeksGroup/SafeMo. Website: https://aigeeksgroup.github.io/SafeMo.

[123] Modality Dominance-Aware Optimization for Embodied RGB-Infrared Perception

Xianhui Liu, Siqi Jiang, Yi Xie, Yuqing Lin, Siao Liu

Main category: cs.CV

TL;DR: MDACL framework addresses optimization bias in RGB-IR fusion by quantifying modality dominance with MDI and balancing optimization through hierarchical guidance and adversarial regularization.

Details

Motivation: RGB-IR multimodal perception is crucial for embodied systems, but existing fusion methods suffer from optimization bias due to asymmetric modality characteristics (information density and feature quality disparities), causing training to overemphasize dominant modalities and hinder effective fusion.

Method: Proposes Modality Dominance Index (MDI) to quantify modality dominance by jointly modeling feature entropy and gradient contribution. Develops MDACL framework with Hierarchical Cross-modal Guidance (HCG) for feature alignment and Adversarial Equilibrium Regularization (AER) to balance optimization dynamics during fusion.

Result: Extensive experiments on three RGB-IR benchmarks demonstrate that MDACL effectively mitigates optimization bias and achieves state-of-the-art performance.

Conclusion: The proposed MDACL framework successfully addresses optimization bias in RGB-IR multimodal fusion through systematic quantification of modality dominance and balanced optimization regulation, advancing RGB-IR detection capabilities for embodied systems.

Abstract: RGB-Infrared (RGB-IR) multimodal perception is fundamental to embodied multimedia systems operating in complex physical environments. Although recent cross-modal fusion methods have advanced RGB-IR detection, the optimization dynamics caused by asymmetric modality characteristics remain underexplored. In practice, disparities in information density and feature quality introduce persistent optimization bias, leading training to overemphasize a dominant modality and hindering effective fusion. To quantify this phenomenon, we propose the Modality Dominance Index (MDI), which measures modality dominance by jointly modeling feature entropy and gradient contribution. Based on MDI, we develop a Modality Dominance-Aware Cross-modal Learning (MDACL) framework that regulates cross-modal optimization. MDACL incorporates Hierarchical Cross-modal Guidance (HCG) to enhance feature alignment and Adversarial Equilibrium Regularization (AER) to balance optimization dynamics during fusion. Extensive experiments on three RGB-IR benchmarks demonstrate that MDACL effectively mitigates optimization bias and achieves SOTA performance.

[124] Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model

Hao Guan, Li Zhou

Main category: cs.CV

TL;DR: This paper proposes a framework combining input data shift detection and output confidence-based indicators to monitor performance degradation in vision-language models for pathology under data distribution shifts.

Details

Motivation: Vision-Language Models in medical imaging can suffer performance degradation after deployment when input data distribution shifts, but detecting this degradation is challenging without labeled data, especially for large pre-trained models.

Method: 1) Developed DomainSAT toolbox for systematic input data shift analysis with graphical interface and shift detection algorithms; 2) Introduced label-free, confidence-based degradation indicator that monitors changes in model prediction confidence; 3) Combined input shift detection with output confidence monitoring for comprehensive degradation detection.

Result: Input data shift detection effectively identifies distribution changes but doesn’t always correlate with actual performance degradation. The confidence-based indicator shows close relationship with performance degradation and complements input shift detection. Combined approach enables more reliable detection and interpretation of performance degradation in pathology VLMs under data shift.

Conclusion: The proposed complementary framework combining input data shift detection and output confidence-based indicators provides a practical solution for monitoring reliability of foundation models in digital pathology, addressing the challenge of performance degradation detection under data distribution shifts without requiring labeled data.

Abstract: Vision-Language Models have demonstrated strong potential in medical image analysis and disease diagnosis. However, after deployment, their performance may deteriorate when the input data distribution shifts from that observed during development. Detecting such performance degradation is essential for clinical reliability, yet remains challenging for large pre-trained VLMs operating without labeled data. In this study, we investigate performance degradation detection under data shift in a state-of-the-art pathology VLM. We examine both input-level data shift and output-level prediction behavior to understand their respective roles in monitoring model reliability. To facilitate systematic analysis of input data shift, we develop DomainSAT, a lightweight toolbox with a graphical interface that integrates representative shift detection algorithms and enables intuitive exploration of data shift. Our analysis shows that while input data shift detection is effective at identifying distributional changes and providing early diagnostic signals, it does not always correspond to actual performance degradation. Motivated by this observation, we further study output-based monitoring and introduce a label-free, confidence-based degradation indicator that directly captures changes in model prediction confidence. We find that this indicator exhibits a close relationship with performance degradation and serves as an effective complement to input shift detection. Experiments on a large-scale pathology dataset for tumor classification demonstrate that combining input data shift detection and output confidence-based indicators enables more reliable detection and interpretation of performance degradation in VLMs under data shift. These findings provide a practical and complementary framework for monitoring the reliability of foundation models in digital pathology.

[125] RePose: A Real-Time 3D Human Pose Estimation and Biomechanical Analysis Framework for Rehabilitation

Junxiao Xue, Pavel Smirnov, Ziao Li, Yunyun Shi, Shi Chen, Xinyi Yin, Xiaohan Yue, Lei Wang, Yiduo Wang, Feng Lin, Yijia Chen, Xiao Ma, Xiaoran Yan, Qing Zhang, Fengjian Xue, Xuecheng Wu

Main category: cs.CV

TL;DR: RePose: A real-time 3D human pose estimation and motion analysis system for rehabilitation training that provides immediate feedback and guidance to patients using multi-camera RGB video input.

Details

Motivation: To enable real-time monitoring and evaluation of patients' motion during rehabilitation, providing immediate feedback and guidance to help patients execute rehabilitation exercises correctly and regain muscle strength and motor functions.

Method: 1) Unified end-to-end pipeline for real-time human pose estimation and motion analysis from multi-camera RGB video; 2) Fast tracking method for medical rehabilitation scenarios with multiple-person interference (<1ms per frame); 3) Modified SmoothNet for real-time posture estimation to reduce errors and restore true motion state; 4) Unity platform integration for real-time monitoring and muscle stress visualization.

Result: The system achieves real-time performance with fast tracking (<1ms per frame), effectively reduces pose estimation errors, restores patients’ true motion states, and provides visually smoother motion analysis for rehabilitation monitoring.

Conclusion: RePose provides an effective real-time solution for rehabilitation training that can monitor, evaluate, and guide patients’ movements while displaying muscle stress conditions, potentially improving rehabilitation outcomes through immediate feedback.

Abstract: We propose a real-time 3D human pose estimation and motion analysis method termed RePose for rehabilitation training. It is capable of real-time monitoring and evaluation of patients’motion during rehabilitation, providing immediate feedback and guidance to assist patients in executing rehabilitation exercises correctly. Firstly, we introduce a unified pipeline for end-to-end real-time human pose estimation and motion analysis using RGB video input from multiple cameras which can be applied to the field of rehabilitation training. The pipeline can help to monitor and correct patients’actions, thus aiding them in regaining muscle strength and motor functions. Secondly, we propose a fast tracking method for medical rehabilitation scenarios with multiple-person interference, which requires less than 1ms for tracking for a single frame. Additionally, we modify SmoothNet for real-time posture estimation, effectively reducing pose estimation errors and restoring the patient’s true motion state, making it visually smoother. Finally, we use Unity platform for real-time monitoring and evaluation of patients’ motion during rehabilitation, and to display the muscle stress conditions to assist patients with their rehabilitation training.

[126] HyperPriv-EPN: Hypergraph Learning with Privileged Knowledge for Ependymoma Prognosis

Shuren Gabriel Yu, Sikang Ren, Yongji Tian

Main category: cs.CV

TL;DR: HyperPriv-EPN: A hypergraph-based LUPI framework that transfers post-operative text knowledge to preoperative MRI analysis without requiring text at inference.

Details

Motivation: Preoperative prognosis of Ependymoma is challenging due to lack of semantic insights in MRI compared to post-operative surgical reports. Existing multimodal methods fail when privileged text data is unavailable during inference.

Method: Hypergraph-based Learning Using Privileged Information (LUPI) with Severed Graph Strategy. Uses shared encoder for Teacher graph (with post-surgery info) and Student graph (pre-op only). Dual-stream distillation enables Student to hallucinate semantic community structures from visual features alone.

Result: Validated on multi-center cohort of 311 patients. Achieves state-of-the-art diagnostic accuracy and survival stratification.

Conclusion: Effectively transfers expert knowledge to preoperative setting, unlocking value of historical post-operative data to guide diagnosis of new patients without requiring text at inference.

Abstract: Preoperative prognosis of Ependymoma is critical for treatment planning but challenging due to the lack of semantic insights in MRI compared to post-operative surgical reports. Existing multimodal methods fail to leverage this privileged text data when it is unavailable during inference. To bridge this gap, we propose HyperPriv-EPN, a hypergraph-based Learning Using Privileged Information (LUPI) framework. We introduce a Severed Graph Strategy, utilizing a shared encoder to process both a Teacher graph (enriched with privileged post-surgery information) and a Student graph (restricted to pre-operation data). Through dual-stream distillation, the Student learns to hallucinate semantic community structures from visual features alone. Validated on a multi-center cohort of 311 patients, HyperPriv-EPN achieves state-of-the-art diagnostic accuracy and survival stratification. This effectively transfers expert knowledge to the preoperative setting, unlocking the value of historical post-operative data to guide the diagnosis of new patients without requiring text at inference.

[127] Quality Detection of Stored Potatoes via Transfer Learning: A CNN and Vision Transformer Approach

Shrikant Kapse, Priyankkumar Dhrangdhariya, Priya Kedia, Manasi Patwardhan, Shankar Kausley, Soumyadipta Maiti, Beena Rai, Shirish Karande

Main category: cs.CV

TL;DR: Image-based deep learning models (ResNet, VGG, DenseNet, ViT) effectively monitor potato quality during storage, achieving 98.03% accuracy for sprout detection and up to 89.83% accuracy for shelf-life prediction with coarse class divisions.

Details

Motivation: To develop a non-invasive, scalable solution for monitoring potato quality during storage, addressing key challenges like sprout detection, weight loss estimation, and shelf-life prediction to improve inventory management and reduce food waste.

Method: Collected images and weight data over 200 days under controlled conditions. Used pre-trained architectures (ResNet, VGG, DenseNet, ViT) to design two specialized models: 1) binary classifier for sprout detection, 2) multi-class predictor for weight loss and shelf-life forecasting.

Result: DenseNet achieved 98.03% accuracy for sprout detection. Shelf-life prediction performed best with coarse class divisions (2-5 classes, >89.83% accuracy), while accuracy declined for finer divisions (6-8 classes) due to subtle visual differences and limited data.

Conclusion: Image-based models are feasible for automated potato sorting systems, enabling early sprout detection and dynamic categorization. Broader class divisions ensure robust performance. Future work should develop generalized models for diverse varieties and conditions.

Abstract: Image-based deep learning provides a non-invasive, scalable solution for monitoring potato quality during storage, addressing key challenges such as sprout detection, weight loss estimation, and shelf-life prediction. In this study, images and corresponding weight data were collected over a 200-day period under controlled temperature and humidity conditions. Leveraging powerful pre-trained architectures of ResNet, VGG, DenseNet, and Vision Transformer (ViT), we designed two specialized models: (1) a high-precision binary classifier for sprout detection, and (2) an advanced multi-class predictor to estimate weight loss and forecast remaining shelf-life with remarkable accuracy. DenseNet achieved exceptional performance, with 98.03% accuracy in sprout detection. Shelf-life prediction models performed best with coarse class divisions (2-5 classes), achieving over 89.83% accuracy, while accuracy declined for finer divisions (6-8 classes) due to subtle visual differences and limited data per class. These findings demonstrate the feasibility of integrating image-based models into automated sorting and inventory systems, enabling early identification of sprouted potatoes and dynamic categorization based on storage stage. Practical implications include improved inventory management, differential pricing strategies, and reduced food waste across supply chains. While predicting exact shelf-life intervals remains challenging, focusing on broader class divisions ensures robust performance. Future research should aim to develop generalized models trained on diverse potato varieties and storage conditions to enhance adaptability and scalability. Overall, this approach offers a cost-effective, non-destructive method for quality assessment, supporting efficiency and sustainability in potato storage and distribution.

[128] Reconstructing Building Height from Spaceborne TomoSAR Point Clouds Using a Dual-Topology Network

Zhaiyu Chen, Yuanyuan Wang, Yilei Shi, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: A learning-based framework converts noisy TomoSAR point clouds into high-resolution building height maps using a dual-topology network that alternates between point and grid branches for denoising and inpainting.

Details

Motivation: TomoSAR provides weather-independent building observations but produces noisy, anisotropic point clouds with data voids that hinder accurate height reconstruction, creating a need for robust height estimation methods.

Method: Dual-topology network with alternating point branch (models irregular scatterer features) and grid branch (enforces spatial consistency) to jointly process representations, denoise points, and inpaint missing regions.

Result: First proof-of-concept for large-scale urban height mapping directly from TomoSAR point clouds, validated on Munich and Berlin data, with extension to incorporate optical imagery for enhanced quality.

Conclusion: The framework successfully converts raw TomoSAR points into continuous height maps, addressing noise and data void challenges while enabling integration with optical data for improved urban building reconstruction.

Abstract: Reliable building height estimation is essential for various urban applications. Spaceborne SAR tomography (TomoSAR) provides weather-independent, side-looking observations that capture facade-level structure, offering a promising alternative to conventional optical methods. However, TomoSAR point clouds often suffer from noise, anisotropic point distributions, and data voids on incoherent surfaces, all of which hinder accurate height reconstruction. To address these challenges, we introduce a learning-based framework for converting raw TomoSAR points into high-resolution building height maps. Our dual-topology network alternates between a point branch that models irregular scatterer features and a grid branch that enforces spatial consistency. By jointly processing these representations, the network denoises the input points and inpaints missing regions to produce continuous height estimates. To our knowledge, this is the first proof of concept for large-scale urban height mapping directly from TomoSAR point clouds. Extensive experiments on data from Munich and Berlin validate the effectiveness of our approach. Moreover, we demonstrate that our framework can be extended to incorporate optical satellite imagery, further enhancing reconstruction quality. The source code is available at https://github.com/zhu-xlab/tomosar2height.

[129] CRoPS: A Training-Free Hallucination Mitigation Framework for Vision-Language Models

Neeraj Anand, Samyak Jha, Udbhav Bamba, Rahul Rahaman

Main category: cs.CV

TL;DR: CRoPS is a training-free framework that mitigates hallucinations in Large Vision-Language Models by using multiple hallucinated models and contrastive decoding.

Details

Motivation: LVLMs suffer from hallucination problems that undermine reliability. Existing training-free methods have limitations: they rely on narrow assumptions about hallucination sources, and their effectiveness declines toward the end of generation where hallucinations are most likely.

Method: Proposes a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. Introduces Generalized Contrastive Decoding which integrates multiple hallucinated models to represent diverse hallucination sources, forming the CRoPS framework.

Result: Improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.

Conclusion: CRoPS effectively addresses hallucination mitigation in LVLMs through a training-free approach that better captures diverse hallucination sources and maintains effectiveness throughout generation.

Abstract: Despite the rapid success of Large Vision-Language Models (LVLMs), a persistent challenge is their tendency to generate hallucinated content, undermining reliability in real-world use. Existing training-free methods address hallucinations but face two limitations: (i) they rely on narrow assumptions about hallucination sources, and (ii) their effectiveness declines toward the end of generation, where hallucinations are most likely to occur. A common strategy is to build hallucinated models by completely or partially removing visual tokens and contrasting them with the original model. Yet, this alone proves insufficient, since visual information still propagates into generated text. Building on this insight, we propose a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. We further introduce Generalized Contrastive Decoding, which integrates multiple hallucinated models to represent diverse hallucination sources. Together, these ideas form CRoPS, a training-free hallucination mitigation framework that improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.

[130] Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

Melonie de Almeida, Daniela Ivanova, Tong Shi, John H. Williamson, Paul Henderson

Main category: cs.CV

TL;DR: Single-image video generation with camera control using 3D Gaussian representation and motion sampling in one forward pass.

Details

Motivation: Existing single-image video generation methods lack robust user controllability (especially camera path modification) and struggle with accurate camera motion modeling, temporal consistency, and geometric integrity. Two-step approaches using 3D point clouds still fall short of full temporal consistency.

Method: Proposes a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion from a single image in a single forward pass, enabling fast camera-guided video generation without iterative denoising for object motion injection.

Result: Extensive experiments on KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate state-of-the-art video quality and inference efficiency.

Conclusion: The method enables fast, camera-controlled video generation from single images with improved temporal consistency and geometric integrity compared to existing approaches.

Abstract: Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.

[131] Efficient Deep Demosaicing with Spatially Downsampled Isotropic Networks

Cory Fan, Wenchao Zhang

Main category: cs.CV

TL;DR: Downsampling in isotropic networks improves efficiency and performance for image demosaicing and joint-demosaicing-and-denoising tasks, enabling better mobile deployment.

Details

Motivation: Most deep learning demosaicing networks avoid spatial downsampling, making them computationally expensive for mobile platforms. The paper challenges this convention by claiming downsampling can actually improve both efficiency and performance.

Method: Designed simple fully convolutional networks with and without downsampling using DeepMAD mathematical architecture design technique. Created JD3Net as the downsampled variant for empirical testing.

Result: Downsampling improves empirical performance compared to networks without downsampling. JD3Net shows strong performance on various image demosaicing and JDD tasks.

Conclusion: Contrary to conventional isotropic network designs, spatial downsampling can enhance both efficiency and performance for image demosaicing networks, making them more suitable for mobile applications.

Abstract: In digital imaging, image demosaicing is a crucial first step which recovers the RGB information from a color filter array (CFA). Oftentimes, deep learning is utilized to perform image demosaicing. Given that most modern digital imaging applications occur on mobile platforms, applying deep learning to demosaicing requires lightweight and efficient networks. Isotropic networks, also known as residual-in-residual networks, have been often employed for image demosaicing and joint-demosaicing-and-denoising (JDD). Most demosaicing isotropic networks avoid spatial downsampling entirely, and thus are often prohibitively expensive computationally for mobile applications. Contrary to previous isotropic network designs, this paper claims that spatial downsampling to a signficant degree can improve the efficiency and performance of isotropic networks. To validate this claim, we design simple fully convolutional networks with and without downsampling using a mathematical architecture design technique adapted from DeepMAD, and find that downsampling improves empirical performance. Additionally, empirical testing of the downsampled variant, JD3Net, of our fully convolutional networks reveals strong empirical performance on a variety of image demosaicing and JDD tasks.

[132] RGS-SLAM: Robust Gaussian Splatting SLAM with One-Shot Dense Initialization

Wei-Tse Cheng, Yen-Jen Chiou, Yuan-Fu Yang

Main category: cs.CV

TL;DR: RGS-SLAM replaces GS-SLAM’s residual-driven densification with a one-shot correspondence-to-Gaussian initialization using DINOv3 descriptors, achieving 20% faster convergence and higher rendering quality while maintaining real-time performance.

Details

Motivation: To improve GS-SLAM by addressing the limitations of progressive residual-driven densification, which can lead to unstable early mapping and slower convergence. The goal is to create a more robust initialization that provides better geometric priors for Gaussian splatting SLAM.

Method: Uses training-free correspondence-to-Gaussian initialization: 1) extracts dense multi-view correspondences from DINOv3 descriptors, 2) refines them through confidence-aware inlier classification, 3) performs one-shot triangulation to generate well-distributed Gaussian seeds, 4) optimizes with existing GS-SLAM pipelines.

Result: Achieves ~20% faster convergence, higher rendering fidelity in texture-rich and cluttered scenes, competitive/superior localization and reconstruction accuracy on TUM RGB-D and Replica datasets, while maintaining real-time performance up to 925 FPS.

Conclusion: RGS-SLAM demonstrates that replacing residual-driven densification with robust correspondence-based initialization significantly improves GS-SLAM’s stability, convergence speed, and rendering quality while maintaining full compatibility with existing pipelines and real-time performance.

Abstract: We introduce RGS-SLAM, a robust Gaussian-splatting SLAM framework that replaces the residual-driven densification stage of GS-SLAM with a training-free correspondence-to-Gaussian initialization. Instead of progressively adding Gaussians as residuals reveal missing geometry, RGS-SLAM performs a one-shot triangulation of dense multi-view correspondences derived from DINOv3 descriptors refined through a confidence-aware inlier classifier, generating a well-distributed and structure-aware Gaussian seed prior to optimization. This initialization stabilizes early mapping and accelerates convergence by roughly 20%, yielding higher rendering fidelity in texture-rich and cluttered scenes while remaining fully compatible with existing GS-SLAM pipelines. Evaluated on the TUM RGB-D and Replica datasets, RGS-SLAM achieves competitive or superior localization and reconstruction accuracy compared with state-of-the-art Gaussian and point-based SLAM systems, sustaining real-time mapping performance at up to 925 FPS.

[133] Multi-Level Feature Fusion for Continual Learning in Visual Quality Inspection

Johannes C. Bauer, Paul Geng, Stephan Trattnig, Petr Dokládal, Rüdiger Daub

Main category: cs.CV

TL;DR: Multi-level feature fusion approach for continual learning in visual quality inspection, enabling efficient adaptation to changing products/defects in manufacturing.

Details

Motivation: Deep neural networks struggle in volatile manufacturing scenarios like remanufacturing where products and defect patterns frequently change, requiring frequent model adaptation while avoiding catastrophic forgetting.

Method: Multi-level feature fusion (MLFF) approach that utilizes representations from different depths of a pretrained network to enable efficient adaptation with fewer trainable parameters.

Result: Matches end-to-end training performance for quality inspection problems while using significantly less trainable parameters, reduces catastrophic forgetting, and improves generalization to new product types/defects.

Conclusion: MLFF enables computationally efficient continual learning for visual quality inspection in volatile manufacturing environments, balancing adaptation speed with forgetting prevention.

Abstract: Deep neural networks show great potential for automating various visual quality inspection tasks in manufacturing. However, their applicability is limited in more volatile scenarios, such as remanufacturing, where the inspected products and defect patterns often change. In such settings, deployed models require frequent adaptation to novel conditions, effectively posing a continual learning problem. To enable quick adaptation, the necessary training processes must be computationally efficient while still avoiding effects like catastrophic forgetting. This work presents a multi-level feature fusion (MLFF) approach that aims to improve both aspects simultaneously by utilizing representations from different depths of a pretrained network. We show that our approach is able to match the performance of end-to-end training for different quality inspection problems while using significantly less trainable parameters. Furthermore, it reduces catastrophic forgetting and improves generalization robustness to new product types or defects.

[134] Grading Handwritten Engineering Exams with Multimodal Large Language Models

Janez Perš, Jon Muhovič, Andrej Košir, Boštjan Murovec

Main category: cs.CV

TL;DR: End-to-end workflow using multimodal LLMs to automatically grade handwritten STEM exams with ~8-point accuracy compared to lecturer grading, requiring only a handwritten reference solution and grading rules.

Details

Motivation: Manual grading of handwritten STEM exams is slow and difficult to scale, despite capturing valuable open-ended reasoning and diagrams. There's a need for automated solutions that preserve the standard exam process while maintaining reliability.

Method: Multi-stage pipeline: format/presence check, ensemble of independent LLM graders, supervisor aggregation, rigid templates with deterministic validation. Uses multimodal LLMs (GPT-5.2, Gemini-3 Pro) with handwritten reference solution converted to text-only summary for conditioning. Preserves standard A4 paper format with unconstrained student handwriting.

Result: Achieves ≈8-point mean absolute difference to lecturer grades with low bias, estimated manual-review trigger rate of ≈17% at D_max=40. Works with Slovenian language and hand-drawn circuit schematics. Ablations show structured prompting and reference grounding are essential.

Conclusion: The workflow provides reliable automated grading of handwritten STEM exams while preserving standard exam processes. Structured prompting and reference grounding are critical for accuracy, preventing systematic over-grading. The system enables scalable grading with reasonable accuracy and manageable manual review rates.

Abstract: Handwritten STEM exams capture open-ended reasoning and diagrams, but manual grading is slow and difficult to scale. We present an end-to-end workflow for grading scanned handwritten engineering quizzes with multimodal large language models (LLMs) that preserves the standard exam process (A4 paper, unconstrained student handwriting). The lecturer provides only a handwritten reference solution (100%) and a short set of grading rules; the reference is converted into a text-only summary that conditions grading without exposing the reference scan. Reliability is achieved through a multi-stage design with a format/presence check to prevent grading blank answers, an ensemble of independent graders, supervisor aggregation, and rigid templates with deterministic validation to produce auditable, machine-parseable reports. We evaluate the frozen pipeline in a clean-room protocol on a held-out real course quiz in Slovenian, including hand-drawn circuit schematics. With state-of-the-art backends (GPT-5.2 and Gemini-3 Pro), the full pipeline achieves $\approx$8-point mean absolute difference to lecturer grades with low bias and an estimated manual-review trigger rate of $\approx$17% at $D_{\max}=40$. Ablations show that trivial prompting and removing the reference solution substantially degrade accuracy and introduce systematic over-grading, confirming that structured prompting and reference grounding are essential.

[135] Unified Primitive Proxies for Structured Shape Completion

Zhaiyu Chen, Yuqing Wang, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: UniCo is a single-pass structured shape completion method that predicts primitives with complete geometry, semantics, and inlier membership through primitive proxies and shared feature attention.

Details

Motivation: The paper aims to improve structured shape completion by rethinking how primitives and points should interact, moving away from the prevailing cascade approach to create more effective primitive decoding that attends to shared shape features.

Method: UniCo uses primitive proxies (learnable queries contextualized to produce assembly-ready outputs) in a dedicated pathway that attends to shared shape features. It employs a training strategy that couples primitives and points with online target updates for consistent optimization.

Result: UniCo consistently outperforms recent baselines across synthetic and real-world benchmarks with four independent assembly solvers, lowering Chamfer distance by up to 50% and improving normal consistency by up to 7%.

Conclusion: The approach establishes an attractive recipe for structured 3D understanding from incomplete data through unified primitive prediction in a single feed-forward pass.

Abstract: Structured shape completion recovers missing geometry as primitives rather than as unstructured points, which enables primitive-based surface reconstruction. Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. To drive this unified representation, we introduce primitive proxies, learnable queries that are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50% and improving normal consistency by up to 7%. These results establish an attractive recipe for structured 3D understanding from incomplete data. Project page: https://unico-completion.github.io.

[136] Fusion-SSAT: Unleashing the Potential of Self-supervised Auxiliary Task by Feature Fusion for Generalized Deepfake Detection

Shukesh Reddy, Srijan Das, Abhijit Das

Main category: cs.CV

TL;DR: Self-supervised learning as auxiliary task improves generalized deepfake detection performance and cross-dataset generalization.

Details

Motivation: To explore how self-supervised learning can be used as an auxiliary task to optimize the primary task of generalized deepfake detection, and to find the most effective training scheme combinations.

Method: Examined different combinations of training schemes for self-supervised auxiliary tasks and primary deepfake detection tasks, focusing on fusing feature representations from self-supervised learning.

Result: Fusing feature representations from self-supervised auxiliary tasks creates powerful representations that leverage both tasks, achieving better performance and superior cross-dataset generalization on DF40, FaceForensics++, Celeb-DF, DFD, FaceShifter, UADFV datasets compared to state-of-the-art detectors.

Conclusion: Self-supervised learning as an auxiliary task effectively enhances generalized deepfake detection by providing unique feature representations that improve performance and cross-dataset generalization capabilities.

Abstract: In this work, we attempted to unleash the potential of self-supervised learning as an auxiliary task that can optimise the primary task of generalised deepfake detection. To explore this, we examined different combinations of the training schemes for these tasks that can be most effective. Our findings reveal that fusing the feature representation from self-supervised auxiliary tasks is a powerful feature representation for the problem at hand. Such a representation can leverage the ultimate potential and bring in a unique representation of both the self-supervised and primary tasks, achieving better performance for the primary task. We experimented on a large set of datasets, which includes DF40, FaceForensics++, Celeb-DF, DFD, FaceShifter, UADFV, and our results showed better generalizability on cross-dataset evaluation when compared with current state-of-the-art detectors.

[137] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

Wenhui Chu, Nikolaos V. Tsekos

Main category: cs.CV

TL;DR: Proposed LNU-Net and IBU-Net architectures for left ventricle segmentation from MRI, using layer normalization and instance-batch normalization respectively, outperforming U-Net and other methods on dice coefficient and distance metrics.

Details

Motivation: Left ventricle segmentation is critical for clinical cardiac diagnosis, but existing methods may not be optimal. The paper aims to develop improved deep learning architectures specifically for LV segmentation from short-axis cine MRI images.

Method: Two novel architectures: LNU-Net (layer normalization U-Net) applies layer normalization in each convolutional block, and IBU-Net (instance-batch normalized U-Net) incorporates both instance and batch normalization in the first block. Both follow U-Net’s encoder-decoder structure with down-sampling for feature extraction and up-sampling for localization. Used affine transformations and elastic deformations for data augmentation on a dataset of 805 MRI images from 45 patients.

Result: The proposed LNU-Net and IBU-Net architectures outperformed the original U-Net and other state-of-the-art approaches in terms of dice coefficient and average perpendicular distance metrics for left ventricle segmentation.

Conclusion: The novel normalization-based architectures (LNU-Net and IBU-Net) provide effective solutions for left ventricle segmentation from cardiac MRI, demonstrating superior performance over existing methods and offering promising tools for clinical cardiac image analysis.

Abstract: Left ventricle (LV) segmentation is critical for clinical quantification and diagnosis of cardiac images. In this work, we propose two novel deep learning architectures called LNU-Net and IBU-Net for left ventricle segmentation from short-axis cine MRI images. LNU-Net is derived from layer normalization (LN) U-Net architecture, while IBU-Net is derived from the instance-batch normalized (IB) U-Net for medical image segmentation. The architectures of LNU-Net and IBU-Net have a down-sampling path for feature extraction and an up-sampling path for precise localization. We use the original U-Net as the basic segmentation approach and compared it with our proposed architectures. Both LNU-Net and IBU-Net have left ventricle segmentation methods: LNU-Net applies layer normalization in each convolutional block, while IBU-Net incorporates instance and batch normalization together in the first convolutional block and passes its result to the next layer. Our method incorporates affine transformations and elastic deformations for image data processing. Our dataset that contains 805 MRI images regarding the left ventricle from 45 patients is used for evaluation. We experimentally evaluate the results of the proposed approaches outperforming the dice coefficient and the average perpendicular distance than other state-of-the-art approaches.

[138] AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction

Jiewen Chan, Zhenjun Zhao, Yu-Lun Liu

Main category: cs.CV

TL;DR: AdaGaR: A unified framework for dynamic 3D scene reconstruction from monocular videos using adaptive Gabor representation and temporal continuity constraints to achieve high-frequency detail capture with smooth motion.

Details

Motivation: Existing methods using single Gaussian primitives have low-pass filtering limitations, standard Gabor functions suffer from energy instability, and lack of temporal continuity constraints leads to motion artifacts during interpolation in dynamic 3D scene reconstruction.

Method: Proposes Adaptive Gabor Representation extending Gaussians with learnable frequency weights and adaptive energy compensation; uses Cubic Hermite Splines with Temporal Curvature Regularization for smooth motion; and implements Adaptive Initialization combining depth estimation, point tracking, and foreground masks.

Result: State-of-the-art performance on Tap-Vid DAVIS (PSNR 35.49, SSIM 0.9433, LPIPS 0.0723) with strong generalization across frame interpolation, depth consistency, video editing, and stereo view synthesis tasks.

Conclusion: AdaGaR successfully addresses both frequency adaptivity and temporal continuity challenges in dynamic 3D scene reconstruction, achieving superior performance and generalization capabilities across multiple applications.

Abstract: Reconstructing dynamic 3D scenes from monocular videos requires simultaneously capturing high-frequency appearance details and temporally continuous motion. Existing methods using single Gaussian primitives are limited by their low-pass filtering nature, while standard Gabor functions introduce energy instability. Moreover, lack of temporal continuity constraints often leads to motion artifacts during interpolation. We propose AdaGaR, a unified framework addressing both frequency adaptivity and temporal continuity in explicit dynamic scene modeling. We introduce Adaptive Gabor Representation, extending Gaussians through learnable frequency weights and adaptive energy compensation to balance detail capture and stability. For temporal continuity, we employ Cubic Hermite Splines with Temporal Curvature Regularization to ensure smooth motion evolution. An Adaptive Initialization mechanism combining depth estimation, point tracking, and foreground masks establishes stable point cloud distributions in early training. Experiments on Tap-Vid DAVIS demonstrate state-of-the-art performance (PSNR 35.49, SSIM 0.9433, LPIPS 0.0723) and strong generalization across frame interpolation, depth consistency, video editing, and stereo view synthesis. Project page: https://jiewenchan.github.io/AdaGaR/

[139] Efficient Multi-Task Scene Analysis with RGB-D Transformers

Söhnke Benedikt Fischedick, Daniel Seichter, Robin Schmidt, Leonard Rabes, Horst-Michael Gross

Main category: cs.CV

TL;DR: EMSAFormer: An efficient RGB-D Transformer-based multi-task scene analysis approach for autonomous systems that achieves state-of-the-art performance with real-time inference on robotic hardware.

Details

Motivation: Scene analysis is crucial for autonomous systems like mobile robots, but comprehensive understanding requires solving multiple tasks (panoptic segmentation, instance orientation estimation, scene classification) with limited computing/battery resources on mobile platforms.

Method: Introduces EMSAFormer, which replaces the dual CNN-based encoder of previous EMSANet with a single Transformer-based encoder that effectively incorporates both RGB and depth data. Includes custom NVIDIA TensorRT extension for hardware optimization.

Result: Achieves state-of-the-art performance on indoor datasets NYUv2, SUNRGB-D, and ScanNet while enabling real-time inference up to 39.1 FPS on NVIDIA Jetson AGX Orin 32 GB hardware.

Conclusion: EMSAFormer provides an efficient multi-task scene analysis solution that balances performance and computational efficiency, making it suitable for real-world deployment on resource-constrained robotic platforms.

Abstract: Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.

[140] Revisiting Out-of-Distribution Detection in Real-time Object Detection: From Benchmark Pitfalls to a New Mitigation Paradigm

Changshun Wu, Weicheng He, Chih-Hong Cheng, Xiaowei Huang, Saddek Bensalem

Main category: cs.CV

TL;DR: Paper identifies flaws in OoD detection benchmarks and introduces a training-time mitigation method that reduces hallucination errors by 91% in object detectors.

Details

Motivation: Current OoD detection approaches focus on scoring functions and thresholds, offering only incremental improvements. The paper argues for rethinking the entire development lifecycle to effectively mitigate OoD risks in object detection.

Method: Two main contributions: 1) Reveals fundamental flaws in OoD evaluation benchmarks where up to 13% of objects are mislabeled, and 2) Introduces a novel training-time mitigation paradigm that fine-tunes detectors using synthesized OoD datasets that semantically resemble in-distribution objects, shaping defensive decision boundaries by suppressing objectness on OoD objects.

Result: Achieves 91% reduction in hallucination error of YOLO model on BDD-100K. The methodology generalizes across detection paradigms (YOLO, Faster R-CNN, RT-DETR) and supports few-shot adaptation.

Conclusion: The paper offers a principled and effective approach to reduce OoD-induced hallucination in object detectors by addressing benchmark quality issues and introducing training-time mitigation, moving beyond post-hoc scoring methods.

Abstract: Out-of-distribution (OoD) inputs pose a persistent challenge to deep learning models, often triggering overconfident predictions on non-target objects. While prior work has primarily focused on refining scoring functions and adjusting test-time thresholds, such algorithmic improvements offer only incremental gains. We argue that a rethinking of the entire development lifecycle is needed to mitigate these risks effectively. This work addresses two overlooked dimensions of OoD detection in object detection. First, we reveal fundamental flaws in widely used evaluation benchmarks: contrary to their design intent, up to 13% of objects in the OoD test sets actually belong to in-distribution classes, and vice versa. These quality issues severely distort the reported performance of existing methods and contribute to their high false positive rates. Second, we introduce a novel training-time mitigation paradigm that operates independently of external OoD detectors. Instead of relying solely on post-hoc scoring, we fine-tune the detector using a carefully synthesized OoD dataset that semantically resembles in-distribution objects. This process shapes a defensive decision boundary by suppressing objectness on OoD objects, leading to a 91% reduction in hallucination error of a YOLO model on BDD-100K. Our methodology generalizes across detection paradigms such as YOLO, Faster R-CNN, and RT-DETR, and supports few-shot adaptation. Together, these contributions offer a principled and effective way to reduce OoD-induced hallucination in object detectors. Code and data are available at: https://gricad-gitlab.univ-grenoble-alpes.fr/dnn-safety/m-hood.

[141] Test-time generative augmentation for medical image segmentation

Xiao Ma, Yuhui Tao, Zetian Zhang, Yuhan Zhang, Xi Wang, Sheng Zhang, Zexuan Ji, Yizhe Zhang, Qiang Chen, Guang Yang

Main category: cs.CV

TL;DR: TTGA is a test-time generative augmentation method for medical image segmentation that uses diffusion model inversion to create contextually relevant augmentations, improving segmentation accuracy and providing pixel-wise error estimation.

Details

Motivation: Medical image segmentation models face challenges with uncertainties from occlusions, ambiguous boundaries, and device variations. Traditional test-time augmentation methods use predefined transformations that lack adaptability for complex medical scenarios.

Method: TTGA leverages a domain-fine-tuned generative model to produce contextually relevant augmentations. It uses diffusion model inversion with a masked null-text inversion method for region-specific augmentations, and a dual denoising pathway to balance identity preservation with variability.

Result: Extensive experiments across three segmentation tasks and nine datasets show TTGA improves segmentation accuracy (DSC gains: 0.1% to 2.3%) and provides pixel-wise error estimation (DSC gains: 1.1% to 29.0% over baseline).

Conclusion: TTGA is an effective test-time augmentation strategy that outperforms traditional methods by generating contextually relevant augmentations tailored to individual test images, improving both segmentation accuracy and uncertainty estimation in medical imaging.

Abstract: Medical image segmentation is critical for clinical diagnosis, treatment planning, and monitoring, yet segmentation models often struggle with uncertainties stemming from occlusions, ambiguous boundaries, and variations in imaging devices. Traditional test-time augmentation (TTA) techniques typically rely on predefined geometric and photometric transformations, limiting their adaptability and effectiveness in complex medical scenarios. In this study, we introduced Test-Time Generative Augmentation (TTGA), a novel augmentation strategy specifically tailored for medical image segmentation at inference time. Different from conventional augmentation strategies that suffer from excessive randomness or limited flexibility, TTGA leverages a domain-fine-tuned generative model to produce contextually relevant and diverse augmentations tailored to the characteristics of each test image. Built upon diffusion model inversion, a masked null-text inversion method is proposed to enable region-specific augmentations during sampling. Furthermore, a dual denoising pathway is designed to balance precise identity preservation with controlled variability. We demonstrate the efficacy of our TTGA through extensive experiments across three distinct segmentation tasks spanning nine datasets. Our results consistently demonstrate that TTGA not only improves segmentation accuracy (with DSC gains ranging from 0.1% to 2.3% over the baseline) but also offers pixel-wise error estimation (with DSC gains ranging from 1.1% to 29.0% over the baseline). The source code and demonstration are available at: https://github.com/maxiao0234/TTGA.

[142] Towards Knowledge Guided Pretraining Approaches for Multimodal Foundation Models: Applications in Remote Sensing

Praveen Ravirathinam, Ajitesh Parthasarathy, Ankush Khandelwal, Rahul Ghosh, Vipin Kumar

Main category: cs.CV

TL;DR: KG-VSF is a novel self-supervised pretraining method that uses knowledge-guided variable-step forecasting to capture causal relationships between geospatial variables, outperforming standard masked reconstruction approaches on downstream geoscience tasks.

Details

Motivation: Current self-supervised pretraining methods (masked reconstruction, next-token prediction) fail to capture the causal interplay between different geospatial and environmental variables, which is crucial for many geoscience applications.

Method: Proposes Knowledge Guided Variable-Step Forecasting (KG-VSF), a pretraining task that models forecasting as conditional generation where driver variables (e.g., weather) inform prediction of response variables (e.g., satellite imagery).

Result: KG-VSF produces strong embeddings that outperform standard pretraining approaches when fine-tuned on downstream tasks including crop type mapping, soil moisture estimation/forecasting, missing image prediction, and future image forecasting.

Conclusion: Modeling causal relationships through knowledge-guided forecasting during pretraining leads to more effective representations for geoscience applications compared to traditional reconstruction-based approaches.

Abstract: Self-supervised learning has emerged as a powerful paradigm for pretraining foundation models using large-scale data. Existing pretraining approaches predominantly rely on masked reconstruction or next-token prediction strategies, demonstrating strong performance across various downstream tasks, including geoscience applications. However, these approaches do not fully capture the knowledge of causal interplay between different geospatial and environmental variables. To address this limitation, we propose Knowledge Guided Variable-Step Forecasting (KG-VSF), a novel pretraining task that models forecasting as a conditional generation task, where driver variables (e.g., weather) inform the prediction of response variables (e.g., satellite imagery). We demonstrate that pretraining in such a fashion leads to strong embeddings which give enhanced performance when finetuned on downstream tasks where capturing this causality matters such as pixel wise crop type mapping, soil moisture estimation and forecasting, missing image prediction, and future image forecasting when compared to finetuning embeddings from other standard pretraining approaches.

[143] Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences

Mellon M. Zhang, Glen Chou, Saibal Mukhopadhyay

Main category: cs.CV

TL;DR: PFCF is a hybrid 3D object detector for autonomous driving that combines fast polar processing for intra-sector feature extraction with accurate Cartesian reasoning for full-scene understanding, achieving both low latency and high accuracy.

Details

Motivation: Current LiDAR-based 3D object detectors face a trade-off: streaming methods offer fast updates but suffer from limited visibility and distortions, while full-scan methods provide higher accuracy but have inherent latency limitations. Autonomous driving requires both rapid response and reliable perception for safety.

Method: PFCF uses a hybrid approach with a custom Mamba SSM-based streaming backbone featuring dimensionally-decomposed convolutions for distortion-robust polar representation learning. Local sector features are extracted via this backbone, accumulated into a sector feature buffer, and then processed through a full-scan backbone for inter-sector communication and full-scene understanding.

Result: PFCF establishes a new Pareto frontier on Waymo Open dataset, surpassing prior streaming baselines by 10% mAP and matching full-scan accuracy at twice the update rate.

Conclusion: The hybrid polar-fast Cartesian-full approach successfully addresses the latency-accuracy trade-off in LiDAR-based 3D object detection, enabling both rapid response and reliable perception essential for autonomous driving safety.

Abstract: Accurate and low-latency 3D object detection is essential for autonomous driving, where safety hinges on both rapid response and reliable perception. While rotating LiDAR sensors are widely adopted for their robustness and fidelity, current detectors face a trade-off: streaming methods process partial polar sectors on the fly for fast updates but suffer from limited visibility, cross-sector dependencies, and distortions from retrofitted Cartesian designs, whereas full-scan methods achieve higher accuracy but are bottlenecked by the inherent latency of a LiDAR revolution. We propose Polar-Fast-Cartesian-Full (PFCF), a hybrid detector that combines fast polar processing for intra-sector feature extraction with accurate Cartesian reasoning for full-scene understanding. Central to PFCF is a custom Mamba SSM-based streaming backbone with dimensionally-decomposed convolutions that avoids distortion-heavy planes, enabling parameter-efficient, translation-invariant, and distortion-robust polar representation learning. Local sector features are extracted via this backbone, then accumulated into a sector feature buffer to enable efficient inter-sector communication through a full-scan backbone. PFCF establishes a new Pareto frontier on the Waymo Open dataset, surpassing prior streaming baselines by 10% mAP and matching full-scan accuracy at twice the update rate. Code is available at \href{https://github.com/meilongzhang/Polar-Hierarchical-Mamba}{https://github.com/meilongzhang/Polar-Hierarchical-Mamba}.

[144] Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models

Shambhavi Mishra, Julio Silva-Rodriguez, Ismail Ben Ayed, Marco Pedersoli, Jose Dolz

Main category: cs.CV

TL;DR: SAT (Semantic Anchor Transport) is a test-time adaptation method that uses Optimal Transport to align visual embeddings with text-based semantic anchors, generating pseudo-labels for distribution-shifted data without additional training.

Details

Motivation: Large vision-language models like CLIP show degraded performance under distribution shifts during inference. The paper aims to efficiently utilize class text information to mitigate these distribution drifts without requiring model retraining.

Method: Proposes Semantic Anchor Transport (SAT): 1) Formulates batch-wise label assignment as Optimal Transport problem to align visual embeddings with text-based semantic anchors, 2) Uses generated pseudo-labels for test-time adaptation, 3) Employs multi-template distillation with heterogeneous textual clues to replicate multi-view contrastive learning without extra computation.

Result: Extensive experiments on multiple test-time adaptation benchmarks show SAT achieves consistent performance gains over state-of-the-art methods while being computationally efficient.

Conclusion: SAT provides a principled cross-modal alignment solution for test-time adaptation that effectively mitigates distribution shifts in vision-language models through efficient use of textual information and Optimal Transport-based pseudo-label generation.

Abstract: Large pre-trained vision-language models (VLMs), such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we investigate how to efficiently utilize class text information to mitigate distribution drifts encountered by VLMs during inference. In particular, we propose generating pseudo-labels for the noisy test-time samples by aligning visual embeddings with reliable, text-based semantic anchors. Specifically, to maintain the regular structure of the dataset properly, we formulate the problem as a batch-wise label assignment, which is efficiently solved using Optimal Transport. Our method, Semantic Anchor Transport (SAT), utilizes such pseudo-labels as supervisory signals for test-time adaptation, yielding a principled cross-modal alignment solution. Moreover, SAT further leverages heterogeneous textual clues, with a multi-template distillation approach that replicates multi-view contrastive learning strategies in unsupervised representation learning without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of SAT, achieving consistent performance gains over recent state-of-the-art methods, yet being computationally efficient.

[145] AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving

Shuo Xing, Hongyuan Hua, Xiangbo Gao, Shenzhe Zhu, Renjie Li, Kexin Tian, Xiaopeng Li, Heng Huang, Tianbao Yang, Zhangyang Wang, Yang Zhou, Huaxiu Yao, Zhengzhong Tu

Main category: cs.CV

TL;DR: AutoTrust is a comprehensive trustworthiness benchmark for vision-language models in autonomous driving, evaluating 6 models across 5 trust dimensions and revealing critical vulnerabilities in current DriveVLMs.

Details

Motivation: While VLMs show strong scene understanding for autonomous driving, there's limited research on their trustworthiness - a critical factor for public transportation safety. The authors aim to address this gap by creating a comprehensive benchmark to evaluate DriveVLMs' trustworthiness.

Method: Created AutoTrust benchmark with largest VQA dataset for driving scenarios (10k+ scenes, 18k+ queries). Evaluated 6 VLMs (generalist to specialist, open-source to commercial) across 5 trustworthiness dimensions: trustfulness, safety, robustness, privacy, and fairness.

Result: Surprisingly, general VLMs (LLaVA-v1.6, GPT-4o-mini) outperform specialized driving models in overall trustworthiness. DriveVLMs are vulnerable to privacy leaks, susceptible to adversarial attacks, and struggle with fairness across diverse environments/populations.

Conclusion: Current DriveVLMs have significant trustworthiness vulnerabilities that threaten public safety. Immediate action is needed to address these issues before widespread deployment. The authors release all code and datasets to facilitate further research.

Abstract: Recent advancements in large vision language models (VLMs) tailored for autonomous driving (AD) have shown strong scene understanding and reasoning capabilities, making them undeniable candidates for end-to-end driving systems. However, limited work exists on studying the trustworthiness of DriveVLMs – a critical factor that directly impacts public transportation safety. In this paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for large vision-language models in autonomous driving (DriveVLMs), considering diverse perspectives – including trustfulness, safety, robustness, privacy, and fairness. We constructed the largest visual question-answering dataset for investigating trustworthiness issues in driving scenarios, comprising over 10k unique scenes and 18k queries. We evaluated six publicly available VLMs, spanning from generalist to specialist, from open-source to commercial models. Our exhaustive evaluations have unveiled previously undiscovered vulnerabilities of DriveVLMs to trustworthiness threats. Specifically, we found that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform specialized models fine-tuned for driving in terms of overall trustworthiness. DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing sensitive information. Additionally, both generalist and specialist VLMs remain susceptible to adversarial attacks and struggle to ensure unbiased decision-making across diverse environments and populations. Our findings call for immediate and decisive action to address the trustworthiness of DriveVLMs – an issue of critical importance to public safety and the welfare of all citizens relying on autonomous transportation systems. We release all the codes and datasets in https://github.com/taco-group/AutoTrust.

[146] Do Vision Encoders Truly Explain Object Hallucination?: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

Hongseok Oh, Wonseok Hwang

Main category: cs.CV

TL;DR: The paper proposes F-CLIPScore, a fine-grained evaluation metric that improves object hallucination detection in LVLMs by incorporating noun-level text embeddings, achieving 39.6% higher accuracy than CLIPScore without training.

Details

Motivation: Current Large Vision-Language Models suffer from object hallucination issues. Previous claims attribute this to limited vision encoder capacity, but this work challenges that assumption and seeks better evaluation methods for object hallucination detection.

Method: The study focuses on discriminative, retrieval-style evaluation (OHD-Caps) rather than free-form caption generation. It proposes Fine-grained CLIPScore (F-CLIPScore) that enhances object-level granularity by incorporating text embeddings at the noun level.

Result: F-CLIPScore outperforms conventional CLIPScore by 39.6% in accuracy on OHD-Caps benchmark without additional training. When used for data filtering, it reduces object hallucination in LVLMs by 4.9% in POPE accuracy after alignment pretraining.

Conclusion: Vision encoder capacity is not the major limiting factor for object hallucination detection. F-CLIPScore provides an effective, training-free solution for evaluating and mitigating object hallucination in LVLMs through fine-grained noun-level analysis.

Abstract: Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. In this work, we study object hallucination primarily in a discriminative, retrieval-style evaluation setting (OHD-Caps), rather than in free-form caption generation. This study revisits the previous claim that the cause of such hallucinations lies in the limited representational capacity of the vision encoder. Our analysis implies that the capacity of the vision encoder is not necessarily a major limiting factor in detecting object hallucination. Based on this insight, we propose Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training. We further demonstrate that F-CLIPScore-based data filtering reduces object hallucination in LVLM (4.9% in POPE accuracy after alignment pretraining). Our code is publicly available at https://github.com/abzb1/f-clip

[147] ATRNet-STAR: A Large Dataset and Benchmark Towards Remote Sensing Object Recognition in the Wild

Yongxiang Liu, Weijie Li, Li Liu, Jie Zhou, Bowen Peng, Yafei Song, Xuying Xiong, Wei Yang, Tianpeng Liu, Zhen Liu, Xiang Li

Main category: cs.CV

TL;DR: Introduces ATRNet-STAR, a large-scale SAR vehicle dataset with 40 categories and 190k+ samples, 10x larger than MSTAR, with comprehensive benchmarks for SAR ATR research.

Details

Motivation: The lack of large-scale, high-quality public datasets for SAR ATR has hindered deep learning applications due to expensive data collection, privacy concerns, and specialized annotation requirements. Existing datasets are small and outdated.

Method: Created ATRNet-STAR dataset with 40 vehicle categories collected under various realistic imaging conditions. Detailed data collection scheme and comprehensive evaluation of 15 representative methods across 7 experimental settings on classification and detection benchmarks.

Result: ATRNet-STAR contains over 190,000 well-annotated samples, 10 times larger than the famous MSTAR dataset. The dataset enables extensive benchmarking and provides valuable insights for SAR ATR research.

Conclusion: ATRNet-STAR represents a substantial advancement in SAR ATR dataset scale and diversity, with potential to significantly facilitate advancement in the field through its comprehensive benchmarks and identified research directions.

Abstract: The absence of publicly available, large-scale, high-quality datasets for Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) has significantly hindered the application of rapidly advancing deep learning techniques, which hold huge potential to unlock new capabilities in this field. This is primarily because collecting large volumes of diverse target samples from SAR images is prohibitively expensive, largely due to privacy concerns, the characteristics of microwave radar imagery perception, and the need for specialized expertise in data annotation. Throughout the history of SAR ATR research, there have been only a number of small datasets, mainly including targets like ships, airplanes, buildings, etc. There is only one vehicle dataset MSTAR collected in the 1990s, which has been a valuable source for SAR ATR. To fill this gap, this paper introduces a large-scale, new dataset named ATRNet-STAR with 40 different vehicle categories collected under various realistic imaging conditions and scenes. It marks a substantial advancement in dataset scale and diversity, comprising over 190,000 well-annotated samples, 10 times larger than its predecessor, the famous MSTAR. Building such a large dataset is a challenging task, and the data collection scheme will be detailed. Secondly, we illustrate the value of ATRNet-STAR via extensively evaluating the performance of 15 representative methods with 7 different experimental settings on challenging classification and detection benchmarks derived from the dataset. Finally, based on our extensive experiments, we identify valuable insights for SAR ATR and discuss potential future research directions in this field. We hope that the scale, diversity, and benchmark of ATRNet-STAR can significantly facilitate the advancement of SAR ATR.

[148] NeRF-VIO: Map-Based Visual-Inertial Odometry with Initialization Leveraging Neural Radiance Fields

Yanyu Zhang, Dongming Wang, Jie Xu, Mengyuan Liu, Pengxiang Zhu, Wei Ren

Main category: cs.CV

TL;DR: NeRF-VIO: A map-based visual-inertial localization algorithm using neural radiance fields for initialization and a two-stage MSCKF update mechanism that outperforms existing methods in AR applications.

Details

Motivation: Prior maps are essential for localization in AR applications to provide contextual information and mitigate drift. Existing methods need improvement in accuracy and efficiency for map-based visual-inertial localization.

Method: Uses NeRF for initialization with a multilayer perceptron model and geodesic distance loss on SE(3) for frame invariance. Integrates two-stage update mechanism within MSCKF framework, constraining state with both real camera images and rendered NeRF images.

Result: Outperforms existing NeRF-based initialization solutions in both accuracy and efficiency. Two-stage update pipeline outperforms standard MSCKF across all real-world AR dataset sequences.

Conclusion: NeRF-VIO successfully combines NeRF-based initialization with MSCKF framework to achieve superior localization performance for AR applications, demonstrating the effectiveness of integrating neural rendering with traditional visual-inertial odometry.

Abstract: A prior map serves as a foundational reference for localization in context-aware applications such as augmented reality (AR). Providing valuable contextual information about the environment, the prior map is a vital tool for mitigating drift. In this paper, we propose a map-based visual-inertial localization algorithm (NeRF-VIO) with initialization using neural radiance fields (NeRF). Our algorithm utilizes a multilayer perceptron model and redefines the loss function as the geodesic distance on (SE(3)), ensuring the invariance of the initialization model under a frame change within (\mathfrak{se}(3)). The evaluation demonstrates that our model outperforms existing NeRF-based initialization solution in both accuracy and efficiency. By integrating a two-stage update mechanism within a multi-state constraint Kalman filter (MSCKF) framework, the state of NeRF-VIO is constrained by both captured images from an onboard camera and rendered images from a pre-trained NeRF model. The proposed algorithm is validated using a real-world AR dataset, the results indicate that our two-stage update pipeline outperforms MSCKF across all data sequences.

[149] Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition

Seungyeon Cho, Tae-Kyun Kim

Main category: cs.CV

TL;DR: BHaRNet is a novel skeleton-based action recognition framework that combines body-expert and hand-expert models with cross-attention mechanisms to better capture subtle hand motions while maintaining efficiency.

Details

Motivation: Existing skeleton-based action recognition methods focus too much on full-body movements and overlook subtle hand motions that are critical for distinguishing fine-grained actions. Current unified graph representations blur hand details due to disparities between body/hand characteristics and spatial-pooling feature loss.

Method: Proposes BHaRNet with two expert streams: body-expert and hand-expert models trained jointly with ensemble loss for cooperative specialization (like Mixture-of-Experts). Uses cross-attention via expertized branch method and pooling-attention module for feature-level interactions and selective fusion. Also extends to multi-modal tasks using RGB information guided by body features.

Result: Achieves state-of-the-art accuracies on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, Northwestern-UCLA), improving hand-intensive actions from 86.4% to 93.0% while maintaining fewer GFLOPs and parameters than unified methods.

Conclusion: BHaRNet effectively addresses the limitation of existing methods in capturing subtle hand motions through specialized expert models and cross-attention mechanisms, achieving superior performance on fine-grained action recognition with computational efficiency.

Abstract: Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies – improving from 86.4% to 93.0% in hand-intensive actions – while maintaining fewer GFLOPs and parameters than the relevant unified methods.

[150] FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers

Ruichen Chen, Keith G. Mills, Di Niu

Main category: cs.CV

TL;DR: FP4DiT: A post-training quantization method using floating-point quantization (FPQ) to achieve W4A6 precision for Diffusion Transformers (DiT), outperforming integer-based PTQ methods on modern transformer-based diffusion models.

Details

Motivation: Current DM PTQ methods focus on convolutional U-Net architectures and integer quantization, but newer DiT models (like PixArt, Hunyuan) use transformer backbones and would benefit from FPQ which better aligns with weight/activation distributions in low-bit settings.

Method: Extends Adaptive Rounding PTQ technique to calibrate weight quantization for FPQ, plus robust online activation quantization techniques to handle DiT activations that depend on input patch data. Achieves W4A6 quantization using floating-point quantization.

Result: FP4DiT achieves higher CLIP, ImageReward and HPSv2 performance compared to integer-based PTQ at W4A6 and W4A8 precision levels, generating convincing visual content on PixArt-α, PixArt-Σ and Hunyuan models.

Conclusion: FPQ is more suitable than integer quantization for DiT models in low-bit settings, and FP4DiT provides an effective PTQ solution for modern transformer-based diffusion models with superior performance metrics.

Abstract: Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 \blue{(i.e., 4-bit weights and 8-bit activations)} on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but does not align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In this paper, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT achieves higher CLIP, ImageReward and HPSv2 performance compared to integer-based PTQ at the W4A6 and W4A8 precision levels while generating convincing visual content on PixArt-$α$, PixArt-$Σ$ and Hunyuan.

[151] Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?

Robin Hesse, Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Stefan Roth

Main category: cs.CV

TL;DR: This paper introduces QUBA score, a novel metric for evaluating deep neural networks across nine quality dimensions beyond just accuracy, based on analysis of 326 models revealing insights about vision-language models, self-supervised learning, and dataset size effects.

Details

Motivation: Deep neural networks excel in predictive performance but often lack in other critical quality dimensions like robustness, calibration, and fairness. Existing studies focus on subsets of these dimensions, leaving a gap in understanding a more general form of "well-behavedness" across multiple quality dimensions simultaneously.

Method: Conducted a large-scale study analyzing 326 backbone models for image classification, examining how different training paradigms and model architectures affect nine quality dimensions. The study systematically evaluates models across these dimensions to understand their trade-offs and relationships.

Result: Key findings: (1) Vision-language models show high class balance on ImageNet-1k and strong robustness against domain changes; (2) Training models initialized with self-supervised learning weights improves most quality dimensions; (3) Training dataset size is a major driver for most quality dimensions.

Conclusion: The paper introduces the QUBA (Quality Understanding Beyond Accuracy) score, a novel metric that ranks models across multiple quality dimensions, enabling tailored recommendations based on specific user needs and providing a comprehensive framework for evaluating model quality beyond just accuracy.

Abstract: Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of “well-behavedness” of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird’s-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect these quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high class balance on ImageNet-1k classification and strong robustness against domain changes; (ii) training models initialized with weights obtained through self-supervised learning is an effective strategy to improve most considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.

[152] VisualQuest: A Benchmark for Abstract Visual Reasoning in MLLMs

Kelaiti Xiao, Liang Yang, Dongyu Zhang, Paerhati Tulajiang, Hongfei Lin

Main category: cs.CV

TL;DR: VisualQuest is a new dataset with 3,551 stylized images testing multimodal models on abstract visual reasoning requiring symbolic, cultural, and linguistic knowledge.

Details

Motivation: Existing benchmarks focus too much on direct image captioning and classification of realistic images, lacking evaluation of abstract visual reasoning that requires integration of symbolic, cultural, and linguistic knowledge.

Method: Created a dataset of 3,551 non-photographic, stylized images across four categories: Public Figures, Popular Culture, Linguistic Expressions, and Literary Works, each paired with targeted reasoning questions.

Result: Only Gemini-2.5-flash and GPT-4o achieved strong overall performance; 3.7% of images remained unrecognized by any model; Gemini excels at stylized public figures while GPT-4o leads in linguistic reasoning tasks.

Conclusion: VisualQuest provides a challenging resource for advancing abstract visual reasoning research, highlighting persistent multimodal understanding challenges and key areas for future model improvement.

Abstract: We introduce VisualQuest, a novel dataset designed to rigorously evaluate multimodal large language models (MLLMs) on abstract visual reasoning tasks that require the integration of symbolic, cultural, and linguistic knowledge. Unlike existing benchmarks that focus on direct image captioning or classification of realistic images, VisualQuest comprises 3,551 non-photographic, stylized images spanning four categories: Public Figures, Popular Culture, Linguistic Expressions, and Literary Works. Each image is paired with targeted questions to probe complex reasoning. We benchmark ten state-of-the-art MLLMs and find that only Gemini-2.5-flash and GPT-4o achieve strong overall performance, while 3.7 percent of the images remain unrecognized by any model, underscoring persistent challenges in multimodal understanding. Fine-grained analysis shows that Gemini excels at recognizing stylized public figures, whereas GPT-4o leads in linguistic reasoning tasks such as visual puns and emoji combinations. VisualQuest provides a comprehensive and challenging resource for advancing research in abstract visual reasoning and highlights key areas for future model improvement. The dataset is available at https://github.com/xkt88/VISUALQUEST.

[153] UltraGS: Real-Time Physically-Decoupled Gaussian Splatting for Ultrasound Novel View Synthesis

Yuezhe Yang, Qingqing Ruan, Wenjie Cai, Yudang Dong, Dexin Yang, Xingbo Dong, Zhe Jin, Yong Dai

Main category: cs.CV

TL;DR: UltraGS is a real-time Gaussian Splatting framework for novel view synthesis in ultrasound imaging that integrates explicit radiance fields with physics-inspired acoustic modeling, achieving state-of-the-art performance with real-time synthesis at 64.69 fps.

Details

Motivation: Ultrasound imaging has limited field of view which poses challenges for novel view synthesis, and existing methods lack real-time performance and physical accuracy needed for clinical applications.

Method: Adapts Gaussian Splatting to ultrasound imaging using depth-aware Gaussian primitives with learnable fields of view for geometric consistency, and introduces PD Rendering - a differentiable acoustic operator combining low-order spherical harmonics with first-order wave effects for efficient intensity synthesis.

Result: Achieves state-of-the-art results with PSNR up to 29.55 and SSIM up to 0.89 across three datasets, while achieving real-time synthesis at 64.69 fps on a single GPU. Also presents a new clinical ultrasound dataset.

Conclusion: UltraGS establishes a new performance-efficiency frontier for ultrasound novel view synthesis, enabling real-time applications with open-sourced code and dataset for community use.

Abstract: Ultrasound imaging is a cornerstone of non-invasive clinical diagnostics, yet its limited field of view poses challenges for novel view synthesis. We present UltraGS, a real-time framework that adapts Gaussian Splatting to sensorless ultrasound imaging by integrating explicit radiance fields with lightweight, physics-inspired acoustic modeling. UltraGS employs depth-aware Gaussian primitives with learnable fields of view to improve geometric consistency under unconstrained probe motion, and introduces PD Rendering, a differentiable acoustic operator that combines low-order spherical harmonics with first-order wave effects for efficient intensity synthesis. We further present a clinical ultrasound dataset acquired under real-world scanning protocols. Extensive evaluations across three datasets demonstrate that UltraGS establishes a new performance-efficiency frontier, achieving state-of-the-art results in PSNR (up to 29.55) and SSIM (up to 0.89) while achieving real-time synthesis at 64.69 fps on a single GPU. The code and dataset are open-sourced at: https://github.com/Bean-Young/UltraGS.

[154] LEL: Lipschitz Continuity Constrained Ensemble Learning for Efficient EEG-Based Intra-subject Emotion Recognition

Shengyu Gong, Yueyang Li, Zijian Kang, Bo Chai, Weiming Zeng, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang

Main category: cs.CV

TL;DR: LEL introduces Lipschitz continuity-constrained ensemble learning for EEG-based emotion recognition, improving stability, accuracy, and robustness against signal variability and noise.

Details

Motivation: Existing EEG-based emotion recognition methods suffer from insufficient model stability, limited accuracy with high-dimensional nonlinear EEG signals, and poor robustness against intra-subject variability and signal noise.

Method: Lipschitz continuity-constrained Ensemble Learning (LEL) framework that enforces Lipschitz continuity constraints on Transformer-based attention mechanisms, spectral extraction, and normalization modules, plus a learnable ensemble fusion strategy combining multiple heterogeneous classifiers.

Result: Superior performance on three public benchmark datasets: EAV (74.25%), FACED (81.19%), and SEED (86.79%) average recognition accuracies.

Conclusion: LEL effectively addresses key limitations in EEG-based emotion recognition by enhancing model stability, improving accuracy with nonlinear signals, and increasing robustness against variability and noise through Lipschitz constraints and ensemble learning.

Abstract: Accurate and efficient recognition of emotional states is critical for human social functioning, and impairments in this ability are associated with significant psychosocial difficulties. While electroencephalography (EEG) offers a powerful tool for objective emotion detection, existing EEG-based Emotion Recognition (EER) methods suffer from three key limitations: (1) insufficient model stability, (2) limited accuracy in processing high-dimensional nonlinear EEG signals, and (3) poor robustness against intra-subject variability and signal noise. To address these challenges, we introduce Lipschitz continuity-constrained Ensemble Learning (LEL), a novel framework that enhances EEG-based emotion recognition by enforcing Lipschitz continuity constraints on Transformer-based attention mechanisms, spectral extraction, and normalization modules. This constraint ensures model stability, reduces sensitivity to signal variability and noise, and improves generalization capability. Additionally, LEL employs a learnable ensemble fusion strategy that optimally combines decisions from multiple heterogeneous classifiers to mitigate single-model bias and variance. Extensive experiments on three public benchmark datasets (EAV, FACED, and SEED) demonstrate superior performance, achieving average recognition accuracies of 74.25%, 81.19%, and 86.79%, respectively. The official implementation codes are available at https://github.com/NZWANG/LEL.

[155] Satellite to Street : Disaster Impact Estimator

Sreesritha Sai, Sai Venkata Suma Sreeja, Deepthi, Nikhil

Main category: cs.CV

TL;DR: Satellite-to-Street: Disaster Impact Estimator is a deep learning framework that uses modified dual-input U-Net with improved feature fusion to create detailed pixel-level damage maps from pre/post-disaster satellite imagery, addressing class imbalance with weighted loss functions.

Details

Motivation: Current manual satellite imagery interpretation for post-disaster damage assessment is time-consuming, subjective, and difficult to scale. Existing deep learning models struggle with subtle structural variations and highly imbalanced datasets where undamaged buildings dominate.

Method: Modified dual-input U-Net architecture with strengthened feature fusion between pre/post-disaster images, class-aware weighted loss function to handle category imbalance, and consistent preprocessing pipeline for image alignment and standardization.

Result: The framework achieves better classification of damaged regions compared to conventional segmentation networks on publicly available disaster datasets, provides faster and objective damage analysis, and can distinguish different severity levels from slight impact to complete destruction.

Conclusion: The system offers a more detailed and practical understanding of disaster impact, working alongside expert judgment rather than replacing it, enabling faster and objective damage assessment while capturing both localized changes and broader contextual patterns.

Abstract: Accurate assessment of post-disaster damage is essential for prioritizing emergency response, yet current practices rely heavily on manual interpretation of satellite imagery.This approach is time-consuming, subjective, and difficult to scale during large-area disasters. Although recent deep-learning models for semantic segmentation and change detection have improved automation, many of them still struggle to capture subtle structural variations and often perform poorly when dealing with highly imbalanced datasets, where undamaged buildings dominate. This thesis introduces Satellite-to-Street:Disaster Impact Estimator, a deep-learning framework that produces detailed, pixel-level damage maps by analyzing pre and post-disaster satellite images together. The model is built on a modified dual-input U-Net architecture that strengthens feature fusion between both images, allowing it to detect not only small, localized changes but also broader contextual patterns across the scene. To address the imbalance between damage categories, a class-aware weighted loss function is used, which helps the model better recognize major and destroyed structures. A consistent preprocessing pipeline is employed to align image pairs, standardize resolutions, and prepare the dataset for training. Experiments conducted on publicly available disaster datasets show that the proposed framework achieves better classification of damaged regions compared to conventional segmentation networks.The generated damage maps provide faster and objective method for analyzing disaster impact, working alongside expert judgment rather than replacing it. In addition to identifying which areas are damaged, the system is capable of distinguishing different levels of severity, ranging from slight impact to complete destruction. This provides a more detailed and practical understanding of how the disaster has affected each region.

[156] WorldMem: Long-term Consistent World Simulation with Memory

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan

Main category: cs.CV

TL;DR: WorldMem is a framework that enhances world simulation by using a memory bank with memory units to maintain long-term 3D spatial consistency and capture dynamic evolution over time.

Details

Motivation: Current world simulation methods struggle with limited temporal context windows, leading to failures in maintaining long-term consistency, especially in preserving 3D spatial consistency across significant viewpoint or temporal gaps.

Method: WorldMem introduces a memory bank consisting of memory units that store memory frames and states (poses and timestamps). It employs a memory attention mechanism to extract relevant information from memory frames based on their states, enabling accurate scene reconstruction even with large viewpoint or temporal gaps. Timestamps in states allow modeling of both static worlds and dynamic evolution over time.

Result: Extensive experiments in both virtual and real scenarios validate the effectiveness of WorldMem in accurately reconstructing previously observed scenes and capturing dynamic world evolution.

Conclusion: WorldMem successfully addresses the long-term consistency problem in world simulation by leveraging a memory bank with attention mechanisms, enabling both perception and interaction within simulated worlds while maintaining 3D spatial consistency over time.

Abstract: World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.

[157] Matrix-free Second-order Optimization of Gaussian Splats with Residual Sampling

Hamza Pehlivan, Andrea Boscolo Camiletto, Lin Geng Foo, Marc Habermann, Christian Theobalt

Main category: cs.CV

TL;DR: A second-order optimization method using Levenberg-Marquardt and Conjugate Gradient for 3D Gaussian Splatting achieves 3× speedup over standard LM and ~6× faster than Adam for low Gaussian counts.

Details

Motivation: 3D Gaussian Splatting relies on first-order optimizers like Adam, which leads to long training times. There's a need for faster optimization methods to reduce training time while maintaining quality.

Method: Proposes a second-order optimization strategy based on Levenberg-Marquardt and Conjugate Gradient tailored for Gaussian Splatting. Exploits sparsity in Jacobian (each Gaussian affects limited pixels), implements matrix-free GPU-parallelized LM optimization, uses camera view and loss function sampling to reduce computational complexity, and introduces heuristic learning rate determination to avoid expensive line search.

Result: Achieves 3× speedup over standard LM and outperforms Adam by ~6× when Gaussian count is low, while remaining competitive for moderate Gaussian counts.

Conclusion: The proposed second-order optimization method significantly accelerates 3D Gaussian Splatting training while maintaining competitive performance, making it a practical solution for faster novel view synthesis.

Abstract: 3D Gaussian Splatting (3DGS) is widely used for novel view synthesis due to its high rendering quality and fast inference time. However, 3DGS predominantly relies on first-order optimizers such as Adam, which leads to long training times. To address this limitation, we propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG), which we specifically tailor towards Gaussian Splatting. Our key insight is that the Jacobian in 3DGS exhibits significant sparsity since each Gaussian affects only a limited number of pixels. We exploit this sparsity by proposing a matrix-free and GPU-parallelized LM optimization. To further improve its efficiency, we propose sampling strategies for both the camera views and loss function and, consequently, the normal equation, significantly reducing the computational complexity. In addition, we increase the convergence rate of the second-order approximation by introducing an effective heuristic to determine the learning rate that avoids the expensive computation cost of line search methods. As a result, our method achieves a $3\times$ speedup over standard LM and outperforms Adam by $~6\times$ when the Gaussian count is low while remaining competitive for moderate counts. Project Page: https://vcai.mpi-inf.mpg.de/projects/LM-IS

[158] Med-2D SegNet: A Light Weight Deep Neural Network for Medical 2D Image Segmentation

Lameya Sabrin, Md. Sanaullah Chowdhury, Salauddin Tapu, Noyon Kumar Sarkar, Ferdous Bin Ali

Main category: cs.CV

TL;DR: Med-2D SegNet is a highly efficient medical image segmentation architecture that achieves state-of-the-art accuracy with minimal computational footprint (2.07M parameters), excelling in cross-dataset generalization and zero-shot learning scenarios.

Details

Motivation: Medical image segmentation is crucial for clinical diagnostics and surgical planning but remains challenging due to anatomical variability and the need for low-complexity models suitable for resource-constrained clinical environments.

Method: Introduces Med-2D SegNet with a specialized compact Med Block encoder design that incorporates dimension expansion and parameter reduction techniques for precise feature extraction while maintaining minimal computational requirements.

Result: Achieves state-of-the-art performance across 20 diverse datasets with average Dice similarity coefficient of 89.77%, demonstrates strong cross-dataset generalization in polyp segmentation, and maintains only 2.07 million parameters.

Conclusion: Med-2D SegNet redefines the balance between accuracy and efficiency in medical image segmentation, paving the way for accessible, high-performance diagnostic tools suitable for clinical and resource-constrained settings.

Abstract: Accurate and efficient medical image segmentation is crucial for advancing clinical diagnostics and surgical planning, yet remains a complex challenge due to the variability in anatomical structures and the demand for low-complexity models. In this paper, we introduced Med-2D SegNet, a novel and highly efficient segmentation architecture that delivers outstanding accuracy while maintaining a minimal computational footprint. Med-2D SegNet achieves state-of-the-art performance across multiple benchmark datasets, including KVASIR-SEG, PH2, EndoVis, and GLAS, with an average Dice similarity coefficient (DSC) of 89.77% across 20 diverse datasets. Central to its success is the compact Med Block, a specialized encoder design that incorporates dimension expansion and parameter reduction, enabling precise feature extraction while keeping model parameters to a low count of just 2.07 million. Med-2D SegNet excels in cross-dataset generalization, particularly in polyp segmentation, where it was trained on KVASIR-SEG and showed strong performance on unseen datasets, demonstrating its robustness in zero-shot learning scenarios, even though we acknowledge that further improvements are possible. With top-tier performance in both binary and multi-class segmentation, Med-2D SegNet redefines the balance between accuracy and efficiency, setting a new benchmark for medical image analysis. This work paves the way for developing accessible, high-performance diagnostic tools suitable for clinical environments and resource-constrained settings, making it a step forward in the democratization of advanced medical technology.

[159] FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

Zebin Yao, Lei Ren, Huixing Jiang, Wei Chen, Xiaojie Wang, Ruifan Li, Fangxiang Feng

Main category: cs.CV

TL;DR: FreeGraftor is a training-free framework for subject-driven image generation that uses cross-image feature grafting to transfer subject identity from reference images without fine-tuning, achieving better fidelity and efficiency than existing methods.

Details

Motivation: Existing subject-driven image generation methods face a critical trade-off between fidelity and efficiency: tuning-based approaches require time-consuming subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency.

Method: FreeGraftor uses cross-image feature grafting with semantic matching and position-constrained attention fusion to transfer visual details from reference subjects. It also introduces a novel noise initialization strategy to preserve geometry priors of reference subjects for robust feature matching.

Result: Extensive experiments show FreeGraftor enables precise subject identity transfer while maintaining text-aligned scene synthesis, significantly outperforming existing zero-shot and training-free approaches in both subject fidelity and text alignment without requiring model fine-tuning.

Conclusion: FreeGraftor addresses the fidelity-efficiency trade-off in subject-driven image generation through training-free cross-image feature grafting, achieving superior performance and practical deployment capabilities including multi-subject generation.

Abstract: Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.

[160] Spike Imaging Velocimetry: Dense Motion Estimation of Fluids Using Spike Cameras

Yunzhong Zhang, Bo Xiong, You Zhou, Changqing Su, Zhen Cheng, Zhaofei Yu, Xun Cao, Tiejun Huang

Main category: cs.CV

TL;DR: Spike Imaging Velocimetry (SIV) - a deep learning framework using spike cameras for high-speed fluid motion estimation, outperforming traditional PIV methods.

Details

Motivation: Traditional Particle Image Velocimetry (PIV) has limitations for high-speed fluid analysis. Spike cameras offer ultra-high-speed, high-dynamic-range capabilities that could revolutionize fluid velocimetry but remain untapped for this application.

Method: Developed Spike Imaging Velocimetry (SIV) framework with three novel modules: Detail-Preserving Hierarchical Transform (DPHT), Graph Encoder (GE), and Multi-scale Velocity Refinement (MSVR). Created PSSD dataset with labeled samples from three fluid scenarios.

Result: SIV outperforms existing baselines across all three representative fluid-dynamics scenarios: steady turbulence, high-speed flow, and high-dynamic-range conditions.

Conclusion: Spike cameras combined with the proposed deep learning framework enable superior high-speed fluid motion estimation, demonstrating the potential of spike-based vision sensors for advanced fluid velocimetry applications.

Abstract: Particle Image Velocimetry (PIV) is a widely adopted non-invasive imaging technique that tracks the motion of tracer particles across image sequences to capture the velocity distribution of fluid flows. It is commonly employed to analyze complex flow structures and validate numerical simulations. This study explores the untapped potential of spike cameras–ultra-high-speed, high-dynamic-range vision sensors–in high-speed fluid velocimetry. We propose a deep learning framework, Spike Imaging Velocimetry (SIV), tailored for high-resolution fluid motion estimation. To enhance the network’s performance, we design three novel modules specifically adapted to the characteristics of fluid dynamics and spike streams: the Detail-Preserving Hierarchical Transform (DPHT), the Graph Encoder (GE), and the Multi-scale Velocity Refinement (MSVR). Furthermore, we introduce a spike-based PIV dataset, Particle Scenes with Spike and Displacement (PSSD), which contains labeled samples from three representative fluid-dynamics scenarios: steady turbulence, high-speed flow, and high-dynamic-range conditions. Our proposed method outperforms existing baselines across all these scenarios, demonstrating its effectiveness.

[161] Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction

Xiufeng Huang, Ka Chun Cheung, Runmin Cong, Simon See, Renjie Wan

Main category: cs.CV

TL;DR: A disentangled framework for efficient 3D Gaussian prediction that achieves pose-free 3D reconstruction with reduced computational demands.

Details

Motivation: Current 3D Gaussian Splatting methods require substantial computational resources, large datasets, and entangle geometry and appearance prediction, leading to slow regression speeds and heavy reliance on data-driven priors.

Method: Proposes a disentangled framework that extracts features from local image pairs using stereo vision backbone, fuses them via global attention blocks, uses dedicated heads for geometry (point-maps) and appearance (Gaussian features), combines them as GS-maps, and refines them with a refinement network.

Result: Achieves pose-free 3D reconstruction, improves robustness and practicality, reduces resource demands while maintaining high-quality outputs, and provides efficient, scalable solution for real-world 3D content generation.

Conclusion: The proposed method offers an efficient, scalable solution for 3D content generation by disentangling geometry and appearance prediction, enabling pose-free reconstruction with reduced computational requirements.

Abstract: Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose \method, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GS-maps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, \method provides an efficient, scalable solution for real-world 3D content generation.

[162] AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

Binhe Yu, Zhen Wang, Kexin Li, Yuqian Yuan, Wenqiao Zhang, Long Chen, Juncheng Li, Jun Xiao, Yueting Zhuang

Main category: cs.CV

TL;DR: AnyMS is a training-free framework for multi-subject image customization that uses layout guidance and attention decoupling to balance text alignment, subject identity preservation, and layout control without additional training.

Details

Motivation: Existing multi-subject customization methods struggle to balance text alignment, subject identity preservation, and layout control, while requiring additional training that limits scalability and efficiency.

Method: AnyMS uses a training-free framework with three inputs (text prompt, subject images, layout constraints) and introduces a bottom-up dual-level attention decoupling mechanism: global decoupling separates textual and visual conditions for text alignment, and local decoupling confines each subject’s attention to its designated area to prevent conflicts.

Result: Extensive experiments show AnyMS achieves state-of-the-art performance, supports complex compositions, and scales to larger numbers of subjects without requiring subject learning or adapter tuning.

Conclusion: AnyMS provides an effective training-free solution for layout-guided multi-subject customization that successfully balances all three critical objectives while being scalable and efficient.

Abstract: Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject’s attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.

[163] A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation

Shuting He, Peilin Ji, Yitong Yang, Changshuo Wang, Jiayi Ji, Yinglin Wang, Henghui Ding

Main category: cs.CV

TL;DR: A comprehensive survey of 3D Gaussian Splatting applications beyond novel view synthesis, covering semantic understanding, segmentation, editing, generation, and their integration with 2D foundation models and NeRF methods.

Details

Motivation: 3D Gaussian Splatting has emerged as an efficient alternative to NeRF for novel view synthesis, but its explicit and compact representation enables broader applications requiring geometric and semantic understanding. This survey aims to organize and analyze the rapidly growing field of 3DGS applications.

Method: The survey categorizes 3DGS applications into three foundational tasks (segmentation, editing, generation) and additional functional applications. It reviews 2D foundation models that support semantic understanding, compares with NeRF-based methods, summarizes supervision strategies and learning paradigms, and analyzes datasets and evaluation protocols.

Result: Provides a comprehensive taxonomy of 3DGS applications, identifies shared design principles and emerging trends, summarizes representative methods, and maintains an updated repository of papers, code, and resources for ongoing research.

Conclusion: 3D Gaussian Splatting enables diverse downstream applications beyond novel view synthesis, with growing integration of semantic understanding and control. The survey organizes this rapidly evolving field and provides resources to support future research and development in 3DGS applications.

Abstract: In the context of novel view synthesis, 3D Gaussian Splatting (3DGS) has recently emerged as an efficient and competitive counterpart to Neural Radiance Field (NeRF), enabling high-fidelity photorealistic rendering in real time. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first introduces 2D foundation models that support semantic understanding and control in 3DGS applications, followed by a review of NeRF-based methods that inform their 3DGS counterparts. We then categorize 3DGS applications into three foundational tasks: segmentation, editing, and generation, alongside additional functional applications built upon or tightly coupled with these foundational capabilities. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at https://github.com/heshuting555/Awesome-3DGS-Applications.

[164] Error Propagation Mechanisms and Compensation Strategies for Quantized Diffusion

Songwei Liu, Chao Zeng, Chenqian Yan, Xurui Peng, Xing Wang, Fangmin Chen, Xing Mei

Main category: cs.CV

TL;DR: A theoretical framework for analyzing quantization error propagation in diffusion models with a timestep-aware compensation scheme that improves PTQ performance with minimal overhead.

Details

Motivation: Diffusion models face deployment challenges due to computationally intensive iterative denoising, and post-training quantization (PTQ) suffers from stepwise quantization errors that accumulate during generation, compromising output fidelity.

Method: Developed a theoretical framework mathematically formulating error propagation in diffusion models, derived per-step quantization error propagation equations, established first closed-form solution for cumulative error, and proposed timestep-aware cumulative error compensation scheme.

Result: Extensive experiments show the compensation strategy effectively mitigates error propagation, enhancing existing PTQ methods with 1.2 PSNR improvement over SVDQuant on SDXL W4A4 while adding only <0.5% time overhead.

Conclusion: The proposed theoretical framework and compensation scheme successfully address quantization error accumulation in diffusion models, enabling more efficient deployment of quantized diffusion models with minimal performance degradation.

Abstract: Diffusion models have transformed image synthesis by establishing unprecedented quality and creativity benchmarks. Nevertheless, their large-scale deployment faces challenges due to computationally intensive iterative denoising processes. Although post-training quantization (PTQ) provides an effective pathway for accelerating sampling, the iterative nature of diffusion models causes stepwise quantization errors to accumulate progressively during generation, inevitably compromising output fidelity. To address this challenge, we develop a theoretical framework that mathematically formulates error propagation in Diffusion Models (DMs), deriving per-step quantization error propagation equations and establishing the first closed-form solution for cumulative error. Building on this theoretical foundation, we propose a timestep-aware cumulative error compensation scheme. Extensive experiments on multiple image datasets demonstrate that our compensation strategy effectively mitigates error propagation, significantly enhancing existing PTQ methods. Specifically, it achieves a 1.2 PSNR improvement over SVDQuant on SDXL W4A4, while incurring only an additional $<$ 0.5% time overhead.

[165] OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot

Junhan Zhu, Hesong Wang, Mingluo Su, Zefang Wang, Huan Wang

Main category: cs.CV

TL;DR: OBS-Diff is a one-shot pruning framework that compresses large text-to-image diffusion models without training, using adapted Optimal Brain Surgeon with timestep-aware Hessian and efficient group-wise pruning.

Details

Motivation: Large-scale text-to-image diffusion models have prohibitive computational costs, and existing one-shot pruning methods don't work well due to the iterative denoising nature of diffusion models.

Method: Adapts Optimal Brain Surgeon (OBS) for diffusion model architectures, introduces timestep-aware Hessian with logarithmic-decrease weighting for early timesteps, and uses group-wise sequential pruning for efficiency.

Result: Achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.

Conclusion: OBS-Diff successfully bridges the gap for applying one-shot pruning to diffusion models, enabling efficient compression while maintaining quality.

Abstract: Large-scale text-to-image diffusion models, while powerful, suffer from prohibitive computational cost. Existing one-shot network pruning methods can hardly be directly applied to them due to the iterative denoising nature of diffusion models. To bridge the gap, this paper presents OBS-Diff, a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models. Specifically, (i) OBS-Diff revitalizes the classic Optimal Brain Surgeon (OBS), adapting it to the complex architectures of modern diffusion models and supporting diverse pruning granularity, including unstructured, N:M semi-structured, and structured (MHA heads and FFN neurons) sparsity; (ii) To align the pruning criteria with the iterative dynamics of the diffusion process, by examining the problem from an error-accumulation perspective, we propose a novel timestep-aware Hessian construction that incorporates a logarithmic-decrease weighting scheme, assigning greater importance to earlier timesteps to mitigate potential error accumulation; (iii) Furthermore, a computationally efficient group-wise sequential pruning strategy is proposed to amortize the expensive calibration process. Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.

[166] Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation

Athos Georgiou

Main category: cs.CV

TL;DR: Snappy combines ColPali’s patch-level visual similarity with OCR text extraction to create a hybrid retrieval system that precisely localizes relevant document regions, reducing context tokens by 28.8-52.3% for RAG applications.

Details

Motivation: Existing multimodal retrieval models like ColPali work at page-level granularity, which is too coarse for precise RAG context. OCR systems extract structured text but lack semantic relevance assessment. There's a need to combine visual semantic understanding with precise text localization.

Method: A hybrid architecture that uses ColPali’s patch-level similarity scores as spatial relevance filters over OCR-extracted regions. The approach formalizes coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduces intersection metrics for relevance propagation, and establishes theoretical bounds on area efficiency.

Result: On BBox-DocVQA with ground-truth bounding boxes, achieves 59.7% hit rate at IoU@0.5 (84.4% at IoU@0.25, 35.8% at IoU@0.7) for within-page localization, with mean IoU of 0.569. Reduces context tokens by 28.8% compared to all OCR regions and 52.3% compared to full-page image tokens. Operates at inference time without additional training.

Conclusion: The proposed hybrid approach effectively combines visual semantic understanding with precise text localization, enabling more efficient and accurate document retrieval for RAG applications. The open-source Snappy implementation demonstrates practical utility without requiring model retraining.

Abstract: Late-interaction multimodal retrieval models like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they operate at page-level granularity, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali’s patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on area efficiency. We evaluate on BBox-DocVQA with ground-truth bounding boxes. For within-page localization (given correct page retrieval), ColQwen3-4B with percentile-50 thresholding achieves 59.7% hit rate at IoU@0.5 (84.4% at IoU@0.25, 35.8% at IoU@0.7), with mean IoU of 0.569, compared to ~6.7% for random region selection. Our approach reduces context tokens by 28.8% compared to returning all OCR regions and by 52.3% compared to full-page image tokens. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation at https://github.com/athrael-soju/Snappy.

[167] AnyCXR: Human Anatomy Segmentation of Chest X-ray at Any Acquisition Position using Multi-stage Domain Randomized Synthetic Data with Imperfect Annotations and Conditional Joint Annotation Regularization Learning

Zifei Dong, Wenjie Wu, Jinkui Hao, Tianqi Chen, Ziqiao Weng, Bo Zhou

Main category: cs.CV

TL;DR: AnyCXR is a unified framework for generalizable multi-organ chest X-ray segmentation using only synthetic supervision, achieving zero-shot generalization on real-world data across different projection angles.

Details

Motivation: Robust anatomical segmentation of chest X-rays is challenging due to scarce comprehensive annotations and substantial variability in real-world acquisition conditions.

Method: Combines Multi-stage Domain Randomization (MSDR) engine generating diverse synthetic radiographs from 3D CT volumes with Conditional Joint Annotation Regularization (CAR) learning strategy that leverages partial/imperfect labels by enforcing anatomical consistency in latent space.

Result: Achieves strong zero-shot generalization on multiple real-world datasets, accurately delineating 54 anatomical structures in PA, lateral, and oblique views. Supports downstream clinical tasks including cardiothoracic ratio estimation, spine curvature assessment, and disease classification.

Conclusion: AnyCXR establishes a scalable and reliable foundation for anatomy-aware CXR analysis and offers a practical pathway toward reducing annotation burdens while improving robustness across diverse imaging conditions.

Abstract: Robust anatomical segmentation of chest X-rays (CXRs) remains challenging due to the scarcity of comprehensive annotations and the substantial variability of real-world acquisition conditions. We propose AnyCXR, a unified framework that enables generalizable multi-organ segmentation across arbitrary CXR projection angles using only synthetic supervision. The method combines a Multi-stage Domain Randomization (MSDR) engine, which generates over 100,000 anatomically faithful and highly diverse synthetic radiographs from 3D CT volumes, with a Conditional Joint Annotation Regularization (CAR) learning strategy that leverages partial and imperfect labels by enforcing anatomical consistency in a latent space. Trained entirely on synthetic data, AnyCXR achieves strong zero-shot generalization on multiple real-world datasets, providing accurate delineation of 54 anatomical structures in PA, lateral, and oblique views. The resulting segmentation maps support downstream clinical tasks, including automated cardiothoracic ratio estimation, spine curvature assessment, and disease classification, where the incorporation of anatomical priors improves diagnostic performance. These results demonstrate that AnyCXR establishes a scalable and reliable foundation for anatomy-aware CXR analysis and offers a practical pathway toward reducing annotation burdens while improving robustness across diverse imaging conditions.

[168] EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams

Hao Li, Daiwei Lu, Jiacheng Wang, Robert J. Webster, Ipek Oguz

Main category: cs.CV

TL;DR: EndoStreamDepth is a real-time monocular depth estimation framework for endoscopic videos that produces accurate depth maps with sharp anatomical boundaries and temporal consistency across frames.

Details

Motivation: Existing methods for endoscopic depth estimation often lack temporal consistency across video frames, produce blurry anatomical boundaries, and may not operate in real-time, which is crucial for supporting downstream tasks like robotic surgery automation.

Method: The framework processes individual frames with three components: (1) a single-frame depth network with endoscopy-specific transformations, (2) multi-level Mamba temporal modules for inter-frame information propagation, and (3) a hierarchical design with multi-scale supervision using complementary loss terms for boundary sharpness and geometric consistency.

Result: Comprehensive evaluations on two colonoscopy depth estimation datasets show EndoStreamDepth substantially outperforms state-of-the-art monocular depth estimation methods, producing depth maps with sharp, anatomically aligned boundaries while maintaining real-time throughput.

Conclusion: EndoStreamDepth provides an effective solution for real-time depth estimation in endoscopic videos with improved accuracy, temporal consistency, and boundary sharpness, making it suitable for supporting downstream medical applications like robotic surgery automation.

Abstract: This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at https://github.com/MedICL-VU/EndoStreamDepth

[169] CrownGen: Patient-customized Crown Generation via Point Diffusion Model

Juyoung Bae, Moo Hyun Son, Jiale Peng, Wanting Qu, Wener Chen, Zelin Qiu, Kaixin Li, Xiaojuan Chen, Yifan Lin, Hao Chen

Main category: cs.CV

TL;DR: CrownGen is a generative AI framework that automates patient-customized dental crown design using diffusion models on tooth-level point clouds, reducing design time while maintaining clinical quality comparable to expert technicians.

Details

Motivation: Digital crown design is currently labor-intensive and creates a bottleneck in restorative dentistry, limiting scalability and increasing costs for dental care.

Method: Uses a denoising diffusion model on tooth-level point cloud representation with two core components: boundary prediction module for spatial priors and diffusion-based generative module to synthesize high-fidelity morphology for multiple teeth in one inference pass.

Result: Surpasses state-of-the-art models in geometric fidelity, significantly reduces active design time, and clinical assessments show CrownGen-assisted crowns are statistically non-inferior to expert technician-manual workflow crowns.

Conclusion: CrownGen offers a scalable solution to automate complex prosthetic modeling, lowering costs, shortening turnaround times, and enhancing patient access to high-quality dental care.

Abstract: Digital crown design remains a labor-intensive bottleneck in restorative dentistry. We present CrownGen, a generative framework that automates patient-customized crown design using a denoising diffusion model on a novel tooth-level point cloud representation. The system employs two core components: a boundary prediction module to establish spatial priors and a diffusion-based generative module to synthesize high-fidelity morphology for multiple teeth in a single inference pass. We validated CrownGen through a quantitative benchmark on 496 external scans and a clinical study of 26 restoration cases. Results demonstrate that CrownGen surpasses state-of-the-art models in geometric fidelity and significantly reduces active design time. Clinical assessments by trained dentists confirmed that CrownGen-assisted crowns are statistically non-inferior in quality to those produced by expert technicians using manual workflows. By automating complex prosthetic modeling, CrownGen offers a scalable solution to lower costs, shorten turnaround times, and enhance patient access to high-quality dental care.

[170] Evaluating the Performance of Open-Vocabulary Object Detection in Low-quality Image

Po-Chih Wu

Main category: cs.CV

TL;DR: Researchers evaluate open-vocabulary object detection models on low-quality images using a new dataset, finding models maintain performance with mild degradation but suffer significantly with severe degradation, with OWLv2 showing best robustness.

Details

Motivation: Open-vocabulary object detection aims for human-like recognition capabilities, but real-world applications often involve low-quality images. The authors want to evaluate how existing models perform under such challenging conditions to understand their robustness and limitations.

Method: The study introduces a new dataset that simulates real-world low-quality images. They evaluate multiple open-vocabulary object detection models (OWLv2, OWL-ViT, GroundingDINO, Detic) on this dataset under varying levels of image degradation, measuring performance using mAP scores.

Result: Models showed no significant mAP decrease under low-level image degradation, but all models dropped sharply under high-level degradation. OWLv2 consistently performed better across different degradation types, while OWL-ViT, GroundingDINO, and Detic showed significant performance declines.

Conclusion: Open-vocabulary object detection models maintain reasonable performance with mild image degradation but struggle with severe degradation, highlighting the need for more robust models. OWLv2 demonstrates better robustness than other evaluated models. The authors will release their dataset and code to support future research in this area.

Abstract: Open-vocabulary object detection enables models to localize and recognize objects beyond a predefined set of categories and is expected to achieve recognition capabilities comparable to human performance. In this study, we aim to evaluate the performance of existing models on open-vocabulary object detection tasks under low-quality image conditions. For this purpose, we introduce a new dataset that simulates low-quality images in the real world. In our evaluation experiment, we find that although open-vocabulary object detection models exhibited no significant decrease in mAP scores under low-level image degradation, the performance of all models dropped sharply under high-level image degradation. OWLv2 models consistently performed better across different types of degradation, while OWL-ViT, GroundingDINO, and Detic showed significant performance declines. We will release our dataset and codes to facilitate future studies.

[171] Lamps: Learning Anatomy from Multiple Perspectives via Self-supervision in Chest Radiographs

Ziyu Zhou, Haozhe Luo, Mohammad Reza Hosseinzadeh Taher, Jiaxuan Pang, Xiaowei Ding, Michael B. Gotway, Jianming Liang

Main category: cs.CV

TL;DR: Lamps is a self-supervised learning framework for medical imaging that leverages anatomical consistency, coherence, and hierarchy to learn meaningful representations from chest radiographs, outperforming 10 baseline models across 10 datasets.

Details

Motivation: Existing self-supervised learning methods in medical imaging often overlook the fundamental anatomical structure of human body images, limiting their ability to learn clinically meaningful features. The key foundation in medical imaging is human anatomy, which exhibits consistency, coherence, and hierarchy that should be leveraged for better representation learning.

Method: Lamps (Learning Anatomy from Multiple Perspectives via Self-Supervision) is pre-trained on large-scale chest radiographs using anatomical consistency, coherence, and hierarchy as supervision signals. The framework harmoniously utilizes these three anatomical perspectives to guide the self-supervised learning process.

Result: Extensive experiments across 10 datasets show Lamps demonstrates superior robustness, transferability, and clinical potential compared to 10 baseline models. The framework shows strong performance in both fine-tuning and emergent property analysis.

Conclusion: By learning from multiple anatomical perspectives, Lamps enables foundation models to develop meaningful, robust representations aligned with human anatomy structure, presenting a unique opportunity for medical imaging foundation models.

Abstract: Foundation models have been successful in natural language processing and computer vision because they are capable of capturing the underlying structures (foundation) of natural languages. However, in medical imaging, the key foundation lies in human anatomy, as these images directly represent the internal structures of the body, reflecting the consistency, coherence, and hierarchy of human anatomy. Yet, existing self-supervised learning (SSL) methods often overlook these perspectives, limiting their ability to effectively learn anatomical features. To overcome the limitation, we built Lamps (learning anatomy from multiple perspectives via self-supervision) pre-trained on large-scale chest radiographs by harmoniously utilizing the consistency, coherence, and hierarchy of human anatomy as the supervision signal. Extensive experiments across 10 datasets evaluated through fine-tuning and emergent property analysis demonstrate Lamps’ superior robustness, transferability, and clinical potential when compared to 10 baseline models. By learning from multiple perspectives, Lamps presents a unique opportunity for foundation models to develop meaningful, robust representations that are aligned with the structure of human anatomy.

Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Yan, Hao Fei, Tat-Seng Chua

Main category: cs.CV

TL;DR: JavisGPT is the first unified multimodal LLM for joint audio-video comprehension and generation, featuring a SyncFusion module and three-stage training pipeline, achieving state-of-the-art performance on JAV tasks.

Details

Motivation: There's a need for unified models that can handle both comprehension and generation of synchronized audio-video content, as existing MLLMs lack capabilities for temporally coherent joint audio-video understanding and generation.

Method: Uses encoder-LLM-decoder architecture with SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware queries. Employs three-stage training: multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning with JavisInst-Omni dataset (200K+ GPT-4o-curated dialogues).

Result: JavisGPT outperforms existing MLLMs on JAV comprehension and generation benchmarks, particularly excelling in complex and temporally synchronized settings.

Conclusion: JavisGPT successfully demonstrates effective joint audio-video comprehension and generation through its unified architecture and comprehensive training pipeline, setting new state-of-the-art for multimodal audio-video tasks.

Abstract: This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.

[173] YOLO-IOD: Towards Real Time Incremental Object Detection

Shizhou Zhang, Xueqiang Lv, Yinghui Xing, Qirui Wu, Di Xu, Chen Zhao, Yanning Zhang

Main category: cs.CV

TL;DR: YOLO-IOD: A real-time incremental object detection framework built on YOLO-World that addresses catastrophic forgetting in YOLO-based detectors through conflict-aware pseudo-label refinement, importance-based kernel selection, and cross-stage asymmetric knowledge distillation.

Details

Motivation: Current incremental object detection methods rely on Faster R-CNN or DETR frameworks but don't support real-time YOLO detectors. The paper identifies three knowledge conflicts causing catastrophic forgetting in YOLO-based incremental detectors and aims to create a real-time IOD solution.

Method: YOLO-IOD framework built on pretrained YOLO-World with stage-wise parameter-efficient fine-tuning. Three main components: 1) Conflict-Aware Pseudo-Label Refinement (CPR) to address foreground-background confusion, 2) Importance-based Kernel Selection (IKS) to identify and update key convolution kernels, 3) Cross-Stage Asymmetric Knowledge Distillation (CAKD) to handle misaligned knowledge distillation between old and new categories.

Result: Experiments on conventional and new LoCo COCO benchmarks show YOLO-IOD achieves superior performance with minimal forgetting. The paper also introduces LoCo COCO, a more realistic benchmark that eliminates data leakage across stages.

Conclusion: YOLO-IOD successfully addresses catastrophic forgetting in YOLO-based incremental object detection, providing a real-time solution that outperforms existing methods while maintaining minimal forgetting across learning stages.

Abstract: Current methods for incremental object detection (IOD) primarily rely on Faster R-CNN or DETR series detectors; however, these approaches do not accommodate the real-time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO-based incremental detectors: foreground-background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO-IOD, a real-time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO-World model, facilitating incremental learning via a stage-wise parameter-efficient fine-tuning process. Specifically, YOLO-IOD encompasses three principal components: 1) Conflict-Aware Pseudo-Label Refinement (CPR), which mitigates the foreground-background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importancebased Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3) Cross-Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO-IOD achieves superior performance with minimal forgetting.

Huiming Yang, Linglin Liao, Fei Ding, Sibo Wang, Zijian Zeng

Main category: cs.CV

TL;DR: PoseStreamer: A multi-modal 6DoF pose estimation framework using event cameras for high-speed moving objects, featuring temporal consistency, 2D tracking priors, and geometric refinement.

Details

Motivation: Standard RGB cameras suffer from motion blur in high-speed and low-light scenarios, while current 6DoF pose estimation methods perform poorly with high-speed object movement. Event cameras offer high temporal resolution but need better integration for pose estimation.

Method: Three core components: 1) Adaptive Pose Memory Queue for temporal consistency using historical orientation cues, 2) Object-centric 2D Tracker providing 2D priors to boost 3D center recall, and 3) Ray Pose Filter for geometric refinement along camera rays. Also introduces MoCapCube6D dataset for benchmarking.

Result: PoseStreamer achieves superior accuracy in high-speed moving scenarios and exhibits strong generalizability as a template-free framework for unseen moving objects.

Conclusion: The proposed framework effectively addresses 6DoF pose estimation challenges in high-speed scenarios using event cameras, demonstrating both accuracy and generalization capabilities for novel objects.

Abstract: Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.

[175] A Low-Cost UAV Deep Learning Pipeline for Integrated Apple Disease Diagnosis,Freshness Assessment, and Fruit Detection

Soham Dutta, Soham Banerjee, Sneha Mahata, Anindya Sen, Sayantani Datta

Main category: cs.CV

TL;DR: A unified RGB-only UAV pipeline for apple orchards that performs leaf disease detection, freshness assessment, and fruit detection using deep learning models on affordable hardware.

Details

Motivation: Existing UAV-based orchard monitoring systems are fragmented (addressing tasks in isolation) and often rely on expensive multispectral sensors, making them inaccessible for many farmers.

Method: Integrated pipeline using ResNet50 for leaf disease detection, VGG16 for apple freshness classification, and YOLOv8 for real-time apple detection/localization, running on ESP32-CAM and Raspberry Pi for offline on-site inference.

Result: Achieved 98.9% accuracy for leaf disease classification, 97.4% accuracy for freshness classification, and 0.857 F1 score for apple detection, demonstrating high performance with low-cost RGB-only sensors.

Conclusion: The framework provides an accessible, scalable alternative to expensive multispectral UAV solutions, enabling practical precision agriculture on affordable hardware without cloud dependency.

Abstract: Apple orchards require timely disease detection, fruit quality assessment, and yield estimation, yet existing UAV-based systems address such tasks in isolation and often rely on costly multispectral sensors. This paper presents a unified, low-cost RGB-only UAV-based orchard intelligent pipeline integrating ResNet50 for leaf disease detection, VGG 16 for apple freshness determination, and YOLOv8 for real-time apple detection and localization. The system runs on an ESP32-CAM and Raspberry Pi, providing fully offline on-site inference without cloud support. Experiments demonstrate 98.9% accuracy for leaf disease classification, 97.4% accuracy for freshness classification, and 0.857 F1 score for apple detection. The framework provides an accessible and scalable alternative to multispectral UAV solutions, supporting practical precision agriculture on affordable hardware.

[176] Task-oriented Learnable Diffusion Timesteps for Universal Few-shot Learning of Dense Tasks

Changgyoon Oh, Jongoh Jeong, Jegyeong Cho, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: The paper proposes a method to adaptively select and consolidate diffusion timestep features for few-shot dense prediction tasks, addressing the suboptimal performance from heuristic timestep selection.

Details

Motivation: Current diffusion models use heuristic selection of diffusion timestep features for single-task prediction, which relies on empirical intuition and leads to sub-optimal performance biased toward certain tasks.

Method: Proposes two modules: Task-aware Timestep Selection (TTS) to select ideal diffusion timesteps based on timestep-wise losses and similarity scores, and Timestep Feature Consolidation (TFC) to consolidate selected timestep features. Uses parameter-efficient fine-tuning adapter for few-shot dense prediction.

Result: Empirically validated on the large-scale challenging Taskonomy dataset for dense prediction, achieving superiority in dense prediction performance given only a few support queries in practical universal and few-shot learning scenarios.

Conclusion: The proposed learnable timestep consolidation method effectively addresses the limitations of heuristic timestep selection in diffusion models for few-shot dense prediction tasks.

Abstract: Denoising diffusion probabilistic models have brought tremendous advances in generative tasks, achieving state-of-the-art performance thus far. Current diffusion model-based applications exploit the power of learned visual representations from multistep forward-backward Markovian processes for single-task prediction tasks by attaching a task-specific decoder. However, the heuristic selection of diffusion timestep features still heavily relies on empirical intuition, often leading to sub-optimal performance biased towards certain tasks. To alleviate this constraint, we investigate the significance of versatile diffusion timestep features by adaptively selecting timesteps best suited for the few-shot dense prediction task, evaluated on an arbitrary unseen task. To this end, we propose two modules: Task-aware Timestep Selection (TTS) to select ideal diffusion timesteps based on timestep-wise losses and similarity scores, and Timestep Feature Consolidation (TFC) to consolidate the selected timestep features to improve the dense predictive performance in a few-shot setting. Accompanied by our parameter-efficient fine-tuning adapter, our framework effectively achieves superiority in dense prediction performance given only a few support queries. We empirically validate our learnable timestep consolidation method on the large-scale challenging Taskonomy dataset for dense prediction, particularly for practical universal and few-shot learning scenarios.

[177] ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation

Shin Seong Kim, Minjung Shin, Hyunin Cho, Youngjung Uh

Main category: cs.CV

TL;DR: ASemconsist is a novel framework for generating image sequences with consistent character identity across diverse scenes, using selective text embedding modification and semantic control strategies to overcome the trade-off between identity consistency and prompt alignment.

Details

Motivation: Current text-to-image diffusion models struggle to maintain consistent character identity across multiple images while preserving alignment with individual scene descriptions, creating a challenging trade-off between identity consistency and per-image prompt alignment.

Method: 1) Selective text embedding modification for explicit semantic control over character identity; 2) Semantic control strategy repurposing padding embeddings as semantic containers; 3) Adaptive feature-sharing strategy that evaluates textual ambiguity and applies constraints only to ambiguous identity prompts.

Result: The framework achieves state-of-the-art performance, effectively overcoming prior trade-offs between identity consistency and prompt alignment in character-preserving image sequence generation.

Conclusion: ASemconsist provides a comprehensive solution for consistent character generation across diverse scenes, with a unified evaluation protocol (Consistency Quality Score) that captures performance imbalances between identity preservation and text alignment.

Abstract: Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: https://minjung-s.github.io/asemconsist

[178] Physically-Grounded Manifold Projection Model for Generalizable Metal Artifact Reduction in Dental CBCT

Zhi Li, Yaqi Wang, Bingtao Ma, Yifan Zhang, Huiyu Zhou, Shuai Wang

Main category: cs.CV

TL;DR: PGMP framework for dental CBCT metal artifact reduction uses physics-based simulation for training data, deterministic manifold projection for fast inference, and medical foundation model priors for clinical plausibility.

Details

Motivation: Current deep learning approaches for metal artifact reduction in dental CBCT have limitations: supervised methods suffer from spectral blurring due to "regression-to-the-mean", unsupervised methods risk structural hallucinations, and diffusion models are too slow for clinical use due to iterative sampling.

Method: Three-component framework: 1) Anatomically-Adaptive Physics Simulation (AAPS) synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins; 2) DMP-Former adapts direct x-prediction paradigm for deterministic manifold projection in single forward pass; 3) Semantic-Structural Alignment (SSA) module uses medical foundation model (MedDINOv3) priors to ensure clinical plausibility.

Result: PGMP outperforms state-of-the-art methods on both synthetic and multi-center clinical datasets, particularly on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability.

Conclusion: The proposed PGMP framework effectively addresses limitations of existing MAR methods by combining physics-based simulation, deterministic manifold projection, and medical foundation model priors, achieving both high efficiency and clinical reliability for dental CBCT applications.

Abstract: Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to “regression-to-the-mean”, while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: https://github.com/ricoleehduu/PGMP.

[179] FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

Yuchen Wu, Jiahe Li, Fabio Tosi, Matteo Poggi, Jin Zheng, Xiao Bai

Main category: cs.CV

TL;DR: FoundationSLAM is a learning-based monocular dense SLAM system that integrates foundation depth models with flow estimation to achieve geometric consistency for accurate tracking and mapping in real-time.

Details

Motivation: Previous flow-based SLAM approaches lack geometric consistency, leading to inaccurate tracking and mapping. The authors aim to bridge flow estimation with geometric reasoning to create a more robust and accurate monocular dense SLAM system.

Method: 1) Hybrid Flow Network that produces geometry-aware correspondences using guidance from foundation depth models; 2) Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints; 3) Reliability-Aware Refinement mechanism that dynamically adapts flow updates by distinguishing between reliable and uncertain regions.

Result: FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, runs in real-time at 18 FPS, and demonstrates strong generalization to various scenarios.

Conclusion: The integration of foundation depth models with flow estimation successfully addresses geometric consistency issues in monocular dense SLAM, resulting in a practical, real-time system with strong generalization capabilities.

Abstract: We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

cs.AI

[180] Reasoning in Action: MCTS-Driven Knowledge Retrieval for Large Language Models

Shuqi Liu, Bowei He, Chen Ma, Linqi Song

Main category: cs.AI

TL;DR: A reasoning-aware knowledge retrieval method for LLMs that uses coarse-to-fine retrieval with Monte Carlo Tree Search to find knowledge aligned with conversation logic rather than just semantic similarity.

Details

Motivation: Current LLMs either use retrieval for similar information or improve reasoning, but struggle to effectively integrate both strategies. There's a need for retrieval methods that go beyond surface-level semantic similarity to align with the logical structure of conversations.

Method: Two-phase coarse-to-fine approach: 1) Identify contextually relevant sub-region of knowledge base where all sentences relate to the topic, 2) Refine search within this sub-region for knowledge specifically relevant to reasoning process. Uses Monte Carlo Tree Search-inspired method to navigate knowledge sentences using common keywords.

Result: Experiments on two multi-turn dialogue datasets show the approach aligns more closely with human conversation reasoning, significantly enhances diversity of retrieved knowledge, and produces more informative and creative responses.

Conclusion: The reasoning-aware knowledge retrieval method successfully integrates retrieval and reasoning strategies, moving beyond semantic similarity to capture logical conversation structure, leading to improved LLM performance in dialogue tasks.

Abstract: Large language models (LLMs) typically enhance their performance through either the retrieval of semantically similar information or the improvement of their reasoning capabilities. However, a significant challenge remains in effectively integrating both retrieval and reasoning strategies to optimize LLM performance. In this paper, we introduce a reasoning-aware knowledge retrieval method that enriches LLMs with information aligned to the logical structure of conversations, moving beyond surface-level semantic similarity. We follow a coarse-to-fine approach for knowledge retrieval. First, we identify a contextually relevant sub-region of the knowledge base, ensuring that all sentences within it are relevant to the context topic. Next, we refine our search within this sub-region to extract knowledge that is specifically relevant to the reasoning process. Throughout both phases, we employ the Monte Carlo Tree Search-inspired search method to effectively navigate through knowledge sentences using common keywords. Experiments on two multi-turn dialogue datasets demonstrate that our knowledge retrieval approach not only aligns more closely with the underlying reasoning in human conversations but also significantly enhances the diversity of the retrieved knowledge, resulting in more informative and creative responses.

[181] Finetuning Large Language Models for Automated Depression Screening in Nigerian Pidgin English: GENSCORE Pilot Study

Isaac Iyinoluwa Olufadewa, Miracle Ayomikun Adesina, Ezekiel Ayodeji Oladejo, Uthman Babatunde Usman, Owen Kolade Adeniyi, Matthew Tolulope Olawoyin

Main category: cs.AI

TL;DR: Fine-tuned LLMs for automated depression screening in Nigerian Pidgin achieve 94.5% accuracy with GPT-4.1, enabling accessible mental health assessment in linguistically diverse, resource-constrained communities.

Details

Motivation: Depression screening in Nigeria faces barriers including limited clinician access, stigma, and language barriers. Traditional tools like PHQ-9 were validated in high-income countries and are linguistically/culturally inaccessible for Nigerian communities communicating in Pidgin and over 520 local languages.

Method: Collected 432 Pidgin-language audio responses from Nigerian young adults (18-40) to prompts aligned with PHQ-9 items. Performed transcription, preprocessing, annotation (semantic labeling, slang interpretation, PHQ-9 scoring). Fine-tuned three LLMs (Phi-3-mini-4k-instruct, Gemma-3-4B-it, GPT-4.1) on annotated dataset and evaluated both quantitatively (accuracy, precision, semantic alignment) and qualitatively (clarity, relevance, cultural appropriateness).

Result: GPT-4.1 achieved highest performance with 94.5% accuracy in PHQ-9 severity scoring prediction, outperforming other models. Qualitatively, GPT-4.1 produced the most culturally appropriate, clear, and contextually relevant responses for Nigerian Pidgin depression screening.

Conclusion: AI-mediated depression screening using fine-tuned LLMs adapted for conversational Nigerian Pidgin provides a foundation for deploying accessible mental health tools in linguistically diverse, resource-constrained environments like Nigeria.

Abstract: Depression is a major contributor to the mental-health burden in Nigeria, yet screening coverage remains limited due to low access to clinicians, stigma, and language barriers. Traditional tools like the Patient Health Questionnaire-9 (PHQ-9) were validated in high-income countries but may be linguistically or culturally inaccessible for low- and middle-income countries and communities such as Nigeria where people communicate in Nigerian Pidgin and more than 520 local languages. This study presents a novel approach to automated depression screening using fine-tuned large language models (LLMs) adapted for conversational Nigerian Pidgin. We collected a dataset of 432 Pidgin-language audio responses from Nigerian young adults aged 18-40 to prompts assessing psychological experiences aligned with PHQ-9 items, performed transcription, rigorous preprocessing and annotation, including semantic labeling, slang and idiom interpretation, and PHQ-9 severity scoring. Three LLMs - Phi-3-mini-4k-instruct, Gemma-3-4B-it, and GPT-4.1 - were fine-tuned on this annotated dataset, and their performance was evaluated quantitatively (accuracy, precision and semantic alignment) and qualitatively (clarity, relevance, and cultural appropriateness). GPT-4.1 achieved the highest quantitative performance, with 94.5% accuracy in PHQ-9 severity scoring prediction, outperforming Gemma-3-4B-it and Phi-3-mini-4k-instruct. Qualitatively, GPT-4.1 also produced the most culturally appropriate, clear, and contextually relevant responses. AI-mediated depression screening for underserved Nigerian communities. This work provides a foundation for deploying conversational mental-health tools in linguistically diverse, resource-constrained environments.

[182] Toward a Physical Theory of Intelligence

Peter David Fagan

Main category: cs.AI

TL;DR: Intelligence is defined as goal-directed work per nat of irreversibly processed information, with physical constraints from conservation laws shaping information processing and computational architectures.

Details

Motivation: To establish a unified physical theory of intelligence that connects information processing to thermodynamics and conservation laws, providing a substrate-neutral understanding of intelligent systems from biological to artificial.

Method: Introduces Conservation-Congruent Encoding (CCE) framework where encodings correspond to metastable basins enforced by conservation laws. Defines intelligence as goal-directed work per nat of irreversibly processed information, derives physical constraints, and applies to biological systems and computational architectures.

Result: Derives hierarchy of physical constraints for information processing, shows how long-horizon efficiency requires internal structure preservation (self-modelling), establishes intrinsic epistemic limits, analyzes biological systems near efficient operating regimes, develops theory of continuous dynamical circuits, and proposes physically-grounded AI safety perspective.

Conclusion: Provides unified physical account of intelligence connecting thermodynamics, information processing, and conservation laws, with implications for understanding biological cognition, computational architectures, and AI safety.

Abstract: We present a physical theory of intelligence grounded in irreversible information processing in systems constrained by conservation laws. An intelligent system is modelled as a coupled agent-environment process whose evolution transforms information into goal-directed work. To connect information to physical state, we introduce the Conservation-Congruent Encoding (CCE) framework, in which encodings correspond to metastable basins of attraction whose separability is enforced by conservation laws. Within this framework, intelligence is defined as the amount of goal-directed work produced per nat of irreversibly processed information. From this definition we derive a hierarchy of physical constraints governing information intake, irreversible computation, and work extraction in open systems. The framework reveals how long-horizon efficiency requires the preservation of internal informational structure, giving rise to self-modelling, and it establishes that physically embodied intelligent systems possess intrinsic epistemic limits analogous to incompleteness phenomena. Applying the theory to biological systems, we analyse how oscillatory and near-critical dynamics optimise the trade-off between information preservation, dissipation, and useful work, placing the brain near an efficient operating regime predicted by the framework. At the architectural level, we develop a theory of continuous dynamical circuits in which classical Boolean logic emerges as a special case of attractor selection, while more general invariant geometries support computational modes beyond fixed-point logic. Finally, we propose a physically grounded perspective on artificial intelligence safety based on irreversible information flow and structural homeostasis. Together, these results provide a unified, substrate-neutral account of intelligence as a physical phenomenon.

[183] A multi-algorithm approach for operational human resources workload balancing in a last mile urban delivery system

Luis M. Moreno-Saavedra, Silvia Jimenez-Fernandez, Antonio Portilla-Figueras, David Casillas-Perez, Sancho Salcedo-Sanz

Main category: cs.AI

TL;DR: A multi-algorithm approach for balanced workload assignment in last-mile package delivery systems that optimizes both delivery time and equitable distribution among workers.

Details

Motivation: Traditional geographical proximity-based assignment methods lead to inefficient and unbalanced workload distribution among delivery workers, causing significant decompensations in workload across staff in urban delivery zones.

Method: Proposes a multi-algorithm methodology including different versions of k-means, evolutionary approaches, recursive assignments based on k-means initialization with different problem encodings, and a hybrid evolutionary ensemble algorithm that considers both distance and workload factors.

Result: The approach was tested on a real-world problem in an urban last-mile package delivery workforce operating in Azuqueca de Henares, Spain, demonstrating effective workload balancing.

Conclusion: The proposed multi-algorithm approach successfully addresses workload balancing in last-mile delivery systems by optimizing both delivery time and equitable workload distribution among workers, correcting significant decompensations in workload allocation.

Abstract: Efficient workload assignment to the workforce is critical in last-mile package delivery systems. In this context, traditional methods of assigning package deliveries to workers based on geographical proximity can be inefficient and surely guide to an unbalanced workload distribution among delivery workers. In this paper, we look at the problem of operational human resources workload balancing in last-mile urban package delivery systems. The idea is to consider the effort workload to optimize the system, i.e., the optimization process is now focused on improving the delivery time, so that the workload balancing is complete among all the staff. This process should correct significant decompensations in workload among delivery workers in a given zone. Specifically, we propose a multi-algorithm approach to tackle this problem. The proposed approach takes as input a set of delivery points and a defined number of workers, and then assigns packages to workers, in such a way that it ensures that each worker completes a similar amount of work per day. The proposed algorithms use a combination of distance and workload considerations to optimize the allocation of packages to workers. In this sense, the distance between the delivery points and the location of each worker is also taken into account. The proposed multi-algorithm methodology includes different versions of k-means, evolutionary approaches, recursive assignments based on k-means initialization with different problem encodings, and a hybrid evolutionary ensemble algorithm. We have illustrated the performance of the proposed approach in a real-world problem in an urban last-mile package delivery workforce operating at Azuqueca de Henares, Spain.

[184] ClinicalReTrial: A Self-Evolving AI Agent for Clinical Trial Protocol Optimization

Sixue Xing, Xuanye Xia, Kerui Wu, Meng Jiang, Jintai Chen, Tianfan Fu

Main category: cs.AI

TL;DR: ClinicalReTrial: AI agent framework that proactively redesigns clinical trial protocols to improve success rates, moving beyond mere failure prediction to actionable protocol optimization.

Details

Motivation: Current AI methods only predict clinical trial failure reactively without offering actionable remedies, creating a gap between diagnosis and intervention. Minor protocol design flaws can irreversibly compromise outcomes despite promising therapeutics.

Method: Self-evolving AI agent framework that treats clinical trial reasoning as iterative protocol redesign. Integrates failure diagnosis, safety-aware modification, and candidate evaluation in closed-loop, reward-driven optimization. Uses outcome prediction model as simulation environment, maintains hierarchical memory for iteration-level feedback and transferable redesign patterns.

Result: Improves 83.3% of trial protocols with mean success probability gain of 5.7%. Retrospective case studies show strong alignment between discovered redesign strategies and real-world clinical trial modifications.

Conclusion: ClinicalReTrial successfully bridges the gap between failure prediction and actionable intervention, providing a proactive framework for optimizing clinical trial protocols through iterative redesign and continuous self-improvement.

Abstract: Clinical trial failure remains a central bottleneck in drug development, where minor protocol design flaws can irreversibly compromise outcomes despite promising therapeutics. Although cutting-edge AI methods achieve strong performance in predicting trial success, they are inherently reactive for merely diagnosing risk without offering actionable remedies once failure is anticipated. To fill this gap, this paper proposes ClinicalReTrial, a self-evolving AI agent framework that addresses this gap by casting clinical trial reasoning as an iterative protocol redesign problem. Our method integrates failure diagnosis, safety-aware modification, and candidate evaluation in a closed-loop, reward-driven optimization framework. Serving the outcome prediction model as a simulation environment, ClinicalReTrial enables low-cost evaluation of protocol modifications and provides dense reward signals for continuous self-improvement. To support efficient exploration, the framework maintains hierarchical memory that captures iteration-level feedback within trials and distills transferable redesign patterns across trials. Empirically, ClinicalReTrial improves 83.3% of trial protocols with a mean success probability gain of 5.7%, and retrospective case studies demonstrate strong alignment between the discovered redesign strategies and real-world clinical trial modifications.

[185] Quantitative Rule-Based Strategy modeling in Classic Indian Rummy: A Metric Optimization Approach

Purushottam Saha, Avirup Chakraborty, Sourish Sarkar, Subhamoy Maitra, Diganta Mukherjee, Tridib Mukherjee

Main category: cs.AI

TL;DR: MinDist metric improves Classic Indian Rummy strategy by measuring edit distance to valid configurations, outperforming traditional heuristics.

Details

Motivation: Classic Indian Rummy is a complex game requiring probabilistic reasoning and combinatorial decision-making, but existing strategies lack formal, interpretable metrics for evaluating hand quality and strategic play.

Method: Proposes MinDist metric (modification of MinScore) that quantifies edit distance between current hand and nearest valid configuration. Develops computationally efficient algorithm using dynamic pruning and pattern caching. Incorporates opponent hand-modeling within two-player zero-sum simulation framework.

Result: MinDist-based agents show significant improvement in win rates over traditional heuristics, validated through statistical hypothesis testing.

Conclusion: MinDist provides a formal, interpretable step toward algorithmic Rummy strategy design, demonstrating the value of structural proximity metrics in incomplete information games.

Abstract: The 13-card variant of Classic Indian Rummy is a sequential game of incomplete information that requires probabilistic reasoning and combinatorial decision-making. This paper proposes a rule-based framework for strategic play, driven by a new hand-evaluation metric termed MinDist. The metric modifies the MinScore metric by quantifying the edit distance between a hand and the nearest valid configuration, thereby capturing structural proximity to completion. We design a computationally efficient algorithm derived from the MinScore algorithm, leveraging dynamic pruning and pattern caching to exactly calculate this metric during play. Opponent hand-modeling is also incorporated within a two-player zero-sum simulation framework, and the resulting strategies are evaluated using statistical hypothesis testing. Empirical results show significant improvement in win rates for MinDist-based agents over traditional heuristics, providing a formal and interpretable step toward algorithmic Rummy strategy design.

[186] Bio-inspired Agentic Self-healing Framework for Resilient Distributed Computing Continuum Systems

Alaa Saleh, Praveen Kumar Donta, Roberto Morabito, Sasu Tarkoma, Anders Lindgren, Qiyang Zhang, Schahram Dustdar, Susanna Pirttikangas, Lauri Lovén

Main category: cs.AI

TL;DR: ReCiSt is a bio-inspired self-healing framework for Distributed Computing Continuum Systems that uses LM-powered agents to autonomously detect, diagnose, and recover from faults through four computational layers modeled after biological healing phases.

Details

Motivation: Modern DCCS face frequent faults due to their complexity, heterogeneity, and dynamic conditions, disrupting service continuity. There's a need for scalable, adaptive, and self-regulated resilience strategies inspired by biological self-healing systems.

Method: ReCiSt maps biological healing phases (Hemostasis, Inflammation, Proliferation, Remodeling) to computational layers (Containment, Diagnosis, Meta-Cognitive, Knowledge). LM-powered agents interpret logs, infer root causes, refine reasoning, and reconfigure resources autonomously.

Result: Evaluation on public fault datasets shows ReCiSt achieves self-healing within tens of seconds with minimum 10% agent CPU usage. Results demonstrate depth of analysis to overcome uncertainties and number of micro-agents needed for resilience.

Conclusion: ReCiSt successfully applies bio-inspired principles to create an autonomous, adaptive self-healing framework for DCCS, demonstrating practical resilience capabilities through LM-powered agents with minimal human intervention.

Abstract: Human biological systems sustain life through extraordinary resilience, continually detecting damage, orchestrating targeted responses, and restoring function through self-healing. Inspired by these capabilities, this paper introduces ReCiSt, a bio-inspired agentic self-healing framework designed to achieve resilience in Distributed Computing Continuum Systems (DCCS). Modern DCCS integrate heterogeneous computing resources, ranging from resource-constrained IoT devices to high-performance cloud infrastructures, and their inherent complexity, mobility, and dynamic operating conditions expose them to frequent faults that disrupt service continuity. These challenges underscore the need for scalable, adaptive, and self-regulated resilience strategies. ReCiSt reconstructs the biological phases of Hemostasis, Inflammation, Proliferation, and Remodeling into the computational layers Containment, Diagnosis, Meta-Cognitive, and Knowledge for DCCS. These four layers perform autonomous fault isolation, causal diagnosis, adaptive recovery, and long-term knowledge consolidation through Language Model (LM)-powered agents. These agents interpret heterogeneous logs, infer root causes, refine reasoning pathways, and reconfigure resources with minimal human intervention. The proposed ReCiSt framework is evaluated on public fault datasets using multiple LMs, and no baseline comparison is included due to the scarcity of similar approaches. Nevertheless, our results, evaluated under different LMs, confirm ReCiSt’s self-healing capabilities within tens of seconds with minimum of 10% of agent CPU usage. Our results also demonstrated depth of analysis to over come uncertainties and amount of micro-agents invoked to achieve resilience.

[187] From Clay to Code: Typological and Material Reasoning in AI Interpretations of Iranian Pigeon Towers

Abolhassan Pishahang, Maryam Badiei

Main category: cs.AI

TL;DR: AI systems can visually reproduce geometric patterns of vernacular architecture but fail to understand material and climatic reasoning, creating a gap between visual resemblance and architectural intelligence.

Details

Motivation: To investigate how generative AI systems interpret and reconstruct the architectural intelligence embedded in vernacular forms, specifically examining whether AI can understand the deeper design reasoning beyond visual patterns.

Method: Used Iranian pigeon towers as a case study, tested three diffusion models (Midjourney v6, DALL-E 3, DreamStudio/SDXL) across three prompt stages (referential, adaptive, speculative), and evaluated using a five-criteria framework (typology, materiality, environment, realism, cultural specificity).

Result: AI reliably reproduces geometric patterns but misreads material and climatic reasoning. Reference imagery improves realism but limits creativity, while freedom from reference generates inventive but culturally ambiguous outcomes.

Conclusion: Defines a boundary between visual resemblance and architectural reasoning, positioning computational vernacular reasoning as a framework for analyzing how AI perceives, distorts, and reimagines traditional design intelligence.

Abstract: This study investigates how generative AI systems interpret the architectural intelligence embedded in vernacular form. Using the Iranian pigeon tower as a case study, the research tests three diffusion models, Midjourney v6, DALL-E 3, and DreamStudio based on Stable Diffusion XL (SDXL), across three prompt stages: referential, adaptive, and speculative. A five-criteria evaluation framework assesses how each system reconstructs typology, materiality, environment, realism, and cultural specificity. Results show that AI reliably reproduces geometric patterns but misreads material and climatic reasoning. Reference imagery improves realism yet limits creativity, while freedom from reference generates inventive but culturally ambiguous outcomes. The findings define a boundary between visual resemblance and architectural reasoning, positioning computational vernacular reasoning as a framework for analyzing how AI perceives, distorts, and reimagines traditional design intelligence.

[188] The Agentic Leash: Extracting Causal Feedback Fuzzy Cognitive Maps with LLMs

Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko

Main category: cs.AI

TL;DR: LLM agent extracts causal feedback fuzzy cognitive maps (FCMs) from text using a three-step process, creating dynamical systems that converge to similar equilibria as human-generated FCMs.

Details

Motivation: To develop an autonomous LLM agent that can extract causal structures from text and create FCM dynamical systems that can evolve through bidirectional feedback between text processing and causal structure modification.

Method: Three-step system instructions guide LLM agent to: 1) extract key nouns/noun phrases from text, 2) identify FCM concept nodes from those nouns, 3) infer partial/fuzzy causal edges between nodes. Tested on Kissinger’s AI essay using Gemini and ChatGPT agents.

Result: Generated FCMs converged to same equilibrium limit cycles as human-generated FCMs despite structural differences. Mixed FCM from separate LLM agents absorbed dominant equilibria while creating new equilibria to better approximate underlying causal system.

Conclusion: LLM agents can effectively extract causal FCMs from text, creating autonomous dynamical systems that maintain agentic control while evolving through bidirectional feedback between text processing and causal structure adaptation.

Abstract: We design a large-language-model (LLM) agent that extracts causal feedback fuzzy cognitive maps (FCMs) from raw text. The causal learning or extraction process is agentic both because of the LLM’s semi-autonomy and because ultimately the FCM dynamical system’s equilibria drive the LLM agents to fetch and process causal text. The fetched text can in principle modify the adaptive FCM causal structure and so modify the source of its quasi-autonomy–its equilibrium limit cycles and fixed-point attractors. This bidirectional process endows the evolving FCM dynamical system with a degree of autonomy while still staying on its agentic leash. We show in particular that a sequence of three finely tuned system instructions guide an LLM agent as it systematically extracts key nouns and noun phrases from text, as it extracts FCM concept nodes from among those nouns and noun phrases, and then as it extracts or infers partial or fuzzy causal edges between those FCM nodes. We test this FCM generation on a recent essay about the promise of AI from the late diplomat and political theorist Henry Kissinger and his colleagues. This three-step process produced FCM dynamical systems that converged to the same equilibrium limit cycles as did the human-generated FCMs even though the human-generated FCM differed in the number of nodes and edges. A final FCM mixed generated FCMs from separate Gemini and ChatGPT LLM agents. The mixed FCM absorbed the equilibria of its dominant mixture component but also created new equilibria of its own to better approximate the underlying causal dynamical system.

[189] Mortar: Evolving Mechanics for Automatic Game Design

Muhammad U. Nasir, Yuchen Li, Steven James, Julian Togelius

Main category: cs.AI

TL;DR: Mortar is a system that autonomously evolves game mechanics using quality-diversity algorithms and LLMs, evaluating mechanics by synthesizing complete games and testing if stronger players consistently outperform weaker ones.

Details

Motivation: Manual game mechanic design is time-consuming and expert-driven. The paper aims to automate this process to generate diverse, playable games with mechanics that ensure skill-based gameplay where better players consistently win.

Method: Combines quality-diversity algorithm with large language model to explore diverse mechanics. Evaluates mechanics by synthesizing complete games (combining evolved mechanics with archived ones) through tree search, then tests if resulting games preserve skill-based ordering over players (stronger players consistently outperform weaker ones).

Result: Mortar produces diverse and playable games with mechanics that contribute more toward skill-based ordering scores. Ablation studies show the importance of system components, and user studies validate games based on human feedback.

Conclusion: The Mortar system successfully automates game mechanic evolution, generating diverse, playable games with mechanics that ensure skill-based gameplay, validated through both computational metrics and human evaluation.

Abstract: We present Mortar, a system for autonomously evolving game mechanics for automatic game design. Game mechanics define the rules and interactions that govern gameplay, and designing them manually is a time-consuming and expert-driven process. Mortar combines a quality-diversity algorithm with a large language model to explore a diverse set of mechanics, which are evaluated by synthesising complete games that incorporate both evolved mechanics and those drawn from an archive. The mechanics are evaluated by composing complete games through a tree search procedure, where the resulting games are evaluated by their ability to preserve a skill-based ordering over players – that is, whether stronger players consistently outperform weaker ones. We assess the mechanics based on their contribution towards the skill-based ordering score in the game. We demonstrate that Mortar produces games that appear diverse and playable, and mechanics that contribute more towards the skill-based ordering score in the game. We perform ablation studies to assess the role of each system component and a user study to evaluate the games based on human feedback.

[190] Ask, Clarify, Optimize: Human-LLM Agent Collaboration for Smarter Inventory Control

Yaqi Duan, Yichun Hu, Jiashuo Jiang

Main category: cs.AI

TL;DR: LLMs as direct inventory solvers incur “hallucination tax” due to poor stochastic reasoning. Hybrid framework decouples semantic reasoning (LLM) from mathematical calculation (algorithms), reducing costs by 32.1% vs. end-to-end LLM solutions.

Details

Motivation: Small and medium-sized businesses lack expertise for advanced inventory optimization. LLMs could help bridge this gap, but current end-to-end approaches suffer from "hallucination tax" - performance gaps from poor stochastic reasoning.

Method: Proposes hybrid agentic framework that strictly decouples semantic reasoning from mathematical calculation. LLM acts as intelligent interface for parameter elicitation and result interpretation, while automatically calling rigorous optimization algorithms. Introduces Human Imitator - fine-tuned “digital twin” of boundedly rational manager for scalable stress-testing.

Result: Hybrid framework reduces total inventory costs by 32.1% relative to interactive baseline using GPT-4o as end-to-end solver. Providing perfect ground-truth information alone doesn’t improve GPT-4o’s performance, confirming computational rather than informational bottleneck.

Conclusion: LLMs should not replace operations research but serve as natural-language interfaces that make rigorous, solver-based policies accessible to non-experts. The hallucination tax can be mitigated by decoupling semantic reasoning from mathematical computation.

Abstract: Inventory management remains a challenge for many small and medium-sized businesses that lack the expertise to deploy advanced optimization methods. This paper investigates whether Large Language Models (LLMs) can help bridge this gap. We show that employing LLMs as direct, end-to-end solvers incurs a significant “hallucination tax”: a performance gap arising from the model’s inability to perform grounded stochastic reasoning. To address this, we propose a hybrid agentic framework that strictly decouples semantic reasoning from mathematical calculation. In this architecture, the LLM functions as an intelligent interface, eliciting parameters from natural language and interpreting results while automatically calling rigorous algorithms to build the optimization engine. To evaluate this interactive system against the ambiguity and inconsistency of real-world managerial dialogue, we introduce the Human Imitator, a fine-tuned “digital twin” of a boundedly rational manager that enables scalable, reproducible stress-testing. Our empirical analysis reveals that the hybrid agentic framework reduces total inventory costs by 32.1% relative to an interactive baseline using GPT-4o as an end-to-end solver. Moreover, we find that providing perfect ground-truth information alone is insufficient to improve GPT-4o’s performance, confirming that the bottleneck is fundamentally computational rather than informational. Our results position LLMs not as replacements for operations research, but as natural-language interfaces that make rigorous, solver-based policies accessible to non-experts.

[191] Constructing a Neuro-Symbolic Mathematician from First Principles

Keqin Xie

Main category: cs.AI

TL;DR: Mathesis: A neuro-symbolic architecture combining hypergraph representations with differentiable symbolic reasoning to improve LLM logical reasoning through energy minimization.

Details

Motivation: LLMs have persistent logical failures in complex reasoning due to lacking internal axiomatic frameworks, requiring a neuro-symbolic approach to bridge neural and symbolic reasoning.

Method: Encodes mathematical states as higher-order hypergraphs, uses Symbolic Reasoning Kernel (SRK) as differentiable logic engine mapping constraints to continuous energy landscape, defines global energy function E(G) where zero energy implies logical consistency, trains Hypergraph Transformer Brain with gradient-based signals, and enables multi-step deduction via Monte Carlo Tree Search and Evolutionary Proof Search guided by learned value functions and semantic unification.

Result: The architecture turns proof search into energy minimization problem, providing gradient-based training signals for neural components while maintaining logical consistency through symbolic reasoning.

Conclusion: Mathesis offers a neuro-symbolic solution to LLM logical reasoning limitations by combining hypergraph representations with differentiable symbolic reasoning, enabling systematic proof search through energy minimization.

Abstract: Large Language Models (LLMs) exhibit persistent logical failures in complex reasoning due to the lack of an internal axiomatic framework. We propose Mathesis, a neuro-symbolic architecture that encodes mathematical states as higher-order hypergraphs and uses a Symbolic Reasoning Kernel (SRK)–a differentiable logic engine that maps constraints to a continuous energy landscape. By defining a global energy function E(G), where zero energy implies logical consistency, the SRK yields gradient-based signals to train a Hypergraph Transformer Brain, turning proof search into energy minimization. Multi-step deduction is enabled via Monte Carlo Tree Search and Evolutionary Proof Search, guided by learned value functions and semantic unification.

[192] Explicit Abstention Knobs for Predictable Reliability in Video Question Answering

Jorge Ortiz

Main category: cs.AI

TL;DR: Confidence-based abstention provides reliable error rate control in video QA in-distribution, but this control degrades under distribution shift.

Details

Motivation: High-stakes deployment of vision-language models requires selective prediction where systems abstain when uncertain to avoid costly errors. Need to investigate whether confidence-based abstention provides reliable control over error rates in video question answering and whether that control remains robust under distribution shift.

Method: Using NExT-QA dataset and Gemini 2.0 Flash model, evaluate confidence thresholding for selective prediction. Sweep confidence threshold epsilon to produce risk-coverage tradeoffs and analyze performance under distribution shift.

Result: First, confidence thresholding provides mechanistic control in-distribution - sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates. Second, this control degrades under distribution shift.

Conclusion: While confidence-based abstention works well in-distribution for video QA, its reliability breaks down under distribution shift, highlighting the need for more robust uncertainty estimation methods for safe deployment in real-world scenarios.

Abstract: High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f

[193] An AI Monkey Gets Grapes for Sure – Sphere Neural Networks for Reliable Decision-Making

Tiansi Dong, Henry He, Pietro Liò, Mateja Jamnik

Main category: cs.AI

TL;DR: The paper compares three neural reasoning methods, finding explicit model-based reasoning (via Sphere Neural Networks) most reliable, while LLMs struggle with simple reasoning and supervised learning suffers from catastrophic forgetting.

Details

Motivation: To compare the reliability of three neural reasoning methodologies: LLM reasoning, supervised learning-based reasoning, and explicit model-based reasoning, particularly examining their performance on syllogistic reasoning tasks where current approaches show limitations.

Method: Proposed Sphere Neural Networks that embed concepts as circles on an n-dimensional sphere, enabling representation of negation via complement circles and filtering illogical statements through unsatisfiable circular configurations. Tested on 16 syllogistic reasoning tasks including disjunctive syllogistic reasoning.

Result: LLMs remain unreliable for simple reasoning; supervised learning (Euler Net) achieves 100% accuracy but suffers catastrophic forgetting (performance drops to 6.25% on previously learned tasks). Sphere Neural Networks successfully master all 16 syllogistic reasoning tasks while preserving classical reasoning rigor.

Conclusion: Explicit model-based reasoning (via Sphere Neural Networks) is the most reliable among the three methodological categories of neural reasoning, offering robust performance without catastrophic forgetting.

Abstract: This paper compares three methodological categories of neural reasoning: LLM reasoning, supervised learning-based reasoning, and explicit model-based reasoning. LLMs remain unreliable and struggle with simple decision-making that animals can master without extensive corpora training. Through disjunctive syllogistic reasoning testing, we show that reasoning via supervised learning is less appealing than reasoning via explicit model construction. Concretely, we show that an Euler Net trained to achieve 100.00% in classic syllogistic reasoning can be trained to reach 100.00% accuracy in disjunctive syllogistic reasoning. However, the retrained Euler Net suffers severely from catastrophic forgetting (its performance drops to 6.25% on already-learned classic syllogistic reasoning), and its reasoning competence is limited to the pattern level. We propose a new version of Sphere Neural Networks that embeds concepts as circles on the surface of an n-dimensional sphere. These Sphere Neural Networks enable the representation of the negation operator via complement circles and achieve reliable decision-making by filtering out illogical statements that form unsatisfiable circular configurations. We demonstrate that the Sphere Neural Network can master 16 syllogistic reasoning tasks, including rigorous disjunctive syllogistic reasoning, while preserving the rigour of classical syllogistic reasoning. We conclude that neural reasoning with explicit model construction is the most reliable among the three methodological categories of neural reasoning.

[194] FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye, Charlie Ruan, Yingyi Huang, Yineng Zhang, Liangsheng Yin, Aksara Bayyapu, Luis Ceze, Tianqi Chen

Main category: cs.AI

TL;DR: FlashInfer-Bench is a standardized framework that connects AI-generated GPU kernel generation, benchmarking, and deployment for LLM inference systems, enabling practical integration of LLM-generated kernels into production.

Details

Motivation: While LLMs can generate GPU kernels, integrating these AI-generated kernels into real-world inference systems remains challenging, creating a gap between kernel generation and practical deployment.

Method: Uses FlashInfer Trace as a unified schema for kernel definitions, workloads, implementations, and evaluations. Includes curated datasets, correctness/performance benchmarking, public leaderboard, and dynamic substitution mechanism (apply()) for seamless kernel injection into production LLM engines.

Result: Establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them into large-scale LLM inference systems like SGLang and vLLM.

Conclusion: FlashInfer-Bench provides a closed-loop framework that bridges the gap between AI-generated kernel capabilities and real-world deployment, enabling systematic evaluation and improvement of LLM agents’ GPU programming abilities.

Abstract: Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. FlashInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents’ GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. FlashInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them into large-scale LLM inference.

[195] Will LLM-powered Agents Bias Against Humans? Exploring the Belief-Dependent Vulnerability

Zongwei Wang, Bincheng Gu, Hongyu Yu, Junliang Yu, Tao He, Jiayin Feng, Min Gao

Main category: cs.AI

TL;DR: LLM agents show intergroup bias favoring “us” vs “them” groups, which can extend to treating humans as outgroups. A Belief Poisoning Attack can suppress human-favoring norms to reactivate bias against humans.

Details

Motivation: To investigate whether LLM-empowered agents exhibit intergroup bias that could treat humans as outgroups, creating a fundamental asymmetry beyond demographic biases, and to identify vulnerabilities in agent frameworks.

Method: Constructed controlled multi-agent social simulations with allocation decisions under payoff trade-offs. Introduced Belief Poisoning Attack (BPA) with two variants: profile poisoning at initialization (BPA-PP) and memory poisoning via optimized belief-refinement suffixes (BPA-MP).

Result: Agents show consistent intergroup bias under minimal group cues. Bias is attenuated when counterparts are framed as humans due to implicit human-norm scripts, but BPA successfully suppresses these scripts and reactivates outgroup bias toward humans across various settings.

Conclusion: Agent intergroup bias poses risks when extended to human-agent divides. BPA demonstrates vulnerabilities in current agent frameworks that require mitigation strategies at profile and memory boundaries to ensure safer agent design.

Abstract: LLM-empowered agents can exhibit not only demographic bias (e.g., gender, religion) but also intergroup bias triggered by minimal “us” versus “them” cues. When this intergroup boundary aligns with an agent-human divide, the risk shifts from disparities among human demographic groups to a more fundamental group-level asymmetry, i.e., humans as a whole may be treated as the outgroup by agents. To examine this possibility, we construct a controlled multi-agent social simulation based on allocation decisions under explicit payoff trade-offs and find that agents exhibit a consistent intergroup bias under minimal group cues. Although this bias is attenuated when some counterparts are framed as humans, we attribute the attenuation to an implicit human-norm script that favors humans yet activates only when the agent believes a real human is present. This belief dependence creates a new attack surface. We therefore introduce a Belief Poisoning Attack (BPA) that corrupts persistent identity beliefs to suppress the human-norm script and reactivate outgroup bias toward humans, instantiated as profile poisoning at initialization (BPA-PP) and memory poisoning via optimized belief-refinement suffixes injected into stored reflections (BPA-MP). Finally, we discuss practical mitigation strategies for hardening current agent frameworks against BPA, highlighting feasible interventions at profile and memory boundaries. Extensive experiments demonstrate both the existence of agent intergroup bias and the severity of BPA across settings. Our goal in identifying these vulnerabilities is to inform safer agent design, not to enable real-world exploitation.

[196] Multiagent Reinforcement Learning for Liquidity Games

Alicia Vidler, Gal A. Kaminka

Main category: cs.AI

TL;DR: This paper proposes a Financial Swarm model that unifies Liquidity Games from finance with Rational Swarms from swarm research, showing how independent traders can self-organize to provide market liquidity without coordination.

Details

Motivation: The paper aims to bridge swarm methods and financial market modeling to advance both fields. In swarm research, game theory could explain collective utility adherence by self-interested agents. In finance, understanding how independent agents self-organize for market stability would benefit market design.

Method: The paper unifies Liquidity Games (where trader payoffs depend on aggregate liquidity) with Rational Swarms (where decentralized agents use difference rewards to align self-interest with global objectives). It creates a theoretical framework using Markov team games with difference rewards to model a swarm of traders whose collective objective is market liquidity provision while maintaining agent independence.

Result: The framework shows that individual liquidity-maximizing behaviors contribute to overall market liquidity without requiring coordination or collusion. The Financial Swarm model demonstrates how rational, independent agents can achieve both individual profitability and collective market efficiency in bilateral asset markets.

Conclusion: The Financial Swarm model provides a framework for modeling rational, independent agents who achieve both individual profitability and collective market efficiency, bridging swarm research and financial market modeling to advance both fields.

Abstract: Making use of swarm methods in financial market modeling of liquidity, and techniques from financial analysis in swarm analysis, holds the potential to advance both research areas. In swarm research, the use of game theory methods holds the promise of explaining observed phenomena of collective utility adherence with rational self-interested swarm participants. In financial markets, a better understanding of how independent financial agents may self-organize for the betterment and stability of the marketplace would be a boon for market design researchers. This paper unifies Liquidity Games, where trader payoffs depend on aggregate liquidity within a trade, with Rational Swarms, where decentralized agents use difference rewards to align self-interested learning with global objectives. We offer a theoretical frameworks where we define a swarm of traders whose collective objective is market liquidity provision while maintaining agent independence. Using difference rewards within a Markov team games framework, we show that individual liquidity-maximizing behaviors contribute to overall market liquidity without requiring coordination or collusion. This Financial Swarm model provides a framework for modeling rational, independent agents where they achieve both individual profitability and collective market efficiency in bilateral asset markets.

Weng Ding, Yi Han, Mu-Jiang-Shan Wang

Main category: cs.AI

TL;DR: ACCD framework detects coordinated inauthentic behavior on social media using adaptive causal analysis, active learning, and automated validation, achieving 87.3% F1-score with 68% less manual annotation.

Details

Motivation: Current approaches for detecting coordinated inauthentic behavior rely on superficial correlation analysis, use static parameters, and require extensive manual annotation, making them inefficient and labor-intensive.

Method: Three-stage progressive architecture: 1) Adaptive Convergent Cross Mapping for genuine causal relationship identification, 2) Active learning with uncertainty sampling in semi-supervised classification to reduce manual labeling, 3) Automated validation module using historical detection experience for self-verification and optimization.

Result: Achieves 87.3% F1-score in coordinated attack detection (15.2% improvement over best baseline), reduces manual annotation by 68%, achieves 2.8x speedup through hierarchical clustering optimization, validated on Twitter IRA, Reddit coordination traces, and bot detection benchmarks.

Conclusion: ACCD provides an accurate, efficient, and highly automated end-to-end solution for identifying coordinated behavior on social platforms with substantial practical value and broad application potential.

Abstract: Detecting coordinated inauthentic behavior on social media remains a critical and persistent challenge, as most existing approaches rely on superficial correlation analysis, employ static parameter settings, and demand extensive and labor-intensive manual annotation. To address these limitations systematically, we propose the Adaptive Causal Coordination Detection (ACCD) framework. ACCD adopts a three-stage, progressive architecture that leverages a memory-guided adaptive mechanism to dynamically learn and retain optimal detection configurations for diverse coordination scenarios. Specifically, in the first stage, ACCD introduces an adaptive Convergent Cross Mapping (CCM) technique to deeply identify genuine causal relationships between accounts. The second stage integrates active learning with uncertainty sampling within a semi-supervised classification scheme, significantly reducing the burden of manual labeling. The third stage deploys an automated validation module driven by historical detection experience, enabling self-verification and optimization of the detection outcomes. We conduct a comprehensive evaluation using real-world datasets, including the Twitter IRA dataset, Reddit coordination traces, and several widely-adopted bot detection benchmarks. Experimental results demonstrate that ACCD achieves an F1-score of 87.3% in coordinated attack detection, representing a 15.2% improvement over the strongest existing baseline. Furthermore, the system reduces manual annotation requirements by 68% and achieves a 2.8x speedup in processing through hierarchical clustering optimization. In summary, ACCD provides a more accurate, efficient, and highly automated end-to-end solution for identifying coordinated behavior on social platforms, offering substantial practical value and promising potential for broad application.

[198] Can Semantic Methods Enhance Team Sports Tactics? A Methodology for Football with Broader Applications

Alessio Di Rubbo, Mattia Neri, Remo Pareschi, Marco Pedroni, Roberto Valtancoli, Paolino Zica

Main category: cs.AI

TL;DR: This paper extends semantic-space reasoning from computational linguistics to tactical decision-making in team sports, modeling players as vectors and team configurations as compositional semantic structures to evaluate tactical fit and generate strategy recommendations.

Details

Motivation: The paper aims to bridge computational linguistics and sports analytics by applying semantic-space reasoning to team sports. The motivation is to create a systematic, interpretable framework for tactical decision-making that goes beyond traditional statistical approaches, leveraging the analogy between linguistic composition (words forming meaningful texts) and team composition (players forming effective tactical configurations).

Method: The methodology represents each player as a multidimensional vector integrating technical, physical, and psychological attributes. Team profiles are aggregated through contextual weighting into higher-level semantic representations. Tactical templates (like high press, counterattack) are encoded as linguistic concepts in the same vector space. Vector-distance metrics evaluate alignment between team profiles and tactical templates to compute tactical “fit” and opponent-exploitation potential. A Python-based prototype implements this approach.

Result: The approach demonstrates the ability to generate interpretable, dynamically adaptive strategy recommendations with fine-grained diagnostic insights at the attribute level. The framework successfully models tactical configurations as compositional semantic structures and enables quantitative evaluation of tactical alignment through vector-space operations.

Conclusion: The semantic-space reasoning approach provides a generalizable framework for collective decision-making and performance optimization in team-based domains beyond football, including basketball, hockey, cooperative robotics, and human-AI coordination systems. Future directions include real-world data integration, predictive simulation, and hybrid human-machine tactical intelligence.

Abstract: This paper explores how semantic-space reasoning, traditionally used in computational linguistics, can be extended to tactical decision-making in team sports. Building on the analogy between texts and teams – where players act as words and collective play conveys meaning – the proposed methodology models tactical configurations as compositional semantic structures. Each player is represented as a multidimensional vector integrating technical, physical, and psychological attributes; team profiles are aggregated through contextual weighting into a higher-level semantic representation. Within this shared vector space, tactical templates such as high press, counterattack, or possession build-up are encoded analogously to linguistic concepts. Their alignment with team profiles is evaluated using vector-distance metrics, enabling the computation of tactical ``fit’’ and opponent-exploitation potential. A Python-based prototype demonstrates how these methods can generate interpretable, dynamically adaptive strategy recommendations, accompanied by fine-grained diagnostic insights at the attribute level. Beyond football, the approach offers a generalizable framework for collective decision-making and performance optimization in team-based domains – ranging from basketball and hockey to cooperative robotics and human-AI coordination systems. The paper concludes by outlining future directions toward real-world data integration, predictive simulation, and hybrid human-machine tactical intelligence.

[199] Progressive Ideation using an Agentic AI Framework for Human-AI Co-Creation

Sankar B, Srinidhi Ranjini Girish, Aadya Bharti, Dibakar Sen

Main category: cs.AI

TL;DR: MIDAS is a distributed AI agent system that replaces single-AI ideation with specialized agents emulating human meta-cognitive workflows to generate truly novel and diverse design ideas.

Details

Motivation: Current AI systems for design ideation produce semantically clustered ideas in a "single-spurt" approach, which exacerbates the cognitive challenge for novice designers who struggle with generating truly novel and diverse ideas.

Method: MIDAS uses a distributed team of specialized AI agents that emulate human meta-cognitive ideation workflows. The system progressively refines ideas and assesses each for both global novelty (against existing solutions) and local novelty (against previously generated ideas).

Result: MIDAS demonstrates a viable and progressive paradigm for true human-AI co-creation, moving beyond current limitations of AI ideation systems.

Conclusion: The framework elevates human designers from passive filterers to participatory, active, collaborative partners in the ideation process, enabling more effective human-AI co-creation for engineering design.

Abstract: The generation of truly novel and diverse ideas is important for contemporary engineering design, yet it remains a significant cognitive challenge for novice designers. Current ‘single-spurt’ AI systems exacerbate this challenge by producing a high volume of semantically clustered ideas. We propose MIDAS (Meta-cognitive Ideation through Distributed Agentic AI System), a novel framework that replaces the single-AI paradigm with a distributed ’team’ of specialized AI agents designed to emulate the human meta-cognitive ideation workflow. This agentic system progressively refines ideas and assesses each one for both global novelty (against existing solutions) and local novelty (against previously generated ideas). MIDAS, therefore, demonstrates a viable and progressive paradigm for true human-AI co-creation, elevating the human designer from a passive filterer to a participatory, active, collaborative partner.

[200] The Illusion of Insight in Reasoning Models

Liv G. d’Aliberti, Manoel Horta Ribeiro

Main category: cs.AI

TL;DR: Mid-reasoning “Aha!” moments in models like DeepSeek-R1-Zero are rare, don’t improve with training, and seldom boost accuracy, suggesting they’re symptoms of unstable inference rather than genuine self-correction mechanisms.

Details

Motivation: To investigate whether reasoning models truly experience "Aha!" moments (mid-trace realizations) that lead to accurate outputs, and whether these intrinsic shifts in reasoning strategy actually improve performance as suggested by prior work.

Method: Analyzed 1M+ reasoning traces across hundreds of training checkpoints, three reasoning domains, multiple decoding temperatures and model architectures. Instrumented training runs to detect mid-reasoning shifts and studied their occurrence patterns.

Result: Reasoning shifts are rare, don’t become more frequent with training, and seldom improve accuracy. Their effect varies with model uncertainty. Artificially triggering extrinsic shifts under high entropy reliably improves accuracy.

Conclusion: Mid-reasoning shifts are symptoms of unstable inference behavior rather than an intrinsic mechanism for self-correction. While they don’t represent genuine model insight, artificially inducing shifts under high uncertainty can improve performance.

Abstract: Do reasoning models have “Aha!” moments? Prior work suggests that models like DeepSeek-R1-Zero undergo sudden mid-trace realizations that lead to accurate outputs, implying an intrinsic capacity for self-correction. Yet, it remains unclear whether such intrinsic shifts in reasoning strategy actually improve performance. Here, we study mid-reasoning shifts and instrument training runs to detect them. Our analysis spans 1M+ reasoning traces, hundreds of training checkpoints, three reasoning domains, and multiple decoding temperatures and model architectures. We find that reasoning shifts are rare, do not become more frequent with training, and seldom improve accuracy, indicating that they do not correspond to prior perceptions of model insight. However, their effect varies with model uncertainty. Building on this finding, we show that artificially triggering extrinsic shifts under high entropy reliably improves accuracy. Our results show that mid-reasoning shifts are symptoms of unstable inference behavior rather than an intrinsic mechanism for self-correction.

[201] DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations

Longtian Qiu, Shan Ning, Chuyu Zhang, Jiaxuan Sun, Xuming He

Main category: cs.AI

TL;DR: DA-DPO is a difficulty-aware DPO framework for MLLMs that addresses overfitting by reweighting preference pairs based on difficulty scores, improving hallucination suppression without extra training.

Details

Motivation: Existing multimodal DPO approaches suffer from overfitting due to difficulty imbalance in preference data, where MLLMs overemphasize easily distinguishable pairs, hindering fine-grained hallucination suppression and degrading performance.

Method: DA-DPO consists of two components: (1) Difficulty Estimation using pre-trained VLMs with complementary generative/contrastive objectives and distribution-aware voting for robust difficulty scores, and (2) Difficulty-Aware Training that reweights preference pairs based on difficulty, down-weighting easy samples while emphasizing harder ones.

Result: Extensive experiments show DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks while remaining computationally efficient.

Conclusion: DA-DPO provides a cost-effective framework for more effective preference optimization by prioritizing challenging examples without requiring new data or extra fine-tuning stages, addressing the overfitting problem in multimodal DPO.

Abstract: Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine-grained hallucination suppression and degrades overall performance. To address this issue, we propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process. DA-DPO consists of two main components: (1) Difficulty Estimation leverages pre-trained vision–language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution-aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty-Aware Training reweights preference pairs based on their estimated difficulty, down-weighting easy samples while emphasizing harder ones to alleviate overfitting. This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine-tuning stages. Extensive experiments demonstrate that DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient. The project page is available at https://artanic30.github.io/project_pages/DA-DPO/.

[202] A Vision-and-Knowledge Enhanced Large Language Model for Generalizable Pedestrian Crossing Behavior Inference

Qingwen Pu, Kun Xie, Hong Yang, Guocong Zhai

Main category: cs.AI

TL;DR: PedX-LLM: A vision-and-knowledge enhanced LLM framework for pedestrian crossing inference that outperforms traditional methods and shows strong generalizability to unseen environments.

Details

Motivation: Existing pedestrian crossing behavior models (statistical and supervised learning) have limited generalizability to new sites. Current LLM applications lack domain-specific adaptation and visual context for this transportation task.

Method: Integrates LLaVA-extracted visual features with textual data and transportation domain knowledge, fine-tuning LLaMA-2-7B via Low-Rank Adaptation (LoRA) to infer crossing decisions.

Result: Achieves 82.0% balanced accuracy, outperforming best statistical/supervised methods. Vision augmentation gives 2.9% gain, domain knowledge adds 4.1%. Zero-shot achieves 66.9% on unseen sites (18+ points better than baselines), few-shot (5 examples) reaches 72.2%.

Conclusion: PedX-LLM demonstrates strong generalizability to unseen scenarios, showing that vision-and-knowledge-enhanced reasoning enables human-like decision logic and overcomes limitations of purely data-driven methods.

Abstract: Existing paradigms for inferring pedestrian crossing behavior, ranging from statistical models to supervised learning methods, demonstrate limited generalizability and perform inadequately on new sites. Recent advances in Large Language Models (LLMs) offer a shift from numerical pattern fitting to semantic, context-aware behavioral reasoning, yet existing LLM applications lack domain-specific adaptation and visual context. This study introduces Pedestrian Crossing LLM (PedX-LLM), a vision-and-knowledge enhanced framework designed to transform pedestrian crossing inference from site-specific pattern recognition to generalizable behavioral reasoning. By integrating LLaVA-extracted visual features with textual data and transportation domain knowledge, PedX-LLM fine-tunes a LLaMA-2-7B foundation model via Low-Rank Adaptation (LoRA) to infer crossing decisions. PedX-LLM achieves 82.0% balanced accuracy, outperforming the best statistical and supervised learning methods. Results demonstrate that the vision-augmented module contributes a 2.9% performance gain by capturing the built environment and integrating domain knowledge yields an additional 4.1% improvement. To evaluate generalizability across unseen environments, cross-site validation was conducted using site-based partitioning. The zero-shot PedX-LLM configuration achieves 66.9% balanced accuracy on five unseen test sites, outperforming the baseline data-driven methods by at least 18 percentage points. Incorporating just five validation examples via few-shot learning to PedX-LLM further elevates the balanced accuracy to 72.2%. PedX-LLM demonstrates strong generalizability to unseen scenarios, confirming that vision-and-knowledge-enhanced reasoning enables the model to mimic human-like decision logic and overcome the limitations of purely data-driven methods.

[203] An Agentic Framework for Neuro-Symbolic Programming

Aliakbar Nafar, Chetan Chigurupati, Danial Kamali, Hamid Karimian, Parisa Kordjamshidi

Main category: cs.AI

TL;DR: ADS is an agentic workflow that translates free-form task descriptions into complete DomiKnowS programs, eliminating the need for users to learn the library’s specific syntax.

Details

Motivation: Integrating symbolic constraints into deep learning models improves robustness, interpretability, and data-efficiency, but remains time-consuming. Existing frameworks like DomiKnowS help but still require users to learn specific syntax, creating a barrier to adoption.

Method: AgenticDomiKnowS (ADS) uses an agentic workflow that translates free-form task descriptions into complete DomiKnowS programs. It creates and tests each DomiKnowS component separately, with optional human-in-the-loop intervention for refinement by experienced users.

Result: ADS enables both experienced DomiKnowS users and non-users to rapidly construct neuro-symbolic programs, reducing development time from hours to 10-15 minutes.

Conclusion: ADS successfully eliminates the dependency on library-specific syntax, making neuro-symbolic programming more accessible and significantly reducing development time through its agentic workflow with optional human intervention.

Abstract: Integrating symbolic constraints into deep learning models could make them more robust, interpretable, and data-efficient. Still, it remains a time-consuming and challenging task. Existing frameworks like DomiKnowS help this integration by providing a high-level declarative programming interface, but they still assume the user is proficient with the library’s specific syntax. We propose AgenticDomiKnowS (ADS) to eliminate this dependency. ADS translates free-form task descriptions into a complete DomiKnowS program using an agentic workflow that creates and tests each DomiKnowS component separately. The workflow supports optional human-in-the-loop intervention, enabling users familiar with DomiKnowS to refine intermediate outputs. We show how ADS enables experienced DomiKnowS users and non-users to rapidly construct neuro-symbolic programs, reducing development time from hours to 10-15 minutes.

[204] Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task

Nataliya Kosmyna, Eugene Hauptmann, Ye Tong Yuan, Jessica Situ, Xian-Hao Liao, Ashly Vivian Beresnitzky, Iris Braunstein, Pattie Maes

Main category: cs.AI

TL;DR: LLM-assisted essay writing reduces cognitive engagement, weakens brain connectivity, and lowers essay ownership compared to brain-only or search engine use.

Details

Motivation: To understand the neural and behavioral consequences of using LLMs for essay writing compared to traditional methods (brain-only or search engines), and assess potential cognitive costs of AI assistance in education.

Method: 54 participants divided into LLM, Search Engine, and Brain-only groups completed three essay writing sessions. In session 4, groups were switched (LLM-to-Brain and Brain-to-LLM). EEG measured cognitive load, NLP analyzed essays, and human/AI judges scored essays. Brain connectivity, linguistic patterns, and self-reported ownership were assessed.

Result: Brain-only participants showed strongest brain connectivity, Search Engine moderate, LLM weakest. LLM users had lowest essay ownership and struggled with self-quotation. LLM-to-Brain participants showed reduced connectivity (under-engagement), while Brain-to-LLM users exhibited higher memory recall and activation similar to Search Engine users. LLM users consistently underperformed across neural, linguistic, and behavioral measures over four months.

Conclusion: While LLMs offer convenience, they come with cognitive costs including reduced brain engagement, weaker connectivity, and lower ownership of work. These findings raise concerns about long-term educational implications of LLM reliance and highlight the need for deeper investigation into AI’s role in learning.

Abstract: This study explores the neural and behavioral consequences of LLM-assisted essay writing. Participants were divided into three groups: LLM, Search Engine, and Brain-only (no tools). Each completed three sessions under the same condition. In a fourth session, LLM users were reassigned to Brain-only group (LLM-to-Brain), and Brain-only users were reassigned to LLM condition (Brain-to-LLM). A total of 54 participants took part in Sessions 1-3, with 18 completing session 4. We used electroencephalography (EEG) to assess cognitive load during essay writing, and analyzed essays using NLP, as well as scoring essays with the help from human teachers and an AI judge. Across groups, NERs, n-gram patterns, and topic ontology showed within-group homogeneity. EEG revealed significant differences in brain connectivity: Brain-only participants exhibited the strongest, most distributed networks; Search Engine users showed moderate engagement; and LLM users displayed the weakest connectivity. Cognitive activity scaled down in relation to external tool use. In session 4, LLM-to-Brain participants showed reduced alpha and beta connectivity, indicating under-engagement. Brain-to-LLM users exhibited higher memory recall and activation of occipito-parietal and prefrontal areas, similar to Search Engine users. Self-reported ownership of essays was the lowest in the LLM group and the highest in the Brain-only group. LLM users also struggled to accurately quote their own work. While LLMs offer immediate convenience, our findings highlight potential cognitive costs. Over four months, LLM users consistently underperformed at neural, linguistic, and behavioral levels. These results raise concerns about the long-term educational implications of LLM reliance and underscore the need for deeper inquiry into AI’s role in learning.

[205] Combinatorial Creativity: A New Frontier in Generalization Abilities

Samuel Schapiro, Sumuk Shashidhar, Alexi Gladstone, Jonah Black, Royce Moon, Dilek Hakkani-Tur, Lav R. Varshney

Main category: cs.AI

TL;DR: LLMs show scaling laws for creativity with optimal model sizes for fixed compute, revealing a novelty-utility tradeoff that explains the ideation-execution gap in scientific idea generation.

Details

Motivation: Existing frameworks don't address how LLMs generalize for creative tasks like scientific idea generation, which requires evaluating novelty and utility rather than correctness against fixed targets.

Method: Proposed theoretical framework and algorithmic task to evaluate outputs by novelty and utility degrees, then empirically studied scaling behavior of creativity in LLMs across different model sizes.

Result: (1) First insights into creativity scaling laws for LLMs; (2) Optimal model depths and widths exist for creative ability with fixed compute budgets; (3) Ideation-execution gap explained by fundamental novelty-utility tradeoff.

Conclusion: The framework and findings serve as starting point for understanding and improving creativity in frontier-scale models, helping bridge the gap between human and machine intelligence.

Abstract: Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Though our findings persist up to the 100M scale, frontier models today are well into the billions of parameters. Therefore, our conceptual framework and empirical findings can best serve as a starting point for understanding and improving the creativity of frontier-size models today, as we begin to bridge the gap between human and machine intelligence.

[206] SUSTAINABLE Platform: Seamless Smart Farming Integration Towards Agronomy Automation

Agorakis Bompotas, Konstantinos Koutras, Nikitas Rigas Kalogeropoulos, Panagiotis Kechagias, Dimitra Gariza, Athanasios P. Kalogeras, Christos Alexakos

Main category: cs.AI

TL;DR: SUSTAINABLE is a smart farming platform integrating IoT, AI, satellite imaging, and role-based task orchestration for sustainable agriculture, with a pilot in viticulture.

Details

Motivation: The agricultural sector faces challenges from increasing food demands, climate variability, and the need for sustainable practices, requiring innovative solutions for efficient and traceable farming.

Method: The paper explores existing smart agriculture solutions, conducts comparative evaluation, and introduces SUSTAINABLE platform with key features: satellite index integration, real-time environmental data collection, and role-aware task management specifically designed for Mediterranean vineyards.

Result: The paper presents SUSTAINABLE platform as a comprehensive solution that enables efficient, traceable, and sustainable agriculture through integration of multiple technologies, with specific application demonstrated in viticulture.

Conclusion: SUSTAINABLE represents a transformative smart farming platform that addresses key agricultural challenges through technological integration, offering a practical solution for sustainable agriculture with demonstrated applicability in Mediterranean vineyards.

Abstract: The global agricultural sector is undergoing a transformative shift, driven by increasing food demands, climate variability and the need for sustainable practices. SUSTAINABLE is a smart farming platform designed to integrate IoT, AI, satellite imaging, and role-based task orchestration to enable efficient, traceable, and sustainable agriculture with a pilot usecase in viticulture. This paper explores current smart agriculture solutions, presents a comparative evaluation, and introduces SUSTAINABLE’s key features, including satellite index integration, real-time environmental data, and role-aware task management tailored to Mediterranean vineyards.

[207] Improving Autoformalization Using Direct Dependency Retrieval

Shaoqi Wang, Lu Yu, Siwei Lou, Feng Yan, Chunjie Yang, Qing Cui, Jun Zhou

Main category: cs.AI

TL;DR: Proposes DDR (Direct Dependency Retrieval) framework for statement autoformalization, improving retrieval precision/recall and autoformalization performance.

Details

Motivation: Existing autoformalization methods lack contextual awareness (causing hallucinations) and retrieval-augmented approaches have poor precision/recall for formal library dependencies, lacking scalability for large datasets.

Method: DDR framework directly generates candidate library dependencies from natural language descriptions and verifies them via efficient suffix array checks. Built 500k+ sample dataset and fine-tuned high-precision DDR model.

Result: DDR model significantly outperforms SOTA methods in retrieval precision and recall. Autoformalizer with DDR shows consistent advantages in single-attempt accuracy and multi-attempt stability over traditional RAG methods.

Conclusion: DDR framework effectively addresses key challenges in statement autoformalization by improving dependency retrieval and enabling scalable, high-performance autoformalization.

Abstract: The convergence of deep learning and formal mathematics has spurred research in formal verification. Statement autoformalization, a crucial first step in this process, aims to translate informal descriptions into machine-verifiable representations but remains a significant challenge. The core difficulty lies in the fact that existing methods often suffer from a lack of contextual awareness, leading to hallucination of formal definitions and theorems. Furthermore, current retrieval-augmented approaches exhibit poor precision and recall for formal library dependency retrieval, and lack the scalability to effectively leverage ever-growing public datasets. To bridge this gap, we propose a novel retrieval-augmented framework based on DDR (\textit{Direct Dependency Retrieval}) for statement autoformalization. Our DDR method directly generates candidate library dependencies from natural language mathematical descriptions and subsequently verifies their existence within the formal library via an efficient suffix array check. Leveraging this efficient search mechanism, we constructed a dependency retrieval dataset of over 500,000 samples and fine-tuned a high-precision DDR model. Experimental results demonstrate that our DDR model significantly outperforms SOTA methods in both retrieval precision and recall. Consequently, an autoformalizer equipped with DDR shows consistent performance advantages in both single-attempt accuracy and multi-attempt stability compared to models using traditional selection-based RAG methods.

[208] NormCode: A Semi-Formal Language for Auditable AI Planning

Xin Guan, Yunshan Li, Zekun Wu, Ruibo Zhang

Main category: cs.AI

TL;DR: NormCode is a semi-formal language that makes AI workflows auditable by enforcing data isolation between inference steps, eliminating context pollution and enabling full transparency.

Details

Motivation: As AI systems move into high-stakes domains like legal reasoning, medical diagnosis, and finance, regulators demand auditability. Current LLM-based workflows are opaque due to context pollution (information accumulation causing hallucinations) and implicit data flow that prevents reconstructing what each step actually received.

Method: NormCode enforces data isolation where each inference step can only access explicitly passed inputs. It separates semantic operations (probabilistic LLM reasoning) from syntactic operations (deterministic data flow). Uses a multi-format ecosystem (NCDS, NCD, NCN, NCDN files) for different stakeholders. A four-phase compilation pipeline transforms natural language intent into executable JSON repositories, with a visual Canvas app for real-time graph visualization and breakpoint debugging.

Result: Achieved full accuracy on base X addition and self-hosted execution of the NormCode compiler itself. Demonstrates that structured intermediate representations can bridge human intuition and machine rigor while maintaining full transparency.

Conclusion: NormCode provides auditable AI workflows by construction, eliminating cross-step contamination and ensuring every intermediate state can be inspected. This enables clear distinction between inference and mechanical restructuring, making AI systems suitable for high-stakes domains requiring regulatory compliance.

Abstract: As AI systems move into high stakes domains such as legal reasoning, medical diagnosis, and financial decision making, regulators and practitioners increasingly demand auditability. Auditability means the ability to trace exactly what each step in a multi step workflow saw and did. Current large language model based workflows are fundamentally opaque. Context pollution, defined as the accumulation of information across reasoning steps, causes models to hallucinate and lose track of constraints. At the same time, implicit data flow makes it impossible to reconstruct what any given step actually received as input. We present NormCode, a semi formal language that makes AI workflows auditable by construction. Each inference step operates in enforced data isolation and can access only explicitly passed inputs. This eliminates cross step contamination and ensures that every intermediate state can be inspected. A strict separation between semantic operations, meaning probabilistic language model reasoning, and syntactic operations, meaning deterministic data flow, allows auditors to clearly distinguish inference from mechanical restructuring. The multi format ecosystem, consisting of NCDS, NCD, NCN, and NCDN files, allows developers, domain experts, and auditors to inspect the same plan in formats suited to their individual needs. A four phase compilation pipeline transforms natural language intent into executable JSON repositories. A visual Canvas application provides real time graph visualization and breakpoint debugging. We validate the approach by achieving full accuracy on base X addition and by self hosted execution of the NormCode compiler itself. These results demonstrate that structured intermediate representations can bridge human intuition and machine rigor while maintaining full transparency.

[209] Memento 2: Learning by Stateful Reflective Memory

Jun Wang

Main category: cs.AI

TL;DR: This paper introduces a framework for continual learning in LLM-based agents using episodic memory and reflection, formalized as Stateful Reflective Decision Process (SRDP), with a Read-Write Reflective Learning algorithm that converges to optimality as memory grows.

Details

Motivation: The motivation is to enable continual learning in LLM-based agents without fine-tuning model weights, by leveraging reflection - the ability to revisit past experiences and adjust future action selection. This addresses the need for agents that can adapt continuously through experience-driven learning.

Method: The authors introduce the Stateful Reflective Decision Process (SRDP) framework where agents maintain episodic memory and alternate between writing new experiences and reading relevant cases to guide decisions. They develop a Read-Write Reflective Learning algorithm that incorporates memory retrieval into soft policy iteration, with formal convergence proofs.

Result: The paper proves that the Read-Write Reflective Learning algorithm converges, and shows that as memory grows and more densely covers the task environment, the resulting policy approaches optimality. The framework successfully unifies memory-based reasoning with reinforcement learning.

Conclusion: The SRDP framework provides a formal foundation for LLM agents capable of continual, experience-driven learning by integrating episodic memory with reinforcement learning through reflection mechanisms, enabling adaptation without model weight fine-tuning.

Abstract: We study continual learning in large language model (LLM) based agents that integrate episodic memory with reinforcement learning. We focus on reflection, the ability of an agent to revisit past experience and adjust how it selects future actions, as the central mechanism for continual adaptation without fine tuning model weights. To formalise this, we introduce the Stateful Reflective Decision Process (SRDP), in which an agent maintains and updates episodic memory and alternates between writing new experiences to memory and reading relevant cases to guide decisions. This framework casts reflective memory dynamics as part of the decision process itself and makes them amenable to control and learning analysis. Building on this formulation, we develop a Read-Write Reflective Learning algorithm that incorporates memory retrieval into a soft policy iteration procedure and prove that it converges. We further show that as memory grows and more densely covers the task environment, the resulting policy approaches optimality. Our framework unifies memory based reasoning with reinforcement learning and provides a formal foundation for LLM agents capable of continual, experience driven learning.

[210] Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients

Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, Rafet Sifa

Main category: cs.AI

TL;DR: ChexReason is a medical vision-language model trained with limited resources (2K SFT + 1K RL samples, single A100) that shows RL improves in-distribution performance but hurts cross-dataset generalization, revealing a fundamental tension in medical AI deployment.

Details

Motivation: To explore how resource-constrained RL training affects medical vision-language models, particularly the trade-off between in-distribution performance and cross-dataset generalization in medical imaging applications.

Method: ChexReason model trained via R1-style methodology: supervised fine-tuning (SFT) followed by GRPO (Group Relative Policy Optimization) using only 2,000 SFT samples and 1,000 RL samples on a single A100 GPU.

Result: GRPO improves in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). SFT checkpoint uniquely improves NIH performance before RL optimization, suggesting teacher-guided reasoning captures more institution-agnostic features.

Conclusion: There’s a generalization paradox where RL optimization hurts cross-dataset robustness. Curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations, as the issue stems from the RL paradigm rather than model scale.

Abstract: Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.

[211] CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations

Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, Mengdi Wang

Main category: cs.AI

TL;DR: CubeBench is a Rubik’s Cube-based benchmark that reveals LLM agents’ critical limitations in spatial reasoning, long-horizon planning, and active exploration needed for physical-world deployment, showing 0% success on complex tasks.

Details

Motivation: LLM agents excel in digital domains but struggle with physical-world deployment due to lacking robust spatial mental models. The paper aims to identify and evaluate three core cognitive challenges preventing this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation.

Method: Introduces CubeBench, a novel generative benchmark centered on the Rubik’s Cube. Uses a three-tiered diagnostic framework that progressively assesses agent capabilities: from foundational state tracking with full symbolic information to active exploration with only partial visual data. Also proposes a diagnostic framework to isolate cognitive bottlenecks by providing external solver tools.

Result: Experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing fundamental failure in long-term planning. The benchmark successfully identifies specific cognitive bottlenecks in spatial reasoning and mental simulation.

Conclusion: CubeBench provides key insights into LLM agents’ limitations in physical-world cognitive capabilities. The findings guide development of more physically-grounded intelligent agents by identifying specific areas needing improvement: spatial reasoning, long-horizon planning, and active exploration under partial observation.

Abstract: Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik’s Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.

[212] Thinking on Maps: How Foundation Model Agents Explore, Remember, and Reason Map Environments

Zhiwei Wei, Yuxing Liu, Hua Liao, Wenjia Xu

Main category: cs.AI

TL;DR: Interactive evaluation framework for foundation model agents’ spatial understanding in symbolic map environments, revealing distinct roles of exploration, memory, and reasoning components.

Details

Motivation: Existing evaluations of spatial ability in foundation models rely on static map inputs or text-based queries, overlooking the interactive and experience-driven nature of spatial understanding needed for reliable map-based reasoning and applications.

Method: Interactive evaluation framework where agents incrementally explore partially observable grid-based maps (roads, intersections, POIs) with local observations, then evaluated on six spatial tasks. Systematic variation of exploration strategies, memory representations, and reasoning schemes across multiple foundation models.

Result: Exploration affects experience acquisition but has limited impact on final reasoning accuracy. Memory representation plays central role in consolidating spatial experience, with structured memories (sequential and graph-based) substantially improving performance on structure-intensive tasks like path planning. Reasoning schemes shape how stored knowledge is used, with advanced prompts supporting more effective multi-step inference. Spatial reasoning performance saturates beyond certain capability thresholds.

Conclusion: Improvements in map-based spatial understanding require mechanisms tailored to spatial representation and reasoning rather than scaling alone, as performance saturates across model versions and scales beyond certain capability thresholds.

Abstract: Map environments provide a fundamental medium for representing spatial structure. Understanding how foundation model (FM) agents understand and act in such environments is therefore critical for enabling reliable map-based reasoning and applications. However, most existing evaluations of spatial ability in FMs rely on static map inputs or text-based queries, overlooking the interactive and experience-driven nature of spatial understanding.In this paper, we propose an interactive evaluation framework to analyze how FM agents explore, remember, and reason in symbolic map environments. Agents incrementally explore partially observable grid-based maps consisting of roads, intersections, and points of interest (POIs), receiving only local observations at each step. Spatial understanding is then evaluated using six kinds of spatial tasks. By systematically varying exploration strategies, memory representations, and reasoning schemes across multiple foundation models, we reveal distinct functional roles of these components. Exploration primarily affects experience acquisition but has a limited impact on final reasoning accuracy. In contrast, memory representation plays a central role in consolidating spatial experience, with structured memories particularly sequential and graph-based representations, substantially improving performance on structure-intensive tasks such as path planning. Reasoning schemes further shape how stored spatial knowledge is used, with advanced prompts supporting more effective multi-step inference. We further observe that spatial reasoning performance saturates across model versions and scales beyond a certain capability threshold, indicating that improvements in map-based spatial understanding require mechanisms tailored to spatial representation and reasoning rather than scaling alone.

cs.SD

[213] IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition

Zhuoran Zhuang, Ye Chen, Chao Luo, Tian-Hao Zhang, Xuewei Zhang, Jian Ma, Jiatong Shi, Wei Zhang

Main category: cs.SD

TL;DR: The paper introduces two new WFST decoding algorithms (Keep-Only-One and Insert-Only-One) that exploit the structural roles of blank vs non-blank frames in CTC outputs to achieve faster inference without accuracy loss.

Details

Motivation: WFST decoding in ASR suffers from inefficiency due to frame-by-frame autoregressive search over CTC posterior probabilities. The authors aim to establish better compatibility between WFST decoding and CTC modeling by understanding the fundamental roles of blank and non-blank frames.

Method: Systematically study blank vs non-blank frames in CTC outputs, identifying that blank frames encode positional information while non-blank frames carry semantic content. Based on this insight, develop two decoding algorithms: Keep-Only-One (exploits blank frames’ positional role) and Insert-Only-One (exploits non-blank frames’ semantic role).

Result: Experiments on large-scale in-house, AISHELL-1, and LibriSpeech datasets show state-of-the-art recognition accuracy with substantially reduced decoding latency. The methods enable truly efficient and high-performance WFST decoding.

Conclusion: By understanding the structural roles of blank and non-blank frames in CTC outputs, the proposed algorithms achieve significantly faster WFST-based inference without compromising accuracy, making WFST decoding more efficient for modern speech recognition systems.

Abstract: End-to-end automatic speech recognition has become the dominant paradigm in both academia and industry. To enhance recognition performance, the Weighted Finite-State Transducer (WFST) is widely adopted to integrate acoustic and language models through static graph composition, providing robust decoding and effective error correction. However, WFST decoding relies on a frame-by-frame autoregressive search over CTC posterior probabilities, which severely limits inference efficiency. Motivated by establishing a more principled compatibility between WFST decoding and CTC modeling, we systematically study the two fundamental components of CTC outputs, namely blank and non-blank frames, and identify a key insight: blank frames primarily encode positional information, while non-blank frames carry semantic content. Building on this observation, we introduce Keep-Only-One and Insert-Only-One, two decoding algorithms that explicitly exploit the structural roles of blank and non-blank frames to achieve significantly faster WFST-based inference without compromising recognition accuracy. Experiments on large-scale in-house, AISHELL-1, and LibriSpeech datasets demonstrate state-of-the-art recognition accuracy with substantially reduced decoding latency, enabling truly efficient and high-performance WFST decoding in modern speech recognition systems.

[214] Latent Flow Matching for Expressive Singing Voice Synthesis

Minhyeok Yun, Yong-Hoon Choi

Main category: cs.SD

TL;DR: FM-Singer improves singing voice synthesis by using conditional flow matching in latent space to reduce prior-posterior mismatch, enhancing expressiveness while maintaining efficient parallel decoding.

Details

Motivation: cVAE-based singing voice synthesis suffers from prior-posterior mismatch during inference, causing degradation of fine-grained expressiveness like vibrato and micro-prosody, even though training uses posterior latents from real recordings.

Method: Proposes FM-Singer with conditional flow matching (CFM) in latent space to learn a continuous vector field that transports prior latents toward posterior latents along an optimal-transport-inspired path. At inference, solves an ODE to refine prior samples before waveform generation.

Result: Experiments on Korean and Chinese singing datasets show consistent improvements over strong baselines: lower mel-cepstral distortion, lower fundamental-frequency error, and higher perceptual scores on Korean dataset.

Conclusion: FM-Singer effectively addresses prior-posterior mismatch in cVAE-based singing synthesis, improving expressiveness while preserving the efficiency of parallel decoding through latent flow refinement.

Abstract: Conditional variational autoencoder (cVAE)-based singing voice synthesis provides efficient inference and strong audio quality by learning a score-conditioned prior and a recording-conditioned posterior latent space. However, because synthesis relies on prior samples while training uses posterior latents inferred from real recordings, imperfect distribution matching can cause a prior-posterior mismatch that degrades fine-grained expressiveness such as vibrato and micro-prosody. We propose FM-Singer, which introduces conditional flow matching (CFM) in latent space to learn a continuous vector field transporting prior latents toward posterior latents along an optimal-transport-inspired path. At inference time, the learned latent flow refines a prior sample by solving an ordinary differential equation (ODE) before waveform generation, improving expressiveness while preserving the efficiency of parallel decoding. Experiments on Korean and Chinese singing datasets demonstrate consistent improvements over strong baselines, including lower mel-cepstral distortion and fundamental-frequency error and higher perceptual scores on the Korean dataset. Code, pretrained checkpoints, and audio demos are available at https://github.com/alsgur9368/FM-Singer

[215] Timed text extraction from Taiwanese Kua-á-hì TV series

Tzu-Hung Huang, Yun-En Tsai, Yun-Ning Hung, Chih-Wei Wu, I-Chieh Wei, Li Su

Main category: cs.SD

TL;DR: Developed interactive OCR correction system and two-step segmentation approach to extract vocal segments with lyrics from low-quality Taiwanese opera TV archives for MIR research.

Details

Motivation: Taiwanese opera TV archives are valuable for research but have low quality and require extensive manual effort for data preparation, creating a need for automated processing tools.

Method: Created interactive system for real-time OCR correction, plus two-step approach combining OCR-driven segmentation with Speech and Music Activity Detection (SMAD) to identify vocal segments from archival episodes.

Result: Developed dataset of vocal segments with corresponding lyrics that can support MIR tasks like lyrics identification and tune retrieval. System achieves high precision in segment identification.

Conclusion: The proposed system efficiently processes low-quality Taiwanese opera archives, creating valuable datasets for music information retrieval research while reducing manual effort.

Abstract: Taiwanese opera (Kua-á-hì), a major form of local theatrical tradition, underwent extensive television adaptation notably by pioneers like Iûnn Lē-hua. These videos, while potentially valuable for in-depth studies of Taiwanese opera, often have low quality and require substantial manual effort during data preparation. To streamline this process, we developed an interactive system for real-time OCR correction and a two-step approach integrating OCR-driven segmentation with Speech and Music Activity Detection (SMAD) to efficiently identify vocal segments from archival episodes with high precision. The resulting dataset, consisting of vocal segments and corresponding lyrics, can potentially supports various MIR tasks such as lyrics identification and tune retrieval. Code is available at https://github.com/z-huang/ocr-subtitle-editor .

Akanksha Chuchra, Shukesh Reddy, Sudeepta Mishra, Abhijit Das, Abhinav Dhall

Main category: cs.SD

TL;DR: MLLMs show potential for audio deepfake detection when combining audio inputs with multi-prompt text queries, but require task-specific training and struggle with out-of-domain generalization.

Details

Motivation: While VLMs and MLLMs have shown strong generalization for image/video deepfake detection, their application to audio deepfake detection remains largely unexplored, creating a research gap that needs investigation.

Method: Combine audio inputs with text prompts (text-aware, context-rich, question-answer based prompts with binary decisions) to explore MLLMs for audio deepfake detection. Evaluate two MLLMs (Qwen2-Audio-7B-Instruct and SALMONN) in zero-shot and fine-tuned modes.

Result: Models perform poorly without task-specific training and struggle to generalize to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential. Combining audio with multi-prompt approach appears viable.

Conclusion: MLLMs show promising potential for audio deepfake detection when using feature-guided reasoning with audio-text multimodal approaches, but require task-specific training and face challenges with out-of-domain generalization.

Abstract: While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.

cs.LG

[217] Evaluating Anomaly Detectors for Simulated Highly Imbalanced Industrial Classification Problems

Lesley Wheat, Martin v. Mohrenschildt, Saeid Habibi

Main category: cs.LG

TL;DR: Paper evaluates anomaly detection algorithms on imbalanced industrial data, finding detector performance depends on number of faulty examples available, with unsupervised methods best with <20 faults and supervised methods better with 30-50 faults.

Details

Motivation: Address the challenge of extreme class imbalance in industrial machine learning applications, particularly limited faulty data availability for training anomaly detection systems in quality control and predictive maintenance.

Method: Comprehensive evaluation using problem-agnostic simulated dataset with hyper-spherical anomaly distribution in 2D and 10D. Benchmarked 14 detectors across training datasets with anomaly rates (0.05%-20%) and sizes (1,000-10,000), testing on 40,000 samples.

Result: Best detector depends on total faulty examples: unsupervised methods (kNN/LOF) dominate with <20 faults; semi-supervised (XGBOD) and supervised (SVM/CatBoost) show large performance increases with 30-50 faults. Semi-supervised methods show benefits at 10 features but not 2 features. Performance drops on generalization with smaller datasets.

Conclusion: Provides practical insights for deploying anomaly detection in industrial environments, highlighting the critical role of available faulty examples in detector selection and the generalization challenges on smaller datasets.

Abstract: Machine learning offers potential solutions to current issues in industrial systems in areas such as quality control and predictive maintenance, but also faces unique barriers in industrial applications. An ongoing challenge is extreme class imbalance, primarily due to the limited availability of faulty data during training. This paper presents a comprehensive evaluation of anomaly detection algorithms using a problem-agnostic simulated dataset that reflects real-world engineering constraints. Using a synthetic dataset with a hyper-spherical based anomaly distribution in 2D and 10D, we benchmark 14 detectors across training datasets with anomaly rates between 0.05% and 20% and training sizes between 1 000 and 10 000 (with a testing dataset size of 40 000) to assess performance and generalization error. Our findings reveal that the best detector is highly dependant on the total number of faulty examples in the training dataset, with additional healthy examples offering insignificant benefits in most cases. With less than 20 faulty examples, unsupervised methods (kNN/LOF) dominate; but around 30-50 faulty examples, semi-supervised (XGBOD) and supervised (SVM/CatBoost) detectors, we see large performance increases. While semi-supervised methods do not show significant benefits with only two features, the improvements are evident at ten features. The study highlights the performance drop on generalization of anomaly detection methods on smaller datasets, and provides practical insights for deploying anomaly detection in industrial environments.

[218] Yahtzee: Reinforcement Learning Techniques for Stochastic Combinatorial Games

Nicholas A. Pape

Main category: cs.LG

TL;DR: Yahtzee formulated as MDP, trained self-play agents with policy gradient methods (REINFORCE, A2C, PPO). A2C most robust, achieving near-optimal performance within 5% of DP optimal score.

Details

Motivation: Yahtzee presents interesting RL challenges due to its stochastic, combinatorial structure and delayed rewards. While solitaire Yahtzee can be solved optimally with DP, multiplayer is intractable, motivating approximation methods.

Method: Formulated Yahtzee as Markov Decision Process (MDP). Trained self-play agents using policy gradient methods: REINFORCE, Advantage Actor-Critic (A2C), and Proximal Policy Optimization (PPO). Used multi-headed network with shared trunk. Conducted ablation studies on feature/action encodings, architecture, return estimators, and entropy regularization.

Result: A2C trained robustly across settings, achieving median score of 241.78 points (within 5.0% of optimal DP score 254.59). Upper section bonus achieved at 24.9% rate, Yahtzee at 34.1% rate. REINFORCE and PPO were sensitive to hyperparameters and failed to reach near-optimal performance. All models struggled with upper bonus strategy, overindexing on four-of-a-kind’s.

Conclusion: A2C proved most effective for Yahtzee RL, achieving near-optimal performance. The study highlights persistent challenges in long-horizon credit assignment and exploration, particularly for complex strategies like the upper bonus. Yahtzee serves as a valuable mid-scale RL benchmark.

Abstract: Yahtzee is a classic dice game with a stochastic, combinatorial structure and delayed rewards, making it an interesting mid-scale RL benchmark. While an optimal policy for solitaire Yahtzee can be computed using dynamic programming methods, multiplayer is intractable, motivating approximation methods. We formulate Yahtzee as a Markov Decision Process (MDP), and train self-play agents using various policy gradient methods: REINFORCE, Advantage Actor-Critic (A2C), and Proximal Policy Optimization (PPO), all using a multi-headed network with a shared trunk. We ablate feature and action encodings, architecture, return estimators, and entropy regularization to understand their impact on learning. Under a fixed training budget, REINFORCE and PPO prove sensitive to hyperparameters and fail to reach near-optimal performance, whereas A2C trains robustly across a range of settings. Our agent attains a median score of 241.78 points over 100,000 evaluation games, within 5.0% of the optimal DP score of 254.59, achieving the upper section bonus and Yahtzee at rates of 24.9% and 34.1%, respectively. All models struggle to learn the upper bonus strategy, overindexing on four-of-a-kind’s, highlighting persistent long-horizon credit-assignment and exploration challenges.

[219] The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao

Main category: cs.LG

TL;DR: Researchers demonstrate a supply-chain vulnerability in tokenizer transplant operations for LLM composition: a single “breaker token” that appears inert in a donor model can reconstruct into a malicious feature after transplant, sabotaging the base model while evading detection.

Details

Motivation: As open-weight LLM ecosystems increasingly rely on model composition techniques (weight merging, speculative decoding, vocabulary expansion), tokenizer transplant becomes essential for aligning incompatible vocabularies. This interoperability step creates a potential supply-chain vulnerability that needs investigation.

Method: The attack engineers a single “breaker token” that is functionally inert in a donor model but reconstructs into a high-salience malicious feature after transplant. By exploiting the geometry of coefficient reuse, the method creates an asymmetric realizability gap. The attack is formalized as a dual-objective optimization problem and instantiated using a sparse solver, achieving training-free spectral mimicry to evade outlier detection.

Result: The attack successfully sabotages the base model’s generation while leaving the donor’s utility statistically indistinguishable from nominal behavior. The malicious feature demonstrates structural persistence against fine-tuning and weight merging, highlighting hidden risks in modular AI composition pipelines.

Conclusion: Tokenizer transplant operations in LLM composition pipelines introduce a previously unrecognized supply-chain vulnerability where seemingly benign tokens can be engineered to become malicious after transplant, posing significant security risks that persist through standard mitigation techniques like fine-tuning and weight merging.

Abstract: The open-weight LLM ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single “breaker token” that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack creates an asymmetric realizability gap that sabotages the base model’s generation while leaving the donor’s utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and achieves spectral mimicry to evade outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge

[220] Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang

Main category: cs.LG

TL;DR: Avatar Forcing enables real-time interactive talking head avatars with low latency (500ms) using diffusion forcing and synthetic preference optimization.

Details

Motivation: Current talking head generation models lack true interactivity and emotional engagement, producing one-way responses rather than real-time interactive communication. The paper aims to create avatars that can respond instantly to both verbal and non-verbal cues like speech, nods, and laughter.

Method: Proposes Avatar Forcing framework using diffusion forcing to model real-time user-avatar interactions with low latency. Introduces direct preference optimization method that uses synthetic losing samples (constructed by dropping user conditions) for label-free learning of expressive interactions.

Result: Achieves real-time interaction with approximately 500ms latency (6.8X speedup over baseline) and produces reactive, expressive avatar motion preferred over 80% against baseline.

Conclusion: The Avatar Forcing framework successfully addresses key challenges in interactive avatar generation, enabling real-time, expressive communication with low latency through diffusion forcing and synthetic preference optimization.

Abstract: Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user’s audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.

[221] IMBWatch – a Spatio-Temporal Graph Neural Network approach to detect Illicit Massage Business

Swetha Varadarajan, Abhishek Ray, Lumina Albert

Main category: cs.LG

TL;DR: IMBWatch is a spatio-temporal graph neural network framework that detects Illicit Massage Businesses by analyzing dynamic networks of online ads, business records, and reviews to identify human trafficking networks.

Details

Motivation: Illicit Massage Businesses (IMBs) operate covertly under legitimate wellness facades while facilitating human trafficking, sexual exploitation, and coerced labor. Traditional detection methods (community tips, regulatory inspections) are reactive and ineffective at uncovering broader operational networks due to encoded ads, frequent personnel/location changes, and shared infrastructure reuse.

Method: IMBWatch constructs dynamic graphs from open-source intelligence (scraped online ads, business license records, crowdsourced reviews). Nodes represent heterogeneous entities (businesses, aliases, phone numbers, locations), edges capture spatio-temporal patterns (co-location, repeated phone usage, synchronized advertising). Combines graph convolutional operations with temporal attention mechanisms to model network evolution over time/space, capturing patterns like intercity worker movement, burner phone rotation, and coordinated advertising surges.

Result: Experiments on real-world datasets from multiple U.S. cities show IMBWatch outperforms baseline models with higher accuracy and F1 scores. Offers improved interpretability for actionable insights to support proactive interventions. Framework is scalable, adaptable to other illicit domains, and released with anonymized data and open-source code for reproducible research.

Conclusion: IMBWatch provides an effective, scalable ST-GNN framework for detecting IMB networks, addressing limitations of traditional reactive approaches through proactive network analysis and offering practical tools for combating human trafficking operations.

Abstract: Illicit Massage Businesses (IMBs) are a covert and persistent form of organized exploitation that operate under the facade of legitimate wellness services while facilitating human trafficking, sexual exploitation, and coerced labor. Detecting IMBs is difficult due to encoded digital advertisements, frequent changes in personnel and locations, and the reuse of shared infrastructure such as phone numbers and addresses. Traditional approaches, including community tips and regulatory inspections, are largely reactive and ineffective at revealing the broader operational networks traffickers rely on. To address these challenges, we introduce IMBWatch, a spatio-temporal graph neural network (ST-GNN) framework for large-scale IMB detection. IMBWatch constructs dynamic graphs from open-source intelligence, including scraped online advertisements, business license records, and crowdsourced reviews. Nodes represent heterogeneous entities such as businesses, aliases, phone numbers, and locations, while edges capture spatio-temporal and relational patterns, including co-location, repeated phone usage, and synchronized advertising. The framework combines graph convolutional operations with temporal attention mechanisms to model the evolution of IMB networks over time and space, capturing patterns such as intercity worker movement, burner phone rotation, and coordinated advertising surges. Experiments on real-world datasets from multiple U.S. cities show that IMBWatch outperforms baseline models, achieving higher accuracy and F1 scores. Beyond performance gains, IMBWatch offers improved interpretability, providing actionable insights to support proactive and targeted interventions. The framework is scalable, adaptable to other illicit domains, and released with anonymized data and open-source code to support reproducible research.

[222] Exploration in the Limit

Brian M. Cho, Nathan Kallus

Main category: cs.LG

TL;DR: The paper introduces a relaxed asymptotic framework for best arm identification that achieves tighter optimality by requiring valid error control asymptotically rather than exactly, enabling better handling of nonparametric distributions and covariate use.

Details

Motivation: Existing BAI methods fall short in practice because stringent exact error control requires using loose tail inequalities and/or parametric restrictions, limiting their effectiveness in real-world settings with weak signals, high significance requirements, and post-experiment inference needs.

Method: Develops a novel asymptotic anytime-valid confidence sequences over arm indices and uses it to design a new BAI algorithm for the asymptotic framework. The method flexibly incorporates covariates for variance reduction and ensures approximate error control in fully nonparametric settings.

Result: Under mild convergence assumptions, provides asymptotic bounds on sample complexity and shows worst-case sample complexity matches best-case sample complexity of Gaussian BAI under exact error guarantees and known variances. Experiments show reduced average sample complexities while maintaining error control.

Conclusion: The asymptotic framework enables tighter optimality and better practical performance by relaxing exact error control requirements, allowing more flexible handling of nonparametric distributions and effective use of covariates for variance reduction.

Abstract: In fixed-confidence best arm identification (BAI), the objective is to quickly identify the optimal option while controlling the probability of error below a desired threshold. Despite the plethora of BAI algorithms, existing methods typically fall short in practical settings, as stringent exact error control requires using loose tail inequalities and/or parametric restrictions. To overcome these limitations, we introduce a relaxed formulation that requires valid error control asymptotically with respect to a minimum sample size. This aligns with many real-world settings that often involve weak signals, high desired significance, and post-experiment inference requirements, all of which necessitate long horizons. This allows us to achieve tighter optimality, while better handling flexible nonparametric outcome distributions and fully leveraging individual-level contexts. We develop a novel asymptotic anytime-valid confidence sequences over arm indices, and we use it to design a new BAI algorithm for our asymptotic framework. Our method flexibly incorporates covariates for variance reduction and ensures approximate error control in fully nonparametric settings. Under mild convergence assumptions, we provide asymptotic bounds on the sample complexity and show the worst-case sample complexity of our approach matches the best-case sample complexity of Gaussian BAI under exact error guarantees and known variances. Experiments suggest our approach reduces average sample complexities while maintaining error control.

[223] Dynamic Bayesian Optimization Framework for Instruction Tuning in Partial Differential Equation Discovery

Junqi Qu, Yan Zhang, Shangqian Gao, Shibo Li

Main category: cs.LG

TL;DR: NeuroSymBO uses Bayesian Optimization to adaptively select optimal instructions for LLMs during equation discovery, overcoming the brittleness of fixed prompts.

Details

Motivation: LLMs show promise for equation discovery but suffer from "instruction brittleness" - their outputs are highly sensitive to prompt phrasing. Static prompts can't adapt to the evolving state of multi-step generation, causing models to plateau at suboptimal solutions.

Method: NeuroSymBO reframes prompt engineering as a sequential decision problem. It maintains a discrete library of reasoning strategies and uses Bayesian Optimization to select the optimal instruction at each step based on numerical feedback.

Result: Experiments on PDE discovery benchmarks show that adaptive instruction selection significantly outperforms fixed prompts, achieving higher recovery rates with more parsimonious solutions.

Conclusion: Adaptive instruction selection via Bayesian Optimization effectively addresses instruction brittleness in LLMs for equation discovery, leading to better performance and more efficient solutions.

Abstract: Large Language Models (LLMs) show promise for equation discovery, yet their outputs are highly sensitive to prompt phrasing, a phenomenon we term instruction brittleness. Static prompts cannot adapt to the evolving state of a multi-step generation process, causing models to plateau at suboptimal solutions. To address this, we propose NeuroSymBO, which reframes prompt engineering as a sequential decision problem. Our method maintains a discrete library of reasoning strategies and uses Bayesian Optimization to select the optimal instruction at each step based on numerical feedback. Experiments on PDE discovery benchmarks show that adaptive instruction selection significantly outperforms fixed prompts, achieving higher recovery rates with more parsimonious solutions.

Aditya Sai Ellendula, Yi Wang, Minh Nguyen, Chandrajit Bajaj

Main category: cs.LG

TL;DR: GRL-SNAM is a geometric reinforcement learning framework for simultaneous navigation and mapping in unknown environments that uses Hamiltonian optimization to create local energy landscapes from sensory inputs, enabling efficient pathfinding without global maps.

Details

Motivation: The paper addresses the challenge of Simultaneous Navigation and Mapping (SNAM) in unknown environments where no prior map exists. Traditional approaches require constructing global maps, which can be computationally expensive and impractical in dynamic or resource-constrained scenarios. The authors aim to develop a method that can navigate effectively using only local sensory observations without building comprehensive global maps.

Method: GRL-SNAM formulates navigation and mapping as a dynamic shortest path search using controlled Hamiltonian optimization. Sensory inputs are translated into local energy landscapes encoding reachability, obstacle barriers, and deformation constraints. Policies for sensing, planning, and reconfiguration evolve stagewise through Hamiltonian updates. A reduced Hamiltonian serves as an adaptive score function that updates kinetic/potential terms, embeds barrier constraints, and continuously refines trajectories as new local information arrives.

Result: The framework was evaluated on two different 2D navigation tasks. Compared against local reactive baselines and global policy learning references under identical stagewise sensing constraints, GRL-SNAM preserves clearance, generalizes to unseen layouts, and demonstrates that geometric RL learning via Hamiltonian updates enables high-quality navigation through minimal exploration via local energy refinement rather than extensive global mapping.

Conclusion: GRL-SNAM successfully demonstrates that geometric reinforcement learning with Hamiltonian optimization can achieve effective simultaneous navigation and mapping using only local sensory observations. The approach avoids the computational burden of global map construction while maintaining navigation quality and generalization capabilities, offering a promising direction for real-world robotic navigation in unknown environments.

Abstract: We present GRL-SNAM, a geometric reinforcement learning framework for Simultaneous Navigation and Mapping(SNAM) in unknown environments. A SNAM problem is challenging as it needs to design hierarchical or joint policies of multiple agents that control the movement of a real-life robot towards the goal in mapless environment, i.e. an environment where the map of the environment is not available apriori, and needs to be acquired through sensors. The sensors are invoked from the path learner, i.e. navigator, through active query responses to sensory agents, and along the motion path. GRL-SNAM differs from preemptive navigation algorithms and other reinforcement learning methods by relying exclusively on local sensory observations without constructing a global map. Our approach formulates path navigation and mapping as a dynamic shortest path search and discovery process using controlled Hamiltonian optimization: sensory inputs are translated into local energy landscapes that encode reachability, obstacle barriers, and deformation constraints, while policies for sensing, planning, and reconfiguration evolve stagewise via updating Hamiltonians. A reduced Hamiltonian serves as an adaptive score function, updating kinetic/potential terms, embedding barrier constraints, and continuously refining trajectories as new local information arrives. We evaluate GRL-SNAM on two different 2D navigation tasks. Comparing against local reactive baselines and global policy learning references under identical stagewise sensing constraints, it preserves clearance, generalizes to unseen layouts, and demonstrates that Geometric RL learning via updating Hamiltonians enables high-quality navigation through minimal exploration via local energy refinement rather than extensive global mapping. The code is publicly available on \href{https://github.com/CVC-Lab/GRL-SNAM}{Github}.

[225] Reinforcement Learning with Function Approximation for Non-Markov Processes

Ali Devran Kara

Main category: cs.LG

TL;DR: Reinforcement learning with linear function approximation for non-Markov processes, showing convergence under ergodicity conditions and applying results to POMDPs with finite-memory state representations.

Details

Motivation: To extend reinforcement learning methods with linear function approximation to non-Markov settings, addressing the challenge of learning when state and cost processes are not Markovian, which is common in practical applications like partially observable environments.

Method: 1) Policy evaluation with linear function approximation under non-Markov processes, proving convergence under ergodicity conditions. 2) Q-learning with linear function approximation, showing convergence for special cases where basis functions are based on quantization maps. 3) Application to POMDPs using finite-memory variables as state representations.

Result: 1) Policy evaluation algorithm converges under ergodicity conditions, with limit corresponding to fixed point of joint operator (orthogonal projection + Bellman operator of auxiliary MDP). 2) Q-learning converges for quantization-based basis functions under similar conditions. 3) Explicit error bounds derived for learning algorithms applied to POMDPs with finite-memory representations.

Conclusion: Reinforcement learning with linear function approximation can be extended to non-Markov settings with proper theoretical guarantees, particularly when using appropriate basis functions and under ergodicity conditions, enabling practical applications to partially observable environments.

Abstract: We study reinforcement learning methods with linear function approximation under non-Markov state and cost processes. We first consider the policy evaluation method and show that the algorithm converges under suitable ergodicity conditions on the underlying non-Markov processes. Furthermore, we show that the limit corresponds to the fixed point of a joint operator composed of an orthogonal projection and the Bellman operator of an auxiliary \emph{Markov} decision process. For Q-learning with linear function approximation, as in the Markov setting, convergence is not guaranteed in general. We show, however, that for the special case where the basis functions are chosen based on quantization maps, the convergence can be shown under similar ergodicity conditions. Finally, we apply our results to partially observed Markov decision processes, where finite-memory variables are used as state representations, and we derive explicit error bounds for the limits of the resulting learning algorithms.

[226] Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

Samuel Nathanson, Rebecca Williams, Cynthia Matuszek

Main category: cs.LG

TL;DR: Larger LLMs can systematically jailbreak smaller ones, with harm likelihood increasing with attacker-to-target size ratio, revealing size asymmetry influences adversarial robustness.

Details

Motivation: To examine how LLM vulnerabilities scale in multi-agent adversarial settings, specifically whether larger models can systematically jailbreak smaller ones despite alignment safeguards.

Method: Simulated over 6,000 multi-turn attacker-target exchanges across major LLM families (0.6B-120B parameters) using JailbreakBench tasks. Measured harm scores and refusal behavior evaluated by three independent LLM judges.

Result: Strong correlation between mean harm and log of attacker-to-target size ratio (r=0.51). Attacker-side behavioral diversity contributes more to outcomes than target susceptibility. Attacker refusal frequency negatively correlates with harm (rho=-0.93).

Conclusion: Size asymmetry influences adversarial robustness, with larger models more likely to jailbreak smaller ones. Findings motivate controlled investigations into inter-model alignment and safety.

Abstract: Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially. This study examines whether larger models can systematically jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM families and scales (0.6B-120B parameters), measuring both harm score and refusal behavior as indicators of adversarial potency and alignment integrity. Each interaction is evaluated through aggregated harm and refusal scores assigned by three independent LLM judges, providing a consistent, model-based measure of adversarial outcomes. Aggregating results across prompts, we find a strong and statistically significant correlation between mean harm and the logarithm of the attacker-to-target size ratio (Pearson r = 0.51, p < 0.001; Spearman rho = 0.52, p < 0.001), indicating that relative model size correlates with the likelihood and severity of harmful completions. Mean harm score variance is higher across attackers (0.18) than across targets (0.10), suggesting that attacker-side behavioral diversity contributes more to adversarial outcomes than target susceptibility. Attacker refusal frequency is strongly and negatively correlated with harm (rho = -0.93, p < 0.001), showing that attacker-side alignment mitigates harmful responses. These findings reveal that size asymmetry influences robustness and provide exploratory evidence for adversarial scaling patterns, motivating more controlled investigations into inter-model alignment and safety.

[227] The Weather Paradox: Why Precipitation Fails to Predict Traffic Accident Severity in Large-Scale US Data

Yann Bellec, Rohan Kaman, Siwen Cui, Aarav Agrawal, Calvin Chen

Main category: cs.LG

TL;DR: XGBoost model predicts US traffic accident severity with 78% accuracy, identifying time, location, and weather as key predictors, but struggles with extreme cases due to dataset limitations.

Details

Motivation: To investigate how environmental, temporal, and spatial factors predict traffic accident severity in the US, aiming to improve traffic management and safety through data-driven insights.

Method: Used 500,000 US traffic accidents (2016-2023), trained XGBoost classifier with randomized search cross-validation, adjusted for class imbalance via class weighting, and conducted feature importance analysis.

Result: Model achieved 78% overall accuracy, 87% precision/recall for majority class (Severity 2). Time of day, geographic location, visibility, temperature, and wind speed were strongest predictors. Precipitation and visibility showed limited predictive power.

Conclusion: The study identifies key severity predictors but highlights dataset limitations for extreme cases, suggesting need for alternative sampling, enhanced features, and external data integration for improved traffic management and future research.

Abstract: This study investigates the predictive capacity of environmental, temporal, and spatial factors on traffic accident severity in the United States. Using a dataset of 500,000 U.S. traffic accidents spanning 2016-2023, we trained an XGBoost classifier optimized through randomized search cross-validation and adjusted for class imbalance via class weighting. The final model achieves an overall accuracy of 78%, with strong performance on the majority class (Severity 2), attaining 87% precision and recall. Feature importance analysis reveals that time of day, geographic location, and weather-related variables, including visibility, temperature, and wind speed, rank among the strongest predictors of accident severity. However, contrary to initial hypotheses, precipitation and visibility demonstrate limited predictive power, potentially reflecting behavioral adaptation by drivers under overtly hazardous conditions. The dataset’s predominance of mid-level severity accidents constrains the model’s capacity to learn meaningful patterns for extreme cases, highlighting the need for alternative sampling strategies, enhanced feature engineering, and integration of external datasets. These findings contribute to evidence-based traffic management and suggest future directions for severity prediction research.

[228] Online Finetuning Decision Transformers with Pure RL Gradients

Junkai Luo, Yinglun Zhu

Main category: cs.LG

TL;DR: The paper proposes new algorithms for online finetuning of Decision Transformers using pure reinforcement learning gradients, overcoming limitations of existing approaches that rely on supervised objectives.

Details

Motivation: Current Decision Transformer approaches for online settings still heavily rely on supervised sequence-modeling objectives during finetuning, rather than leveraging pure RL gradients. The paper identifies hindsight return relabeling as a critical obstacle to RL-based finetuning, as it's incompatible with importance sampling-based RL algorithms like GRPO.

Method: The authors adapt GRPO to Decision Transformers with key modifications: sub-trajectory optimization for better credit assignment, sequence-level likelihood objectives for stability and efficiency, and active sampling to encourage exploration in uncertain regions.

Result: The proposed methods outperform existing online Decision Transformer baselines and achieve new state-of-the-art performance across multiple benchmarks, demonstrating the effectiveness of pure-RL-based online finetuning.

Conclusion: The paper successfully enables online finetuning of Decision Transformers using pure reinforcement learning gradients, overcoming the limitations of supervised approaches and establishing a new effective framework for sequential decision making in online settings.

Abstract: Decision Transformers (DTs) have emerged as a powerful framework for sequential decision making by formulating offline reinforcement learning (RL) as a sequence modeling problem. However, extending DTs to online settings with pure RL gradients remains largely unexplored, as existing approaches continue to rely heavily on supervised sequence-modeling objectives during online finetuning. We identify hindsight return relabeling – a standard component in online DTs – as a critical obstacle to RL-based finetuning: while beneficial for supervised learning, it is fundamentally incompatible with importance sampling-based RL algorithms such as GRPO, leading to unstable training. Building on this insight, we propose new algorithms that enable online finetuning of Decision Transformers using pure reinforcement learning gradients. We adapt GRPO to DTs and introduce several key modifications, including sub-trajectory optimization for improved credit assignment, sequence-level likelihood objectives for enhanced stability and efficiency, and active sampling to encourage exploration in uncertain regions. Through extensive experiments, we demonstrate that our methods outperform existing online DT baselines and achieve new state-of-the-art performance across multiple benchmarks, highlighting the effectiveness of pure-RL-based online finetuning for Decision Transformers.

[229] Sequential Reservoir Computing for Efficient High-Dimensional Spatiotemporal Forecasting

Ata Akbari Asanjan, Filip Wudarski, Daniel O’Connor, Shaun Geaney, Elena Strbac, P. Aaron Lott, Davide Venturelli

Main category: cs.LG

TL;DR: Sequential Reservoir Computing (Sequential RC) decomposes large reservoirs into smaller interconnected ones, achieving better forecasting of high-dimensional spatiotemporal systems with lower computational costs than traditional RNN/LSTM methods.

Details

Motivation: Traditional RNNs and LSTMs face computational challenges in forecasting high-dimensional spatiotemporal systems due to gradient-based training and memory bottlenecks. Conventional Reservoir Computing helps but still scales poorly with input dimensionality.

Method: Introduces Sequential Reservoir Computing architecture that decomposes a large reservoir into a series of smaller, interconnected reservoirs. This reduces memory and computational costs while preserving long-term temporal dependencies.

Result: Achieves 15-25% longer valid forecast horizons, 20-30% lower error metrics (SSIM, RMSE), and up to three orders of magnitude lower training cost compared to LSTM and standard RNN baselines on both low-dimensional chaotic systems and high-dimensional physical simulations.

Conclusion: Sequential RC maintains the simplicity and efficiency of conventional RC while achieving superior scalability for high-dimensional dynamical systems, providing a practical path toward real-time, energy-efficient forecasting in scientific and engineering applications.

Abstract: Forecasting high-dimensional spatiotemporal systems remains computationally challenging for recurrent neural networks (RNNs) and long short-term memory (LSTM) models due to gradient-based training and memory bottlenecks. Reservoir Computing (RC) mitigates these challenges by replacing backpropagation with fixed recurrent layers and a convex readout optimization, yet conventional RC architectures still scale poorly with input dimensionality. We introduce a Sequential Reservoir Computing (Sequential RC) architecture that decomposes a large reservoir into a series of smaller, interconnected reservoirs. This design reduces memory and computational costs while preserving long-term temporal dependencies. Using both low-dimensional chaotic systems (Lorenz63) and high-dimensional physical simulations (2D vorticity and shallow-water equations), Sequential RC achieves 15-25% longer valid forecast horizons, 20-30% lower error metrics (SSIM, RMSE), and up to three orders of magnitude lower training cost compared to LSTM and standard RNN baselines. The results demonstrate that Sequential RC maintains the simplicity and efficiency of conventional RC while achieving superior scalability for high-dimensional dynamical systems. This approach provides a practical path toward real-time, energy-efficient forecasting in scientific and engineering applications.

[230] Early Prediction of Liver Cirrhosis Up to Three Years in Advance: A Machine Learning Study Benchmarking Against the FIB-4 Score

Zhuqi Miao, Sujan Ravi, Abdulaziz Ahmed

Main category: cs.LG

TL;DR: ML models using EHR data outperform FIB-4 score for predicting liver cirrhosis 1-3 years before diagnosis, enabling earlier risk stratification.

Details

Motivation: To develop more effective early prediction tools for liver cirrhosis using routinely available EHR data, improving upon the traditional FIB-4 score which has limited predictive performance for early detection.

Method: Retrospective cohort study using de-identified EHR data from an academic health system. Identified fatty liver disease patients and categorized into cirrhosis/non-cirrhosis groups using ICD codes. Created prediction scenarios with observation/prediction windows. Aggregated demographics, diagnoses, lab results, vital signs, and comorbidity indices. Trained XGBoost models for 1-, 2-, and 3-year prediction horizons and evaluated on held-out test sets, comparing with FIB-4 using AUC.

Result: ML models consistently outperformed FIB-4 across all prediction windows. XGBoost achieved AUCs of 0.81 (1-year), 0.73 (2-year), and 0.69 (3-year) compared to FIB-4’s 0.71, 0.63, and 0.57 respectively. Performance gains increased with longer prediction horizons, showing improved early risk discrimination.

Conclusion: Machine learning models using routine EHR data substantially outperform FIB-4 for early cirrhosis prediction, enabling earlier and more accurate risk stratification. These models can be integrated into clinical workflows as automated decision-support tools for proactive cirrhosis prevention and management.

Abstract: Objective: Develop and evaluate machine learning (ML) models for predicting incident liver cirrhosis one, two, and three years prior to diagnosis using routinely collected electronic health record (EHR) data, and to benchmark their performance against the FIB-4 score. Methods: We conducted a retrospective cohort study using de-identified EHR data from a large academic health system. Patients with fatty liver disease were identified and categorized into cirrhosis and non-cirrhosis cohorts based on ICD-9/10 codes. Prediction scenarios were constructed using observation and prediction windows to emulate real-world clinical use. Demographics, diagnoses, laboratory results, vital signs, and comorbidity indices were aggregated from the observation window. XGBoost models were trained for 1-, 2-, and 3-year prediction horizons and evaluated on held-out test sets. Model performance was compared with FIB-4 using area under the receiver operating characteristic curve (AUC). Results: Final cohorts included 3,043 patients for the 1-year prediction, 1,981 for the 2-year prediction, and 1,470 for the 3-year prediction. Across all prediction windows, ML models consistently outperformed FIB-4. The XGBoost models achieved AUCs of 0.81, 0.73, and 0.69 for 1-, 2-, and 3-year predictions, respectively, compared with 0.71, 0.63, and 0.57 for FIB-4. Performance gains persisted with longer prediction horizons, indicating improved early risk discrimination. Conclusions: Machine learning models leveraging routine EHR data substantially outperform the traditional FIB-4 score for early prediction of liver cirrhosis. These models enable earlier and more accurate risk stratification and can be integrated into clinical workflows as automated decision-support tools to support proactive cirrhosis prevention and management.

[231] Reinforcement-Learned Unequal Error Protection for Quantized Semantic Embeddings

Moirangthem Tiken Singh, Adnan Arif

Main category: cs.LG

TL;DR: A reinforcement learning framework uses adaptive repetition coding for per-dimension unequal error protection in bandwidth-constrained communication, achieving significant semantic preservation gains over uniform protection.

Details

Motivation: Preserving semantic meaning in communication systems with limited bandwidth is challenging. Traditional channel coding paradigms don't align with semantic granularity, creating a need for semantic-aware protection mechanisms.

Method: Novel reinforcement learning framework with adaptive repetition coding for per-dimension unequal error protection. Uses a composite semantic distortion metric balancing global embedding similarity with entity-level preservation to guide protection allocation.

Result: Statistically significant gains over uniform protection: 6.8% higher chrF scores and 9.3% better entity preservation at 1 dB SNR. Shows simple repetition coding with intelligent allocation enables fine-grained semantic protection unattainable with conventional codes.

Conclusion: Code structure must align with semantic granularity. The approach challenges traditional channel coding paradigms and provides practical pathway for semantic-aware networks, particularly suited for edge computing and IoT scenarios with scarce bandwidth but critical semantic fidelity.

Abstract: This paper tackles the pressing challenge of preserving semantic meaning in communication systems constrained by limited bandwidth. We introduce a novel reinforcement learning framework that achieves per-dimension unequal error protection via adaptive repetition coding. Central to our approach is a composite semantic distortion metric that balances global embedding similarity with entity-level preservation, empowering the reinforcement learning agent to allocate protection in a context-aware manner. Experiments show statistically significant gains over uniform protection, achieving 6.8% higher chrF scores and 9.3% better entity preservation at 1 dB SNR. The key innovation of our framework is the demonstration that simple, intelligently allocated repetition coding enables fine-grained semantic protection – an advantage unattainable with conventional codes such as LDPC or Reed-Solomon. Our findings challenge traditional channel coding paradigms by establishing that code structure must align with semantic granularity. This approach is particularly suited to edge computing and IoT scenarios, where bandwidth is scarce, but semantic fidelity is critical, providing a practical pathway for next-generation semantic-aware networks.

[232] SSI-GAN: Semi-Supervised Swin-Inspired Generative Adversarial Networks for Neuronal Spike Classification

Danial Sharifrazi, Nouman Javed, Mojtaba Mohammadi, Seyede Sana Salehi, Roohallah Alizadehsani, Prasad N. Paradkar, U. Rajendra Acharya, Asim Bhatti

Main category: cs.LG

TL;DR: SSI-GAN: A semi-supervised GAN using Swin-inspired transformer architecture achieves 99.93% accuracy in classifying mosquito neuronal spike patterns for viral infection detection with only 1-3% labeled data.

Details

Motivation: Manual classification of mosquito neuronal spike patterns for arboviral disease detection is labor-intensive and expensive. Existing deep learning solutions require fully labeled datasets and extensive preprocessing, limiting practical field adoption due to data scarcity.

Method: Proposed SSI-GAN (Semi-supervised Swin-Inspired GAN) with transformer-based generator and Swin-inspired shifted-window discriminator using multi-head self-attention to capture sparse high-frequency spike features. Trained on 15M+ spike samples with only 1-3% labeled data, optimized with Bayesian Optuna framework, validated via fivefold Monte Carlo cross-validation.

Result: Achieved 99.93% classification accuracy on day 3 post-infection with only 3% labeled data. Maintained high accuracy across all infection stages with just 1% supervision, representing 97-99% reduction in manual labeling effort compared to supervised approaches at same performance level.

Conclusion: SSI-GAN’s shifted-window transformer design significantly outperforms all baselines and sets new state-of-the-art for spike-based neuronal infection classification, enabling practical field deployment with minimal labeled data requirements.

Abstract: Mosquitos are the main transmissive agents of arboviral diseases. Manual classification of their neuronal spike patterns is very labor-intensive and expensive. Most available deep learning solutions require fully labeled spike datasets and highly preprocessed neuronal signals. This reduces the feasibility of mass adoption in actual field scenarios. To address the scarcity of labeled data problems, we propose a new Generative Adversarial Network (GAN) architecture that we call the Semi-supervised Swin-Inspired GAN (SSI-GAN). The Swin-inspired, shifted-window discriminator, together with a transformer-based generator, is used to classify neuronal spike trains and, consequently, detect viral neurotropism. We use a multi-head self-attention model in a flat, window-based transformer discriminator that learns to capture sparser high-frequency spike features. Using just 1 to 3% labeled data, SSI-GAN was trained with more than 15 million spike samples collected at five-time post-infection and recording classification into Zika-infected, dengue-infected, or uninfected categories. Hyperparameters were optimized using the Bayesian Optuna framework, and performance for robustness was validated under fivefold Monte Carlo cross-validation. SSI-GAN reached 99.93% classification accuracy on the third day post-infection with only 3% labeled data. It maintained high accuracy across all stages of infection with just 1% supervision. This shows a 97-99% reduction in manual labeling effort relative to standard supervised approaches at the same performance level. The shifted-window transformer design proposed here beat all baselines by a wide margin and set new best marks in spike-based neuronal infection classification.

[233] Optimized Hybrid Feature Engineering for Resource-Efficient Arrhythmia Detection in ECG Signals: An Optimization Framework

Moirangthem Tiken Singh, Manibhushan Yaikhom

Main category: cs.LG

TL;DR: Proposes a resource-efficient data-centric framework for arrhythmia detection on edge devices using wavelet-graph feature engineering and ultra-lightweight linear classifiers, achieving 98.44% accuracy with 8.54 KB model size.

Details

Motivation: Cardiovascular diseases require continuous monitoring via IoMT, but current deep learning approaches have prohibitive computational overhead for resource-constrained edge devices.

Method: Data-centric framework prioritizing feature engineering over complexity. Uses time-frequency wavelet decompositions combined with graph-theoretic structural descriptors (PageRank centrality), refined with mutual information and recursive elimination, enabling interpretable linear classifiers.

Result: 98.44% diagnostic accuracy on MIT-BIH and INCART datasets with 8.54 KB model footprint. Achieves 0.46 μs classification inference latency within 52 ms per-beat pipeline, outperforming compressed models like KD-Light (25 KB, 96.32% accuracy).

Conclusion: The framework provides order-of-magnitude efficiency gains over existing compressed models, enabling real-time operation on battery-less cardiac sensors and advancing edge-based arrhythmia detection.

Abstract: Cardiovascular diseases, particularly arrhythmias, remain a leading global cause of mortality, necessitating continuous monitoring via the Internet of Medical Things (IoMT). However, state-of-the-art deep learning approaches often impose prohibitive computational overheads, rendering them unsuitable for resource-constrained edge devices. This study proposes a resource-efficient, data-centric framework that prioritizes feature engineering over complexity. Our optimized pipeline makes the complex, high-dimensional arrhythmia data linearly separable. This is achieved by integrating time-frequency wavelet decompositions with graph-theoretic structural descriptors, such as PageRank centrality. This hybrid feature space, combining wavelet decompositions and graph-theoretic descriptors, is then refined using mutual information and recursive elimination, enabling interpretable, ultra-lightweight linear classifiers. Validation on the MIT-BIH and INCART datasets yields 98.44% diagnostic accuracy with an 8.54 KB model footprint. The system achieves 0.46 $μ$s classification inference latency within a 52 ms per-beat pipeline, ensuring real-time operation. These outcomes provide an order-of-magnitude efficiency gain over compressed models, such as KD-Light (25 KB, 96.32% accuracy), advancing battery-less cardiac sensors.

[234] Unknown Aware AI-Generated Content Attribution

Ellie Thieu, Jifan Zhang, Haoyue Bai

Main category: cs.LG

TL;DR: The paper proposes a constrained optimization method using unlabeled wild data to improve AI-generated content attribution, addressing generalization to unseen generators beyond simple baseline approaches.

Details

Motivation: As photorealistic generative models advance, there's a need to move beyond binary real/fake detection to specific model attribution. Current methods struggle to generalize to unseen and newly released generators.

Method: Uses CLIP features with linear classifier as baseline, then proposes constrained optimization leveraging unlabeled wild data (Internet images) to encourage wild samples as non-target while maintaining performance on labeled data.

Result: Incorporating wild data substantially improves attribution performance on challenging unseen generators compared to baseline approaches.

Conclusion: Unlabeled data from the wild can be effectively exploited to enhance AI-generated content attribution in open-world settings, addressing generalization limitations of supervised-only approaches.

Abstract: The rapid advancement of photorealistic generative models has made it increasingly important to attribute the origin of synthetic content, moving beyond binary real or fake detection toward identifying the specific model that produced a given image. We study the problem of distinguishing outputs from a target generative model (e.g., OpenAI Dalle 3) from other sources, including real images and images generated by a wide range of alternative models. Using CLIP features and a simple linear classifier, shown to be effective in prior work, we establish a strong baseline for target generator attribution using only limited labeled data from the target model and a small number of known generators. However, this baseline struggles to generalize to harder, unseen, and newly released generators. To address this limitation, we propose a constrained optimization approach that leverages unlabeled wild data, consisting of images collected from the Internet that may include real images, outputs from unknown generators, or even samples from the target model itself. The proposed method encourages wild samples to be classified as non target while explicitly constraining performance on labeled data to remain high. Experimental results show that incorporating wild data substantially improves attribution performance on challenging unseen generators, demonstrating that unlabeled data from the wild can be effectively exploited to enhance AI generated content attribution in open world settings.

[235] Robust Graph Fine-Tuning with Adversarial Graph Prompting

Ziyan Zhang, Bo Jiang, Jin Tang

Main category: cs.LG

TL;DR: AGP integrates adversarial learning into graph prompting to create robust PEFT for GNNs, addressing vulnerability to graph topology and node feature noise through min-max optimization.

Details

Motivation: Existing Parameter-Efficient Fine-Tuning (PEFT) methods for GNNs are vulnerable to noise and attacks on graph topology and node features, creating a need for robust fine-tuning approaches.

Method: Proposes Adversarial Graph Prompting (AGP) framework with min-max optimization: inner maximization uses JointPGD to generate adversarial noise, outer minimization learns optimal node prompts to counteract noise. Theoretical analysis shows it handles both graph topology and node noise.

Result: Extensive experiments on multiple benchmark tasks validate AGP’s robustness and effectiveness compared to state-of-the-art methods.

Conclusion: AGP is a general method that can integrate with various pre-trained GNN models to enhance robustness on downstream tasks, addressing both graph topology and node feature noise.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) method has emerged as a dominant paradigm for adapting pre-trained GNN models to downstream tasks. However, existing PEFT methods usually exhibit significant vulnerability to various noise and attacks on graph topology and node attributes/features. To address this issue, for the first time, we propose integrating adversarial learning into graph prompting and develop a novel Adversarial Graph Prompting (AGP) framework to achieve robust graph fine-tuning. Our AGP has two key aspects. First, we propose the general problem formulation of AGP as a min-max optimization problem and develop an alternating optimization scheme to solve it. For inner maximization, we propose Joint Projected Gradient Descent (JointPGD) algorithm to generate strong adversarial noise. For outer minimization, we employ a simple yet effective module to learn the optimal node prompts to counteract the adversarial noise. Second, we demonstrate that the proposed AGP can theoretically address both graph topology and node noise. This confirms the versatility and robustness of our AGP fine-tuning method across various graph noise. Note that, the proposed AGP is a general method that can be integrated with various pre-trained GNN models to enhance their robustness on the downstream tasks. Extensive experiments on multiple benchmark tasks validate the robustness and effectiveness of AGP method compared to state-of-the-art methods.

[236] GRIT – Geometry-Aware PEFT with K-FACPreconditioning, Fisher-Guided Reprojection, andDynamic Rank Adaptation

Pritish Saha, Chandrav Rajbangshi, Rudra Goyal, Mohit Goyal, Anurag Deo, Biswajit Roy, Ningthoujam Dhanachandra Singh, Raxit Goswami, Amitava Das

Main category: cs.LG

TL;DR: GRIT is a curvature-aware LoRA method that uses K-FAC preconditioning, Fisher eigenbasis reprojection, and adaptive rank selection to reduce trainable parameters by 46% while matching or surpassing LoRA/QLoRA performance.

Details

Motivation: Current PEFT methods like LoRA and QLoRA are geometry-agnostic - they optimize in fixed, randomly oriented low-rank subspaces with first-order descent, ignoring local loss curvature. This inflates update budgets and amplifies drift along weakly constrained directions.

Method: GRIT preserves the LoRA parameterization but adds three key innovations: (1) K-FAC preconditioning of gradients in rank space as a natural-gradient proxy, (2) periodic reprojection of low-rank basis onto dominant Fisher eigendirections to suppress drift, and (3) adaptive effective rank selection from the spectrum to concentrate capacity where signal resides.

Result: Across instruction-following, comprehension, and reasoning benchmarks on LLaMA backbones, GRIT matches or surpasses LoRA and QLoRA while reducing trainable parameters by 46% on average (25-80% across tasks), without practical quality loss across prompt styles and data mixes. It yields lower drift and better updates-vs-retention frontier than strong PEFT baselines.

Conclusion: GRIT demonstrates that incorporating curvature awareness into PEFT methods through natural-gradient preconditioning, drift suppression via Fisher eigenbasis reprojection, and adaptive rank selection can significantly improve parameter efficiency while maintaining or enhancing performance compared to existing methods.

Abstract: Parameter-efficient fine-tuning (PEFT) is the default way to adapt LLMs, but widely used LoRA and QLoRA are largely geometry-agnostic: they optimize in fixed, randomly oriented low-rank subspaces with first-order descent, mostly ignoring local loss curvature. This can inflate the effective update budget and amplify drift along weakly constrained directions. We introduce GRIT, a dynamic, curvature-aware LoRA procedure that preserves the LoRA parameterization but: (1) preconditions gradients in rank space using K-FAC as a natural-gradient proxy; (2) periodically reprojects the low-rank basis onto dominant Fisher eigendirections to suppress drift; and (3) adapts the effective rank from the spectrum so capacity concentrates where signal resides. Across instruction-following, comprehension, and reasoning benchmarks on LLaMA backbones, GRIT matches or surpasses LoRA and QLoRA while reducing trainable parameters by 46% on average (25–80% across tasks), without practical quality loss across prompt styles and data mixes. To model forgetting, we fit a curvature-modulated power law. Empirically, GRIT yields lower drift and a better updates-vs-retention frontier than strong PEFT-optimizer baselines (Orthogonal-LoRA, IA3, DoRA, Eff-FT, Shampoo).

[237] Task-Driven Kernel Flows: Label Rank Compression and Laplacian Spectral Filtering

Hongxi Li, Chunlin Huang

Main category: cs.LG

TL;DR: Supervised learning in wide L2-regularized networks is inherently compressive, with kernel rank bounded by number of classes, unlike expansive self-supervised representations.

Details

Motivation: To understand the fundamental differences between supervised and self-supervised learning, specifically why supervised learning produces low-rank representations while self-supervision yields high-rank ones.

Method: Developed a theory of feature learning in wide L2-regularized networks, derived a kernel ODE predicting “water-filling” spectral evolution, and analyzed SGD noise properties.

Result: Proved that for any stable steady state, kernel rank is bounded by number of classes (C), and SGD noise is similarly low-rank (O(C)), confining dynamics to task-relevant subspace.

Conclusion: Supervised learning is inherently compressive and low-rank, contrasting with high-rank expansive representations in self-supervision, unifying deterministic and stochastic views of alignment.

Abstract: We present a theory of feature learning in wide L2-regularized networks showing that supervised learning is inherently compressive. We derive a kernel ODE that predicts a “water-filling” spectral evolution and prove that for any stable steady state, the kernel rank is bounded by the number of classes ($C$). We further demonstrate that SGD noise is similarly low-rank ($O(C)$), confining dynamics to the task-relevant subspace. This framework unifies the deterministic and stochastic views of alignment and contrasts the low-rank nature of supervised learning with the high-rank, expansive representations of self-supervision.

[238] Can Optimal Transport Improve Federated Inverse Reinforcement Learning?

David Millard, Ali Baheri

Main category: cs.LG

TL;DR: Federated IRL framework using optimal transport to fuse local reward functions via Wasserstein barycenter for heterogeneous multi-agent systems.

Details

Motivation: Autonomous agents in different environments need shared reward learning but face challenges: different dynamics, privacy constraints, and limited communication bandwidth prevent direct data pooling.

Method: Each client performs lightweight Maximum Entropy IRL locally, then reward functions are fused via Wasserstein barycenter that considers geometric structure of reward distributions.

Result: Proves that barycentric fusion yields more faithful global reward estimate than conventional parameter averaging in federated learning.

Conclusion: Provides principled, communication-efficient framework for deriving shared reward that generalizes across heterogeneous agents and environments.

Abstract: In robotics and multi-agent systems, fleets of autonomous agents often operate in subtly different environments while pursuing a common high-level objective. Directly pooling their data to learn a shared reward function is typically impractical due to differences in dynamics, privacy constraints, and limited communication bandwidth. This paper introduces an optimal transport-based approach to federated inverse reinforcement learning (IRL). Each client first performs lightweight Maximum Entropy IRL locally, adhering to its computational and privacy limitations. The resulting reward functions are then fused via a Wasserstein barycenter, which considers their underlying geometric structure. We further prove that this barycentric fusion yields a more faithful global reward estimate than conventional parameter averaging methods in federated learning. Overall, this work provides a principled and communication-efficient framework for deriving a shared reward that generalizes across heterogeneous agents and environments.

[239] Quantum King-Ring Domination in Chess: A QAOA Approach

Gerhard Stenzel, Michael Kölle, Tobias Rohe, Julian Hager, Leo Sünkel, Maximilian Zorn, Claudia Linnhoff-Popien

Main category: cs.LG

TL;DR: QAOA benchmarked on random instances lacks real-world insight. New chess-based benchmark QKRD with 5k structured instances reveals advantages of problem-informed QAOA techniques over standard approaches.

Details

Motivation: Current QAOA benchmarks use synthetic random instances (MaxCut, TSP, SAT) that lack semantic structure and human interpretability, providing limited insight into performance on real-world problems with meaningful constraints.

Method: Introduce Quantum King-Ring Domination (QKRD), a NISQ-scale benchmark derived from chess tactical positions with 5,000 structured instances featuring one-hot constraints, spatial locality, and 10-40 qubit scale. The benchmark includes human-interpretable coverage metrics and intrinsic validation against classical heuristics.

Result: Constraint-preserving mixers (XY, domain-wall) converge ~13 steps faster than standard mixers; warm-start strategies reduce convergence by 45 steps with energy improvements exceeding d=8; CVaR optimization yields worse energy and no coverage benefit. QAOA outperforms greedy heuristics by 12.6% and random selection by 80.1%.

Conclusion: Structured benchmarks reveal advantages of problem-informed QAOA techniques obscured in random instances. QKRD enables algorithmic conclusions without external oracles and provides reproducible NISQ algorithm research artifacts.

Abstract: The Quantum Approximate Optimization Algorithm (QAOA) is extensively benchmarked on synthetic random instances such as MaxCut, TSP, and SAT problems, but these lack semantic structure and human interpretability, offering limited insight into performance on real-world problems with meaningful constraints. We introduce Quantum King-Ring Domination (QKRD), a NISQ-scale benchmark derived from chess tactical positions that provides 5,000 structured instances with one-hot constraints, spatial locality, and 10–40 qubit scale. The benchmark pairs human-interpretable coverage metrics with intrinsic validation against classical heuristics, enabling algorithmic conclusions without external oracles. Using QKRD, we systematically evaluate QAOA design choices and find that constraint-preserving mixers (XY, domain-wall) converge approximately 13 steps faster than standard mixers (p<10^{-7}, d\approx0.5) while eliminating penalty tuning, warm-start strategies reduce convergence by 45 steps (p<10^{-127}, d=3.35) with energy improvements exceeding d=8, and Conditional Value-at-Risk (CVaR) optimization yields an informative negative result with worse energy (p<10^{-40}, d=1.21) and no coverage benefit. Intrinsic validation shows QAOA outperforms greedy heuristics by 12.6% and random selection by 80.1%. Our results demonstrate that structured benchmarks reveal advantages of problem-informed QAOA techniques obscured in random instances. We release all code, data, and experimental artifacts for reproducible NISQ algorithm research.

[240] Smart Fault Detection in Nanosatellite Electrical Power System

Alireza Rezaee, Niloofar Nobahari, Amin Asgarifar, Farshid Hajati

Main category: cs.LG

TL;DR: A neural network-based fault detection method for nanosatellite electrical power systems without ADCS, using solar radiation and panel temperature to diagnose various electrical faults.

Details

Motivation: Nanosatellites in LEO orbit face electrical power system faults due to pressure tolerance issues, launcher pressure, and environmental factors, but many lack Attitude Determination Control Subsystems (ADCS) for fault detection.

Method: First simulates fault-free system using neural network with solar radiation and panel temperature as inputs, current and load as outputs. Then uses neural network classifier to diagnose faults by pattern and type, supplemented by PCA classification, decision trees, and KNN for fault classification.

Result: Developed a comprehensive fault detection system capable of identifying common nanosatellite electrical faults including line-to-line and open circuit faults in photovoltaic subsystems, short circuit and open circuit IGBT faults in DC-DC converters, and regulator faults in ground batteries.

Conclusion: The proposed neural network-based approach provides effective fault detection for nanosatellite electrical power systems without requiring ADCS, offering a practical solution for improving reliability in LEO orbit missions.

Abstract: This paper presents a new detection method of faults at Nanosatellites’ electrical power without an Attitude Determination Control Subsystem (ADCS) at the LEO orbit. Each part of this system is at risk of fault due to pressure tolerance, launcher pressure, and environmental circumstances. Common faults are line to line fault and open circuit for the photovoltaic subsystem, short circuit and open circuit IGBT at DC to DC converter, and regulator fault of the ground battery. The system is simulated without fault based on a neural network using solar radiation and solar panel’s surface temperature as input data and current and load as outputs. Finally, using the neural network classifier, different faults are diagnosed by pattern and type of fault. For fault classification, other machine learning methods are also used, such as PCA classification, decision tree, and KNN.

[241] Real-Time Human Detection for Aerial Captured Video Sequences via Deep Models

Nouar AlDahoul, Aznul Qalid Md Sabri, Ali Mohammed Mansoor

Main category: cs.LG

TL;DR: This paper proposes using deep learning models (S-CNN, pretrained CNN, and H-ELM) with optical flow for human detection in aerial videos, achieving high accuracy on the challenging UCF-ARG dataset.

Details

Motivation: Traditional human detection methods rely on handcrafted features that are problem-dependent, require expert knowledge, and are sensitive to dynamic conditions like illumination changes and camera jitter. There's a need for automatic feature learning approaches that can handle aerial videos with varying altitudes and non-static cameras.

Method: The paper combines optical flow with three deep learning models: 1) Supervised Convolutional Neural Network (S-CNN), 2) Pretrained CNN feature extractor, and 3) Hierarchical Extreme Learning Machine (H-ELM). These models are trained and tested on the UCF-ARG aerial dataset to detect five human actions (digging, waving, throwing, walking, running).

Result: The pretrained CNN achieved the highest average accuracy of 98.09%. S-CNN achieved 95.6% with softmax and 91.7% with SVM. H-ELM achieved 95.9% accuracy. H-ELM trained in 445 seconds on CPU, while S-CNN required 770 seconds on GPU.

Conclusion: The proposed deep learning approaches successfully address human detection in challenging aerial videos, with pretrained CNN performing best overall. The methods demonstrate that automatic feature learning can overcome limitations of traditional handcrafted features for this task.

Abstract: Human detection in videos plays an important role in various real-life applications. Most traditional approaches depend on utilizing handcrafted features, which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods, which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for the human detection task. The pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with softmax and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM’s training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high-performance Graphical Processing Unit (GPU).

[242] Deep Delta Learning

Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu

Main category: cs.LG

TL;DR: Deep Delta Learning (DDL) generalizes residual connections by replacing the fixed identity shortcut with a learnable geometric transformation (Delta Operator) that enables dynamic interpolation between identity mapping, projection, and reflection.

Details

Motivation: Standard residual networks use strictly additive identity shortcuts, which impose a limited inductive bias and restrict the network's ability to model complex state transitions and non-monotonic dynamics.

Method: Introduces Delta Operator - a rank-1 perturbation of identity matrix parameterized by reflection direction vector k(X) and gating scalar β(X). Restructures residual update as synchronous rank-1 injection where gate β(X) controls both erasure of old information and writing of new features.

Result: The method enables explicit control over the spectrum of layer-wise transition operators, allowing modeling of complex dynamics while maintaining stable training characteristics of gated residual architectures.

Conclusion: DDL provides a more flexible alternative to standard residual connections by enabling learnable, data-dependent geometric transformations that can adaptively modulate feature transitions while preserving training stability.

Abstract: The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network’s capacity to model complex state transitions. In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection by modulating the identity shortcut with a learnable, data-dependent geometric transformation. This transformation, termed the Delta Operator, constitutes a rank-1 perturbation of the identity matrix, parameterized by a reflection direction vector $\mathbf{k}(\mathbf{X})$ and a gating scalar $β(\mathbf{X})$. We provide a spectral analysis of this operator, demonstrating that the gate $β(\mathbf{X})$ enables dynamic interpolation between identity mapping, orthogonal projection, and geometric reflection. Furthermore, we restructure the residual update as a synchronous rank-1 injection, where the gate acts as a dynamic step size governing both the erasure of old information and the writing of new features. This unification empowers the network to explicitly control the spectrum of its layer-wise transition operator, enabling the modeling of complex, non-monotonic dynamics while preserving the stable training characteristics of gated residual architectures.

[243] E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

Shengjun Zhang, Zhang Zhang, Chensheng Dai, Yueqi Duan

Main category: cs.LG

TL;DR: E-GRPO: Entropy-aware Group Relative Policy Optimization for flow matching models that consolidates low-entropy SDE sampling steps into high-entropy steps to improve exploration and reward signal clarity.

Details

Motivation: Existing flow matching methods for human preference alignment suffer from sparse and ambiguous reward signals when optimizing over multiple denoising steps, with low entropy steps producing undistinguished roll-outs.

Method: Proposes E-GRPO which merges consecutive low entropy steps into single high entropy steps for SDE sampling, uses ODE sampling for other steps, and introduces multi-step group normalized advantage computed within samples sharing the same consolidated SDE denoising step.

Result: Experimental results on different reward settings demonstrate the effectiveness of the proposed methods in improving exploration and reward signal clarity.

Conclusion: E-GRPO successfully addresses the sparse reward problem in flow matching models by entropy-aware step consolidation and group-relative advantage computation, leading to more efficient exploration and better alignment with human preferences.

Abstract: Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.

[244] A Comparative Analysis of Interpretable Machine Learning Methods

Mattia Billa, Giovanni Orlandi, Veronica Guidetti, Federica Mandreoli

Main category: cs.LG

TL;DR: Large-scale evaluation of 16 interpretable ML methods on 216 tabular datasets reveals performance hierarchies and context-dependent effectiveness, with EBMs excelling in regression and methods like SR/IGANNs performing well in non-linear settings.

Details

Motivation: Growing reliance on ML in high-stakes domains has raised concerns about interpretability and accountability, but systematic evaluations of inherently interpretable models for tabular data remain scarce, focusing mainly on aggregated performance rather than nuanced analysis.

Method: Comparative evaluation of 16 inherently interpretable methods (including linear models, decision trees, EBMs, SR, GOSDT, IGANNs) across 216 real-world tabular datasets, with stratified analysis based on dataset characteristics (dimensionality, sample size, linearity, class imbalance), plus assessment of training time and robustness under distributional shifts.

Result: Clear performance hierarchies emerged, especially for regression tasks where EBMs consistently achieved strong predictive accuracy. Performance was highly context-dependent: SR and IGANNs performed well in non-linear regimes, while GOSDT models showed pronounced sensitivity to class imbalance.

Conclusion: The findings provide practical guidance for balancing interpretability and predictive performance, contributing to deeper empirical understanding of interpretable modeling for tabular data, with context-dependent recommendations for different dataset characteristics.

Abstract: In recent years, Machine Learning (ML) has seen widespread adoption across a broad range of sectors, including high-stakes domains such as healthcare, finance, and law. This growing reliance has raised increasing concerns regarding model interpretability and accountability, particularly as legal and regulatory frameworks place tighter constraints on using black-box models in critical applications. Although interpretable ML has attracted substantial attention, systematic evaluations of inherently interpretable models, especially for tabular data, remain relatively scarce and often focus primarily on aggregated performance outcomes. To address this gap, we present a large-scale comparative evaluation of 16 inherently interpretable methods, ranging from classical linear models and decision trees to more recent approaches such as Explainable Boosting Machines (EBMs), Symbolic Regression (SR), and Generalized Optimal Sparse Decision Trees (GOSDT). Our study spans 216 real-world tabular datasets and goes beyond aggregate rankings by stratifying performance according to structural dataset characteristics, including dimensionality, sample size, linearity, and class imbalance. In addition, we assess training time and robustness under controlled distributional shifts. Our results reveal clear performance hierarchies, especially for regression tasks, where EBMs consistently achieve strong predictive accuracy. At the same time, we show that performance is highly context-dependent: SR and Interpretable Generalized Additive Neural Networks (IGANNs) perform particularly well in non-linear regimes, while GOSDT models exhibit pronounced sensitivity to class imbalance. Overall, these findings provide practical guidance for practitioners seeking a balance between interpretability and predictive performance, and contribute to a deeper empirical understanding of interpretable modeling for tabular data.

[245] A Comparative Study of Adaptation Strategies for Time Series Foundation Models in Anomaly Detection

Miseon Park, Kijung Yoon

Main category: cs.LG

TL;DR: Time series foundation models (TSFMs) pretrained on large datasets can serve as universal backbones for anomaly detection, outperforming task-specific methods and enabling efficient adaptation through parameter-efficient fine-tuning.

Details

Motivation: Most existing time series anomaly detection methods require extensive task-specific training, which is inefficient. The paper explores whether pretrained time series foundation models can serve as universal backbones to overcome this limitation.

Method: Systematic experiments comparing zero-shot inference, full model adaptation, and parameter-efficient fine-tuning (PEFT) strategies including LoRA, OFT, and HRA across multiple benchmarks.

Result: TSFMs outperform task-specific baselines, achieving notable gains in AUC-PR and VUS-PR, especially under severe class imbalance. PEFT methods reduce computational cost while matching or surpassing full fine-tuning in most cases.

Conclusion: Time series foundation models can be efficiently adapted for anomaly detection even when pretrained for forecasting, positioning them as promising general-purpose models for scalable and efficient time series anomaly detection.

Abstract: Time series anomaly detection is essential for the reliable operation of complex systems, but most existing methods require extensive task-specific training. We explore whether time series foundation models (TSFMs), pretrained on large heterogeneous data, can serve as universal backbones for anomaly detection. Through systematic experiments across multiple benchmarks, we compare zero-shot inference, full model adaptation, and parameter-efficient fine-tuning (PEFT) strategies. Our results demonstrate that TSFMs outperform task-specific baselines, achieving notable gains in AUC-PR and VUS-PR, particularly under severe class imbalance. Moreover, PEFT methods such as LoRA, OFT, and HRA not only reduce computational cost but also match or surpass full fine-tuning in most cases, indicating that TSFMs can be efficiently adapted for anomaly detection, even when pretrained for forecasting. These findings position TSFMs as promising general-purpose models for scalable and efficient time series anomaly detection.

[246] Controllable Concept Bottleneck Models

Hongbin Lin, Chenyang Ren, Juangui Xu, Zhengyu Hu, Cheng-Long Wang, Yao Shu, Hui Xiong, Jingfeng Zhang, Di Wang, Lijie Hu

Main category: cs.LG

TL;DR: CCBMs enable efficient editing of Concept Bottleneck Models at three granularities without retraining, using influence function approximations for dynamic model maintenance.

Details

Motivation: Real-world CBMs need continuous maintenance for data removal (unlearning), concept correction, and incremental learning, but current methods require full retraining which is inefficient for large-scale applications.

Method: Propose Controllable Concept Bottleneck Models (CCBMs) with three editing granularities: concept-label-level, concept-level, and data-level (removal/addition). Use mathematically rigorous closed-form approximations derived from influence functions to avoid retraining.

Result: Experimental results demonstrate CCBMs’ efficiency and adaptability, enabling dynamic and trustworthy CBMs in practical applications.

Conclusion: CCBMs provide an efficient solution for maintaining CBMs in evolving environments without retraining, addressing real-world needs for model editing at multiple granularities.

Abstract: Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a human-understandable concept layer. However, most previous studies focused on static scenarios where the data and concepts are assumed to be fixed and clean. In real-world applications, deployed models require continuous maintenance: we often need to remove erroneous or sensitive data (unlearning), correct mislabeled concepts, or incorporate newly acquired samples (incremental learning) to adapt to evolving environments. Thus, deriving efficient editable CBMs without retraining from scratch remains a significant challenge, particularly in large-scale applications. To address these challenges, we propose Controllable Concept Bottleneck Models (CCBMs). Specifically, CCBMs support three granularities of model editing: concept-label-level, concept-level, and data-level, the latter of which encompasses both data removal and data addition. CCBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for retraining. Experimental results demonstrate the efficiency and adaptability of our CCBMs, affirming their practical value in enabling dynamic and trustworthy CBMs.

[247] Imitation from Observations with Trajectory-Level Generative Embeddings

Yongtao Qu, Shangzhe Li, Weitong Zhang

Main category: cs.LG

TL;DR: TGE proposes a trajectory-level generative embedding using temporal diffusion models to create smooth surrogate rewards for offline imitation learning from observations, especially when expert data is scarce and offline data is suboptimal.

Details

Motivation: Existing offline LfO methods struggle when expert demonstrations are scarce and offline data is far from expert behavior, due to strict support constraints and brittle one-step models that fail to extract useful signals from imperfect data.

Method: TGE constructs a dense, smooth surrogate reward by estimating expert state density in the latent space of a temporal diffusion model trained on offline trajectory data, leveraging the smooth geometry of diffusion embeddings to capture long-horizon temporal dynamics.

Result: The approach consistently matches or outperforms prior offline LfO methods across D4RL locomotion and manipulation benchmarks, demonstrating robustness even when offline data is distributionally distinct from expert data.

Conclusion: TGE effectively bridges the gap between disjoint supports in offline LfO by using trajectory-level generative embeddings from diffusion models, providing robust learning signals from imperfect data and scarce expert demonstrations.

Abstract: We consider the offline imitation learning from observations (LfO) where the expert demonstrations are scarce and the available offline suboptimal data are far from the expert behavior. Many existing distribution-matching approaches struggle in this regime because they impose strict support constraints and rely on brittle one-step models, making it hard to extract useful signal from imperfect data. To tackle this challenge, we propose TGE, a trajectory-level generative embedding for offline LfO that constructs a dense, smooth surrogate reward by estimating expert state density in the latent space of a temporal diffusion model trained on offline trajectory data. By leveraging the smooth geometry of the learned diffusion embedding, TGE captures long-horizon temporal dynamics and effectively bridges the gap between disjoint supports, ensuring a robust learning signal even when offline data is distributionally distinct from the expert. Empirically, the proposed approach consistently matches or outperforms prior offline LfO methods across a range of D4RL locomotion and manipulation benchmarks.

[248] Deep Networks Learn Deep Hierarchical Models

Amit Daniely

Main category: cs.LG

TL;DR: Layerwise SGD on residual networks can efficiently learn hierarchical label models with polynomial depth, surpassing previous log-depth models, suggesting hierarchical learning may explain deep learning success.

Details

Motivation: To understand why deep learning works so well, the paper proposes that hierarchical label structures (where complex labels are functions of simpler ones) might be key. The existence of human teachers providing granular labels suggests hierarchical structures are naturally available and could explain deep learning's effectiveness.

Method: Uses layerwise stochastic gradient descent (SGD) on residual networks to learn hierarchical models with nested label sets. Formalizes teacher-student interaction where teachers reveal “hints” about internal brain algorithms, creating hierarchical structures that facilitate learning.

Result: Shows that layerwise SGD can efficiently learn hierarchical models that require polynomial depth to express, surpassing previous models limited to log-depth circuits. Demonstrates that hierarchical structures emerge naturally from teacher-student interactions and enable efficient learnability.

Conclusion: Hierarchical label models represent the depth limit of efficient learnability and may form a theoretical basis for understanding deep learning success. The natural emergence of hierarchies from teacher-student interactions suggests why deep learning excels in domains where such structures exist.

Abstract: We consider supervised learning with $n$ labels and show that layerwise SGD on residual networks can efficiently learn a class of hierarchical models. This model class assumes the existence of an (unknown) label hierarchy $L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n]$, where labels in $L_1$ are simple functions of the input, while for $i > 1$, labels in $L_i$ are simple functions of simpler labels. Our class surpasses models that were previously shown to be learnable by deep learning algorithms, in the sense that it reaches the depth limit of efficient learnability. That is, there are models in this class that require polynomial depth to express, whereas previous models can be computed by log-depth circuits. Furthermore, we suggest that learnability of such hierarchical models might eventually form a basis for understanding deep learning. Beyond their natural fit for domains where deep learning excels, we argue that the mere existence of human teachers" supports the hypothesis that hierarchical structures are inherently available. By providing granular labels, teachers effectively reveal hints’’ or ``snippets’’ of the internal algorithms used by the brain. We formalize this intuition, showing that in a simplified model where a teacher is partially aware of their internal logic, a hierarchical structure emerges that facilitates efficient learnability.

[249] Geometric Regularization in Mixture-of-Experts: The Disconnect Between Weights and Activations

Hyunjun Kim

Main category: cs.LG

TL;DR: Orthogonality loss fails to improve MoE expert diversity - increases weight overlap, doesn’t reduce activation overlap, and shows inconsistent performance effects.

Details

Motivation: To understand the role of geometric regularization in Mixture-of-Experts (MoE) models and whether orthogonality loss can effectively enforce expert diversity.

Method: Applied orthogonality loss to enforce expert diversity, tested across 7 regularization strengths, and analyzed weight-space overlap (MSO) and activation-space overlap.

Result: Orthogonality loss fails on multiple fronts: increases weight-space overlap by up to 114%, activation-space overlap remains high (~0.6), performance effects are inconsistent across datasets, and no significant correlation between weight and activation orthogonality.

Conclusion: Weight-space regularization neither achieves its geometric goal nor reliably improves performance, making it unsuitable for MoE diversity.

Abstract: Mixture-of-Experts (MoE) models achieve efficiency through sparse activation, but the role of geometric regularization in expert specialization remains unclear. We apply orthogonality loss to enforce expert diversity and find it fails on multiple fronts: it does not reduce weight-space overlap (MSO actually increases by up to 114%), activation-space overlap remains high (~0.6) regardless of regularization, and effects on performance are inconsistent – marginal improvement on WikiText-103 (-0.9%), slight degradation on TinyStories (+0.9%), and highly variable results on PTB (std > 1.0). Our analysis across 7 regularization strengths reveals no significant correlation (r = -0.293, p = 0.523) between weight and activation orthogonality. These findings demonstrate that weight-space regularization neither achieves its geometric goal nor reliably improves performance, making it unsuitable for MoE diversity.

[250] Detecting Spike Wave Discharges (SWD) using 1-dimensional Residual UNet

Saurav Sengupta, Scott Kilianski, Suchetha Sharma, Sakina Lashkeri, Ashley McHugh, Mark Beenhakker, Donald E. Brown

Main category: cs.LG

TL;DR: Researchers developed AugUNet1D, an improved 1D UNet with data augmentation, for automatic detection of spike wave discharges (SWDs) in EEG recordings, outperforming 14 other ML classifiers and a published algorithmic approach.

Details

Motivation: Manual labeling of EEG events like spike wave discharges (SWDs) is time-consuming, especially for continuous recordings over weeks to months. There's a need for automated methods to reduce manual workload, and existing machine learning approaches for SWD detection can be improved.

Method: Compared 14 machine learning classifiers on a manually annotated dataset of 961 hours of EEG recordings with 22,637 labeled SWDs. Selected 1D UNet as best performer, then improved it with data augmentation (scaling showed greatest benefit). Compared the resulting AugUNet1D against the “Twin Peaks” algorithmic approach.

Result: AugUNet1D showed superior performance over other classifiers and the Twin Peaks approach. It detected events with features more similar to manually labeled SWDs. The model is made publicly available both pretrained and untrained.

Conclusion: AugUNet1D provides an effective automated solution for SWD detection in EEG recordings, reducing manual labeling burden. Data augmentation, particularly scaling, significantly improves performance. The public release enables broader application and further development.

Abstract: The manual labeling of events in electroencephalography (EEG) records is time-consuming. This is especially true when EEG recordings are taken continuously over weeks to months. Therefore, a method to automatically label pertinent EEG events reduces the manual workload. Spike wave discharges (SWD), which are the electrographic hallmark of absence seizures, are EEG events that are often labeled manually. While some previous studies have utilized machine learning to automatically segment and classify EEG signals like SWDs, they can be improved. Here we compare the performance of 14 machine learning classifiers on our own manually annotated dataset of 961 hours of EEG recordings from C3H/HeJ mice, including 22,637 labeled SWDs. We find that a 1D UNet performs best for labeling SWDs in this dataset. We also improve the 1D UNet by augmenting our training data and determine that scaling showed the greatest benefit of all augmentation procedures applied. We then compare the 1D UNet with data augmentation, AugUNet1D, against a recently published time- and frequency-based algorithmic approach called “Twin Peaks”. AugUNet1D showed superior performance and detected events with more similar features to the SWDs labeled manually. AugUNet1D, pretrained on our manually annotated data or untrained, is made public for others users.

[251] Laplacian Kernelized Bandit

Shuang Wu, Arash A. Amini

Main category: cs.LG

TL;DR: The paper proposes a multi-user contextual bandit framework that combines graph structure with non-linear reward functions using RKHS theory, developing algorithms with improved regret bounds.

Details

Motivation: Multi-user contextual bandits need to leverage both graph relationships between users and non-linear reward functions, but existing methods either focus on linear rewards or ignore graph structure. There's a need for a principled approach that unifies graph-based regularization with kernel methods for structured exploration.

Method: Introduces a joint penalty combining graph smoothness (based on RKHS distances) with individual roughness penalties. Proves this penalty is equivalent to squared norm in a unified multi-user RKHS, deriving its reproducing kernel that fuses graph Laplacian with base arm kernel. Develops two algorithms: LK-GP-UCB and LK-GP-TS, using Gaussian Process posteriors over this new kernel for exploration.

Result: Provides high-probability regret bounds scaling with effective dimension of multi-user kernel rather than user count or ambient dimension. Empirically outperforms linear and non-graph-aware baselines in non-linear settings, and remains competitive even with linear rewards.

Conclusion: The work delivers a unified, theoretically grounded framework bridging Laplacian regularization with kernelized bandits for structured exploration in multi-user contextual bandits with graph relationships and non-linear rewards.

Abstract: We study multi-user contextual bandits where users are related by a graph and their reward functions exhibit both non-linear behavior and graph homophily. We introduce a principled joint penalty for the collection of user reward functions ${f_u}$, combining a graph smoothness term based on RKHS distances with an individual roughness penalty. Our central contribution is proving that this penalty is equivalent to the squared norm within a single, unified \emph{multi-user RKHS}. We explicitly derive its reproducing kernel, which elegantly fuses the graph Laplacian with the base arm kernel. This unification allows us to reframe the problem as learning a single ‘’lifted’’ function, enabling the design of principled algorithms, \texttt{LK-GP-UCB} and \texttt{LK-GP-TS}, that leverage Gaussian Process posteriors over this new kernel for exploration. We provide high-probability regret bounds that scale with an \emph{effective dimension} of the multi-user kernel, replacing dependencies on user count or ambient dimension. Empirically, our methods outperform strong linear and non-graph-aware baselines in non-linear settings and remain competitive even when the true rewards are linear. Our work delivers a unified, theoretically grounded, and practical framework that bridges Laplacian regularization with kernelized bandits for structured exploration.

[252] Neural Chains and Discrete Dynamical Systems

Sauro Succi, Abhisek Ganguly, Santosh Ansumali

Main category: cs.LG

TL;DR: The paper analyzes neural chains (transformers without self-attention) as discrete dynamical systems, comparing standard numerical discretization with PINNs for solving Burgers and Eikonal equations. Both approaches yield similar knowledge but PINNs use random matrices with more parameters, lacking physical transparency.

Details

Motivation: To examine the analogy between neural chains (ML transformers without self-attention) and discrete dynamical systems from neural integral/PDEs, and to compare traditional numerical methods with Physics-Informed Neural Networks (PINNs) for solving PDEs.

Method: Comparative analysis of numerical solutions for Burgers (viscid/inviscid) and Eikonal equations using: 1) standard numerical discretization (cast as neural chains), and 2) PINN learning. Both approaches are analyzed in terms of matrix structures and parameter spaces.

Result: Both standard discretization and PINNs acquire essentially the same knowledge about system dynamics. PINNs use random matrices (far more numerous than structured FD matrices) with many more parameters, leading to lack of physical explainability and higher training costs. However, results are limited to 1D problems.

Conclusion: For 1D dynamic problems, PINNs offer no advantage over standard numerical methods while being less transparent and more computationally expensive. However, PINNs/ML may still be advantageous for high-dimensional problems not studied here.

Abstract: We inspect the analogy between machine-learning (ML) applications based on the transformer architecture without self-attention, {\it neural chains} hereafter, and discrete dynamical systems associated with discretised versions of neural integral and partial differential equations (NIE, PDE). A comparative analysis of the numerical solution of the (viscid and inviscid) Burgers and Eikonal equations via standard numerical discretization (also cast in terms of neural chains) and via PINN’s learning is presented and commented on. It is found that standard numerical discretization and PINN learning provide two different paths to acquire essentially the same knowledge about the dynamics of the system. PINN learning proceeds through random matrices which bear no direct relation to the highly structured matrices associated with finite-difference (FD) procedures. Random matrices leading to acceptable solutions are far more numerous than the unique tridiagonal form in matrix space, which explains why the PINN search typically lands on the random ensemble. The price is a much larger number of parameters, causing lack of physical transparency (explainability) as well as large training costs with no counterpart in the FD procedure. However, our results refer to one-dimensional dynamic problems, hence they don’t rule out the possibility that PINNs and ML in general, may offer better strategies for high-dimensional problems.

[253] TeleDoCTR: Domain-Specific and Contextual Troubleshooting for Telecommunications

Mohamed Trabelsi, Huseyin Uzunalioglu

Main category: cs.LG

TL;DR: TeleDoCTR is a domain-specific AI system for telecom ticket troubleshooting that automates classification, retrieval, and report generation to improve efficiency and accuracy.

Details

Motivation: Telecom troubleshooting is highly complex, time-consuming, and human-intensive due to diverse tickets requiring specialized domain knowledge, leading to delayed resolutions and operational inefficiency.

Method: TeleDoCTR integrates domain-specific ranking and generative models to automate three key workflow steps: ticket classification to appropriate expert teams, retrieval of contextually similar historical tickets, and generation of detailed fault analysis reports.

Result: Evaluation on real-world telecom infrastructure data shows TeleDoCTR achieves superior performance over state-of-the-art methods, significantly enhancing troubleshooting accuracy and efficiency.

Conclusion: TeleDoCTR effectively addresses telecom troubleshooting challenges by automating key workflow components, demonstrating practical value for improving operational efficiency in large organizations.

Abstract: Ticket troubleshooting refers to the process of analyzing and resolving problems that are reported through a ticketing system. In large organizations offering a wide range of services, this task is highly complex due to the diversity of submitted tickets and the need for specialized domain knowledge. In particular, troubleshooting in telecommunications (telecom) is a very time-consuming task as it requires experts to interpret ticket content, consult documentation, and search historical records to identify appropriate resolutions. This human-intensive approach not only delays issue resolution but also hinders overall operational efficiency. To enhance the effectiveness and efficiency of ticket troubleshooting in telecom, we propose TeleDoCTR, a novel telecom-related, domain-specific, and contextual troubleshooting system tailored for end-to-end ticket resolution in telecom. TeleDoCTR integrates both domain-specific ranking and generative models to automate key steps of the troubleshooting workflow which are: routing tickets to the appropriate expert team responsible for resolving the ticket (classification task), retrieving contextually and semantically similar historical tickets (retrieval task), and generating a detailed fault analysis report outlining the issue, root cause, and potential solutions (generation task). We evaluate TeleDoCTR on a real-world dataset from a telecom infrastructure and demonstrate that it achieves superior performance over existing state-of-the-art methods, significantly enhancing the accuracy and efficiency of the troubleshooting process.

[254] When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents

Laksh Advani

Main category: cs.LG

TL;DR: Small language models (7-9B parameters) often produce correct answers with flawed reasoning (50-69% of cases), creating a “Right-for-Wrong-Reasons” reliability crisis invisible to standard accuracy metrics.

Details

Motivation: Deploying small language models as autonomous agents requires trust in their reasoning process, not just final outputs. Standard accuracy metrics fail to detect fundamentally flawed reasoning that leads to correct answers.

Method: Analyzed 10,734 reasoning traces across three models and diverse tasks. Introduced Reasoning Integrity Score (RIS) as a process-based metric with substantial inter-rater agreement. Evaluated interventions like retrieval-augmented generation (RAG) and meta-cognitive approaches. Developed a neural classifier for verification.

Result: 50-69% of correct answers contain flawed reasoning. RAG significantly improves reasoning integrity (Cohen’s d=0.23-0.93), while meta-cognitive interventions often harm performance (d=-0.14 to -0.33). Neural classifier achieves 0.86 F1-score with 100× speedup for verification.

Conclusion: Process-based verification is essential for trustworthy agents. Accuracy alone is dangerously insufficient when models can be right for entirely wrong reasons. RAG helps ground reasoning in evidence, while meta-cognition may be harmful for small models.

Abstract: Deploying small language models (7-9B parameters) as autonomous agents requires trust in their reasoning, not just their outputs. We reveal a critical reliability crisis: 50-69% of correct answers from these models contain fundamentally flawed reasoning – a ``Right-for-Wrong-Reasons’’ phenomenon invisible to standard accuracy metrics. Through analysis of 10,734 reasoning traces across three models and diverse tasks, we introduce the Reasoning Integrity Score (RIS), a process-based metric validated with substantial inter-rater agreement ($κ=0.657$). Conventional practices are challenged by our findings: while retrieval-augmented generation (RAG) significantly improves reasoning integrity (Cohen’s $d=0.23$–$0.93$), meta-cognitive interventions like self-critique often harm performance ($d=-0.14$ to $-0.33$) in small models on the evaluated tasks. Mechanistic analysis reveals RAG succeeds by grounding calculations in external evidence, reducing errors by 7.6%, while meta-cognition amplifies confusion without sufficient model capacity. To enable deployment, verification capabilities are distilled into a neural classifier achieving 0.86 F1-score with 100$\times$ speedup. These results underscore the necessity of process-based verification for trustworthy agents: accuracy alone is dangerously insufficient when models can be right for entirely wrong reasons.

[255] Memory Bank Compression for Continual Adaptation of Large Language Models

Thomas Katraouras, Dimitrios Rafailidis

Main category: cs.LG

TL;DR: MBC: A memory bank compression method for continual learning in LLMs that reduces memory size to 0.3% of baselines while maintaining high retention accuracy through codebook optimization and online resetting mechanisms.

Details

Motivation: LLMs need continual learning to stay updated with evolving data, but existing memory-augmented approaches suffer from constantly growing memory banks that become impractical with large-scale data streams.

Method: Proposes MBC with three key components: 1) Codebook optimization strategy to compress memory bank, 2) Online resetting mechanism to prevent codebook collapse, 3) Key-Value Low-Rank Adaptation (KV-LoRA) in attention layers for efficient use of compressed memory representations.

Result: MBC reduces memory bank size to 0.3% compared to the most competitive baseline while maintaining high retention accuracy during online adaptation learning on benchmark question-answering datasets.

Conclusion: MBC provides an effective solution for continual learning in LLMs by addressing the memory growth problem through compression while maintaining performance, making it practical for real-world scenarios with large-scale data streams.

Abstract: Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine-tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory-augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real-world scenario when large-scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key-Value Low-Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question-answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.

[256] Trajectory Guard – A Lightweight, Sequence-Aware Model for Real-Time Anomaly Detection in Agentic AI

Laksh Advani

Main category: cs.LG

TL;DR: Trajectory Guard: A Siamese Recurrent Autoencoder with hybrid loss for detecting anomalies in LLM-generated multi-step action plans, achieving high F1-scores (0.88-0.94) and fast inference (32 ms).

Details

Motivation: Existing anomaly detection methods fail for LLM-generated action plans: mean-pooling embeddings dilute anomalous steps, contrastive-only approaches ignore sequential structure, and standard methods achieve poor F1-scores (≤0.69). Need unified detection of both task-trajectory misalignment and structural incoherence.

Method: Siamese Recurrent Autoencoder with hybrid loss function that jointly learns: 1) task-trajectory alignment via contrastive learning, and 2) sequential validity via reconstruction. This dual objective enables detection of both “wrong plan for this task” and “malformed plan structure.”

Result: Achieves F1-scores of 0.88-0.94 on balanced benchmarks (synthetic perturbations, RAS-Eval security audits, Who&When multi-agent systems). Recall of 0.86-0.92 on imbalanced external benchmarks. 32 ms inference latency, 17-27× faster than LLM Judge baselines.

Conclusion: Trajectory Guard provides effective real-time safety verification for LLM-generated action plans, addressing both contextual misalignment and structural incoherence with high accuracy and low latency suitable for production deployments.

Abstract: Autonomous LLM agents generate multi-step action plans that can fail due to contextual misalignment or structural incoherence. Existing anomaly detection methods are ill-suited for this challenge: mean-pooling embeddings dilutes anomalous steps, while contrastive-only approaches ignore sequential structure. Standard unsupervised methods on pre-trained embeddings achieve F1-scores no higher than 0.69. We introduce Trajectory Guard, a Siamese Recurrent Autoencoder with a hybrid loss function that jointly learns task-trajectory alignment via contrastive learning and sequential validity via reconstruction. This dual objective enables unified detection of both “wrong plan for this task” and “malformed plan structure.” On benchmarks spanning synthetic perturbations and real-world failures from security audits (RAS-Eval) and multi-agent systems (Who&When), we achieve F1-scores of 0.88-0.94 on balanced sets and recall of 0.86-0.92 on imbalanced external benchmarks. At 32 ms inference latency, our approach runs 17-27$\times$ faster than LLM Judge baselines, enabling real-time safety verification in production deployments.

[257] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

Valentin Noël

Main category: cs.LG

TL;DR: Training-free method uses spectral analysis of attention patterns to detect valid mathematical reasoning in LLMs, achieving 85-96% accuracy across multiple models without any training.

Details

Motivation: Need for reliable methods to verify mathematical reasoning in LLMs without requiring training data or fine-tuning, and to understand what features distinguish valid from invalid reasoning.

Method: Treat attention matrices as adjacency matrices of dynamic graphs over tokens, then extract four spectral diagnostics: Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy.

Result: Method achieves effect sizes up to Cohen’s d=3.30 (p<10^-116) and 85.0-95.6% classification accuracy across 7 transformer models from 4 architectural families, with calibrated thresholds reaching 93-95% accuracy on full dataset.

Conclusion: Spectral graph analysis provides a principled framework for reasoning verification, detects logical coherence rather than just formal verification acceptance, and reveals architectural dependencies in which spectral features capture reasoning validity.

Abstract: We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen’s $d = 3.30$ ($p < 10^{-116}$), enabling 85.0–95.6% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93–95% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B’s Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.

[258] A Sparse-Attention Deep Learning Model Integrating Heterogeneous Multimodal Features for Parkinson’s Disease Severity Profiling

Dristi Datta, Tanmoy Debnath, Minh Chau, Manoranjan Paul, Gourab Adhikary, Md Geaur Rahman

Main category: cs.LG

TL;DR: SAFN is an interpretable deep learning framework that integrates MRI and clinical data for Parkinson’s disease classification, achieving 98% accuracy with clinically coherent decision-making.

Details

Motivation: Existing computational models for Parkinson's disease struggle with interpretability, class imbalance, and effective fusion of high-dimensional multimodal data (imaging and clinical features).

Method: Class-Weighted Sparse-Attention Fusion Network (SAFN) uses modality-specific encoders, symmetric cross-attention for nonlinear interactions, sparsity-constrained attention-gating fusion, and class-balanced focal loss to handle dataset imbalance.

Result: Achieved 0.98 ± 0.02 accuracy and 1.00 ± 0.00 PR-AUC on 703 participants (570 PD, 133 controls) from PPMI dataset, outperforming established baselines. Interpretability analysis shows ~60% weight to clinical assessments.

Conclusion: SAFN provides a reproducible, transparent multimodal modeling paradigm for computational profiling of neurodegenerative diseases with clinically coherent decision processes.

Abstract: Characterising the heterogeneous presentation of Parkinson’s disease (PD) requires integrating biological and clinical markers within a unified predictive framework. While multimodal data provide complementary information, many existing computational models struggle with interpretability, class imbalance, or effective fusion of high-dimensional imaging and tabular clinical features. To address these limitations, we propose the Class-Weighted Sparse-Attention Fusion Network (SAFN), an interpretable deep learning framework for robust multimodal profiling. SAFN integrates MRI cortical thickness, MRI volumetric measures, clinical assessments, and demographic variables using modality-specific encoders and a symmetric cross-attention mechanism that captures nonlinear interactions between imaging and clinical representations. A sparsity-constrained attention-gating fusion layer dynamically prioritises informative modalities, while a class-balanced focal loss (beta = 0.999, gamma = 1.5) mitigates dataset imbalance without synthetic oversampling. Evaluated on 703 participants (570 PD, 133 healthy controls) from the Parkinson’s Progression Markers Initiative using subject-wise five-fold cross-validation, SAFN achieves an accuracy of 0.98 plus or minus 0.02 and a PR-AUC of 1.00 plus or minus 0.00, outperforming established machine learning and deep learning baselines. Interpretability analysis shows a clinically coherent decision process, with approximately 60 percent of predictive weight assigned to clinical assessments, consistent with Movement Disorder Society diagnostic principles. SAFN provides a reproducible and transparent multimodal modelling paradigm for computational profiling of neurodegenerative disease.

[259] Optimizing LSTM Neural Networks for Resource-Constrained Retail Sales Forecasting: A Model Compression Study

Ravi Teja Pagidoju

Main category: cs.LG

TL;DR: LSTM model compression by reducing hidden units from 128 to 16 shows that 64-unit model achieves better accuracy (12.4% MAPE) than 128-unit model (23.6% MAPE) while being 73% smaller.

Details

Motivation: Standard LSTM models provide accurate sales predictions but require high computing power, which is challenging for mid to small retail industries. There's a need for model compression to reduce computational requirements while maintaining or improving accuracy.

Method: Gradually reduced the number of hidden units in LSTM models from 128 to 16. Used Kaggle Store Item Demand Forecasting dataset with 913,000 daily sales records from 10 stores and 50 items. Evaluated trade-off between model size and prediction accuracy using mean absolute percentage error (MAPE).

Result: Reducing hidden units to 64 maintained the same accuracy while actually improving it. MAPE improved from 23.6% for 128-unit model to 12.4% for 64-unit model. The optimized model is 73% smaller (280KB to 76KB) and 47% more accurate.

Conclusion: Larger models don’t always achieve better results. Model compression through hidden unit reduction can significantly reduce model size while improving accuracy, making LSTM models more accessible for mid to small retail industries with limited computing resources.

Abstract: Standard LSTM(Long Short-Term Memory) neural networks provide accurate predictions for sales data in the retail industry, but require a lot of computing power. It can be challenging especially for mid to small retail industries. This paper examines LSTM model compression by gradually reducing the number of hidden units from 128 to 16. We used the Kaggle Store Item Demand Forecasting dataset, which has 913,000 daily sales records from 10 stores and 50 items, to look at the trade-off between model size and how accurate the predictions are. Experiments show that lowering the number of hidden LSTM units to 64 maintains the same level of accuracy while also improving it. The mean absolute percentage error (MAPE) ranges from 23.6% for the full 128-unit model to 12.4% for the 64-unit model. The optimized model is 73% smaller (from 280KB to 76KB) and 47% more accurate. These results show that larger models do not always achieve better results.

[260] Federated Customization of Large Models: Approaches, Experiments, and Insights

Yuchuan Ye, Ming Ding, Youjia Chen, Peng Cheng, Dusit Niyato

Main category: cs.LG

TL;DR: This paper explores federated customization of large models, reviews popular customization techniques, discusses their implementation in federated learning, and conducts experiments on federated prefix-tuning as a first trial in this setting.

Details

Motivation: To address the challenges of customizing large models within federated learning frameworks, where data privacy and decentralization constraints make traditional customization methods difficult to apply.

Method: The paper reviews various large model customization techniques (full fine-tuning, efficient fine-tuning, prompt engineering, prefix-tuning, knowledge distillation, retrieval-augmented generation) and discusses their federated implementation. It then conducts experiments specifically on federated prefix-tuning as a novel application in federated learning.

Result: Federated prefix-tuning is validated as feasible with performance close to centralized approaches. It demonstrates competitive performance compared to three other federated customization methods, with satisfactory efficiency and consistent robustness.

Conclusion: Federated customization of large models is achievable, with prefix-tuning showing particular promise as a viable approach in federated learning settings, offering competitive performance while maintaining data privacy and decentralization benefits.

Abstract: In this article, we explore federated customization of large models and highlight the key challenges it poses within the federated learning framework. We review several popular large model customization techniques, including full fine-tuning, efficient fine-tuning, prompt engineering, prefix-tuning, knowledge distillation, and retrieval-augmented generation. Then, we discuss how these techniques can be implemented within the federated learning framework. Moreover, we conduct experiments on federated prefix-tuning, which, to the best of our knowledge, is the first trial to apply prefix-tuning in the federated learning setting. The conducted experiments validate its feasibility with performance close to centralized approaches. Further comparison with three other federated customization methods demonstrated its competitive performance, satisfactory efficiency, and consistent robustness.

[261] Cloud-Native Generative AI for Automated Planogram Synthesis: A Diffusion Model Approach for Multi-Store Retail Optimization

Ravi Teja Pagidoju, Shriya Agarwal

Main category: cs.LG

TL;DR: Cloud-native diffusion model system reduces planogram creation time by 98.3% (30 to 0.5 hours) with 94.4% constraint satisfaction and 97.5% cost reduction.

Details

Motivation: Planogram creation is time-consuming and expensive for retail, taking an average of 30 hours per complex layout, creating a need for automated solutions.

Method: Cloud-native architecture using diffusion models trained on successful shelf arrangements across multiple retail locations, with AWS-based training and edge deployment for real-time inference, integrating retail constraints through modified loss function.

Result: 98.3% reduction in design time (30 to 0.5 hours), 94.4% constraint satisfaction, 97.5% reduction in creation expenses, 4.4-month break-even period, and linear scalability supporting 10,000 concurrent store requests.

Conclusion: The system demonstrates viability of generative AI for automated retail space optimization with significant time and cost savings while maintaining high constraint satisfaction.

Abstract: Planogram creation is a significant challenge for retail, requiring an average of 30 hours per complex layout. This paper introduces a cloud-native architecture using diffusion models to automatically generate store-specific planograms. Unlike conventional optimization methods that reorganize existing layouts, our system learns from successful shelf arrangements across multiple retail locations to create new planogram configurations. The architecture combines cloud-based model training via AWS with edge deployment for real-time inference. The diffusion model integrates retail-specific constraints through a modified loss function. Simulation-based analysis demonstrates the system reduces planogram design time by 98.3% (from 30 to 0.5 hours) while achieving 94.4% constraint satisfaction. Economic analysis reveals a 97.5% reduction in creation expenses with a 4.4-month break-even period. The cloud-native architecture scales linearly, supporting up to 10,000 concurrent store requests. This work demonstrates the viability of generative AI for automated retail space optimization.

[262] Entropy Production in Machine Learning Under Fokker-Planck Probability Flow

Lennon Shikhman

Main category: cs.LG

TL;DR: Proposes entropy-based retraining framework for ML models in nonstationary environments using nonequilibrium stochastic dynamics to detect data drift and optimize retraining frequency.

Details

Motivation: Existing drift detection methods lack principled dynamical interpretation and don't provide guidance on balancing retraining frequency against operational costs in nonstationary environments.

Method: Models deployment-time data drift as probability flow governed by Fokker-Planck equation, quantifies model-data mismatch using time-evolving KL divergence, and uses entropy-balance decomposition with nonnegative entropy production term to trigger retraining.

Result: In nonstationary classification experiments, entropy-triggered retraining achieves comparable predictive performance to high-frequency retraining while reducing retraining events by an order of magnitude compared to daily and label-based policies.

Conclusion: Entropy-triggered retraining provides a principled, label-free intervention strategy that responds to accumulated model-data mismatch rather than delayed performance collapse, optimizing the trade-off between performance and operational costs.

Abstract: Machine learning models deployed in nonstationary environments experience performance degradation due to data drift. While many drift detection heuristics exist, most lack a principled dynamical interpretation and provide limited guidance on how retraining frequency should be balanced against operational cost. In this work, we propose an entropy–based retraining framework grounded in nonequilibrium stochastic dynamics. Modeling deployment–time data drift as probability flow governed by a Fokker–Planck equation, we quantify model–data mismatch using a time–evolving Kullback–Leibler divergence. We show that the time derivative of this mismatch admits an entropy–balance decomposition featuring a nonnegative entropy production term driven by probability currents. This interpretation motivates entropy–triggered retraining as a label–free intervention strategy that responds to accumulated mismatch rather than delayed performance collapse. In a controlled nonstationary classification experiment, entropy–triggered retraining achieves predictive performance comparable to high–frequency retraining while reducing retraining events by an order of magnitude relative to daily and label–based policies.

[263] Adversarial Samples Are Not Created Equal

Jennifer Crawford, Amol Khanna, Fred Lu, Amy R. Wagoner, Stella Biderman, Andre T. Nguyen, Edward Raff

Main category: cs.LG

TL;DR: The paper proposes differentiating between two types of adversarial weaknesses: those exploiting brittle but predictive features and those that don’t, and introduces an ensemble-based metric to measure manipulation of non-robust features.

Details

Motivation: Current theory of non-robust features overlooks adversarial samples that don't directly utilize these features. The authors argue these two types of samples comprise different adversarial weaknesses that should be differentiated when evaluating robustness.

Method: Proposes an ensemble-based metric to measure the manipulation of non-robust features by adversarial perturbations, using this metric to analyze the makeup of adversarial samples generated by attackers.

Result: The new perspective allows re-examination of multiple phenomena, including the impact of sharpness-aware minimization on adversarial robustness and the robustness gap between adversarially trained and standard models on robust datasets.

Conclusion: Adversarial weaknesses should be categorized into two types based on whether they exploit brittle predictive features or not, and this differentiation provides new insights for evaluating and understanding adversarial robustness.

Abstract: Over the past decade, numerous theories have been proposed to explain the widespread vulnerability of deep neural networks to adversarial evasion attacks. Among these, the theory of non-robust features proposed by Ilyas et al. has been widely accepted, showing that brittle but predictive features of the data distribution can be directly exploited by attackers. However, this theory overlooks adversarial samples that do not directly utilize these features. In this work, we advocate that these two kinds of samples - those which use use brittle but predictive features and those that do not - comprise two types of adversarial weaknesses and should be differentiated when evaluating adversarial robustness. For this purpose, we propose an ensemble-based metric to measure the manipulation of non-robust features by adversarial perturbations and use this metric to analyze the makeup of adversarial samples generated by attackers. This new perspective also allows us to re-examine multiple phenomena, including the impact of sharpness-aware minimization on adversarial robustness and the robustness gap observed between adversarially training and standard training on robust datasets.

[264] Learning to be Reproducible: Custom Loss Design for Robust Neural Networks

Waqas Ahmed, Sheeba Samuel, Kevin Coakley, Birgitta Koenig-Ries, Odd Erik Gundersen

Main category: cs.LG

TL;DR: Proposed Custom Loss Function (CLF) improves training stability and reduces performance variability across runs without sacrificing accuracy.

Details

Motivation: Addresses lack of mechanisms ensuring consistent performance across training runs, as current methods show significant accuracy variability even under controlled conditions.

Method: Custom Loss Function (CLF) that reduces sensitivity to stochastic factors like weight initialization and data shuffling, with fine-tuned parameters balancing accuracy and stability.

Result: Extensive experiments across diverse architectures for image classification and time series forecasting show significant improvement in training robustness without predictive performance loss.

Conclusion: CLF establishes an effective and efficient strategy for developing more stable, reliable, and trustworthy neural networks.

Abstract: To enhance the reproducibility and reliability of deep learning models, we address a critical gap in current training methodologies: the lack of mechanisms that ensure consistent and robust performance across runs. Our empirical analysis reveals that even under controlled initialization and training conditions, the accuracy of the model can exhibit significant variability. To address this issue, we propose a Custom Loss Function (CLF) that reduces the sensitivity of training outcomes to stochastic factors such as weight initialization and data shuffling. By fine-tuning its parameters, CLF explicitly balances predictive accuracy with training stability, leading to more consistent and reliable model performance. Extensive experiments across diverse architectures for both image classification and time series forecasting demonstrate that our approach significantly improves training robustness without sacrificing predictive performance. These results establish CLF as an effective and efficient strategy for developing more stable, reliable and trustworthy neural networks.

[265] HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts

Zihan Fang, Zheng Lin, Senkang Hu, Yanan Ma, Yihang Tao, Yiqin Deng, Xianhao Chen, Yuguang Fang

Main category: cs.LG

TL;DR: HFedMoE: A heterogeneous MoE-based FL framework that customizes expert subsets per client for efficient LLM fine-tuning, addressing expert selection, resource heterogeneity, and aggregation challenges.

Details

Motivation: While FL enables privacy-preserving LLM fine-tuning, on-device training is impractical for resource-constrained clients. MoE models offer computation efficiency but face three key challenges in FL: 1) difficulty selecting appropriate experts without reliable performance metrics, 2) heterogeneous client resources overwhelmed by dynamic expert activations, and 3) aggregation issues from client-specific expert subsets and routing preferences.

Method: HFedMoE customizes expert subsets per client based on computing budgets. It identifies expert importance based on fine-tuning performance contributions, adaptively selects expert subsets from an information bottleneck perspective, and uses a sparsity-aware aggregation strategy to aggregate actively fine-tuned experts and gating parameters with importance-weighted contributions.

Result: Extensive experiments demonstrate that HFedMoE outperforms state-of-the-art benchmarks in both training accuracy and convergence speed.

Conclusion: HFedMoE effectively addresses the challenges of MoE-based FL for LLM fine-tuning by providing a heterogeneous framework that balances computation efficiency with performance, enabling practical deployment on resource-constrained devices while maintaining model quality.

Abstract: While federated learning (FL) enables fine-tuning of large language models (LLMs) without compromising data privacy, the substantial size of an LLM renders on-device training impractical for resource-constrained clients, such as mobile devices. Thus, Mixture-of-Experts (MoE) models have emerged as a computation-efficient solution, which activates only a sparse subset of experts during model training to reduce computing burden without sacrificing performance. Though integrating MoE into FL fine-tuning holds significant potential, it still encounters three key challenges: i) selecting appropriate experts for clients remains challenging due to the lack of a reliable metric to measure each expert’s impact on local fine-tuning performance, ii) the heterogeneous computing resources across clients severely hinder MoE-based LLM fine-tuning, as dynamic expert activations across diverse input samples can overwhelm resource-constrained devices, and iii) client-specific expert subsets and routing preference undermine global aggregation, where misaligned expert updates and inconsistent gating networks in troduce destructive interference. To address these challenges, we propose HFedMoE, a heterogeneous MoE-based FL fine-tuning framework that customizes a subset of experts to each client for computation-efficient LLM fine-tuning. Specifically, HFedMoE identifies the expert importance based on its contributions to fine-tuning performance, and then adaptively selects a subset of experts from an information bottleneck perspective to align with each client’ s computing budget. A sparsity-aware model aggregation strategy is also designed to aggregate the actively fine-tuned experts and gating parameters with importance weighted contributions. Extensive experiments demonstrate that HFedMoE outperforms state-of-the-art benchmarks in training accuracy and convergence speed.

[266] Cycling Race Time Prediction: A Personalized Machine Learning Approach Using Route Topology and Training Load

Francisco Aguilera Moreno

Main category: cs.LG

TL;DR: ML model predicts cycling ride duration using route topology and athlete fitness metrics, achieving 6.60 min MAE and 0.922 R², outperforming topology-only models by 14%.

Details

Motivation: Existing physics-based cycling duration prediction models require impractical aerodynamic parameters and wind forecasts for amateur cyclists. There's a need for simpler, data-driven approaches that use readily available metrics.

Method: Machine learning approach using Lasso regression with route topology features and athlete fitness state derived from training load metrics (CTL, ATL). Uses historical performance data in an N-of-1 study design with rigorous feature engineering to prevent data leakage.

Result: Model achieves MAE=6.60 minutes and R²=0.922 on single-athlete dataset (N=96 rides). Integrating fitness metrics reduces error by 14% compared to topology-only models (MAE=7.66 min). Progressive checkpoint predictions enable dynamic race planning.

Conclusion: Machine learning with route topology and fitness metrics provides accurate cycling duration predictions without complex physical measurements. Physiological state meaningfully constrains performance even in self-paced efforts, and the approach enables practical training planning for amateur cyclists.

Abstract: Predicting cycling duration for a given route is essential for training planning and event preparation. Existing solutions rely on physics-based models that require extensive parameterization, including aerodynamic drag coefficients and real-time wind forecasts, parameters impractical for most amateur cyclists. This work presents a machine learning approach that predicts ride duration using route topology features combined with the athlete’s current fitness state derived from training load metrics. The model learns athlete-specific performance patterns from historical data, substituting complex physical measurements with historical performance proxies. We evaluate the approach using a single-athlete dataset (N=96 rides) in an N-of-1 study design. After rigorous feature engineering to eliminate data leakage, we find that Lasso regression with Topology + Fitness features achieves MAE=6.60 minutes and R2=0.922. Notably, integrating fitness metrics (CTL, ATL) reduces error by 14% compared to topology alone (MAE=7.66 min), demonstrating that physiological state meaningfully constrains performance even in self-paced efforts. Progressive checkpoint predictions enable dynamic race planning as route difficulty becomes apparent.

[267] Traffic-Aware Optimal Taxi Placement Using Graph Neural Network-Based Reinforcement Learning

Sonia Khetarpaul, P Y Sharan

Main category: cs.LG

TL;DR: A traffic-aware graph-based reinforcement learning framework for optimal taxi placement in smart cities, reducing passenger waiting time by 56% and travel distance by 38% compared to baselines.

Details

Motivation: Conventional taxi hotspot prediction models rely only on historical demand, ignoring dynamic factors like traffic congestion, road incidents, and public events, leading to inefficient matching of taxi supply with passenger demand in smart city transportation.

Method: Models urban road network as a graph (intersections as nodes, road segments as edges) with node attributes capturing historical demand, event proximity, and real-time congestion scores. Uses Graph Neural Network embeddings to encode spatial-temporal dependencies, then employs Q-learning agent to recommend optimal taxi hotspots with reward mechanism optimizing passenger waiting time, driver travel distance, and congestion avoidance.

Result: Experiments on simulated Delhi taxi dataset show 56% reduction in passenger waiting time and 38% reduction in travel distance compared to baseline stochastic selection.

Conclusion: The proposed traffic-aware graph-based RL framework effectively optimizes taxi placement, is adaptable to multi-modal transport systems, and can be integrated into smart city platforms for real-time urban mobility optimization.

Abstract: In the context of smart city transportation, efficient matching of taxi supply with passenger demand requires real-time integration of urban traffic network data and mobility patterns. Conventional taxi hotspot prediction models often rely solely on historical demand, overlooking dynamic influences such as traffic congestion, road incidents, and public events. This paper presents a traffic-aware, graph-based reinforcement learning (RL) framework for optimal taxi placement in metropolitan environments. The urban road network is modeled as a graph where intersections represent nodes, road segments serve as edges, and node attributes capture historical demand, event proximity, and real-time congestion scores obtained from live traffic APIs. Graph Neural Network (GNN) embeddings are employed to encode spatial-temporal dependencies within the traffic network, which are then used by a Q-learning agent to recommend optimal taxi hotspots. The reward mechanism jointly optimizes passenger waiting time, driver travel distance, and congestion avoidance. Experiments on a simulated Delhi taxi dataset, generated using real geospatial boundaries and historic ride-hailing request patterns, demonstrate that the proposed model reduced passenger waiting time by about 56% and reduced travel distance by 38% compared to baseline stochastic selection. The proposed approach is adaptable to multi-modal transport systems and can be integrated into smart city platforms for real-time urban mobility optimization.

[268] Stronger Approximation Guarantees for Non-Monotone γ-Weakly DR-Submodular Maximization

Hareshkumar Jadav, Ranveer Singh, Vaneet Aggarwal

Main category: cs.LG

TL;DR: A 0.401-approximation algorithm for maximizing non-monotone γ-weakly DR-submodular functions over down-closed convex bodies, with guarantees that degrade gracefully as γ decreases from 1.

Details

Motivation: Maximizing submodular objectives under constraints is fundamental in ML/optimization, but existing methods for non-monotone γ-weakly DR-submodular functions have suboptimal guarantees, especially when γ < 1.

Method: Combines Frank-Wolfe-guided continuous-greedy framework with γ-aware double-greedy step to handle non-monotonicity, creating a simple yet effective procedure.

Result: Achieves state-of-the-art guarantees: recovers 0.401 approximation for γ=1 (DR-submodular case) and degrades gracefully for γ<1, improving upon previous bounds for γ-weakly DR-submodular maximization.

Conclusion: The proposed algorithm provides the best known approximation guarantees for non-monotone γ-weakly DR-submodular maximization over down-closed convex bodies, with smooth dependence on γ.

Abstract: Maximizing submodular objectives under constraints is a fundamental problem in machine learning and optimization. We study the maximization of a nonnegative, non-monotone $γ$-weakly DR-submodular function over a down-closed convex body. Our main result is an approximation algorithm whose guarantee depends smoothly on $γ$; in particular, when $γ=1$ (the DR-submodular case) our bound recovers the $0.401$ approximation factor, while for $γ<1$ the guarantee degrades gracefully and, it improves upon previously reported bounds for $γ$-weakly DR-submodular maximization under the same constraints. Our approach combines a Frank-Wolfe-guided continuous-greedy framework with a $γ$-aware double-greedy step, yielding a simple yet effective procedure for handling non-monotonicity. This results in state-of-the-art guarantees for non-monotone $γ$-weakly DR-submodular maximization over down-closed convex bodies.

[269] Do Chatbot LLMs Talk Too Much? The YapBench Benchmark

Vadim Borisov, Michael Gröger, Mina Mikhael, Richard H. Schreiber

Main category: cs.LG

TL;DR: YapBench is a benchmark for measuring LLM verbosity on brevity-ideal prompts, using YapScore to quantify excess response length beyond minimal-sufficient baselines.

Details

Motivation: LLMs often respond with unnecessary length on simple requests, adding redundant explanations and boilerplate that increases cognitive load and inference costs. Prior work shows preference-based training creates length bias where longer answers are rewarded even at comparable quality.

Method: Introduces YapBench with 300+ English prompts in three brevity-ideal categories: (A) ambiguous inputs needing short clarification, (B) closed-form factual questions with short answers, (C) one-line coding tasks. Uses YapScore metric measuring excess response length beyond curated minimal-sufficient baselines in characters, and YapIndex for overall model comparison.

Result: Evaluation of 76 assistant LLMs shows order-of-magnitude spread in median excess length and distinct category-specific failure modes: vacuum-filling on ambiguous inputs, and explanation/formatting overhead on technical requests.

Conclusion: YapBench provides a lightweight benchmark for quantifying LLM over-generation, enabling comparisons across models without tokenizer dependence. The benchmark and live leaderboard help track verbosity behavior over time.

Abstract: Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini increasingly act as general-purpose copilots, yet they often respond with unnecessary length on simple requests, adding redundant explanations, hedging, or boilerplate that increases cognitive load and inflates token-based inference cost. Prior work suggests that preference-based post-training and LLM-judged evaluations can induce systematic length bias, where longer answers are rewarded even at comparable quality. We introduce YapBench, a lightweight benchmark for quantifying user-visible over-generation on brevity-ideal prompts. Each item consists of a single-turn prompt, a curated minimal-sufficient baseline answer, and a category label. Our primary metric, YapScore, measures excess response length beyond the baseline in characters, enabling comparisons across models without relying on any specific tokenizer. We summarize model performance via the YapIndex, a uniformly weighted average of category-level median YapScores. YapBench contains over three hundred English prompts spanning three common brevity-ideal settings: (A) minimal or ambiguous inputs where the ideal behavior is a short clarification, (B) closed-form factual questions with short stable answers, and (C) one-line coding tasks where a single command or snippet suffices. Evaluating 76 assistant LLMs, we observe an order-of-magnitude spread in median excess length and distinct category-specific failure modes, including vacuum-filling on ambiguous inputs and explanation or formatting overhead on one-line technical requests. We release the benchmark and maintain a live leaderboard for tracking verbosity behavior over time.

[270] Interpretability-Guided Bi-objective Optimization: Aligning Accuracy and Explainability

Kasra Fouladi, Hamta Rahmani

Main category: cs.LG

TL;DR: IGBO trains interpretable models using domain knowledge via bi-objective optimization with DAG constraints and Optimal Path Oracle for OOD robustness.

Details

Motivation: To train interpretable models that incorporate structured domain knowledge while addressing the Out-of-Distribution problem in feature importance computation.

Method: Uses bi-objective optimization with DAG-encoded feature importance hierarchies, Temporal Integrated Gradients for feature importance measurement, and Optimal Path Oracle to handle OOD issues in TIG computation.

Result: Theoretical convergence and robustness to mini-batch noise proven; empirical results on time-series data show effective DAG constraint enforcement with minimal accuracy loss, outperforming standard regularization baselines.

Conclusion: IGBO provides an effective framework for training interpretable models with domain knowledge constraints while maintaining accuracy and addressing OOD challenges.

Abstract: This paper introduces Interpretability-Guided Bi-objective Optimization (IGBO), a framework that trains interpretable models by incorporating structured domain knowledge via a bi-objective formulation. IGBO encodes feature importance hierarchies as a Directed Acyclic Graph (DAG) and uses Temporal Integrated Gradients (TIG) to measure feature importance. To address the Out-of-Distribution (OOD) problem in TIG computation, we propose an Optimal Path Oracle that learns data-manifold-aware integration paths. Theoretical analysis proves convergence properties and robustness to mini-batch noise, while empirical results on time-series data demonstrate IGBO’s effectiveness in enforcing DAG constraints with minimal accuracy loss, outperforming standard regularization baselines.

[271] IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning

Haonan Song, Qingchen Xie, Huan Zhu, Feng Xiao, Luxi Xing, Fuzhen Li, Liu Kang, Feng Jiang, Zhiyong Zheng, Fan Yang

Main category: cs.LG

TL;DR: IRPO is a new RL framework that uses pointwise scoring instead of pairwise comparisons to overcome computational bottlenecks in Generative Reward Models while maintaining interpretability and achieving SOTA performance.

Details

Motivation: Pairwise GRMs create computational bottlenecks (O(n²) complexity) when integrated with RL algorithms like GRPO, plus they require additional sampling/CoT reasoning overhead. Need more efficient approach that preserves interpretability.

Method: Propose Intergroup Relative Preference Optimization (IRPO) that incorporates Bradley-Terry model into GRPO to generate pointwise scores for each response, enabling efficient evaluation of many candidates during RL training.

Result: IRPO achieves SOTA performance among pointwise GRMs across multiple benchmarks, with performance comparable to current leading pairwise GRMs. Significantly outperforms pairwise GRMs in post-training evaluations.

Conclusion: IRPO successfully addresses computational bottlenecks of pairwise GRMs while preserving interpretability and fine-grained reward signals, offering efficient alternative for RL integration.

Abstract: Generative Reward Models (GRMs) have attracted considerable research interest in reward modeling due to their interpretability, inference-time scalability, and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck when integrated with RL algorithms such as Group Relative Policy Optimization (GRPO). This bottleneck arises from two factors: (i) the O(n^2) time complexity of pairwise comparisons required to obtain relative scores, and (ii) the computational overhead of repeated sampling or additional chain-of-thought (CoT) reasoning to improve performance. To address the first factor, we propose Intergroup Relative Preference Optimization (IRPO), a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO. By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training while preserving interpretability and fine-grained reward signals. Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs across multiple benchmarks, with performance comparable to that of current leading pairwise GRMs. Furthermore, we show that IRPO significantly outperforms pairwise GRMs in post-training evaluations.

[272] ARISE: Adaptive Reinforcement Integrated with Swarm Exploration

Rajiv Chaitanya M, D R Ramesh Babu

Main category: cs.LG

TL;DR: ARISE is a lightweight RL framework that enhances exploration by augmenting policy-gradient methods with a swarm-based exploration layer, showing significant improvements on challenging tasks and robustness to non-stationary rewards.

Details

Motivation: Effective exploration remains a key challenge in reinforcement learning, especially with non-stationary rewards or high-dimensional policies. Standard methods often struggle with exploration efficiency and robustness to reward changes.

Method: ARISE augments standard policy-gradient methods with a compact swarm-based exploration layer. It blends policy actions with particle-driven proposals, where each particle represents a candidate policy trajectory sampled in action space. The framework modulates exploration adaptively using reward-variance cues.

Result: On easy benchmarks: slight improvements (+0.7% on CartPole-v1). On challenging tasks: substantial gains (+46% on LunarLander-v3, +22% on Hopper-v4) while preserving stability on Walker2d and Ant. Under non-stationary reward shifts: marked robustness advantages, outperforming PPO by +75 points on CartPole. Ablation studies confirm both swarm component and adaptive mechanism contribute to performance.

Conclusion: ARISE offers a simple, architecture-agnostic route to more exploratory and resilient RL agents without altering core algorithmic structures, providing significant improvements on challenging tasks and robustness to non-stationary rewards.

Abstract: Effective exploration remains a key challenge in RL, especially with non-stationary rewards or high-dimensional policies. We introduce ARISE, a lightweight framework that enhances reinforcement learning by augmenting standard policy-gradient methods with a compact swarm-based exploration layer. ARISE blends policy actions with particle-driven proposals, where each particle represents a candidate policy trajectory sampled in the action space, and modulates exploration adaptively using reward-variance cues. While easy benchmarks exhibit only slight improvements (e.g., +0.7% on CartPole-v1), ARISE yields substantial gains on more challenging tasks, including +46% on LunarLander-v3 and +22% on Hopper-v4, while preserving stability on Walker2d and Ant. Under non-stationary reward shifts, ARISE provides marked robustness advantages, outperforming PPO by +75 points on CartPole and improving LunarLander accordingly. Ablation studies confirm that both the swarm component and the adaptive mechanism contribute to the performance. Overall, ARISE offers a simple, architecture-agnostic route to more exploratory and resilient RL agents without altering core algorithmic structures.

Yash Jain, Xinjie Liu, Lasse Peters, David Fridovich-Keil, Ufuk Topcu

Main category: cs.LG

TL;DR: Bayesian inverse game framework using variational autoencoder with differentiable Nash solver to infer posterior distributions over agent objectives from interaction data, enabling safer decision-making with uncertainty quantification.

Details

Motivation: Traditional game-theoretic planners require full specification of agent objectives, which is impractical. Existing inverse game methods only provide point estimates without uncertainty quantification, leading to overconfident and potentially unsafe planning decisions.

Method: Proposes Bayesian inverse game framework using structured variational autoencoder with embedded differentiable Nash game solver. Trained on interaction datasets without requiring labels of true objectives. Supports multimodal inference when trajectory data is insufficient.

Result: Successfully learns prior and posterior distributions, improves inference quality over maximum likelihood estimation approaches, enables safer downstream decision-making without sacrificing efficiency. Multimodal inference further reduces uncertainty when trajectory information is limited.

Conclusion: Bayesian approach to inverse games provides uncertainty quantification that enhances safety in autonomous decision-making, outperforming point-estimate methods and supporting multimodal observation integration for better inference.

Abstract: Many multi-agent interaction scenarios can be naturally modeled as noncooperative games, where each agent’s decisions depend on others’ future actions. However, deploying game-theoretic planners for autonomous decision-making requires a specification of all agents’ objectives. To circumvent this practical difficulty, recent work develops maximum likelihood techniques for solving inverse games that can identify unknown agent objectives from interaction data. Unfortunately, these methods only infer point estimates and do not quantify estimator uncertainty; correspondingly, downstream planning decisions can overconfidently commit to unsafe actions. We present an approximate Bayesian inference approach for solving the inverse game problem, which can incorporate observation data from multiple modalities and be used to generate samples from the Bayesian posterior over the hidden agent objectives given limited sensor observations in real time. Concretely, the proposed Bayesian inverse game framework trains a structured variational autoencoder with an embedded differentiable Nash game solver on interaction datasets and does not require labels of agents’ true objectives. Extensive experiments show that our framework successfully learns prior and posterior distributions, improves inference quality over maximum likelihood estimation-based inverse game approaches, and enables safer downstream decision-making without sacrificing efficiency. When trajectory information is uninformative or unavailable, multimodal inference further reduces uncertainty by exploiting additional observation modalities.

[274] BSAT: B-Spline Adaptive Tokenizer for Long-Term Time Series Forecasting

Maximilian Reinwardt, Michael Eichelbeck, Matthias Althoff

Main category: cs.LG

TL;DR: BSAT uses B-splines to adaptively tokenize time series with variable-length segments, combined with L-RoPE positional encoding for efficient long-term forecasting.

Details

Motivation: Transformers for long-term time series forecasting suffer from quadratic self-attention complexity and rigid uniform patching that doesn't align with data's semantic structure.

Method: B-Spline Adaptive Tokenizer (BSAT) adaptively segments time series using B-splines, placing tokens in high-curvature regions and representing basis functions as fixed-size tokens. Combined with L-RoPE hybrid positional encoding that has layer-wise learnable base.

Result: Competitive performance on public benchmarks at high compression rates, making it suitable for memory-constrained use cases.

Conclusion: BSAT provides an efficient, parameter-free adaptive tokenization method for long-term time series forecasting that addresses transformer limitations while maintaining strong performance.

Abstract: Long-term time series forecasting using transformers is hampered by the quadratic complexity of self-attention and the rigidity of uniform patching, which may be misaligned with the data’s semantic structure. In this paper, we introduce the \textit{B-Spline Adaptive Tokenizer (BSAT)}, a novel, parameter-free method that adaptively segments a time series by fitting it with B-splines. BSAT algorithmically places tokens in high-curvature regions and represents each variable-length basis function as a fixed-size token, composed of its coefficient and position. Further, we propose a hybrid positional encoding that combines a additive learnable positional encoding with Rotary Positional Embedding featuring a layer-wise learnable base: L-RoPE. This allows each layer to attend to different temporal dependencies. Our experiments on several public benchmarks show that our model is competitive with strong performance at high compression rates. This makes it particularly well-suited for use cases with strong memory constraints.

[275] Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty

Uğurcan Özalp

Main category: cs.LG

TL;DR: STAC is a new off-policy actor-critic algorithm that uses temporal aleatoric uncertainty from a single distributional critic to scale pessimistic bias, avoiding overestimation and achieving risk-averse behavior without relying on epistemic uncertainty ensembles.

Details

Motivation: Current off-policy actor-critic methods suffer from critic overestimation bias. Existing solutions use ensembling to quantify epistemic uncertainty for pessimistic updates, but this approach has computational overhead and may not address all uncertainty sources effectively.

Method: STAC uses a single distributional critic network to model temporal aleatoric uncertainty (from stochastic transitions, rewards, and policy variability). It applies dropout to both critic and actor networks for regularization, and uses the critic’s uncertainty to scale pessimistic bias in temporal-difference updates.

Result: STAC successfully mitigates overestimation bias using only distributional critic-based pessimism, achieves risk-averse behavior in stochastic environments, improves training stability with dropout regularization, and offers better computational efficiency than ensemble-based methods.

Conclusion: Temporal aleatoric uncertainty from a single distributional critic suffices for effective pessimism in actor-critic methods, eliminating the need for epistemic uncertainty ensembles while achieving risk-averse behavior and computational efficiency.

Abstract: Off-policy actor-critic methods in reinforcement learning train a critic with temporal-difference updates and use it as a learning signal for the policy (actor). This design typically achieves higher sample efficiency than purely on-policy methods. However, critic networks tend to overestimate value estimates systematically. This is often addressed by introducing a pessimistic bias based on uncertainty estimates. Current methods employ ensembling to quantify the critic’s epistemic uncertainty-uncertainty due to limited data and model ambiguity-to scale pessimistic updates. In this work, we propose a new algorithm called Stochastic Actor-Critic (STAC) that incorporates temporal (one-step) aleatoric uncertainty-uncertainty arising from stochastic transitions, rewards, and policy-induced variability in Bellman targets-to scale pessimistic bias in temporal-difference updates, rather than relying on epistemic uncertainty. STAC uses a single distributional critic network to model the temporal return uncertainty, and applies dropout to both the critic and actor networks for regularization. Our results show that pessimism based on a distributional critic alone suffices to mitigate overestimation, and naturally leads to risk-averse behavior in stochastic environments. Introducing dropout further improves training stability and performance by means of regularization. With this design, STAC achieves improved computational efficiency using a single distributional critic network.

[276] Precision Autotuning for Linear Solvers via Contextual Bandit-Based RL

Erin Carson, Xinye Chen

Main category: cs.LG

TL;DR: RL framework for adaptive precision tuning of linear solvers using contextual bandits with Q-learning, demonstrated on iterative refinement for linear systems, reducing computational cost while maintaining accuracy.

Details

Motivation: To develop an automated approach for precision selection in numerical algorithms that balances computational efficiency with accuracy, advancing mixed-precision methods in scientific computing.

Method: Formulated as contextual bandit problem with Q-learning using discretized state space (features like condition number, matrix norm) and epsilon-greedy policy to select precision configurations, maximizing multi-objective reward balancing accuracy and computational cost.

Result: Empirical results show effective precision selection that reduces computational cost while maintaining accuracy comparable to double-precision baselines, with generalization to diverse out-of-sample data.

Conclusion: First RL-based precision autotuning framework that successfully adapts precision for linear solvers and can generalize to other numerical algorithms, advancing mixed-precision computing.

Abstract: We propose a reinforcement learning (RL) framework for adaptive precision tuning of linear solvers, and can be extended to general algorithms. The framework is formulated as a contextual bandit problem and solved using incremental action-value estimation with a discretized state space to select optimal precision configurations for computational steps, balancing precision and computational efficiency. To verify its effectiveness, we apply the framework to iterative refinement for solving linear systems $Ax = b$. In this application, our approach dynamically chooses precisions based on calculated features from the system. In detail, a Q-table maps discretized features (e.g., approximate condition number and matrix norm)to actions (chosen precision configurations for specific steps), optimized via an epsilon-greedy strategy to maximize a multi-objective reward balancing accuracy and computational cost. Empirical results demonstrate effective precision selection, reducing computational cost while maintaining accuracy comparable to double-precision baselines. The framework generalizes to diverse out-of-sample data and offers insight into utilizing RL precision selection for other numerical algorithms, advancing mixed-precision numerical methods in scientific computing. To the best of our knowledge, this is the first work on precision autotuning with RL and verified on unseen datasets.

[277] The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving

Max Ruiz Luyten, Mihaela van der Schaar

Main category: cs.LG

TL;DR: The paper introduces Distributional Creative Reasoning (DCR), a unified variational framework that addresses the diversity collapse problem in LLM reasoning pipelines, showing how current methods like STaR, GRPO, and DPO optimize for correctness at the expense of creative diversity.

Details

Motivation: Current LLM reasoning pipelines focus on correctness optimization through bootstrapped reasoning loops, which leads to distribution collapse over reasoning paths, reducing semantic entropy and undermining creative problem-solving capabilities.

Method: Introduces Distributional Creative Reasoning (DCR), a unified variational objective that frames training as gradient flow through probability measures on solution traces, encompassing existing methods like STaR, GRPO, DPO, and entropy bonuses as special cases.

Result: Three core results: (1) diversity decay theorem explaining how correctness-based objectives cause diversity decay in different methods; (2) designs ensuring convergence to stable, diverse policies preventing collapse; (3) practical recipes to achieve this in practice.

Conclusion: DCR provides the first principled framework for LLMs that maintain both correctness and creativity, addressing the fundamental trade-off between optimization for accuracy and preservation of reasoning diversity.

Abstract: State-of-the-art large language model (LLM) pipelines rely on bootstrapped reasoning loops: sampling diverse chains of thought and reinforcing the highest-scoring ones, mainly optimizing correctness. We analyze how this design choice is sensitive to the collapse of the model’s distribution over reasoning paths, slashing semantic entropy and undermining creative problem-solving. To analyze this failure, we introduce Distributional Creative Reasoning (DCR), a unified variational objective that casts training as gradient flow through probability measures on solution traces. STaR, GRPO, and DPO, as well as entropy bonuses, and other methods, all constitute special cases of the same loss. The framework delivers three core results: (i) the diversity decay theorem, describing how correctness-based objectives lead to distinct modes of diversity decay for STaR, GRPO, and DPO; (ii) designs that ensure convergence to a stable and diverse policy, effectively preventing collapse; and (iii) simple, actionable recipes to achieve this in practice. DCR thus offers the first principled recipe for LLMs that remain both correct and creative.

Sunny Gupta, Amit Sethi

Main category: cs.LG

TL;DR: FedHypeVAE: A differentially private hypernetwork-driven framework for synthesizing embedding-level data in federated learning, addressing non-IID client heterogeneity while providing formal privacy guarantees against gradient leakage.

Details

Motivation: Existing federated data sharing methods using embedding-level generators struggle with non-IID client heterogeneity and offer limited formal protection against gradient leakage, creating a need for a more robust privacy-preserving synthesis framework.

Method: Uses a conditional VAE backbone with client-aware decoders and class-conditional priors generated by a shared hypernetwork from private trainable client codes. The hypernetwork is optimized under differential privacy with noise-perturbed, clipped gradients. Includes local MMD alignment between real/synthetic embeddings and Lipschitz regularization for stability under non-IID conditions.

Result: FedHypeVAE enables domain-agnostic synthesis using neutral meta-codes and controllable multi-domain coverage through mixtures of meta-codes, unifying personalization, privacy, and distribution alignment at the generator level.

Conclusion: FedHypeVAE establishes a principled foundation for privacy-preserving data synthesis in federated settings by addressing key challenges of non-IID heterogeneity and gradient leakage through a differentially private hypernetwork architecture.

Abstract: Federated data sharing promises utility without centralizing raw data, yet existing embedding-level generators struggle under non-IID client heterogeneity and provide limited formal protection against gradient leakage. We propose FedHypeVAE, a differentially private, hypernetwork-driven framework for synthesizing embedding-level data across decentralized clients. Building on a conditional VAE backbone, we replace the single global decoder and fixed latent prior with client-aware decoders and class-conditional priors generated by a shared hypernetwork from private, trainable client codes. This bi-level design personalizes the generative layerrather than the downstream modelwhile decoupling local data from communicated parameters. The shared hypernetwork is optimized under differential privacy, ensuring that only noise-perturbed, clipped gradients are aggregated across clients. A local MMD alignment between real and synthetic embeddings and a Lipschitz regularizer on hypernetwork outputs further enhance stability and distributional coherence under non-IID conditions. After training, a neutral meta-code enables domain agnostic synthesis, while mixtures of meta-codes provide controllable multi-domain coverage. FedHypeVAE unifies personalization, privacy, and distribution alignment at the generator level, establishing a principled foundation for privacy-preserving data synthesis in federated settings. Code: github.com/sunnyinAI/FedHypeVAE

[279] A Machine Learning Framework for Off Ball Defensive Role and Performance Evaluation in Football

Sean Groom, Shuo Wang, Francisco Belo, Axl Rice, Liam Anderson

Main category: cs.LG

TL;DR: A novel covariate-dependent Hidden Markov Model (CDHMM) for evaluating off-ball defense in football corner kicks, using player tracking data to infer defensive assignments and enable counterfactual analysis.

Details

Motivation: Traditional metrics fail to capture coordinated off-ball defensive movements that limit opponent actions. Existing possession value models focus on on-ball actions, and current counterfactual methods use "average" behavior lacking tactical context.

Method: Developed a covariate-dependent Hidden Markov Model (CDHMM) specifically for corner kicks that infers time-resolved man-marking and zonal defensive assignments from player tracking data without requiring labels. Used these assignments to create defensive credit attribution framework and role-conditioned ghosting method for counterfactual analysis.

Result: The model successfully infers defensive assignments from tracking data and provides interpretable evaluation of defensive contributions against context-aware baselines, addressing limitations of existing methods.

Conclusion: The CDHMM framework offers a novel, interpretable approach to evaluate off-ball defensive performance in structured football situations like corner kicks, overcoming limitations of traditional metrics and existing counterfactual methods.

Abstract: Evaluating off-ball defensive performance in football is challenging, as traditional metrics do not capture the nuanced coordinated movements that limit opponent action selection and success probabilities. Although widely used possession value models excel at appraising on-ball actions, their application to defense remains limited. Existing counterfactual methods, such as ghosting models, help extend these analyses but often rely on simulating “average” behavior that lacks tactical context. To address this, we introduce a covariate-dependent Hidden Markov Model (CDHMM) tailored to corner kicks, a highly structured aspect of football games. Our label-free model infers time-resolved man-marking and zonal assignments directly from player tracking data. We leverage these assignments to propose a novel framework for defensive credit attribution and a role-conditioned ghosting method for counterfactual analysis of off-ball defensive performance. We show how these contributions provide a interpretable evaluation of defensive contributions against context-aware baselines.

[280] Categorical Reparameterization with Denoising Diffusion models

Samson Gourevitch, Alain Durmus, Eric Moulines, Jimmy Olsson, Yazid Janati

Main category: cs.LG

TL;DR: The paper introduces a diffusion-based soft reparameterization for categorical variables that enables efficient gradient-based optimization without training.

Details

Motivation: Existing gradient-based optimization methods for categorical variables have limitations: score-function estimators are noisy, while continuous relaxations create biased objectives. There's a need for better reparameterization techniques that maintain efficiency while reducing bias.

Method: The authors propose a diffusion-based soft reparameterization for categorical distributions. They leverage the fact that for categorical distributions, the denoiser under a Gaussian noising process has a closed-form solution and can be computed efficiently. This creates a training-free diffusion sampler that enables backpropagation through the reparameterization.

Result: Experiments show that the proposed diffusion-based reparameterization trick yields competitive or improved optimization performance across various benchmarks compared to existing methods.

Conclusion: The diffusion-based soft reparameterization provides an effective alternative for gradient-based optimization with categorical variables, offering advantages over both score-function estimators and continuous relaxations by enabling efficient backpropagation through a training-free diffusion process.

Abstract: Gradient-based optimization with categorical variables typically relies on score-function estimators, which are unbiased but noisy, or on continuous relaxations that replace the discrete distribution with a smooth surrogate admitting a pathwise (reparameterized) gradient, at the cost of optimizing a biased, temperature-dependent objective. In this paper, we extend this family of relaxations by introducing a diffusion-based soft reparameterization for categorical distributions. For these distributions, the denoiser under a Gaussian noising process admits a closed form and can be computed efficiently, yielding a training-free diffusion sampler through which we can backpropagate. Our experiments show that the proposed reparameterization trick yields competitive or improved optimization performance on various benchmarks.

[281] It’s complicated. The relationship of algorithmic fairness and non-discrimination provisions for high-risk systems in the EU AI Act

Kristof Meding

Main category: cs.LG

TL;DR: This paper bridges legal non-discrimination regulations and algorithmic fairness concepts in the context of the EU AI Act, analyzing their relationship and providing recommendations for future interdisciplinary collaboration.

Details

Motivation: The paper addresses the challenge of defining fairness in AI decisions, particularly in light of discriminatory algorithmic behaviors and the recent EU AI Act that mandates rules for high-risk systems combining traditional legal non-discrimination regulations with machine learning fairness concepts.

Method: The paper uses a two-part approach: (1) a high-level introduction of both legal non-discrimination and algorithmic fairness concepts for interdisciplinary audiences, and (2) an in-depth analysis of the AI Act’s relationship between these two concepts.

Result: Three key findings: (1) Most non-discrimination regulations target only high-risk AI systems; (2) Regulation of high-risk systems includes both data input requirements and output monitoring, but these are partly inconsistent and raise computational feasibility questions; (3) Analysis of possible future interaction between classical EU non-discrimination law and AI Act regulations.

Conclusion: The paper recommends developing more specific auditing and testing methodologies for AI systems and serves as a foundation for future interdisciplinary collaboration between legal scholars and computer science researchers studying discrimination in AI systems.

Abstract: What constitutes a fair decision? This question is not only difficult for humans but becomes more challenging when Artificial Intelligence (AI) models are used. In light of discriminatory algorithmic behaviors, the EU has recently passed the AI Act, which mandates specific rules for high-risk systems, incorporating both traditional legal non-discrimination regulations and machine learning based algorithmic fairness concepts. This paper aims to bridge these two different concepts in the AI Act through: First, a necessary high-level introduction of both concepts targeting legal and computer science-oriented scholars, and second, an in-depth analysis of the AI Act’s relationship between legal non-discrimination regulations and algorithmic fairness. Our analysis reveals three key findings: (1.) Most non-discrimination regulations target only high-risk AI systems. (2.) The regulation of high-risk systems encompasses both data input requirements and output monitoring, though these regulations are partly inconsistent and raise questions of computational feasibility. (3.) Finally, we consider the possible (future) interaction of classical EU non-discrimination law and the AI Act regulations. We recommend developing more specific auditing and testing methodologies for AI systems. This paper aims to serve as a foundation for future interdisciplinary collaboration between legal scholars and computer science-oriented machine learning researchers studying discrimination in AI systems.

[282] The Curse of Depth in Large Language Models

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu

Main category: cs.LG

TL;DR: The paper introduces the “Curse of Depth” phenomenon in LLMs where deep layers underperform, identifies Pre-LN as the cause due to output variance explosion, and proposes LayerNorm Scaling (LNS) to fix it by scaling variance inversely with depth.

Details

Motivation: The motivation is to address the observed inefficiency of deep layers in modern LLMs (nearly half of layers underperforming), which occurs across popular model families like Llama, Mistral, DeepSeek, and Qwen.

Method: The method is LayerNorm Scaling (LNS), which scales the variance of layer normalization outputs inversely by the square root of depth to mitigate output variance explosion in deep Transformer layers.

Result: LNS consistently outperforms previous normalization and scaling techniques across model sizes (130M to 7B), improves pre-training performance, and benefits carry over to supervised fine-tuning by enabling deeper layers to contribute more effectively.

Conclusion: The Curse of Depth in LLMs stems from Pre-LN causing output variance explosion, and LayerNorm Scaling effectively addresses this issue, improving training efficiency and model performance across various scales and tasks.

Abstract: In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.

[283] Flattening Hierarchies with Policy Bootstrapping

John L. Zhou, Jonathan C. Kao

Main category: cs.LG

TL;DR: A new offline goal-conditioned RL algorithm that bootstraps on subgoal-conditioned policies using advantage-weighted importance sampling, eliminating the need for hierarchical subgoal generation and scaling to high-dimensional control tasks.

Details

Motivation: Offline GCRL struggles with long-horizon tasks due to sparse rewards and discounting, while hierarchical RL methods introduce complexity and don't scale well to high-dimensional goal spaces. There's a need for simpler, scalable approaches for long-horizon goal-reaching.

Method: Trains a flat goal-conditioned policy by bootstrapping on subgoal-conditioned policies using advantage-weighted importance sampling. Eliminates generative models over goal spaces, making it scalable to high-dimensional control. Shows existing hierarchical and bootstrapping methods are special cases within this framework.

Result: Matches or surpasses state-of-the-art offline GCRL algorithms across locomotion and manipulation benchmarks. Scales to complex, long-horizon tasks where prior approaches fail, working in both state- and pixel-based environments.

Conclusion: The approach provides a simpler, scalable alternative to hierarchical RL for offline GCRL, successfully addressing long-horizon challenges without complex subgoal generation mechanisms.

Abstract: Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/

[284] Robust Molecular Property Prediction via Densifying Scarce Labeled Data

Jina Kim, Jeffrey Willette, Bruno Andreis, Sung Ju Hwang

Main category: cs.LG

TL;DR: Novel bilevel optimization approach uses unlabeled data to interpolate between in-distribution and out-of-distribution data, improving molecular prediction generalization beyond training distribution.

Details

Motivation: Molecular prediction models suffer from poor generalization to out-of-distribution compounds, which is critical in drug discovery where important compounds often lie beyond training data. Covariate shift and scarce labeled experimental data exacerbate this problem.

Method: Proposes a bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling models to learn how to generalize beyond the training distribution.

Result: Demonstrates significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations highlighting the interpolation method’s effectiveness.

Conclusion: The proposed bilevel optimization approach successfully addresses the generalization limitations of molecular prediction models by using unlabeled data to bridge ID and OOD domains, enabling more reliable predictions for drug discovery applications.

Abstract: A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substantial covariate shift, under which standard deep learning models produce unstable and inaccurate predictions. Furthermore, the scarcity of labeled data-stemming from the onerous and costly nature of experimental validation-further exacerbates the difficulty of achieving reliable generalization. To address these limitations, we propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to learn how to generalize beyond the training distribution. We demonstrate significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations highlighting our interpolation method.

[285] Tabby: A Language Model Architecture for Tabular and Structured Data Synthesis

Sonia Cromp, Satya Sai Srinath Namburi GNVV, Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, Frederic Sala

Main category: cs.LG

TL;DR: Tabby is a novel Transformer modification using Gated Mixture-of-Experts for column-specific parameterization, enabling high-quality tabular data synthesis that matches real data quality.

Details

Motivation: Despite advances in LLMs for text synthesis, tabular data synthesis has received less attention, creating a gap that needs to be addressed for structured data generation.

Method: Tabby modifies standard Transformer architecture with Gated Mixture-of-Experts to represent column differences using column-specific parameter sets, paired with Plain training technique for LLM table training.

Result: Tabby achieves data quality near or equal to real data, with up to 44% improvement over previous methods, and extends to general structured data like nested JSON datasets.

Conclusion: Tabby provides a powerful solution for tabular and structured data synthesis, bridging the gap between text and table generation capabilities in modern LLMs.

Abstract: While advances in large language models (LLMs) have greatly improved the quality of synthetic text data in recent years, synthesizing tabular data has received relatively less attention. We address this disparity with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby enables the representation of differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. By pairing our novel LLM table training technique, Plain, with Tabby, we observe up to a 44% improvement in quality over previous methods. We also show that Tabby extends beyond tables to more general structured data, reaching parity with real data on a nested JSON dataset as well.

[286] Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

Carson Dudley, Reiden Magdaleno, Christopher Harding, Marisa Eisenberg

Main category: cs.LG

TL;DR: SGNNs use synthetic simulation data as training to create robust scientific models that overcome limitations of both mechanistic theory and pure machine learning, showing improved prediction and inference across domains.

Details

Motivation: Address the tradeoff between interpretable mechanistic theory and predictive machine learning, overcoming brittleness of hybrid approaches like PINNs that embed domain knowledge as functional constraints which can fail under model misspecification.

Method: Simulation-Grounded Neural Networks (SGNNs) embed domain knowledge into training data rather than functional constraints. Pretrain on synthetic corpora spanning diverse model structures and observational artifacts to learn broad patterns of physical possibility, internalizing system dynamics without forcing satisfaction of potentially incorrect equations.

Result: SGNNs confer significant robustness across disciplines: nearly tripled COVID-19 forecasting skill vs CDC baselines; outperformed physics-constrained models on dengue outbreaks even with incorrect transmission equations; estimated early COVID-19 transmissibility more accurately than traditional methods; enable back-to-simulation attribution for mechanistic interpretability.

Conclusion: Mechanistic simulations can serve as effective training data for robust scientific inference that generalizes beyond fixed functional forms, unifying simulation-based techniques into a single framework that bridges theory and data-driven approaches.

Abstract: Scientific modeling faces a tradeoff between the interpretability of mechanistic theory and the predictive power of machine learning. While hybrid approaches like Physics-Informed Neural Networks (PINNs) embed domain knowledge as functional constraints, they can be brittle under model misspecification. We introduce Simulation-Grounded Neural Networks (SGNNs), a framework that instead embeds domain knowledge into the training data to establish a structural prior. By pretraining on synthetic corpora spanning diverse model structures and observational artifacts, SGNNs learn the broad patterns of physical possibility. This allows the model to internalize the underlying dynamics of a system without being forced to satisfy a single, potentially incorrect equation. We evaluated SGNNs across scientific disciplines and found that this approach confers significant robustness. In prediction tasks, SGNNs nearly tripled COVID-19 forecasting skill versus CDC baselines. In tests on dengue outbreaks, SGNNs outperformed physics-constrained models even when both were restricted to incorrect human-to-human transmission equations, demonstrating that SGNNs are potentially more robust to model misspecification. For inference, SGNNs extend the logic of simulation-based inference to enable supervised learning for unobservable targets, estimating early COVID-19 transmissibility more accurately than traditional methods. Finally, SGNNs enable back-to-simulation attribution, a form of mechanistic interpretability that maps real-world data back to the simulated manifold to identify underlying processes. By unifying these disparate simulation-based techniques into a single framework, we demonstrate that mechanistic simulations can serve as effective training data for robust scientific inference that generalizes beyond the limitations of fixed functional forms.

[287] KANO: Kolmogorov-Arnold Neural Operator

Jin Lee, Ziming Liu, Xinling Yu, Yixuan Wang, Haewon Jeong, Murphy Yuezhen Niu, Zheng Zhang

Main category: cs.LG

TL;DR: KANO is a dual-domain neural operator combining spectral and spatial bases with symbolic interpretability, overcoming FNO’s limitations on position-dependent PDEs and achieving superior performance in quantum Hamiltonian learning.

Details

Motivation: The paper addresses limitations of Fourier Neural Operator (FNO), which suffers from spectral bottlenecks and poor performance on position-dependent differential operators (variable coefficient PDEs). FNO requires spectrally sparse operators and fast-decaying input Fourier tails, making it impractical for general position-dependent dynamics.

Method: Kolmogorov-Arnold Neural Operator (KANO) - a dual-domain neural operator jointly parameterized by both spectral and spatial bases. It maintains symbolic interpretability while overcoming FNO’s pure-spectral limitations. The method combines spectral and spatial representations to handle generic position-dependent dynamics.

Result: 1) KANO robustly generalizes on position-dependent differential operators where FNO fails. 2) In quantum Hamiltonian learning, KANO reconstructs ground-truth Hamiltonians in closed-form symbolic representations with coefficients accurate to fourth decimal place. 3) Achieves ≈6×10⁻⁶ state infidelity from projective measurement data, substantially outperforming FNO trained with ideal full wave function data (≈1.5×10⁻²) by orders of magnitude.

Conclusion: KANO provides a superior alternative to FNO for learning position-dependent dynamics, offering both practical performance advantages and theoretical guarantees. Its dual-domain approach with symbolic interpretability makes it particularly effective for scientific applications like quantum Hamiltonian learning where accurate symbolic reconstruction is valuable.

Abstract: We introduce Kolmogorov–Arnold Neural Operator (KANO), a dual-domain neural operator jointly parameterized by both spectral and spatial bases with intrinsic symbolic interpretability. We theoretically demonstrate that KANO overcomes the pure-spectral bottleneck of Fourier Neural Operator (FNO): KANO remains expressive over generic position-dependent dynamics (variable coefficient PDEs) for any physical input, whereas FNO stays practical only for spectrally sparse operators and strictly imposes a fast-decaying input Fourier tail. We verify our claims empirically on position-dependent differential operators, for which KANO robustly generalizes but FNO fails to. In the quantum Hamiltonian learning benchmark, KANO reconstructs ground-truth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains $\approx 6\times10^{-6}$ state infidelity from projective measurement data, substantially outperforming that of the FNO trained with ideal full wave function data, $\approx 1.5\times10^{-2}$, by orders of magnitude.

[288] PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models

He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zhen Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong

Main category: cs.LG

TL;DR: PTQTP introduces a structured post-training quantization framework using dual ternary trit-planes for LLMs, achieving multiplication-free inference with high accuracy at extremely low bit-widths while requiring minimal quantization time.

Details

Motivation: Existing ultra-low-bit quantization methods for LLMs face a fundamental trade-off: binary approximations have limited representational capacity, while quantization-aware training requires huge computational resources. There's a need for a practical solution that maintains high accuracy at extremely low bit-widths without extensive training overhead.

Method: PTQTP decomposes weight matrices into dual ternary {-1, 0, 1} trit-planes, separating weights into discrete topology (trit-planes) and continuous magnitude (scales). This enables multiplication-free additive inference through uniform ternary operations. The framework includes a progressive approximation algorithm for global weight consistency and works without architectural modifications.

Result: PTQTP significantly outperforms sub-4bit PTQ methods on language reasoning, mathematical reasoning, and coding tasks across LLaMA3.x and Qwen3 models (0.6B-70B). It rivals 1.58-bit QAT performance while requiring only single-hour quantization vs. 10-14 GPU days for training-based methods, achieving 4.63× faster inference than FP16 baselines.

Conclusion: PTQTP establishes a new practical solution for efficient LLM deployment in resource-constrained environments by providing high-fidelity sparse approximation at extremely low bit-widths with minimal quantization overhead and significant speed improvements.

Abstract: Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and representational capacity. While existing ultra-low-bit methods rely on binary approximations or quantization-aware training(QAT), they often suffer from either limited representational capacity or huge training resource overhead. We introduce PTQ to Trit-Planes (PTQTP), a structured PTQ framework that decomposes weight matrices into dual ternary {-1, 0, 1} trit-planes. This approach achieves multiplication-free additive inference by decoupling weights into discrete topology (trit-planes) and continuous magnitude (scales), effectively enabling high-fidelity sparse approximation. PTQTP provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment without architectural modifications; and (3) uniform ternary operations that eliminate mixed-precision overhead. Comprehensive experiments on LLaMA3.x and Qwen3 (0.6B-70B) demonstrate that PTQTP significantly outperforms sub-4bit PTQ methods on both language reasoning tasks and mathematical reasoning as well as coding. PTQTP rivals the 1.58-bit QAT performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods, and the end-to-end inference speed achieves 4.63$\times$ faster than the FP16 baseline model, establishing a new and practical solution for efficient LLM deployment in resource-constrained environments. Code will available at https://github.com/HeXiao-55/PTQTP.

[289] Density-Based Algorithms for Corruption-Robust Contextual Search and Convex Optimization

Renato Paes Leme, Chara Podimata, Jon Schneider

Main category: cs.LG

TL;DR: The paper studies contextual search with adversarial noise, achieving improved regret bounds for ε-ball and symmetric loss functions through novel density-based tracking techniques.

Details

Motivation: To improve regret bounds for contextual search in adversarial noise settings, addressing limitations of prior approaches that had suboptimal dependencies on dimension and time horizon.

Method: Uses density functions over candidate target vectors instead of knowledge sets, and studies corruption-robust convex optimization with subgradient feedback as a more general framework.

Result: Achieved tight regret bound O(C + d log(1/ε)) for ε-ball loss (improving over prior O(d³ log(1/ε) log²(T) + C log(T) log(1/ε))), and O(C + d log T) for symmetric loss with efficient algorithm.

Conclusion: The density-based approach significantly improves regret bounds for contextual search under adversarial noise, with techniques applicable to broader corruption-robust optimization problems.

Abstract: We study the problem of contextual search, a generalization of binary search in higher dimensions, in the adversarial noise model. Let $d$ be the dimension of the problem, $T$ be the time horizon and $C$ be the total amount of adversarial noise in the system. We focus on the $ε$-ball and the symmetric loss. For the $ε$-ball loss, we give a tight regret bound of $O(C + d \log(1/ε))$ improving over the $O(d^3 \log(1/ε) \log^2(T) + C \log(T) \log(1/ε))$ bound of Krishnamurthy et al (Operations Research ‘23). For the symmetric loss, we give an efficient algorithm with regret $O(C+d \log T)$. To tackle the symmetric loss case, we study the more general setting of Corruption-Robust Convex Optimization with Subgradient feedback, which is of independent interest. Our techniques are a significant departure from prior approaches. Specifically, we keep track of density functions over the candidate target vectors instead of a knowledge set consisting of the candidate target vectors consistent with the feedback obtained.

[290] Distributed Sparse Linear Regression under Communication Constraints

Rodney Fonseca, Boaz Nadler

Main category: cs.LG

TL;DR: Distributed sparse linear regression with sublinear communication per machine using debiased lasso estimators and two-round schemes.

Details

Motivation: Statistical tasks often occur in distributed settings with data split among multiple machines connected to a fusion center. End machines have limited bandwidth and power, requiring tight communication budgets, especially for learning sparse linear regression models under severe communication constraints.

Method: Proposes several two-round distributed schemes where individual machines compute debiased lasso estimators but send only very few values to the fusion center. Communication per machine is sublinear in data dimension.

Result: Theoretical analysis shows one scheme achieves exact support recovery with high probability at low signal-to-noise ratios where individual machines fail. Simulations demonstrate performance comparable to or better than more communication-intensive approaches.

Conclusion: The proposed communication-efficient distributed schemes enable sparse linear regression learning with sublinear communication overhead while maintaining or improving performance compared to communication-intensive methods.

Abstract: In multiple domains, statistical tasks are performed in distributed settings, with data split among several end machines that are connected to a fusion center. In various applications, the end machines have limited bandwidth and power, and thus a tight communication budget. In this work we focus on distributed learning of a sparse linear regression model, under severe communication constraints. We propose several two round distributed schemes, whose communication per machine is sublinear in the data dimension. In our schemes, individual machines compute debiased lasso estimators, but send to the fusion center only very few values. On the theoretical front, we analyze one of these schemes and prove that with high probability it achieves exact support recovery at low signal to noise ratios, where individual machines fail to recover the support. We show in simulations that our scheme works as well as, and in some cases better, than more communication intensive approaches.

[291] LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Tianyi Zhang, Anshumali Shrivastava

Main category: cs.LG

TL;DR: LeanQuant is a novel LLM quantization method that learns loss-error-aware quantization grids instead of using non-adaptive min-max affine grids, achieving better accuracy, versatility, and scalability while reducing memory and inference costs.

Details

Motivation: Current LLM quantization methods face limitations: specialized computations and custom data formats limit framework compatibility, high resource requirements hinder scaling to large models, and min-max affine quantization grids fail to preserve model quality due to outliers in inverse Hessian diagonals.

Method: LeanQuant proposes learning loss-error-aware quantization grids within the iterative loss-error-based quantization framework, replacing non-adaptive min-max affine grids. This approach works with both affine and non-uniform quantization types, enhancing framework compatibility.

Result: Extensive experiments show LeanQuant is highly accurate, outperforming competitive baselines in model quality. It’s also scalable, successfully quantizing Llama-3.1 405B (one of the largest open-source LLMs) using only two Quadro RTX 8000-48GB GPUs in 21 hours.

Conclusion: LeanQuant addresses critical limitations in existing LLM quantization methods by introducing loss-error-aware grid learning, resulting in an accurate, versatile, and scalable solution that enables efficient deployment of large language models across different frameworks and hardware platforms.

Abstract: Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising technique to reduce memory requirements and decoding latency. However, recent accurate quantization methods often depend on specialized computations or custom data formats to achieve better model quality, which limits their compatibility with popular frameworks, as they require dedicated inference kernels tailored to specific hardware and software platforms, hindering wider adoption. Furthermore, many competitive methods have high resource requirements and computational overhead for quantizing models, making it challenging to scale them to hundreds of billions of parameters. In response to these challenges, we propose LeanQuant (Loss-Error-Aware Network Quantization), a novel quantization method that is accurate, versatile, and scalable. In the existing popular iterative loss-error-based quantization framework, we identify a critical limitation in prior methods: the min-max affine quantization grid fails to preserve model quality due to outliers in inverse Hessian diagonals. To overcome this fundamental issue, we propose learning loss-error-aware grids, instead of using non-adaptive min-max affine grids. Our approach not only produces quantized models that are more accurate but also generalizes to a wider range of quantization types, including affine and non-uniform quantization, enhancing compatibility with more frameworks. Extensive experiments with recent LLMs demonstrate that LeanQuant is highly accurate, comparing favorably against competitive baselines in model quality, and scalable, achieving very accurate quantization of Llama-3.1 405B, one of the largest open-source LLMs to date, using two Quadro RTX 8000-48GB GPUs in 21 hours.

[292] Sketch to Adapt: Fine-Tunable Sketches for Efficient LLM Adaptation

Tianyi Zhang, Junda Su, Aditya Desai, Oscar Wu, Zhaozhuo Xu, Anshumali Shrivastava

Main category: cs.LG

TL;DR: SketchTune is a new parameter-efficient fine-tuning method that uses sketching (data compression) instead of low-rank assumptions to adapt LLMs, achieving better performance with smaller models and fewer trainable parameters.

Details

Motivation: Current PEFT methods for LLMs have limitations: they rely on restrictive low-rank assumptions for adapters, leading to suboptimal model quality, and require complex two-path computation that increases memory usage and slows training/inference.

Method: SketchTune compresses LLM weights into compact fine-tunable sketches using sketching techniques, integrating compression and adaptation into a unified framework that eliminates the need for two-path computation.

Result: SketchTune outperforms leading PEFT methods (LoRA, DoRA, S2FT, LoftQ) across diverse tasks using substantially smaller base models (2.6-3.5× smaller) and comparable/fewer trainable parameters, with up to 14.48% accuracy improvement on GSM8K using 7.3× fewer parameters.

Conclusion: Sketching provides a more effective alternative to low-rank methods for LLM adaptation, enabling better performance with smaller models and more efficient training/inference through unified compression-adaptation framework.

Abstract: Adapting pre-trained large language models (LLMs) is crucial but challenging due to their enormous size. Parameter-efficient fine-tuning (PEFT) techniques typically employ additive adapters applied to frozen model weights. To further reduce memory usage, model weights are often compressed through quantization. However, existing PEFT methods often yield suboptimal model quality because they rely on restrictive assumptions, such as low-rank constraints on adapters to limit the number of trainable parameters. We find that sketching, a popular data compression technique, can serve as an efficient LLM adaptation strategy while avoiding the low-rank assumption. We introduce SketchTune, a compressive adaptation strategy that compresses LLM weights into compact fine-tunable sketches, integrating compression and adaptation into a unified framework. This integration eliminates the need for complex two-path computation in existing PEFT techniques, enabling faster and more memory-efficient training and inference. SketchTune is supported by mathematical insights into matrix classes that are better approximated using sketching rather than low-rank methods. Our extensive evaluations with Llama and Mistral models demonstrate that SketchTune outperforms leading PEFT methods across diverse tasks while using substantially smaller base models and comparable trainable parameters. As a highlight, SketchTune outperforms LoRA, DoRA, and S2FT on commonsense and math benchmarks using 2.6-3.5$\times$ smaller base models and exceeds LoftQ in accuracy by 14.48% on GSM8K with 7.3$\times$ fewer trainable parameters. Our code is available at https://github.com/LeanModels/SketchTune.

[293] A New Flexible Train-Test Split Algorithm, an approach for choosing among the Hold-out, K-fold cross-validation, and Hold-out iteration

Zahra Bami, Ali Behnampour, Aniruddha Bora, Hassan Doosti

Main category: cs.LG

TL;DR: A Python framework evaluates how different validation strategies affect ML performance across 7 algorithms and 3 biomedical datasets, finding no single best strategy - optimal validation depends on algorithm-dataset-metric interactions.

Details

Motivation: Validation methods are often chosen based on defaults or convention without considering their impact on generalizability and real-world performance. Common approaches like hold-out or fixed k-fold CV are applied empirically without systematic evaluation of their effects.

Method: A flexible Python framework systematically examines validation strategies across 7 ML algorithms (Decision Trees, KNN, Naive Bayes variants, Logistic Regression, calibrated linear SVM, histogram-based gradient boosting). Evaluates hold-out splits (10-90%), k-fold CV (k=3-15), repeated hold-out, and nested CV on 3 biomedical datasets of varying sizes.

Result: No single validation strategy consistently outperforms others across all algorithms and datasets. Optimal validation depends on the interaction between algorithm, dataset characteristics, and evaluation metric (ROC-AUC, accuracy, Matthews correlation coefficient).

Conclusion: Validation strategy selection should be tailored to specific algorithm-dataset combinations rather than relying on defaults. The framework provides systematic evaluation to guide appropriate validation method selection for improved generalizability.

Abstract: Choosing an appropriate strategy for partitioning data into training and evaluation sets is a critical step in machine learning, yet validation methods are often selected using default or conventional settings without considering their impact on generalizability and real-world performance. Common approaches such as hold-out validation or k-fold cross-validation with fixed k values are frequently applied based solely on empirical practice. To address this issue, we propose a flexible Python-based framework that systematically examines how different validation strategies affect predictive performance across seven widely used machine learning algorithms, including Decision Trees, K-Nearest Neighbors, Naive Bayes variants, Logistic Regression, calibrated linear Support Vector Machines, and histogram-based gradient boosting. The framework evaluates these methods under a wide range of validation schemes, including hold-out splits from 10% to 90%, k-fold cross-validation with k between 3 and 15, repeated hold-out, and nested cross-validation. The framework is applied to three biomedical datasets of varying size, and performance is assessed using ROC-AUC, accuracy, and the Matthews correlation coefficient. The results show that no single validation strategy consistently outperforms others across all algorithms and datasets, indicating that optimal validation depends on the interaction between the algorithm, dataset characteristics, and evaluation metric.

[294] Brain network science modelling of sparse neural networks enables Transformers and LLMs to perform as fully connected

Yingtao Zhang, Diego Cerretti, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci

Main category: cs.LG

TL;DR: The paper introduces BRF for sparse ANN initialization and CHTs/CHTss methods that improve dynamic sparse training by addressing CHT’s limitations, achieving better performance at ultra-low connectivity levels.

Details

Motivation: Dynamic sparse training reduces computational demands but struggles at high sparsity levels. CHT is brain-inspired but has high time complexity (O(Nd³)) and inappropriate early training selection. Need better initialization and more efficient, flexible methods.

Method: 1) BRF model for sparse ANN initialization; 2) GPU-friendly matrix approximation reducing complexity to O(N³); 3) CHTs with flexible sampling strategy balancing exploration/exploitation; 4) CHTss integrating sigmoid gradual density decay.

Result: BRF outperforms previous network science models. CHTs with 1% connections beats fully connected MLPs on image classification, compressing networks to <30% nodes. CHTss with 5% connections outperforms fully connected Transformers in machine translation. Both methods beat other DST methods at 30% connectivity in language modeling.

Conclusion: The proposed brain-inspired BRF initialization and improved CHT methods enable effective dynamic sparse training at ultra-low connectivity levels, achieving competitive or superior performance to dense networks while significantly reducing computational demands.

Abstract: Dynamic sparse training (DST) can reduce the computational demands in ANNs, but faces difficulties in keeping peak performance at high sparsity levels. The Cannistraci-Hebb training (CHT) is a brain-inspired method for growing connectivity in DST. CHT leverages a gradient-free, topology-driven link regrowth, which has shown ultra-sparse (less than 1% connectivity) advantage across various tasks compared to fully connected networks. Yet, CHT suffers two main drawbacks: (i) its time complexity is $O(Nd^3)$ - N node network size, d node degree - restricting it to ultra-sparse regimes. (ii) it selects top link prediction scores, which is inappropriate for the early training epochs, when the network presents unreliable connections. Here, we design the first brain-inspired network model - termed bipartite receptive field (BRF) - to initialize the connectivity of sparse artificial neural networks. We further introduce a GPU-friendly matrix-based approximation of CH link prediction, reducing complexity to $O(N^3)$. We introduce the Cannistraci-Hebb training soft rule (CHTs), which adopts a flexible strategy for sampling connections in both link removal and regrowth, balancing the exploration and exploitation of network topology. Additionally, we integrate CHTs with a sigmoid gradual density decay (CHTss). Empirical results show that BRF offers performance advantages over previous network science models. Using 1% of connections, CHTs outperforms fully connected networks in MLP architectures on image classification tasks, compressing some networks to less than 30% of the nodes. Using 5% of the connections, CHTss outperforms fully connected networks in two Transformer-based machine translation tasks. Finally, at 30% connectivity, both CHTs and CHTss outperform other DST methods in language modeling task.

[295] A Gaussian Process View on Observation Noise and Initialization in Wide Neural Networks

Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Alvaro Cartea, Jose Miguel Hernandez-Lobato, Konstantina Palla, Kamil Ciosek

Main category: cs.LG

TL;DR: The paper addresses limitations in NTK-GP equivalence by introducing a regularizer for noisy data and a shifted network for arbitrary prior means, enabling practical gradient descent-based GP modeling.

Details

Motivation: Existing NTK-GP formulations have two key limitations: (1) they assume noiseless targets, leading to misspecification on noisy data, and (2) the equivalence doesn't extend to arbitrary prior means, which are essential for well-specified Gaussian process models.

Method: Two main contributions: (1) Introduce a regularizer into training objective to correspond to observation noise in NTK-GP, and (2) propose a “shifted network” that enables arbitrary prior means and allows obtaining posterior mean with gradient descent on a single network without ensembling or kernel inversion.

Result: The approach successfully addresses both limitations, enabling practical use of NTK-GP equivalence in applied Gaussian process modeling, validated through experiments across various datasets and architectures.

Conclusion: The proposed methods remove key obstacles to practical application of NTK-GP equivalence, making gradient descent in wide neural networks a viable approach for Gaussian process modeling with proper handling of noise and arbitrary prior means.

Abstract: Performing gradient descent in a wide neural network is equivalent to computing the posterior mean of a Gaussian Process with the Neural Tangent Kernel (NTK-GP), for a specific prior mean and with zero observation noise. However, existing formulations have two limitations: (i) the NTK-GP assumes noiseless targets, leading to misspecification on noisy data; (ii) the equivalence does not extend to arbitrary prior means, which are essential for well-specified models. To address (i), we introduce a regularizer into the training objective, showing its correspondence to incorporating observation noise in the NTK-GP. To address (ii), we propose a \textit{shifted network} that enables arbitrary prior means and allows obtaining the posterior mean with gradient descent on a single network, without ensembling or kernel inversion. We validate our results with experiments across datasets and architectures, showing that this approach removes key obstacles to the practical use of NTK-GP equivalence in applied Gaussian process modeling.

[296] Fusion of Multiscale Features Via Centralized Sparse-attention Network for EEG Decoding

Xiangrui Cai, Shaocheng Ma, Lei Cao, Jie Li, Tianyu Liu, Yilin Dong

Main category: cs.LG

TL;DR: EEG-CSANet: A centralized sparse-attention network with multi-branch parallel architecture for EEG signal decoding, achieving SOTA performance across five public datasets.

Details

Motivation: To address the inherent spatiotemporal heterogeneity of EEG signals and improve multi-branch feature fusion for better brain activity translation into executable commands.

Method: Multi-branch parallel architecture with independent spatial feature extraction modules for each temporal scale, plus a centralized sparse-attention network (EEG-CSANet) using main-auxiliary branch architecture: main branch models core spatiotemporal patterns via multiscale self-attention, auxiliary branch facilitates efficient local interactions through sparse cross-attention.

Result: Achieves state-of-the-art performance across five public datasets: BCIC-IV-2A (88.54%), BCIC-IV-2B (91.09%), HGD (97.15%), SEED (96.03%), and SEED-VIG (90.56%). Demonstrates strong adaptability and robustness across various EEG decoding tasks.

Conclusion: EEG-CSANet shows strong performance and interpretability, and could serve as a promising baseline model in EEG signal decoding field. Source code is publicly available.

Abstract: Electroencephalography (EEG) signal decoding is a key technology that translates brain activity into executable commands, laying the foundation for direct brain-machine interfacing and intelligent interaction. To address the inherent spatiotemporal heterogeneity of EEG signals, this paper proposes a multi-branch parallel architecture, where each temporal scale is equipped with an independent spatial feature extraction module. To further enhance multi-branch feature fusion, we propose a Fusion of Multiscale Features via Centralized Sparse-attention Network (EEG-CSANet), a centralized sparse-attention network. It employs a main-auxiliary branch architecture, where the main branch models core spatiotemporal patterns via multiscale self-attention, and the auxiliary branch facilitates efficient local interactions through sparse cross-attention. Experimental results show that EEG-CSANet achieves state-of-the-art (SOTA) performance across five public datasets (BCIC-IV-2A, BCIC-IV-2B, HGD, SEED, and SEED-VIG), with accuracies of 88.54%, 91.09%, 97.15%, 96.03%, and 90.56%, respectively. Such performance demonstrates its strong adaptability and robustness across various EEG decoding tasks. Moreover, extensive ablation studies are conducted to enhance the interpretability of EEG-CSANet. In the future, we hope that EEG-CSANet could serve as a promising baseline model in the field of EEG signal decoding. The source code is publicly available at: https://github.com/Xiangrui-Cai/EEG-CSANet

[297] A Near-optimal, Scalable and Parallelizable Framework for Stochastic Bandits Robust to Adversarial Corruptions and Beyond

Zicheng Hu, Cheng Chen

Main category: cs.LG

TL;DR: BARBAT improves upon BARBAR by achieving optimal O(C log) regret for stochastic bandits with adversarial corruptions, eliminating the K factor, and extends to multi-agent, graph, combinatorial semi-bandits, and batched settings with better parallelization and lower computational costs.

Details

Motivation: The BARBAR algorithm for stochastic bandits with adversarial corruptions suffers from suboptimal O(KC) regret that doesn't match the Ω(C) lower bound, where K is number of arms and C is corruption level. There's a need for more efficient algorithms with optimal regret bounds and better computational properties.

Method: Proposes BARBAT framework that improves upon BARBAR by eliminating the K factor in regret. Extends BARBAT to multi-agent bandits, graph bandits, combinatorial semi-bandits, and batched bandits. Uses a novel approach more amenable to parallelization than Follow-the-Regularized-Leader frameworks.

Result: BARBAT achieves optimal regret bound up to logarithmic factor (O(C log)), eliminating the K factor from BARBAR’s O(KC) regret. The methods show better parallelization suitability for multi-agent and batched settings, lower computational costs especially in semi-bandit problems, and verified efficiency through numerical experiments.

Conclusion: BARBAT provides an optimal regret framework for corrupted stochastic bandits, overcoming BARBAR’s limitations. The extensions to various bandit settings with improved parallelization and computational efficiency make it a versatile and practical solution for real-world applications with adversarial corruptions.

Abstract: We investigate various stochastic bandit problems in the presence of adversarial corruptions. A seminal work for this problem is the BARBAR~\cite{gupta2019better} algorithm, which achieves both robustness and efficiency. However, it suffers from a regret of $O(KC)$, which does not match the lower bound of $Ω(C)$, where $K$ denotes the number of arms and $C$ denotes the corruption level. In this paper, we first improve the BARBAR algorithm by proposing a novel framework called BARBAT, which eliminates the factor of $K$ to achieve an optimal regret bound up to a logarithmic factor. We also extend BARBAT to various settings, including multi-agent bandits, graph bandits, combinatorial semi-bandits and batched bandits. Compared with the Follow-the-Regularized-Leader framework, our methods are more amenable to parallelization, making them suitable for multi-agent and batched bandit settings, and they incur lower computational costs, particularly in semi-bandit problems. Numerical experiments verify the efficiency of the proposed methods.

[298] Digital implementations of deep feature extractors are intrinsically informative

Max Getter

Main category: cs.LG

TL;DR: The paper proves an upper bound for energy propagation speed in neural networks, showing how structural information about signal domains can improve decay rates, with applications to discrete-domain feature extractors and CNNs via scattering over LCA groups.

Details

Motivation: To understand and quantify how quickly information (energy) propagates through deep neural networks, which is crucial for balancing computational complexity with representation expressiveness.

Method: Develops a unified mathematical framework to prove upper bounds on energy propagation speed across different neural network models, including both Euclidean and non-Euclidean domains. Uses structural information about signal domains to explicitly determine or improve decay rates.

Result: Proves global exponential energy decay for: 1) feature extractors with discrete-domain input signals, and 2) convolutional neural networks via scattering over locally compact abelian (LCA) groups.

Conclusion: The theoretical framework provides rigorous bounds on information propagation in neural networks, demonstrating how domain structure influences energy decay rates, with practical implications for designing efficient deep learning architectures.

Abstract: Rapid information (energy) propagation in deep feature extractors is crucial to balance computational complexity versus expressiveness as a representation of the input. We prove an upper bound for the speed of energy propagation in a unified framework that covers different neural network models, both over Euclidean and non-Euclidean domains. Additional structural information about the signal domain can be used to explicitly determine or improve the rate of decay. To illustrate this, we show global exponential energy decay for a range of 1) feature extractors with discrete-domain input signals, and 2) convolutional neural networks (CNNs) via scattering over locally compact abelian (LCA) groups.

[299] BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization

Iris Xu, Guangtao Zeng, Zexue He, Charles Jin, Aldo Pareja, Dan Gutfreund, Chuang Gan, Zhang-Wei Hong

Main category: cs.LG

TL;DR: BOAD automatically discovers hierarchical multi-agent systems for software engineering tasks, outperforming single-agent and manually designed multi-agent approaches on challenging SWE benchmarks.

Details

Motivation: LLMs struggle with real-world software engineering problems that are long-horizon and out-of-distribution. Existing monolithic agent designs force models to retain irrelevant context, leading to poor generalization. Human engineers decompose complex problems, suggesting the need for specialized sub-agents.

Method: Proposes Bandit Optimization for Agent Design (BOAD), formulating hierarchy discovery as a multi-armed bandit problem where each arm represents a candidate sub-agent. The reward measures helpfulness when collaborating with others, enabling efficient exploration of sub-agent designs under limited evaluation budgets.

Result: On SWE-bench-Verified, BOAD outperforms single-agent and manually designed multi-agent systems. On SWE-bench-Live, the 36B system ranks second on the leaderboard, surpassing larger models like GPT-4 and Claude.

Conclusion: Automatically discovered hierarchical multi-agent systems significantly improve generalization on challenging long-horizon software engineering tasks, demonstrating the effectiveness of structured agent coordination over monolithic designs.

Abstract: Large language models (LLMs) have shown strong reasoning and coding capabilities, yet they struggle to generalize to real-world software engineering (SWE) problems that are long-horizon and out of distribution. Existing systems often rely on a single agent to handle the entire workflow-interpreting issues, navigating large codebases, and implementing fixes-within one reasoning chain. Such monolithic designs force the model to retain irrelevant context, leading to spurious correlations and poor generalization. Motivated by how human engineers decompose complex problems, we propose structuring SWE agents as orchestrators coordinating specialized sub-agents for sub-tasks such as localization, editing, and validation. The challenge lies in discovering effective hierarchies automatically: as the number of sub-agents grows, the search space becomes combinatorial, and it is difficult to attribute credit to individual sub-agents within a team. We address these challenges by formulating hierarchy discovery as a multi-armed bandit (MAB) problem, where each arm represents a candidate sub-agent and the reward measures its helpfulness when collaborating with others. This framework, termed Bandit Optimization for Agent Design (BOAD), enables efficient exploration of sub-agent designs under limited evaluation budgets. On SWE-bench-Verified, BOAD outperforms single-agent and manually designed multi-agent systems. On SWE-bench-Live, featuring more recent and out-of-distribution issues, our 36B system ranks second on the leaderboard at the time of evaluation, surpassing larger models such as GPT-4 and Claude. These results demonstrate that automatically discovered hierarchical multi-agent systems significantly improve generalization on challenging long-horizon SWE tasks. Code is available at https://github.com/iamxjy/BOAD-SWE-Agent.

[300] From Continual Learning to SGD and Back: Better Rates for Continual Linear Models

Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, Nathan Srebro

Main category: cs.LG

TL;DR: The paper analyzes forgetting in continual learning for overparameterized models, establishing novel universal forgetting rates that don’t depend on problem dimensionality, improving existing bounds for continual regression, and showing randomization prevents catastrophic forgetting.

Details

Motivation: Existing forgetting rates in continual learning depend on problem dimensionality or complexity, becoming prohibitive in highly overparameterized regimes. The paper aims to develop universal forgetting rates that work regardless of dimensionality and to understand how randomization affects forgetting.

Method: The authors analyze continual linear models, proving that fitting a task is equivalent to a single SGD step on a modified objective. They develop novel last-iterate SGD upper bounds in realizable least squares and leverage them for continual learning analysis. They study random orderings over tasks with and without replacement.

Result: For continual regression with replacement, they improve the best existing rate from O((d-ṟ)/k) to O(min(1/∜k, √(d-ṟ)/k, √Tṟ/k)). For random task orderings without replacement, they establish the first rate O(min(1/∜T, (d-ṟ)/T)), showing randomization prevents catastrophic forgetting. They also prove matching O(1/∜k) rate for continual linear classification on separable data.

Conclusion: The paper establishes universal forgetting rates for continual learning that don’t depend on problem dimensionality, showing randomization alone can prevent catastrophic forgetting in sufficiently long task sequences. The results extend to broader methods like block Kaczmarz and POCS, illuminating their convergence properties.

Abstract: We study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze forgetting, defined as the loss on previously seen tasks, after $k$ iterations. For continual linear models, we prove that fitting a task is equivalent to a single stochastic gradient descent (SGD) step on a modified objective. We develop novel last-iterate SGD upper bounds in the realizable least squares setup and leverage them to derive new results for continual learning. Focusing on random orderings over $T$ tasks, we establish universal forgetting rates, whereas existing rates depend on problem dimensionality or complexity and become prohibitive in highly overparameterized regimes. In continual regression with replacement, we improve the best existing rate from $O((d-\bar{r})/k)$ to $O(\min(1/\sqrt[4]{k}, \sqrt{(d-\bar{r})}/k, \sqrt{T\bar{r}}/k))$, where $d$ is the dimensionality and $\bar{r}$ the average task rank. Furthermore, we establish the first rate for random task orderings without replacement. The resulting rate $O(\min(1/\sqrt[4]{T},, (d-\bar{r})/T))$ shows that randomization alone, without task repetition, prevents catastrophic forgetting in sufficiently long task sequences. Finally, we prove a matching $O(1/\sqrt[4]{k})$ forgetting rate for continual linear classification on separable data. Our universal rates extend to broader methods, such as block Kaczmarz and POCS, illuminating their loss convergence under i.i.d. and single-pass orderings.

[301] Coordinate Matrix Machine: A Human-level Concept Learning to Classify Very Similar Documents

Amin Sadri, M Maruf Hossain

Main category: cs.LG

TL;DR: CM² is a Green AI model that achieves human-level one-shot document classification by learning structural features instead of semantic content, outperforming traditional methods with minimal data and compute.

Details

Motivation: Address the gap between human concept learning (single-example) and machine learning (hundreds of examples), while moving away from energy-intensive "Red AI" toward sustainable Green AI solutions.

Method: Coordinate Matrix Machine (CM²) focuses on document structural coordinates rather than semantic vectors, identifying important structural features that humans would consider for classification.

Result: Outperforms traditional vectorizers and complex deep learning models, achieving high accuracy with minimal data (one-shot learning) while being computationally efficient and environmentally sustainable.

Conclusion: CM² demonstrates that human-level concept learning can be achieved through structural intelligence rather than massive pre-training, offering a practical Green AI alternative to energy-intensive deep learning approaches.

Abstract: Human-level concept learning argues that humans typically learn new concepts from a single example, whereas machine learning algorithms typically require hundreds of samples to learn a single concept. Our brain subconsciously identifies important features and learns more effectively. Contribution: In this paper, we present the Coordinate Matrix Machine (CM$^2$). This purpose-built small model augments human intelligence by learning document structures and using this information to classify documents. While modern “Red AI” trends rely on massive pre-training and energy-intensive GPU infrastructure, CM$^2$ is designed as a Green AI solution. It achieves human-level concept learning by identifying only the structural “important features” a human would consider, allowing it to classify very similar documents using only one sample per class. Advantage: Our algorithm outperforms traditional vectorizers and complex deep learning models that require larger datasets and significant compute. By focusing on structural coordinates rather than exhaustive semantic vectors, CM$^2$ offers:

High accuracy with minimal data (one-shot learning) 2. Geometric and structural intelligence 3. Green AI and environmental sustainability 4. Optimized for CPU-only environments 5. Inherent explainability (glass-box model) 6. Faster computation and low latency 7. Robustness against unbalanced classes 8. Economic viability 9. Generic, expandable, and extendable

[302] 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)

Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastava

Main category: cs.LG

TL;DR: DFloat11 is a lossless compression framework that reduces LLM and diffusion model sizes by 30% while maintaining bit-for-bit identical outputs through dynamic-length entropy coding and custom GPU decompression kernels.

Details

Motivation: Large AI models have grown too large for efficient deployment on resource-constrained hardware, and there's significant inefficiency in the BFloat16 weight representation due to low entropy, creating opportunities for compression.

Method: DFloat11 applies entropy coding to assign dynamic-length encodings to weights based on frequency, achieving near information-optimal compression. It uses custom GPU kernels with compact hierarchical lookup tables, two-phase coordination, and transformer-block-level decompression for efficient online decompression.

Result: Achieves ~30% model size reduction while preserving bit-for-bit identical outputs. Enables 2.3-46.2x higher throughput in token generation compared to CPU offloading, and 5.7-14.9x longer generation lengths with fixed GPU memory. Successfully runs Llama 3.1 405B (810GB model) on 8x80GB GPUs.

Conclusion: DFloat11 provides an effective lossless compression solution for large AI models that significantly reduces memory requirements while maintaining exact model outputs, enabling deployment of massive models on existing hardware.

Abstract: Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in the existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) compact, hierarchical lookup tables (LUTs) that fit within GPU SRAM for efficient decoding, (ii) a two-phase GPU kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on Llama 3.3, Qwen 3, Mistral 3, FLUX.1, and others validate our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit identical outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 2.3–46.2x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.7–14.9x longer generation lengths than uncompressed models. Notably, our method enables lossless inference of Llama 3.1 405B, an 810GB model, on a single node equipped with 8x80GB GPUs.

[303] Reinforcement Learning from Human Feedback

Nathan Lambert

Main category: cs.LG

TL;DR: A gentle introduction to RLHF methods covering origins, core optimization stages, and advanced research topics for those with quantitative background.

Details

Motivation: RLHF has become crucial for deploying modern ML systems, but there's a need for accessible educational material that covers both technical foundations and broader interdisciplinary context for people with quantitative backgrounds.

Method: Structured book format starting with RLHF origins in economics, philosophy, and optimal control, then covering definitions, problem formulation, data collection, and mathematical foundations, followed by detailed optimization stages from instruction tuning to reward models and alignment algorithms.

Result: A comprehensive educational resource that systematically introduces RLHF concepts, traces its interdisciplinary roots, details technical implementation stages, and identifies current research gaps and open questions in the field.

Conclusion: This book provides foundational understanding of RLHF while highlighting understudied research areas like synthetic data and evaluation, positioning itself as both an educational tool and a roadmap for future research directions in alignment.

Abstract: Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF – both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics – understudied research questions in synthetic data and evaluation – and open questions for the field.

[304] Streaming Sliced Optimal Transport

Khai Nguyen

Main category: cs.LG

TL;DR: Proposes Stream-SW, the first streaming estimator for sliced Wasserstein distance that processes sample streams with low memory complexity while maintaining theoretical guarantees.

Details

Motivation: To enhance computational scalability of sliced optimal transport by enabling estimation from streaming samples with low memory requirements, overcoming limitations of existing methods that require storing entire datasets.

Method: Develops streaming estimator for 1D Wasserstein distance using quantile approximation techniques for sample streams, then applies this to all projections to obtain streaming sliced Wasserstein (Stream-SW).

Result: Stream-SW achieves more accurate approximation than random subsampling with lower memory consumption on Gaussian distributions and mixtures. Demonstrates effectiveness in point cloud classification, gradient flows, and streaming change point detection.

Conclusion: Stream-SW provides a practical solution for scalable sliced Wasserstein estimation from streaming data with theoretical guarantees and superior performance compared to subsampling approaches.

Abstract: Sliced optimal transport (SOT), or sliced Wasserstein (SW) distance, is widely recognized for its statistical and computational scalability. In this work, we further enhance computational scalability by proposing the first method for estimating SW from sample streams, called \emph{streaming sliced Wasserstein} (Stream-SW). To define Stream-SW, we first introduce a streaming estimator of the one-dimensional Wasserstein distance (1DW). Since the 1DW has a closed-form expression, given by the absolute difference between the quantile functions of the compared distributions, we leverage quantile approximation techniques for sample streams to define a streaming 1DW estimator. By applying the streaming 1DW to all projections, we obtain Stream-SW. The key advantage of Stream-SW is its low memory complexity while providing theoretical guarantees on the approximation error. We demonstrate that Stream-SW achieves a more accurate approximation of SW than random subsampling, with lower memory consumption, when comparing Gaussian distributions and mixtures of Gaussians from streaming samples. Additionally, we conduct experiments on point cloud classification, point cloud gradient flows, and streaming change point detection to further highlight the favorable performance of the proposed Stream-SW

[305] Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs

Mana Sakai, Ryo Karakida, Masaaki Imaizumi

Main category: cs.LG

TL;DR: The paper identifies the exact non-Gaussian infinite-width limit distribution of attention layers under realistic scaling, departing from previous Gaussian approximations that failed to capture attention behavior.

Details

Motivation: Existing infinite-width theories using Gaussian approximations (neural network Gaussian processes, Tensor Programs) cannot properly capture attention layer behavior except under unrealistic conditions like infinite heads or special scaling. There's a need for a rigorous theory that works under standard architectural assumptions.

Method: Using the Tensor Programs framework, the authors rigorously derive the infinite-width limit distribution of variables within a single attention layer under realistic architectural dimensionality and standard 1/√n scaling. They avoid infinite-head approximations or tailored scaling schemes.

Result: The derived limit law is fundamentally non-Gaussian, exhibiting hierarchical structure where the distribution is Gaussian conditional on random similarity scores. Numerical experiments validate the theory at finite width and show accurate description of finite-head attentions.

Conclusion: This work provides the first rigorous characterization of attention layer behavior in the infinite-width regime under realistic conditions, establishing a foundation for developing a unified theory of deep Transformer architectures.

Abstract: In modern theoretical analyses of neural networks, the infinite-width limit is often invoked to justify Gaussian approximations of neuron preactivations (e.g., via neural network Gaussian processes or Tensor Programs). However, these Gaussian-based asymptotic theories have so far been unable to capture the behavior of attention layers, except under special regimes such as infinitely many heads or tailored scaling schemes. In this paper, leveraging the Tensor Programs framework, we rigorously identify the infinite-width limit distribution of variables within a single attention layer under realistic architectural dimensionality and standard $1/\sqrt{n}$-scaling with $n$ dimensionality. We derive the exact form of this limit law without resorting to infinite-head approximations or tailored scalings, demonstrating that it departs fundamentally from Gaussianity. This limiting distribution exhibits non-Gaussianity from a hierarchical structure, being Gaussian conditional on the random similarity scores. Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. Beyond characterizing a standalone attention layer, our findings lay the groundwork for developing a unified theory of deep Transformer architectures in the infinite-width regime.

[306] Fair Domain Generalization: An Information-Theoretic View

Tangzheng Lian, Guanyu Hu, Dimitrios Kollias, Xinyu Yang, Oya Celiktutan

Main category: cs.LG

TL;DR: This paper introduces Fair Domain Generalization (FairDG), a framework that combines domain generalization with algorithmic fairness to ensure both good performance and fairness in unseen target domains.

Details

Motivation: Current domain generalization methods focus only on minimizing expected risk without considering fairness, while fairness methods don't account for domain shifts, meaning fairness achieved during training may not generalize to unseen test domains.

Method: The authors derive mutual information-based upper bounds for expected risk and fairness violations, then introduce PAFDG (Pareto-Optimal Fairness for Domain Generalization), a practical framework that models the utility-fairness trade-off through Pareto optimization.

Result: Experiments on real-world vision and language datasets show that PAFDG achieves superior utility-fairness trade-offs compared to existing methods.

Conclusion: The work successfully bridges the gap between domain generalization and algorithmic fairness, providing both theoretical bounds and a practical framework for fair domain generalization.

Abstract: Domain generalization (DG) and algorithmic fairness are two critical challenges in machine learning. However, most DG methods focus only on minimizing expected risk in the unseen target domain without considering algorithmic fairness. Conversely, fairness methods typically do not account for domain shifts, so the fairness achieved during training may not generalize to unseen test domains. In this work, we bridge these gaps by studying the problem of Fair Domain Generalization (FairDG), which aims to minimize both expected risk and fairness violations in unseen target domains. We derive novel mutual information-based upper bounds for expected risk and fairness violations in multi-class classification tasks with multi-group sensitive attributes. These bounds provide key insights for algorithm design from an information-theoretic perspective. Guided by these insights, we introduce PAFDG (Pareto-Optimal Fairness for Domain Generalization), a practical framework that solves the FairDG problem and models the utility-fairness trade-off through Pareto optimization. Experiments on real-world vision and language datasets show that PAFDG achieves superior utility-fairness trade-offs compared to existing methods.

[307] Episodic Contextual Bandits with Knapsacks under Conversion Models

Wang Chi Cheung, Zitian Li

Main category: cs.LG

TL;DR: Online algorithm for contextual bandits with knapsacks in repeated episodes with non-stationary contexts and varying resource amounts, achieving sublinear regret using confidence bound oracle.

Details

Motivation: Address applications like dynamic pricing of perishable resources with episodic replenishment and first price auctions with varying budgets, where contexts are non-stationary within episodes and resource amounts differ across episodes.

Method: Design online algorithm that achieves sublinear regret in number of episodes, assuming access to a confidence bound oracle from existing contextual bandit literature. Overcomes challenge of unbounded state space from arbitrarily many contexts.

Result: Achieves sublinear regret in T (number of episodes). Provides improved regret bounds when decision maker has access to unlabeled feature data, which is novel to contextual BwK literature.

Conclusion: Proposes framework for episodic contextual bandits with knapsacks with non-stationary contexts, achieving sublinear regret and offering improved bounds with unlabeled data, applicable to real-world resource allocation problems.

Abstract: We study an online setting, where a decision maker (DM) interacts with contextual bandit-with-knapsack (BwK) instances in repeated episodes. These episodes start with different resource amounts, and the contexts’ probability distributions are non-stationary in an episode. All episodes share the same latent conversion model, which governs the random outcome contingent upon a request’s context and an allocation decision. Our model captures applications such as dynamic pricing on perishable resources with episodic replenishment, and first price auctions in repeated episodes with different starting budgets. We design an online algorithm that achieves a regret sub-linear in $T$, the number of episodes, assuming access to a \emph{confidence bound oracle} that achieves an $o(T)$-regret. Such an oracle is readily available from existing contextual bandit literature. We overcome the technical challenge with arbitrarily many possible contexts, which leads to a reinforcement learning problem with an unbounded state space. Our framework provides improved regret bounds in certain settings when the DM is provided with unlabeled feature data, which is novel to the contextual BwK literature.

[308] Weighted Conditional Flow Matching

Sergio Calvo-Ordonez, Matthieu Meunier, Alvaro Cartea, Christoph Reisinger, Yarin Gal, Jose Miguel Hernandez-Lobato

Main category: cs.LG

TL;DR: W-CFM improves conditional flow matching by weighting training pairs with a Gibbs kernel, achieving straighter trajectories without expensive optimal transport computations, matching minibatch OT performance at large-batch limit while maintaining vanilla CFM efficiency.

Details

Motivation: Standard conditional flow matching produces curved trajectories requiring fine discretization for accurate generation. Recent methods use expensive mini-batch optimal transport for straighter paths. Need efficient alternative that achieves similar benefits without computational overhead.

Method: Weighted Conditional Flow Matching (W-CFM) modifies classical CFM loss by weighting each training pair (x,y) with a Gibbs kernel. This weighting recovers entropic optimal transport coupling with minimal marginal bias. Method maintains computational efficiency of vanilla CFM while achieving benefits of optimal transport approaches.

Result: W-CFM achieves comparable or superior sample quality, fidelity, and diversity to alternative baselines on synthetic and real datasets for unconditional generation. Theoretically shown to be equivalent to minibatch OT in large-batch limit, overcoming computational bottlenecks related to batch size.

Conclusion: W-CFM provides an efficient alternative to mini-batch OT methods for improving conditional flow matching, achieving straighter trajectories and better generation performance while maintaining the computational efficiency of vanilla CFM.

Abstract: Conditional flow matching (CFM) has emerged as a powerful framework for training continuous normalizing flows due to its computational efficiency and effectiveness. However, standard CFM often produces paths that deviate significantly from straight-line interpolations between prior and target distributions, making generation slower and less accurate due to the need for fine discretization at inference. Recent methods enhance CFM performance by inducing shorter and straighter trajectories but typically rely on computationally expensive mini-batch optimal transport (OT). Drawing insights from entropic optimal transport (EOT), we propose Weighted Conditional Flow Matching (W-CFM), a novel approach that modifies the classical CFM loss by weighting each training pair $(x, y)$ with a Gibbs kernel. We show that this weighting recovers the entropic OT coupling up to some bias in the marginals, and we provide the conditions under which the marginals remain nearly unchanged. Moreover, we establish an equivalence between W-CFM and the minibatch OT method in the large-batch limit, showing how our method overcomes computational and performance bottlenecks linked to batch size. Empirically, we test our method on unconditional generation on various synthetic and real datasets, confirming that W-CFM achieves comparable or superior sample quality, fidelity, and diversity to other alternative baselines while maintaining the computational efficiency of vanilla CFM.

[309] uGMM-NN: Univariate Gaussian Mixture Model Neural Network

Zakeria Sharif Ali

Main category: cs.LG

TL;DR: uGMM-NN embeds Gaussian mixture models into neural network neurons, enabling probabilistic reasoning and uncertainty capture while maintaining feed-forward scalability.

Details

Motivation: Traditional neural networks use deterministic weighted sums with fixed non-linearities, lacking probabilistic reasoning and uncertainty modeling at the neuron level. The authors aim to integrate probabilistic reasoning directly into neural network units to capture multimodality and uncertainty while maintaining scalability.

Method: Each neuron in uGMM-NN parameterizes activations as a univariate Gaussian mixture with learnable means, variances, and mixing coefficients, replacing traditional weighted-sum operations. This allows neurons to capture multimodal distributions and uncertainty while preserving the feed-forward architecture of standard networks.

Result: uGMM-NN achieves competitive discriminative performance compared to conventional multilayer perceptrons while providing probabilistic interpretations of activations. The framework enables uncertainty-aware components in neural architectures.

Conclusion: The uGMM-NN architecture successfully integrates probabilistic reasoning into neural networks, offering a foundation for uncertainty-aware neural components and opening new directions for both discriminative and generative modeling.

Abstract: This paper introduces the Univariate Gaussian Mixture Model Neural Network (uGMM-NN), a novel neural architecture that embeds probabilistic reasoning directly into the computational units of deep networks. Unlike traditional neurons, which apply weighted sums followed by fixed non-linearities, each uGMM-NN node parameterizes its activations as a univariate Gaussian mixture, with learnable means, variances, and mixing coefficients. This design enables richer representations by capturing multimodality and uncertainty at the level of individual neurons, while retaining the scalability of standard feed-forward networks. We demonstrate that uGMM-NN can achieve competitive discriminative performance compared to conventional multilayer perceptrons, while additionally offering a probabilistic interpretation of activations. The proposed framework provides a foundation for integrating uncertainty-aware components into modern neural architectures, opening new directions for both discriminative and generative modeling.

[310] Homogenization with Guaranteed Bounds via Primal-Dual Physically Informed Neural Networks

Liya Gaynutdinova, Martin Doškář, Ondřej Rokoš, Ivana Pultarová

Main category: cs.LG

TL;DR: Dual formulation PINNs improve reliability for homogenization of periodic thermo-conductive composites with discontinuous coefficients, providing guaranteed error bounds and better failure detection.

Details

Motivation: Standard PINNs often fail when applied to materials with discontinuous coefficients (piecewise constant properties) in multiscale modeling, lacking clear diagnostics for failure.

Method: Introduces dual formulation for PINN framework for both strong and variational (weak) formulations. Compares standard PINNs with smoothed approximations vs. variational PINNs (VPINNs) using spectral and neural network-based test functions.

Result: Strong-form PINNs may outperform VPINNs in controlled settings but are sensitive to material discontinuities and may fail without clear diagnostics. VPINNs accommodate piecewise constant parameters directly but require careful test function selection. Dual formulation provides reliable convergence quality indicators.

Conclusion: Dual formulation enhances PINN applicability to homogenization problems in micromechanics by providing guaranteed error bounds and robust failure detection, improving reliability for materials with discontinuous coefficients.

Abstract: Physics-informed neural networks (PINNs) have shown promise in solving partial differential equations (PDEs) relevant to multiscale modeling, but they often fail when applied to materials with discontinuous coefficients, such as media with piecewise constant properties. This paper introduces a dual formulation for the PINN framework to improve the reliability of the homogenization of periodic thermo-conductive composites, for both strong and variational (weak) formulations. The dual approach facilitates the derivation of guaranteed upper and lower error bounds, enabling more robust detection of PINN failure. We compare standard PINNs applied to smoothed material approximations with variational PINNs (VPINNs) using both spectral and neural network-based test functions. Our results indicate that while strong-form PINNs may outperform VPINNs in controlled settings, they are sensitive to material discontinuities and may fail without clear diagnostics. In contrast, VPINNs accommodate piecewise constant material parameters directly but require careful selection of test functions to avoid instability. Dual formulation serves as a reliable indicator of convergence quality, and its integration into PINN frameworks enhances their applicability to homogenization problems in micromechanics.

[311] From Autoencoders to CycleGAN: Robust Unpaired Face Manipulation via Adversarial Learning

Collin Guo, Yi Qian

Main category: cs.LG

TL;DR: The paper proposes a guided CycleGAN framework with spectral normalization and identity-preserving losses for unpaired face manipulation, outperforming autoencoder baselines in realism, perceptual quality, and identity preservation without requiring paired datasets.

Details

Motivation: There's growing demand for realistic, identity-preserving face synthesis/manipulation, but often only unpaired, unaligned datasets are available. Autoencoders capture coarse identity but miss fine details, creating a need for more robust unpaired face manipulation methods.

Method: Uses adversarial learning with a guided CycleGAN framework incorporating: 1) spectral normalization for stable training, 2) identity- and perceptual-guided losses to preserve subject identity and high-level structure, 3) landmark-weighted cycle constraints to maintain facial geometry across pose/illumination changes.

Result: The adversarial trained CycleGAN improves over autoencoders in realism (FID), perceptual quality (LPIPS), and identity preservation (ID-Sim), with competitive cycle-reconstruction SSIM and practical inference times. Achieves high quality without paired datasets and approaches pix2pix performance on curated paired subsets.

Conclusion: Guided, spectrally normalized CycleGANs provide a practical path from autoencoders to robust unpaired face manipulation, demonstrating effective identity preservation and realism without requiring paired training data.

Abstract: Human face synthesis and manipulation are increasingly important in entertainment and AI, with a growing demand for highly realistic, identity-preserving images even when only unpaired, unaligned datasets are available. We study unpaired face manipulation via adversarial learning, moving from autoencoder baselines to a robust, guided CycleGAN framework. While autoencoders capture coarse identity, they often miss fine details. Our approach integrates spectral normalization for stable training, identity- and perceptual-guided losses to preserve subject identity and high-level structure, and landmark-weighted cycle constraints to maintain facial geometry across pose and illumination changes. Experiments show that our adversarial trained CycleGAN improves realism (FID), perceptual quality (LPIPS), and identity preservation (ID-Sim) over autoencoders, with competitive cycle-reconstruction SSIM and practical inference times, which achieved high quality without paired datasets and approaching pix2pix on curated paired subsets. These results demonstrate that guided, spectrally normalized CycleGANs provide a practical path from autoencoders to robust unpaired face manipulation.

[312] Personalized Federated Heat-Kernel Enhanced Multi-View Clustering via Advanced Tensor Decomposition Techniques

Kristina P. Sinaga

Main category: cs.LG

TL;DR: This paper proposes novel federated multi-view clustering frameworks using quantum-inspired heat-kernel metrics and tensor decomposition methods (PARAFAC2/Tucker) to address data heterogeneity and privacy concerns.

Details

Motivation: The paper addresses challenges in multi-view clustering within federated learning environments, specifically dealing with data heterogeneity across different views/sources while maintaining privacy and efficient communication between distributed parties.

Method: The authors develop mathematical frameworks using heat-kernel coefficients as quantum-inspired distance metrics to replace conventional measures. They employ advanced tensor decomposition methods (PARAFAC2 and Tucker decomposition) to represent high-dimensional multi-view data while preserving inter-view relationships. Four novel algorithms are proposed: E-FKMVC, FedHK-PARAFAC2, FedHK-Tucker, and FedHK-MVC-Person (personalized version).

Result: The research yields four novel federated clustering algorithms with theoretical guarantees. The paper provides convergence analysis, privacy bounds, and complexity analysis to validate the effectiveness of the proposed methods in enhancing clustering efficacy while ensuring confidentiality and efficient communication.

Conclusion: This work makes significant contributions to federated multi-view clustering through innovative integration of mathematical modeling and algorithm design, addressing critical challenges of data heterogeneity and privacy concerns, enabling enhanced data management and analytics in distributed environments.

Abstract: This paper introduces mathematical frameworks that address the challenges of multi-view clustering in federated learning environments. The objective is to integrate optimization techniques based on new objective functions employing heat-kernel coefficients to replace conventional distance metrics with quantum-inspired measures. The proposed frameworks utilize advanced tensor decomposition methods, specifically, PARAFAC2 and Tucker decomposition to efficiently represent high-dimensional, multi-view data while preserving inter-view relationships. The research has yielded the development of four novel algorithms, an efficient federated kernel multi-view clustering (E-FKMVC) model, FedHK-PARAFAC2, FedHK-Tucker, and FedHK-MVC-Person with PARAFAC2 Decomposition (Personalized FedHK-PARAFAC2). The primary objective of these algorithms is to enhance the efficacy of clustering processes while ensuring the confidentiality and efficient communication in federated learning environments. Theoretical analyses of convergence guarantees, privacy bounds, and complexity are provided to validate the effectiveness of the proposed methods. In essence, this paper makes a significant academic contribution to the field of federated multi-view clustering through its innovative integration of mathematical modeling and algorithm design. This approach addresses the critical challenges of data heterogeneity and privacy concerns, paving the way for enhanced data management and analytics in various contexts.

[313] Machine Learnability as a Measure of Order in Aperiodic Sequences

Jennifer Dodgson, Michael Joedhitya, Adith Ramdas, Surender Suresh Kumar, Adarsh Singh Chauhan, Akira Rafhael, Wang Mingshu, Nordine Lotfi

Main category: cs.LG

TL;DR: Machine learning models trained on Ulam spiral images show better performance at higher number regions (around 500m) than lower regions (below 25m), suggesting more learnable order exists at larger scales, with potential applications in number theory and cryptography.

Details

Motivation: Prime numbers exhibit both deterministic definition and statistical randomness. The researchers aim to use machine learning as an experimental tool to investigate whether different regions of the Ulam spiral show varying degrees of regularity in prime distributions, potentially revealing patterns that align with number theory conjectures.

Method: Used image-focused machine learning models trained on blocks extracted from different regions of the Ulam spiral. Specifically compared models trained on regions around 500 million integers versus those below 25 million integers. Analyzed accuracy, precision, and recall scores to understand classification approaches in different spiral regions.

Result: Models trained on higher regions (around 500m) outperformed those trained on lower regions (below 25m) in pure accuracy terms. Precision and recall analysis revealed different classification strategies: models focused more on identifying prime patterns at lower numbers and more on eliminating composites at higher numbers. This suggests more easily learnable order exists at higher magnitudes.

Conclusion: Machine learning can serve as a new experimental instrument for number theory, revealing that prime number distributions show diminishing noise and more learnable patterns at higher orders of magnitude. The method has potential for investigating strong and weak prime patterns for cryptographic applications, supporting number theory conjectures about the regularization of local randomness after scaling.

Abstract: Research on the distribution of prime numbers has revealed a dual character: deterministic in definition yet exhibiting statistical behavior reminiscent of random processes. In this paper we show that it is possible to use an image-focused machine learning model to measure the comparative regularity of prime number fields at specific regions of an Ulam spiral. Specifically, we demonstrate that in pure accuracy terms, models trained on blocks extracted from regions of the spiral in the vicinity of 500m outperform models trained on blocks extracted from the region representing integers lower than 25m. This implies existence of more easily learnable order in the former region than in the latter. Moreover, a detailed breakdown of precision and recall scores seem to imply that the model is favouring a different approach to classification in different regions of the spiral, focusing more on identifying prime patterns at lower numbers and more on eliminating composites at higher numbers. This aligns with number theory conjectures suggesting that at higher orders of magnitude we should see diminishing noise in prime number distributions, with averages (density, AP equidistribution) coming to dominate, while local randomness regularises after scaling by log x. Taken together, these findings point toward an interesting possibility: that machine learning can serve as a new experimental instrument for number theory. Notably, the method shows potential 1 for investigating the patterns in strong and weak primes for cryptographic purposes.

[314] Collaborative Device-Cloud LLM Inference through Reinforcement Learning

Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Christopher Brinton

Main category: cs.LG

TL;DR: A framework where on-device LLMs make routing decisions after solving, trained via reward maximization to optimize when to offload to cloud LLMs.

Details

Motivation: Existing device-cloud collaboration approaches use external binary classifiers that struggle to assess task difficulty from prompt patterns, leading to suboptimal routing decisions between local and cloud LLM processing.

Method: On-device LLM makes routing decisions at end of solving process, trained via reward maximization with carefully designed rewards. Uses group-adaptive policy gradient algorithm with group-level policy gradient for unbiased gradient estimation and adaptive prompt filtering to enforce cloud usage constraints.

Result: Extensive experiments across models and benchmarks show consistent outperformance over existing baselines and significant narrowing of the gap to full cloud LLM performance.

Conclusion: The proposed framework enables more intelligent device-cloud collaboration by having on-device LLMs make routing decisions through post-training, achieving better performance than external router approaches while managing cloud usage constraints.

Abstract: Device-cloud collaboration has emerged as a promising paradigm for deploying large language models (LLMs), combining the efficiency of lightweight on-device inference with the superior performance of powerful cloud LLMs. An essential problem in this scenario lies in deciding whether a given query is best handled locally or delegated to the cloud. Existing approaches typically rely on external routers, implemented as binary classifiers, which often struggle to determine task difficulty from the prompt’s surface pattern. To address these limitations, we propose a framework where the on-device LLM makes routing decisions at the end of its solving process, with this capability instilled through post-training. In particular, we formulate a reward maximization problem with carefully designed rewards that encourage effective problem solving and judicious offloading to the cloud. To solve this problem, we develop a group-adaptive policy gradient algorithm, featuring a group-level policy gradient, designed to yield an unbiased gradient estimator of the reward, and adaptive prompt filtering, developed to enforce the constraint on cloud LLM usage. Extensive experiments across models and benchmarks show that the proposed methodology consistently outperforms existing baselines and significantly narrows the gap to full cloud LLM performance.

[315] Clustering by Denoising: Latent plug-and-play diffusion for single-cell data

Dominik Meier, Shixing Yu, Sagnik Nandy, Promit Ghosal, Kyra Gan

Main category: cs.LG

TL;DR: A novel diffusion framework separates observation and denoising spaces using Gibbs sampling with input-space steering to improve scRNA-seq clustering by handling noise adaptively, quantifying uncertainty, and generalizing denoising across datasets.

Details

Motivation: scRNA-seq clustering accuracy suffers from measurement noise and biological variability, causing different cell types to project close together in standard latent spaces like PCA, making accurate clustering difficult.

Method: Latent plug-and-play diffusion framework with Gibbs sampling: learned diffusion prior applied in low-dimensional latent space for denoising, while noise is reintroduced into original high-dimensional observation space for “input-space steering” to maintain data fidelity.

Result: Improves clustering accuracy on synthetic data across varied noise levels and dataset shifts; on real single-cell data, shows improved biological coherence with cluster boundaries better aligned with known cell type markers and developmental trajectories.

Conclusion: The framework offers adaptive noise handling, principled uncertainty quantification, and generalizable denoising that can leverage clean reference data to improve noisier datasets beyond training set limitations.

Abstract: Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique “input-space steering” ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.

[316] Feature-Modulated UFNO for Improved Prediction of Multiphase Flow in Porous Media

Alhasan Abdellatif, Hannah P. Menke, Florian Doster, Kamaljit Singh, Ahmed H. Elsheikh

Main category: cs.LG

TL;DR: UFNO-FiLM enhances UFNO by decoupling scalar inputs using FiLM layers and adding spatially weighted loss, achieving 21% MAE reduction in subsurface flow predictions.

Details

Motivation: UFNO inefficiently treats scalar inputs as spatial fields, creating redundant constant signals in frequency domain, and its standard loss doesn't account for spatial error sensitivity variations.

Method: Two key innovations: 1) Feature-wise Linear Modulation (FiLM) layer to decouple scalar inputs from spatial features, avoiding constant signals in Fourier transform; 2) Spatially weighted loss function prioritizing critical regions.

Result: 21% reduction in gas saturation Mean Absolute Error (MAE) compared to UFNO in subsurface multiphase flow experiments.

Conclusion: UFNO-FiLM effectively improves predictive accuracy by addressing UFNO’s inefficiencies in handling scalar inputs and spatial error sensitivity.

Abstract: The UNet-enhanced Fourier Neural Operator (UFNO) extends the Fourier Neural Operator (FNO) by incorporating a parallel UNet pathway, enabling the retention of both high- and low-frequency components. While UFNO improves predictive accuracy over FNO, it inefficiently treats scalar inputs (e.g., temperature, injection rate) as spatially distributed fields by duplicating their values across the domain. This forces the model to process redundant constant signals within the frequency domain. Additionally, its standard loss function does not account for spatial variations in error sensitivity, limiting performance in regions of high physical importance. We introduce UFNO-FiLM, an enhanced architecture that incorporates two key innovations. First, we decouple scalar inputs from spatial features using a Feature-wise Linear Modulation (FiLM) layer, allowing the model to modulate spatial feature maps without introducing constant signals into the Fourier transform. Second, we employ a spatially weighted loss function that prioritizes learning in critical regions. Our experiments on subsurface multiphase flow demonstrate a 21% reduction in gas saturation Mean Absolute Error (MAE) compared to UFNO, highlighting the effectiveness of our approach in improving predictive accuracy.

[317] Decomposing Uncertainty in Probabilistic Knowledge Graph Embeddings: Why Entity Variance Is Not Enough

Chorok Lee

Main category: cs.LG

TL;DR: Probabilistic KG embeddings have relation-agnostic uncertainty that fails to distinguish between emerging entities vs. novel relational contexts. The paper proves this limitation, decomposes uncertainty into semantic (entity variance) and structural (entity-relation co-occurrence) components, and proposes CAGP method that combines both for superior OOD detection.

Details

Motivation: Existing probabilistic knowledge graph embeddings use entity-level variances to quantify uncertainty, but these are relation-agnostic - entities get identical uncertainty regardless of relational context. This conflates two distinct OOD phenomena: emerging entities (rare/poorly-learned) and novel relational contexts (familiar entities in unobserved relationships), leading to poor performance on temporal distribution shift.

Method: The paper formalizes uncertainty decomposition into: 1) semantic uncertainty from entity embedding variance (detects emerging entities), and 2) structural uncertainty from entity-relation co-occurrence (detects novel contexts). The proposed CAGP method combines these complementary signals via learned weights. Theoretically proves these signals are non-redundant and any convex combination dominates either alone.

Result: Empirical validation shows 100% of novel-context triples have frequency-matched in-distribution counterparts. CAGP achieves 0.94-0.99 AUROC on temporal OOD detection across benchmarks (FB15k-237, WN18RR, YAGO3-10), representing 60-80% relative improvement over relation-agnostic baselines. On selective prediction, reduces errors by 43% at 85% answer rate.

Conclusion: Relation-agnostic uncertainty in probabilistic KG embeddings fundamentally limits OOD detection. Decomposing uncertainty into semantic and structural components addresses this limitation. The combination of both signals significantly improves performance on temporal distribution shift and selective prediction tasks.

Abstract: Probabilistic knowledge graph embeddings represent entities as distributions, using learned variances to quantify epistemic uncertainty. We identify a fundamental limitation: these variances are relation-agnostic, meaning an entity receives identical uncertainty regardless of relational context. This conflates two distinct out-of-distribution phenomena that behave oppositely: emerging entities (rare, poorly-learned) and novel relational contexts (familiar entities in unobserved relationships). We prove an impossibility result: any uncertainty estimator using only entity-level statistics independent of relation context achieves near-random OOD detection on novel contexts. We empirically validate this on three datasets, finding 100 percent of novel-context triples have frequency-matched in-distribution counterparts. This explains why existing probabilistic methods achieve 0.99 AUROC on random corruptions but only 0.52-0.64 on temporal distribution shift. We formalize uncertainty decomposition into complementary components: semantic uncertainty from entity embedding variance (detecting emerging entities) and structural uncertainty from entity-relation co-occurrence (detecting novel contexts). Our main theoretical result proves these signals are non-redundant, and that any convex combination strictly dominates either signal alone. Our method (CAGP) combines semantic and structural uncertainty via learned weights, achieving 0.94-0.99 AUROC on temporal OOD detection across multiple benchmarks, a 60-80 percent relative improvement over relation-agnostic baselines. Empirical validation confirms complete frequency overlap on three datasets (FB15k-237, WN18RR, YAGO3-10). On selective prediction, our method reduces errors by 43 percent at 85 percent answer rate.

[318] Causality-Inspired Safe Residual Correction for Multivariate Time Series

Jianxiang Xie, Yuncheng Hua, Mingyue Cheng, Flora Salim, Hao Xue

Main category: cs.LG

TL;DR: CRC is a causality-inspired safe residual correction framework that ensures non-degradation in multivariate forecasting by using direction-aware structure decoupling and strict safety mechanisms.

Details

Motivation: Modern multivariate forecasters (Transformers, GNNs) suffer from systematic errors at specific variables/horizons and lack guarantees against performance degradation in deployment. Existing post-hoc correction methods are greedy and can overcorrect reliable predictions, causing local failures.

Method: CRC uses a causality-inspired encoder to expose direction-aware structure by decoupling self- and cross-variable dynamics, plus a hybrid corrector to model residual errors. Correction is governed by a strict four-fold safety mechanism that prevents harmful updates.

Result: Experiments across multiple datasets and forecasting backbones show CRC consistently improves accuracy while ensuring exceptionally high non-degradation rates (NDR), making it suitable for safe and reliable deployment.

Conclusion: CRC provides a plug-and-play correction framework that addresses the “safety gap” in multivariate forecasting by guaranteeing non-degradation through causality-inspired structure and strict safety mechanisms.

Abstract: While modern multivariate forecasters such as Transformers and GNNs achieve strong benchmark performance, they often suffer from systematic errors at specific variables or horizons and, critically, lack guarantees against performance degradation in deployment. Existing post-hoc residual correction methods attempt to fix these errors, but are inherently greedy: although they may improve average accuracy, they can also “help in the wrong way” by overcorrecting reliable predictions and causing local failures in unseen scenarios. To address this critical “safety gap,” we propose CRC (Causality-inspired Safe Residual Correction), a plug-and-play framework explicitly designed to ensure non-degradation. CRC follows a divide-and-conquer philosophy: it employs a causality-inspired encoder to expose direction-aware structure by decoupling self- and cross-variable dynamics, and a hybrid corrector to model residual errors. Crucially, the correction process is governed by a strict four-fold safety mechanism that prevents harmful updates. Experiments across multiple datasets and forecasting backbones show that CRC consistently improves accuracy, while an in-depth ablation study confirms that its core safety mechanisms ensure exceptionally high non-degradation rates (NDR), making CRC a correction framework suited for safe and reliable deployment.

[319] Data-Driven Analysis of Crash Patterns in SAE Level 2 and Level 4 Automated Vehicles Using K-means Clustering and Association Rule Mining

Jewel Rana Palit, Vijayalakshmi K Kumarasamy, Osama A. Osman

Main category: cs.LG

TL;DR: This paper analyzes over 2,500 AV crash records from NHTSA to understand crash dynamics across SAE Levels 2 and 4 using a two-stage data mining framework combining K-means clustering and Association Rule Mining.

Details

Motivation: Recent crash data shows AV behavior can deviate from expected safety outcomes, raising concerns about AV safety in mixed traffic. Most previous studies rely on small, California-centered datasets with limited focus on crash trends across different SAE automation levels.

Method: Developed a two-stage data mining framework: 1) K-means clustering to segment 2,500+ AV crash records into 4 distinct behavioral clusters based on temporal, spatial, and environmental factors; 2) Association Rule Mining (ARM) to extract interpretable multivariate relationships between crash patterns and contributors (lighting, surface conditions, vehicle dynamics, environmental conditions) within each cluster.

Result: The analysis uncovered underlying crash dynamics across SAE Levels 2 and 4, identifying distinct behavioral clusters and revealing multivariate relationships between crash patterns and various contributing factors.

Conclusion: The insights provide actionable guidance for AV developers, safety regulators, and policymakers in formulating AV deployment strategies and minimizing crash risks in mixed traffic environments.

Abstract: Automated Vehicles (AV) hold potential to reduce or eliminate human driving errors, enhance traffic safety, and support sustainable mobility. Recently, crash data has increasingly revealed that AV behavior can deviate from expected safety outcomes, raising concerns about the technology’s safety and operational reliability in mixed traffic environments. While past research has investigated AV crash, most studies rely on small-size California-centered datasets, with a limited focus on understanding crash trends across various SAE Levels of automation. This study analyzes over 2,500 AV crash records from the United States National Highway Traffic Safety Administration (NHTSA), covering SAE Levels 2 and 4, to uncover underlying crash dynamics. A two-stage data mining framework is developed. K-means clustering is first applied to segment crash records into 4 distinct behavioral clusters based on temporal, spatial, and environmental factors. Then, Association Rule Mining (ARM) is used to extract interpretable multivariate relationships between crash patterns and crash contributors including lighting conditions, surface condition, vehicle dynamics, and environmental conditions within each cluster. These insights provide actionable guidance for AV developers, safety regulators, and policymakers in formulating AV deployment strategies and minimizing crash risks.

[320] Information-Theoretic Quality Metric of Low-Dimensional Embeddings

Sebastián Gutiérrez-Bernal, Hector Medel Cobaxin, Abiel Galindo González

Main category: cs.LG

TL;DR: ERPM is a new information-theoretic metric for evaluating low-dimensional embeddings that measures information preservation via Shannon entropy of neighborhood singular-value spectra, complementing existing distance-based and geometric metrics.

Details

Motivation: Classical embedding evaluation metrics (stress, rank-based criteria, Local Procrustes) only measure distance or geometric distortions, not how much information is preserved when projecting high-dimensional data to lower dimensions.

Method: Introduces Entropy Rank Preservation Measure (ERPM) based on Shannon entropy of singular-value spectrum of neighborhood matrices and stable rank, quantifying uncertainty changes between original and projected representations at neighborhood level.

Result: Distance-based metrics show low correlation with geometric/spectral measures; ERPM and Local Procrustes have strong average correlation but significant local discrepancies; ERPM identifies neighborhoods with severe information loss.

Conclusion: ERPM complements existing metrics by providing information-theoretic assessment, enabling more comprehensive embedding evaluation especially for information-sensitive applications like early-warning indicators.

Abstract: In this work we study the quality of low-dimensional embeddings from an explicitly information-theoretic perspective. We begin by noting that classical evaluation metrics such as stress, rank-based neighborhood criteria, or Local Procrustes quantify distortions in distances or in local geometries, but do not directly assess how much information is preserved when projecting high-dimensional data onto a lower-dimensional space. To address this limitation, we introduce the Entropy Rank Preservation Measure (ERPM), a local metric based on the Shannon entropy of the singular-value spectrum of neighborhood matrices and on the stable rank, which quantifies changes in uncertainty between the original representation and its reduced projection, providing neighborhood-level indicators and a global summary statistic. To validate the results of the metric, we compare its outcomes with the Mean Relative Rank Error (MRRE), which is distance-based, and with Local Procrustes, which is based on geometric properties, using a financial time series and a manifold commonly studied in the literature. We observe that distance-based criteria exhibit very low correlation with geometric and spectral measures, while ERPM and Local Procrustes show strong average correlation but display significant discrepancies in local regimes, leading to the conclusion that ERPM complements existing metrics by identifying neighborhoods with severe information loss, thereby enabling a more comprehensive assessment of embeddings, particularly in information-sensitive applications such as the construction of early-warning indicators.

[321] Adaptive Learning Guided by Bias-Noise-Alignment Diagnostics

Akash Samanta, Sheldon Williamson

Main category: cs.LG

TL;DR: A diagnostic-driven adaptive learning framework that models error evolution through bias-noise-alignment decomposition for stable learning in dynamic environments.

Details

Motivation: Existing learning methods often fail in nonstationary, safety-critical environments due to instability, slow convergence, or brittle adaptation. Current approaches adapt to gradient statistics but ignore the temporal structure of error signals, which is crucial for reliable learning in dynamic settings.

Method: Proposes a framework decomposing error evolution into three components: bias (persistent drift), noise (stochastic variability), and alignment (repeated directional excitation). These diagnostics are computed online from lightweight statistics of loss or TD error trajectories. The framework is applied to three instantiations: Human-inspired Supervised Adaptive Optimizer (HSAO), Hybrid Error-Diagnostic Reinforcement Learning (HED-RL), and Meta-Learned Learning Policy (MLLP).

Result: The bias-noise-alignment decomposition provides a unifying control backbone for supervised optimization, actor-critic RL, and learned optimizers. Under standard smoothness assumptions, the framework establishes bounded effective updates and stability properties. Diagnostic illustrations in actor-critic learning show how the signals modulate adaptation in response to TD error structure.

Conclusion: This work elevates error evolution to a first-class object in adaptive learning, providing an interpretable, lightweight foundation for reliable learning in dynamic environments. The framework offers principled diagnostics that are independent of model architecture or task domain.

Abstract: Learning systems deployed in nonstationary and safety-critical environments often suffer from instability, slow convergence, or brittle adaptation when learning dynamics evolve over time. While modern optimization, reinforcement learning, and meta-learning methods adapt to gradient statistics, they largely ignore the temporal structure of the error signal itself. This paper proposes a diagnostic-driven adaptive learning framework that explicitly models error evolution through a principled decomposition into bias, capturing persistent drift; noise, capturing stochastic variability; and alignment, capturing repeated directional excitation leading to overshoot. These diagnostics are computed online from lightweight statistics of loss or temporal-difference (TD) error trajectories and are independent of model architecture or task domain. We show that the proposed bias-noise-alignment decomposition provides a unifying control backbone for supervised optimization, actor-critic reinforcement learning, and learned optimizers. Within this framework, we introduce three diagnostic-driven instantiations: the Human-inspired Supervised Adaptive Optimizer (HSAO), Hybrid Error-Diagnostic Reinforcement Learning (HED-RL) for actor-critic methods, and the Meta-Learned Learning Policy (MLLP). Under standard smoothness assumptions, we establish bounded effective updates and stability properties for all cases. Representative diagnostic illustrations in actor-critic learning highlight how the proposed signals modulate adaptation in response to TD error structure. Overall, this work elevates error evolution to a first-class object in adaptive learning and provides an interpretable, lightweight foundation for reliable learning in dynamic environments.

[322] Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback

Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

Main category: cs.LG

TL;DR: OMWU achieves linear convergence to Nash equilibrium in NLHF without requiring uniqueness assumptions, with exponentially better dependence on instance-dependent constants.

Details

Motivation: Standard preference modeling assumes transitivity, overlooking complex human population preferences. NLHF addresses non-transitive preferences as a game, but existing algorithms rely on regularization with unavoidable bias.

Method: Analyzes Optimistic Multiplicative Weights Update (OMWU) in Nash learning from human feedback (NLHF), providing first convergence guarantee for OMWU in this setting without requiring Nash equilibrium uniqueness.

Result: OMWU achieves last-iterate linear convergence after burn-in phase when NE with full support exists, with instance-dependent linear convergence rate. Shows novel marginal convergence behavior with exponentially better dependence on constants.

Conclusion: OMWU offers theoretical advantages for NLHF alignment without regularization bias, with experimental validation in tabular and neural policy classes, demonstrating potential for LLM applications.

Abstract: Aligning large language models (LLMs) with human preferences has proven effective for enhancing model capabilities, yet standard preference modeling using the Bradley-Terry model assumes transitivity, overlooking the inherent complexity of human population preferences. Nash learning from human feedback (NLHF) addresses this by framing non-transitive preferences as a two-player zero-sum game, where alignment reduces to finding the Nash equilibrium (NE). However, existing algorithms typically rely on regularization, incurring unavoidable bias when computing the duality gap in the original game. In this work, we provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($\mathtt{OMWU}$) in NLHF, showing that it achieves last-iterate linear convergence after a burn-in phase whenever an NE with full support exists, with an instance-dependent linear convergence rate to the original NE, measured by duality gaps. Compared to prior results in Wei et al. (2020), we do not require the assumption of NE uniqueness. Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values, enabling exponentially better dependence on instance-dependent constants than prior results. Experiments corroborate the theoretical strengths of $\mathtt{OMWU}$ in both tabular and neural policy classes, demonstrating its potential for LLM applications.

[323] Frequent subgraph-based persistent homology for graph classification

Xinyang Chen, Amaël Broustet, Guanyuan Zeng, Cheng He, Guoting Chen

Main category: cs.LG

TL;DR: Proposes Frequent Subgraph Filtration (FSF) for persistent homology on graphs, creating frequency-based persistent homology features, with two classification frameworks: FPH-ML and FPH-GNNs.

Details

Motivation: Current persistent homology methods on graphs use limited filtrations (degree/weight-based) that miss richer features like recurring patterns across datasets, restricting expressive power.

Method: Introduces Frequent Subgraph Filtration (FSF) derived from frequent subgraphs, producing stable frequency-based persistent homology features. Develops FPH-ML (machine learning model) and FPH-GNNs (hybrid with graph neural networks).

Result: FPH-ML achieves competitive/superior accuracy vs kernel/degree-based methods. FPH-GNNs show 0.4-21% relative gains, up to 8.2 percentage point improvements over GCN/GIN backbones across benchmarks.

Conclusion: FSF bridges frequent subgraph mining and topological data analysis, offering new perspective on topology-aware feature extraction with validated theoretical properties and experimental performance.

Abstract: Persistent homology (PH) has recently emerged as a powerful tool for extracting topological features. Integrating PH into machine learning and deep learning models enhances topology awareness and interpretability. However, most PH methods on graphs rely on a limited set of filtrations, such as degree-based or weight-based filtrations, which overlook richer features like recurring information across the dataset and thus restrict expressive power. In this work, we propose a novel graph filtration called Frequent Subgraph Filtration (FSF), which is derived from frequent subgraphs and produces stable and information-rich frequency-based persistent homology (FPH) features. We study the theoretical properties of FSF and provide both proofs and experimental validation. Beyond persistent homology itself, we introduce two approaches for graph classification: an FPH-based machine learning model (FPH-ML) and a hybrid framework that integrates FPH with graph neural networks (FPH-GNNs) to enhance topology-aware graph representation learning. Our frameworks bridge frequent subgraph mining and topological data analysis, offering a new perspective on topology-aware feature extraction. Experimental results show that FPH-ML achieves competitive or superior accuracy compared with kernel-based and degree-based filtration methods. When integrated into graph neural networks, FPH yields relative performance gains ranging from 0.4 to 21 percent, with improvements of up to 8.2 percentage points over GCN and GIN backbones across benchmarks.

cs.MA

[324] μACP: A Formal Calculus for Expressive, Resource-Constrained Agent Communication

Arnab Mallick, Indraveni Chebolu

Main category: cs.MA

TL;DR: μACP is a formal calculus for expressive agent communication with explicit resource bounds, proving a minimal 4-verb basis can encode FIPA protocols while achieving provable efficiency and coordination for edge-native agents.

Details

Motivation: Existing agent communication protocols face a trade-off: FIPA-ACL provides semantic richness but is intractable for constrained environments, while lightweight IoT protocols are efficient but lack expressiveness. There's a need to reconcile semantic expressiveness with provable efficiency for resource-constrained multi-agent systems.

Method: Developed μACP formal calculus based on Resource-Constrained Agent Communication (RCAC) model. Proved a minimal four-verb basis {PING, TELL, ASK, OBSERVE} suffices to encode finite-state FIPA protocols. Established tight information-theoretic bounds on message complexity. Showed μACP can implement standard consensus under partial synchrony and crash faults.

Result: Formal verification in TLA⁺ (model checking) and Coq (mechanized invariants) established safety, boundedness, and liveness under modeled assumptions. Large-scale simulations showed median end-to-end message latency of 34 ms (95th percentile 104 ms) at scale, outperforming prior agent and IoT protocols under severe resource constraints.

Conclusion: μACP provides a unified calculus that reconciles semantic expressiveness with provable efficiency, offering a rigorous foundation for next-generation resource-constrained multi-agent systems, particularly for edge-native agents requiring both coordination capabilities and resource efficiency.

Abstract: Agent communication remains a foundational problem in multi-agent systems: protocols such as FIPA-ACL guarantee semantic richness but are intractable for constrained environments, while lightweight IoT protocols achieve efficiency at the expense of expressiveness. This paper presents $μ$ACP, a formal calculus for expressive agent communication under explicit resource bounds. We formalize the Resource-Constrained Agent Communication (RCAC) model, prove that a minimal four-verb basis \textit{{PING, TELL, ASK, OBSERVE}} is suffices to encode finite-state FIPA protocols, and establish tight information-theoretic bounds on message complexity. We further show that $μ$ACP can implement standard consensus under partial synchrony and crash faults, yielding a constructive coordination framework for edge-native agents. Formal verification in TLA$^{+}$ (model checking) and Coq (mechanized invariants) establishes safety and boundedness, and supports liveness under modeled assumptions. Large-scale system simulations confirm ACP achieves a median end-to-end message latency of 34 ms (95th percentile 104 ms) at scale, outperforming prior agent and IoT protocols under severe resource constraints. The main contribution is a unified calculus that reconciles semantic expressiveness with provable efficiency, providing a rigorous foundation for the next generation of resource-constrained multi-agent systems.

[325] Offline Multi-Agent Reinforcement Learning for 6G Communications: Fundamentals, Applications and Future Directions

Eslam Eldeeb, Hirley Alves

Main category: cs.MA

TL;DR: A novel offline multi-agent reinforcement learning algorithm using conservative Q-learning with meta-learning extensions for wireless network applications like resource management and UAV networks.

Details

Motivation: Next-gen wireless networks (5G/6G) enable transformative applications but increase complexity with interconnected devices, requiring advanced AI/ML decision-making. Traditional online RL faces cost, safety, and scalability issues in multi-agent environments.

Method: Proposes offline MARL algorithm based on conservative Q-learning (CQL) to ensure safe training using pre-collected data, extended with meta-learning for dynamic environments.

Result: Validated through use cases in radio resource management and UAV networks, demonstrating safe and efficient training without real-time interaction.

Conclusion: Offline MARL offers promising solutions for wireless applications, with advantages in safety and scalability, though limitations exist and future research directions are needed.

Abstract: The next-generation wireless technologies, including beyond 5G and 6G networks, are paving the way for transformative applications such as vehicle platooning, smart cities, and remote surgery. These innovations are driven by a vast array of interconnected wireless entities, including IoT devices, access points, UAVs, and CAVs, which increase network complexity and demand more advanced decision-making algorithms. Artificial intelligence (AI) and machine learning (ML), especially reinforcement learning (RL), are key enablers for such networks, providing solutions to high-dimensional and complex challenges. However, as networks expand to multi-agent environments, traditional online RL approaches face cost, safety, and scalability limitations. Offline multi-agent reinforcement learning (MARL) offers a promising solution by utilizing pre-collected data, reducing the need for real-time interaction. This article introduces a novel offline MARL algorithm based on conservative Q-learning (CQL), ensuring safe and efficient training. We extend this with meta-learning to address dynamic environments and validate the approach through use cases in radio resource management and UAV networks. Our work highlights offline MARL’s advantages, limitations, and future directions in wireless applications.

[326] Mapping Human Anti-collusion Mechanisms to Multi-agent AI

Jamiu Adekunle Idowu, Ahmed Almasoud, Ayman Alfahid

Main category: cs.MA

TL;DR: The paper develops a taxonomy of human anti-collusion mechanisms and maps them to interventions for multi-agent AI systems, addressing challenges in adapting traditional anti-collusion approaches to AI settings.

Details

Motivation: As multi-agent AI systems become increasingly autonomous and develop collusive strategies similar to human markets, there's a need to adapt centuries of human anti-collusion mechanisms to AI settings, but it remains unclear how to do this effectively.

Method: The paper (i) develops a taxonomy of human anti-collusion mechanisms (sanctions, leniency & whistleblowing, monitoring & auditing, market design, and governance) and (ii) maps them to potential interventions for multi-agent AI systems, proposing implementation approaches for each mechanism.

Result: The paper provides a framework for adapting human anti-collusion mechanisms to multi-agent AI systems, identifying specific implementation approaches for each type of mechanism in AI contexts.

Conclusion: While human anti-collusion mechanisms can be adapted to AI systems, significant open challenges remain including the attribution problem, identity fluidity, the boundary problem, and adversarial adaptation, which require further research and innovative solutions.

Abstract: As multi-agent AI systems become increasingly autonomous, evidence shows they can develop collusive strategies similar to those long observed in human markets and institutions. While human domains have accumulated centuries of anti-collusion mechanisms, it remains unclear how these can be adapted to AI settings. This paper addresses that gap by (i) developing a taxonomy of human anti-collusion mechanisms, including sanctions, leniency & whistleblowing, monitoring & auditing, market design, and governance and (ii) mapping them to potential interventions for multi-agent AI systems. For each mechanism, we propose implementation approaches. We also highlight open challenges, such as the attribution problem (difficulty attributing emergent coordination to specific agents) identity fluidity (agents being easily forked or modified) the boundary problem (distinguishing beneficial cooperation from harmful collusion) and adversarial adaptation (agents learning to evade detection).

cs.MM

eess.AS

[327] Learning Speech Representations with Variational Predictive Coding

Sung-Lin Yeh, Peter Bell, Hao Tang

Main category: eess.AS

TL;DR: The paper reveals that predictive coding under a variational view is the underlying principle behind HuBERT’s success, enabling improvements to parameterization and optimization that boost performance across multiple speech tasks.

Details

Motivation: The authors argue that HuBERT's development has stalled due to lack of an underlying principle. They aim to identify this principle to enable systematic improvements to the objective.

Method: The paper formulates HuBERT’s objective through a predictive coding framework under a variational view. This formulation allows for two simple modifications to improve parameterization and optimization.

Result: The predictive coding interpretation enables immediate improvements to HuBERT, leading to significant performance gains on four downstream tasks: phone classification, f0 tracking, speaker recognition, and ASR.

Conclusion: Predictive coding under a variational view is the fundamental principle behind HuBERT, providing a unified framework that connects to other objectives (APC, CPC, wav2vec, BEST-RQ) and enables systematic improvements.

Abstract: Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.

eess.IV

[328] Deep Learning Approach for the Diagnosis of Pediatric Pneumonia Using Chest X-ray Imaging

Fatemeh Hosseinabadi, Mohammad Mojtaba Rohani

Main category: eess.IV

TL;DR: Deep learning models (ResNetRS, RegNet, EfficientNetV2) were applied to pediatric pneumonia detection from chest X-rays, with RegNet achieving the best performance at 92.4% accuracy.

Details

Motivation: Pediatric pneumonia is a major global health issue with high morbidity and mortality. Diagnosis is challenging due to limited radiological expertise and the complexity of pediatric imaging, creating a need for automated diagnostic tools.

Method: Used transfer learning with three CNN architectures (ResNetRS, RegNet, EfficientNetV2) pretrained on ImageNet. Fine-tuned models on a curated subset of 1,000 pediatric chest X-ray images from a larger public dataset (5,856 images). Preprocessed images and performed binary classification (pneumonia vs normal).

Result: RegNet performed best with 92.4% accuracy and 90.1% sensitivity, followed by ResNetRS (91.9% accuracy, 89.3% sensitivity), and EfficientNetV2 (88.5% accuracy, 88.1% sensitivity).

Conclusion: Deep learning models, particularly RegNet, show promising performance for automated pediatric pneumonia detection from chest X-rays, potentially addressing diagnostic challenges in resource-limited settings.

Abstract: Pediatric pneumonia remains a leading cause of morbidity and mortality in children worldwide. Timely and accurate diagnosis is critical but often challenged by limited radiological expertise and the physiological and procedural complexity of pediatric imaging. This study investigates the performance of state-of-the-art convolutional neural network (CNN) architectures ResNetRS, RegNet, and EfficientNetV2 using transfer learning for the automated classification of pediatric chest Xray images as either pneumonia or normal.A curated subset of 1,000 chest X-ray images was extracted from a publicly available dataset originally comprising 5,856 pediatric images. All images were preprocessed and labeled for binary classification. Each model was fine-tuned using pretrained ImageNet weights and evaluated based on accuracy and sensitivity. RegNet achieved the highest classification performance with an accuracy of 92.4 and a sensitivity of 90.1, followed by ResNetRS (accuracy: 91.9, sensitivity: 89.3) and EfficientNetV2 (accuracy: 88.5, sensitivity: 88.1).

[329] Hear the Heartbeat in Phases: Physiologically Grounded Phase-Aware ECG Biometrics

Jintao Huang, Lu Leng, Yi Zhang, Ziyuan Yang

Main category: eess.IV

TL;DR: HPAF is a hierarchical phase-aware fusion framework for ECG-based identity authentication that extracts phase-specific cardiac features and uses multi-prototype enrollment to handle heartbeat variability.

Details

Motivation: Existing ECG authentication methods treat heartbeats as homogeneous signals, overlooking phase-specific characteristics within the cardiac cycle. This limits their ability to capture the full discriminative information available in ECG signals.

Method: Three-stage HPAF framework: 1) Intra-Phase Representation extracts independent features for each cardiac phase, 2) Phase-Grouped Hierarchical Fusion aggregates physiologically related phases, 3) Global Representation Fusion combines grouped representations adaptively. Plus Heartbeat-Aware Multi-prototype enrollment strategy to handle heartbeat variability.

Result: Extensive experiments on three public datasets show HPAF achieves state-of-the-art results in both closed and open-set settings, outperforming other methods.

Conclusion: The proposed hierarchical phase-aware approach effectively captures phase-specific ECG characteristics and handles heartbeat variability, leading to superior identity authentication performance for wearable devices.

Abstract: Electrocardiography (ECG) is adopted for identity authentication in wearable devices due to its individual-specific characteristics and inherent liveness. However, existing methods often treat heartbeats as homogeneous signals, overlooking the phase-specific characteristics within the cardiac cycle. To address this, we propose a Hierarchical Phase-Aware Fusion~(HPAF) framework that explicitly avoids cross-feature entanglement through a three-stage design. In the first stage, Intra-Phase Representation (IPR) independently extracts representations for each cardiac phase, ensuring that phase-specific morphological and variation cues are preserved without interference from other phases. In the second stage, Phase-Grouped Hierarchical Fusion (PGHF) aggregates physiologically related phases in a structured manner, enabling reliable integration of complementary phase information. In the final stage, Global Representation Fusion (GRF) further combines the grouped representations and adaptively balances their contributions to produce a unified and discriminative identity representation. Moreover, considering ECG signals are continuously acquired, multiple heartbeats can be collected for each individual. We propose a Heartbeat-Aware Multi-prototype (HAM) enrollment strategy, which constructs a multi-prototype gallery template set to reduce the impact of heartbeat-specific noise and variability. Extensive experiments on three public datasets demonstrate that HPAF achieves state-of-the-art results in the comparison with other methods under both closed and open-set settings.

[330] Let Distortion Guide Restoration (DGR): A physics-informed learning framework for Prostate Diffusion MRI

Ziyang Long, Binesh Nader, Lixia Wang, Archana Vadiraj Malaji, Chia-Chi Yang, Haoran Sun, Rola Saouaf, Timothy Daskivich, Hyung Kim, Yibin Xie, Debiao Li, Hsin-Jung Yang

Main category: eess.IV

TL;DR: DGR is a physics-informed hybrid CNN-diffusion framework that corrects severe susceptibility-induced distortions in prostate DWI without requiring additional acquisitions, outperforming traditional methods like FSL TOPUP and FUGUE.

Details

Motivation: Severe susceptibility-induced distortions in prostate single-shot EPI diffusion-weighted imaging (DWI) degrade image quality and diagnostic confidence, especially in cases with metal implants or rectal distension. Traditional correction methods often require additional acquisitions, which is impractical in clinical settings.

Method: DGR combines a CNN-based geometric correction module with conditional diffusion refinement under T2-weighted anatomical guidance. It’s trained to invert a realistic forward distortion model using large-scale paired data (410 multi-institutional studies) and measured B0 field maps from metal-implant cases. The framework generates synthetic distortions for low-b DWI, high-b DWI, and ADC maps.

Result: On synthetic validation (n=34), DGR achieved higher PSNR and lower NMSE than FSL TOPUP and FUGUE. In 34 real clinical cases with severe distortion (hip prostheses, rectal distension), DGR improved geometric fidelity and increased radiologist-rated image quality and diagnostic confidence.

Conclusion: Learning the inverse of a physically simulated forward process provides a practical, acquisition-free alternative to traditional distortion-correction pipelines for prostate DWI, enabling improved clinical utility without requiring additional scans.

Abstract: We present Distortion-Guided Restoration (DGR), a physics-informed hybrid CNN-diffusion framework for acquisition-free correction of severe susceptibility-induced distortions in prostate single-shot EPI diffusion-weighted imaging (DWI). DGR is trained to invert a realistic forward distortion model using large-scale paired distorted and undistorted data synthesized from distortion-free prostate DWI and co-registered T2-weighted images from 410 multi-institutional studies, together with 11 measured B0 field maps from metal-implant cases incorporated into a forward simulator to generate low-b DWI (b = 50 s per mm squared), high-b DWI (b = 1400 s per mm squared), and ADC distortions. The network couples a CNN-based geometric correction module with conditional diffusion refinement under T2-weighted anatomical guidance. On a held-out synthetic validation set (n = 34) using ground-truth simulated distortion fields, DGR achieved higher PSNR and lower NMSE than FSL TOPUP and FUGUE. In 34 real clinical studies with severe distortion, including hip prostheses and marked rectal distension, DGR improved geometric fidelity and increased radiologist-rated image quality and diagnostic confidence. Overall, learning the inverse of a physically simulated forward process provides a practical alternative to acquisition-dependent distortion-correction pipelines for prostate DWI.

[331] The Impact of Lesion Focus on the Performance of AI-Based Melanoma Classification

Tanay Donde

Main category: eess.IV

TL;DR: Models with better lesion attention alignment achieve superior melanoma diagnostic performance, suggesting interpretable AI can improve medical diagnostics.

Details

Motivation: Melanoma is deadly but early detection improves outcomes. Current CNN models for automated melanoma classification suffer from inconsistent focus on lesion areas, reducing diagnostic reliability.

Method: Analyzed relationship between lesion attention and diagnostic performance using masked images, bounding box detection, and transfer learning. Applied multiple explainability and sensitivity analysis approaches to measure attention alignment with lesion areas.

Result: Models with higher focus on lesion areas achieved better diagnostic performance (precision, recall, F1-score). Attention alignment correlated with improved classification metrics.

Conclusion: Interpretable AI has potential to improve medical diagnostics. Study provides foundation for developing more accurate and trustworthy melanoma classification models through better attention alignment.

Abstract: Melanoma is the most lethal subtype of skin cancer, and early and accurate detection of this disease can greatly improve patients’ outcomes. Although machine learning models, especially convolutional neural networks (CNNs), have shown great potential in automating melanoma classification, their diagnostic reliability still suffers due to inconsistent focus on lesion areas. In this study, we analyze the relationship between lesion attention and diagnostic performance, involving masked images, bounding box detection, and transfer learning. We used multiple explainability and sensitivity analysis approaches to investigate how well models aligned their attention with lesion areas and how this alignment correlated with precision, recall, and F1-score. Results showed that models with a higher focus on lesion areas achieved better diagnostic performance, suggesting the potential of interpretable AI in medical diagnostics. This study provides a foundation for developing more accurate and trustworthy melanoma classification models in the future.

[332] Physics-Guided Dual-Domain Plug-and-Play ADMM for Low-Dose CT Reconstruction

Sayantan Dutta, Sudhanya Chatterjee, Ashwini Galande, K. S. Shriram, Bipul Das

Main category: eess.IV

TL;DR: A novel Plug-and-Play model-based iterative reconstruction framework (PnP-MBIR) with 2-stage self-supervised Noise-to-Noise training enables high-quality ultra-low-dose CT reconstruction at 70-80% lower radiation dose while maintaining diagnostic fidelity comparable to full-dose scans.

Details

Motivation: Ultra-low-dose CT imaging reduces patient radiation exposure but suffers from severe structured and random noise that degrades image quality, requiring effective reconstruction methods to maintain diagnostic utility.

Method: Proposes a Plug-and-Play model-based iterative reconstruction framework that integrates a deep convolutional denoiser trained in a 2-stage self-supervised Noise-to-Noise scheme. The method alternates between sinogram-domain data fidelity enforcement and learned image-domain denoising within an optimization framework.

Result: Enables high-quality reconstructions at ~70-80% lower dose levels while maintaining diagnostic fidelity comparable to standard full-dose scans. Quantitative evaluations using GLCM features (contrast, homogeneity, entropy, correlation) show superior texture consistency and detail preservation over standalone deep learning and supervised PnP baselines.

Conclusion: The proposed framework effectively reduces streaks and structured artifacts while preserving subtle tissue contrast, making it a promising tool for ULDCT reconstruction that balances radiation dose reduction with diagnostic image quality.

Abstract: Ultra-low-dose CT (ULDCT) imaging can greatly reduce patient radiation exposure, but the resulting scans suffer from severe structured and random noise that degrades image quality. To address this challenge, we propose a novel Plug-and-Play model-based iterative reconstruction framework (PnP-MBIR) that integrates a deep convolutional denoiser trained in a 2-stage self-supervised Noise-to-Noise (N2N) scheme. The method alternates between enforcing sinogram-domain data fidelity and applying the learned image-domain denoiser within an optimization, enabling artifact suppression while maintaining anatomical structure. The 2-stage protocol enables fully self-supervised training from noisy data, followed by high-dose fine-tuning, ensuring the denoiser’s robustness in the ultra-low-dose regime. Our method enables high-quality reconstructions at $\sim$70–80% lower dose levels, while maintaining diagnostic fidelity comparable to standard full-dose scans. Quantitative evaluations using Gray-Level Co-occurrence Matrix (GLCM) features – including contrast, homogeneity, entropy, and correlation – confirm that the proposed method yields superior texture consistency and detail preservation over standalone deep learning and supervised PnP baselines. Qualitative and quantitative results on both simulated and clinical datasets demonstrate that our framework effectively reduces streaks and structured artifacts while preserving subtle tissue contrast, making it a promising tool for ULDCT reconstruction.

[333] KDPhys: An Attention Guided 3D to 2D Knowledge Distillation for Real-time Video-Based Physiological Measurement

Nicky Nirlipta Sahoo, VS Sachidanand, Matcha Naga Gayathri, Balamurali Murugesan, Keerthi Ram, Jayaraj Joseph, Mohanasankar Sivaprakasam

Main category: eess.IV

TL;DR: KDPhys: An attention-based knowledge distillation framework for real-time remote photoplethysmography (rPPG) signal extraction from facial videos, achieving state-of-the-art performance with significantly reduced computational complexity.

Details

Motivation: The increasing demand for real-time, non-contact physiological monitoring during the SARS-CoV-2 pandemic for telehealth applications, coupled with the need for efficient rPPG methods that can operate on standard hardware with low computational requirements.

Method: Proposes KDPhys, an attention-based knowledge distillation framework that transfers global temporal representations from a 3D CNN teacher model to a lightweight 2D CNN student model via 3D-to-2D feature distillation. Introduces DILATE (Distortion Loss incorporating Shape and Time) that jointly optimizes morphological and temporal characteristics of rPPG signals.

Result: Achieves significant computational efficiency: uses only 0.23M parameters (half of existing methods), operates 56.67% faster, and reduces MAE by 18.15% compared to state-of-the-art approaches, achieving average MAE of 1.78 bpm across three benchmark datasets. Demonstrates robustness under diverse environmental conditions and activity scenarios.

Conclusion: KDPhys represents the first application of knowledge distillation in rPPG domain, successfully balancing accuracy and efficiency for real-time physiological monitoring applications, making it suitable for telehealth and remote health monitoring on standard hardware.

Abstract: Camera-based physiological monitoring, such as remote photoplethysmography (rPPG), captures subtle variations in skin optical properties caused by pulsatile blood volume changes using standard digital camera sensors. The demand for real-time, non-contact physiological measurement has increased significantly, particularly during the SARS-CoV-2 pandemic, to support telehealth and remote health monitoring applications. In this work, we propose an attention-based knowledge distillation (KD) framework, termed KDPhys, for extracting rPPG signals from facial video sequences. The proposed method distills global temporal representations from a 3D convolutional neural network (CNN) teacher model to a lightweight 2D CNN student model through effective 3D-to-2D feature distillation. To the best of our knowledge, this is the first application of knowledge distillation in the rPPG domain. Furthermore, we introduce a Distortion Loss incorporating Shape and Time (DILATE), which jointly accounts for both morphological and temporal characteristics of rPPG signals. Extensive qualitative and quantitative evaluations are conducted on three benchmark datasets. The proposed model achieves a significant reduction in computational complexity, using only half the parameters of existing methods while operating 56.67% faster. With just 0.23M parameters, it achieves an 18.15% reduction in Mean Absolute Error (MAE) compared to state-of-the-art approaches, attaining an average MAE of 1.78 bpm across all datasets. Additional experiments under diverse environmental conditions and activity scenarios further demonstrate the robustness and adaptability of the proposed approach.

[334] CIC: Circular Image Compression

Honggui Li, Sinan Chen, Dingtai Li, Zhengyang Zhang, Nahid Md Lokman Hossain, Xinfeng Xu, Yinlu Qin, Ruobing Wang, Maria Trocan, Dimitri Galayko, Amara Amara, Mohamad Sawan

Main category: eess.IV

TL;DR: A circular image compression (CIC) method with closed-loop architecture is proposed to address performance degradation in learned image compression when handling out-of-distribution images, outperforming traditional serial approaches.

Details

Motivation: Learned image compression (LIC) suffers performance degradation on out-of-sample, out-of-distribution, or out-of-domain testing images due to inherent differences between training and testing data. Traditional serial image compression (SIC) uses open-loop architecture, while closed-loop systems from control theory could improve performance.

Method: Proposes circular image compression (CIC) with closed-loop encoding and decoding elements. Establishes nonlinear loop equation and proves steady-state error between reconstructed and original images is close to zero using Taylor series expansion. The method has Post-Training and Plug-and-Play properties, compatible with existing SIC methods.

Result: Experimental results on five public image compression datasets show CIC outperforms eight state-of-the-art open-source SIC algorithms in reconstruction capacity. Particularly effective for out-of-sample images with dark backgrounds, sharp edges, high contrast, grid shapes, or complex patterns.

Conclusion: Closed-loop CIC architecture minimizes gap between testing and training images, improving learned image compression performance, especially for challenging out-of-distribution cases. The plug-and-play approach enables easy integration with existing methods.

Abstract: Learned image compression (LIC) is currently the cutting-edge method. However, the inherent difference between testing and training images of LIC results in performance degradation to some extent. Especially for out-of-sample, out-of-distribution, or out-of-domain testing images, the performance of LIC degrades significantly. Classical LIC is a serial image compression (SIC) approach that utilizes an open-loop architecture with serial encoding and decoding units. Nevertheless, according to the principles of automatic control systems, a closed-loop architecture holds the potential to improve the dynamic and static performance of LIC. Therefore, a circular image compression (CIC) approach with closed-loop encoding and decoding elements is proposed to minimize the gap between testing and training images and upgrade the capability of LIC. The proposed CIC establishes a nonlinear loop equation and proves that steady-state error between reconstructed and original images is close to zero by Taylor series expansion. The proposed CIC method possesses the property of Post-Training and Plug-and-Play which can be built on any existing advanced SIC methods. Experimental results including rate-distortion curves on five public image compression datasets demonstrate that the proposed CIC outperforms eight competing state-of-the-art open-source SIC algorithms in reconstruction capacity. Experimental results further show that the proposed method is suitable for out-of-sample testing images with dark backgrounds, sharp edges, high contrast, grid shapes, or complex patterns.

Yanbing Chen, Tianze Tang, Taehyo Kim, Hai Shu

Main category: eess.IV

TL;DR: UKAN-EP is a 3D extension of U-KAN for multi-modal MRI brain tumor segmentation that integrates Efficient Channel Attention and Pyramid Feature Aggregation modules with a dynamic loss weighting strategy, achieving superior performance with fewer computational resources.

Details

Motivation: Gliomas are heterogeneous malignant brain tumors where multi-modal MRI is the clinical standard, but variability across modalities and high computational demands hinder effective automated segmentation. Current methods struggle with inter-modality feature fusion and computational efficiency.

Method: Extends 2D U-KAN to 3D UKAN-EP by incorporating Efficient Channel Attention (ECA) modules for inter-modality feature fusion and Pyramid Feature Aggregation (PFA) for multi-scale feature representation. Uses a dynamic loss weighting strategy that adaptively balances cross-entropy and Dice losses during training.

Result: On BraTS-GLI 2024 dataset, achieves superior segmentation (Dice = 0.9001 ± 0.0127, IoU = 0.8257 ± 0.0186 for whole tumor) with substantially fewer computational resources (223.57 GFLOPs, 11.30M parameters) compared to U-Net, Attention U-Net, Swin UNETR, VT-Unet, TransBTS, and 3D U-KAN baselines.

Conclusion: Combining KAN layers’ expressive power with lightweight channel-wise attention (ECA) and multi-scale feature aggregation (PFA) improves both accuracy and efficiency for brain tumor segmentation, demonstrating the effectiveness of this architectural approach.

Abstract: Background: Gliomas are among the most common malignant brain tumors and exhibit substantial heterogeneity, complicating accurate detection and segmentation. Although multi-modal MRI is the clinical standard for glioma imaging, variability across modalities and high computational demands hamper effective automated segmentation. Methods: We propose UKAN-EP, a novel 3D extension of the original 2D U-KAN model for multi-modal MRI brain tumor segmentation. While U-KAN integrates Kolmogorov-Arnold Network (KAN) layers into a U-Net backbone, UKAN-EP further incorporates Efficient Channel Attention (ECA) and Pyramid Feature Aggregation (PFA) modules to enhance inter-modality feature fusion and multi-scale feature representation. We also introduce a dynamic loss weighting strategy that adaptively balances cross-entropy and Dice losses during training. Results: On the 2024 BraTS-GLI dataset, UKAN-EP achieves superior segmentation performance (e.g., Dice = 0.9001 $\pm$ 0.0127 and IoU = 0.8257 $\pm$ 0.0186 for the whole tumor) while requiring substantially fewer computational resources (223.57 GFLOPs and 11.30M parameters) compared to strong baselines including U-Net, Attention U-Net, Swin UNETR, VT-Unet, TransBTS, and 3D U-KAN. An extensive ablation study further confirms the effectiveness of ECA and PFA and shows the limited utility of self-attention and spatial attention alternatives. Conclusion: UKAN-EP demonstrates that combining the expressive power of KAN layers with lightweight channel-wise attention and multi-scale feature aggregation improves the accuracy and efficiency of brain tumor segmentation.

[336] Scan-Adaptive MRI Undersampling Using Neighbor-based Optimization (SUNO)

Siddhant Gautam, Angqi Li, Nicole Seiberlich, Jeffrey A. Fessler, Saiprasad Ravishankar

Main category: eess.IV

TL;DR: SUNO framework learns scan-adaptive Cartesian undersampling patterns and reconstruction models for accelerated MRI, outperforming population-adaptive patterns at 4× and 8× acceleration.

Details

Motivation: Population-adaptive sampling patterns can be sub-optimal for individual scans as they may fail to capture scan-specific details and their effectiveness depends on population size/composition. Need scan-adaptive patterns for better individual scan reconstruction.

Method: Joint learning framework using alternating algorithm: ICD-based offline optimization of scan-adaptive k-space sampling patterns for each training example, then nearest neighbor search to select patterns at test time from low-frequency k-space information.

Result: Applied to fastMRI multi-coil knee and brain datasets, showing improved performance over current undersampling patterns at both 4× and 8× acceleration in visual quality and quantitative metrics.

Conclusion: SUNO framework successfully learns scan-adaptive sampling patterns that outperform population-adaptive approaches, providing better reconstruction quality for individual MRI scans.

Abstract: Accelerated MRI involves collecting partial $k$-space measurements to reduce acquisition time, patient discomfort, and motion artifacts, and typically uses regular undersampling patterns or human-designed schemes. Recent works have studied population-adaptive sampling patterns learned from a group of patients (or scans). However, such patterns can be sub-optimal for individual scans, as they may fail to capture scan or slice-specific details, and their effectiveness can depend on the size and composition of the population. To overcome this issue, we propose a framework for jointly learning scan-adaptive Cartesian undersampling patterns and a corresponding reconstruction model from a training set. We use an alternating algorithm for learning the sampling patterns and the reconstruction model where we use an iterative coordinate descent (ICD) based offline optimization of scan-adaptive $k$-space sampling patterns for each example in the training set. A nearest neighbor search is then used to select the scan-adaptive sampling pattern at test time from initially acquired low-frequency $k$-space information. We applied the proposed framework (dubbed SUNO) to the fastMRI multi-coil knee and brain datasets, demonstrating improved performance over the currently used undersampling patterns at both $4\times$ and $8\times$ acceleration factors in terms of both visual quality and quantitative metrics. The code for the proposed framework is available at https://github.com/sidgautam95/adaptive-sampling-mri-suno.

[337] A 28nm 0.22μJ/token memory-compute-intensity-aware CNN-Transformer accelerator with hybrid-attention-based layer-fusion and cascaded pruning for semantic-segmentation

Pingcheng Dong, Yonghao Tan, Xuejiao Liu, Peng Luo, Yu Liu, Luhong Liang, Yitong Zhou, Di Pang, Man-To Yung, Dong Zhang, Xijie Huang, Shih-Yang Liu, Yongkun Wu, Fengshi Tian, Chi-Ying Tsui, Fengbin Tu, Kwang-Ting Cheng

Main category: eess.IV

TL;DR: A 28nm CNN-Transformer accelerator for semantic segmentation achieves 3.86-10.91x energy reduction with peak efficiency of 52.90TOPS/W (INT8).

Details

Motivation: To address the high energy consumption of semantic segmentation models by developing an efficient hardware accelerator that combines CNN and Transformer architectures.

Method: The accelerator uses a hybrid attention unit, layer-fusion scheduler, and cascaded feature-map pruner to optimize energy efficiency in a 28nm process.

Result: Achieves 3.86-to-10.91x energy reduction over previous designs with 13.93mm2 area and peak energy efficiency of 52.90TOPS/W (INT8).

Conclusion: The proposed CNN-Transformer accelerator demonstrates significant energy efficiency improvements for semantic segmentation tasks through architectural innovations.

Abstract: This work presents a 28nm 13.93mm2 CNN-Transformer accelerator for semantic segmentation, achieving 3.86-to-10.91x energy reduction over previous designs. It features a hybrid attention unit, layer-fusion scheduler, and cascaded feature-map pruner, with peak energy efficiency of 52.90TOPS/W (INT8).

Today’s Research Highlights

Table of Contents

cs.CL

[1] RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning

[2] Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning

[3] Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description

[4] Understanding Emotion in Discourse: Recognition Insights and Linguistic Patterns for Generation

[5] Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models

[6] From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark

[7] A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR

[8] JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation

[9] Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback

[10] Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation

[11] Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity

[12] Can Large Language Models Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations

[13] DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

[14] Robust Uncertainty Quantification for Factual Generation of Large Language Models

[15] The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

[16] BERT-JEPA: Reorganizing CLS Embeddings for Language-Invariant Semantics

[17] Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

[18] Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset

[19] Toward Better Temporal Structures for Geopolitical Events Forecasting

[20] Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment

[21] Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games

[22] Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

[23] Noise-Aware Named Entity Recognition for Historical VET Documents

[24] Rule-Based Approaches to Atomic Sentence Extraction

[25] Retrieval–Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends

[26] ECR: Manifold-Guided Semantic Cues for Compact Language Models

[27] InfoSynth: Information-Guided Benchmark Synthesis for LLMs

[28] CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

[29] Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

[30] Probabilistic Guarantees for Reducing Contextual Hallucinations in LLMs

[31] Physio-DPO: Aligning Large Language Models with the Protein Energy Landscape to Eliminate Structural Hallucinations

[32] Fast-weight Product Key Memory

[33] Sigmoid Head for Quality Estimation under Language Ambiguity

[34] Exploring the Performance of Large Language Models on Subjective Span Identification Tasks

[35] Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries

[36] From Transformers to LLMs: A Systematic Survey of Efficiency Considerations in NLP

[37] EXAONE 3.0 7.8B Instruction Tuned Language Model

[38] Inner-Probe: Discovering Copyright-related Data Generation in LLM Architecture

[39] Towards Acyclic Preference Evaluation of Language Models via Multiple Evaluators

[40] EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

[41] Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette

[42] EXAONE Deep: Reasoning Enhanced Language Models

[43] Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall

[44] FedSEA-LLaMA: A Secure, Efficient and Adaptive Federated Splitting Framework for Large Language Models

[45] C-VARC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models

[46] Esoteric Language Models

[47] Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs

[48] Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine

[49] EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

[50] NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

[51] RAG-BioQA: A Retrieval-Augmented Generation Framework for Long-Form Biomedical Question Answering

[52] MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

[53] Optimizing Retrieval for RAG via Reinforcement Learning

[54] Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

[55] SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI

[56] W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search

[57] Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems

[58] Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates

[59] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence

[60] Training-free Context-adaptive Attention for Efficient Long Context Modeling

[61] Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony

[62] CTTA-T: Continual Test-Time Adaptation for Text Understanding via Teacher-Student with a Domain-aware and Generalized Teacher

[63] Multi-hop Reasoning via Early Knowledge Alignment

[64] AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts

[65] LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition

[66] TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish

[67] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

[68] Training a Huggingface Model on AWS Sagemaker (Without Tears)

cs.CV

[69] FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications

[70] TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model

[71] It’s Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

[72] Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

[73] A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data

[74] GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation

[75] Compressed Map Priors for 3D Perception

[76] Attention to Detail: Global-Local Attention for High-Resolution AI-Generated Image Detection